Skip to main content

Orchestration Patterns for CustomerOrchestrator and AdminOrchestrator

Status: A1 deliverable. Inputs to ADR-0003 (FSM persistence) and candidate ADR-0005 (orchestration model). Audience: Adrian (solo founder); future Ratiba contributors who will implement M3 onwards. Voice: Opinionated. One fork picked per question. Build on the locked decisions in the ADR-0001 amendment of the same date.


Verdict: Hybrid — deterministic FSM as the spine, with bounded LLM agent loops attached as escape hatches at three named seams: (a) intent classification, (b) slot extraction under ambiguity, and (c) clarification rephrasing.

The booking flow is a workflow, not an agent task, in the Anthropic taxonomy — the steps are knowable in advance (greet, identify, service, slot, confirm, pay), the failure modes are knowable in advance (slot unavailable, payment timeout, customer changes mind), and the cost of an agent "creatively" reordering them is real money. Pure FSM, however, is too brittle for natural Swahili and English where customers say "tafadhali nipange masaji ya tishu kesho saa nane" (one utterance, three slots, one language switch) — a hard-coded regex layer collapses on its first encounter with a real human. The hybrid shape keeps the spine deterministic so the audit trail and the test suite stay tractable, while the LLM does what it is uniquely good at: parsing messy multilingual utterances into structured slot updates and producing human-shaped clarifications when a slot is missing or ambiguous. This shape also matches the load-bearing constraint from Phase C that "conversation state IS canonical state for in-flight bookings" — the FSM is the canonical state and the LLM is a stateless transformation over it.


2. Why-not on the rejected alternatives

2.1 Why-not FSM-only (pure rules, no LLM)

A pure-FSM CustomerOrchestrator would treat every inbound utterance as a string to be matched against keyword tables, dispatch on the match, and reject anything unmatched into a clarification state. This is tempting because it is testable end-to-end with no flaky LLM calls, costs zero per turn, and behaves identically across deploys. It dies on contact with bilingual reality. Swahili appointment booking does not look like English: "kesho saa tatu asubuhi" is "tomorrow at 9 AM" but uses the Swahili clock convention (saa tatu = 3rd hour after dawn = 9 AM, not 3 AM); "siku ya tatu" is Tuesday, not "the third day"; English speakers say "next Tuesday" and mean either the immediately upcoming Tuesday or the one after depending on whether today is past Wednesday. Hand-coding this with high recall requires either a curated phrasebook per tenant (operational nightmare for a 1-FTE team) or the eventual realization that you are reinventing intent classification badly. We would also pay this cost twice — once on intent ("did the customer mean book or reschedule?") and again on slot extraction ("was 'Mary' the customer name or the staff name?"). The right answer is to let an LLM do the parsing it is already trained for and use the FSM to enforce the invariants the LLM is bad at (one decision per turn, no double-booking, no skipped payment).

2.2 Why-not agent-loop-with-tools (full LLM-in-charge ReAct)

The pure-agent shape — model in charge, FSM dissolved into a tool catalogue (list_services, find_slots, confirm_booking, initiate_payment) — is what most "agent framework" demos show, and it is what CrewAI and the OpenAI Agents SDK optimise for. It is wrong for Ratiba on three counts. First, the booking flow has hard real-world invariants — BOOKING_PAYMENT cannot precede BOOKING_CONFIRM, you cannot complete an appointment that was never confirmed, the M-Pesa STK push must reference an appointment_id that exists — and an LLM-in-charge architecture pushes those invariants into prompt instructions ("never call initiate_payment before confirm_booking") which is the weakest place to enforce them; the same model that is being asked to follow the rule is the one deciding whether to follow it. Second, the latency budget is unforgiving on voice (target end-to-end <800 ms per LiveKit's pipeline writeup) and a free-form ReAct loop with tool reflection routinely needs 3–5 LLM hops per turn; a hybrid shape can do the same turn in 1 LLM hop because the FSM already knows what state to advance to once the slot is filled. Third, the Anthropic effective-harnesses post explicitly recommends single-loop with disciplined planning over agent-in-charge for "tightly coupled" workflows — booking is the canonical tightly-coupled workflow.


3. State machine sketch

The canonical CustomerOrchestrator FSM. States are uppercase; the LLM-augmented seams are flagged inline. AdminOrchestrator uses a different, looser FSM (more LLM-driven because admin utterances are open-ended) and is sketched separately.

3.1 CustomerOrchestrator booking-flow FSM

+-----------+
| IDLE | <----------+
+-----+-----+ |
| |
inbound message | |
| v |
| +-------------+ |
+--->| GREET | |
+-----+-------+ |
| |
intent classify (LLM) |
| |
+--------+--------+--------+--+--+--------+-------+
| | | | | | |
v v v v v v v
IDENTIFY BOOK_* CANCEL_* RESCHED INQUIRY UNKNOWN ABANDON
| ^
v |
CLARIFICATION |
| |
(>=3 fails) |
+-------+
|
v
ESCALATE

The booking sub-flow:

GREET --> IDENTIFY (auto-create contact if unknown phone)
| |
| v
| +----+----+
+-----> SERVICE | <-- LLM slot extract from initial utterance if present
+----+----+
|
v (auto-skip if 1 staff for service)
+----+----+
| STAFF |
+----+----+
|
v
+----+----+
| SLOT | <-- SchedulingService.get_available_slots
+----+----+
|
v
+----+----+
| CONFIRM | <-- LLM AnswerShaper produces summary; user clicks button
+----+----+
|
+---if mpesa_enabled---> PAY ----> DONE
|
+---else--------------------------> DONE

3.2 Transitions and conditions

FromToConditionSide-effect
IDLEGREETAny inbound message on a fresh thread (no Redis FSM key)Create FSM thread, persist via TenantScopedSaver
GREETIDENTIFYIntent classifier returns book with confidence >= 0.85Persist intent=book slot
GREETCLARIFICATIONIntent confidence in [0.40, 0.85)LLM rephrase; offer 3 reply buttons
GREETUNKNOWNIntent confidence < 0.40 OR no intentGeneric fallback message
IDENTIFYSERVICEContact exists or auto-createdSet contact_id in state
SERVICESTAFFService slot filled (button click OR LLM-extracted with confidence >= 0.85)Persist service_id
SERVICECLARIFICATIONSlot ambiguous (e.g., "massage" matches 3 services)LLM disambiguation question
STAFFSLOTStaff slot filled, OR service has only 1 eligible staff (auto-fill), OR user said "anyone" (set null)Persist staff_id
SLOTCONFIRMUser selected one of offered slots (button)Lock-pending: tentative reservation in Redis, 5 min TTL
CONFIRMPAYUser clicked Confirm AND tenant.mpesa_enabled == trueappointments row INSERT with status pending, payment_status pending; trigger STK push
CONFIRMDONEUser clicked Confirm AND tenant.mpesa_enabled == falseappointments row INSERT with status confirmed, payment_status unpaid; release Redis lock
CONFIRMSLOTUser clicked ChangeDiscard tentative; re-enter slot selection
CONFIRMABANDONUser clicked CancelDELETE Redis tentative; clean exit
PAYDONEM-Pesa callback ResultCode==0 arrives within 60sUpdate appointment to confirmed, paid; send WhatsApp confirmation
PAYCLARIFICATIONCallback ResultCode!=0 (timeout, cancelled, NSF)Offer Retry / Cancel buttons
PAYABANDON60s STK timeout AND no callback AND user idle 2 minCancel tentative appointment; soft-deliver "we did not see your payment"
any non-terminalCLARIFICATIONInbound utterance does not match expected slot AND retry_count < 3Increment retry_count, ask LLM to re-prompt
CLARIFICATIONESCALATEretry_count >= 3Notify admin via WhatsApp; pause customer thread (interrupt-and-resume)
ESCALATEIDLEAdmin completes handoff (button: "Hand back to bot")Resume agent on next inbound message
anyABANDON30 min Redis TTL elapses with no inbound messagePersist final state to Postgres checkpoint; delete Redis key
DONEIDLEAlways (on next inbound message)Fresh thread

3.3 Sad-path notes

  • CLARIFICATION is the catch-all for "the slot you wanted is unfilled or ambiguous". It is not a terminal state — it loops back to the originating state with a refined question. Hard cap at 3 loops to prevent infinite re-asking.
  • ESCALATE uses the LangGraph interrupt() primitive. The graph state is checkpointed, the customer is told "I'm going to hand you over to the team", and the admin's WhatsApp gets a structured briefing card (last 5 turns + extracted slots so far). On admin "Hand back to bot", the graph resumes from ESCALATE into IDLE.
  • ABANDON is a clean terminal — Redis key is deleted, but the Postgres checkpoint is retained for audit + analytics. The next inbound message starts a fresh thread.
  • Admin AdminOrchestrator FSM is much shallower because admin utterances are inherently open-ended ("how's tomorrow looking", "block Jane Thursday afternoon"). Its only states are IDLE -> ROUTED -> AWAIT_CONFIRMATION -> EXECUTED -> IDLE with the heavy lifting happening inside ROUTED's LLM intent + entity extraction (see §4.2).

4. Intent routing

4.1 Customer side: hybrid (rules-first, LLM-fallback)

Use rules-first pattern matching for the cheap-and-obvious cases, LLM-fallback for everything else. Concrete protocol per inbound utterance:

  1. Channel pre-routing. If the inbound is a WhatsApp button reply (type: "interactive"), the button id IS the intent — no classifier needed. This handles ~60% of customer turns once the FSM is past GREET.

  2. Keyword fast-path. A small per-language regex table catches high-confidence single-word utterances: ^(book|booking|appointment|kuhifadhi|nipange)\bbook with confidence 0.95; ^(cancel|cancellation|sitaki|futa)\bcancel with confidence 0.95. If matched, skip step 3.

  3. LLM classifier (fallback). Use a nano-class model (Claude Haiku) with structured outputs in strict mode — Phase C §10 mandates strict mode for any agent → tool call, and intent routing IS a tool call. Schema:

    {
    "name": "ClassifyCustomerIntent",
    "schema": {
    "type": "object",
    "properties": {
    "intent": {
    "type": "string",
    "enum": ["book", "cancel", "reschedule", "inquiry", "greeting", "unknown"]
    },
    "confidence": { "type": "number", "minimum": 0, "maximum": 1 },
    "language": { "type": "string", "enum": ["en", "sw"] },
    "extracted_slots": {
    "type": "object",
    "properties": {
    "service_hint": { "type": ["string", "null"] },
    "date_hint": { "type": ["string", "null"] },
    "time_hint": { "type": ["string", "null"] },
    "staff_hint": { "type": ["string", "null"] }
    }
    }
    },
    "required": ["intent", "confidence", "language", "extracted_slots"],
    "additionalProperties": false
    },
    "strict": true
    }

    The classifier returns extracted_slots opportunistically — if the user said "nipange masaji kesho saa nane", we want service_hint="masaji", date_hint="kesho", time_hint="saa nane" all populated in one LLM call so the FSM can skip ahead.

4.2 Confidence thresholds

Confidence bandAction
>= 0.85Auto-route. Treat as decided.
>= 0.40 and < 0.85Route to CLARIFICATION with the LLM's rephrased question.
< 0.40Route to UNKNOWN with generic fallback "I didn't catch that — would you like to book, cancel, or ask a question?"

These thresholds are starting values, not gospel. They get tuned against the eval suite (A3) once we have 20+ real Swahili transcripts.

4.3 Admin side: LLM-only with structured outputs

Admin utterances are too open-ended to benefit from rules ("how's tomorrow looking", "what did we make this week", "block Jane off Thursday afternoon", "send Mary a reminder about her appointment", "create a new service: hot stone, 90 minutes, 5000 bob"). Use Claude Haiku with strict structured outputs against the AdminIntent enum from PRD §4 Module 8. No keyword fast-path — the grammar of admin commands is not regular. Same >= 0.85 threshold for auto-execute, with the difference that any AdminIntent of *_DELETE or BROADCAST_* always triggers AWAIT_CONFIRMATION regardless of confidence (§5 irreversibility rule).


5. Slot filling under uncertainty

Three concrete heuristics, in order of precedence:

5.1 The irreversibility rule (always confirm)

If executing the slot's downstream action is irreversible or expensive to undo, always confirm, regardless of confidence. Irreversible actions for Ratiba:

  • confirm_booking (writes to appointments, may trigger M-Pesa STK push that costs the customer money)
  • cancel_booking (notifies customer, may trigger refund)
  • initiate_stk_push (debits the customer's phone)
  • broadcast_message (admin-side; can spam an entire customer base)
  • delete_service / disable_service (admin-side; may cancel future bookings)
  • block_staff_time (admin-side; may cancel existing bookings)

These all transition through an explicit AWAIT_CONFIRMATION button-only step. No exceptions, no "we're 99% sure".

5.2 The confidence rule (confirm if uncertain)

For reversible slot fills, confirm if confidence < 0.85 OR if the FSM has rolled back from this slot already in this session (the user said "no, change" once). Concretely:

SlotConfidence threshold for auto-fill
intent0.85
service_id0.85 (the user can change it next turn cheaply)
staff_id0.80 (cheaper still — most services have 2-3 staff)
slot (date+time)0.95 (high cost of misbooking; most users will have clicked a button anyway)
language0.70 (we can switch back if we got it wrong)
name (contact name extraction)0.90 (we will reuse this for months — better to ask than guess wrong)

5.3 The "ask once, assume after" rule

Some slots benefit from optimistic assumption with a corrective backstop. The pattern: extract the slot from context (e.g., service_hint from initial utterance), present the next-state response as if the slot were filled, and let the user correct. Example:

  • Customer: "nipange masaji kesho saa nane"
  • LLM extracts: service_hint="masaji", date_hint="kesho", time_hint="saa nane" with mid-confidence (0.70-0.80).
  • Bot does NOT ask "did you mean massage?" — instead it skips to slot selection: "Sawa, masaji ya tishu kesho saa nane. Ningekupa nafasi hizi:" (OK, deep tissue tomorrow at 8 — I have these slots:) and presents a list. If the customer wanted Swedish massage instead, they'll correct via "no, swedish" which the LLM picks up in one round-trip.

This violates the strict confidence threshold but pays for itself in turn count. Limit to service_id and staff_id slots only — never apply to date/time (cost of misbooking) or to anything irreversible.

5.4 The clarification budget

Hard cap of 3 clarification rounds per booking attempt. After 3, transition to ESCALATE automatically. This is a guard against infinite re-prompting, not a hard product rule — the eval suite will tell us if 3 is the right number.


6. Tool-calling pattern

Verdict: hybrid — internal services (Scheduling, CRM, Payments, Catalog) exposed as MCP-shaped tool definitions, registered into the LangGraph node via the standard tool-call schema. The "MCP-shape" means the tool definitions live in a dedicated tools/ package with the JSON-schema shape MCP requires (name, description, inputSchema), but Phase 1 dispatches them in-process via direct Python function calls rather than spinning up actual MCP servers.

Phase C §1 made the call: "Adopt MCP as the internal tool-bus shape so calendar/M-Pesa/CRM tools speak the same protocol regardless of which model we route to." The trade-off is well-understood: full MCP servers add a process boundary and JSON-RPC overhead per tool call (cost: 5-15 ms extra latency, plus container management); raw Anthropic tool schemas lock us to Anthropic's wire format. The hybrid is: write the tool definitions in MCP shape so the surface is portable, but execute them in-process via a thin adapter for now.

6.1 Tool catalogue (Phase 1)

ToolSurfaceOwnerUsed by
list_active_services(tenant_id) -> list[Service]CatalogCustomerOrch SERVICE state, AdminOrch LIST_SERVICES
list_eligible_staff(tenant_id, service_id) -> list[Staff]CatalogCustomerOrch STAFF state, AdminOrch
get_available_slots(tenant_id, service_id, staff_id?, date) -> list[Slot]SchedulingCustomerOrch SLOT state
tentatively_reserve(tenant_id, slot, ttl_seconds=300) -> reservation_idSchedulingCustomerOrch CONFIRM state
confirm_booking(reservation_id, contact_id) -> appointmentSchedulingCustomerOrch transition CONFIRM->DONE
cancel_booking(appointment_id, reason) -> appointmentSchedulingboth orchestrators
initiate_stk_push(appointment_id, amount, phone) -> checkout_idPaymentsCustomerOrch PAY state
lookup_contact(tenant_id, phone) -> contact?CRMboth orchestrators
upsert_contact(tenant_id, phone, name?, language?) -> contactCRMidentity resolver
query_bookings(tenant_id, date_range, status?, staff?) -> list[appointment]SchedulingAdminOrch BOOKINGS_*
query_revenue(tenant_id, date_range) -> {total, by_day}SchedulingAdminOrch REVENUE_*
notify_admin(tenant_id, message_card) -> voidWhatsAppCustomerOrch ESCALATE
notify_customer(contact_id, message) -> voidWhatsAppboth orchestrators

6.2 Why MCP-shape, not raw

Three reasons. First, the shape costs us nothing today (a tool_definition.json per tool is a 20-line file) and saves us a migration when we want to expose tools to a Phase-3 admin "talk to your business data" agent that runs in a different process — the same definitions become a real MCP server with a one-line wrapper. Second, structured-outputs strict-mode enforcement (Phase C §10) is mandatory for any tool call that touches money; MCP's inputSchema field IS a JSON schema, which feeds directly into Claude's strict-mode payload. Third, the tool definitions become the contract for the eval suite (A3) — graders check that the right tool was called with the right shape, independent of the prompt that triggered it.

6.3 Why not full MCP servers in Phase 1

Each MCP server is a separate process with its own lifecycle, transport (stdio or HTTP), and failure mode. For tools that live inside the same Python process and the same tenant transaction, that overhead buys nothing — we'd be JSON-RPC-ing to ourselves. Phase C calls out that "MCP is the right long-term bet but the cost of doing it on day 1 isn't well-quantified" — the in-process adapter is the cheap intermediate that pays the long-term debt without paying for the process boundary today.


7. LangGraph integration shape

Per the locked decision (ADR-0001 amendment), LangGraph + langgraph-checkpoint-postgres is the orchestration runtime, integrated via the spike's Option A TenantScopedSaver. This section spells out the concrete graph shape, not the framework choice.

7.1 Graph shape

One StateGraph per orchestrator type — CustomerBookingGraph and AdminGraph. Each FSM state from §3 maps to one LangGraph node. Conditional edges encode the transition table from §3.2. The state object is a Pydantic model that carries the full FSM state (the same shape that gets persisted to the checkpoint).

# illustrative pseudocode — NOT runnable
from typing import Literal
from pydantic import BaseModel
from langgraph.graph import StateGraph, START, END

class BookingState(BaseModel):
# identity
tenant_id: str
contact_id: str | None = None
phone: str
channel: Literal["whatsapp", "voice"]
language: Literal["en", "sw"] = "en"

# FSM cursor
current_state: Literal[
"GREET", "IDENTIFY", "SERVICE", "STAFF",
"SLOT", "CONFIRM", "PAY", "DONE",
"CLARIFICATION", "ESCALATE", "ABANDON"
] = "GREET"

# slots
intent: Literal["book", "cancel", "reschedule", "inquiry"] | None = None
service_id: str | None = None
staff_id: str | None = None # null = "any"
slot: dict | None = None # {start_at, end_at, staff_id}
reservation_id: str | None = None
appointment_id: str | None = None
checkout_id: str | None = None

# bookkeeping
last_inbound: str
retry_count: int = 0
history: list[dict] = [] # last N turns, summarized

# --- nodes ---
async def node_greet(state: BookingState) -> BookingState: ...
async def node_identify(state: BookingState) -> BookingState: ...
async def node_service(state: BookingState) -> BookingState: ...
async def node_staff(state: BookingState) -> BookingState: ...
async def node_slot(state: BookingState) -> BookingState: ...
async def node_confirm(state: BookingState) -> BookingState: ...
async def node_pay(state: BookingState) -> BookingState: ...
async def node_done(state: BookingState) -> BookingState: ...
async def node_clarification(state: BookingState) -> BookingState: ...
async def node_escalate(state: BookingState) -> BookingState: ...

# --- conditional router ---
def route_after_greet(state: BookingState) -> str:
# decision based on intent + confidence (set by node_greet's LLM call)
if state.intent == "book": return "IDENTIFY"
if state.intent in ("cancel", "reschedule"): return "IDENTIFY"
if state.intent == "inquiry": return "DONE"
return "CLARIFICATION"

# --- graph assembly ---
def build_customer_graph(checkpointer):
g = StateGraph(BookingState)
g.add_node("GREET", node_greet)
g.add_node("IDENTIFY", node_identify)
g.add_node("SERVICE", node_service)
g.add_node("STAFF", node_staff)
g.add_node("SLOT", node_slot)
g.add_node("CONFIRM", node_confirm)
g.add_node("PAY", node_pay)
g.add_node("DONE", node_done)
g.add_node("CLARIFICATION", node_clarification)
g.add_node("ESCALATE", node_escalate)

g.add_edge(START, "GREET")
g.add_conditional_edges("GREET", route_after_greet)
# ... (one add_conditional_edges per stateful transition)
g.add_edge("DONE", END)
g.add_edge("ABANDON", END)

# ESCALATE uses interrupt() inside the node body, not an edge
return g.compile(checkpointer=checkpointer)

7.2 TenantScopedSaver injection (per spike Option A)

The compiled graph is built per invocation, not at app startup, because the checkpointer needs to be bound to a connection whose search_path is the inbound message's tenant schema. The ingress handler shape:

# illustrative pseudocode
async def handle_inbound_whatsapp(msg: WhatsAppMessage):
tenant = await tenant_resolver.resolve_from_business_number(msg.business_number)
contact_or_admin = await identity_resolver.resolve(tenant, msg.from_)

# ONE dedicated psycopg connection for this turn, search_path pre-set
async with tenant_scoped_checkpointer.for_tenant(tenant.schema_name) as saver:
graph = build_customer_graph(checkpointer=saver)

thread_id = f"{tenant.id}:{msg.from_}" # see §8 for full key shape
config = {"configurable": {"thread_id": thread_id}}

result = await graph.ainvoke(
{"last_inbound": msg.text, "channel": "whatsapp", ...},
config=config,
)

await answer_shaper.send(result, channel="whatsapp")

Three properties this guarantees:

  1. Hard tenant isolation. The connection's search_path is set to tenant_xxx, public for the lifetime of the async with block; LangGraph's hardcoded unqualified table names (per the spike §2.4) land in the right schema.
  2. No connection reuse across tenants. The for_tenant context manager either takes a per-tenant micro-pool connection or opens fresh — never a shared pool whose connection might still carry tenant A's search_path when tenant B uses it (the failure mode the spike's §2.6 warned about).
  3. Resume-able. Because thread_id is stable across turns (tenant_id:phone), every invocation that resolves to the same (tenant, customer) pair hydrates the same checkpoint and resumes from the persisted FSM cursor.

7.3 Reference

The graph shape is the standard LangGraph StateGraph + conditional-edges + checkpointer pattern. The interrupt() and Command(resume=...) primitive used inside node_escalate is the LangGraph human-in-the-loop interrupt pattern — the canonical fit for our admin handoff.


8. FSM persistence model

This section is the input to ADR-0003 (FSM persistence).

8.1 Two-tier persistence

Two stores, two roles. Phase C §3 and §10 + the spike both establish the same shape.

TierStoreLifetimePurpose
HotRedis 7<= 30 min (TTL)In-flight FSM cursor; idempotency dedup; rate limiting
DurablePostgres (per-tenant schema)Forever (subject to retention policy)LangGraph checkpoint for interrupt-and-resume; audit trail; cross-channel migration

The pattern: Redis is the read-after-write hot path; Postgres is the source of truth on cold-start and the only store that survives a 30-min idle gap.

8.2 Redis schema

KeyTypeTTLValue
ratiba:fsm:{tenant_id}:{phone}hash1800 s (30 min){thread_id, current_state, last_seen_at} (the cursor pointing at the LangGraph thread)
ratiba:lock:reservation:{tenant_id}:{slot_key}string (NX)300 s (5 min)appointment_tentative_id — held while CONFIRM is awaited
ratiba:dedup:{tenant_id}:{message_id}string86400 s (24 h)"1" — 360dialog message-id idempotency
ratiba:rate:{phone}string (INCR)60 sinbound count per minute (anti-abuse)
ratiba:stkpush:{checkout_id}hash300 s (5 min){appointment_id, thread_id, started_at} — used by STK callback to find the right thread to resume

Eviction: TTL only. We do not evict on memory pressure — Redis is sized for the working set (estimated < 50k active conversations at 1k tenants × 50 active per tenant). If we hit the ceiling, scale Redis vertically before introducing eviction (an evicted FSM cursor without a corresponding Postgres checkpoint would orphan the conversation).

8.3 Postgres durability story

Per ADR-0001 amendment, the durable store IS LangGraph's checkpoints / checkpoint_blobs / checkpoint_writes / checkpoint_migrations tables, created in each tenant schema by PostgresSaver.setup() invoked at tenant onboarding. The TenantScopedSaver wrapper binds each invocation to the tenant's schema (per spike §3 Option A and the four acceptance criteria there). Every node transition that succeeds in the LangGraph graph triggers a checkpoint write to that tenant's schema.

Retention: keep checkpoints for completed bookings for 90 days (configurable per tenant for compliance), then archive to a cold table. Abandoned bookings get the same 90-day window (useful for "why did this customer drop off" eval traces).

8.4 Recovery semantics on cold start

Three cold-start scenarios:

  1. Backend restart (Redis intact). No-op for in-flight FSMs. Redis still holds ratiba:fsm:* keys; the next inbound message resolves the thread_id and the LangGraph graph hydrates from the latest Postgres checkpoint anyway (Redis only carries the cursor, not the state).
  2. Redis flush + Postgres intact. All in-flight FSM cursors are lost from Redis. On the next inbound message for a (tenant, phone) pair, the orchestrator constructs the deterministic thread_id = "{tenant_id}:{phone}", attempts to load the latest checkpoint from Postgres for that thread_id, and either resumes (if a checkpoint exists and is not in a terminal state) or starts fresh. The ratiba:lock:reservation:* keys are gone, so any tentative reservations are released; the user is prompted to re-pick a slot. This is acceptable degradation.
  3. Postgres tenant schema lost. Catastrophic — appointments, contacts, and FSM checkpoints all gone. Out of scope for FSM design; this is a backup/restore concern.

8.5 Cross-channel migration (WhatsApp → voice mid-flow)

The deterministic thread_id = "{tenant_id}:{phone}" carries across channels. If a customer starts a WhatsApp booking, gets to STAFF, then calls in via voice, the voice channel resolves the same thread_id and the LangGraph graph hydrates from the same checkpoint. Two complications to handle:

  1. Channel switch in the state. The BookingState.channel field is updated to "voice" on the first voice-channel hydration; the AnswerShaper uses this to choose voice-shaped responses (≤2 sentences, no markdown) for the rest of the flow.
  2. Pending Redis lock. If a tentative reservation was held in Redis from the WhatsApp turn and is still within its 5-min TTL, the voice channel inherits it cleanly. If TTL has expired, the voice channel re-prompts for slot selection (see scenario 2 above).

This is an emergent property of the design, not a special-cased migration path — the canonical key is the customer's phone number, and the channel is just metadata.

8.6 Concurrency: parallel WhatsApp + voice for same customer

Edge case but real: customer texts on WhatsApp and immediately calls in before the WhatsApp turn completes. Two ingress handlers will race for the same thread_id. Two layers of defence:

  1. Per-thread mutex via Redis SETNX. Before invoking the graph, acquire ratiba:lock:thread:{thread_id} with 30-second TTL. The losing handler returns a deferred response ("hold on, I'm processing your previous message") and retries on a 1s backoff up to 5s. Above that, fail with a soft message ("system busy, please try again").
  2. LangGraph checkpoint write semantics. Even if mutex 1 fails (e.g., Redis split-brain), LangGraph's checkpoint writes are append-only and parent-chained — the second writer's write either chains correctly or conflicts on a unique constraint (thread_id, checkpoint_id). On conflict, the orchestrator drops the second invocation and lets the customer retry.

This is over-engineered for Phase 1 (the case is rare); the mutex is the must-have, the checkpoint conflict handling is the nice-to-have backstop.


9. Voice-channel turn-taking

Per Phase C §8, the voice stack is LiveKit + Deepgram Nova-3 + ElevenLabs Multilingual v2 with <800 ms end-to-end as the streaming SLO and <500 ms as the "feels human" target.

9.1 Adaptive interruption (barge-in)

Use LiveKit Adaptive Interruption Handling on day 1 of voice (Phase 2). The model is trained on real conversational audio and operates after VAD identifies incoming user audio, using an audio encoder + CNN to distinguish "intentional barge-in" from "backchannel" ("uh-huh", "mm-hmm", "ndio"). Critical property for our bilingual product: per the LiveKit docs, "the adaptive interruption model is meant to be used with any spoken language" — it has not been benchmarked specifically on Swahili backchanneling but works on the audio domain (not transcript domain), so the language-independence claim is plausible. We validate this empirically on the first 20 Swahili call recordings and feed the results into the eval suite.

When adaptive interruption fires:

  1. Cancel TTS playback immediately.
  2. Flush the LiveKit audio output queue.
  3. Restart the STT segment for the user's utterance.
  4. The LangGraph node currently executing returns its in-progress state (no checkpoint write — the cancelled turn never completes).
  5. Next turn picks up from the same checkpoint as if the interrupted turn never happened.

9.2 Filler-clock

Per Phase C §8 + the existing zol-rag voice-stack pattern (captured in user memory), we play a brief "thinking sound" when the LLM call is going to take > 300 ms. The cadence:

Elapsed since user finished speakingAudio played
0-300 ms(silence — under perception threshold)
300 msFirst filler: "mm" (en) / "mm" (sw) — same audio asset, language-agnostic
4 sSecond filler: "let me check" (en) / "hebu nikague" (sw) — language-locked
10 sReassurance: "still working on it" (en) / "bado nakushughulikia" (sw)
> 30 sApologetic timeout: "this is taking longer than expected — let me get someone to help" → ESCALATE

The filler player is a separate LiveKit audio track that plays while the LLM call is in flight; it's preempted the moment TTS for the real response starts.

9.3 End-of-turn detection

Use LiveKit's default VAD-based end-of-turn detection with a 700 ms silence threshold (LiveKit docs default is 500 ms; we lengthen to 700 ms to accommodate the slower pacing observed in zol-rag's Swahili calls). This is a tunable that the eval suite revisits — too short and we cut customers off mid-utterance, too long and we feel laggy.

For the customer-facing AnswerShaper response, prefer explicit verbal handoff ("…shall I book it?") over silent end-of-turn — the question mark is a strong signal that the user should speak now, more reliable than silence detection alone.

9.4 What we do NOT use for voice

  • End-to-end audio-to-audio models (GPT-4o realtime, Claude voice) — Phase C §8 rejected these as too opaque for a multi-tenant compliance-sensitive product where we may have to debug "why did the model mishear mtoto as moto". Sequential pipeline only.
  • Multi-agent orchestration over voice — same reasoning as the §1 verdict, doubly so for voice where the latency budget cannot absorb extra LLM hops.

10. Open questions surfaced

These are explicitly punted to ADR-0003 / ADR-0005 to resolve before implementation:

10.1 For ADR-0003 (FSM persistence)

  1. Per-tenant Redis keyspaces vs shared Redis with tenant-prefixed keys? Current design (§8.2) uses prefixed keys (ratiba:fsm:{tenant_id}:...) on a shared Redis. Alternative: separate Redis databases per tenant (Redis supports 16 numbered DBs by default; not a real isolation boundary anyway) or full Redis instances per tenant (operational overkill). The prefix-shared model is the recommendation; needs ADR ratification.
  2. Postgres checkpoint retention policy. §8.3 suggested "90 days then cold-archive" but the cold-archive mechanism is unspecified. Options: a separate checkpoints_archive table per tenant (same schema), a single archive schema across all tenants (breaks the isolation story), or just letting pg_dump + drop be the archive. ADR-0003 should pick one.
  3. Cross-tenant pool sizing. §8 implies one micro-pool per tenant for the psycopg checkpoint connections. At 1k tenants × 2 idle connections = 2k idle connections, which is at the edge of what a single Postgres instance can hold (default max_connections=100; a tuned VPS can do 500-1000). ADR-0003 needs to specify pool size, eviction, and whether we use PgBouncer for the checkpoint path (probably yes).
  4. Reservation lock semantics under voice + WhatsApp race. §8.6 sketched a Redis SETNX mutex but did not specify the exact retry/fallback semantics for the user-facing message ("system busy, try again" feels rough). ADR could decide whether to instead serialize via Postgres advisory locks or via a per-thread asyncio queue inside the worker.
  5. Redis flush recovery: do we preserve abandoned bookings? §8.4 scenario 2 said we resume from Postgres if a checkpoint exists. But for an ABANDON checkpoint, do we resume into ABANDON (continuing the abandoned state, weird) or treat as fresh? The clean answer is "checkpoint state machine has a final flag; ABANDON is final; resume is no-op for final states" — but ADR should make it explicit.

10.2 For ADR-0005 (orchestration model)

  1. Who owns the intent classifier prompt? §4.3 says Claude Haiku with strict structured outputs. The prompt itself is a load-bearing artifact (the eval suite gates changes to it per Phase C). ADR-0005 should establish where the prompt lives (code vs Langfuse vs both per the Phase C §10 open question on prompt versioning).
  2. Bilingual ambiguity escalation. What happens when the LLM returns language="en" with confidence < 0.7 — do we ask the user explicitly ("English or Swahili?") or default and let them correct? §5.2 set 0.70 as the threshold but didn't specify the disambiguation UX.
  3. Admin handoff briefing card shape. §3.3 said "structured briefing card (last 5 turns + extracted slots so far)". The exact shape of that card (LLM-generated summary? raw transcript? both?) was the same open question Phase C surfaced in Appendix B / A2. Carry forward to ADR-0005 with explicit options.
  4. Multi-turn LLM cost ceiling per booking. A worst-case booking with 3 clarification rounds + escalation could hit 8-10 LLM calls. At Haiku pricing this is ~$0.01 per booking, which is fine; at frontier-model pricing it would be ~$0.30. ADR-0005 should set a cost SLO per booking and specify the model routing rule (Haiku for classifier + slot extraction, frontier model only for clarification rephrasing).
  5. AdminOrchestrator state machine. §3.1 sketched a much shallower FSM for admin (IDLE -> ROUTED -> AWAIT_CONFIRMATION -> EXECUTED -> IDLE) but did not specify the state object, checkpoint shape, or whether admin sessions have a TTL at all. Defer to ADR-0005 — admin flows are open-ended and may benefit from a different shape entirely (e.g., no FSM, just a dispatcher).
  6. MCP tool registry ownership. §6 said tools live in a tools/ package with MCP-shape definitions. Who owns the catalogue, who reviews tool additions, how do we version the tool schemas? ADR-0005 should establish.
  7. Error budget for the LLM strict-mode classifier. Strict mode gives us mathematical schema compliance, but the model can still return nonsense values within the schema (e.g., intent="book" when the user asked to cancel). Phase C §10 mentioned "LLM-as-judge for eval grading" as an open question — relates here. ADR-0005 should declare what counts as a classifier failure and how we measure it.

Sources