Orchestration Patterns for CustomerOrchestrator and AdminOrchestrator
Status: A1 deliverable. Inputs to ADR-0003 (FSM persistence) and candidate ADR-0005 (orchestration model). Audience: Adrian (solo founder); future Ratiba contributors who will implement M3 onwards. Voice: Opinionated. One fork picked per question. Build on the locked decisions in the ADR-0001 amendment of the same date.
1. Recommended orchestration model
Verdict: Hybrid — deterministic FSM as the spine, with bounded LLM agent loops attached as escape hatches at three named seams: (a) intent classification, (b) slot extraction under ambiguity, and (c) clarification rephrasing.
The booking flow is a workflow, not an agent task, in the Anthropic taxonomy — the steps are knowable in advance (greet, identify, service, slot, confirm, pay), the failure modes are knowable in advance (slot unavailable, payment timeout, customer changes mind), and the cost of an agent "creatively" reordering them is real money. Pure FSM, however, is too brittle for natural Swahili and English where customers say "tafadhali nipange masaji ya tishu kesho saa nane" (one utterance, three slots, one language switch) — a hard-coded regex layer collapses on its first encounter with a real human. The hybrid shape keeps the spine deterministic so the audit trail and the test suite stay tractable, while the LLM does what it is uniquely good at: parsing messy multilingual utterances into structured slot updates and producing human-shaped clarifications when a slot is missing or ambiguous. This shape also matches the load-bearing constraint from Phase C that "conversation state IS canonical state for in-flight bookings" — the FSM is the canonical state and the LLM is a stateless transformation over it.
2. Why-not on the rejected alternatives
2.1 Why-not FSM-only (pure rules, no LLM)
A pure-FSM CustomerOrchestrator would treat every inbound utterance as a string to be matched against keyword tables, dispatch on the match, and reject anything unmatched into a clarification state. This is tempting because it is testable end-to-end with no flaky LLM calls, costs zero per turn, and behaves identically across deploys. It dies on contact with bilingual reality. Swahili appointment booking does not look like English: "kesho saa tatu asubuhi" is "tomorrow at 9 AM" but uses the Swahili clock convention (saa tatu = 3rd hour after dawn = 9 AM, not 3 AM); "siku ya tatu" is Tuesday, not "the third day"; English speakers say "next Tuesday" and mean either the immediately upcoming Tuesday or the one after depending on whether today is past Wednesday. Hand-coding this with high recall requires either a curated phrasebook per tenant (operational nightmare for a 1-FTE team) or the eventual realization that you are reinventing intent classification badly. We would also pay this cost twice — once on intent ("did the customer mean book or reschedule?") and again on slot extraction ("was 'Mary' the customer name or the staff name?"). The right answer is to let an LLM do the parsing it is already trained for and use the FSM to enforce the invariants the LLM is bad at (one decision per turn, no double-booking, no skipped payment).
2.2 Why-not agent-loop-with-tools (full LLM-in-charge ReAct)
The pure-agent shape — model in charge, FSM dissolved into a tool catalogue (list_services, find_slots, confirm_booking, initiate_payment) — is what most "agent framework" demos show, and it is what CrewAI and the OpenAI Agents SDK optimise for. It is wrong for Ratiba on three counts. First, the booking flow has hard real-world invariants — BOOKING_PAYMENT cannot precede BOOKING_CONFIRM, you cannot complete an appointment that was never confirmed, the M-Pesa STK push must reference an appointment_id that exists — and an LLM-in-charge architecture pushes those invariants into prompt instructions ("never call initiate_payment before confirm_booking") which is the weakest place to enforce them; the same model that is being asked to follow the rule is the one deciding whether to follow it. Second, the latency budget is unforgiving on voice (target end-to-end <800 ms per LiveKit's pipeline writeup) and a free-form ReAct loop with tool reflection routinely needs 3–5 LLM hops per turn; a hybrid shape can do the same turn in 1 LLM hop because the FSM already knows what state to advance to once the slot is filled. Third, the Anthropic effective-harnesses post explicitly recommends single-loop with disciplined planning over agent-in-charge for "tightly coupled" workflows — booking is the canonical tightly-coupled workflow.
3. State machine sketch
The canonical CustomerOrchestrator FSM. States are uppercase; the LLM-augmented seams are flagged inline. AdminOrchestrator uses a different, looser FSM (more LLM-driven because admin utterances are open-ended) and is sketched separately.
3.1 CustomerOrchestrator booking-flow FSM
+-----------+
| IDLE | <----------+
+-----+-----+ |
| |
inbound message | |
| v |
| +-------------+ |
+--->| GREET | |
+-----+-------+ |
| |
intent classify (LLM) |
| |
+--------+--------+--------+--+--+--------+-------+
| | | | | | |
v v v v v v v
IDENTIFY BOOK_* CANCEL_* RESCHED INQUIRY UNKNOWN ABANDON
| ^
v |
CLARIFICATION |
| |
(>=3 fails) |
+-------+
|
v
ESCALATE
The booking sub-flow:
GREET --> IDENTIFY (auto-create contact if unknown phone)
| |
| v
| +----+----+
+-----> SERVICE | <-- LLM slot extract from initial utterance if present
+----+----+
|
v (auto-skip if 1 staff for service)
+----+----+
| STAFF |
+----+----+
|
v
+----+----+
| SLOT | <-- SchedulingService.get_available_slots
+----+----+
|
v
+----+----+
| CONFIRM | <-- LLM AnswerShaper produces summary; user clicks button
+----+----+
|
+---if mpesa_enabled---> PAY ----> DONE
|
+---else--------------------------> DONE
3.2 Transitions and conditions
| From | To | Condition | Side-effect |
|---|---|---|---|
IDLE | GREET | Any inbound message on a fresh thread (no Redis FSM key) | Create FSM thread, persist via TenantScopedSaver |
GREET | IDENTIFY | Intent classifier returns book with confidence >= 0.85 | Persist intent=book slot |
GREET | CLARIFICATION | Intent confidence in [0.40, 0.85) | LLM rephrase; offer 3 reply buttons |
GREET | UNKNOWN | Intent confidence < 0.40 OR no intent | Generic fallback message |
IDENTIFY | SERVICE | Contact exists or auto-created | Set contact_id in state |
SERVICE | STAFF | Service slot filled (button click OR LLM-extracted with confidence >= 0.85) | Persist service_id |
SERVICE | CLARIFICATION | Slot ambiguous (e.g., "massage" matches 3 services) | LLM disambiguation question |
STAFF | SLOT | Staff slot filled, OR service has only 1 eligible staff (auto-fill), OR user said "anyone" (set null) | Persist staff_id |
SLOT | CONFIRM | User selected one of offered slots (button) | Lock-pending: tentative reservation in Redis, 5 min TTL |
CONFIRM | PAY | User clicked Confirm AND tenant.mpesa_enabled == true | appointments row INSERT with status pending, payment_status pending; trigger STK push |
CONFIRM | DONE | User clicked Confirm AND tenant.mpesa_enabled == false | appointments row INSERT with status confirmed, payment_status unpaid; release Redis lock |
CONFIRM | SLOT | User clicked Change | Discard tentative; re-enter slot selection |
CONFIRM | ABANDON | User clicked Cancel | DELETE Redis tentative; clean exit |
PAY | DONE | M-Pesa callback ResultCode==0 arrives within 60s | Update appointment to confirmed, paid; send WhatsApp confirmation |
PAY | CLARIFICATION | Callback ResultCode!=0 (timeout, cancelled, NSF) | Offer Retry / Cancel buttons |
PAY | ABANDON | 60s STK timeout AND no callback AND user idle 2 min | Cancel tentative appointment; soft-deliver "we did not see your payment" |
| any non-terminal | CLARIFICATION | Inbound utterance does not match expected slot AND retry_count < 3 | Increment retry_count, ask LLM to re-prompt |
CLARIFICATION | ESCALATE | retry_count >= 3 | Notify admin via WhatsApp; pause customer thread (interrupt-and-resume) |
ESCALATE | IDLE | Admin completes handoff (button: "Hand back to bot") | Resume agent on next inbound message |
| any | ABANDON | 30 min Redis TTL elapses with no inbound message | Persist final state to Postgres checkpoint; delete Redis key |
DONE | IDLE | Always (on next inbound message) | Fresh thread |
3.3 Sad-path notes
CLARIFICATIONis the catch-all for "the slot you wanted is unfilled or ambiguous". It is not a terminal state — it loops back to the originating state with a refined question. Hard cap at 3 loops to prevent infinite re-asking.ESCALATEuses the LangGraphinterrupt()primitive. The graph state is checkpointed, the customer is told "I'm going to hand you over to the team", and the admin's WhatsApp gets a structured briefing card (last 5 turns + extracted slots so far). On admin "Hand back to bot", the graph resumes fromESCALATEintoIDLE.ABANDONis a clean terminal — Redis key is deleted, but the Postgres checkpoint is retained for audit + analytics. The next inbound message starts a fresh thread.- Admin AdminOrchestrator FSM is much shallower because admin utterances are inherently open-ended ("how's tomorrow looking", "block Jane Thursday afternoon"). Its only states are
IDLE -> ROUTED -> AWAIT_CONFIRMATION -> EXECUTED -> IDLEwith the heavy lifting happening insideROUTED's LLM intent + entity extraction (see §4.2).
4. Intent routing
4.1 Customer side: hybrid (rules-first, LLM-fallback)
Use rules-first pattern matching for the cheap-and-obvious cases, LLM-fallback for everything else. Concrete protocol per inbound utterance:
-
Channel pre-routing. If the inbound is a WhatsApp button reply (
type: "interactive"), the buttonidIS the intent — no classifier needed. This handles ~60% of customer turns once the FSM is pastGREET. -
Keyword fast-path. A small per-language regex table catches high-confidence single-word utterances:
^(book|booking|appointment|kuhifadhi|nipange)\b→bookwith confidence 0.95;^(cancel|cancellation|sitaki|futa)\b→cancelwith confidence 0.95. If matched, skip step 3. -
LLM classifier (fallback). Use a nano-class model (Claude Haiku) with structured outputs in strict mode — Phase C §10 mandates strict mode for any agent → tool call, and intent routing IS a tool call. Schema:
{"name": "ClassifyCustomerIntent","schema": {"type": "object","properties": {"intent": {"type": "string","enum": ["book", "cancel", "reschedule", "inquiry", "greeting", "unknown"]},"confidence": { "type": "number", "minimum": 0, "maximum": 1 },"language": { "type": "string", "enum": ["en", "sw"] },"extracted_slots": {"type": "object","properties": {"service_hint": { "type": ["string", "null"] },"date_hint": { "type": ["string", "null"] },"time_hint": { "type": ["string", "null"] },"staff_hint": { "type": ["string", "null"] }}}},"required": ["intent", "confidence", "language", "extracted_slots"],"additionalProperties": false},"strict": true}The classifier returns extracted_slots opportunistically — if the user said "nipange masaji kesho saa nane", we want service_hint="masaji", date_hint="kesho", time_hint="saa nane" all populated in one LLM call so the FSM can skip ahead.
4.2 Confidence thresholds
| Confidence band | Action |
|---|---|
>= 0.85 | Auto-route. Treat as decided. |
>= 0.40 and < 0.85 | Route to CLARIFICATION with the LLM's rephrased question. |
< 0.40 | Route to UNKNOWN with generic fallback "I didn't catch that — would you like to book, cancel, or ask a question?" |
These thresholds are starting values, not gospel. They get tuned against the eval suite (A3) once we have 20+ real Swahili transcripts.
4.3 Admin side: LLM-only with structured outputs
Admin utterances are too open-ended to benefit from rules ("how's tomorrow looking", "what did we make this week", "block Jane off Thursday afternoon", "send Mary a reminder about her appointment", "create a new service: hot stone, 90 minutes, 5000 bob"). Use Claude Haiku with strict structured outputs against the AdminIntent enum from PRD §4 Module 8. No keyword fast-path — the grammar of admin commands is not regular. Same >= 0.85 threshold for auto-execute, with the difference that any AdminIntent of *_DELETE or BROADCAST_* always triggers AWAIT_CONFIRMATION regardless of confidence (§5 irreversibility rule).
5. Slot filling under uncertainty
Three concrete heuristics, in order of precedence:
5.1 The irreversibility rule (always confirm)
If executing the slot's downstream action is irreversible or expensive to undo, always confirm, regardless of confidence. Irreversible actions for Ratiba:
confirm_booking(writes toappointments, may trigger M-Pesa STK push that costs the customer money)cancel_booking(notifies customer, may trigger refund)initiate_stk_push(debits the customer's phone)broadcast_message(admin-side; can spam an entire customer base)delete_service/disable_service(admin-side; may cancel future bookings)block_staff_time(admin-side; may cancel existing bookings)
These all transition through an explicit AWAIT_CONFIRMATION button-only step. No exceptions, no "we're 99% sure".
5.2 The confidence rule (confirm if uncertain)
For reversible slot fills, confirm if confidence < 0.85 OR if the FSM has rolled back from this slot already in this session (the user said "no, change" once). Concretely:
| Slot | Confidence threshold for auto-fill |
|---|---|
intent | 0.85 |
service_id | 0.85 (the user can change it next turn cheaply) |
staff_id | 0.80 (cheaper still — most services have 2-3 staff) |
slot (date+time) | 0.95 (high cost of misbooking; most users will have clicked a button anyway) |
language | 0.70 (we can switch back if we got it wrong) |
name (contact name extraction) | 0.90 (we will reuse this for months — better to ask than guess wrong) |
5.3 The "ask once, assume after" rule
Some slots benefit from optimistic assumption with a corrective backstop. The pattern: extract the slot from context (e.g., service_hint from initial utterance), present the next-state response as if the slot were filled, and let the user correct. Example:
- Customer: "nipange masaji kesho saa nane"
- LLM extracts:
service_hint="masaji",date_hint="kesho",time_hint="saa nane"with mid-confidence (0.70-0.80). - Bot does NOT ask "did you mean massage?" — instead it skips to slot selection: "Sawa, masaji ya tishu kesho saa nane. Ningekupa nafasi hizi:" (OK, deep tissue tomorrow at 8 — I have these slots:) and presents a list. If the customer wanted Swedish massage instead, they'll correct via "no, swedish" which the LLM picks up in one round-trip.
This violates the strict confidence threshold but pays for itself in turn count. Limit to service_id and staff_id slots only — never apply to date/time (cost of misbooking) or to anything irreversible.
5.4 The clarification budget
Hard cap of 3 clarification rounds per booking attempt. After 3, transition to ESCALATE automatically. This is a guard against infinite re-prompting, not a hard product rule — the eval suite will tell us if 3 is the right number.
6. Tool-calling pattern
Verdict: hybrid — internal services (Scheduling, CRM, Payments, Catalog) exposed as MCP-shaped tool definitions, registered into the LangGraph node via the standard tool-call schema. The "MCP-shape" means the tool definitions live in a dedicated tools/ package with the JSON-schema shape MCP requires (name, description, inputSchema), but Phase 1 dispatches them in-process via direct Python function calls rather than spinning up actual MCP servers.
Phase C §1 made the call: "Adopt MCP as the internal tool-bus shape so calendar/M-Pesa/CRM tools speak the same protocol regardless of which model we route to." The trade-off is well-understood: full MCP servers add a process boundary and JSON-RPC overhead per tool call (cost: 5-15 ms extra latency, plus container management); raw Anthropic tool schemas lock us to Anthropic's wire format. The hybrid is: write the tool definitions in MCP shape so the surface is portable, but execute them in-process via a thin adapter for now.
6.1 Tool catalogue (Phase 1)
| Tool | Surface | Owner | Used by |
|---|---|---|---|
list_active_services | (tenant_id) -> list[Service] | Catalog | CustomerOrch SERVICE state, AdminOrch LIST_SERVICES |
list_eligible_staff | (tenant_id, service_id) -> list[Staff] | Catalog | CustomerOrch STAFF state, AdminOrch |
get_available_slots | (tenant_id, service_id, staff_id?, date) -> list[Slot] | Scheduling | CustomerOrch SLOT state |
tentatively_reserve | (tenant_id, slot, ttl_seconds=300) -> reservation_id | Scheduling | CustomerOrch CONFIRM state |
confirm_booking | (reservation_id, contact_id) -> appointment | Scheduling | CustomerOrch transition CONFIRM->DONE |
cancel_booking | (appointment_id, reason) -> appointment | Scheduling | both orchestrators |
initiate_stk_push | (appointment_id, amount, phone) -> checkout_id | Payments | CustomerOrch PAY state |
lookup_contact | (tenant_id, phone) -> contact? | CRM | both orchestrators |
upsert_contact | (tenant_id, phone, name?, language?) -> contact | CRM | identity resolver |
query_bookings | (tenant_id, date_range, status?, staff?) -> list[appointment] | Scheduling | AdminOrch BOOKINGS_* |
query_revenue | (tenant_id, date_range) -> {total, by_day} | Scheduling | AdminOrch REVENUE_* |
notify_admin | (tenant_id, message_card) -> void | CustomerOrch ESCALATE | |
notify_customer | (contact_id, message) -> void | both orchestrators |
6.2 Why MCP-shape, not raw
Three reasons. First, the shape costs us nothing today (a tool_definition.json per tool is a 20-line file) and saves us a migration when we want to expose tools to a Phase-3 admin "talk to your business data" agent that runs in a different process — the same definitions become a real MCP server with a one-line wrapper. Second, structured-outputs strict-mode enforcement (Phase C §10) is mandatory for any tool call that touches money; MCP's inputSchema field IS a JSON schema, which feeds directly into Claude's strict-mode payload. Third, the tool definitions become the contract for the eval suite (A3) — graders check that the right tool was called with the right shape, independent of the prompt that triggered it.
6.3 Why not full MCP servers in Phase 1
Each MCP server is a separate process with its own lifecycle, transport (stdio or HTTP), and failure mode. For tools that live inside the same Python process and the same tenant transaction, that overhead buys nothing — we'd be JSON-RPC-ing to ourselves. Phase C calls out that "MCP is the right long-term bet but the cost of doing it on day 1 isn't well-quantified" — the in-process adapter is the cheap intermediate that pays the long-term debt without paying for the process boundary today.
7. LangGraph integration shape
Per the locked decision (ADR-0001 amendment), LangGraph + langgraph-checkpoint-postgres is the orchestration runtime, integrated via the spike's Option A TenantScopedSaver. This section spells out the concrete graph shape, not the framework choice.
7.1 Graph shape
One StateGraph per orchestrator type — CustomerBookingGraph and AdminGraph. Each FSM state from §3 maps to one LangGraph node. Conditional edges encode the transition table from §3.2. The state object is a Pydantic model that carries the full FSM state (the same shape that gets persisted to the checkpoint).
# illustrative pseudocode — NOT runnable
from typing import Literal
from pydantic import BaseModel
from langgraph.graph import StateGraph, START, END
class BookingState(BaseModel):
# identity
tenant_id: str
contact_id: str | None = None
phone: str
channel: Literal["whatsapp", "voice"]
language: Literal["en", "sw"] = "en"
# FSM cursor
current_state: Literal[
"GREET", "IDENTIFY", "SERVICE", "STAFF",
"SLOT", "CONFIRM", "PAY", "DONE",
"CLARIFICATION", "ESCALATE", "ABANDON"
] = "GREET"
# slots
intent: Literal["book", "cancel", "reschedule", "inquiry"] | None = None
service_id: str | None = None
staff_id: str | None = None # null = "any"
slot: dict | None = None # {start_at, end_at, staff_id}
reservation_id: str | None = None
appointment_id: str | None = None
checkout_id: str | None = None
# bookkeeping
last_inbound: str
retry_count: int = 0
history: list[dict] = [] # last N turns, summarized
# --- nodes ---
async def node_greet(state: BookingState) -> BookingState: ...
async def node_identify(state: BookingState) -> BookingState: ...
async def node_service(state: BookingState) -> BookingState: ...
async def node_staff(state: BookingState) -> BookingState: ...
async def node_slot(state: BookingState) -> BookingState: ...
async def node_confirm(state: BookingState) -> BookingState: ...
async def node_pay(state: BookingState) -> BookingState: ...
async def node_done(state: BookingState) -> BookingState: ...
async def node_clarification(state: BookingState) -> BookingState: ...
async def node_escalate(state: BookingState) -> BookingState: ...
# --- conditional router ---
def route_after_greet(state: BookingState) -> str:
# decision based on intent + confidence (set by node_greet's LLM call)
if state.intent == "book": return "IDENTIFY"
if state.intent in ("cancel", "reschedule"): return "IDENTIFY"
if state.intent == "inquiry": return "DONE"
return "CLARIFICATION"
# --- graph assembly ---
def build_customer_graph(checkpointer):
g = StateGraph(BookingState)
g.add_node("GREET", node_greet)
g.add_node("IDENTIFY", node_identify)
g.add_node("SERVICE", node_service)
g.add_node("STAFF", node_staff)
g.add_node("SLOT", node_slot)
g.add_node("CONFIRM", node_confirm)
g.add_node("PAY", node_pay)
g.add_node("DONE", node_done)
g.add_node("CLARIFICATION", node_clarification)
g.add_node("ESCALATE", node_escalate)
g.add_edge(START, "GREET")
g.add_conditional_edges("GREET", route_after_greet)
# ... (one add_conditional_edges per stateful transition)
g.add_edge("DONE", END)
g.add_edge("ABANDON", END)
# ESCALATE uses interrupt() inside the node body, not an edge
return g.compile(checkpointer=checkpointer)
7.2 TenantScopedSaver injection (per spike Option A)
The compiled graph is built per invocation, not at app startup, because the checkpointer needs to be bound to a connection whose search_path is the inbound message's tenant schema. The ingress handler shape:
# illustrative pseudocode
async def handle_inbound_whatsapp(msg: WhatsAppMessage):
tenant = await tenant_resolver.resolve_from_business_number(msg.business_number)
contact_or_admin = await identity_resolver.resolve(tenant, msg.from_)
# ONE dedicated psycopg connection for this turn, search_path pre-set
async with tenant_scoped_checkpointer.for_tenant(tenant.schema_name) as saver:
graph = build_customer_graph(checkpointer=saver)
thread_id = f"{tenant.id}:{msg.from_}" # see §8 for full key shape
config = {"configurable": {"thread_id": thread_id}}
result = await graph.ainvoke(
{"last_inbound": msg.text, "channel": "whatsapp", ...},
config=config,
)
await answer_shaper.send(result, channel="whatsapp")
Three properties this guarantees:
- Hard tenant isolation. The connection's
search_pathis set totenant_xxx, publicfor the lifetime of theasync withblock; LangGraph's hardcoded unqualified table names (per the spike §2.4) land in the right schema. - No connection reuse across tenants. The
for_tenantcontext manager either takes a per-tenant micro-pool connection or opens fresh — never a shared pool whose connection might still carry tenant A'ssearch_pathwhen tenant B uses it (the failure mode the spike's §2.6 warned about). - Resume-able. Because
thread_idis stable across turns (tenant_id:phone), every invocation that resolves to the same(tenant, customer)pair hydrates the same checkpoint and resumes from the persisted FSM cursor.
7.3 Reference
The graph shape is the standard LangGraph StateGraph + conditional-edges + checkpointer pattern. The interrupt() and Command(resume=...) primitive used inside node_escalate is the LangGraph human-in-the-loop interrupt pattern — the canonical fit for our admin handoff.
8. FSM persistence model
This section is the input to ADR-0003 (FSM persistence).
8.1 Two-tier persistence
Two stores, two roles. Phase C §3 and §10 + the spike both establish the same shape.
| Tier | Store | Lifetime | Purpose |
|---|---|---|---|
| Hot | Redis 7 | <= 30 min (TTL) | In-flight FSM cursor; idempotency dedup; rate limiting |
| Durable | Postgres (per-tenant schema) | Forever (subject to retention policy) | LangGraph checkpoint for interrupt-and-resume; audit trail; cross-channel migration |
The pattern: Redis is the read-after-write hot path; Postgres is the source of truth on cold-start and the only store that survives a 30-min idle gap.
8.2 Redis schema
| Key | Type | TTL | Value |
|---|---|---|---|
ratiba:fsm:{tenant_id}:{phone} | hash | 1800 s (30 min) | {thread_id, current_state, last_seen_at} (the cursor pointing at the LangGraph thread) |
ratiba:lock:reservation:{tenant_id}:{slot_key} | string (NX) | 300 s (5 min) | appointment_tentative_id — held while CONFIRM is awaited |
ratiba:dedup:{tenant_id}:{message_id} | string | 86400 s (24 h) | "1" — 360dialog message-id idempotency |
ratiba:rate:{phone} | string (INCR) | 60 s | inbound count per minute (anti-abuse) |
ratiba:stkpush:{checkout_id} | hash | 300 s (5 min) | {appointment_id, thread_id, started_at} — used by STK callback to find the right thread to resume |
Eviction: TTL only. We do not evict on memory pressure — Redis is sized for the working set (estimated < 50k active conversations at 1k tenants × 50 active per tenant). If we hit the ceiling, scale Redis vertically before introducing eviction (an evicted FSM cursor without a corresponding Postgres checkpoint would orphan the conversation).
8.3 Postgres durability story
Per ADR-0001 amendment, the durable store IS LangGraph's checkpoints / checkpoint_blobs / checkpoint_writes / checkpoint_migrations tables, created in each tenant schema by PostgresSaver.setup() invoked at tenant onboarding. The TenantScopedSaver wrapper binds each invocation to the tenant's schema (per spike §3 Option A and the four acceptance criteria there). Every node transition that succeeds in the LangGraph graph triggers a checkpoint write to that tenant's schema.
Retention: keep checkpoints for completed bookings for 90 days (configurable per tenant for compliance), then archive to a cold table. Abandoned bookings get the same 90-day window (useful for "why did this customer drop off" eval traces).
8.4 Recovery semantics on cold start
Three cold-start scenarios:
- Backend restart (Redis intact). No-op for in-flight FSMs. Redis still holds
ratiba:fsm:*keys; the next inbound message resolves the thread_id and the LangGraph graph hydrates from the latest Postgres checkpoint anyway (Redis only carries the cursor, not the state). - Redis flush + Postgres intact. All in-flight FSM cursors are lost from Redis. On the next inbound message for a
(tenant, phone)pair, the orchestrator constructs the deterministicthread_id = "{tenant_id}:{phone}", attempts to load the latest checkpoint from Postgres for that thread_id, and either resumes (if a checkpoint exists and is not in a terminal state) or starts fresh. Theratiba:lock:reservation:*keys are gone, so any tentative reservations are released; the user is prompted to re-pick a slot. This is acceptable degradation. - Postgres tenant schema lost. Catastrophic — appointments, contacts, and FSM checkpoints all gone. Out of scope for FSM design; this is a backup/restore concern.
8.5 Cross-channel migration (WhatsApp → voice mid-flow)
The deterministic thread_id = "{tenant_id}:{phone}" carries across channels. If a customer starts a WhatsApp booking, gets to STAFF, then calls in via voice, the voice channel resolves the same thread_id and the LangGraph graph hydrates from the same checkpoint. Two complications to handle:
- Channel switch in the state. The
BookingState.channelfield is updated to"voice"on the first voice-channel hydration; the AnswerShaper uses this to choose voice-shaped responses (≤2 sentences, no markdown) for the rest of the flow. - Pending Redis lock. If a tentative reservation was held in Redis from the WhatsApp turn and is still within its 5-min TTL, the voice channel inherits it cleanly. If TTL has expired, the voice channel re-prompts for slot selection (see scenario 2 above).
This is an emergent property of the design, not a special-cased migration path — the canonical key is the customer's phone number, and the channel is just metadata.
8.6 Concurrency: parallel WhatsApp + voice for same customer
Edge case but real: customer texts on WhatsApp and immediately calls in before the WhatsApp turn completes. Two ingress handlers will race for the same thread_id. Two layers of defence:
- Per-thread mutex via Redis
SETNX. Before invoking the graph, acquireratiba:lock:thread:{thread_id}with 30-second TTL. The losing handler returns a deferred response ("hold on, I'm processing your previous message") and retries on a 1s backoff up to 5s. Above that, fail with a soft message ("system busy, please try again"). - LangGraph checkpoint write semantics. Even if mutex 1 fails (e.g., Redis split-brain), LangGraph's checkpoint writes are append-only and parent-chained — the second writer's write either chains correctly or conflicts on a unique constraint (
thread_id, checkpoint_id). On conflict, the orchestrator drops the second invocation and lets the customer retry.
This is over-engineered for Phase 1 (the case is rare); the mutex is the must-have, the checkpoint conflict handling is the nice-to-have backstop.
9. Voice-channel turn-taking
Per Phase C §8, the voice stack is LiveKit + Deepgram Nova-3 + ElevenLabs Multilingual v2 with <800 ms end-to-end as the streaming SLO and <500 ms as the "feels human" target.
9.1 Adaptive interruption (barge-in)
Use LiveKit Adaptive Interruption Handling on day 1 of voice (Phase 2). The model is trained on real conversational audio and operates after VAD identifies incoming user audio, using an audio encoder + CNN to distinguish "intentional barge-in" from "backchannel" ("uh-huh", "mm-hmm", "ndio"). Critical property for our bilingual product: per the LiveKit docs, "the adaptive interruption model is meant to be used with any spoken language" — it has not been benchmarked specifically on Swahili backchanneling but works on the audio domain (not transcript domain), so the language-independence claim is plausible. We validate this empirically on the first 20 Swahili call recordings and feed the results into the eval suite.
When adaptive interruption fires:
- Cancel TTS playback immediately.
- Flush the LiveKit audio output queue.
- Restart the STT segment for the user's utterance.
- The LangGraph node currently executing returns its in-progress state (no checkpoint write — the cancelled turn never completes).
- Next turn picks up from the same checkpoint as if the interrupted turn never happened.
9.2 Filler-clock
Per Phase C §8 + the existing zol-rag voice-stack pattern (captured in user memory), we play a brief "thinking sound" when the LLM call is going to take > 300 ms. The cadence:
| Elapsed since user finished speaking | Audio played |
|---|---|
| 0-300 ms | (silence — under perception threshold) |
| 300 ms | First filler: "mm" (en) / "mm" (sw) — same audio asset, language-agnostic |
| 4 s | Second filler: "let me check" (en) / "hebu nikague" (sw) — language-locked |
| 10 s | Reassurance: "still working on it" (en) / "bado nakushughulikia" (sw) |
> 30 s | Apologetic timeout: "this is taking longer than expected — let me get someone to help" → ESCALATE |
The filler player is a separate LiveKit audio track that plays while the LLM call is in flight; it's preempted the moment TTS for the real response starts.
9.3 End-of-turn detection
Use LiveKit's default VAD-based end-of-turn detection with a 700 ms silence threshold (LiveKit docs default is 500 ms; we lengthen to 700 ms to accommodate the slower pacing observed in zol-rag's Swahili calls). This is a tunable that the eval suite revisits — too short and we cut customers off mid-utterance, too long and we feel laggy.
For the customer-facing AnswerShaper response, prefer explicit verbal handoff ("…shall I book it?") over silent end-of-turn — the question mark is a strong signal that the user should speak now, more reliable than silence detection alone.
9.4 What we do NOT use for voice
- End-to-end audio-to-audio models (GPT-4o realtime, Claude voice) — Phase C §8 rejected these as too opaque for a multi-tenant compliance-sensitive product where we may have to debug "why did the model mishear mtoto as moto". Sequential pipeline only.
- Multi-agent orchestration over voice — same reasoning as the §1 verdict, doubly so for voice where the latency budget cannot absorb extra LLM hops.
10. Open questions surfaced
These are explicitly punted to ADR-0003 / ADR-0005 to resolve before implementation:
10.1 For ADR-0003 (FSM persistence)
- Per-tenant Redis keyspaces vs shared Redis with tenant-prefixed keys? Current design (§8.2) uses prefixed keys (
ratiba:fsm:{tenant_id}:...) on a shared Redis. Alternative: separate Redis databases per tenant (Redis supports 16 numbered DBs by default; not a real isolation boundary anyway) or full Redis instances per tenant (operational overkill). The prefix-shared model is the recommendation; needs ADR ratification. - Postgres checkpoint retention policy. §8.3 suggested "90 days then cold-archive" but the cold-archive mechanism is unspecified. Options: a separate
checkpoints_archivetable per tenant (same schema), a single archive schema across all tenants (breaks the isolation story), or just lettingpg_dump+ drop be the archive. ADR-0003 should pick one. - Cross-tenant pool sizing. §8 implies one micro-pool per tenant for the psycopg checkpoint connections. At 1k tenants × 2 idle connections = 2k idle connections, which is at the edge of what a single Postgres instance can hold (default
max_connections=100; a tuned VPS can do 500-1000). ADR-0003 needs to specify pool size, eviction, and whether we use PgBouncer for the checkpoint path (probably yes). - Reservation lock semantics under voice + WhatsApp race. §8.6 sketched a Redis SETNX mutex but did not specify the exact retry/fallback semantics for the user-facing message ("system busy, try again" feels rough). ADR could decide whether to instead serialize via Postgres advisory locks or via a per-thread asyncio queue inside the worker.
- Redis flush recovery: do we preserve abandoned bookings? §8.4 scenario 2 said we resume from Postgres if a checkpoint exists. But for an
ABANDONcheckpoint, do we resume into ABANDON (continuing the abandoned state, weird) or treat as fresh? The clean answer is "checkpoint state machine has a final flag; ABANDON is final; resume is no-op for final states" — but ADR should make it explicit.
10.2 For ADR-0005 (orchestration model)
- Who owns the intent classifier prompt? §4.3 says Claude Haiku with strict structured outputs. The prompt itself is a load-bearing artifact (the eval suite gates changes to it per Phase C). ADR-0005 should establish where the prompt lives (code vs Langfuse vs both per the Phase C §10 open question on prompt versioning).
- Bilingual ambiguity escalation. What happens when the LLM returns
language="en"withconfidence < 0.7— do we ask the user explicitly ("English or Swahili?") or default and let them correct? §5.2 set 0.70 as the threshold but didn't specify the disambiguation UX. - Admin handoff briefing card shape. §3.3 said "structured briefing card (last 5 turns + extracted slots so far)". The exact shape of that card (LLM-generated summary? raw transcript? both?) was the same open question Phase C surfaced in Appendix B / A2. Carry forward to ADR-0005 with explicit options.
- Multi-turn LLM cost ceiling per booking. A worst-case booking with 3 clarification rounds + escalation could hit 8-10 LLM calls. At Haiku pricing this is ~$0.01 per booking, which is fine; at frontier-model pricing it would be ~$0.30. ADR-0005 should set a cost SLO per booking and specify the model routing rule (Haiku for classifier + slot extraction, frontier model only for clarification rephrasing).
- AdminOrchestrator state machine. §3.1 sketched a much shallower FSM for admin (
IDLE -> ROUTED -> AWAIT_CONFIRMATION -> EXECUTED -> IDLE) but did not specify the state object, checkpoint shape, or whether admin sessions have a TTL at all. Defer to ADR-0005 — admin flows are open-ended and may benefit from a different shape entirely (e.g., no FSM, just a dispatcher). - MCP tool registry ownership. §6 said tools live in a
tools/package with MCP-shape definitions. Who owns the catalogue, who reviews tool additions, how do we version the tool schemas? ADR-0005 should establish. - Error budget for the LLM strict-mode classifier. Strict mode gives us mathematical schema compliance, but the model can still return nonsense values within the schema (e.g.,
intent="book"when the user asked to cancel). Phase C §10 mentioned "LLM-as-judge for eval grading" as an open question — relates here. ADR-0005 should declare what counts as a classifier failure and how we measure it.
Sources
- Anthropic — Building Effective Agents (workflow vs agent taxonomy)
- Anthropic — Effective harnesses for long-running agents (single-loop discipline)
- Anthropic — Structured outputs documentation (strict-mode schema enforcement)
- LangChain — LangGraph Graph API overview (StateGraph, nodes, conditional edges)
- LangChain — LangGraph interrupts (interrupt-and-resume primitive)
- LangChain — Making it easier to build HITL agents with interrupt
- LiveKit — Adaptive Interruption Handling docs
- LiveKit — Solving unwanted interruptions blog post
- LiveKit — Sequential pipeline architecture for voice agents
- LiveKit — Turn detection and interruptions docs
- Model Context Protocol — modelcontextprotocol.io
- Phase C landscape scan:
docs/research/2026-04-25-agentic-landscape-2026.md - Spike:
docs/research/2026-04-25-langgraph-postgressaver-spike.md - ADR-0001 (amended 2026-04-25):
docs/adr/ADR-0001-tech-stack.md - PRD:
docs/prd/ratiba-prd.md(§2.1, §4 Modules 7-9 most relevant)