Skip to main content

ADR-0005: Orchestration Model

Status: Accepted Date: 2026-04-25

Context

A1 (docs/research/2026-04-25-orchestration-patterns.md) settled the high-level orchestration shape: a hybrid LangGraph StateGraph with LLM-augmented seams at intent classification, slot extraction, and clarification rephrasing. Multi-agent rejected (Phase C §1 + A1 §2.2); single-loop favored. C1 from spec §12 added MCP-shape internal tool definitions dispatched in-process. Phase B §3 settled where prompts live (in code, with Langfuse as observability-only).

What this ADR settles is the operational scaffolding around that shape: where exactly the prompts go, which model handles which role, how bilingual ambiguity resolves, how runtime cost gets bounded, what the AdminOrchestrator's much-shallower FSM looks like, what shape the tool registry takes, and how classifier errors are measured and escalated.

This ADR also closes a question ADR-0001 left implicitly open: which LLM provider is the orchestrator's default brain. Phase C §1 inferred Claude (cost favourable, presumed Swahili edge) but ADR-0001 itself did not pin the LLM brain — only STT (Deepgram), TTS (ElevenLabs), and the framework stack. Adrian's preference (2026-04-25) shifted the default to OpenAI GPT-4.1 mini for the high-volume narrow roles, with Claude Haiku reserved for two specific nuanced roles. The mechanism that makes this swappable per-role and per-tenant is the LLMRouter / role-assignments pattern (D4).

ADR-0001's amendment of 2026-04-25 will be updated in a follow-up amendment to reference this ADR for the LLM-brain default, since the LLM-brain inference in that amendment was light-touch.

Decision

Nine concrete decisions, organized as a coherent orchestration model.

1. Architectural recap (inherited, locked here)

AspectWhere decidedRecap
Orchestrator shapeA1 §1Hybrid: deterministic LangGraph StateGraph spine + LLM seams at three named points (intent classification, slot extraction, clarification rephrasing).
Single-loop vs multi-agentPhase C §1 + A1 §2.2Single-loop. Multi-agent rejected for tightly coupled booking flows.
Per-orchestrator graph typeA1 §7Two graphs: CustomerBookingGraph + AdminGraph. Compiled per-invocation against a TenantScopedSaver per ADR-0002 D4.
Tool definition shapeC1 / A1 §6MCP-shape (JSON Schema), dispatched in-process for Phase 1; deferred MCP server boundary until Phase 3+.
Structured outputsPhase C §10 / A1 §4Strict mode mandatory at every classifier and tool call.
PersistenceADR-0003Two-tier (Redis hot + Postgres LangGraph checkpoint). Fresh thread_id per booking via conversation_threads pointer table.

This ADR builds on those primitives without re-deriving them.

2. Intent classifier prompt: single bilingual prompt, human-review-only

The intent classifier runs on every customer turn that isn't a WhatsApp button reply. Per Phase B §3, prompts live in code:

backend/app/orchestrator/prompts/intent_classifier/
system.md # the single bilingual prompt (handles en + sw)
metadata.yaml # version, target role (intent_classifier), eval_baseline, owner

Single bilingual prompt (not split per language). The structured-output schema includes language: enum[en, sw] so the model identifies the customer's language as part of its output. Two prompts would either double LLM calls per turn (language-detection pre-step + classifier) or double the prompt-management surface; neither pays for itself given that frontier-class models handle bilingual input natively.

Human-review-only on every PR per Phase B §6. The intent classifier gates the booking flow which gates payment, so it touches money via the chain. The prompt-eval-reviewer reviewer agent (Phase B §10 #4) reads the eval-delta PR comment but does not auto-approve intent-classifier prompt PRs.

3. Bilingual ambiguity disambiguation: default to tenant locale

When the intent classifier returns language confidence below 0.70 (starting threshold from A1 §5.2; eval-tunable), the orchestrator defaults the customer-facing response language to public.tenants.locale (set at tenant onboarding). The customer's next utterance — now likely longer or more discriminating — is re-classified and the orchestrator switches if the new classification crosses the 0.70 threshold.

Why the tenant prior beats other defaults. Most customers of a given tenant speak the same language as the tenant. Defaulting to "whichever the LLM scored higher" (even at low confidence) ignores that prior; explicitly asking ("English or Swahili? / Kingereza au Kiswahili?") feels robotic for the 99% case where language is detectable. The auto-correction path on the next turn is cheap.

The 0.70 threshold is starting; A3's eval suite will refit it from real Swahili pilot transcripts (per spec §12 / Q5 cost-discipline).

4. Per-booking LLM cost ceiling: $0.05 soft / $0.20 hard, per-tenant

Two ceilings enforced via a current_booking_cost: ContextVar[float] set at booking-thread creation and updated after every LLM call:

CeilingDefaultBehaviour on breach
Soft$0.05Logs event_type=cost.budget.soft_breach (per Phase B §5.2 schema). Orchestrator continues. Surfaces in nightly cost-review dashboard.
Hard$0.20Logs event_type=cost.budget.hard_breach. Orchestrator escalates to admin via the existing handoff path (reason code BUDGET_BREACH per A2 §2). Customer still gets served (by the admin); no surprise bill.

Per-tenant configurable. New columns on public.tenants:

ALTER TABLE public.tenants ADD COLUMN cost_ceiling_soft_usd NUMERIC(6,4) NOT NULL DEFAULT 0.05;
ALTER TABLE public.tenants ADD COLUMN cost_ceiling_hard_usd NUMERIC(6,4) NOT NULL DEFAULT 0.20;

Some tenants justify higher ceilings (dental triage; longer conversations with health context). Override is a UPDATE public.tenants operation; no code change.

Calibration. Defaults are starting values sized for 10 LLM calls per booking at GPT-4.1 mini pricing ($0.001-$0.005 per call typical; $0.05 ceiling ~= 10x average). After the first 100 production bookings, recalibrate from observed distribution. Aligns with the cost-discipline principle (project memory: project_cost_discipline.md).

The mechanism interacts cleanly with the auto-debug logging schema: the cost_usd field on every LLM-call event (Phase B §5.1) is the data the contextvar accumulates, so cost can be diagnosed per-conversation or aggregated per-tenant from JSONL logs alone.

5. Model selection and LLM router

A new LLMRouter class in backend/app/orchestrator/llm/router.py maps logical roles to concrete (provider, model) pairs via YAML configuration. Closes the LLM-brain question ADR-0001 left open.

Role-assignment defaults (backend/app/orchestrator/llm/role_assignments.yaml):

defaults:
intent_classifier: {provider: openai, model: gpt-4.1-mini}
slot_extractor: {provider: openai, model: gpt-4.1-mini}
answer_shaper: {provider: openai, model: gpt-4.1-mini}
handoff_summarizer: {provider: anthropic, model: claude-haiku-4.5}
reorientation: {provider: anthropic, model: claude-haiku-4.5}

Why GPT-4.1 mini for the four narrow-task roles. Cheaper per call than Claude Haiku for the same workload (~$0.40/1M input tokens + $1.60/1M output tokens); strict-mode structured outputs supported; multilingual including Swahili. The four roles are bounded classification / extraction / shaping tasks where the schema is strict and the output is constrained. Vendor diversity also reduces the OpenAI-pulled-the-rug risk (Phase C §2 surfaced this risk generally).

Why Claude Haiku for handoff_summarizer + reorientation. These two roles produce free-form bilingual prose that the customer reads directly (the handoff brief goes to the admin, which then partially informs how they reply; the reorientation message is the agent-to-customer "welcome back" after admin hand-back). The brief-quality and reorientation-fluency are load-bearing for trust; Claude's reputation for natural-sounding Swahili is worth the higher per-call cost on these two relatively low-volume roles (one call each per handoff, vs many calls per booking for the four narrow roles).

Per-tenant override. New JSONB column on public.tenants:

ALTER TABLE public.tenants
ADD COLUMN llm_role_overrides JSONB NOT NULL DEFAULT '{}';

Shape: {"intent_classifier": {"provider": "anthropic", "model": "claude-haiku-4.5"}}. Empty default means use the role_assignments.yaml defaults.

Provider abstractions. Two interface implementations in backend/app/orchestrator/llm/providers/:

class LLMProvider(Protocol):
async def call(
self,
model: str,
system_prompt: str,
user_prompt: str,
schema: dict | None = None, # for strict-mode structured outputs
max_tokens: int | None = None,
) -> LLMResponse: ...

OpenAIProvider + AnthropicProvider both implement this. Adding a future provider (Gemini, local Llama, etc.) = new class implementing LLMProvider + entry in role_assignments.yaml. No orchestrator code changes.

Future canary support. The router is also where Phase 3 canary prompt/model routing will plug in (per A3 §6). For Phase 1, the config is static; for Phase 3, the router consults a "prompt assignment service" (deterministic hash of (tenant_id, conversation_id) → assignment) at the start of each conversation. The interface stays the same; the resolution becomes dynamic.

6. AdminOrchestrator: shallow 4-state FSM, shared conversation_threads

AdminOrchestrator is dispatcher-style — admin types one command, classifier routes, executor runs (with confirmation step for irreversible actions), result is sent back. Four states:

IDLE ──▶ ROUTED ──▶ AWAIT_CONFIRMATION ──▶ EXECUTED ──▶ IDLE
│ ▲
└──────────────────────────────────┘
(read-only commands skip
AWAIT_CONFIRMATION)

Read-only commands skip AWAIT_CONFIRMATION (stats query, booking list, contact lookup). Irreversible commands always go through it regardless of LLM confidence (per A1 §5.1): delete service, broadcast message, cancel booking, create staff time-block. The safety_class field on each tool's metadata.yaml (D6 below) drives this routing automatically — no per-tool confirmation boilerplate.

Session TTL: 4 hours (vs 30 min for customer). Admins operate in batches: "what's today like, block Jane Friday afternoon, send Mary a reminder, what's revenue this week" — multiple commands in one sitting. A 30-min TTL would force re-greeting between commands.

Shared conversation_threads table (introduced by ADR-0003) with new column:

ALTER TABLE conversation_threads
ADD COLUMN actor_type VARCHAR(20) NOT NULL
CHECK (actor_type IN ('customer', 'admin'));

CREATE INDEX idx_threads_active_admin
ON conversation_threads (customer_phone)
WHERE closed_at IS NULL AND actor_type = 'admin';

(Plus an analogous partial index for actor_type = 'customer'; the existing single partial index from ADR-0003 splits into two.)

Same LangGraph + TenantScopedSaver infrastructure as the customer side. Same thread_id derivation (fresh ULID per session). Same checkpoint tables. Same per-tenant micro-pool. Reusing the infrastructure is free; bifurcating it would be unnecessary cost.

Boundary with handoff (D9). When admin is in the middle of a customer-handoff conversation (HUMAN_DRIVING state on the customer-side LangGraph thread per A2), admin messages route to the handoff handler, NOT to AdminOrchestrator. AdminOrchestrator only sees admin commands that aren't part of an active handoff.

7. MCP-shape tool registry: directory-per-tool with safety_class

Tools live at backend/app/orchestrator/tools/<tool_name>/ with three files. Pattern repeats for the 13 Phase-1 tools listed in A1 §6.1.

backend/app/orchestrator/tools/
registry.py
list_active_services/
definition.json # MCP shape: {name, description, inputSchema}
handler.py # async def invoke(args: dict) -> dict
metadata.yaml # version, owner, safety_class, audit_log_required
confirm_booking/
definition.json
handler.py
metadata.yaml
initiate_stk_push/
definition.json
handler.py
metadata.yaml
... (13 total)

definition.json is pure MCP shape. When Phase 3+ wants to expose any tool to a separate analytics-agent process, the file copies verbatim into the MCP server manifest — no decoder/extractor needed. JSON Schema (with additionalProperties: false) is what Claude/OpenAI strict-mode consumes for structured-output validation, so the file also serves as the runtime schema.

safety_class enum drives confirmation behaviour:

ClassExamplesAuto-behaviour
readlist_active_services, lookup_contact, query_bookingsNone. Direct dispatch.
writeupsert_contactNone unless tenant explicitly enables; idempotent writes.
irreversibleconfirm_booking, initiate_stk_push, broadcast_message, cancel_bookingThe register_tool() machinery wraps these with a confirm_before_invoke decorator that forces an AWAIT_CONFIRMATION step regardless of LLM confidence.

audit_log_required: bool drives whether the tool's invocation emits an audit.config.changed event (per Phase B §5.2 event types) in addition to the standard tool.invoked event. Defaults to true for irreversible tools; configurable per-tool.

Registry validates at startup. Schema check on definition.json (against a meta-schema), handler signature check on handler.py (async def invoke(args: dict) -> dict), required fields check on metadata.yaml. Misshapen tools fail fast, before serving any traffic.

Tool addition is human-review-only per Phase B §6. The prompt-eval-reviewer agent (Phase B §10 #4) reads the eval-delta PR comment when a new tool's definition or handler changes; the reviewer flags scenarios that regressed but does not auto-approve.

8. Classifier error budget: three-signal failure + 2%/5% per-tenant

Strict-mode structured outputs guarantee schema compliance, NOT semantic correctness. The classifier can return intent="book" when the user wanted to cancel — schema-valid, semantically wrong.

Three-signal failure definition. Any of the following counts as a classifier failure:

  1. Confident-unknown. LLM returns intent="unknown" with confidence > 0.5. The model claims confidence in not-knowing — bug in the prompt's "what do I do when I don't know" branch.
  2. FSM-rejected. LLM returns valid intent that the downstream FSM rejects (e.g., intent="reschedule" for a customer with no existing bookings). The classifier and the business logic disagree about what's possible.
  3. Customer-contradicted. LLM returns valid intent that the customer's next-2-turns explicitly contradicts ("no, I meant cancel"). Lagging signal but high-fidelity.

Tracking. A Langfuse score classifier.failure_rate per tenant, computed as a 7-day rolling rate over the three signals. Signal (3) is computed by a nightly job that reads interactions rows for the window and matches contradiction patterns; signals (1) + (2) are recorded inline at classifier-call time.

Two-tier escalation:

Failure rate (rolling 7-day)Action
< 2%Normal. No action.
≥ 2%Eval gate flags the tenant on the next prompt PR. The prompt-eval-reviewer agent surfaces it in the PR comment.
≥ 5%Page Adrian (the prompt is materially broken, or model has regressed for that tenant's traffic mix).

Why a runtime budget in addition to the eval suite. The eval suite catches regressions on golden conversations; production traffic contains utterances no eval scenario predicted. The runtime budget catches drift between eval-suite quality and production quality. Together they bracket the classifier's behaviour from both sides.

9. Handoff boundary

AdminOrchestrator (D5) does NOT handle in-progress customer handoffs. The handoff handler — including the briefing-card shape, the /take / /done / /end magic-phrase commands, the re-handoff lifecycle, the AI disclosure rules — is entirely ADR-0006's territory. ADR-0006 builds on A2 §3-§9.

ADR-0005 says only this: when admin types in their WhatsApp DID, the IdentityResolver checks whether there's an active customer-handoff for that admin (a HUMAN_DRIVING state on some customer-side LangGraph thread). If yes, the message routes to the handoff handler. If no, the message routes to AdminOrchestrator.

This boundary keeps ADR-0005 tight on orchestration plumbing and ADR-0006 tight on handoff lifecycle. Two ADRs both specifying the briefing card would be drift opportunity.

Consequences

Positive.

  1. Model-agnostic orchestrator. Per-role swap is a YAML change. The LLMRouter interface stays stable across providers; adding a new provider is a new class. Vendor lock-in is structurally prevented.
  2. Cost discipline enforced at runtime, not just observed. The contextvar mechanism keeps budgets honest per-conversation; the hard ceiling converts a runaway booking into a paged admin who can finish it manually — no surprise bills.
  3. Classifier failures tracked with two-tier escalation. The runtime signal complements the eval-suite signal; bad prompts surface in days rather than weeks.
  4. Admin and customer share infrastructure. Same LangGraph harness, same TenantScopedSaver, same conversation_threads table with discriminator. DRY without sacrificing semantic clarity (the actor_type column makes filtering trivial).
  5. Tool registry's safety_class makes irreversible-action confirmation automatic. No per-tool boilerplate; new tools inherit the right behaviour from their metadata.
  6. GPT-4.1 mini for narrow tasks meaningfully cheapens the per-conversation cost (vs Claude Haiku) while keeping Claude for the two roles where output quality is most load-bearing for customer trust.

Negative.

  1. Two SDK dependencies + two API-key envelopes. OpenAI + Anthropic means pyproject.toml adds the openai package alongside the existing anthropic line, and tenant onboarding must provision keys for both. Mitigated by the LLMRouter abstraction (consumers never import provider SDKs directly).
  2. Role-assignments table is a new artifact future tenants need to understand. Documented in docs/methodology/agentic-development.md (Phase B) under prompt-routing; the YAML file is self-describing but the concept needs onboarding for new contributors.
  3. current_booking_cost is yet-another-contextvar (joining current_tenant from ADR-0002). Both must be set in a known order at the channel boundary. Documented in backend/app/orchestrator/runner.py with explicit ordering.
  4. GPT-4.1 mini Swahili quality is the same uncomfortable-thin question Phase C §A.1 flagged, now applied to the four narrow roles. Resolution path is unchanged: build the calibration set during pilot; the eval suite drives future per-role assignments. If GPT-4.1 mini struggles on Swahili classification, the YAML override flips that role to Claude Haiku without code changes.
  5. The safety_class = irreversible enforcement is convention, not language-enforced. A tool author could omit confirm_before_invoke and the registry wouldn't catch it. Mitigation: a startup-time check that every irreversible tool's handler imports through the registry's register_tool() decorator (which adds the wrap automatically). PR-time check by the prompt-eval-reviewer agent on tool additions.

Neutral.

  1. AdminOrchestrator's 4-hour TTL is a guess. No production data yet on actual admin session length. May tune up or down after first month of pilot.
  2. The 0.70 language-confidence threshold and 0.85 intent-confidence thresholds throughout are starting values from A1 §4.2 / §5.2; eval-tunable hyperparameters.
  3. GPT-4.1 mini context window (1M tokens as of 2026-04) is dramatically larger than any single Ratiba conversation needs; no context-window pressure in any plausible scenario.

Alternatives Considered

AlternativeRejected because
Two intent-classifier prompts (one per language).Doubles prompt-management surface (two metadata.yaml versions to bump, two eval baselines, two PR-review surfaces) and adds a language-detection pre-step. Frontier-class models handle bilingual input natively; per-language tuning isn't worth its operational cost.
Bilingual disambiguation by explicit ask ("English or Swahili? / Kingereza au Kiswahili?").Feels robotic for the 99% case where language is detectable. The tenant-locale prior is informative; auto-correction on the next turn is cheap.
No formal cost ceiling; track and dashboard only.Loses the runtime safety net. A buggy prompt that loops the FSM through 100 LLM calls would be a surprise bill; the hard ceiling converts that into a paged admin (no money burned, customer still served).
Single-provider orchestrator (no LLMRouter, hardcoded SDK imports).Vendor lock-in; "swap a model" becomes a code-search-and-replace exercise. The LLMRouter abstraction costs ~50 LoC and makes the swap a YAML change.
All-Claude default (per Phase C §1's inference).Cost is meaningfully higher for the four narrow-task roles vs GPT-4.1 mini; vendor diversity is structurally desirable. Phase C §1's "Claude as primary brain" was an inference, not a documented commitment, and is now superseded by D4. Claude retained for the two nuanced roles where output quality dominates cost.
All-OpenAI default (no Claude in defaults table).The handoff briefing card and the post-handoff reorientation message are bilingual prose that the customer reads directly; output quality is load-bearing. Until eval data shows GPT-4.1 mini's bilingual prose quality matches Claude's on these two specific tasks, the safer default is Claude for them. Per-tenant override (llm_role_overrides) lets a cost-sensitive tenant flip these to GPT-4.1 mini if their pilot data justifies it.
Decorator-based tool registry (@tool(...) on Python functions, no separate definition.json).Faster to write a new tool, at the cost of needing a "decorator → JSON schema" extractor that keeps in lockstep with the Python source. The pure-JSON-file approach makes the future MCP-server export trivial (copy the file) and the schema is the runtime contract for strict-mode structured outputs.
Stateless AdminOrchestrator (no FSM; each command is independent).Loses the natural place for confirmation steps on irreversible actions — every tool would need to re-implement its own confirmation. The 4-state FSM is small (the boilerplate is in shared confirm_before_invoke infrastructure) and gives the right place for the confirmation gate.
Separate admin_threads table (parallel to conversation_threads).Doubles the migration + query surface. The shared table with actor_type discriminator is DRYer; the partial indices (WHERE actor_type = 'admin' AND closed_at IS NULL) make the per-actor lookups single-index-scan cheap.
No formal classifier error budget (trust the eval suite to catch regressions).The eval suite catches regressions on golden conversations; production traffic contains utterances no eval scenario predicted. Without a runtime signal, classifier drift surfaces only at the next eval-suite refresh — slow feedback loop.
Specify the handoff briefing card inside ADR-0005.Two ADRs both specifying the same artifact is a drift opportunity. ADR-0006 owns handoff entirely; ADR-0005 references and stops.

References

  • docs/prd/ratiba-prd.md — §2.1 architecture, §4 Modules 7-9 (orchestrator + scheduling + admin)
  • docs/adr/ADR-0001-tech-stack.md (amended 2026-04-25) — model brain question left open; this ADR closes it
  • docs/adr/ADR-0002-multi-tenant-isolation.md — TenantScopedSaver via per-tenant micro-pools; asyncio contextvar tenant propagation
  • docs/adr/ADR-0003-fsm-persistence.mdconversation_threads pointer table (extended here with actor_type column); Redis key catalogue; per-thread mutex
  • docs/research/2026-04-25-langgraph-postgressaver-spike.md — Option A wrapper integration model
  • docs/research/2026-04-25-orchestration-patterns.md — A1 §1 (hybrid verdict), §3 (FSM sketch), §4 (intent routing), §5 (slot filling), §6 (tool catalogue), §7 (LangGraph integration), §10.2 (open questions resolved here)
  • docs/research/2026-04-25-human-in-the-loop-handoff.md — A2 §3 (briefing card shape, deferred entirely to ADR-0006)
  • docs/research/2026-04-25-eval-frameworks.md — A3 §6 (Phase 3 canary support hooks into LLMRouter); §8 (Langfuse instrumentation is where cost_usd and classifier.failure_rate are tracked)
  • docs/methodology/agentic-development.md — Phase B §3 (prompt storage), §4 (eval gate matrix), §5 (auto-debug logging schema: cost.budget.* event types, tool.invoked event), §6 (delegate-vs-human-review boundaries)
  • docs/superpowers/specs/2026-04-25-agentic-research-investment-design.md §12 — C1 (MCP-shape tools), Q5 (cost discipline) anchors
  • Project memory — project_cost_discipline.md (cost-conscious per-conversation framing)
  • OpenAI structured outputs documentation (strict mode, JSON Schema enforcement)
  • Anthropic structured outputs documentation (strict mode for tool calls)
  • LangGraph StateGraph + conditional edges + Command(resume=...) API reference