ADR-0005: Orchestration Model
Status: Accepted Date: 2026-04-25
Context
A1 (docs/research/2026-04-25-orchestration-patterns.md) settled the
high-level orchestration shape: a hybrid LangGraph StateGraph with
LLM-augmented seams at intent classification, slot extraction, and
clarification rephrasing. Multi-agent rejected (Phase C §1 + A1 §2.2);
single-loop favored. C1 from spec §12 added MCP-shape internal tool
definitions dispatched in-process. Phase B §3 settled where prompts
live (in code, with Langfuse as observability-only).
What this ADR settles is the operational scaffolding around that shape: where exactly the prompts go, which model handles which role, how bilingual ambiguity resolves, how runtime cost gets bounded, what the AdminOrchestrator's much-shallower FSM looks like, what shape the tool registry takes, and how classifier errors are measured and escalated.
This ADR also closes a question ADR-0001 left implicitly open: which LLM provider is the orchestrator's default brain. Phase C §1 inferred Claude (cost favourable, presumed Swahili edge) but ADR-0001 itself did not pin the LLM brain — only STT (Deepgram), TTS (ElevenLabs), and the framework stack. Adrian's preference (2026-04-25) shifted the default to OpenAI GPT-4.1 mini for the high-volume narrow roles, with Claude Haiku reserved for two specific nuanced roles. The mechanism that makes this swappable per-role and per-tenant is the LLMRouter / role-assignments pattern (D4).
ADR-0001's amendment of 2026-04-25 will be updated in a follow-up amendment to reference this ADR for the LLM-brain default, since the LLM-brain inference in that amendment was light-touch.
Decision
Nine concrete decisions, organized as a coherent orchestration model.
1. Architectural recap (inherited, locked here)
| Aspect | Where decided | Recap |
|---|---|---|
| Orchestrator shape | A1 §1 | Hybrid: deterministic LangGraph StateGraph spine + LLM seams at three named points (intent classification, slot extraction, clarification rephrasing). |
| Single-loop vs multi-agent | Phase C §1 + A1 §2.2 | Single-loop. Multi-agent rejected for tightly coupled booking flows. |
| Per-orchestrator graph type | A1 §7 | Two graphs: CustomerBookingGraph + AdminGraph. Compiled per-invocation against a TenantScopedSaver per ADR-0002 D4. |
| Tool definition shape | C1 / A1 §6 | MCP-shape (JSON Schema), dispatched in-process for Phase 1; deferred MCP server boundary until Phase 3+. |
| Structured outputs | Phase C §10 / A1 §4 | Strict mode mandatory at every classifier and tool call. |
| Persistence | ADR-0003 | Two-tier (Redis hot + Postgres LangGraph checkpoint). Fresh thread_id per booking via conversation_threads pointer table. |
This ADR builds on those primitives without re-deriving them.
2. Intent classifier prompt: single bilingual prompt, human-review-only
The intent classifier runs on every customer turn that isn't a WhatsApp button reply. Per Phase B §3, prompts live in code:
backend/app/orchestrator/prompts/intent_classifier/
system.md # the single bilingual prompt (handles en + sw)
metadata.yaml # version, target role (intent_classifier), eval_baseline, owner
Single bilingual prompt (not split per language). The
structured-output schema includes language: enum[en, sw] so the
model identifies the customer's language as part of its output. Two
prompts would either double LLM calls per turn (language-detection
pre-step + classifier) or double the prompt-management surface;
neither pays for itself given that frontier-class models handle
bilingual input natively.
Human-review-only on every PR per Phase B §6. The intent
classifier gates the booking flow which gates payment, so it touches
money via the chain. The prompt-eval-reviewer reviewer agent (Phase
B §10 #4) reads the eval-delta PR comment but does not auto-approve
intent-classifier prompt PRs.
3. Bilingual ambiguity disambiguation: default to tenant locale
When the intent classifier returns language confidence below 0.70
(starting threshold from A1 §5.2; eval-tunable), the orchestrator
defaults the customer-facing response language to
public.tenants.locale (set at tenant onboarding). The customer's
next utterance — now likely longer or more discriminating — is
re-classified and the orchestrator switches if the new classification
crosses the 0.70 threshold.
Why the tenant prior beats other defaults. Most customers of a given tenant speak the same language as the tenant. Defaulting to "whichever the LLM scored higher" (even at low confidence) ignores that prior; explicitly asking ("English or Swahili? / Kingereza au Kiswahili?") feels robotic for the 99% case where language is detectable. The auto-correction path on the next turn is cheap.
The 0.70 threshold is starting; A3's eval suite will refit it from real Swahili pilot transcripts (per spec §12 / Q5 cost-discipline).
4. Per-booking LLM cost ceiling: $0.05 soft / $0.20 hard, per-tenant
Two ceilings enforced via a current_booking_cost: ContextVar[float]
set at booking-thread creation and updated after every LLM call:
| Ceiling | Default | Behaviour on breach |
|---|---|---|
| Soft | $0.05 | Logs event_type=cost.budget.soft_breach (per Phase B §5.2 schema). Orchestrator continues. Surfaces in nightly cost-review dashboard. |
| Hard | $0.20 | Logs event_type=cost.budget.hard_breach. Orchestrator escalates to admin via the existing handoff path (reason code BUDGET_BREACH per A2 §2). Customer still gets served (by the admin); no surprise bill. |
Per-tenant configurable. New columns on public.tenants:
ALTER TABLE public.tenants ADD COLUMN cost_ceiling_soft_usd NUMERIC(6,4) NOT NULL DEFAULT 0.05;
ALTER TABLE public.tenants ADD COLUMN cost_ceiling_hard_usd NUMERIC(6,4) NOT NULL DEFAULT 0.20;
Some tenants justify higher ceilings (dental triage; longer
conversations with health context). Override is a UPDATE public.tenants operation; no code change.
Calibration. Defaults are starting values sized for 10 LLM calls
per booking at GPT-4.1 mini pricing ($0.001-$0.005 per call typical;
$0.05 ceiling ~= 10x average). After the first 100 production
bookings, recalibrate from observed distribution. Aligns with the
cost-discipline principle (project memory:
project_cost_discipline.md).
The mechanism interacts cleanly with the auto-debug logging schema:
the cost_usd field on every LLM-call event (Phase B §5.1) is the
data the contextvar accumulates, so cost can be diagnosed
per-conversation or aggregated per-tenant from JSONL logs alone.
5. Model selection and LLM router
A new LLMRouter class in backend/app/orchestrator/llm/router.py
maps logical roles to concrete (provider, model) pairs via YAML
configuration. Closes the LLM-brain question ADR-0001 left open.
Role-assignment defaults (backend/app/orchestrator/llm/role_assignments.yaml):
defaults:
intent_classifier: {provider: openai, model: gpt-4.1-mini}
slot_extractor: {provider: openai, model: gpt-4.1-mini}
answer_shaper: {provider: openai, model: gpt-4.1-mini}
handoff_summarizer: {provider: anthropic, model: claude-haiku-4.5}
reorientation: {provider: anthropic, model: claude-haiku-4.5}
Why GPT-4.1 mini for the four narrow-task roles. Cheaper per call than Claude Haiku for the same workload (~$0.40/1M input tokens + $1.60/1M output tokens); strict-mode structured outputs supported; multilingual including Swahili. The four roles are bounded classification / extraction / shaping tasks where the schema is strict and the output is constrained. Vendor diversity also reduces the OpenAI-pulled-the-rug risk (Phase C §2 surfaced this risk generally).
Why Claude Haiku for handoff_summarizer + reorientation. These two roles produce free-form bilingual prose that the customer reads directly (the handoff brief goes to the admin, which then partially informs how they reply; the reorientation message is the agent-to-customer "welcome back" after admin hand-back). The brief-quality and reorientation-fluency are load-bearing for trust; Claude's reputation for natural-sounding Swahili is worth the higher per-call cost on these two relatively low-volume roles (one call each per handoff, vs many calls per booking for the four narrow roles).
Per-tenant override. New JSONB column on public.tenants:
ALTER TABLE public.tenants
ADD COLUMN llm_role_overrides JSONB NOT NULL DEFAULT '{}';
Shape: {"intent_classifier": {"provider": "anthropic", "model": "claude-haiku-4.5"}}.
Empty default means use the role_assignments.yaml defaults.
Provider abstractions. Two interface implementations in
backend/app/orchestrator/llm/providers/:
class LLMProvider(Protocol):
async def call(
self,
model: str,
system_prompt: str,
user_prompt: str,
schema: dict | None = None, # for strict-mode structured outputs
max_tokens: int | None = None,
) -> LLMResponse: ...
OpenAIProvider + AnthropicProvider both implement this.
Adding a future provider (Gemini, local Llama, etc.) = new class
implementing LLMProvider + entry in role_assignments.yaml. No
orchestrator code changes.
Future canary support. The router is also where Phase 3 canary
prompt/model routing will plug in (per A3 §6). For Phase 1, the
config is static; for Phase 3, the router consults a "prompt
assignment service" (deterministic hash of (tenant_id, conversation_id)
→ assignment) at the start of each conversation. The interface stays
the same; the resolution becomes dynamic.
6. AdminOrchestrator: shallow 4-state FSM, shared conversation_threads
AdminOrchestrator is dispatcher-style — admin types one command,
classifier routes, executor runs (with confirmation step for
irreversible actions), result is sent back. Four states:
IDLE ──▶ ROUTED ──▶ AWAIT_CONFIRMATION ──▶ EXECUTED ──▶ IDLE
│ ▲
└──────────────────────────────────┘
(read-only commands skip
AWAIT_CONFIRMATION)
Read-only commands skip AWAIT_CONFIRMATION (stats query, booking
list, contact lookup). Irreversible commands always go through it
regardless of LLM confidence (per A1 §5.1): delete service, broadcast
message, cancel booking, create staff time-block. The safety_class
field on each tool's metadata.yaml (D6 below) drives this routing
automatically — no per-tool confirmation boilerplate.
Session TTL: 4 hours (vs 30 min for customer). Admins operate in batches: "what's today like, block Jane Friday afternoon, send Mary a reminder, what's revenue this week" — multiple commands in one sitting. A 30-min TTL would force re-greeting between commands.
Shared conversation_threads table (introduced by ADR-0003) with
new column:
ALTER TABLE conversation_threads
ADD COLUMN actor_type VARCHAR(20) NOT NULL
CHECK (actor_type IN ('customer', 'admin'));
CREATE INDEX idx_threads_active_admin
ON conversation_threads (customer_phone)
WHERE closed_at IS NULL AND actor_type = 'admin';
(Plus an analogous partial index for actor_type = 'customer'; the
existing single partial index from ADR-0003 splits into two.)
Same LangGraph + TenantScopedSaver infrastructure as the customer
side. Same thread_id derivation (fresh ULID per session). Same
checkpoint tables. Same per-tenant micro-pool. Reusing the
infrastructure is free; bifurcating it would be unnecessary cost.
Boundary with handoff (D9). When admin is in the middle of a
customer-handoff conversation (HUMAN_DRIVING state on the
customer-side LangGraph thread per A2), admin messages route to the
handoff handler, NOT to AdminOrchestrator. AdminOrchestrator only
sees admin commands that aren't part of an active handoff.
7. MCP-shape tool registry: directory-per-tool with safety_class
Tools live at backend/app/orchestrator/tools/<tool_name>/ with three
files. Pattern repeats for the 13 Phase-1 tools listed in A1 §6.1.
backend/app/orchestrator/tools/
registry.py
list_active_services/
definition.json # MCP shape: {name, description, inputSchema}
handler.py # async def invoke(args: dict) -> dict
metadata.yaml # version, owner, safety_class, audit_log_required
confirm_booking/
definition.json
handler.py
metadata.yaml
initiate_stk_push/
definition.json
handler.py
metadata.yaml
... (13 total)
definition.json is pure MCP shape. When Phase 3+ wants to expose
any tool to a separate analytics-agent process, the file copies
verbatim into the MCP server manifest — no decoder/extractor needed.
JSON Schema (with additionalProperties: false) is what Claude/OpenAI
strict-mode consumes for structured-output validation, so the file
also serves as the runtime schema.
safety_class enum drives confirmation behaviour:
| Class | Examples | Auto-behaviour |
|---|---|---|
read | list_active_services, lookup_contact, query_bookings | None. Direct dispatch. |
write | upsert_contact | None unless tenant explicitly enables; idempotent writes. |
irreversible | confirm_booking, initiate_stk_push, broadcast_message, cancel_booking | The register_tool() machinery wraps these with a confirm_before_invoke decorator that forces an AWAIT_CONFIRMATION step regardless of LLM confidence. |
audit_log_required: bool drives whether the tool's invocation
emits an audit.config.changed event (per Phase B §5.2 event types)
in addition to the standard tool.invoked event. Defaults to true
for irreversible tools; configurable per-tool.
Registry validates at startup. Schema check on definition.json
(against a meta-schema), handler signature check on handler.py
(async def invoke(args: dict) -> dict), required fields check on
metadata.yaml. Misshapen tools fail fast, before serving any
traffic.
Tool addition is human-review-only per Phase B §6. The
prompt-eval-reviewer agent (Phase B §10 #4) reads the eval-delta PR
comment when a new tool's definition or handler changes; the
reviewer flags scenarios that regressed but does not auto-approve.
8. Classifier error budget: three-signal failure + 2%/5% per-tenant
Strict-mode structured outputs guarantee schema compliance, NOT
semantic correctness. The classifier can return intent="book" when
the user wanted to cancel — schema-valid, semantically wrong.
Three-signal failure definition. Any of the following counts as a classifier failure:
- Confident-unknown. LLM returns
intent="unknown"withconfidence > 0.5. The model claims confidence in not-knowing — bug in the prompt's "what do I do when I don't know" branch. - FSM-rejected. LLM returns valid intent that the downstream FSM
rejects (e.g.,
intent="reschedule"for a customer with no existing bookings). The classifier and the business logic disagree about what's possible. - Customer-contradicted. LLM returns valid intent that the customer's next-2-turns explicitly contradicts ("no, I meant cancel"). Lagging signal but high-fidelity.
Tracking. A Langfuse score classifier.failure_rate per tenant,
computed as a 7-day rolling rate over the three signals. Signal (3)
is computed by a nightly job that reads interactions rows for the
window and matches contradiction patterns; signals (1) + (2) are
recorded inline at classifier-call time.
Two-tier escalation:
| Failure rate (rolling 7-day) | Action |
|---|---|
< 2% | Normal. No action. |
≥ 2% | Eval gate flags the tenant on the next prompt PR. The prompt-eval-reviewer agent surfaces it in the PR comment. |
≥ 5% | Page Adrian (the prompt is materially broken, or model has regressed for that tenant's traffic mix). |
Why a runtime budget in addition to the eval suite. The eval suite catches regressions on golden conversations; production traffic contains utterances no eval scenario predicted. The runtime budget catches drift between eval-suite quality and production quality. Together they bracket the classifier's behaviour from both sides.
9. Handoff boundary
AdminOrchestrator (D5) does NOT handle in-progress customer
handoffs. The handoff handler — including the briefing-card shape,
the /take / /done / /end magic-phrase commands, the
re-handoff lifecycle, the AI disclosure rules — is entirely
ADR-0006's territory. ADR-0006 builds on A2 §3-§9.
ADR-0005 says only this: when admin types in their WhatsApp DID, the
IdentityResolver checks whether there's an active customer-handoff
for that admin (a HUMAN_DRIVING state on some customer-side
LangGraph thread). If yes, the message routes to the handoff handler.
If no, the message routes to AdminOrchestrator.
This boundary keeps ADR-0005 tight on orchestration plumbing and ADR-0006 tight on handoff lifecycle. Two ADRs both specifying the briefing card would be drift opportunity.
Consequences
Positive.
- Model-agnostic orchestrator. Per-role swap is a YAML change.
The
LLMRouterinterface stays stable across providers; adding a new provider is a new class. Vendor lock-in is structurally prevented. - Cost discipline enforced at runtime, not just observed. The contextvar mechanism keeps budgets honest per-conversation; the hard ceiling converts a runaway booking into a paged admin who can finish it manually — no surprise bills.
- Classifier failures tracked with two-tier escalation. The runtime signal complements the eval-suite signal; bad prompts surface in days rather than weeks.
- Admin and customer share infrastructure. Same LangGraph
harness, same TenantScopedSaver, same
conversation_threadstable with discriminator. DRY without sacrificing semantic clarity (theactor_typecolumn makes filtering trivial). - Tool registry's
safety_classmakes irreversible-action confirmation automatic. No per-tool boilerplate; new tools inherit the right behaviour from their metadata. - GPT-4.1 mini for narrow tasks meaningfully cheapens the per-conversation cost (vs Claude Haiku) while keeping Claude for the two roles where output quality is most load-bearing for customer trust.
Negative.
- Two SDK dependencies + two API-key envelopes. OpenAI + Anthropic
means
pyproject.tomladds theopenaipackage alongside the existinganthropicline, and tenant onboarding must provision keys for both. Mitigated by the LLMRouter abstraction (consumers never import provider SDKs directly). - Role-assignments table is a new artifact future tenants need to
understand. Documented in
docs/methodology/agentic-development.md(Phase B) under prompt-routing; the YAML file is self-describing but the concept needs onboarding for new contributors. current_booking_costis yet-another-contextvar (joiningcurrent_tenantfrom ADR-0002). Both must be set in a known order at the channel boundary. Documented inbackend/app/orchestrator/runner.pywith explicit ordering.- GPT-4.1 mini Swahili quality is the same uncomfortable-thin question Phase C §A.1 flagged, now applied to the four narrow roles. Resolution path is unchanged: build the calibration set during pilot; the eval suite drives future per-role assignments. If GPT-4.1 mini struggles on Swahili classification, the YAML override flips that role to Claude Haiku without code changes.
- The
safety_class = irreversibleenforcement is convention, not language-enforced. A tool author could omitconfirm_before_invokeand the registry wouldn't catch it. Mitigation: a startup-time check that everyirreversibletool's handler imports through the registry'sregister_tool()decorator (which adds the wrap automatically). PR-time check by theprompt-eval-revieweragent on tool additions.
Neutral.
- AdminOrchestrator's 4-hour TTL is a guess. No production data yet on actual admin session length. May tune up or down after first month of pilot.
- The 0.70 language-confidence threshold and 0.85 intent-confidence thresholds throughout are starting values from A1 §4.2 / §5.2; eval-tunable hyperparameters.
- GPT-4.1 mini context window (1M tokens as of 2026-04) is dramatically larger than any single Ratiba conversation needs; no context-window pressure in any plausible scenario.
Alternatives Considered
| Alternative | Rejected because |
|---|---|
| Two intent-classifier prompts (one per language). | Doubles prompt-management surface (two metadata.yaml versions to bump, two eval baselines, two PR-review surfaces) and adds a language-detection pre-step. Frontier-class models handle bilingual input natively; per-language tuning isn't worth its operational cost. |
| Bilingual disambiguation by explicit ask ("English or Swahili? / Kingereza au Kiswahili?"). | Feels robotic for the 99% case where language is detectable. The tenant-locale prior is informative; auto-correction on the next turn is cheap. |
| No formal cost ceiling; track and dashboard only. | Loses the runtime safety net. A buggy prompt that loops the FSM through 100 LLM calls would be a surprise bill; the hard ceiling converts that into a paged admin (no money burned, customer still served). |
| Single-provider orchestrator (no LLMRouter, hardcoded SDK imports). | Vendor lock-in; "swap a model" becomes a code-search-and-replace exercise. The LLMRouter abstraction costs ~50 LoC and makes the swap a YAML change. |
| All-Claude default (per Phase C §1's inference). | Cost is meaningfully higher for the four narrow-task roles vs GPT-4.1 mini; vendor diversity is structurally desirable. Phase C §1's "Claude as primary brain" was an inference, not a documented commitment, and is now superseded by D4. Claude retained for the two nuanced roles where output quality dominates cost. |
| All-OpenAI default (no Claude in defaults table). | The handoff briefing card and the post-handoff reorientation message are bilingual prose that the customer reads directly; output quality is load-bearing. Until eval data shows GPT-4.1 mini's bilingual prose quality matches Claude's on these two specific tasks, the safer default is Claude for them. Per-tenant override (llm_role_overrides) lets a cost-sensitive tenant flip these to GPT-4.1 mini if their pilot data justifies it. |
Decorator-based tool registry (@tool(...) on Python functions, no separate definition.json). | Faster to write a new tool, at the cost of needing a "decorator → JSON schema" extractor that keeps in lockstep with the Python source. The pure-JSON-file approach makes the future MCP-server export trivial (copy the file) and the schema is the runtime contract for strict-mode structured outputs. |
| Stateless AdminOrchestrator (no FSM; each command is independent). | Loses the natural place for confirmation steps on irreversible actions — every tool would need to re-implement its own confirmation. The 4-state FSM is small (the boilerplate is in shared confirm_before_invoke infrastructure) and gives the right place for the confirmation gate. |
Separate admin_threads table (parallel to conversation_threads). | Doubles the migration + query surface. The shared table with actor_type discriminator is DRYer; the partial indices (WHERE actor_type = 'admin' AND closed_at IS NULL) make the per-actor lookups single-index-scan cheap. |
| No formal classifier error budget (trust the eval suite to catch regressions). | The eval suite catches regressions on golden conversations; production traffic contains utterances no eval scenario predicted. Without a runtime signal, classifier drift surfaces only at the next eval-suite refresh — slow feedback loop. |
| Specify the handoff briefing card inside ADR-0005. | Two ADRs both specifying the same artifact is a drift opportunity. ADR-0006 owns handoff entirely; ADR-0005 references and stops. |
References
docs/prd/ratiba-prd.md— §2.1 architecture, §4 Modules 7-9 (orchestrator + scheduling + admin)docs/adr/ADR-0001-tech-stack.md(amended 2026-04-25) — model brain question left open; this ADR closes itdocs/adr/ADR-0002-multi-tenant-isolation.md— TenantScopedSaver via per-tenant micro-pools; asyncio contextvar tenant propagationdocs/adr/ADR-0003-fsm-persistence.md—conversation_threadspointer table (extended here withactor_typecolumn); Redis key catalogue; per-thread mutexdocs/research/2026-04-25-langgraph-postgressaver-spike.md— Option A wrapper integration modeldocs/research/2026-04-25-orchestration-patterns.md— A1 §1 (hybrid verdict), §3 (FSM sketch), §4 (intent routing), §5 (slot filling), §6 (tool catalogue), §7 (LangGraph integration), §10.2 (open questions resolved here)docs/research/2026-04-25-human-in-the-loop-handoff.md— A2 §3 (briefing card shape, deferred entirely to ADR-0006)docs/research/2026-04-25-eval-frameworks.md— A3 §6 (Phase 3 canary support hooks into LLMRouter); §8 (Langfuse instrumentation is wherecost_usdandclassifier.failure_rateare tracked)docs/methodology/agentic-development.md— Phase B §3 (prompt storage), §4 (eval gate matrix), §5 (auto-debug logging schema:cost.budget.*event types,tool.invokedevent), §6 (delegate-vs-human-review boundaries)docs/superpowers/specs/2026-04-25-agentic-research-investment-design.md§12 — C1 (MCP-shape tools), Q5 (cost discipline) anchors- Project memory —
project_cost_discipline.md(cost-conscious per-conversation framing) - OpenAI structured outputs documentation (strict mode, JSON Schema enforcement)
- Anthropic structured outputs documentation (strict mode for tool calls)
- LangGraph
StateGraph+ conditional edges +Command(resume=...)API reference