Pilot deployment runbook

This page is the script for the M12 W5 pilot validation pass and the M13 real-tenant onboarding pass that follows. Adrian opens it and follows it verbatim — every step, every command, every expected log line. The pilot's job is to surface deviations: places where what the doc said and what the system did diverge. Each deviation becomes a PR.

The four scenarios below cover the load-bearing channels: WhatsApp (Tier-1, the primary surface), voice (Tier-1, the most-difficult surface), web widget (Tier-2, the conversion-funnel surface), and SMS handoff (Tier-2 fallback, the cost-discipline surface). Together they exercise every ADR-pinned subsystem: classifier (ADR-0005), FSM (ADR-0003), payments (ADR-0007), handoff (ADR-0006), channel substrate (ADR-0009), and tenant self-service (ADR-0010).

Pre-flight checklist

Four gates before the first scenario fires. Each is a hard pre-condition; do not start if any is red.

Setup not done yet? If this is your first time on this machine, start at Dev setup (one-time installs) and then Onboard a tenant. This page assumes the stack is already bootstrapped.

Gate	Verify with	If red
`pilot-preflight.sh` is green	`./scripts/pilot-preflight.sh` exits `0` with all rows `[OK]`	See Local dev runbook / Pre-flight.
Full test suite green	Run the backend and frontend test suites — see Testing runbook for the exact command sequence	Stop the pilot. Do not pilot from a red baseline. Open a fix PR first.
A fresh tenant is onboarded	A row in `public.tenants`; `tenant_<slug>` schema exists; catalog populated; staff scheduled	Follow Onboard a tenant. For pilot, use slug `spa_pilot`.
Env vars set per channel	Daraja sandbox creds, Meta WhatsApp test number on the tenant, LiveKit creds, Africa's Talking sandbox creds	Set in `.env` — see Configuration reference for the full env-var table. Restart the backend; re-run pilot-preflight.

A capture file lives at docs/M12-pilot-deviations.md (created by W5, deleted at W6 close-out — superseded by the close-out memo). Open it in an editor before starting; you'll be writing into it as you go.

Scenario overview

All four scenarios share the same pre-condition → action → expected-result shape. The following diagram shows the common structure and how each scenario branches:

Scenario 1 — WhatsApp booking happy path

Estimated time: ~15 min. Channel: WhatsApp Cloud API. ADRs exercised: 0003 (FSM), 0005 (orchestration), 0007 (payments), 0008 (Cloud API direct).

Pre-conditions

tenant_spa_pilot is onboarded with at least one staff member, one service, and a recurring weekly schedule.
The tenant's whatsapp_phone_number_id resolves to Meta's free dev test number; your own phone is on the verified-recipients list.
Daraja sandbox shortcode + passkey are set on the tenant.

Action

Open WhatsApp on your phone, message the Meta test number:

Hi I'd like to book a haircut tomorrow at 3pm
In a separate terminal:
```
tail -f backend/.uvicorn.log
```
Wait for the agent's reply on WhatsApp.
Reply yes to confirm.
Daraja sandbox simulator pops; enter the sandbox PIN to approve.
Wait for the final confirmation message on WhatsApp.

Sequence

Expected log lines

For how to read structured log fields and filter by thread_id, see Observability.

Stage	Log expectation	WhatsApp expectation
Inbound	`webhook.whatsapp signature_valid=True`	(silent)
Classify	`classifier intent=book confidence>0.8 lang=en`	(silent)
FSM transitions	`state=GREET → COLLECT_SERVICE → COLLECT_SLOT → CONFIRM thread=<ULID>`	Confirmation prompt within ~2s
User confirm	`state=CONFIRM → PAYMENT_PENDING`	(silent)
STK push	`daraja.client stkpush success CheckoutRequestID=<ws_CO_...>`	M-Pesa pop-up on phone
Callback	`webhook.daraja correlation_match thread=<ULID>`	(silent)
FSM final	`state=PAYMENT_PENDING → BOOKED`	Final confirmation message

Pass criteria

The final WhatsApp message arrives within ~30s end-to-end.
tenant_spa_pilot.appointments has a new row with status='confirmed'.
tenant_spa_pilot.payments has a row with state='settled' and a populated mpesa_receipt.
No errors or warnings in the log line stream for that thread.

If any of those is false, capture in docs/M12-pilot-deviations.md with severity per the rubric below.

Scenario 2 — Voice booking happy path

Estimated time: ~15 min. Channel: LiveKit SIP + Deepgram + ElevenLabs. ADRs exercised: 0003, 0005, 0007, plus the voice-stack pattern (LiveKit AgentSession, post-process voice rendering).

Pre-conditions

LiveKit container is running (docker compose ps shows ratiba-livekit up).
Deepgram + ElevenLabs API keys set in .env; visible to backend at startup.
The tenant has a SIP-registered phone number routed to LiveKit's local SIP bridge.
M-Pesa STK push has the 90s voice hard cap per ADR-0007 — the voice flow is shorter than WhatsApp by design.

Action

Call the tenant's pilot phone number (LiveKit local SIP bridge maps this to a room).
After the agent's greeting, speak: I'd like to book a haircut tomorrow at three pm.
Tail the backend log:
```
tail -f backend/.uvicorn.log
```
Listen for the agent's confirmation TTS playback over the call.
Speak yes to confirm.
Approve the STK push that fires on your phone (within the 90s voice cap).
Listen for the final confirmation TTS.

Sequence

Expected log lines

For log-reading commands and structured log field reference, see Observability.

Stage	Log expectation	Audio / latency expectation
Call connect	`livekit.session.start room=<id> participant=<sip>`	Call connects within ~2s
STT first turn	`deepgram.transcript text="i'd like to book..." lang=en`	Transcribed within ~1s of speech end
Classify + FSM	Same `classifier` + `booking_graph` lines as Scenario 1	(silent)
TTS confirmation	`elevenlabs.synthesize duration_ms=<x>`	Agent voice plays within ~2.5s end-to-end
User confirm STT	`deepgram.transcript text="yes"`	(silent)
STK push	`daraja.client stkpush success`	M-Pesa pop-up on phone within ~1s
Callback	`webhook.daraja correlation_match`	(silent)
FSM final	`state=PAYMENT_PENDING → BOOKED`	Final TTS plays

Pass criteria

End-to-end speech-to-confirmation-speech latency stays under 2.5s at each turn (the M7 target).
STK push fires within the 90s voice cap window.
tenant_spa_pilot.appointments row created; tenant_spa_pilot.payments settled.
Audio playback is intelligible (no garbled TTS, no STT mistranscription on the haircut/yes turns).

Estimated time: ~15 min. Channel: Web widget (Tier-2). ADRs exercised: 0009 (channel substrate, COLLECT_PHONE entry-state), 0007 (payments), 0010 (cross-sell).

Pre-conditions

Tenant has at least two services with a service_relations row of type cross_sell between them. For pilot, manicure and pedicure are the canonical pair (M11 fixture).
<tenant>.appointments is empty for both services on the test slot (so the slot is genuinely available).

Sequence — happy path (accept the cross-sell)

Action — happy path (accept the cross-sell)

Open http://localhost:3010/widget?tenant=spa_pilot in a browser.
Type: I'd like a manicure tomorrow at 11am.
Watch the widget chrome — it should walk you through COLLECT_PHONE (ADR-0009 Tier-2 entry-state since web has no phone metadata).
Enter your phone number when prompted; confirm.
Agent replies with the manicure confirmation prompt plus the cross-sell offer for pedicure.
Reply yes to accept the bundle.
STK push fires for the combined amount; approve in Daraja sandbox.
Verify both appointments persist.

Expected log lines

For log-reading commands, see Observability.

Stage	Log expectation	Widget expectation
Page load	`widget.session.create channel=web`	Widget chrome renders
First message	`state=GREET → COLLECT_PHONE`	Phone-collection prompt
Phone validate	`phone.validate ok` (or `e164_invalid` with 3-strike re-prompt)	Phone accepted
Service + slot	`state=COLLECT_PHONE → COLLECT_SERVICE → COLLECT_SLOT → CONFIRM`	Confirmation prompt + cross-sell offer
User accept bundle	`state=CONFIRM → BUNDLE_CONFIRM → PAYMENT_PENDING`	(silent)
STK push	`daraja.client stkpush amount_kes=<combined>`	(silent)
Callback	`state=PAYMENT_PENDING → BOOKED`	Final confirmation message

Pass criteria

tenant_spa_pilot.appointments has two rows with status='confirmed' for the same customer.
tenant_spa_pilot.payments has one row with state='settled' and amount_kes equal to the combined total.
Cross-channel identity: if the same phone has prior WhatsApp activity, customers shows a single row with customer_identities rows for both channels (per ADR-0009 D5 phone-only deterministic merge).

Action — the negative path (decline the cross-sell)

Re-run from Step 1 with a different test slot. At Step 6, reply no. Verify:

Only one appointment row created (manicure only).
The pedicure reservation that was held during BUNDLE_CONFIRM is released — i.e. no orphan row in appointments with status='reserved'.

This negative path tests the pair-release semantics from M11 cross-sell. If an orphan reservation is left behind, that's a small-bug deviation.

Scenario 4 — SMS reminder + admin handoff

Estimated time: ~15 min. Channels: WhatsApp inbound + admin orchestrator + SMS via Africa's Talking sandbox. ADRs exercised: 0006 (handoff Pattern 3), 0009 (SMS as NotificationSink, not Channel).

Pre-conditions

Africa's Talking sandbox credentials in .env; AT_USERNAME=sandbox.
Tenant admin's WhatsApp number set in tenant_admins table.
Existing appointment in tenant_spa_pilot.appointments 24h out so the reminder scheduler picks it up.

Sequence — handoff path

Action — handoff path

Open WhatsApp, message the Meta test number with a deliberately vague turn:

So like I was wondering if maybe sometime this week you might could possibly help with a thing for my wife
The classifier should flag low-confidence intent. Per ADR-0006 Pattern 3, the FSM enters HANDOFF_PENDING.
Within ~120s the admin (you, on the admin's WhatsApp) receives a briefing card.
As the admin, reply with a directive: Tell them we have spa appointments Friday at 2pm and 4pm.
The orchestrator forwards the response back to the customer.

Expected log lines — handoff

For log-reading guide and jq filter patterns, see Observability.

Stage	Log expectation
Inbound vague turn	`classifier confidence<0.6`
Handoff trigger	`handoff.start reason=low_confidence pattern=3`
Admin notify	`whatsapp.send to=<admin_phone> template=briefing_card`
Admin reply	`webhook.whatsapp actor=admin orchestrator_route=true`
Customer reply	`whatsapp.send to=<customer_phone>`

Pass criteria — handoff

Briefing card reaches the admin's WhatsApp within ~120s.
Admin's response reaches the customer.
Handoff log row in <tenant>.handoff_log with pattern=3 and full transcript fold.

Action — SMS reminder path

Time-shift or trigger the daily reminder cron manually:

docker compose exec backend /app/.venv/bin/python -m app.scripts.run_reminder_dispatch

The reminder scheduler picks the appointment 24h out.
The customer's session-window check: WhatsApp 24h window expired → fallback to SMS via Africa's Talking.
SMS arrives on the customer's phone with the reminder text.

Expected log lines — SMS

Stage	Log expectation
Reminder picked	`reminder.candidate appt=<id> due_in_h=24`
Window check	`whatsapp.session_window expired=true`
SMS dispatch	`sms.africas_talking send phone=<e164> cost_usd=0.003`

Pass criteria — SMS

SMS arrives on the customer's phone (Africa's Talking sandbox shows the message in its dashboard).
Cost line in the log shows ~$0.003 — the cost-discipline win that motivated SMS-as-fallback in ADR-0009 (cheaper than out-of-window WhatsApp utility templates in Kenya).

Deviation severity rubric

Capture every divergence in docs/M12-pilot-deviations.md. Each row: timestamp | scenario | observation | severity | resolution. Severity drives whether the deviation gets fixed during W5 or deferred to M13.

Severity	Symbol	Definition	Action
Doc-only fix	🟢	The runbook said X, the system actually does Y. The system is correct; the runbook is wrong.	Open a docs-only PR in the W5 docs-fixes commit. No code change.
Small bug	🟡	The system has a clear bug, fix is `<` 30 min and well-scoped.	Open a code PR in the W5 bug-fixes commit. Re-run the affected scenario.
Deferred	🔴	A real architectural surprise — fix needs design-level discussion or `>` 30 min of work.	Log as M13 carry-forward in the close-out memo + open a GitHub issue. Do not fix during W5.

The W5 commit cap is 3 commits maximum per the M12 plan v3 R6 mitigation: one pilot-execution commit (the M12-pilot-deviations.md capture file with all observations), one docs-fixes commit (all 🟢 batched), one bug-fixes commit (all 🟡 batched). Anything 🔴 stays in the deviations file as a carry-forward — it does not get a commit during W5.

This cap exists because the pilot is a calibration pass, not a code-change marathon. The point is to learn what the system actually does in pilot conditions and capture that learning, not to rewrite the system from inside the pilot. M13 is the place for architectural responses to 🔴 findings.

Where to capture

Open docs/M12-pilot-deviations.md in an editor before starting. Use this row shape:

| 2026-05-07 09:34 | S1 WhatsApp | Confirmation message took 47s instead of expected ~30s | 🔴 | M13 carry-forward — investigate Daraja sandbox latency vs production |
| 2026-05-07 09:51 | S3 web widget | "manicure" tokenisation collapsed to "mani cure" in classifier | 🟡 | bug-fixes commit — extend tokenizer test fixture |
| 2026-05-07 10:08 | S2 voice | TTS playback expectation says "~2.5s end-to-end" but actual is "~3.0s on first turn" | 🟢 | docs-fixes commit — update Scenario 2 latency target |

The file is deleted at W6 close-out; the observations are folded into the close-out memo at ~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_m12_pilot_readiness_landed.md. That memo becomes the durable record.

What next

During the pass — when something goes wrong, Incidents runbook is the diagnose-and-fix table.
For log reading during the pass — Observability covers structured log fields, jq filter patterns, and the daily digest.
After the pass — the W6 close-out memo aggregates everything into a single milestone retrospective.
For M13 onboarding — re-run this exact runbook against a real tenant. The 4 scenarios become the acceptance criteria for "this tenant is live."

Pre-flight checklist​

Scenario overview​

Scenario 1 — WhatsApp booking happy path​

Pre-conditions​

Action​

Sequence​

Expected log lines​

Pass criteria​

Scenario 2 — Voice booking happy path​

Pre-conditions​

Action​

Sequence​

Expected log lines​

Pass criteria​

Scenario 3 — Web widget + cross-sell​

Pre-conditions​

Sequence — happy path (accept the cross-sell)​

Action — happy path (accept the cross-sell)​

Expected log lines​

Pass criteria​

Action — the negative path (decline the cross-sell)​

Scenario 4 — SMS reminder + admin handoff​

Pre-conditions​

Sequence — handoff path​

Action — handoff path​

Expected log lines — handoff​

Pass criteria — handoff​

Action — SMS reminder path​

Expected log lines — SMS​

Pass criteria — SMS​

Deviation severity rubric​

Where to capture​

What next​

Pre-flight checklist

Scenario overview

Scenario 1 — WhatsApp booking happy path

Pre-conditions

Action

Sequence

Expected log lines

Pass criteria

Scenario 2 — Voice booking happy path

Pre-conditions

Action

Sequence

Expected log lines

Pass criteria

Scenario 3 — Web widget + cross-sell

Pre-conditions

Sequence — happy path (accept the cross-sell)

Action — happy path (accept the cross-sell)

Expected log lines

Pass criteria

Action — the negative path (decline the cross-sell)

Scenario 4 — SMS reminder + admin handoff

Pre-conditions

Sequence — handoff path

Action — handoff path

Expected log lines — handoff

Pass criteria — handoff

Action — SMS reminder path

Expected log lines — SMS

Pass criteria — SMS

Deviation severity rubric

Where to capture

What next