Pilot deployment runbook
This page is the script for the M12 W5 pilot validation pass and the M13 real-tenant onboarding pass that follows. Adrian opens it and follows it verbatim — every step, every command, every expected log line. The pilot's job is to surface deviations: places where what the doc said and what the system did diverge. Each deviation becomes a PR.
The four scenarios below cover the load-bearing channels: WhatsApp (Tier-1, the primary surface), voice (Tier-1, the most-difficult surface), web widget (Tier-2, the conversion-funnel surface), and SMS handoff (Tier-2 fallback, the cost-discipline surface). Together they exercise every ADR-pinned subsystem: classifier (ADR-0005), FSM (ADR-0003), payments (ADR-0007), handoff (ADR-0006), channel substrate (ADR-0009), and tenant self-service (ADR-0010).
Pre-flight checklist
Four gates before the first scenario fires. Each is a hard pre-condition; do not start if any is red.
Setup not done yet? If this is your first time on this machine, start at Dev setup (one-time installs) and then Onboard a tenant. This page assumes the stack is already bootstrapped.
| Gate | Verify with | If red |
|---|---|---|
pilot-preflight.sh is green | ./scripts/pilot-preflight.sh exits 0 with all rows [OK] | See Local dev runbook / Pre-flight. |
| Full test suite green | Run the backend and frontend test suites — see Testing runbook for the exact command sequence | Stop the pilot. Do not pilot from a red baseline. Open a fix PR first. |
| A fresh tenant is onboarded | A row in public.tenants; tenant_<slug> schema exists; catalog populated; staff scheduled | Follow Onboard a tenant. For pilot, use slug spa_pilot. |
| Env vars set per channel | Daraja sandbox creds, Meta WhatsApp test number on the tenant, LiveKit creds, Africa's Talking sandbox creds | Set in .env — see Configuration reference for the full env-var table. Restart the backend; re-run pilot-preflight. |
A capture file lives at docs/M12-pilot-deviations.md (created by W5, deleted at W6 close-out — superseded by the close-out memo). Open it in an editor before starting; you'll be writing into it as you go.
Scenario overview
All four scenarios share the same pre-condition → action → expected-result shape. The following diagram shows the common structure and how each scenario branches:
Scenario 1 — WhatsApp booking happy path
Estimated time: ~15 min. Channel: WhatsApp Cloud API. ADRs exercised: 0003 (FSM), 0005 (orchestration), 0007 (payments), 0008 (Cloud API direct).
Pre-conditions
tenant_spa_pilotis onboarded with at least one staff member, one service, and a recurring weekly schedule.- The tenant's
whatsapp_phone_number_idresolves to Meta's free dev test number; your own phone is on the verified-recipients list. - Daraja sandbox shortcode + passkey are set on the tenant.
Action
- Open WhatsApp on your phone, message the Meta test number:
Hi I'd like to book a haircut tomorrow at 3pm - In a separate terminal:
tail -f backend/.uvicorn.log
- Wait for the agent's reply on WhatsApp.
- Reply
yesto confirm. - Daraja sandbox simulator pops; enter the sandbox PIN to approve.
- Wait for the final confirmation message on WhatsApp.
Sequence
Expected log lines
For how to read structured log fields and filter by thread_id, see Observability.
| Stage | Log expectation | WhatsApp expectation |
|---|---|---|
| Inbound | webhook.whatsapp signature_valid=True | (silent) |
| Classify | classifier intent=book confidence>0.8 lang=en | (silent) |
| FSM transitions | state=GREET → COLLECT_SERVICE → COLLECT_SLOT → CONFIRM thread=<ULID> | Confirmation prompt within ~2s |
| User confirm | state=CONFIRM → PAYMENT_PENDING | (silent) |
| STK push | daraja.client stkpush success CheckoutRequestID=<ws_CO_...> | M-Pesa pop-up on phone |
| Callback | webhook.daraja correlation_match thread=<ULID> | (silent) |
| FSM final | state=PAYMENT_PENDING → BOOKED | Final confirmation message |
Pass criteria
- The final WhatsApp message arrives within ~30s end-to-end.
tenant_spa_pilot.appointmentshas a new row withstatus='confirmed'.tenant_spa_pilot.paymentshas a row withstate='settled'and a populatedmpesa_receipt.- No errors or warnings in the log line stream for that thread.
If any of those is false, capture in docs/M12-pilot-deviations.md with severity per the rubric below.
Scenario 2 — Voice booking happy path
Estimated time: ~15 min. Channel: LiveKit SIP + Deepgram + ElevenLabs. ADRs exercised: 0003, 0005, 0007, plus the voice-stack pattern (LiveKit AgentSession, post-process voice rendering).
Pre-conditions
- LiveKit container is running (
docker compose psshowsratiba-livekitup). - Deepgram + ElevenLabs API keys set in
.env; visible to backend at startup. - The tenant has a SIP-registered phone number routed to LiveKit's local SIP bridge.
- M-Pesa STK push has the 90s voice hard cap per ADR-0007 — the voice flow is shorter than WhatsApp by design.
Action
- Call the tenant's pilot phone number (LiveKit local SIP bridge maps this to a room).
- After the agent's greeting, speak:
I'd like to book a haircut tomorrow at three pm. - Tail the backend log:
tail -f backend/.uvicorn.log
- Listen for the agent's confirmation TTS playback over the call.
- Speak
yesto confirm. - Approve the STK push that fires on your phone (within the 90s voice cap).
- Listen for the final confirmation TTS.
Sequence
Expected log lines
For log-reading commands and structured log field reference, see Observability.
| Stage | Log expectation | Audio / latency expectation |
|---|---|---|
| Call connect | livekit.session.start room=<id> participant=<sip> | Call connects within ~2s |
| STT first turn | deepgram.transcript text="i'd like to book..." lang=en | Transcribed within ~1s of speech end |
| Classify + FSM | Same classifier + booking_graph lines as Scenario 1 | (silent) |
| TTS confirmation | elevenlabs.synthesize duration_ms=<x> | Agent voice plays within ~2.5s end-to-end |
| User confirm STT | deepgram.transcript text="yes" | (silent) |
| STK push | daraja.client stkpush success | M-Pesa pop-up on phone within ~1s |
| Callback | webhook.daraja correlation_match | (silent) |
| FSM final | state=PAYMENT_PENDING → BOOKED | Final TTS plays |
Pass criteria
- End-to-end speech-to-confirmation-speech latency stays under 2.5s at each turn (the M7 target).
- STK push fires within the 90s voice cap window.
tenant_spa_pilot.appointmentsrow created;tenant_spa_pilot.paymentssettled.- Audio playback is intelligible (no garbled TTS, no STT mistranscription on the haircut/yes turns).
Scenario 3 — Web widget + cross-sell
Estimated time: ~15 min. Channel: Web widget (Tier-2). ADRs exercised: 0009 (channel substrate, COLLECT_PHONE entry-state), 0007 (payments), 0010 (cross-sell).
Pre-conditions
- Tenant has at least two services with a
service_relationsrow of typecross_sellbetween them. For pilot,manicureandpedicureare the canonical pair (M11 fixture). <tenant>.appointmentsis empty for both services on the test slot (so the slot is genuinely available).
Sequence — happy path (accept the cross-sell)
Action — happy path (accept the cross-sell)
- Open
http://localhost:3010/widget?tenant=spa_pilotin a browser. - Type:
I'd like a manicure tomorrow at 11am. - Watch the widget chrome — it should walk you through
COLLECT_PHONE(ADR-0009 Tier-2 entry-state since web has no phone metadata). - Enter your phone number when prompted; confirm.
- Agent replies with the manicure confirmation prompt plus the cross-sell offer for pedicure.
- Reply
yesto accept the bundle. - STK push fires for the combined amount; approve in Daraja sandbox.
- Verify both appointments persist.
Expected log lines
For log-reading commands, see Observability.
| Stage | Log expectation | Widget expectation |
|---|---|---|
| Page load | widget.session.create channel=web | Widget chrome renders |
| First message | state=GREET → COLLECT_PHONE | Phone-collection prompt |
| Phone validate | phone.validate ok (or e164_invalid with 3-strike re-prompt) | Phone accepted |
| Service + slot | state=COLLECT_PHONE → COLLECT_SERVICE → COLLECT_SLOT → CONFIRM | Confirmation prompt + cross-sell offer |
| User accept bundle | state=CONFIRM → BUNDLE_CONFIRM → PAYMENT_PENDING | (silent) |
| STK push | daraja.client stkpush amount_kes=<combined> | (silent) |
| Callback | state=PAYMENT_PENDING → BOOKED | Final confirmation message |
Pass criteria
tenant_spa_pilot.appointmentshas two rows withstatus='confirmed'for the same customer.tenant_spa_pilot.paymentshas one row withstate='settled'andamount_kesequal to the combined total.- Cross-channel identity: if the same phone has prior WhatsApp activity,
customersshows a single row withcustomer_identitiesrows for both channels (per ADR-0009 D5 phone-only deterministic merge).
Action — the negative path (decline the cross-sell)
Re-run from Step 1 with a different test slot. At Step 6, reply no. Verify:
- Only one appointment row created (manicure only).
- The pedicure reservation that was held during
BUNDLE_CONFIRMis released — i.e. no orphan row inappointmentswithstatus='reserved'.
This negative path tests the pair-release semantics from M11 cross-sell. If an orphan reservation is left behind, that's a small-bug deviation.
Scenario 4 — SMS reminder + admin handoff
Estimated time: ~15 min. Channels: WhatsApp inbound + admin orchestrator + SMS via Africa's Talking sandbox. ADRs exercised: 0006 (handoff Pattern 3), 0009 (SMS as NotificationSink, not Channel).
Pre-conditions
- Africa's Talking sandbox credentials in
.env;AT_USERNAME=sandbox. - Tenant admin's WhatsApp number set in
tenant_adminstable. - Existing appointment in
tenant_spa_pilot.appointments24h out so the reminder scheduler picks it up.
Sequence — handoff path
Action — handoff path
- Open WhatsApp, message the Meta test number with a deliberately vague turn:
So like I was wondering if maybe sometime this week you might could possibly help with a thing for my wife - The classifier should flag low-confidence intent. Per ADR-0006 Pattern 3, the FSM enters
HANDOFF_PENDING. - Within ~120s the admin (you, on the admin's WhatsApp) receives a briefing card.
- As the admin, reply with a directive:
Tell them we have spa appointments Friday at 2pm and 4pm. - The orchestrator forwards the response back to the customer.
Expected log lines — handoff
For log-reading guide and jq filter patterns, see Observability.
| Stage | Log expectation |
|---|---|
| Inbound vague turn | classifier confidence<0.6 |
| Handoff trigger | handoff.start reason=low_confidence pattern=3 |
| Admin notify | whatsapp.send to=<admin_phone> template=briefing_card |
| Admin reply | webhook.whatsapp actor=admin orchestrator_route=true |
| Customer reply | whatsapp.send to=<customer_phone> |
Pass criteria — handoff
- Briefing card reaches the admin's WhatsApp within ~120s.
- Admin's response reaches the customer.
- Handoff log row in
<tenant>.handoff_logwithpattern=3and full transcript fold.
Action — SMS reminder path
- Time-shift or trigger the daily reminder cron manually:
docker compose exec backend /app/.venv/bin/python -m app.scripts.run_reminder_dispatch
- The reminder scheduler picks the appointment 24h out.
- The customer's session-window check: WhatsApp 24h window expired → fallback to SMS via Africa's Talking.
- SMS arrives on the customer's phone with the reminder text.
Expected log lines — SMS
| Stage | Log expectation |
|---|---|
| Reminder picked | reminder.candidate appt=<id> due_in_h=24 |
| Window check | whatsapp.session_window expired=true |
| SMS dispatch | sms.africas_talking send phone=<e164> cost_usd=0.003 |
Pass criteria — SMS
- SMS arrives on the customer's phone (Africa's Talking sandbox shows the message in its dashboard).
- Cost line in the log shows ~$0.003 — the cost-discipline win that motivated SMS-as-fallback in ADR-0009 (cheaper than out-of-window WhatsApp utility templates in Kenya).
Deviation severity rubric
Capture every divergence in docs/M12-pilot-deviations.md. Each row: timestamp | scenario | observation | severity | resolution. Severity drives whether the deviation gets fixed during W5 or deferred to M13.
| Severity | Symbol | Definition | Action |
|---|---|---|---|
| Doc-only fix | 🟢 | The runbook said X, the system actually does Y. The system is correct; the runbook is wrong. | Open a docs-only PR in the W5 docs-fixes commit. No code change. |
| Small bug | 🟡 | The system has a clear bug, fix is < 30 min and well-scoped. | Open a code PR in the W5 bug-fixes commit. Re-run the affected scenario. |
| Deferred | 🔴 | A real architectural surprise — fix needs design-level discussion or > 30 min of work. | Log as M13 carry-forward in the close-out memo + open a GitHub issue. Do not fix during W5. |
The W5 commit cap is 3 commits maximum per the M12 plan v3 R6 mitigation: one pilot-execution commit (the M12-pilot-deviations.md capture file with all observations), one docs-fixes commit (all 🟢 batched), one bug-fixes commit (all 🟡 batched). Anything 🔴 stays in the deviations file as a carry-forward — it does not get a commit during W5.
This cap exists because the pilot is a calibration pass, not a code-change marathon. The point is to learn what the system actually does in pilot conditions and capture that learning, not to rewrite the system from inside the pilot. M13 is the place for architectural responses to 🔴 findings.
Where to capture
Open docs/M12-pilot-deviations.md in an editor before starting. Use this row shape:
| 2026-05-07 09:34 | S1 WhatsApp | Confirmation message took 47s instead of expected ~30s | 🔴 | M13 carry-forward — investigate Daraja sandbox latency vs production |
| 2026-05-07 09:51 | S3 web widget | "manicure" tokenisation collapsed to "mani cure" in classifier | 🟡 | bug-fixes commit — extend tokenizer test fixture |
| 2026-05-07 10:08 | S2 voice | TTS playback expectation says "~2.5s end-to-end" but actual is "~3.0s on first turn" | 🟢 | docs-fixes commit — update Scenario 2 latency target |
The file is deleted at W6 close-out; the observations are folded into the close-out memo at ~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_m12_pilot_readiness_landed.md. That memo becomes the durable record.
What next
- During the pass — when something goes wrong, Incidents runbook is the diagnose-and-fix table.
- For log reading during the pass — Observability covers structured log fields, jq filter patterns, and the daily digest.
- After the pass — the W6 close-out memo aggregates everything into a single milestone retrospective.
- For M13 onboarding — re-run this exact runbook against a real tenant. The 4 scenarios become the acceptance criteria for "this tenant is live."