Skip to main content

Pilot deployment runbook

This page is the script for the M12 W5 pilot validation pass and the M13 real-tenant onboarding pass that follows. Adrian opens it and follows it verbatim — every step, every command, every expected log line. The pilot's job is to surface deviations: places where what the doc said and what the system did diverge. Each deviation becomes a PR.

The four scenarios below cover the load-bearing channels: WhatsApp (Tier-1, the primary surface), voice (Tier-1, the most-difficult surface), web widget (Tier-2, the conversion-funnel surface), and SMS handoff (Tier-2 fallback, the cost-discipline surface). Together they exercise every ADR-pinned subsystem: classifier (ADR-0005), FSM (ADR-0003), payments (ADR-0007), handoff (ADR-0006), channel substrate (ADR-0009), and tenant self-service (ADR-0010).


Pre-flight checklist

Four gates before the first scenario fires. Each is a hard pre-condition; do not start if any is red.

Setup not done yet? If this is your first time on this machine, start at Dev setup (one-time installs) and then Onboard a tenant. This page assumes the stack is already bootstrapped.

GateVerify withIf red
pilot-preflight.sh is green./scripts/pilot-preflight.sh exits 0 with all rows [OK]See Local dev runbook / Pre-flight.
Full test suite greenRun the backend and frontend test suites — see Testing runbook for the exact command sequenceStop the pilot. Do not pilot from a red baseline. Open a fix PR first.
A fresh tenant is onboardedA row in public.tenants; tenant_<slug> schema exists; catalog populated; staff scheduledFollow Onboard a tenant. For pilot, use slug spa_pilot.
Env vars set per channelDaraja sandbox creds, Meta WhatsApp test number on the tenant, LiveKit creds, Africa's Talking sandbox credsSet in .env — see Configuration reference for the full env-var table. Restart the backend; re-run pilot-preflight.

A capture file lives at docs/M12-pilot-deviations.md (created by W5, deleted at W6 close-out — superseded by the close-out memo). Open it in an editor before starting; you'll be writing into it as you go.


Scenario overview

All four scenarios share the same pre-condition → action → expected-result shape. The following diagram shows the common structure and how each scenario branches:


Scenario 1 — WhatsApp booking happy path

Estimated time: ~15 min. Channel: WhatsApp Cloud API. ADRs exercised: 0003 (FSM), 0005 (orchestration), 0007 (payments), 0008 (Cloud API direct).

Pre-conditions

  • tenant_spa_pilot is onboarded with at least one staff member, one service, and a recurring weekly schedule.
  • The tenant's whatsapp_phone_number_id resolves to Meta's free dev test number; your own phone is on the verified-recipients list.
  • Daraja sandbox shortcode + passkey are set on the tenant.

Action

  1. Open WhatsApp on your phone, message the Meta test number:

    Hi I'd like to book a haircut tomorrow at 3pm

  2. In a separate terminal:
    tail -f backend/.uvicorn.log
  3. Wait for the agent's reply on WhatsApp.
  4. Reply yes to confirm.
  5. Daraja sandbox simulator pops; enter the sandbox PIN to approve.
  6. Wait for the final confirmation message on WhatsApp.

Sequence

Expected log lines

For how to read structured log fields and filter by thread_id, see Observability.

StageLog expectationWhatsApp expectation
Inboundwebhook.whatsapp signature_valid=True(silent)
Classifyclassifier intent=book confidence>0.8 lang=en(silent)
FSM transitionsstate=GREET → COLLECT_SERVICE → COLLECT_SLOT → CONFIRM thread=<ULID>Confirmation prompt within ~2s
User confirmstate=CONFIRM → PAYMENT_PENDING(silent)
STK pushdaraja.client stkpush success CheckoutRequestID=<ws_CO_...>M-Pesa pop-up on phone
Callbackwebhook.daraja correlation_match thread=<ULID>(silent)
FSM finalstate=PAYMENT_PENDING → BOOKEDFinal confirmation message

Pass criteria

  • The final WhatsApp message arrives within ~30s end-to-end.
  • tenant_spa_pilot.appointments has a new row with status='confirmed'.
  • tenant_spa_pilot.payments has a row with state='settled' and a populated mpesa_receipt.
  • No errors or warnings in the log line stream for that thread.

If any of those is false, capture in docs/M12-pilot-deviations.md with severity per the rubric below.


Scenario 2 — Voice booking happy path

Estimated time: ~15 min. Channel: LiveKit SIP + Deepgram + ElevenLabs. ADRs exercised: 0003, 0005, 0007, plus the voice-stack pattern (LiveKit AgentSession, post-process voice rendering).

Pre-conditions

  • LiveKit container is running (docker compose ps shows ratiba-livekit up).
  • Deepgram + ElevenLabs API keys set in .env; visible to backend at startup.
  • The tenant has a SIP-registered phone number routed to LiveKit's local SIP bridge.
  • M-Pesa STK push has the 90s voice hard cap per ADR-0007 — the voice flow is shorter than WhatsApp by design.

Action

  1. Call the tenant's pilot phone number (LiveKit local SIP bridge maps this to a room).
  2. After the agent's greeting, speak: I'd like to book a haircut tomorrow at three pm.
  3. Tail the backend log:
    tail -f backend/.uvicorn.log
  4. Listen for the agent's confirmation TTS playback over the call.
  5. Speak yes to confirm.
  6. Approve the STK push that fires on your phone (within the 90s voice cap).
  7. Listen for the final confirmation TTS.

Sequence

Expected log lines

For log-reading commands and structured log field reference, see Observability.

StageLog expectationAudio / latency expectation
Call connectlivekit.session.start room=<id> participant=<sip>Call connects within ~2s
STT first turndeepgram.transcript text="i'd like to book..." lang=enTranscribed within ~1s of speech end
Classify + FSMSame classifier + booking_graph lines as Scenario 1(silent)
TTS confirmationelevenlabs.synthesize duration_ms=<x>Agent voice plays within ~2.5s end-to-end
User confirm STTdeepgram.transcript text="yes"(silent)
STK pushdaraja.client stkpush successM-Pesa pop-up on phone within ~1s
Callbackwebhook.daraja correlation_match(silent)
FSM finalstate=PAYMENT_PENDING → BOOKEDFinal TTS plays

Pass criteria

  • End-to-end speech-to-confirmation-speech latency stays under 2.5s at each turn (the M7 target).
  • STK push fires within the 90s voice cap window.
  • tenant_spa_pilot.appointments row created; tenant_spa_pilot.payments settled.
  • Audio playback is intelligible (no garbled TTS, no STT mistranscription on the haircut/yes turns).

Scenario 3 — Web widget + cross-sell

Estimated time: ~15 min. Channel: Web widget (Tier-2). ADRs exercised: 0009 (channel substrate, COLLECT_PHONE entry-state), 0007 (payments), 0010 (cross-sell).

Pre-conditions

  • Tenant has at least two services with a service_relations row of type cross_sell between them. For pilot, manicure and pedicure are the canonical pair (M11 fixture).
  • <tenant>.appointments is empty for both services on the test slot (so the slot is genuinely available).

Sequence — happy path (accept the cross-sell)

Action — happy path (accept the cross-sell)

  1. Open http://localhost:3010/widget?tenant=spa_pilot in a browser.
  2. Type: I'd like a manicure tomorrow at 11am.
  3. Watch the widget chrome — it should walk you through COLLECT_PHONE (ADR-0009 Tier-2 entry-state since web has no phone metadata).
  4. Enter your phone number when prompted; confirm.
  5. Agent replies with the manicure confirmation prompt plus the cross-sell offer for pedicure.
  6. Reply yes to accept the bundle.
  7. STK push fires for the combined amount; approve in Daraja sandbox.
  8. Verify both appointments persist.

Expected log lines

For log-reading commands, see Observability.

StageLog expectationWidget expectation
Page loadwidget.session.create channel=webWidget chrome renders
First messagestate=GREET → COLLECT_PHONEPhone-collection prompt
Phone validatephone.validate ok (or e164_invalid with 3-strike re-prompt)Phone accepted
Service + slotstate=COLLECT_PHONE → COLLECT_SERVICE → COLLECT_SLOT → CONFIRMConfirmation prompt + cross-sell offer
User accept bundlestate=CONFIRM → BUNDLE_CONFIRM → PAYMENT_PENDING(silent)
STK pushdaraja.client stkpush amount_kes=<combined>(silent)
Callbackstate=PAYMENT_PENDING → BOOKEDFinal confirmation message

Pass criteria

  • tenant_spa_pilot.appointments has two rows with status='confirmed' for the same customer.
  • tenant_spa_pilot.payments has one row with state='settled' and amount_kes equal to the combined total.
  • Cross-channel identity: if the same phone has prior WhatsApp activity, customers shows a single row with customer_identities rows for both channels (per ADR-0009 D5 phone-only deterministic merge).

Action — the negative path (decline the cross-sell)

Re-run from Step 1 with a different test slot. At Step 6, reply no. Verify:

  • Only one appointment row created (manicure only).
  • The pedicure reservation that was held during BUNDLE_CONFIRM is released — i.e. no orphan row in appointments with status='reserved'.

This negative path tests the pair-release semantics from M11 cross-sell. If an orphan reservation is left behind, that's a small-bug deviation.


Scenario 4 — SMS reminder + admin handoff

Estimated time: ~15 min. Channels: WhatsApp inbound + admin orchestrator + SMS via Africa's Talking sandbox. ADRs exercised: 0006 (handoff Pattern 3), 0009 (SMS as NotificationSink, not Channel).

Pre-conditions

  • Africa's Talking sandbox credentials in .env; AT_USERNAME=sandbox.
  • Tenant admin's WhatsApp number set in tenant_admins table.
  • Existing appointment in tenant_spa_pilot.appointments 24h out so the reminder scheduler picks it up.

Sequence — handoff path

Action — handoff path

  1. Open WhatsApp, message the Meta test number with a deliberately vague turn:

    So like I was wondering if maybe sometime this week you might could possibly help with a thing for my wife

  2. The classifier should flag low-confidence intent. Per ADR-0006 Pattern 3, the FSM enters HANDOFF_PENDING.
  3. Within ~120s the admin (you, on the admin's WhatsApp) receives a briefing card.
  4. As the admin, reply with a directive: Tell them we have spa appointments Friday at 2pm and 4pm.
  5. The orchestrator forwards the response back to the customer.

Expected log lines — handoff

For log-reading guide and jq filter patterns, see Observability.

StageLog expectation
Inbound vague turnclassifier confidence<0.6
Handoff triggerhandoff.start reason=low_confidence pattern=3
Admin notifywhatsapp.send to=<admin_phone> template=briefing_card
Admin replywebhook.whatsapp actor=admin orchestrator_route=true
Customer replywhatsapp.send to=<customer_phone>

Pass criteria — handoff

  • Briefing card reaches the admin's WhatsApp within ~120s.
  • Admin's response reaches the customer.
  • Handoff log row in <tenant>.handoff_log with pattern=3 and full transcript fold.

Action — SMS reminder path

  1. Time-shift or trigger the daily reminder cron manually:
    docker compose exec backend /app/.venv/bin/python -m app.scripts.run_reminder_dispatch
  2. The reminder scheduler picks the appointment 24h out.
  3. The customer's session-window check: WhatsApp 24h window expired → fallback to SMS via Africa's Talking.
  4. SMS arrives on the customer's phone with the reminder text.

Expected log lines — SMS

StageLog expectation
Reminder pickedreminder.candidate appt=<id> due_in_h=24
Window checkwhatsapp.session_window expired=true
SMS dispatchsms.africas_talking send phone=<e164> cost_usd=0.003

Pass criteria — SMS

  • SMS arrives on the customer's phone (Africa's Talking sandbox shows the message in its dashboard).
  • Cost line in the log shows ~$0.003 — the cost-discipline win that motivated SMS-as-fallback in ADR-0009 (cheaper than out-of-window WhatsApp utility templates in Kenya).

Deviation severity rubric

Capture every divergence in docs/M12-pilot-deviations.md. Each row: timestamp | scenario | observation | severity | resolution. Severity drives whether the deviation gets fixed during W5 or deferred to M13.

SeveritySymbolDefinitionAction
Doc-only fix🟢The runbook said X, the system actually does Y. The system is correct; the runbook is wrong.Open a docs-only PR in the W5 docs-fixes commit. No code change.
Small bug🟡The system has a clear bug, fix is < 30 min and well-scoped.Open a code PR in the W5 bug-fixes commit. Re-run the affected scenario.
Deferred🔴A real architectural surprise — fix needs design-level discussion or > 30 min of work.Log as M13 carry-forward in the close-out memo + open a GitHub issue. Do not fix during W5.

The W5 commit cap is 3 commits maximum per the M12 plan v3 R6 mitigation: one pilot-execution commit (the M12-pilot-deviations.md capture file with all observations), one docs-fixes commit (all 🟢 batched), one bug-fixes commit (all 🟡 batched). Anything 🔴 stays in the deviations file as a carry-forward — it does not get a commit during W5.

This cap exists because the pilot is a calibration pass, not a code-change marathon. The point is to learn what the system actually does in pilot conditions and capture that learning, not to rewrite the system from inside the pilot. M13 is the place for architectural responses to 🔴 findings.


Where to capture

Open docs/M12-pilot-deviations.md in an editor before starting. Use this row shape:

| 2026-05-07 09:34 | S1 WhatsApp | Confirmation message took 47s instead of expected ~30s | 🔴 | M13 carry-forward — investigate Daraja sandbox latency vs production |
| 2026-05-07 09:51 | S3 web widget | "manicure" tokenisation collapsed to "mani cure" in classifier | 🟡 | bug-fixes commit — extend tokenizer test fixture |
| 2026-05-07 10:08 | S2 voice | TTS playback expectation says "~2.5s end-to-end" but actual is "~3.0s on first turn" | 🟢 | docs-fixes commit — update Scenario 2 latency target |

The file is deleted at W6 close-out; the observations are folded into the close-out memo at ~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_m12_pilot_readiness_landed.md. That memo becomes the durable record.


What next

  • During the pass — when something goes wrong, Incidents runbook is the diagnose-and-fix table.
  • For log reading during the pass — Observability covers structured log fields, jq filter patterns, and the daily digest.
  • After the pass — the W6 close-out memo aggregates everything into a single milestone retrospective.
  • For M13 onboarding — re-run this exact runbook against a real tenant. The 4 scenarios become the acceptance criteria for "this tenant is live."