Incident response
Current reality
Ratiba M13 is a pre-pilot, single-operator project. There is no on-call rotation, no paging system, and no SLA with the 5-10 beta customers. The incident process today is:
- Detect — Adrian notices something is broken (direct observation, a beta customer WhatsApp, or the daily operator digest at 07:00 EAT).
- Triage — Use the Incidents runbook failure-mode table. That table is the on-call playbook. It covers the most common failure modes (container unhealthy, WhatsApp signature failure, Daraja callback not arriving, FSM stuck, payment stuck, checkpoint accumulation) with diagnosis steps and copy-pastable fixes. Do not duplicate it here.
- Diagnose — Use the Observability runbook to read structured logs, tail docker compose output, and correlate timestamps.
- Fix, redeploy, verify —
./start-server.shrestarts the backend + worker;./start-client.shrestarts the frontend. Confirm the fix via a test message through the affected channel. - Note it — Brief the beta customer on WhatsApp if the failure was visible to them. Add a row to the Incidents runbook if the failure mode was new and took more than ~10 minutes to diagnose.
The severity and SLO framework below is a target posture for post-M13 production — not a commitment made to the M13 beta cohort.
Severity & SLO (target)
These definitions and response targets are aspirational — post-pilot targets, not M13 commitments. They are written here to establish vocabulary now so that when a paid tenant asks "what's your SLA?" there is a principled answer to build from.
Severity definitions
| Severity | Definition | Example in Ratiba | Target response | Target resolution |
|---|---|---|---|---|
| P1 — Critical | Core booking or payment flow broken for all tenants, OR data integrity at risk | Payment reconciliation down (public.payment_routing reaper failed; unrouted callbacks accumulating in public.payment_callbacks_unrouted); WhatsApp webhook rejecting all inbound messages; Postgres primary unreachable | 15 minutes | 2 hours |
| P2 — Significant | Single-tenant outage or a material feature broken for a subset of tenants | One tenant's catalog import failing (M11 idempotent import error); voice calls not connecting for tenants on a specific LiveKit config; a tenant's FSM stuck in COLLECT_SLOT due to empty staff_schedules | 1 hour | 8 hours |
| P3 — Degraded | Degraded performance or a non-critical feature broken; no booking or payment data at risk | Response latency elevated but bookings completing; a dashboard analytics query timing out; Docusaurus build failing on a docs-only PR | Next business day | 48 hours |
P1 examples expand on the real failure modes:
- Payment reconciliation down: daily reaper at 3 AM EAT has not run for
>1day;public.payment_routingrows growing unbounded; callbacks arriving for reaped rows are silently written topublic.payment_callbacks_unroutedbut never surfaced to the operator. Risk: customers charged but booking not confirmed. - All inbound messages failing: WhatsApp
WHATSAPP_APP_SECRETrotated but.envnot updated; every webhook returns 403; no new bookings can start for any tenant.
Classification note for M13: Adrian is the classifier. If it looks like a P1, treat it as a P1. The cost of over-classifying is a thorough investigation that turns out to be a P2; the cost of under-classifying a real P1 is customer money going unreconciled.
Response procedure
P1
- Drop everything. Open
docker compose psanddocker compose logs -f backend. - Work the Incidents runbook triage. Most P1s are in the table.
- Brief affected tenants on WhatsApp within the target response window, even if the fix isn't ready: "We're aware of an issue with [bookings/payments] and are working on it. We'll update you in [N] minutes."
- Fix, verify, communicate resolution.
- File a post-mortem within 48 hours (template below).
P2
- Work the Incidents runbook triage when your current task reaches a safe pause point (within 1 hour).
- Brief the affected tenant if the failure is visible to their customers.
- Fix, verify.
- File a post-mortem within 1 week if the root cause was non-obvious.
P3
- Log it (a GitHub issue or a note in the relevant close-out memo).
- Fix in the next available sprint or alongside related work.
Post-mortem template
Post-mortems are blameless. The goal is a durable record of what happened and what changes will prevent recurrence — not an audit of who did what wrong.
File post-mortems as docs/runbook/post-mortems/YYYY-MM-DD-<slug>.md (not published to
the Docusaurus site; stored in the repo for searchability). For M13 beta, a WhatsApp
voice note to yourself followed by a brief written note is acceptable — the template
below is the target form for production post-mortems.
## Incident: <title> — <date>
**Severity:** P1 / P2 / P3
**Duration:** <start time EAT> → <end time EAT> (<N> minutes)
**Impact:** <which tenants / how many bookings / customers affected; whether payment
data was at risk>
### Timeline
- HH:MM EAT — <event: what was observed or what action was taken>
- HH:MM EAT — <event>
- HH:MM EAT — <resolved / workaround applied>
### Root cause
<One paragraph. What was the proximate cause? What was the underlying cause that made
the proximate cause possible?>
### Contributing factors
- <Factor that made the incident more likely or harder to detect>
- <e.g., "no alert on public.payment_callbacks_unrouted growth">
### What we did right
- <e.g., "daily digest surfaced the anomaly within 8 hours">
- <e.g., "dead-letter table meant no payment data was lost">
### Action items
| Item | Owner | Due |
|---|---|---|
| <specific change, e.g. "add alert when unreviewed callbacks > 0"> | Adrian | <date> |
| <add row to Incidents runbook for this failure mode> | Adrian | <date> |
Related
- Incidents runbook — the failure-mode triage table (symptom → diagnosis → fix)
- Observability runbook — how to read structured logs and the daily digest
- Capacity & scaling — the known bottlenecks and why they matter