Incident response

Current reality

Ratiba M13 is a pre-pilot, single-operator project. There is no on-call rotation, no paging system, and no SLA with the 5-10 beta customers. The incident process today is:

Detect — Adrian notices something is broken (direct observation, a beta customer WhatsApp, or the daily operator digest at 07:00 EAT).
Triage — Use the Incidents runbook failure-mode table. That table is the on-call playbook. It covers the most common failure modes (container unhealthy, WhatsApp signature failure, Daraja callback not arriving, FSM stuck, payment stuck, checkpoint accumulation) with diagnosis steps and copy-pastable fixes. Do not duplicate it here.
Diagnose — Use the Observability runbook to read structured logs, tail docker compose output, and correlate timestamps.
Fix, redeploy, verify — ./start-server.sh restarts the backend + worker; ./start-client.sh restarts the frontend. Confirm the fix via a test message through the affected channel.
Note it — Brief the beta customer on WhatsApp if the failure was visible to them. Add a row to the Incidents runbook if the failure mode was new and took more than ~10 minutes to diagnose.

The severity and SLO framework below is a target posture for post-M13 production — not a commitment made to the M13 beta cohort.

Severity & SLO (target)

These definitions and response targets are aspirational — post-pilot targets, not M13 commitments. They are written here to establish vocabulary now so that when a paid tenant asks "what's your SLA?" there is a principled answer to build from.

Severity definitions

Severity	Definition	Example in Ratiba	Target response	Target resolution
P1 — Critical	Core booking or payment flow broken for all tenants, OR data integrity at risk	Payment reconciliation down (`public.payment_routing` reaper failed; unrouted callbacks accumulating in `public.payment_callbacks_unrouted`); WhatsApp webhook rejecting all inbound messages; Postgres primary unreachable	15 minutes	2 hours
P2 — Significant	Single-tenant outage or a material feature broken for a subset of tenants	One tenant's catalog import failing (M11 idempotent import error); voice calls not connecting for tenants on a specific LiveKit config; a tenant's FSM stuck in `COLLECT_SLOT` due to empty `staff_schedules`	1 hour	8 hours
P3 — Degraded	Degraded performance or a non-critical feature broken; no booking or payment data at risk	Response latency elevated but bookings completing; a dashboard analytics query timing out; Docusaurus build failing on a docs-only PR	Next business day	48 hours

P1 examples expand on the real failure modes:

Payment reconciliation down: daily reaper at 3 AM EAT has not run for >1 day; public.payment_routing rows growing unbounded; callbacks arriving for reaped rows are silently written to public.payment_callbacks_unrouted but never surfaced to the operator. Risk: customers charged but booking not confirmed.
All inbound messages failing: WhatsApp WHATSAPP_APP_SECRET rotated but .env not updated; every webhook returns 403; no new bookings can start for any tenant.

Classification note for M13: Adrian is the classifier. If it looks like a P1, treat it as a P1. The cost of over-classifying is a thorough investigation that turns out to be a P2; the cost of under-classifying a real P1 is customer money going unreconciled.

Response procedure

Drop everything. Open docker compose ps and docker compose logs -f backend.
Work the Incidents runbook triage. Most P1s are in the table.
Brief affected tenants on WhatsApp within the target response window, even if the fix isn't ready: "We're aware of an issue with [bookings/payments] and are working on it. We'll update you in [N] minutes."
Fix, verify, communicate resolution.
File a post-mortem within 48 hours (template below).

Work the Incidents runbook triage when your current task reaches a safe pause point (within 1 hour).
Brief the affected tenant if the failure is visible to their customers.
Fix, verify.
File a post-mortem within 1 week if the root cause was non-obvious.

Log it (a GitHub issue or a note in the relevant close-out memo).
Fix in the next available sprint or alongside related work.

Post-mortem template

Post-mortems are blameless. The goal is a durable record of what happened and what changes will prevent recurrence — not an audit of who did what wrong.

File post-mortems as docs/runbook/post-mortems/YYYY-MM-DD-<slug>.md (not published to the Docusaurus site; stored in the repo for searchability). For M13 beta, a WhatsApp voice note to yourself followed by a brief written note is acceptable — the template below is the target form for production post-mortems.

## Incident: <title> — <date>

**Severity:** P1 / P2 / P3
**Duration:** <start time EAT> → <end time EAT> (<N> minutes)
**Impact:** <which tenants / how many bookings / customers affected; whether payment
data was at risk>

### Timeline

- HH:MM EAT — <event: what was observed or what action was taken>
- HH:MM EAT — <event>
- HH:MM EAT — <resolved / workaround applied>

### Root cause

<One paragraph. What was the proximate cause? What was the underlying cause that made
the proximate cause possible?>

### Contributing factors

- <Factor that made the incident more likely or harder to detect>
- <e.g., "no alert on public.payment_callbacks_unrouted growth">

### What we did right

- <e.g., "daily digest surfaced the anomaly within 8 hours">
- <e.g., "dead-letter table meant no payment data was lost">

### Action items

| Item | Owner | Due |
|---|---|---|
| <specific change, e.g. "add alert when unreviewed callbacks > 0"> | Adrian | <date> |
| <add row to Incidents runbook for this failure mode> | Adrian | <date> |

Incidents runbook — the failure-mode triage table (symptom → diagnosis → fix)
Observability runbook — how to read structured logs and the daily digest
Capacity & scaling — the known bottlenecks and why they matter

Current reality​

Severity & SLO (target)​

Severity definitions​

Response procedure​

Post-mortem template​

Related​

Current reality

Severity & SLO (target)

Severity definitions

Response procedure

Post-mortem template

Related