Skip to main content

Incident response

Current reality

Ratiba M13 is a pre-pilot, single-operator project. There is no on-call rotation, no paging system, and no SLA with the 5-10 beta customers. The incident process today is:

  1. Detect — Adrian notices something is broken (direct observation, a beta customer WhatsApp, or the daily operator digest at 07:00 EAT).
  2. Triage — Use the Incidents runbook failure-mode table. That table is the on-call playbook. It covers the most common failure modes (container unhealthy, WhatsApp signature failure, Daraja callback not arriving, FSM stuck, payment stuck, checkpoint accumulation) with diagnosis steps and copy-pastable fixes. Do not duplicate it here.
  3. Diagnose — Use the Observability runbook to read structured logs, tail docker compose output, and correlate timestamps.
  4. Fix, redeploy, verify./start-server.sh restarts the backend + worker; ./start-client.sh restarts the frontend. Confirm the fix via a test message through the affected channel.
  5. Note it — Brief the beta customer on WhatsApp if the failure was visible to them. Add a row to the Incidents runbook if the failure mode was new and took more than ~10 minutes to diagnose.

The severity and SLO framework below is a target posture for post-M13 production — not a commitment made to the M13 beta cohort.


Severity & SLO (target)

These definitions and response targets are aspirational — post-pilot targets, not M13 commitments. They are written here to establish vocabulary now so that when a paid tenant asks "what's your SLA?" there is a principled answer to build from.

Severity definitions

SeverityDefinitionExample in RatibaTarget responseTarget resolution
P1 — CriticalCore booking or payment flow broken for all tenants, OR data integrity at riskPayment reconciliation down (public.payment_routing reaper failed; unrouted callbacks accumulating in public.payment_callbacks_unrouted); WhatsApp webhook rejecting all inbound messages; Postgres primary unreachable15 minutes2 hours
P2 — SignificantSingle-tenant outage or a material feature broken for a subset of tenantsOne tenant's catalog import failing (M11 idempotent import error); voice calls not connecting for tenants on a specific LiveKit config; a tenant's FSM stuck in COLLECT_SLOT due to empty staff_schedules1 hour8 hours
P3 — DegradedDegraded performance or a non-critical feature broken; no booking or payment data at riskResponse latency elevated but bookings completing; a dashboard analytics query timing out; Docusaurus build failing on a docs-only PRNext business day48 hours

P1 examples expand on the real failure modes:

  • Payment reconciliation down: daily reaper at 3 AM EAT has not run for >1 day; public.payment_routing rows growing unbounded; callbacks arriving for reaped rows are silently written to public.payment_callbacks_unrouted but never surfaced to the operator. Risk: customers charged but booking not confirmed.
  • All inbound messages failing: WhatsApp WHATSAPP_APP_SECRET rotated but .env not updated; every webhook returns 403; no new bookings can start for any tenant.

Classification note for M13: Adrian is the classifier. If it looks like a P1, treat it as a P1. The cost of over-classifying is a thorough investigation that turns out to be a P2; the cost of under-classifying a real P1 is customer money going unreconciled.

Response procedure

P1

  1. Drop everything. Open docker compose ps and docker compose logs -f backend.
  2. Work the Incidents runbook triage. Most P1s are in the table.
  3. Brief affected tenants on WhatsApp within the target response window, even if the fix isn't ready: "We're aware of an issue with [bookings/payments] and are working on it. We'll update you in [N] minutes."
  4. Fix, verify, communicate resolution.
  5. File a post-mortem within 48 hours (template below).

P2

  1. Work the Incidents runbook triage when your current task reaches a safe pause point (within 1 hour).
  2. Brief the affected tenant if the failure is visible to their customers.
  3. Fix, verify.
  4. File a post-mortem within 1 week if the root cause was non-obvious.

P3

  1. Log it (a GitHub issue or a note in the relevant close-out memo).
  2. Fix in the next available sprint or alongside related work.

Post-mortem template

Post-mortems are blameless. The goal is a durable record of what happened and what changes will prevent recurrence — not an audit of who did what wrong.

File post-mortems as docs/runbook/post-mortems/YYYY-MM-DD-<slug>.md (not published to the Docusaurus site; stored in the repo for searchability). For M13 beta, a WhatsApp voice note to yourself followed by a brief written note is acceptable — the template below is the target form for production post-mortems.

## Incident: <title> — <date>

**Severity:** P1 / P2 / P3
**Duration:** <start time EAT><end time EAT> (<N> minutes)
**Impact:** <which tenants / how many bookings / customers affected; whether payment
data was at risk>

### Timeline

- HH:MM EAT — <event: what was observed or what action was taken>
- HH:MM EAT — <event>
- HH:MM EAT — <resolved / workaround applied>

### Root cause

<One paragraph. What was the proximate cause? What was the underlying cause that made
the proximate cause possible?>

### Contributing factors

- <Factor that made the incident more likely or harder to detect>
- <e.g., "no alert on public.payment_callbacks_unrouted growth">

### What we did right

- <e.g., "daily digest surfaced the anomaly within 8 hours">
- <e.g., "dead-letter table meant no payment data was lost">

### Action items

| Item | Owner | Due |
|---|---|---|
| <specific change, e.g. "add alert when unreviewed callbacks > 0"> | Adrian | <date> |
| <add row to Incidents runbook for this failure mode> | Adrian | <date> |