Incidents runbook

Failure-mode-to-fix table. Living document — every new failure mode that costs more than ~10 minutes to diagnose should land here as a follow-up PR row. The table is ordered roughly by frequency: containers and env vars at the top, payment + FSM in the middle, test pollution at the bottom.

If a failure isn't on the table yet, work through the Triage flowchart below to narrow it to a known service, then match to the table. If neither works, the Escalation path at the end says who to grab and what to bring.

Triage

Four questions, in order. Most incidents resolve at Q3.

Which service is failing? Run docker compose ps. Anything not healthy (or not running for LiveKit) is suspect. See the flowchart below.
When did it start? A failure that surfaced after a git pull is almost always a missing migration (alembic upgrade head) or a new env var. A failure during a working session is usually data-related (a bad row, a stale Redis key, an FSM thread stuck).
What changed? git log --oneline -20. If a migration or service definition changed, that's the lead. git diff HEAD~5 -- backend/app docker-compose.yml.
What's in the logs? See Observability for how to read structured logs and per-service tail commands. The short answer: docker compose logs --since 5m <service>. Most failures self-identify in the last ~50 lines.

Triage flowchart

Failure modes

Symptom	Diagnosis	Fix
`docker compose ps` shows a service as `unhealthy` or `exited`	The healthcheck is failing or the container crashed at boot.	Read the logs: `docker compose logs <service>`. For Postgres: usually a port conflict on `:5434` (sister projects use `:5432` and `:5433`) or volume corruption. Restart: `docker compose restart <service>`. If persistent: `docker compose down -v && docker compose up -d` (warning: drops all data).
WhatsApp webhook returns 403 on signature verify	`WHATSAPP_APP_SECRET` is unset or stale. Backend logs show `webhook.whatsapp signature_valid=False`.	Confirm the value matches Meta for Developers → App → Settings → Basic → App Secret. Set in `.env`; restart: `./start-server.sh`. Re-trigger the inbound (Meta retries 5xx but not 4xx — you may need to send a fresh test message).
LiveKit SIP not registering / voice booking attempts fail	LiveKit logs show ICE failures or "no STUN candidate".	Check `docker compose ps` — LiveKit should be `running` with `network_mode: host`. Verify ports: signal `:7890`, TCP RTC `:7891`, UDP RTC `:52000-52050` are not blocked by macOS firewall. Restart: `docker compose restart livekit`. The `docker/livekit.yaml` has `node_ip: 127.0.0.1` and `stun_servers: []` — both required for loopback ICE. ICE warnings in the log are expected on macOS loopback; only act if SIP register actually fails.
STK push fires but no callback received	Daraja sandbox callback URL doesn't reach your machine. `tenant_<slug>.payments` row stuck in `state='initiated'`.	If on a public IP / VPS: verify Daraja sandbox config has the right callback URL. If on a laptop: use `ngrok http 8010` and set the ngrok URL as the Daraja callback URL on the tenant. Check `public.payment_routing` for a correlation row keyed by `CheckoutRequestID`; check `public.payment_callbacks_unrouted` for late callbacks that missed the FSM window (per ADR-0007 D9).
FSM stuck in `COLLECT_SLOT`, agent re-prompts repeatedly	The booking graph can't resolve any slot for the tenant. `tenant_<slug>.staff_schedules` is empty or `tenant_<slug>.staff_blocks` covers the requested window.	Inspect: `docker compose exec postgres psql -U ratiba ratiba -c "SELECT * FROM tenant_<slug>.staff_schedules;"`. If empty, onboard staff via `/admin/catalog` or insert directly: `INSERT INTO tenant_<slug>.staff_schedules (staff_id, day_of_week, start_time, end_time) VALUES ((SELECT id FROM tenant_<slug>.staff LIMIT 1), 1, '09:00', '18:00');`. Repeat for each weekday.
FSM stuck in `GREET` or `COLLECT_SERVICE`; no state progression after intent classify	Classifier returned an intent the FSM's routing table doesn't recognise — usually a new vertical with an unmapped category. Backend log shows `intent_router no_route intent=<x>`.	Check `app/agents/intent_router.py` — add the new intent mapping. Restart: `./start-server.sh`. This symptom is common during new-vertical onboarding before the catalog is seeded.
Payment stuck in `PAYMENT_PENDING`; no callback after 60s	Either the one-shot `stkpushquery` reconciliation poll didn't fire (worker not running) or the customer dismissed the STK pop-up.	Verify the worker is running: `ps aux \| grep arq`. Check the FSM state: `docker compose exec redis redis-cli -a ratiba_redis_password GET ratiba:fsm:thread:<ULID>`. Manually invoke the daily reaper to age out: `docker compose exec backend /app/.venv/bin/python -m app.scripts.run_payment_reaper`. The reaper drains stuck rows into `public.payment_callbacks_unrouted`.
Daraja STK push returns `errorCode: 500.002.1001` — "request cancelled"	Customer dismissed the STK pop-up (common in sandbox testing if you don't see the pop-up in time). Not a system bug.	Re-send the booking confirmation message manually or start a new booking thread. If this recurs in production, per ADR-0007 the FSM transitions to `PAYMENT_CANCELLED_BY_CUSTOMER` automatically after the 8-minute nudge window.
Voice TTS latency spikes above 2.5s	ElevenLabs quota hit or cold-start. The M7 voice channel has a 2.5s end-to-end latency target per the voice close-out memo.	Check `elevenlabs.synthesize duration_ms` in the log. If `>2500` on every turn, the ElevenLabs API is slow (regional latency or quota). Retry. If `duration_ms` is fine but playback is slow, it's LiveKit room media lag — restart the room.
Backend OOM / hangs; uvicorn worker `>1GB`	LangGraph Postgres checkpoints accumulating per ADR-0003 retention rules. Daily reaper hasn't run.	Inspect: `docker compose exec postgres psql -U ratiba ratiba -c "SELECT count(*) FROM tenant_<slug>.checkpoints;"`. If the count is in the millions, the daily reaper hasn't run for days. Manual run: `docker compose exec backend /app/.venv/bin/python -m app.scripts.run_archival_reaper`. Restart the backend: `./start-server.sh`. The reaper should run at 3 AM EAT — if it hasn't, check the cron container and the worker logs for archival errors.
`knowledge_overflow` WARN in backend logs	Per-tenant `knowledge_snippets` table exceeds the ~20-snippet / ~1500-token cap (Phase-0 limit, ADR-0013 D11). Answer quality may degrade due to prompt truncation.	Deactivate low-priority snippets: `UPDATE tenant_<slug>.knowledge_snippets SET is_active = false WHERE category = 'general' ORDER BY created_at LIMIT 5;`. This is the Phase-0 → Phase-1 graduation signal — when this fires regularly, it's time to implement real pgvector retrieval.
Tests fail with `tenant_<slug>_<runid>` already exists	Test pollution from a previous run that didn't clean up — usually a watchdog stall or a forced `Ctrl-C` during the per-scenario fresh-tenant fixture (per ADR-0004).	Drop all leftover test schemas: `docker compose exec postgres psql -U ratiba ratiba -c "SELECT 'DROP SCHEMA \"' \|\| schema_name \|\| '\" CASCADE;' FROM information_schema.schemata WHERE schema_name LIKE 'test_tenant_%';" \| docker compose exec -T postgres psql -U ratiba ratiba`. Then re-run pytest.
`alembic upgrade head` fails with `Target database is not up to date`	Two migration heads exist (branching in the revision graph) — can happen after a rebase or cherry-pick.	Run `cd backend && /Users/soft4u/Development/ratiba/backend/.venv/bin/python -m alembic heads` to see the two tip revisions. If both are yours: `alembic merge heads -m "merge"` then re-run upgrade. If one is from an unmerged branch, check out the right branch first.

A few extras worth knowing

These don't get their own table row because they're shape-issues, not symptoms-with-fixes. But they recur enough to mention.

alembic upgrade head from the wrong CWD — the Alembic config resolves script_location relative to the CWD. Always run from backend/, never from the repo root. Symptom: FAILED: No 'script_location' key found.
pytest from the wrong CWD — symmetric to the above. Always from backend/. The PYTHONPATH resolution in conftest.py assumes that CWD; running from repo root produces import errors that look like missing modules but are really path issues.
Pyright LSP showing zero errors but CLI showing problems — the CLI is canonical (M10 lesson). LSP caches stale type info across edits, especially after a Pydantic model regen. If LSP and CLI disagree, trust CLI. Run cd backend && /Users/soft4u/Development/ratiba/backend/.venv/bin/pyright explicitly.
Frontend showing stale Tailwind classes after a tailwind.config.ts edit — kill start-client.sh, delete frontend/.next/, re-run. Tailwind v4's incremental cache is sometimes pinned to old class names.
How to read structured log fields and filter by thread_id — see Observability. The short pattern is docker compose logs backend | jq 'select(.thread_id == "<ULID>")'.

Escalation path

If the failure isn't in the table and triage doesn't narrow it:

Grep the close-out memos — ~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_m*_landed.md. Each milestone close-out logs surprises encountered during that milestone. The same surprise might recur.
Tail every log at once — docker compose logs -f & tail -f backend/.uvicorn.log frontend/.next-dev.log. Cross-reference timestamps. See Observability for the full log-reading guide.
Capture state for Adrian — before pinging, gather:
- Output of docker compose ps.
- Last 200 lines of the relevant log.
- The exact command that triggered the failure.
- Output of git log --oneline -10.
Pilot-window finding goes to the M12 close-out memo — if you hit this during the W5 pilot pass, capture in docs/M12-pilot-deviations.md per the Pilot deployment runbook deviation rubric.

Adding a new failure mode

When you hit a failure that costs more than ~10 minutes to diagnose, add a row to the table above in a follow-up PR. Row shape:

| <symptom — observable behaviour, not internal cause> | <diagnosis — what to check first> | <fix — copy-pastable command sequence> |

Bump last_verified: YYYY-MM-DD to today on the same PR. The frontmatter validator gates this on every build per ADR-0011 D6.

What next

Local dev runbook — the daily-driver boot sequence.
Observability — how to read structured logs, tail per-service output, and interpret the daily WhatsApp digest.
Pilot deployment runbook — the script for the W5 pilot validation pass.

Triage​

Triage flowchart​

Failure modes​

A few extras worth knowing​

Escalation path​

Adding a new failure mode​

What next​