Incidents runbook
Failure-mode-to-fix table. Living document — every new failure mode that costs more than ~10 minutes to diagnose should land here as a follow-up PR row. The table is ordered roughly by frequency: containers and env vars at the top, payment + FSM in the middle, test pollution at the bottom.
If a failure isn't on the table yet, work through the Triage flowchart below to narrow it to a known service, then match to the table. If neither works, the Escalation path at the end says who to grab and what to bring.
Triage
Four questions, in order. Most incidents resolve at Q3.
- Which service is failing? Run
docker compose ps. Anything nothealthy(or notrunningfor LiveKit) is suspect. See the flowchart below. - When did it start? A failure that surfaced after a
git pullis almost always a missing migration (alembic upgrade head) or a new env var. A failure during a working session is usually data-related (a bad row, a stale Redis key, an FSM thread stuck). - What changed?
git log --oneline -20. If a migration or service definition changed, that's the lead.git diff HEAD~5 -- backend/app docker-compose.yml. - What's in the logs? See Observability for how to read structured logs and per-service tail commands. The short answer:
docker compose logs --since 5m <service>. Most failures self-identify in the last ~50 lines.
Triage flowchart
Failure modes
| Symptom | Diagnosis | Fix |
|---|---|---|
docker compose ps shows a service as unhealthy or exited | The healthcheck is failing or the container crashed at boot. | Read the logs: docker compose logs <service>. For Postgres: usually a port conflict on :5434 (sister projects use :5432 and :5433) or volume corruption. Restart: docker compose restart <service>. If persistent: docker compose down -v && docker compose up -d (warning: drops all data). |
| WhatsApp webhook returns 403 on signature verify | WHATSAPP_APP_SECRET is unset or stale. Backend logs show webhook.whatsapp signature_valid=False. | Confirm the value matches Meta for Developers → App → Settings → Basic → App Secret. Set in .env; restart: ./start-server.sh. Re-trigger the inbound (Meta retries 5xx but not 4xx — you may need to send a fresh test message). |
| LiveKit SIP not registering / voice booking attempts fail | LiveKit logs show ICE failures or "no STUN candidate". | Check docker compose ps — LiveKit should be running with network_mode: host. Verify ports: signal :7890, TCP RTC :7891, UDP RTC :52000-52050 are not blocked by macOS firewall. Restart: docker compose restart livekit. The docker/livekit.yaml has node_ip: 127.0.0.1 and stun_servers: [] — both required for loopback ICE. ICE warnings in the log are expected on macOS loopback; only act if SIP register actually fails. |
| STK push fires but no callback received | Daraja sandbox callback URL doesn't reach your machine. tenant_<slug>.payments row stuck in state='initiated'. | If on a public IP / VPS: verify Daraja sandbox config has the right callback URL. If on a laptop: use ngrok http 8010 and set the ngrok URL as the Daraja callback URL on the tenant. Check public.payment_routing for a correlation row keyed by CheckoutRequestID; check public.payment_callbacks_unrouted for late callbacks that missed the FSM window (per ADR-0007 D9). |
FSM stuck in COLLECT_SLOT, agent re-prompts repeatedly | The booking graph can't resolve any slot for the tenant. tenant_<slug>.staff_schedules is empty or tenant_<slug>.staff_blocks covers the requested window. | Inspect: docker compose exec postgres psql -U ratiba ratiba -c "SELECT * FROM tenant_<slug>.staff_schedules;". If empty, onboard staff via /admin/catalog or insert directly: INSERT INTO tenant_<slug>.staff_schedules (staff_id, day_of_week, start_time, end_time) VALUES ((SELECT id FROM tenant_<slug>.staff LIMIT 1), 1, '09:00', '18:00');. Repeat for each weekday. |
FSM stuck in GREET or COLLECT_SERVICE; no state progression after intent classify | Classifier returned an intent the FSM's routing table doesn't recognise — usually a new vertical with an unmapped category. Backend log shows intent_router no_route intent=<x>. | Check app/agents/intent_router.py — add the new intent mapping. Restart: ./start-server.sh. This symptom is common during new-vertical onboarding before the catalog is seeded. |
Payment stuck in PAYMENT_PENDING; no callback after 60s | Either the one-shot stkpushquery reconciliation poll didn't fire (worker not running) or the customer dismissed the STK pop-up. | Verify the worker is running: ps aux | grep arq. Check the FSM state: docker compose exec redis redis-cli -a ratiba_redis_password GET ratiba:fsm:thread:<ULID>. Manually invoke the daily reaper to age out: docker compose exec backend /app/.venv/bin/python -m app.scripts.run_payment_reaper. The reaper drains stuck rows into public.payment_callbacks_unrouted. |
Daraja STK push returns errorCode: 500.002.1001 — "request cancelled" | Customer dismissed the STK pop-up (common in sandbox testing if you don't see the pop-up in time). Not a system bug. | Re-send the booking confirmation message manually or start a new booking thread. If this recurs in production, per ADR-0007 the FSM transitions to PAYMENT_CANCELLED_BY_CUSTOMER automatically after the 8-minute nudge window. |
| Voice TTS latency spikes above 2.5s | ElevenLabs quota hit or cold-start. The M7 voice channel has a 2.5s end-to-end latency target per the voice close-out memo. | Check elevenlabs.synthesize duration_ms in the log. If >2500 on every turn, the ElevenLabs API is slow (regional latency or quota). Retry. If duration_ms is fine but playback is slow, it's LiveKit room media lag — restart the room. |
Backend OOM / hangs; uvicorn worker >1GB | LangGraph Postgres checkpoints accumulating per ADR-0003 retention rules. Daily reaper hasn't run. | Inspect: docker compose exec postgres psql -U ratiba ratiba -c "SELECT count(*) FROM tenant_<slug>.checkpoints;". If the count is in the millions, the daily reaper hasn't run for days. Manual run: docker compose exec backend /app/.venv/bin/python -m app.scripts.run_archival_reaper. Restart the backend: ./start-server.sh. The reaper should run at 3 AM EAT — if it hasn't, check the cron container and the worker logs for archival errors. |
knowledge_overflow WARN in backend logs | Per-tenant knowledge_snippets table exceeds the ~20-snippet / ~1500-token cap (Phase-0 limit, ADR-0013 D11). Answer quality may degrade due to prompt truncation. | Deactivate low-priority snippets: UPDATE tenant_<slug>.knowledge_snippets SET is_active = false WHERE category = 'general' ORDER BY created_at LIMIT 5;. This is the Phase-0 → Phase-1 graduation signal — when this fires regularly, it's time to implement real pgvector retrieval. |
Tests fail with tenant_<slug>_<runid> already exists | Test pollution from a previous run that didn't clean up — usually a watchdog stall or a forced Ctrl-C during the per-scenario fresh-tenant fixture (per ADR-0004). | Drop all leftover test schemas: docker compose exec postgres psql -U ratiba ratiba -c "SELECT 'DROP SCHEMA \"' || schema_name || '\" CASCADE;' FROM information_schema.schemata WHERE schema_name LIKE 'test_tenant_%';" | docker compose exec -T postgres psql -U ratiba ratiba. Then re-run pytest. |
alembic upgrade head fails with Target database is not up to date | Two migration heads exist (branching in the revision graph) — can happen after a rebase or cherry-pick. | Run cd backend && /Users/soft4u/Development/ratiba/backend/.venv/bin/python -m alembic heads to see the two tip revisions. If both are yours: alembic merge heads -m "merge" then re-run upgrade. If one is from an unmerged branch, check out the right branch first. |
A few extras worth knowing
These don't get their own table row because they're shape-issues, not symptoms-with-fixes. But they recur enough to mention.
alembic upgrade headfrom the wrong CWD — the Alembic config resolvesscript_locationrelative to the CWD. Always run frombackend/, never from the repo root. Symptom:FAILED: No 'script_location' key found.pytestfrom the wrong CWD — symmetric to the above. Always frombackend/. The PYTHONPATH resolution inconftest.pyassumes that CWD; running from repo root produces import errors that look like missing modules but are really path issues.- Pyright LSP showing zero errors but CLI showing problems — the CLI is canonical (M10 lesson). LSP caches stale type info across edits, especially after a Pydantic model regen. If LSP and CLI disagree, trust CLI. Run
cd backend && /Users/soft4u/Development/ratiba/backend/.venv/bin/pyrightexplicitly. - Frontend showing stale Tailwind classes after a
tailwind.config.tsedit — killstart-client.sh, deletefrontend/.next/, re-run. Tailwind v4's incremental cache is sometimes pinned to old class names. - How to read structured log fields and filter by thread_id — see Observability. The short pattern is
docker compose logs backend | jq 'select(.thread_id == "<ULID>")'.
Escalation path
If the failure isn't in the table and triage doesn't narrow it:
- Grep the close-out memos —
~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_m*_landed.md. Each milestone close-out logs surprises encountered during that milestone. The same surprise might recur. - Tail every log at once —
docker compose logs -f & tail -f backend/.uvicorn.log frontend/.next-dev.log. Cross-reference timestamps. See Observability for the full log-reading guide. - Capture state for Adrian — before pinging, gather:
- Output of
docker compose ps. - Last 200 lines of the relevant log.
- The exact command that triggered the failure.
- Output of
git log --oneline -10.
- Output of
- Pilot-window finding goes to the M12 close-out memo — if you hit this during the W5 pilot pass, capture in
docs/M12-pilot-deviations.mdper the Pilot deployment runbook deviation rubric.
Adding a new failure mode
When you hit a failure that costs more than ~10 minutes to diagnose, add a row to the table above in a follow-up PR. Row shape:
| <symptom — observable behaviour, not internal cause> | <diagnosis — what to check first> | <fix — copy-pastable command sequence> |
Bump last_verified: YYYY-MM-DD to today on the same PR. The frontmatter validator gates this on every build per ADR-0011 D6.
What next
- Local dev runbook — the daily-driver boot sequence.
- Observability — how to read structured logs, tail per-service output, and interpret the daily WhatsApp digest.
- Pilot deployment runbook — the script for the W5 pilot validation pass.