Skip to main content

Incidents runbook

Failure-mode-to-fix table. Living document — every new failure mode that costs more than ~10 minutes to diagnose should land here as a follow-up PR row. The table is ordered roughly by frequency: containers and env vars at the top, payment + FSM in the middle, test pollution at the bottom.

If a failure isn't on the table yet, work through the Triage flowchart below to narrow it to a known service, then match to the table. If neither works, the Escalation path at the end says who to grab and what to bring.


Triage

Four questions, in order. Most incidents resolve at Q3.

  1. Which service is failing? Run docker compose ps. Anything not healthy (or not running for LiveKit) is suspect. See the flowchart below.
  2. When did it start? A failure that surfaced after a git pull is almost always a missing migration (alembic upgrade head) or a new env var. A failure during a working session is usually data-related (a bad row, a stale Redis key, an FSM thread stuck).
  3. What changed? git log --oneline -20. If a migration or service definition changed, that's the lead. git diff HEAD~5 -- backend/app docker-compose.yml.
  4. What's in the logs? See Observability for how to read structured logs and per-service tail commands. The short answer: docker compose logs --since 5m <service>. Most failures self-identify in the last ~50 lines.

Triage flowchart


Failure modes

SymptomDiagnosisFix
docker compose ps shows a service as unhealthy or exitedThe healthcheck is failing or the container crashed at boot.Read the logs: docker compose logs <service>. For Postgres: usually a port conflict on :5434 (sister projects use :5432 and :5433) or volume corruption. Restart: docker compose restart <service>. If persistent: docker compose down -v && docker compose up -d (warning: drops all data).
WhatsApp webhook returns 403 on signature verifyWHATSAPP_APP_SECRET is unset or stale. Backend logs show webhook.whatsapp signature_valid=False.Confirm the value matches Meta for Developers → App → Settings → Basic → App Secret. Set in .env; restart: ./start-server.sh. Re-trigger the inbound (Meta retries 5xx but not 4xx — you may need to send a fresh test message).
LiveKit SIP not registering / voice booking attempts failLiveKit logs show ICE failures or "no STUN candidate".Check docker compose ps — LiveKit should be running with network_mode: host. Verify ports: signal :7890, TCP RTC :7891, UDP RTC :52000-52050 are not blocked by macOS firewall. Restart: docker compose restart livekit. The docker/livekit.yaml has node_ip: 127.0.0.1 and stun_servers: [] — both required for loopback ICE. ICE warnings in the log are expected on macOS loopback; only act if SIP register actually fails.
STK push fires but no callback receivedDaraja sandbox callback URL doesn't reach your machine. tenant_<slug>.payments row stuck in state='initiated'.If on a public IP / VPS: verify Daraja sandbox config has the right callback URL. If on a laptop: use ngrok http 8010 and set the ngrok URL as the Daraja callback URL on the tenant. Check public.payment_routing for a correlation row keyed by CheckoutRequestID; check public.payment_callbacks_unrouted for late callbacks that missed the FSM window (per ADR-0007 D9).
FSM stuck in COLLECT_SLOT, agent re-prompts repeatedlyThe booking graph can't resolve any slot for the tenant. tenant_<slug>.staff_schedules is empty or tenant_<slug>.staff_blocks covers the requested window.Inspect: docker compose exec postgres psql -U ratiba ratiba -c "SELECT * FROM tenant_<slug>.staff_schedules;". If empty, onboard staff via /admin/catalog or insert directly: INSERT INTO tenant_<slug>.staff_schedules (staff_id, day_of_week, start_time, end_time) VALUES ((SELECT id FROM tenant_<slug>.staff LIMIT 1), 1, '09:00', '18:00');. Repeat for each weekday.
FSM stuck in GREET or COLLECT_SERVICE; no state progression after intent classifyClassifier returned an intent the FSM's routing table doesn't recognise — usually a new vertical with an unmapped category. Backend log shows intent_router no_route intent=<x>.Check app/agents/intent_router.py — add the new intent mapping. Restart: ./start-server.sh. This symptom is common during new-vertical onboarding before the catalog is seeded.
Payment stuck in PAYMENT_PENDING; no callback after 60sEither the one-shot stkpushquery reconciliation poll didn't fire (worker not running) or the customer dismissed the STK pop-up.Verify the worker is running: ps aux | grep arq. Check the FSM state: docker compose exec redis redis-cli -a ratiba_redis_password GET ratiba:fsm:thread:<ULID>. Manually invoke the daily reaper to age out: docker compose exec backend /app/.venv/bin/python -m app.scripts.run_payment_reaper. The reaper drains stuck rows into public.payment_callbacks_unrouted.
Daraja STK push returns errorCode: 500.002.1001 — "request cancelled"Customer dismissed the STK pop-up (common in sandbox testing if you don't see the pop-up in time). Not a system bug.Re-send the booking confirmation message manually or start a new booking thread. If this recurs in production, per ADR-0007 the FSM transitions to PAYMENT_CANCELLED_BY_CUSTOMER automatically after the 8-minute nudge window.
Voice TTS latency spikes above 2.5sElevenLabs quota hit or cold-start. The M7 voice channel has a 2.5s end-to-end latency target per the voice close-out memo.Check elevenlabs.synthesize duration_ms in the log. If >2500 on every turn, the ElevenLabs API is slow (regional latency or quota). Retry. If duration_ms is fine but playback is slow, it's LiveKit room media lag — restart the room.
Backend OOM / hangs; uvicorn worker >1GBLangGraph Postgres checkpoints accumulating per ADR-0003 retention rules. Daily reaper hasn't run.Inspect: docker compose exec postgres psql -U ratiba ratiba -c "SELECT count(*) FROM tenant_<slug>.checkpoints;". If the count is in the millions, the daily reaper hasn't run for days. Manual run: docker compose exec backend /app/.venv/bin/python -m app.scripts.run_archival_reaper. Restart the backend: ./start-server.sh. The reaper should run at 3 AM EAT — if it hasn't, check the cron container and the worker logs for archival errors.
knowledge_overflow WARN in backend logsPer-tenant knowledge_snippets table exceeds the ~20-snippet / ~1500-token cap (Phase-0 limit, ADR-0013 D11). Answer quality may degrade due to prompt truncation.Deactivate low-priority snippets: UPDATE tenant_<slug>.knowledge_snippets SET is_active = false WHERE category = 'general' ORDER BY created_at LIMIT 5;. This is the Phase-0 → Phase-1 graduation signal — when this fires regularly, it's time to implement real pgvector retrieval.
Tests fail with tenant_<slug>_<runid> already existsTest pollution from a previous run that didn't clean up — usually a watchdog stall or a forced Ctrl-C during the per-scenario fresh-tenant fixture (per ADR-0004).Drop all leftover test schemas: docker compose exec postgres psql -U ratiba ratiba -c "SELECT 'DROP SCHEMA \"' || schema_name || '\" CASCADE;' FROM information_schema.schemata WHERE schema_name LIKE 'test_tenant_%';" | docker compose exec -T postgres psql -U ratiba ratiba. Then re-run pytest.
alembic upgrade head fails with Target database is not up to dateTwo migration heads exist (branching in the revision graph) — can happen after a rebase or cherry-pick.Run cd backend && /Users/soft4u/Development/ratiba/backend/.venv/bin/python -m alembic heads to see the two tip revisions. If both are yours: alembic merge heads -m "merge" then re-run upgrade. If one is from an unmerged branch, check out the right branch first.

A few extras worth knowing

These don't get their own table row because they're shape-issues, not symptoms-with-fixes. But they recur enough to mention.

  • alembic upgrade head from the wrong CWD — the Alembic config resolves script_location relative to the CWD. Always run from backend/, never from the repo root. Symptom: FAILED: No 'script_location' key found.
  • pytest from the wrong CWD — symmetric to the above. Always from backend/. The PYTHONPATH resolution in conftest.py assumes that CWD; running from repo root produces import errors that look like missing modules but are really path issues.
  • Pyright LSP showing zero errors but CLI showing problems — the CLI is canonical (M10 lesson). LSP caches stale type info across edits, especially after a Pydantic model regen. If LSP and CLI disagree, trust CLI. Run cd backend && /Users/soft4u/Development/ratiba/backend/.venv/bin/pyright explicitly.
  • Frontend showing stale Tailwind classes after a tailwind.config.ts edit — kill start-client.sh, delete frontend/.next/, re-run. Tailwind v4's incremental cache is sometimes pinned to old class names.
  • How to read structured log fields and filter by thread_id — see Observability. The short pattern is docker compose logs backend | jq 'select(.thread_id == "<ULID>")'.

Escalation path

If the failure isn't in the table and triage doesn't narrow it:

  1. Grep the close-out memos~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_m*_landed.md. Each milestone close-out logs surprises encountered during that milestone. The same surprise might recur.
  2. Tail every log at oncedocker compose logs -f & tail -f backend/.uvicorn.log frontend/.next-dev.log. Cross-reference timestamps. See Observability for the full log-reading guide.
  3. Capture state for Adrian — before pinging, gather:
    • Output of docker compose ps.
    • Last 200 lines of the relevant log.
    • The exact command that triggered the failure.
    • Output of git log --oneline -10.
  4. Pilot-window finding goes to the M12 close-out memo — if you hit this during the W5 pilot pass, capture in docs/M12-pilot-deviations.md per the Pilot deployment runbook deviation rubric.

Adding a new failure mode

When you hit a failure that costs more than ~10 minutes to diagnose, add a row to the table above in a follow-up PR. Row shape:

| <symptom observable behaviour, not internal cause> | <diagnosis what to check first> | <fix copy-pastable command sequence> |

Bump last_verified: YYYY-MM-DD to today on the same PR. The frontmatter validator gates this on every build per ADR-0011 D6.


What next