Observability
Ratiba's observability stack for M13 is deliberately lean. This page explains what you can see right now, how to read the structured logs, what the daily digest delivers, and what the post-pilot target posture looks like.
Current reality
Per ADR-0012, M13 ships with
lean observability: docker compose logs + structured JSON tailing + a daily
WhatsApp digest. No Loki, Datadog, or Sentry in M13.
Why lean? At 5–10 beta tenants and ~50 bookings per week, adding a full metrics stack before knowing which signals actually matter in production is premature. The cost of running Loki + Grafana on a laptop VPS would dwarf the value. The daily digest and grep-able JSON logs are sufficient to catch regressions, billing anomalies, and FSM edge-cases at this scale. The Target posture section describes the upgrade path when the scale changes.
What you can see today:
| Surface | How | Notes |
|---|---|---|
| Live backend logs | docker compose logs -f backend | Structured JSON via structlog |
| Live worker logs | docker compose logs -f worker | APScheduler job results, reaper output |
| Live frontend logs | docker compose logs -f frontend | Next.js stdout (less structured) |
| Recent logs (last N minutes) | docker compose logs --since 15m backend | Combine with | jq |
| Daily digest | WhatsApp message at 07:00 EAT | Booking counts, errors, handoffs |
| Postgres state | docker compose exec postgres psql -U ratiba ratiba | Query any tenant schema directly |
| Redis FSM state | docker compose exec redis redis-cli -a ratiba_redis_password | GET ratiba:fsm:thread:<ULID> |
For failure-mode-specific diagnosis steps, see the Incidents runbook — this page covers log navigation, not failure remediation.
Reading structured logs
Format
The backend and worker emit structured JSON via structlog to stdout. Each log line is a single JSON object. Example:
{
"event": "dispatcher.intent_classified",
"level": "info",
"timestamp": "2026-05-31T09:14:03.412Z",
"tenant_id": "018f1d2e-...",
"thread_id": "01HX5...",
"intent": "booking",
"fsm_state": "GREETING",
"source": "keyword",
"text_preview": "I'd like to book a haircut"
}
Every line carries at minimum: event, level, timestamp. Most
dispatcher-originated lines also carry tenant_id and thread_id.
Common jq recipes
All errors in the last 10 minutes:
docker compose logs --since 10m backend | jq 'select(.level == "error" or .level == "warning")'
Did a specific booking get a payment initiated?
docker compose logs backend \
| jq 'select(.event == "dispatcher.payment.mpesa_initiated" and .thread_id == "01HX5...")'
Which tenant sent the most messages today?
docker compose logs backend \
| jq -r 'select(.event == "dispatcher.intent_classified") | .tenant_id' \
| sort | uniq -c | sort -rn | head
All payment callbacks received (Daraja + PesaPal):
docker compose logs backend \
| jq 'select(.event | startswith("payment.callback."))'
Was the worker's 3 AM reaper healthy?
docker compose logs worker \
| jq 'select(.event == "reaper.completed" or .event == "reaper.failed")'
Did any FSM thread hit the recursion cap?
docker compose logs backend \
| jq 'select(.event == "dispatcher.graph_recursion_capped")'
Key event names
The table below covers the events most useful for day-to-day operations.
For the complete event catalog, grep -r '"logger\.' backend/app/ and
grep -r '"dispatcher\.' backend/app/.
| Event | Level | Module | What it means |
|---|---|---|---|
dispatcher.intent_classified | info | orchestrator/dispatcher | Each inbound turn — intent, FSM state, source (state continuation vs fresh keyword classification) |
dispatcher.intent_classified (source=state_continuation) | info | orchestrator/dispatcher | FSM continued an active booking flow without re-classification |
dispatcher.human_driving_stub | info | orchestrator/dispatcher | Admin-in-handoff; customer reply echoed without LLM invocation |
dispatcher.payment.mpesa_initiated | info | orchestrator/dispatcher | M-Pesa STK push fired; carries thread_id, appointment_id, merchant_reference, amount |
dispatcher.payment.pesapal_initiated | info | orchestrator/dispatcher | PesaPal card order submitted |
dispatcher.payment.skipped_layer3_guard | info | orchestrator/dispatcher | A second payment attempt blocked (concurrent-payment guard; see ADR-0007) |
payment.daraja.stk_initiated | info | payments/initiate_daraja | Daraja STK push accepted by Safaricom; carries checkout_request_id, masked phone |
payment.daraja.poll_completed | info | payments/poll_daraja | One-shot stkpushquery at t=60s; outcome (paid / cancelled / pending) |
payment.callback.daraja_routed | info | payments/callbacks | Callback successfully routed to tenant FSM |
payment.callback.daraja_dead_letter | warning | payments/callbacks | Late Daraja callback (FSM already past the payment window); written to public.payment_callbacks_unrouted |
payment.pesapal.nudge_required | info | payments/pesapal_flow | PesaPal 8-minute nudge reminder sent to customer |
payment.pesapal.abandoned | info | payments/pesapal_flow | PesaPal 30-minute abandon timeout triggered |
handoff.triggered | info | orchestrator/handoff | Human-handoff threshold crossed; carries signal, detail, masked phone, llm_cost_usd |
handoff.cost_soft_ceiling_observed | warning | orchestrator/handoff | Per-booking LLM cost hit the $0.05 soft ceiling (ADR-0005 D3); handoff triggered |
reaper.completed | info | workers/payments_reaper | Daily 3 AM EAT consolidated reaper finished; check the archived field |
reaper.failed | error | worker | Consolidated reaper crashed (APScheduler caught; retries next day) |
worker.finalize_deletion.scan | info | worker | Tenant finalization sweep; candidate_count tells you how many soft-deleted tenants were processed |
knowledge_gap_candidate | warning | orchestrator/dispatcher | Customer asked an other-intent question for which no knowledge snippet matched; carries question, tenant_id, snippets_available — see Knowledge answers |
knowledge_overflow | warning | services/knowledge | The tenant's knowledge_snippets table has exceeded the 20-snippet / 1500-token cap for this intent; the Phase-0 → Phase-1 graduation signal — see Knowledge answers |
booking_graph.no_active_services | warning | orchestrator/booking_graph | A booking was attempted but the tenant has zero active services in the catalogue |
booking_graph.json_parse_failed | warning | orchestrator/booking_graph | LLM returned malformed slot JSON; FSM will re-prompt |
dispatcher.checkpoint_parse_failed | warning | orchestrator/dispatcher | LangGraph checkpoint shape has drifted (usually after a FSM state model change); clears on next fresh thread |
Fields reference
| Field | Type | Present on |
|---|---|---|
tenant_id | UUID string | All dispatcher, payment, handoff events |
thread_id | ULID string | Per-turn dispatcher events, payment initiation |
level | "info" / "warning" / "error" | Every line |
timestamp | ISO-8601 UTC | Every line |
intent | "booking" / "inquiry" / "other" | dispatcher.intent_classified |
fsm_state | FSM state enum value | dispatcher.intent_classified |
signal | handoff signal enum value | handoff.triggered |
llm_cost_usd | Decimal string | handoff.triggered |
checkout_request_id | string | payment.daraja.stk_initiated |
merchant_reference | string | Both payment initiation events |
question | string (raw text) | knowledge_gap_candidate |
intent | Intent literal | knowledge_overflow, knowledge_gap_candidate |
The daily digest
What it is
The daily digest is a brief WhatsApp message sent to the operator phone at 07:00 EAT (East Africa Time, UTC+3) each morning. It summarises the previous 24-hour window so you don't need to grep logs to know if anything went wrong overnight.
The digest covers:
- Total bookings confirmed in the window (FSM transitions to
DONE) - Payment breakdown — M-Pesa successful, PesaPal successful, abandoned
- Handoff count and most common trigger signal
- Error counts by level (warning, error) across backend + worker
knowledge_gap_candidatecount per tenant — the daily nudge to add knowledge snippets
Implementation status
The digest job is planned for M13 T7 (onboarding wave) and is not yet
implemented in code. The APScheduler worker (app/worker.py) has
infrastructure for registering a 07:00 EAT cron job alongside the existing
3 AM EAT reaper and hourly admin-TTL sweep. When T7 lands, a new
DAILY_DIGEST_JOB_ID constant will appear in app/worker.py and the job
will call a function in app/notifications/ (the existing outbound-only
notification sink module — see app/notifications/__init__.py).
The WhatsApp message uses the M9 admin rail (the same OPERATOR_WHATSAPP_NUMBER
channel the admin dashboard already messages for handoff briefings). No new
WhatsApp template is required; the daily digest is a free-form message within
the 24-hour session window.
Configuration
| Variable | Where | Purpose |
|---|---|---|
OPERATOR_WHATSAPP_NUMBER | .env (project root) | E.164 number that receives the daily digest and handoff briefings — see Configuration reference |
Set this before starting the stack. If unset, the digest job will log a warning and skip silently (it does not crash the worker).
Adding a new metric
When the digest job is implemented, add new metrics by:
- Writing a query against
public.tenants+ per-tenant schema tables, or aggregating from recent structlog output. - Appending a line to the
DigestPayloaddataclass in the digest module. - Including the field in the WhatsApp message template string.
There is no schema migration required for adding digest metrics — the digest is read-only and aggregates from existing tables.
What to watch
These are the four signals that matter most during the M13 beta. Check them each morning by tailing logs or reviewing the daily digest.
Booking success rate
Healthy: bookings that enter COLLECT_SLOT should reach DONE at a rate
above 60%. A sudden drop indicates either a service catalogue problem
(booking_graph.no_active_services) or an FSM regression.
docker compose logs --since 24h backend \
| jq 'select(.event == "dispatcher.intent_classified" and .intent == "booking") | .fsm_state' \
| sort | uniq -c
Compare DONE counts against COLLECT_SLOT entry counts.
Payment reconciliation lag
Any payment.callback.daraja_dead_letter or payment.callback.pesapal_dead_letter
event means a callback arrived after the FSM had already timed out. These land
in public.payment_callbacks_unrouted and need manual reconciliation.
Check daily:
docker compose exec postgres psql -U ratiba ratiba \
-c "SELECT count(*) FROM public.payment_callbacks_unrouted WHERE created_at > NOW() - INTERVAL '24 hours';"
Zero rows is the target. Non-zero means the daily reaper or callback routing has a timing issue; see the Incidents runbook payment section.
FSM state distribution
A healthy system has the majority of threads in terminal states (DONE,
PAYMENT_FAILED) or GREETING (fresh threads). An accumulation of threads
stuck in COLLECT_SLOT or PAYMENT_PENDING indicates a stall.
docker compose exec redis redis-cli -a ratiba_redis_password \
--scan --pattern "ratiba:fsm:thread:*" \
| xargs -I{} docker compose exec -T redis redis-cli -a ratiba_redis_password GET {} \
| jq -r '.fsm_state' | sort | uniq -c | sort -rn
Cost per booking
The handoff.triggered event carries llm_cost_usd — the accumulated LLM
spend for that conversation thread. A healthy booking costs well under $0.05
(the ADR-0005 D3 soft ceiling). Conversations hitting the soft ceiling
automatically escalate to handoff; handoff.cost_soft_ceiling_observed in
the logs is the signal. Track the rolling average:
docker compose logs --since 7d backend \
| jq 'select(.event == "handoff.triggered") | .llm_cost_usd | tonumber' \
| awk '{sum+=$1; n++} END {print "avg:", sum/n, "n:", n}'
Knowledge overflow frequency
knowledge_overflow warnings fire when a tenant's knowledge_snippets table
exceeds the 20-snippet / ~1500-character cap for a given intent. This is the
Phase-0 → Phase-1 graduation signal: when it fires more than a handful of
times per day, it's time to implement real embedding-based retrieval.
See Knowledge answers for the Phase-1
upgrade path.
docker compose logs --since 24h backend | jq 'select(.event == "knowledge_overflow")'
Target posture
The following tools are NOT in M13 scope. They are the planned upgrade path for post-pilot scale (50+ tenants, VPS production deployment).
| Tool | Role | When |
|---|---|---|
| Langfuse v4 (self-hosted) | Prompt traces, LLM span observability, prompt version management, eval dataset management | M14 — the Langfuse sink seam already exists in app/llm/adapters/ (langfuse_sink.skip_empty log event); wiring it to a running Langfuse instance is a configuration change, not a code change |
| Grafana + Loki | Log aggregation, dashboards, alerting at scale | M14/M15 when structured logs outgrow a terminal jq session |
| Sentry | Exception tracking, error grouping, release tracking | M14 when unhandled exceptions need automatic triage rather than manual log grepping |
| Prometheus + Grafana | Metrics (booking rate, payment success, FSM state distribution) | M15+ when the signal-to-noise ratio of log-based metrics becomes painful |
The structlog event design is forward-compatible with all of the above: Loki ingests JSON stdout directly; Langfuse traces map one-to-one to dispatcher turns; Sentry exception capture is an SDK initialisation step that doesn't require changing any log call sites.
Until M14 tooling lands, the combination of docker compose logs | jq,
the daily digest, and the
Incidents runbook covers everything needed for a
5–10 tenant beta cohort.