Observability

Ratiba's observability stack for M13 is deliberately lean. This page explains what you can see right now, how to read the structured logs, what the daily digest delivers, and what the post-pilot target posture looks like.

Current reality

Per ADR-0012, M13 ships with lean observability: docker compose logs + structured JSON tailing + a daily WhatsApp digest. No Loki, Datadog, or Sentry in M13.

Why lean? At 5–10 beta tenants and ~50 bookings per week, adding a full metrics stack before knowing which signals actually matter in production is premature. The cost of running Loki + Grafana on a laptop VPS would dwarf the value. The daily digest and grep-able JSON logs are sufficient to catch regressions, billing anomalies, and FSM edge-cases at this scale. The Target posture section describes the upgrade path when the scale changes.

What you can see today:

Surface	How	Notes
Live backend logs	`docker compose logs -f backend`	Structured JSON via structlog
Live worker logs	`docker compose logs -f worker`	APScheduler job results, reaper output
Live frontend logs	`docker compose logs -f frontend`	Next.js stdout (less structured)
Recent logs (last N minutes)	`docker compose logs --since 15m backend`	Combine with `\| jq`
Daily digest	WhatsApp message at 07:00 EAT	Booking counts, errors, handoffs
Postgres state	`docker compose exec postgres psql -U ratiba ratiba`	Query any tenant schema directly
Redis FSM state	`docker compose exec redis redis-cli -a ratiba_redis_password`	`GET ratiba:fsm:thread:<ULID>`

For failure-mode-specific diagnosis steps, see the Incidents runbook — this page covers log navigation, not failure remediation.

Reading structured logs

Format

The backend and worker emit structured JSON via structlog to stdout. Each log line is a single JSON object. Example:

{
  "event": "dispatcher.intent_classified",
  "level": "info",
  "timestamp": "2026-05-31T09:14:03.412Z",
  "tenant_id": "018f1d2e-...",
  "thread_id": "01HX5...",
  "intent": "booking",
  "fsm_state": "GREETING",
  "source": "keyword",
  "text_preview": "I'd like to book a haircut"
}

Every line carries at minimum: event, level, timestamp. Most dispatcher-originated lines also carry tenant_id and thread_id.

Common `jq` recipes

All errors in the last 10 minutes:

docker compose logs --since 10m backend | jq 'select(.level == "error" or .level == "warning")'

Did a specific booking get a payment initiated?

docker compose logs backend \
  | jq 'select(.event == "dispatcher.payment.mpesa_initiated" and .thread_id == "01HX5...")'

Which tenant sent the most messages today?

docker compose logs backend \
  | jq -r 'select(.event == "dispatcher.intent_classified") | .tenant_id' \
  | sort | uniq -c | sort -rn | head

All payment callbacks received (Daraja + PesaPal):

docker compose logs backend \
  | jq 'select(.event | startswith("payment.callback."))'

Was the worker's 3 AM reaper healthy?

docker compose logs worker \
  | jq 'select(.event == "reaper.completed" or .event == "reaper.failed")'

Did any FSM thread hit the recursion cap?

docker compose logs backend \
  | jq 'select(.event == "dispatcher.graph_recursion_capped")'

Key event names

The table below covers the events most useful for day-to-day operations. For the complete event catalog, grep -r '"logger\.' backend/app/ and grep -r '"dispatcher\.' backend/app/.

Event	Level	Module	What it means
`dispatcher.intent_classified`	info	orchestrator/dispatcher	Each inbound turn — intent, FSM state, source (state continuation vs fresh keyword classification)
`dispatcher.intent_classified` (source=`state_continuation`)	info	orchestrator/dispatcher	FSM continued an active booking flow without re-classification
`dispatcher.human_driving_stub`	info	orchestrator/dispatcher	Admin-in-handoff; customer reply echoed without LLM invocation
`dispatcher.payment.mpesa_initiated`	info	orchestrator/dispatcher	M-Pesa STK push fired; carries `thread_id`, `appointment_id`, `merchant_reference`, `amount`
`dispatcher.payment.pesapal_initiated`	info	orchestrator/dispatcher	PesaPal card order submitted
`dispatcher.payment.skipped_layer3_guard`	info	orchestrator/dispatcher	A second payment attempt blocked (concurrent-payment guard; see ADR-0007)
`payment.daraja.stk_initiated`	info	payments/initiate_daraja	Daraja STK push accepted by Safaricom; carries `checkout_request_id`, masked phone
`payment.daraja.poll_completed`	info	payments/poll_daraja	One-shot `stkpushquery` at t=60s; outcome (paid / cancelled / pending)
`payment.callback.daraja_routed`	info	payments/callbacks	Callback successfully routed to tenant FSM
`payment.callback.daraja_dead_letter`	warning	payments/callbacks	Late Daraja callback (FSM already past the payment window); written to `public.payment_callbacks_unrouted`
`payment.pesapal.nudge_required`	info	payments/pesapal_flow	PesaPal 8-minute nudge reminder sent to customer
`payment.pesapal.abandoned`	info	payments/pesapal_flow	PesaPal 30-minute abandon timeout triggered
`handoff.triggered`	info	orchestrator/handoff	Human-handoff threshold crossed; carries `signal`, `detail`, masked phone, `llm_cost_usd`
`handoff.cost_soft_ceiling_observed`	warning	orchestrator/handoff	Per-booking LLM cost hit the $0.05 soft ceiling (ADR-0005 D3); handoff triggered
`reaper.completed`	info	workers/payments_reaper	Daily 3 AM EAT consolidated reaper finished; check the `archived` field
`reaper.failed`	error	worker	Consolidated reaper crashed (APScheduler caught; retries next day)
`worker.finalize_deletion.scan`	info	worker	Tenant finalization sweep; `candidate_count` tells you how many soft-deleted tenants were processed
`knowledge_gap_candidate`	warning	orchestrator/dispatcher	Customer asked an `other`-intent question for which no knowledge snippet matched; carries `question`, `tenant_id`, `snippets_available` — see Knowledge answers
`knowledge_overflow`	warning	services/knowledge	The tenant's `knowledge_snippets` table has exceeded the 20-snippet / 1500-token cap for this intent; the Phase-0 → Phase-1 graduation signal — see Knowledge answers
`booking_graph.no_active_services`	warning	orchestrator/booking_graph	A booking was attempted but the tenant has zero active services in the catalogue
`booking_graph.json_parse_failed`	warning	orchestrator/booking_graph	LLM returned malformed slot JSON; FSM will re-prompt
`dispatcher.checkpoint_parse_failed`	warning	orchestrator/dispatcher	LangGraph checkpoint shape has drifted (usually after a FSM state model change); clears on next fresh thread

Fields reference

Field	Type	Present on
`tenant_id`	UUID string	All dispatcher, payment, handoff events
`thread_id`	ULID string	Per-turn dispatcher events, payment initiation
`level`	`"info"` / `"warning"` / `"error"`	Every line
`timestamp`	ISO-8601 UTC	Every line
`intent`	`"booking"` / `"inquiry"` / `"other"`	`dispatcher.intent_classified`
`fsm_state`	FSM state enum value	`dispatcher.intent_classified`
`signal`	handoff signal enum value	`handoff.triggered`
`llm_cost_usd`	Decimal string	`handoff.triggered`
`checkout_request_id`	string	`payment.daraja.stk_initiated`
`merchant_reference`	string	Both payment initiation events
`question`	string (raw text)	`knowledge_gap_candidate`
`intent`	Intent literal	`knowledge_overflow`, `knowledge_gap_candidate`

The daily digest

What it is

The daily digest is a brief WhatsApp message sent to the operator phone at 07:00 EAT (East Africa Time, UTC+3) each morning. It summarises the previous 24-hour window so you don't need to grep logs to know if anything went wrong overnight.

The digest covers:

Total bookings confirmed in the window (FSM transitions to DONE)
Payment breakdown — M-Pesa successful, PesaPal successful, abandoned
Handoff count and most common trigger signal
Error counts by level (warning, error) across backend + worker
knowledge_gap_candidate count per tenant — the daily nudge to add knowledge snippets

Implementation status

The digest job is planned for M13 T7 (onboarding wave) and is not yet implemented in code. The APScheduler worker (app/worker.py) has infrastructure for registering a 07:00 EAT cron job alongside the existing 3 AM EAT reaper and hourly admin-TTL sweep. When T7 lands, a new DAILY_DIGEST_JOB_ID constant will appear in app/worker.py and the job will call a function in app/notifications/ (the existing outbound-only notification sink module — see app/notifications/__init__.py).

The WhatsApp message uses the M9 admin rail (the same OPERATOR_WHATSAPP_NUMBER channel the admin dashboard already messages for handoff briefings). No new WhatsApp template is required; the daily digest is a free-form message within the 24-hour session window.

Configuration

Variable	Where	Purpose
`OPERATOR_WHATSAPP_NUMBER`	`.env` (project root)	E.164 number that receives the daily digest and handoff briefings — see Configuration reference

Set this before starting the stack. If unset, the digest job will log a warning and skip silently (it does not crash the worker).

Adding a new metric

When the digest job is implemented, add new metrics by:

Writing a query against public.tenants + per-tenant schema tables, or aggregating from recent structlog output.
Appending a line to the DigestPayload dataclass in the digest module.
Including the field in the WhatsApp message template string.

There is no schema migration required for adding digest metrics — the digest is read-only and aggregates from existing tables.

What to watch

These are the four signals that matter most during the M13 beta. Check them each morning by tailing logs or reviewing the daily digest.

Booking success rate

Healthy: bookings that enter COLLECT_SLOT should reach DONE at a rate above 60%. A sudden drop indicates either a service catalogue problem (booking_graph.no_active_services) or an FSM regression.

docker compose logs --since 24h backend \
  | jq 'select(.event == "dispatcher.intent_classified" and .intent == "booking") | .fsm_state' \
  | sort | uniq -c

Compare DONE counts against COLLECT_SLOT entry counts.

Payment reconciliation lag

Any payment.callback.daraja_dead_letter or payment.callback.pesapal_dead_letter event means a callback arrived after the FSM had already timed out. These land in public.payment_callbacks_unrouted and need manual reconciliation. Check daily:

docker compose exec postgres psql -U ratiba ratiba \
  -c "SELECT count(*) FROM public.payment_callbacks_unrouted WHERE created_at > NOW() - INTERVAL '24 hours';"

Zero rows is the target. Non-zero means the daily reaper or callback routing has a timing issue; see the Incidents runbook payment section.

FSM state distribution

A healthy system has the majority of threads in terminal states (DONE, PAYMENT_FAILED) or GREETING (fresh threads). An accumulation of threads stuck in COLLECT_SLOT or PAYMENT_PENDING indicates a stall.

docker compose exec redis redis-cli -a ratiba_redis_password \
  --scan --pattern "ratiba:fsm:thread:*" \
  | xargs -I{} docker compose exec -T redis redis-cli -a ratiba_redis_password GET {} \
  | jq -r '.fsm_state' | sort | uniq -c | sort -rn

Cost per booking

The handoff.triggered event carries llm_cost_usd — the accumulated LLM spend for that conversation thread. A healthy booking costs well under $0.05 (the ADR-0005 D3 soft ceiling). Conversations hitting the soft ceiling automatically escalate to handoff; handoff.cost_soft_ceiling_observed in the logs is the signal. Track the rolling average:

docker compose logs --since 7d backend \
  | jq 'select(.event == "handoff.triggered") | .llm_cost_usd | tonumber' \
  | awk '{sum+=$1; n++} END {print "avg:", sum/n, "n:", n}'

Knowledge overflow frequency

knowledge_overflow warnings fire when a tenant's knowledge_snippets table exceeds the 20-snippet / ~1500-character cap for a given intent. This is the Phase-0 → Phase-1 graduation signal: when it fires more than a handful of times per day, it's time to implement real embedding-based retrieval. See Knowledge answers for the Phase-1 upgrade path.

docker compose logs --since 24h backend | jq 'select(.event == "knowledge_overflow")'

Target posture

The following tools are NOT in M13 scope. They are the planned upgrade path for post-pilot scale (50+ tenants, VPS production deployment).

Tool	Role	When
Langfuse v4 (self-hosted)	Prompt traces, LLM span observability, prompt version management, eval dataset management	M14 — the Langfuse sink seam already exists in `app/llm/adapters/` (`langfuse_sink.skip_empty` log event); wiring it to a running Langfuse instance is a configuration change, not a code change
Grafana + Loki	Log aggregation, dashboards, alerting at scale	M14/M15 when structured logs outgrow a terminal `jq` session
Sentry	Exception tracking, error grouping, release tracking	M14 when unhandled exceptions need automatic triage rather than manual log grepping
Prometheus + Grafana	Metrics (booking rate, payment success, FSM state distribution)	M15+ when the signal-to-noise ratio of log-based metrics becomes painful

The structlog event design is forward-compatible with all of the above: Loki ingests JSON stdout directly; Langfuse traces map one-to-one to dispatcher turns; Sentry exception capture is an SDK initialisation step that doesn't require changing any log call sites.

Until M14 tooling lands, the combination of docker compose logs | jq, the daily digest, and the Incidents runbook covers everything needed for a 5–10 tenant beta cohort.

Current reality​

Reading structured logs​

Format​

Common jq recipes​

Key event names​

Fields reference​

The daily digest​

What it is​

Implementation status​

Configuration​

Adding a new metric​

What to watch​

Booking success rate​

Payment reconciliation lag​

FSM state distribution​

Cost per booking​

Knowledge overflow frequency​

Target posture​