Skip to main content

Observability

Ratiba's observability stack for M13 is deliberately lean. This page explains what you can see right now, how to read the structured logs, what the daily digest delivers, and what the post-pilot target posture looks like.


Current reality

Per ADR-0012, M13 ships with lean observability: docker compose logs + structured JSON tailing + a daily WhatsApp digest. No Loki, Datadog, or Sentry in M13.

Why lean? At 5–10 beta tenants and ~50 bookings per week, adding a full metrics stack before knowing which signals actually matter in production is premature. The cost of running Loki + Grafana on a laptop VPS would dwarf the value. The daily digest and grep-able JSON logs are sufficient to catch regressions, billing anomalies, and FSM edge-cases at this scale. The Target posture section describes the upgrade path when the scale changes.

What you can see today:

SurfaceHowNotes
Live backend logsdocker compose logs -f backendStructured JSON via structlog
Live worker logsdocker compose logs -f workerAPScheduler job results, reaper output
Live frontend logsdocker compose logs -f frontendNext.js stdout (less structured)
Recent logs (last N minutes)docker compose logs --since 15m backendCombine with | jq
Daily digestWhatsApp message at 07:00 EATBooking counts, errors, handoffs
Postgres statedocker compose exec postgres psql -U ratiba ratibaQuery any tenant schema directly
Redis FSM statedocker compose exec redis redis-cli -a ratiba_redis_passwordGET ratiba:fsm:thread:<ULID>

For failure-mode-specific diagnosis steps, see the Incidents runbook — this page covers log navigation, not failure remediation.


Reading structured logs

Format

The backend and worker emit structured JSON via structlog to stdout. Each log line is a single JSON object. Example:

{
"event": "dispatcher.intent_classified",
"level": "info",
"timestamp": "2026-05-31T09:14:03.412Z",
"tenant_id": "018f1d2e-...",
"thread_id": "01HX5...",
"intent": "booking",
"fsm_state": "GREETING",
"source": "keyword",
"text_preview": "I'd like to book a haircut"
}

Every line carries at minimum: event, level, timestamp. Most dispatcher-originated lines also carry tenant_id and thread_id.

Common jq recipes

All errors in the last 10 minutes:

docker compose logs --since 10m backend | jq 'select(.level == "error" or .level == "warning")'

Did a specific booking get a payment initiated?

docker compose logs backend \
| jq 'select(.event == "dispatcher.payment.mpesa_initiated" and .thread_id == "01HX5...")'

Which tenant sent the most messages today?

docker compose logs backend \
| jq -r 'select(.event == "dispatcher.intent_classified") | .tenant_id' \
| sort | uniq -c | sort -rn | head

All payment callbacks received (Daraja + PesaPal):

docker compose logs backend \
| jq 'select(.event | startswith("payment.callback."))'

Was the worker's 3 AM reaper healthy?

docker compose logs worker \
| jq 'select(.event == "reaper.completed" or .event == "reaper.failed")'

Did any FSM thread hit the recursion cap?

docker compose logs backend \
| jq 'select(.event == "dispatcher.graph_recursion_capped")'

Key event names

The table below covers the events most useful for day-to-day operations. For the complete event catalog, grep -r '"logger\.' backend/app/ and grep -r '"dispatcher\.' backend/app/.

EventLevelModuleWhat it means
dispatcher.intent_classifiedinfoorchestrator/dispatcherEach inbound turn — intent, FSM state, source (state continuation vs fresh keyword classification)
dispatcher.intent_classified (source=state_continuation)infoorchestrator/dispatcherFSM continued an active booking flow without re-classification
dispatcher.human_driving_stubinfoorchestrator/dispatcherAdmin-in-handoff; customer reply echoed without LLM invocation
dispatcher.payment.mpesa_initiatedinfoorchestrator/dispatcherM-Pesa STK push fired; carries thread_id, appointment_id, merchant_reference, amount
dispatcher.payment.pesapal_initiatedinfoorchestrator/dispatcherPesaPal card order submitted
dispatcher.payment.skipped_layer3_guardinfoorchestrator/dispatcherA second payment attempt blocked (concurrent-payment guard; see ADR-0007)
payment.daraja.stk_initiatedinfopayments/initiate_darajaDaraja STK push accepted by Safaricom; carries checkout_request_id, masked phone
payment.daraja.poll_completedinfopayments/poll_darajaOne-shot stkpushquery at t=60s; outcome (paid / cancelled / pending)
payment.callback.daraja_routedinfopayments/callbacksCallback successfully routed to tenant FSM
payment.callback.daraja_dead_letterwarningpayments/callbacksLate Daraja callback (FSM already past the payment window); written to public.payment_callbacks_unrouted
payment.pesapal.nudge_requiredinfopayments/pesapal_flowPesaPal 8-minute nudge reminder sent to customer
payment.pesapal.abandonedinfopayments/pesapal_flowPesaPal 30-minute abandon timeout triggered
handoff.triggeredinfoorchestrator/handoffHuman-handoff threshold crossed; carries signal, detail, masked phone, llm_cost_usd
handoff.cost_soft_ceiling_observedwarningorchestrator/handoffPer-booking LLM cost hit the $0.05 soft ceiling (ADR-0005 D3); handoff triggered
reaper.completedinfoworkers/payments_reaperDaily 3 AM EAT consolidated reaper finished; check the archived field
reaper.failederrorworkerConsolidated reaper crashed (APScheduler caught; retries next day)
worker.finalize_deletion.scaninfoworkerTenant finalization sweep; candidate_count tells you how many soft-deleted tenants were processed
knowledge_gap_candidatewarningorchestrator/dispatcherCustomer asked an other-intent question for which no knowledge snippet matched; carries question, tenant_id, snippets_available — see Knowledge answers
knowledge_overflowwarningservices/knowledgeThe tenant's knowledge_snippets table has exceeded the 20-snippet / 1500-token cap for this intent; the Phase-0 → Phase-1 graduation signal — see Knowledge answers
booking_graph.no_active_serviceswarningorchestrator/booking_graphA booking was attempted but the tenant has zero active services in the catalogue
booking_graph.json_parse_failedwarningorchestrator/booking_graphLLM returned malformed slot JSON; FSM will re-prompt
dispatcher.checkpoint_parse_failedwarningorchestrator/dispatcherLangGraph checkpoint shape has drifted (usually after a FSM state model change); clears on next fresh thread

Fields reference

FieldTypePresent on
tenant_idUUID stringAll dispatcher, payment, handoff events
thread_idULID stringPer-turn dispatcher events, payment initiation
level"info" / "warning" / "error"Every line
timestampISO-8601 UTCEvery line
intent"booking" / "inquiry" / "other"dispatcher.intent_classified
fsm_stateFSM state enum valuedispatcher.intent_classified
signalhandoff signal enum valuehandoff.triggered
llm_cost_usdDecimal stringhandoff.triggered
checkout_request_idstringpayment.daraja.stk_initiated
merchant_referencestringBoth payment initiation events
questionstring (raw text)knowledge_gap_candidate
intentIntent literalknowledge_overflow, knowledge_gap_candidate

The daily digest

What it is

The daily digest is a brief WhatsApp message sent to the operator phone at 07:00 EAT (East Africa Time, UTC+3) each morning. It summarises the previous 24-hour window so you don't need to grep logs to know if anything went wrong overnight.

The digest covers:

  • Total bookings confirmed in the window (FSM transitions to DONE)
  • Payment breakdown — M-Pesa successful, PesaPal successful, abandoned
  • Handoff count and most common trigger signal
  • Error counts by level (warning, error) across backend + worker
  • knowledge_gap_candidate count per tenant — the daily nudge to add knowledge snippets

Implementation status

The digest job is planned for M13 T7 (onboarding wave) and is not yet implemented in code. The APScheduler worker (app/worker.py) has infrastructure for registering a 07:00 EAT cron job alongside the existing 3 AM EAT reaper and hourly admin-TTL sweep. When T7 lands, a new DAILY_DIGEST_JOB_ID constant will appear in app/worker.py and the job will call a function in app/notifications/ (the existing outbound-only notification sink module — see app/notifications/__init__.py).

The WhatsApp message uses the M9 admin rail (the same OPERATOR_WHATSAPP_NUMBER channel the admin dashboard already messages for handoff briefings). No new WhatsApp template is required; the daily digest is a free-form message within the 24-hour session window.

Configuration

VariableWherePurpose
OPERATOR_WHATSAPP_NUMBER.env (project root)E.164 number that receives the daily digest and handoff briefings — see Configuration reference

Set this before starting the stack. If unset, the digest job will log a warning and skip silently (it does not crash the worker).

Adding a new metric

When the digest job is implemented, add new metrics by:

  1. Writing a query against public.tenants + per-tenant schema tables, or aggregating from recent structlog output.
  2. Appending a line to the DigestPayload dataclass in the digest module.
  3. Including the field in the WhatsApp message template string.

There is no schema migration required for adding digest metrics — the digest is read-only and aggregates from existing tables.


What to watch

These are the four signals that matter most during the M13 beta. Check them each morning by tailing logs or reviewing the daily digest.

Booking success rate

Healthy: bookings that enter COLLECT_SLOT should reach DONE at a rate above 60%. A sudden drop indicates either a service catalogue problem (booking_graph.no_active_services) or an FSM regression.

docker compose logs --since 24h backend \
| jq 'select(.event == "dispatcher.intent_classified" and .intent == "booking") | .fsm_state' \
| sort | uniq -c

Compare DONE counts against COLLECT_SLOT entry counts.

Payment reconciliation lag

Any payment.callback.daraja_dead_letter or payment.callback.pesapal_dead_letter event means a callback arrived after the FSM had already timed out. These land in public.payment_callbacks_unrouted and need manual reconciliation. Check daily:

docker compose exec postgres psql -U ratiba ratiba \
-c "SELECT count(*) FROM public.payment_callbacks_unrouted WHERE created_at > NOW() - INTERVAL '24 hours';"

Zero rows is the target. Non-zero means the daily reaper or callback routing has a timing issue; see the Incidents runbook payment section.

FSM state distribution

A healthy system has the majority of threads in terminal states (DONE, PAYMENT_FAILED) or GREETING (fresh threads). An accumulation of threads stuck in COLLECT_SLOT or PAYMENT_PENDING indicates a stall.

docker compose exec redis redis-cli -a ratiba_redis_password \
--scan --pattern "ratiba:fsm:thread:*" \
| xargs -I{} docker compose exec -T redis redis-cli -a ratiba_redis_password GET {} \
| jq -r '.fsm_state' | sort | uniq -c | sort -rn

Cost per booking

The handoff.triggered event carries llm_cost_usd — the accumulated LLM spend for that conversation thread. A healthy booking costs well under $0.05 (the ADR-0005 D3 soft ceiling). Conversations hitting the soft ceiling automatically escalate to handoff; handoff.cost_soft_ceiling_observed in the logs is the signal. Track the rolling average:

docker compose logs --since 7d backend \
| jq 'select(.event == "handoff.triggered") | .llm_cost_usd | tonumber' \
| awk '{sum+=$1; n++} END {print "avg:", sum/n, "n:", n}'

Knowledge overflow frequency

knowledge_overflow warnings fire when a tenant's knowledge_snippets table exceeds the 20-snippet / ~1500-character cap for a given intent. This is the Phase-0 → Phase-1 graduation signal: when it fires more than a handful of times per day, it's time to implement real embedding-based retrieval. See Knowledge answers for the Phase-1 upgrade path.

docker compose logs --since 24h backend | jq 'select(.event == "knowledge_overflow")'

Target posture

The following tools are NOT in M13 scope. They are the planned upgrade path for post-pilot scale (50+ tenants, VPS production deployment).

ToolRoleWhen
Langfuse v4 (self-hosted)Prompt traces, LLM span observability, prompt version management, eval dataset managementM14 — the Langfuse sink seam already exists in app/llm/adapters/ (langfuse_sink.skip_empty log event); wiring it to a running Langfuse instance is a configuration change, not a code change
Grafana + LokiLog aggregation, dashboards, alerting at scaleM14/M15 when structured logs outgrow a terminal jq session
SentryException tracking, error grouping, release trackingM14 when unhandled exceptions need automatic triage rather than manual log grepping
Prometheus + GrafanaMetrics (booking rate, payment success, FSM state distribution)M15+ when the signal-to-noise ratio of log-based metrics becomes painful

The structlog event design is forward-compatible with all of the above: Loki ingests JSON stdout directly; Langfuse traces map one-to-one to dispatcher turns; Sentry exception capture is an SDK initialisation step that doesn't require changing any log call sites.

Until M14 tooling lands, the combination of docker compose logs | jq, the daily digest, and the Incidents runbook covers everything needed for a 5–10 tenant beta cohort.