Skip to main content

Data flow

What this page covers

This page traces the temporal flow of data through Ratiba: a customer's message arrives over a channel, traverses the system, and produces an outbound utterance plus a state mutation. The same substrate carries four flow shapes worth understanding individually:

  1. Inbound customer message — booking path — the canonical hot path. Webhook → adapter → identity → tenant scope → FSM dispatcher → FSM step → checkpoint write → outbound adapter.
  2. Inbound customer message — inquiry/answer path — when the FSM determines the customer is asking a factual question (services, hours, general info), the dispatcher calls inquiry.py, which calls knowledge.fetch_snippets(intent) and injects the result into raw_data["knowledge"] before handing off to answer_shaper.py. See the Knowledge answers explainer for the full "no-RAG RAG" design (ADR-0013).
  3. Payment callback — Safaricom Daraja calls our webhook asynchronously (seconds-to-minutes after STK push); we route the callback back into the originating FSM thread via Postgres pg_notify, or dead-letter it if the correlation row is gone.
  4. Admin reply — an admin sends /handback (or natural-language equivalent) over WhatsApp; the same webhook ingress fans out through the AdminMessageRouter and emits a NOTIFY that resumes the originating customer thread.

State persistence is the two-tier model from ADR-0003: Redis for the live hot path, Postgres LangGraph checkpoints for the warm/cold long-tail. Both are written every FSM step; reads are Redis-first with Postgres hydration on cache miss.

Inbound customer message lifecycle

The hot path runs end-to-end inside one HTTP request to the channel webhook. The diagram below is the canonical happy path for WhatsApp Cloud API; Tier-2 channels (Instagram, Messenger, web widget) follow the same shape — only the adapter and the identity-resolution prelude differ. The diagram shows both the booking branch and the inquiry/knowledge-answer branch at the FSM node fork.

Key invariants:

  • HMAC verification first: an unverified payload never touches identity resolution or the FSM. WhatsApp uses a project-level WHATSAPP_APP_SECRET (per ADR-0008); IG and Messenger use their own project-level secrets (per ADR-0009).
  • Tenant scope via ContextVar: current_tenant is set once at webhook ingress and read transparently by every downstream layer (DB pool selection, prompt rendering, checkpoint key namespace). This is the asyncio-native equivalent of thread-local tenancy. See Identity and tenancy for the full ContextVar propagation description.
  • FSM mutex via Redis SETNX: per-thread serialisation prevents concurrent inbound turns from racing the FSM (ADR-0003 D6: exponential backoff, 10s ceiling, bilingual failure response).
  • Knowledge-snippet injection (inquiry branch): when the FSM classifies a turn as an inquiry intent, knowledge.fetch_snippets(intent) runs a category-routed SQL query against the per-tenant knowledge_snippets table and returns up to 20 snippets (≤1500 tokens). The dispatcher injects them into raw_data["knowledge"] before the answer_shaper call. The answer_shaper user template splices the snippets in at render time so the system prompt byte sequence stays cross-tenant-identical and Anthropic prompt-cache hits are preserved (per ADR-0013
    • ADR-0010 D9). If no snippets match, the existing deflection response fires and a knowledge_gap_candidate log event is emitted. See the full design at Knowledge answers.
  • Checkpoint write per FSM step: LangGraph's AsyncPostgresSaver writes one checkpoint row per node transition; the saver is scoped to the per-tenant schema by TenantScopedSaverFactory.

State persistence — two-tier model (ADR-0003)

Live FSM state lives in Redis (hot path: low-latency reads, payment locks, slot reservations). Long-tail recovery and audit live in Postgres checkpoints_<tenant>.checkpoints (warm/cold path: durable across Redis restart, queryable for replay). The dispatcher reads Redis first; on cache miss it hydrates from the most recent Postgres checkpoint.

Cache-coherence rules:

  • Writes are Redis + Postgres: the dispatcher writes the hydrated state back to Redis after every step; the LangGraph saver writes the checkpoint row in the same logical turn.
  • Reads are Redis-first: Redis miss → load latest checkpoint from Postgres → repopulate Redis. A cold restart of the cache layer is recoverable with no FSM data loss.
  • Retention asymmetry: Redis TTLs are short (30s mutex, 24h channel-switch token). Postgres checkpoints are kept for 90 days, then moved to checkpoints_archive by the daily reaper (ADR-0007 D9: consolidated 3 AM EAT reaper covers checkpoints, payment routing, and handoff log together).

Payment callback flow

Safaricom Daraja calls our webhook asynchronously after the customer completes (or abandons) the STK push prompt. There is no HMAC on Daraja callbacks (per ADR-0007); authenticity comes from source-IP allowlist plus correlation lookup against payments we actually initiated.

Key constraints:

  • Always 200 / Accepted: the webhook returns the canonical Daraja ack regardless of routing outcome. A non-2xx (or non-zero ResultCode) puts Daraja into an indefinite retry loop; the dead-letter table is the correct safety net.
  • Correlation via CheckoutRequestID: this is the only field that ties a callback back to a specific FSM thread. The payment_routing row is written at STK-push time; the daily reaper trims rows older than 90 days, so very late callbacks (rare) land in the dead-letter table.
  • NOTIFY + LISTEN over Redis pub/sub: payment state changes use Postgres pg_notify because the write and the notification must be transactionally atomic (Postgres only delivers NOTIFY on COMMIT).

PesaPal callbacks (cards, ADR-0007 D6) follow the same shape, with the correlation column swapped for the PesaPal OrderTrackingId.

Admin reply flow

When an admin sends /handback (or a natural-language equivalent routed through AdminMessageRouter) over WhatsApp, the same channel ingress is reused. The webhook distinguishes admin from customer by matching the sender phone against public.tenants.admin_phone_e164; admin inbound is then dispatched through the admin command pipeline, which can in turn emit a NOTIFY that resumes the originating customer thread.

The admin ingress reuses every layer below the router: same HMAC verification, same identity resolution (admin sender lookup is one SQL query), same tenant ContextVar, same outbound adapter. The only new pieces are the AdminMessageRouter (NL → slash translation) and the admin_state_events NOTIFY channel, which the customer-thread dispatcher already listens on (the M9 contract).

  • Conversation FSM — node-level walk through booking_graph, cancel_graph, reschedule_graph.
  • Knowledge answers — the "no-RAG RAG" design: knowledge_snippets table, category→intent routing, snippet injection into raw_data["knowledge"], prompt-cache invariant, Phase-0→Phase-1 graduation trigger. (ADR-0013)
  • Channel substrate — the Channel protocol, Tier-1 vs Tier-2 capability flags, adapter registration.
  • Identity and tenancy — how resolve() produces a ResolvedIdentity, ContextVar propagation, cross-channel customer merge (phone-only, deterministic).
  • Payments — STK push lifecycle, PesaPal flow, cancellation reversal, dead-letter triage.
  • Admin orchestrator — admin FSM (4 states, 4h TTL), /handback and NL routing, the dual-channel admin rail (dashboard free + WhatsApp opt-in paid).
  • Voice conversation — how the voice channel replaces the HTTP round-trip with a LiveKit room session; the full-duplex turn-taking modules (barge_in, backchannel, hard_interrupt, listening_ack, voice_speed) sit between the STT stream and the FSM dispatcher.