Data flow
What this page covers
This page traces the temporal flow of data through Ratiba: a customer's message arrives over a channel, traverses the system, and produces an outbound utterance plus a state mutation. The same substrate carries four flow shapes worth understanding individually:
- Inbound customer message — booking path — the canonical hot path. Webhook → adapter → identity → tenant scope → FSM dispatcher → FSM step → checkpoint write → outbound adapter.
- Inbound customer message — inquiry/answer path — when the FSM
determines the customer is asking a factual question (services, hours,
general info), the dispatcher calls
inquiry.py, which callsknowledge.fetch_snippets(intent)and injects the result intoraw_data["knowledge"]before handing off toanswer_shaper.py. See the Knowledge answers explainer for the full "no-RAG RAG" design (ADR-0013). - Payment callback — Safaricom Daraja calls our webhook
asynchronously (seconds-to-minutes after STK push); we route the
callback back into the originating FSM thread via Postgres
pg_notify, or dead-letter it if the correlation row is gone. - Admin reply — an admin sends
/handback(or natural-language equivalent) over WhatsApp; the same webhook ingress fans out through the AdminMessageRouter and emits a NOTIFY that resumes the originating customer thread.
State persistence is the two-tier model from ADR-0003: Redis for the live hot path, Postgres LangGraph checkpoints for the warm/cold long-tail. Both are written every FSM step; reads are Redis-first with Postgres hydration on cache miss.
Inbound customer message lifecycle
The hot path runs end-to-end inside one HTTP request to the channel webhook. The diagram below is the canonical happy path for WhatsApp Cloud API; Tier-2 channels (Instagram, Messenger, web widget) follow the same shape — only the adapter and the identity-resolution prelude differ. The diagram shows both the booking branch and the inquiry/knowledge-answer branch at the FSM node fork.
Key invariants:
- HMAC verification first: an unverified payload never touches
identity resolution or the FSM. WhatsApp uses a project-level
WHATSAPP_APP_SECRET(per ADR-0008); IG and Messenger use their own project-level secrets (per ADR-0009). - Tenant scope via ContextVar:
current_tenantis set once at webhook ingress and read transparently by every downstream layer (DB pool selection, prompt rendering, checkpoint key namespace). This is the asyncio-native equivalent of thread-local tenancy. See Identity and tenancy for the full ContextVar propagation description. - FSM mutex via Redis SETNX: per-thread serialisation prevents concurrent inbound turns from racing the FSM (ADR-0003 D6: exponential backoff, 10s ceiling, bilingual failure response).
- Knowledge-snippet injection (inquiry branch): when the FSM
classifies a turn as an inquiry intent,
knowledge.fetch_snippets(intent)runs a category-routed SQL query against the per-tenantknowledge_snippetstable and returns up to 20 snippets (≤1500 tokens). The dispatcher injects them intoraw_data["knowledge"]before theanswer_shapercall. Theanswer_shaperuser template splices the snippets in at render time so the system prompt byte sequence stays cross-tenant-identical and Anthropic prompt-cache hits are preserved (per ADR-0013- ADR-0010 D9).
If no snippets match, the existing deflection response fires and a
knowledge_gap_candidatelog event is emitted. See the full design at Knowledge answers.
- ADR-0010 D9).
If no snippets match, the existing deflection response fires and a
- Checkpoint write per FSM step: LangGraph's
AsyncPostgresSaverwrites one checkpoint row per node transition; the saver is scoped to the per-tenant schema byTenantScopedSaverFactory.
State persistence — two-tier model (ADR-0003)
Live FSM state lives in Redis (hot path: low-latency reads, payment
locks, slot reservations). Long-tail recovery and audit live in
Postgres checkpoints_<tenant>.checkpoints (warm/cold path: durable
across Redis restart, queryable for replay). The dispatcher reads
Redis first; on cache miss it hydrates from the most recent Postgres
checkpoint.
Cache-coherence rules:
- Writes are Redis + Postgres: the dispatcher writes the hydrated state back to Redis after every step; the LangGraph saver writes the checkpoint row in the same logical turn.
- Reads are Redis-first: Redis miss → load latest checkpoint from Postgres → repopulate Redis. A cold restart of the cache layer is recoverable with no FSM data loss.
- Retention asymmetry: Redis TTLs are short (30s mutex, 24h
channel-switch token). Postgres checkpoints are kept for 90 days,
then moved to
checkpoints_archiveby the daily reaper (ADR-0007 D9: consolidated 3 AM EAT reaper covers checkpoints, payment routing, and handoff log together).
Payment callback flow
Safaricom Daraja calls our webhook asynchronously after the customer completes (or abandons) the STK push prompt. There is no HMAC on Daraja callbacks (per ADR-0007); authenticity comes from source-IP allowlist plus correlation lookup against payments we actually initiated.
Key constraints:
- Always 200 / Accepted: the webhook returns the canonical Daraja
ack regardless of routing outcome. A non-2xx (or non-zero
ResultCode) puts Daraja into an indefinite retry loop; the dead-letter table is the correct safety net. - Correlation via
CheckoutRequestID: this is the only field that ties a callback back to a specific FSM thread. Thepayment_routingrow is written at STK-push time; the daily reaper trims rows older than 90 days, so very late callbacks (rare) land in the dead-letter table. - NOTIFY + LISTEN over Redis pub/sub: payment state changes use
Postgres
pg_notifybecause the write and the notification must be transactionally atomic (Postgres only delivers NOTIFY on COMMIT).
PesaPal callbacks (cards, ADR-0007 D6) follow the same shape, with
the correlation column swapped for the PesaPal OrderTrackingId.
Admin reply flow
When an admin sends /handback (or a natural-language equivalent
routed through AdminMessageRouter) over WhatsApp, the same channel
ingress is reused. The webhook distinguishes admin from customer by
matching the sender phone against public.tenants.admin_phone_e164;
admin inbound is then dispatched through the admin command pipeline,
which can in turn emit a NOTIFY that resumes the originating
customer thread.
The admin ingress reuses every layer below the router: same HMAC
verification, same identity resolution (admin sender lookup is one
SQL query), same tenant ContextVar, same outbound adapter. The only
new pieces are the AdminMessageRouter (NL → slash translation) and
the admin_state_events NOTIFY channel, which the customer-thread
dispatcher already listens on (the M9 contract).
Cross-links
- Conversation FSM — node-level
walk through
booking_graph,cancel_graph,reschedule_graph. - Knowledge answers — the
"no-RAG RAG" design:
knowledge_snippetstable, category→intent routing, snippet injection intoraw_data["knowledge"], prompt-cache invariant, Phase-0→Phase-1 graduation trigger. (ADR-0013) - Channel substrate — the
Channelprotocol, Tier-1 vs Tier-2 capability flags, adapter registration. - Identity and tenancy — how
resolve()produces aResolvedIdentity, ContextVar propagation, cross-channel customer merge (phone-only, deterministic). - Payments — STK push lifecycle, PesaPal flow, cancellation reversal, dead-letter triage.
- Admin orchestrator — admin
FSM (4 states, 4h TTL),
/handbackand NL routing, the dual-channel admin rail (dashboard free + WhatsApp opt-in paid). - Voice conversation — how the
voice channel replaces the HTTP round-trip with a LiveKit room session;
the full-duplex turn-taking modules (
barge_in,backchannel,hard_interrupt,listening_ack,voice_speed) sit between the STT stream and the FSM dispatcher.