ADR-0003: Conversation FSM Persistence Model
Status: Accepted Date: 2026-04-25
Context
Ratiba's product thesis ("the AI agent IS the UI" + "conversation state IS
canonical state for in-flight bookings") makes the conversation FSM a
load-bearing piece of the architecture, not an implementation detail.
A1 §8 (docs/research/2026-04-25-orchestration-patterns.md) settled the
high-level shape: two-tier persistence with Redis 7 as hot cache and
Postgres + LangGraph checkpoint tables as the durable store, accessed
through the per-invocation TenantScopedSaver wrapper from the spike
(docs/research/2026-04-25-langgraph-postgressaver-spike.md).
What this ADR settles is the operational model of that persistence:
how Redis keys are namespaced and isolated, how the LangGraph
thread_id is derived (and how a fresh booking gets a fresh thread),
how reservations and races are mediated, how the 90-day retention from
spec §12 (Q6) actually archives data, and how cold-start recovery
behaves. These are inputs to the orchestrator implementation that
otherwise would be re-derived in five different files.
ADR-0002 settled the multi-tenant primitives (schema-per-tenant, TenantScopedSaver, asyncio contextvar, two-pool pattern) that this ADR builds on directly; ADR-0003 inherits all of them without re-deriving.
Decision
Seven decisions, organized as a coherent persistence model.
1. Two-tier persistence (architectural recap)
Inherited from A1 §8 and locked here as the canonical model.
| Tier | Store | Lifetime | Purpose |
|---|---|---|---|
| Hot | Redis 7 | TTL ≤ 30 min | Conversation thread pointer; reservation locks; webhook dedup; rate limiting |
| Durable | Postgres (per-tenant schema, LangGraph checkpoint tables) | 90 days live + archive thereafter | Source of truth for FSM state; audit trail; cross-channel migration; cold-start recovery |
Read pattern. On every inbound message, the orchestrator first consults Redis for the customer's current thread pointer (microseconds); falls back to Postgres if Redis is cold (single indexed query). LangGraph hydrates the FSM state from the latest checkpoint of the resolved thread.
Write pattern. Every FSM transition writes a checkpoint via
LangGraph's PostgresSaver (durable). Reservation locks, dedup keys,
and rate-limit counters write to Redis only (hot, ephemeral).
2. Redis isolation — shared instance with prefixed keys
Single Redis instance for all tenants, with rigorous key-prefix
discipline enforced by a single key-builder module
(backend/app/persistence/redis_keys.py). Application code never
formats Redis keys directly; the key builder is the only place that
knows the prefix conventions.
Canonical key shapes:
| Key | Type | TTL | Value |
|---|---|---|---|
ratiba:thread:{tenant_id}:{phone_e164} | string | 1800s (30 min) | Current thread_id for the customer's active booking. Set on FSM entry; cleared on final-state entry. |
ratiba:fsm:{tenant_id}:{thread_id} | hash | 1800s (30 min) | Cached FSM cursor: {current_state, last_seen_at, retry_count}. Hot read; durable copy in Postgres checkpoint. |
ratiba:lock:reservation:{tenant_id}:{slot_key} | string (NX) | 300s (5 min) | Tentative slot reservation held while CONFIRM is awaited. |
ratiba:lock:thread:{thread_id} | string (NX) | 30s | Per-thread mutex against parallel WhatsApp + voice ingress (D6). |
ratiba:dedup:wa:{tenant_id}:{message_id} | string | 86400s (24h) | 360dialog message-ID idempotency. |
ratiba:dedup:mpesa:{checkout_request_id} | string | 86400s (24h) | Daraja STK callback idempotency. |
ratiba:dedup:pesapal:{order_tracking_id} | string | 86400s (24h) | PesaPal IPN idempotency. |
ratiba:rate:{phone_e164} | string (INCR) | 60s | Inbound count per minute (anti-abuse; threshold per ADR-0006). |
ratiba:stkpush:{checkout_request_id} | hash | 300s (5 min) | M-Pesa STK callback → conversation routing aid. |
Why shared Redis is acceptable. Redis holds hot cache state only; it never holds canonical state. The canonical isolation lives in Postgres via schema-per-tenant (ADR-0002). A Redis bug or mis-keyed read can at worst leak hot cursor data — the consequence is a transient routing error, not a cross-tenant data leak. Real isolation costs (per-tenant Redis databases or instances) buy nothing proportional to that risk.
Eviction policy. TTL only. No maxmemory-policy eviction. Redis is
sized for the working set (estimated < 50k active conversations at
1k tenants × 50 active per tenant); if we approach the ceiling, scale
Redis vertically before introducing eviction (an evicted FSM cursor
without a matching Postgres checkpoint would orphan the conversation —
unacceptable).
3. Postgres durable store — per-tenant LangGraph checkpoint tables
Inherited from ADR-0002 D2 (created by PostgresSaver(conn=tenant_conn).setup()
in onboarding Phase B) and ADR-0001 amendment (TenantScopedSaver wrapper
per spike Option A). Tables:
checkpoints checkpoint_blobs checkpoint_writes checkpoint_migrations
Live in each tenant schema; accessed exclusively through the
TenantScopedSaverFactory.for_tenant(tenant_id) wrapper from
backend/app/persistence/checkpointer.py (the only module that imports
psycopg, per the ADR-0001 amendment scoping rule).
The wrapper acquires a per-tenant micro-pool connection (size 1-2,
lazy + idle-evicting), pre-sets search_path to the tenant schema,
and constructs a vanilla PostgresSaver(conn=that_connection) for the
duration of one graph invocation.
4. Thread ID and conversation_threads pointer table
Decision: fresh thread_id per booking attempt. Each new booking
is a self-contained LangGraph thread with its own checkpoint chain;
the customer's identity is the stable axis, not the thread.
Thread ID format. {tenant_id}:{phone_e164}:{ulid} where the ULID
is generated at thread creation. ULID is preferred over UUIDv4 because
it is lexically sortable by creation time — SELECT ... ORDER BY thread_id orders bookings chronologically without an extra timestamp.
Example: 9f1c2b71-...:+254712345678:01HFM6Z9YQK4M2EXB7DH3RG0NV.
Pointer table (per-tenant schema):
CREATE TABLE conversation_threads (
customer_phone VARCHAR(20) NOT NULL,
thread_id VARCHAR(100) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
closed_at TIMESTAMPTZ, -- set on final-state entry
closed_reason VARCHAR(40), -- 'done' | 'abandon' | 'closed_by_human'
channel_started VARCHAR(20) NOT NULL, -- 'whatsapp' | 'voice'
PRIMARY KEY (customer_phone, thread_id)
);
CREATE INDEX idx_threads_active
ON conversation_threads (customer_phone)
WHERE closed_at IS NULL;
The partial index idx_threads_active makes the active-thread lookup
a single index scan: SELECT thread_id FROM conversation_threads WHERE customer_phone = $1 AND closed_at IS NULL returns 0 or 1 row.
Resolution flow on inbound message:
1. Read Redis: ratiba:thread:{tenant_id}:{phone}
├── HIT → use that thread_id (hot path)
└── MISS → step 2
2. Query Postgres: conversation_threads WHERE customer_phone = phone
AND closed_at IS NULL
├── 1 row → use that thread_id; cache it back to Redis
└── 0 rows → step 3
3. Generate new ULID; new thread_id = "{tenant}:{phone}:{ulid}"
├── INSERT INTO conversation_threads (customer_phone, thread_id, channel_started)
├── SET ratiba:thread:{tenant}:{phone} = new thread_id (TTL 1800s)
└── New LangGraph thread starts at FSM state GREET
Final-state entry (FSM transitions to DONE, ABANDON, or
CLOSED_BY_HUMAN):
1. UPDATE conversation_threads
SET closed_at = NOW(), closed_reason = '<state>'
WHERE customer_phone = phone AND thread_id = current_thread_id
2. DEL ratiba:thread:{tenant_id}:{phone}
3. The next inbound message for this customer triggers steps 2-3 above
(no active row found → fresh thread)
Why fresh thread per booking. A LangGraph thread's checkpoint chain
is single-purpose. Reusing the same thread_id across multiple
bookings would require either an is_final flag on FSM state with
custom resume logic that knows to ignore final checkpoints (more
complex than this design), or accumulating arbitrarily long checkpoint
chains over a customer's lifetime (slower hydration over time). The
fresh-thread pattern keeps each booking's checkpoint history bounded
and self-contained. Cross-booking analytics ("how many bookings has
this customer made?") is a single SELECT COUNT(*) FROM conversation_threads WHERE customer_phone = phone — affordable.
Cross-channel migration (WhatsApp → voice mid-flow): A1 §8.5's
emergent property still holds. The thread pointer is keyed by
customer_phone, not by channel; the voice ingress resolves the same
pointer the WhatsApp ingress would. The BookingState.channel field
flips on the first voice-channel hydration; the AnswerShaper switches
to voice-shaped responses for the rest of the flow.
5. Retention and cold archive — per-tenant checkpoints_archive
Q6 from spec §12: 90 days live + cold archive thereafter. This ADR specifies the mechanism.
Per-tenant archive table (created in onboarding Phase A alongside
checkpoints):
CREATE TABLE checkpoints_archive (LIKE checkpoints INCLUDING ALL);
CREATE TABLE checkpoint_blobs_archive (LIKE checkpoint_blobs INCLUDING ALL);
CREATE TABLE checkpoint_writes_archive (LIKE checkpoint_writes INCLUDING ALL);
-- checkpoint_migrations is small (< 100 rows lifetime); not archived
LIKE ... INCLUDING ALL clones the column types, defaults, indexes,
and constraints — archive rows are a perfect mirror of live rows.
Archive job (backend/app/persistence/archive_checkpoints.py,
runs daily via the existing reaper infrastructure that also purges
public.payment_routing past TTL):
-- Per-tenant, in tenant schema, single transaction:
BEGIN;
INSERT INTO checkpoints_archive
SELECT * FROM checkpoints
WHERE created_at < NOW() - INTERVAL '90 days';
DELETE FROM checkpoints
WHERE created_at < NOW() - INTERVAL '90 days';
-- Same for checkpoint_blobs and checkpoint_writes (referencing thread_id of
-- already-archived checkpoints).
COMMIT;
Compliance / dispute access pattern. When an operator needs to investigate a booking past the 90-day window, the diagnostic query unions live + archive:
SELECT * FROM (
SELECT * FROM checkpoints WHERE thread_id = $1
UNION ALL
SELECT * FROM checkpoints_archive WHERE thread_id = $1
) AS all_checkpoints
ORDER BY checkpoint_id;
This is in the auto-debug logging recipe library (docs/methodology/log-diagnostics.md)
once the first such incident lands.
When to revisit. When a single tenant's checkpoints table
exceeds ~10M rows, switch that tenant to declarative partitioning by
month and detach old partitions instead of moving rows. ADR-0003
amendment at that point.
6. Cold-start recovery semantics
Inherited from A1 §8.4 with this ADR's specifics added.
| Scenario | Behaviour |
|---|---|
| Backend restart, Redis intact | No-op for in-flight FSMs. Redis still holds thread pointers; the next inbound message resolves the active thread_id from Redis and LangGraph hydrates from the latest Postgres checkpoint of that thread. |
| Redis flush, Postgres intact | All Redis pointers + locks lost. On next inbound for (tenant, phone), step 1 (Redis hit) misses; step 2 falls back to conversation_threads Postgres query. If an active row exists, the thread is re-adopted and Redis pointer rebuilt; if not, a fresh thread starts. Tentative reservations held in ratiba:lock:reservation:* are gone — the user is re-prompted to pick a slot. Acceptable degradation. |
| Postgres tenant schema lost | Catastrophic — appointments, contacts, threads, checkpoints all gone. Out of scope for this ADR; this is a backup/restore concern. |
Single thread's checkpoint corrupted (a single row in checkpoints is unreadable) | Hydration fails on that thread. Operator can manually CLOSE the thread (UPDATE conversation_threads SET closed_at = NOW(), closed_reason = 'corruption') — the next inbound for the customer starts fresh. The corrupted thread stays in Postgres for forensics. |
7. Reservation-lock and per-thread mutex semantics
Two distinct lock kinds, both via Redis SETNX. Use cases differ.
Reservation lock (ratiba:lock:reservation:{tenant_id}:{slot_key}):
held when a customer enters CONFIRM but hasn't yet committed
(button-tap or M-Pesa STK callback). TTL 300s (5 min). Prevents
two customers from booking the same slot during the 5-min confirmation
window. On CONFIRM → DONE, the lock is replaced by a real row in
appointments. On CONFIRM → ABANDON or TTL expiry, the slot is
released.
Per-thread mutex (ratiba:lock:thread:{thread_id}): held when an
ingress handler is actively processing a turn for that thread. Prevents
parallel WhatsApp + voice ingress from racing the same thread.
| Property | Value |
|---|---|
| TTL | 30 seconds (longer than typical turn processing; expires automatically on handler crash) |
| Retry strategy | Exponential backoff: 500ms → 1s → 2s → 4s, max 4 retries |
| Total wait ceiling | ~10 seconds before the loser gives up |
| Failure response (English) | "One moment please — I'm catching up on your previous message." |
| Failure response (Swahili) | "Subiri kidogo — naendelea na ujumbe wako wa awali." |
| Observability | Every contention event logs event_type=fsm.race.contention with correlation_ids populated for both contenders |
Race semantics are intentionally aimed at functional correctness over optimization. The case (customer simultaneously messages on WhatsApp
- calls in via voice) is genuinely rare in practice; the design accepts ~10 seconds of perceived latency in the rare race case as a trade for clean correctness in 99.9% of the happy-path traffic. Optimization (shorter ceilings, better backoff curves, optimistic concurrency) defers until production data shows the race is more frequent than projected.
Consequences
Positive.
- Each booking is a self-contained LangGraph thread, so checkpoint hydration time is bounded by the length of one booking flow (~10-20 checkpoints) rather than growing with the customer's lifetime relationship with the tenant.
- Cross-booking analytics is trivial.
conversation_threadsis the customer-history table. Single SELECT for "how many bookings," "how many abandoned at PAY," "average booking duration." - Redis-flush recovery is gentle. The Postgres pointer table is the durable fallback; one missed slot reservation is the worst-case user experience.
- Cold archive preserves tenant isolation because the archive
tables live in the tenant's own schema. Compliance / dispute
queries
UNIONlive + archive without crossing schema boundaries. - Race semantics prioritize functional correctness over speed, in line with the "functional excellence first, optimize later" principle stated 2026-04-25.
Negative.
- Two writes per inbound on cold-Redis path. Redis miss → Postgres query → Redis pointer write. Negligible (~5ms additional latency) but worth noting; the hot path is single Redis read.
- Quadratic thread-count growth with customer lifetime. A spa
customer who books 12 times/year over 5 years accumulates 60 rows
in
conversation_threads. Acceptable at any plausible scale; the table per-tenant should stay under 100k rows for the life of any plausible business. - Archive job adds a daily database write window. Mitigation: run
off-peak (3 AM EAT); operations on
checkpointsduring the archive window may see brief lock contention. The job uses transactions per tenant, not a single global transaction, so contention is bounded per-tenant per-batch. - Per-thread mutex on Redis means a Redis outage degrades to "two
handlers might process same thread." LangGraph's checkpoint writes
are append-only and parent-chained, so the second writer's write
either chains correctly or conflicts on a unique constraint
(
thread_id, checkpoint_id). Conflict detection is the backstop when Redis is unavailable; the orchestrator drops the second invocation and the customer can retry.
Neutral.
- ULID for
thread_idis a conventional choice (lexically sortable, time-ordered). Migration to UUIDv7 (similar properties, broader stdlib support) is a one-character change at the generator if Python stdlib eventually ships it natively. - 30-min TTL on the Redis pointer assumes a customer who waits 30 min mid-booking has effectively abandoned. The Postgres pointer is the durable fallback; if they return after 30 min, the next inbound resolves the active thread from Postgres and re-caches Redis.
Alternatives Considered
| Alternative | Rejected because |
|---|---|
Single LangGraph thread_id = "{tenant}:{phone}" reused across bookings with is_final flag on FSM state. | Reusing the thread accumulates an unbounded checkpoint chain over the customer's lifetime — hydration grows linearly with booking count. Adding is_final semantics to LangGraph's resume logic adds complexity without benefit. The fresh-thread-per-booking design was Adrian's explicit call (2026-04-25, D4). |
| Per-tenant Redis databases (Redis's 16 numbered DB feature). | Not a real isolation boundary (FLUSHALL and ACLs cross databases trivially); adds operational complexity (per-tenant DB selection on every connection); buys nothing. |
| Per-tenant Redis instances. | Operationally absurd at any scale (1k tenants = 1k Redis processes, each with its own monitoring/backup window). |
| Single shared archive schema for cold checkpoints across all tenants. | Breaks the schema-per-tenant isolation principle for cold data — bad signal for health-data compliance review. Per-tenant archive tables (D2 = (a)) preserves the pattern. |
pg_dump + drop as the archive mechanism. | Query-unfriendly (compliance/dispute queries can't reach archived data without restore work). The per-tenant archive table is the ergonomic upgrade for trivial extra storage cost. |
Postgres declarative partitioning by month for the live checkpoints table. | Elegant at scale; premature at PoC scale. Every per-tenant checkpoints table would need to be partitioned at creation — additional complexity in onboarding (ADR-0002 D5) for benefit that only materializes at >10M rows per tenant. Reserved as a scale-time amendment to this ADR. |
| Postgres advisory locks instead of Redis SETNX for the per-thread mutex. | Postgres advisory locks are session-bound (released on connection close); the asyncpg pool's connection lifecycle would need careful coordination. Redis SETNX with TTL is simpler and the Redis dependency is already present for the hot-cache layer. |
No mutex; rely solely on LangGraph's checkpoint conflict detection (write fails on (thread_id, checkpoint_id) collision). | Works but produces ugly user-facing failures (the losing handler sees a database error instead of a graceful "one moment" message). Mutex + LangGraph's natural conflict detection together gives both layers of defence. |
| Shorter mutex retry ceiling (e.g., 5s instead of 10s). | Optimizes perceived latency in the rare race case at the cost of higher failure rate. Adrian's "functional excellence first, optimize later" call (D3) keeps the longer ceiling for now. |
References
docs/prd/ratiba-prd.md— §2.2 (Redis 7 + PostgreSQL), §4 Modules 7-9 (orchestrator + scheduling)docs/adr/ADR-0001-tech-stack.md(amended 2026-04-25) — Redis + Postgres pinned; LangGraph + psycopg exceptiondocs/adr/ADR-0002-multi-tenant-isolation.md— schema-per-tenant operational specifics; TenantScopedSaver via per-tenant micro-pools (D4); asyncio contextvar tenant propagation (D7)docs/research/2026-04-25-langgraph-postgressaver-spike.md— Option A TenantScopedSaver wrapper; data-leakage failure mode warningdocs/research/2026-04-25-orchestration-patterns.md— A1 §3 FSM transitions; §7 LangGraph integration; §8 two-tier persistence (canonical inputs to this ADR)docs/research/2026-04-25-human-in-the-loop-handoff.md— A2 §4 driver-as-FSM-dimension; §5 Redis queue for handoff inbounddocs/research/2026-04-25-payments-orchestration.md— A4 §3 payment routing pattern (uses thread_id); §5 idempotency layersdocs/superpowers/specs/2026-04-25-agentic-research-investment-design.md§12 — Q6 (90-day retention) locked- LangGraph
langgraph-checkpoint-postgresv2.x —PostgresSaverAPI, thread-id config,Command(resume=...)semantics - ULID specification —
https://github.com/ulid/spec