ADR-0003: Conversation FSM Persistence Model

Status: Accepted Date: 2026-04-25

Context

Ratiba's product thesis ("the AI agent IS the UI" + "conversation state IS canonical state for in-flight bookings") makes the conversation FSM a load-bearing piece of the architecture, not an implementation detail. A1 §8 (docs/research/2026-04-25-orchestration-patterns.md) settled the high-level shape: two-tier persistence with Redis 7 as hot cache and Postgres + LangGraph checkpoint tables as the durable store, accessed through the per-invocation TenantScopedSaver wrapper from the spike (docs/research/2026-04-25-langgraph-postgressaver-spike.md).

What this ADR settles is the operational model of that persistence: how Redis keys are namespaced and isolated, how the LangGraph thread_id is derived (and how a fresh booking gets a fresh thread), how reservations and races are mediated, how the 90-day retention from spec §12 (Q6) actually archives data, and how cold-start recovery behaves. These are inputs to the orchestrator implementation that otherwise would be re-derived in five different files.

ADR-0002 settled the multi-tenant primitives (schema-per-tenant, TenantScopedSaver, asyncio contextvar, two-pool pattern) that this ADR builds on directly; ADR-0003 inherits all of them without re-deriving.

Decision

Seven decisions, organized as a coherent persistence model.

1. Two-tier persistence (architectural recap)

Inherited from A1 §8 and locked here as the canonical model.

Tier	Store	Lifetime	Purpose
Hot	Redis 7	TTL ≤ 30 min	Conversation thread pointer; reservation locks; webhook dedup; rate limiting
Durable	Postgres (per-tenant schema, LangGraph checkpoint tables)	90 days live + archive thereafter	Source of truth for FSM state; audit trail; cross-channel migration; cold-start recovery

Read pattern. On every inbound message, the orchestrator first consults Redis for the customer's current thread pointer (microseconds); falls back to Postgres if Redis is cold (single indexed query). LangGraph hydrates the FSM state from the latest checkpoint of the resolved thread.

Write pattern. Every FSM transition writes a checkpoint via LangGraph's PostgresSaver (durable). Reservation locks, dedup keys, and rate-limit counters write to Redis only (hot, ephemeral).

2. Redis isolation — shared instance with prefixed keys

Single Redis instance for all tenants, with rigorous key-prefix discipline enforced by a single key-builder module (backend/app/persistence/redis_keys.py). Application code never formats Redis keys directly; the key builder is the only place that knows the prefix conventions.

Canonical key shapes:

Key	Type	TTL	Value
`ratiba:thread:{tenant_id}:{phone_e164}`	string	1800s (30 min)	Current `thread_id` for the customer's active booking. Set on FSM entry; cleared on final-state entry.
`ratiba:fsm:{tenant_id}:{thread_id}`	hash	1800s (30 min)	Cached FSM cursor: `{current_state, last_seen_at, retry_count}`. Hot read; durable copy in Postgres checkpoint.
`ratiba:lock:reservation:{tenant_id}:{slot_key}`	string (NX)	300s (5 min)	Tentative slot reservation held while `CONFIRM` is awaited.
`ratiba:lock:thread:{thread_id}`	string (NX)	30s	Per-thread mutex against parallel WhatsApp + voice ingress (D6).
`ratiba:dedup:wa:{tenant_id}:{message_id}`	string	86400s (24h)	360dialog message-ID idempotency.
`ratiba:dedup:mpesa:{checkout_request_id}`	string	86400s (24h)	Daraja STK callback idempotency.
`ratiba:dedup:pesapal:{order_tracking_id}`	string	86400s (24h)	PesaPal IPN idempotency.
`ratiba:rate:{phone_e164}`	string (INCR)	60s	Inbound count per minute (anti-abuse; threshold per ADR-0006).
`ratiba:stkpush:{checkout_request_id}`	hash	300s (5 min)	M-Pesa STK callback → conversation routing aid.

Why shared Redis is acceptable. Redis holds hot cache state only; it never holds canonical state. The canonical isolation lives in Postgres via schema-per-tenant (ADR-0002). A Redis bug or mis-keyed read can at worst leak hot cursor data — the consequence is a transient routing error, not a cross-tenant data leak. Real isolation costs (per-tenant Redis databases or instances) buy nothing proportional to that risk.

Eviction policy. TTL only. No maxmemory-policy eviction. Redis is sized for the working set (estimated < 50k active conversations at 1k tenants × 50 active per tenant); if we approach the ceiling, scale Redis vertically before introducing eviction (an evicted FSM cursor without a matching Postgres checkpoint would orphan the conversation — unacceptable).

3. Postgres durable store — per-tenant LangGraph checkpoint tables

Inherited from ADR-0002 D2 (created by PostgresSaver(conn=tenant_conn).setup() in onboarding Phase B) and ADR-0001 amendment (TenantScopedSaver wrapper per spike Option A). Tables:

checkpoints   checkpoint_blobs   checkpoint_writes   checkpoint_migrations

Live in each tenant schema; accessed exclusively through the TenantScopedSaverFactory.for_tenant(tenant_id) wrapper from backend/app/persistence/checkpointer.py (the only module that imports psycopg, per the ADR-0001 amendment scoping rule).

The wrapper acquires a per-tenant micro-pool connection (size 1-2, lazy + idle-evicting), pre-sets search_path to the tenant schema, and constructs a vanilla PostgresSaver(conn=that_connection) for the duration of one graph invocation.

4. Thread ID and `conversation_threads` pointer table

Decision: fresh thread_id per booking attempt. Each new booking is a self-contained LangGraph thread with its own checkpoint chain; the customer's identity is the stable axis, not the thread.

Thread ID format. {tenant_id}:{phone_e164}:{ulid} where the ULID is generated at thread creation. ULID is preferred over UUIDv4 because it is lexically sortable by creation time — SELECT ... ORDER BY thread_id orders bookings chronologically without an extra timestamp.

Example: 9f1c2b71-...:+254712345678:01HFM6Z9YQK4M2EXB7DH3RG0NV.

Pointer table (per-tenant schema):

CREATE TABLE conversation_threads (
  customer_phone     VARCHAR(20)  NOT NULL,
  thread_id          VARCHAR(100) NOT NULL,
  created_at         TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
  closed_at          TIMESTAMPTZ,                -- set on final-state entry
  closed_reason      VARCHAR(40),                -- 'done' | 'abandon' | 'closed_by_human'
  channel_started    VARCHAR(20)  NOT NULL,      -- 'whatsapp' | 'voice'
  PRIMARY KEY (customer_phone, thread_id)
);

CREATE INDEX idx_threads_active
  ON conversation_threads (customer_phone)
  WHERE closed_at IS NULL;

The partial index idx_threads_active makes the active-thread lookup a single index scan: SELECT thread_id FROM conversation_threads WHERE customer_phone = $1 AND closed_at IS NULL returns 0 or 1 row.

Resolution flow on inbound message:

1. Read Redis: ratiba:thread:{tenant_id}:{phone}
   ├── HIT → use that thread_id (hot path)
   └── MISS → step 2

2. Query Postgres: conversation_threads WHERE customer_phone = phone
              AND closed_at IS NULL
   ├── 1 row → use that thread_id; cache it back to Redis
   └── 0 rows → step 3

3. Generate new ULID; new thread_id = "{tenant}:{phone}:{ulid}"
   ├── INSERT INTO conversation_threads (customer_phone, thread_id, channel_started)
   ├── SET ratiba:thread:{tenant}:{phone} = new thread_id (TTL 1800s)
   └── New LangGraph thread starts at FSM state GREET

Final-state entry (FSM transitions to DONE, ABANDON, or CLOSED_BY_HUMAN):

1. UPDATE conversation_threads
   SET closed_at = NOW(), closed_reason = '<state>'
   WHERE customer_phone = phone AND thread_id = current_thread_id
2. DEL ratiba:thread:{tenant_id}:{phone}
3. The next inbound message for this customer triggers steps 2-3 above
   (no active row found → fresh thread)

Why fresh thread per booking. A LangGraph thread's checkpoint chain is single-purpose. Reusing the same thread_id across multiple bookings would require either an is_final flag on FSM state with custom resume logic that knows to ignore final checkpoints (more complex than this design), or accumulating arbitrarily long checkpoint chains over a customer's lifetime (slower hydration over time). The fresh-thread pattern keeps each booking's checkpoint history bounded and self-contained. Cross-booking analytics ("how many bookings has this customer made?") is a single SELECT COUNT(*) FROM conversation_threads WHERE customer_phone = phone — affordable.

Cross-channel migration (WhatsApp → voice mid-flow): A1 §8.5's emergent property still holds. The thread pointer is keyed by customer_phone, not by channel; the voice ingress resolves the same pointer the WhatsApp ingress would. The BookingState.channel field flips on the first voice-channel hydration; the AnswerShaper switches to voice-shaped responses for the rest of the flow.

5. Retention and cold archive — per-tenant `checkpoints_archive`

Q6 from spec §12: 90 days live + cold archive thereafter. This ADR specifies the mechanism.

Per-tenant archive table (created in onboarding Phase A alongside checkpoints):

CREATE TABLE checkpoints_archive (LIKE checkpoints INCLUDING ALL);
CREATE TABLE checkpoint_blobs_archive (LIKE checkpoint_blobs INCLUDING ALL);
CREATE TABLE checkpoint_writes_archive (LIKE checkpoint_writes INCLUDING ALL);
-- checkpoint_migrations is small (< 100 rows lifetime); not archived

LIKE ... INCLUDING ALL clones the column types, defaults, indexes, and constraints — archive rows are a perfect mirror of live rows.

Archive job (backend/app/persistence/archive_checkpoints.py, runs daily via the existing reaper infrastructure that also purges public.payment_routing past TTL):

-- Per-tenant, in tenant schema, single transaction:
BEGIN;
INSERT INTO checkpoints_archive
  SELECT * FROM checkpoints
  WHERE created_at < NOW() - INTERVAL '90 days';
DELETE FROM checkpoints
  WHERE created_at < NOW() - INTERVAL '90 days';
-- Same for checkpoint_blobs and checkpoint_writes (referencing thread_id of
-- already-archived checkpoints).
COMMIT;

Compliance / dispute access pattern. When an operator needs to investigate a booking past the 90-day window, the diagnostic query unions live + archive:

SELECT * FROM (
  SELECT * FROM checkpoints WHERE thread_id = $1
  UNION ALL
  SELECT * FROM checkpoints_archive WHERE thread_id = $1
) AS all_checkpoints
ORDER BY checkpoint_id;

This is in the auto-debug logging recipe library (docs/methodology/log-diagnostics.md) once the first such incident lands.

When to revisit. When a single tenant's checkpoints table exceeds ~10M rows, switch that tenant to declarative partitioning by month and detach old partitions instead of moving rows. ADR-0003 amendment at that point.

6. Cold-start recovery semantics

Inherited from A1 §8.4 with this ADR's specifics added.

Scenario	Behaviour
Backend restart, Redis intact	No-op for in-flight FSMs. Redis still holds thread pointers; the next inbound message resolves the active `thread_id` from Redis and LangGraph hydrates from the latest Postgres checkpoint of that thread.
Redis flush, Postgres intact	All Redis pointers + locks lost. On next inbound for `(tenant, phone)`, step 1 (Redis hit) misses; step 2 falls back to `conversation_threads` Postgres query. If an active row exists, the thread is re-adopted and Redis pointer rebuilt; if not, a fresh thread starts. Tentative reservations held in `ratiba:lock:reservation:*` are gone — the user is re-prompted to pick a slot. Acceptable degradation.
Postgres tenant schema lost	Catastrophic — appointments, contacts, threads, checkpoints all gone. Out of scope for this ADR; this is a backup/restore concern.
Single thread's checkpoint corrupted (a single row in `checkpoints` is unreadable)	Hydration fails on that thread. Operator can manually CLOSE the thread (`UPDATE conversation_threads SET closed_at = NOW(), closed_reason = 'corruption'`) — the next inbound for the customer starts fresh. The corrupted thread stays in Postgres for forensics.

7. Reservation-lock and per-thread mutex semantics

Two distinct lock kinds, both via Redis SETNX. Use cases differ.

Reservation lock (ratiba:lock:reservation:{tenant_id}:{slot_key}): held when a customer enters CONFIRM but hasn't yet committed (button-tap or M-Pesa STK callback). TTL 300s (5 min). Prevents two customers from booking the same slot during the 5-min confirmation window. On CONFIRM → DONE, the lock is replaced by a real row in appointments. On CONFIRM → ABANDON or TTL expiry, the slot is released.

Per-thread mutex (ratiba:lock:thread:{thread_id}): held when an ingress handler is actively processing a turn for that thread. Prevents parallel WhatsApp + voice ingress from racing the same thread.

Property	Value
TTL	30 seconds (longer than typical turn processing; expires automatically on handler crash)
Retry strategy	Exponential backoff: 500ms → 1s → 2s → 4s, max 4 retries
Total wait ceiling	~10 seconds before the loser gives up
Failure response (English)	"One moment please — I'm catching up on your previous message."
Failure response (Swahili)	"Subiri kidogo — naendelea na ujumbe wako wa awali."
Observability	Every contention event logs `event_type=fsm.race.contention` with `correlation_ids` populated for both contenders

Race semantics are intentionally aimed at functional correctness over optimization. The case (customer simultaneously messages on WhatsApp

calls in via voice) is genuinely rare in practice; the design accepts ~10 seconds of perceived latency in the rare race case as a trade for clean correctness in 99.9% of the happy-path traffic. Optimization (shorter ceilings, better backoff curves, optimistic concurrency) defers until production data shows the race is more frequent than projected.

Consequences

Positive.

Each booking is a self-contained LangGraph thread, so checkpoint hydration time is bounded by the length of one booking flow (~10-20 checkpoints) rather than growing with the customer's lifetime relationship with the tenant.
Cross-booking analytics is trivial. conversation_threads is the customer-history table. Single SELECT for "how many bookings," "how many abandoned at PAY," "average booking duration."
Redis-flush recovery is gentle. The Postgres pointer table is the durable fallback; one missed slot reservation is the worst-case user experience.
Cold archive preserves tenant isolation because the archive tables live in the tenant's own schema. Compliance / dispute queries UNION live + archive without crossing schema boundaries.
Race semantics prioritize functional correctness over speed, in line with the "functional excellence first, optimize later" principle stated 2026-04-25.

Negative.

Two writes per inbound on cold-Redis path. Redis miss → Postgres query → Redis pointer write. Negligible (~5ms additional latency) but worth noting; the hot path is single Redis read.
Quadratic thread-count growth with customer lifetime. A spa customer who books 12 times/year over 5 years accumulates 60 rows in conversation_threads. Acceptable at any plausible scale; the table per-tenant should stay under 100k rows for the life of any plausible business.
Archive job adds a daily database write window. Mitigation: run off-peak (3 AM EAT); operations on checkpoints during the archive window may see brief lock contention. The job uses transactions per tenant, not a single global transaction, so contention is bounded per-tenant per-batch.
Per-thread mutex on Redis means a Redis outage degrades to "two handlers might process same thread." LangGraph's checkpoint writes are append-only and parent-chained, so the second writer's write either chains correctly or conflicts on a unique constraint (thread_id, checkpoint_id). Conflict detection is the backstop when Redis is unavailable; the orchestrator drops the second invocation and the customer can retry.

Neutral.

ULID for thread_id is a conventional choice (lexically sortable, time-ordered). Migration to UUIDv7 (similar properties, broader stdlib support) is a one-character change at the generator if Python stdlib eventually ships it natively.
30-min TTL on the Redis pointer assumes a customer who waits 30 min mid-booking has effectively abandoned. The Postgres pointer is the durable fallback; if they return after 30 min, the next inbound resolves the active thread from Postgres and re-caches Redis.

Alternatives Considered

Alternative	Rejected because
Single LangGraph `thread_id = "{tenant}:{phone}"` reused across bookings with `is_final` flag on FSM state.	Reusing the thread accumulates an unbounded checkpoint chain over the customer's lifetime — hydration grows linearly with booking count. Adding `is_final` semantics to LangGraph's resume logic adds complexity without benefit. The fresh-thread-per-booking design was Adrian's explicit call (2026-04-25, D4).
Per-tenant Redis databases (Redis's 16 numbered DB feature).	Not a real isolation boundary (`FLUSHALL` and ACLs cross databases trivially); adds operational complexity (per-tenant DB selection on every connection); buys nothing.
Per-tenant Redis instances.	Operationally absurd at any scale (1k tenants = 1k Redis processes, each with its own monitoring/backup window).
Single shared archive schema for cold checkpoints across all tenants.	Breaks the schema-per-tenant isolation principle for cold data — bad signal for health-data compliance review. Per-tenant archive tables (D2 = (a)) preserves the pattern.
`pg_dump` + drop as the archive mechanism.	Query-unfriendly (compliance/dispute queries can't reach archived data without restore work). The per-tenant archive table is the ergonomic upgrade for trivial extra storage cost.
Postgres declarative partitioning by month for the live `checkpoints` table.	Elegant at scale; premature at PoC scale. Every per-tenant `checkpoints` table would need to be partitioned at creation — additional complexity in onboarding (ADR-0002 D5) for benefit that only materializes at >10M rows per tenant. Reserved as a scale-time amendment to this ADR.
Postgres advisory locks instead of Redis SETNX for the per-thread mutex.	Postgres advisory locks are session-bound (released on connection close); the asyncpg pool's connection lifecycle would need careful coordination. Redis SETNX with TTL is simpler and the Redis dependency is already present for the hot-cache layer.
No mutex; rely solely on LangGraph's checkpoint conflict detection (write fails on `(thread_id, checkpoint_id)` collision).	Works but produces ugly user-facing failures (the losing handler sees a database error instead of a graceful "one moment" message). Mutex + LangGraph's natural conflict detection together gives both layers of defence.
Shorter mutex retry ceiling (e.g., 5s instead of 10s).	Optimizes perceived latency in the rare race case at the cost of higher failure rate. Adrian's "functional excellence first, optimize later" call (D3) keeps the longer ceiling for now.

References

docs/prd/ratiba-prd.md — §2.2 (Redis 7 + PostgreSQL), §4 Modules 7-9 (orchestrator + scheduling)
docs/adr/ADR-0001-tech-stack.md (amended 2026-04-25) — Redis + Postgres pinned; LangGraph + psycopg exception
docs/adr/ADR-0002-multi-tenant-isolation.md — schema-per-tenant operational specifics; TenantScopedSaver via per-tenant micro-pools (D4); asyncio contextvar tenant propagation (D7)
docs/research/2026-04-25-langgraph-postgressaver-spike.md — Option A TenantScopedSaver wrapper; data-leakage failure mode warning
docs/research/2026-04-25-orchestration-patterns.md — A1 §3 FSM transitions; §7 LangGraph integration; §8 two-tier persistence (canonical inputs to this ADR)
docs/research/2026-04-25-human-in-the-loop-handoff.md — A2 §4 driver-as-FSM-dimension; §5 Redis queue for handoff inbound
docs/research/2026-04-25-payments-orchestration.md — A4 §3 payment routing pattern (uses thread_id); §5 idempotency layers
docs/superpowers/specs/2026-04-25-agentic-research-investment-design.md §12 — Q6 (90-day retention) locked
LangGraph langgraph-checkpoint-postgres v2.x — PostgresSaver API, thread-id config, Command(resume=...) semantics
ULID specification — https://github.com/ulid/spec

Context​

Decision​

1. Two-tier persistence (architectural recap)​

2. Redis isolation — shared instance with prefixed keys​

3. Postgres durable store — per-tenant LangGraph checkpoint tables​

4. Thread ID and conversation_threads pointer table​

5. Retention and cold archive — per-tenant checkpoints_archive​

6. Cold-start recovery semantics​

7. Reservation-lock and per-thread mutex semantics​

Consequences​

Alternatives Considered​

References​