ADR-0001: Tech Stack

Status: Accepted Date: 2026-04-25 Amended: 2026-04-25 — pinned Python 3.13; added library-currency policy; added LangGraph + psycopg exception consequent on the PostgresSaver spike (Option A: TenantScopedSaver wrapper).

Context

Ratiba.chat is a conversational-first business management platform for service-based SMBs in East Africa. Architecturally it must support:

WhatsApp as primary channel (Meta Cloud API direct webhook + reply-button / list-message rendering, plus voice-note STT). The WhatsApp ingress was originally pinned to the 360dialog BSP; ADR-0008 (2026-04-26) supersedes that choice in favour of Meta's first-party Cloud API. See §Decision below for the updated table cell.
Voice channel — phone calls bridged through SIP into a real-time AI pipeline (STT → reasoning → TTS) with sub-second filler clock to keep the caller engaged while the LLM thinks.
Multi-tenant from day one — schema-per-tenant isolation, ready for PII / health-data compliance.
M-Pesa-native payments with a feature switch (Daraja STK push at booking confirmation; some tenants will not enable payments at all).
Admin-via-conversation — business owners run their operation from inside WhatsApp; the web dashboard exists only for bulk operations and initial onboarding.

The PRD specifies the stack at high level; this ADR captures the formal decision so subsequent ADRs can reference "the stack" without ambiguity.

Decision

Layer	Technology	Why
Runtime	Python 3.13 (pinned, amended 2026-04-25)	Released October 2024; ~18 months mature at amendment time; universal library support across the rest of the stack. Python 3.14's headline (free-threading) doesn't help an I/O-bound system; the ~6-month-old long-tail compatibility risk isn't worth it. Re-evaluate Q4 2026.
Backend framework	FastAPI / Pydantic v2	Async-first; matches the toolchain used in trust-relay-workflow + zol-rag for org-wide skill compounding. Pydantic v2 gives typed contracts at every API boundary.
ORM	SQLAlchemy 2.0 (async) + Alembic + asyncpg	Mature schema-per-tenant story (`SET search_path` per session). Alembic handles tenant-aware migrations with the multi-tenant template. asyncpg is the fastest Python pg driver and native to the async runtime.
State store driver exception	`psycopg` 3 alongside `asyncpg` (amended 2026-04-25)	LangGraph's `PostgresSaver` checkpointer uses `psycopg`. The spike at `docs/research/2026-04-25-langgraph-postgressaver-spike.md` confirmed the Python port has no `schema` parameter and no per-call hook. Option A (TenantScopedSaver wrapper) is the chosen integration model — see the spike for the wrapper acceptance criteria. Operational consequence: two Postgres drivers in the backend (`asyncpg` for application queries, `psycopg` exclusively inside the FSM checkpoint path).
Orchestration / FSM	LangGraph + `langgraph-checkpoint-postgres` (amended 2026-04-25)	Interrupt-and-resume API is the closest published primitive to Ratiba's admin-handoff requirement. Adopted via Option A wrapper from the spike. Custom-FSM-on-asyncpg (Option C in the spike) is reserved as an escape hatch for ADR-0005 if LangGraph constrains other architectural choices in M3-M4.
Frontend	Next.js 14+ App Router / TypeScript / Tailwind / shadcn/ui	Same reasoning as the rest of the org's stack. shadcn/ui is consistent with the S4U design-system convention.
Auth	Keycloak 24+	Tenant realms (one realm per tenant for hard isolation of admin user pools); phone-number-based authenticator for admin WhatsApp linking; OIDC for dashboard. Self-hosted, no per-MAU SaaS cost.
Database	PostgreSQL 16, schema-per-tenant	Hard isolation without operational overhead of database-per-tenant. Shared `public` schema for the tenant registry. Row-Level Security available for any cross-tenant tables that arise later (e.g., shared service catalogue).
Cache / FSM state	Redis 7	Session state for booking/admin orchestrators. FSM hydration on each turn keeps the orchestrators stateless. Also: rate limiting + WhatsApp webhook dedup.
WhatsApp ingress	Meta WhatsApp Cloud API (direct) — superseded ADR-0008 (2026-04-26)	Originally pinned 360dialog BSP for faster onboarding. ADR-0008 reopened the choice mid-M4 once cost data + multi-tenant fit were clearer: €0/month platform fee vs €49/month BSP, project-level HMAC across all tenants, single-secret rotation, no BSP-shape translation layer. Twilio long-code for production phone-number provisioning (verified once with WhatsApp Business Platform via SMS OTP, then claimed by the WABA). Reversibility-by-design: schema preserves the option to switch to a BSP later if pilot data forces it.
Voice STT	Deepgram Nova-3	Multilingual (Swahili + English), streaming, low latency. Same provider the org already uses in zol-rag, so the playbook (key management, billing, eval) carries over.
Voice TTS	ElevenLabs Multilingual v2	Same rationale as STT — already wired up in zol-rag, voice quality acceptable for Swahili.
Voice telephony	LiveKit (SIP + rooms)	SIP bridge for phone-number ingress; rooms for the agent worker pattern. Same pattern as zol-rag.
Payments	Safaricom Daraja API (M-Pesa STK push)	Non-negotiable for the Kenyan market. Feature-switched per tenant (some tenants will not collect at booking). PRD Annex A reserves the abstraction for future Airtel Money + Equitel / Jenga.
Containers	Docker Compose (dev + small-VPS prod)	Same pattern as the rest of the org. Production path to Kubernetes if scale demands it; not on the roadmap.

Library Currency Policy (added 2026-04-25)

Ratiba treats dependency currency as a methodology concern, not a per-PR guess. The policy applies to every package in backend/pyproject.toml, frontend/package.json, and docusaurus/ratiba/package.json.

At lock-in (initial pinning): every dependency is pinned to its latest stable release at the moment of pinning. No "match the version trust-relay uses." The PoC is the moment to start at the front.
Patch releases (x.y.Z): auto-applied via Renovate (or Dependabot). Most ecosystems treat patches as safe; eval suite (when A3 lands) gates merge.
Minor releases (x.Y.0): reviewed monthly by Adrian (or by an automated PR with the eval suite running against it once A3 lands). Adopt unless something breaks.
Major releases (X.0.0): reviewed quarterly. ADR-worthy for any major bump that changes the stack's contracts (e.g., FastAPI 1.0, SQLAlchemy 3.0, Pydantic 3.0, LangGraph 1.0).
Deprecation tracking: subscribe to release notes for the load-bearing libraries (LangGraph, FastAPI, Daraja's official Python SDK if they ship one, Pydantic, ElevenLabs SDK, Deepgram SDK, LiveKit SDK, Keycloak). Phase C flagged the OpenAI Assistants API sunset as the canonical "vendor pulled the rug" risk — the methodology supplement (Phase B) will codify the tracking ritual.

The Python pin (3.13) is reviewed annually with the same discipline. 3.14 → re-evaluate Q4 2026; 3.15 → re-evaluate Q4 2027; etc.

Consequences

Positive. This stack is essentially the union of trust-relay-workflow and zol-rag's stacks. Three benefits compound from that:

Skill reuse — every tool is one a contributor has already shipped with. No new framework adoption tax.
Operational reuse — Keycloak realm management, Postgres tuning, Redis ops, LiveKit-on-Docker config drift mitigation are all known problems with known solutions in this org.
Library reuse — once Ratiba has shipped its first month of code, shared packages (e.g., a phone-number-to-tenant resolver, a Daraja client, a voice greeting builder) can be lifted into the trustrelay-* shared-package model without retrofit.

Negative. The stack assumes Python proficiency for the backend and TypeScript proficiency for the frontend. East African contributor pool lean toward JavaScript-everywhere; this is a minor onboarding tax for local hires later. Acceptable trade-off for the first two years.

Negative (amended 2026-04-25). Two Postgres drivers in the backend (asyncpg for app queries, psycopg for the LangGraph FSM checkpoint path). Small operational wart: separate connection-pool tuning, two sets of driver upgrades to track. Mitigation: keep psycopg strictly scoped to the TenantScopedSaver wrapper module (backend/app/persistence/checkpointer.py) so the surface area never spreads. If langgraph-checkpoint-postgres ships an asyncpg adapter or upstream merges the schema parameter from the JS sibling, revisit and consolidate.

Neutral. Schema-per-tenant has a known scaling ceiling around ~10,000 tenants per Postgres instance (each schema's catalog rows add overhead). Beyond that, sharding by tenant becomes the conversation. Not a constraint we need to design around now.

Alternatives Considered

Alternative	Rejected because
Node / NestJS backend	Splits the org's tech surface into two language stacks. Org has more Python bench depth.
Database-per-tenant	Operationally heavier (n connection pools, n migration runs, n backup windows) for a small additional isolation guarantee that schema-per-tenant already covers.
Self-hosted Mautic / Plivo / Twilio for WhatsApp	Twilio is more expensive at scale; Plivo's WhatsApp coverage is patchy outside India; Mautic isn't a BSP.
No schema isolation, RLS only	RLS-only requires every query to carry tenant context perfectly, every time. One missed `WHERE tenant_id = …` and you have a cross-tenant bug. Schema-per-tenant makes the isolation structural.
Bare-metal Linux SIP + Asterisk	Higher operational burden than LiveKit + 360dialog. Not where we want our engineering time.

References

docs/prd/ratiba-prd.md — full product requirements
Cross-project port assignments: CLAUDE.md § Port Assignments
LiveKit-in-Docker config pattern: trust-relay-workflow / zol-rag livekit.yaml (use node_ip: 127.0.0.1, stun_servers: [] for loopback dev)
docs/research/2026-04-25-agentic-landscape-2026.md — Phase C landscape scan (anchor for vocabulary, adopt/reject/park calls)
docs/research/2026-04-25-langgraph-postgressaver-spike.md — source-read spike that produced Option A (TenantScopedSaver wrapper) and the psycopg exception

Context​

Decision​

Library Currency Policy (added 2026-04-25)​

Consequences​

Alternatives Considered​

References​