ADR-0001: Tech Stack
Status: Accepted
Date: 2026-04-25
Amended: 2026-04-25 — pinned Python 3.13; added library-currency policy; added LangGraph + psycopg exception consequent on the PostgresSaver spike (Option A: TenantScopedSaver wrapper).
Context
Ratiba.chat is a conversational-first business management platform for service-based SMBs in East Africa. Architecturally it must support:
- WhatsApp as primary channel (Meta Cloud API direct webhook + reply-button / list-message rendering, plus voice-note STT). The WhatsApp ingress was originally pinned to the 360dialog BSP; ADR-0008 (2026-04-26) supersedes that choice in favour of Meta's first-party Cloud API. See §Decision below for the updated table cell.
- Voice channel — phone calls bridged through SIP into a real-time AI pipeline (STT → reasoning → TTS) with sub-second filler clock to keep the caller engaged while the LLM thinks.
- Multi-tenant from day one — schema-per-tenant isolation, ready for PII / health-data compliance.
- M-Pesa-native payments with a feature switch (Daraja STK push at booking confirmation; some tenants will not enable payments at all).
- Admin-via-conversation — business owners run their operation from inside WhatsApp; the web dashboard exists only for bulk operations and initial onboarding.
The PRD specifies the stack at high level; this ADR captures the formal decision so subsequent ADRs can reference "the stack" without ambiguity.
Decision
| Layer | Technology | Why |
|---|---|---|
| Runtime | Python 3.13 (pinned, amended 2026-04-25) | Released October 2024; ~18 months mature at amendment time; universal library support across the rest of the stack. Python 3.14's headline (free-threading) doesn't help an I/O-bound system; the ~6-month-old long-tail compatibility risk isn't worth it. Re-evaluate Q4 2026. |
| Backend framework | FastAPI / Pydantic v2 | Async-first; matches the toolchain used in trust-relay-workflow + zol-rag for org-wide skill compounding. Pydantic v2 gives typed contracts at every API boundary. |
| ORM | SQLAlchemy 2.0 (async) + Alembic + asyncpg | Mature schema-per-tenant story (SET search_path per session). Alembic handles tenant-aware migrations with the multi-tenant template. asyncpg is the fastest Python pg driver and native to the async runtime. |
| State store driver exception | psycopg 3 alongside asyncpg (amended 2026-04-25) | LangGraph's PostgresSaver checkpointer uses psycopg. The spike at docs/research/2026-04-25-langgraph-postgressaver-spike.md confirmed the Python port has no schema parameter and no per-call hook. Option A (TenantScopedSaver wrapper) is the chosen integration model — see the spike for the wrapper acceptance criteria. Operational consequence: two Postgres drivers in the backend (asyncpg for application queries, psycopg exclusively inside the FSM checkpoint path). |
| Orchestration / FSM | LangGraph + langgraph-checkpoint-postgres (amended 2026-04-25) | Interrupt-and-resume API is the closest published primitive to Ratiba's admin-handoff requirement. Adopted via Option A wrapper from the spike. Custom-FSM-on-asyncpg (Option C in the spike) is reserved as an escape hatch for ADR-0005 if LangGraph constrains other architectural choices in M3-M4. |
| Frontend | Next.js 14+ App Router / TypeScript / Tailwind / shadcn/ui | Same reasoning as the rest of the org's stack. shadcn/ui is consistent with the S4U design-system convention. |
| Auth | Keycloak 24+ | Tenant realms (one realm per tenant for hard isolation of admin user pools); phone-number-based authenticator for admin WhatsApp linking; OIDC for dashboard. Self-hosted, no per-MAU SaaS cost. |
| Database | PostgreSQL 16, schema-per-tenant | Hard isolation without operational overhead of database-per-tenant. Shared public schema for the tenant registry. Row-Level Security available for any cross-tenant tables that arise later (e.g., shared service catalogue). |
| Cache / FSM state | Redis 7 | Session state for booking/admin orchestrators. FSM hydration on each turn keeps the orchestrators stateless. Also: rate limiting + WhatsApp webhook dedup. |
| WhatsApp ingress | Meta WhatsApp Cloud API (direct) — superseded ADR-0008 (2026-04-26) | Originally pinned 360dialog BSP for faster onboarding. ADR-0008 reopened the choice mid-M4 once cost data + multi-tenant fit were clearer: €0/month platform fee vs €49/month BSP, project-level HMAC across all tenants, single-secret rotation, no BSP-shape translation layer. Twilio long-code for production phone-number provisioning (verified once with WhatsApp Business Platform via SMS OTP, then claimed by the WABA). Reversibility-by-design: schema preserves the option to switch to a BSP later if pilot data forces it. |
| Voice STT | Deepgram Nova-3 | Multilingual (Swahili + English), streaming, low latency. Same provider the org already uses in zol-rag, so the playbook (key management, billing, eval) carries over. |
| Voice TTS | ElevenLabs Multilingual v2 | Same rationale as STT — already wired up in zol-rag, voice quality acceptable for Swahili. |
| Voice telephony | LiveKit (SIP + rooms) | SIP bridge for phone-number ingress; rooms for the agent worker pattern. Same pattern as zol-rag. |
| Payments | Safaricom Daraja API (M-Pesa STK push) | Non-negotiable for the Kenyan market. Feature-switched per tenant (some tenants will not collect at booking). PRD Annex A reserves the abstraction for future Airtel Money + Equitel / Jenga. |
| Containers | Docker Compose (dev + small-VPS prod) | Same pattern as the rest of the org. Production path to Kubernetes if scale demands it; not on the roadmap. |
Library Currency Policy (added 2026-04-25)
Ratiba treats dependency currency as a methodology concern, not a per-PR
guess. The policy applies to every package in backend/pyproject.toml,
frontend/package.json, and docusaurus/ratiba/package.json.
- At lock-in (initial pinning): every dependency is pinned to its latest stable release at the moment of pinning. No "match the version trust-relay uses." The PoC is the moment to start at the front.
- Patch releases (
x.y.Z): auto-applied via Renovate (or Dependabot). Most ecosystems treat patches as safe; eval suite (when A3 lands) gates merge. - Minor releases (
x.Y.0): reviewed monthly by Adrian (or by an automated PR with the eval suite running against it once A3 lands). Adopt unless something breaks. - Major releases (
X.0.0): reviewed quarterly. ADR-worthy for any major bump that changes the stack's contracts (e.g., FastAPI 1.0, SQLAlchemy 3.0, Pydantic 3.0, LangGraph 1.0). - Deprecation tracking: subscribe to release notes for the load-bearing libraries (LangGraph, FastAPI, Daraja's official Python SDK if they ship one, Pydantic, ElevenLabs SDK, Deepgram SDK, LiveKit SDK, Keycloak). Phase C flagged the OpenAI Assistants API sunset as the canonical "vendor pulled the rug" risk — the methodology supplement (Phase B) will codify the tracking ritual.
The Python pin (3.13) is reviewed annually with the same discipline. 3.14 → re-evaluate Q4 2026; 3.15 → re-evaluate Q4 2027; etc.
Consequences
Positive. This stack is essentially the union of trust-relay-workflow and zol-rag's stacks. Three benefits compound from that:
- Skill reuse — every tool is one a contributor has already shipped with. No new framework adoption tax.
- Operational reuse — Keycloak realm management, Postgres tuning, Redis ops, LiveKit-on-Docker config drift mitigation are all known problems with known solutions in this org.
- Library reuse — once Ratiba has shipped its first month of code,
shared packages (e.g., a phone-number-to-tenant resolver, a Daraja
client, a voice greeting builder) can be lifted into the
trustrelay-*shared-package model without retrofit.
Negative. The stack assumes Python proficiency for the backend and TypeScript proficiency for the frontend. East African contributor pool lean toward JavaScript-everywhere; this is a minor onboarding tax for local hires later. Acceptable trade-off for the first two years.
Negative (amended 2026-04-25). Two Postgres drivers in the backend
(asyncpg for app queries, psycopg for the LangGraph FSM checkpoint
path). Small operational wart: separate connection-pool tuning, two sets
of driver upgrades to track. Mitigation: keep psycopg strictly scoped
to the TenantScopedSaver wrapper module (backend/app/persistence/checkpointer.py)
so the surface area never spreads. If langgraph-checkpoint-postgres
ships an asyncpg adapter or upstream merges the schema parameter from
the JS sibling, revisit and consolidate.
Neutral. Schema-per-tenant has a known scaling ceiling around ~10,000 tenants per Postgres instance (each schema's catalog rows add overhead). Beyond that, sharding by tenant becomes the conversation. Not a constraint we need to design around now.
Alternatives Considered
| Alternative | Rejected because |
|---|---|
| Node / NestJS backend | Splits the org's tech surface into two language stacks. Org has more Python bench depth. |
| Database-per-tenant | Operationally heavier (n connection pools, n migration runs, n backup windows) for a small additional isolation guarantee that schema-per-tenant already covers. |
| Self-hosted Mautic / Plivo / Twilio for WhatsApp | Twilio is more expensive at scale; Plivo's WhatsApp coverage is patchy outside India; Mautic isn't a BSP. |
| No schema isolation, RLS only | RLS-only requires every query to carry tenant context perfectly, every time. One missed WHERE tenant_id = … and you have a cross-tenant bug. Schema-per-tenant makes the isolation structural. |
| Bare-metal Linux SIP + Asterisk | Higher operational burden than LiveKit + 360dialog. Not where we want our engineering time. |
References
docs/prd/ratiba-prd.md— full product requirements- Cross-project port assignments:
CLAUDE.md§ Port Assignments - LiveKit-in-Docker config pattern: trust-relay-workflow / zol-rag
livekit.yaml(usenode_ip: 127.0.0.1,stun_servers: []for loopback dev) docs/research/2026-04-25-agentic-landscape-2026.md— Phase C landscape scan (anchor for vocabulary, adopt/reject/park calls)docs/research/2026-04-25-langgraph-postgressaver-spike.md— source-read spike that produced Option A (TenantScopedSaver wrapper) and thepsycopgexception