Skip to main content

ADR-0001: Tech Stack

Status: Accepted Date: 2026-04-25 Amended: 2026-04-25 — pinned Python 3.13; added library-currency policy; added LangGraph + psycopg exception consequent on the PostgresSaver spike (Option A: TenantScopedSaver wrapper).

Context

Ratiba.chat is a conversational-first business management platform for service-based SMBs in East Africa. Architecturally it must support:

  • WhatsApp as primary channel (Meta Cloud API direct webhook + reply-button / list-message rendering, plus voice-note STT). The WhatsApp ingress was originally pinned to the 360dialog BSP; ADR-0008 (2026-04-26) supersedes that choice in favour of Meta's first-party Cloud API. See §Decision below for the updated table cell.
  • Voice channel — phone calls bridged through SIP into a real-time AI pipeline (STT → reasoning → TTS) with sub-second filler clock to keep the caller engaged while the LLM thinks.
  • Multi-tenant from day one — schema-per-tenant isolation, ready for PII / health-data compliance.
  • M-Pesa-native payments with a feature switch (Daraja STK push at booking confirmation; some tenants will not enable payments at all).
  • Admin-via-conversation — business owners run their operation from inside WhatsApp; the web dashboard exists only for bulk operations and initial onboarding.

The PRD specifies the stack at high level; this ADR captures the formal decision so subsequent ADRs can reference "the stack" without ambiguity.

Decision

LayerTechnologyWhy
RuntimePython 3.13 (pinned, amended 2026-04-25)Released October 2024; ~18 months mature at amendment time; universal library support across the rest of the stack. Python 3.14's headline (free-threading) doesn't help an I/O-bound system; the ~6-month-old long-tail compatibility risk isn't worth it. Re-evaluate Q4 2026.
Backend frameworkFastAPI / Pydantic v2Async-first; matches the toolchain used in trust-relay-workflow + zol-rag for org-wide skill compounding. Pydantic v2 gives typed contracts at every API boundary.
ORMSQLAlchemy 2.0 (async) + Alembic + asyncpgMature schema-per-tenant story (SET search_path per session). Alembic handles tenant-aware migrations with the multi-tenant template. asyncpg is the fastest Python pg driver and native to the async runtime.
State store driver exceptionpsycopg 3 alongside asyncpg (amended 2026-04-25)LangGraph's PostgresSaver checkpointer uses psycopg. The spike at docs/research/2026-04-25-langgraph-postgressaver-spike.md confirmed the Python port has no schema parameter and no per-call hook. Option A (TenantScopedSaver wrapper) is the chosen integration model — see the spike for the wrapper acceptance criteria. Operational consequence: two Postgres drivers in the backend (asyncpg for application queries, psycopg exclusively inside the FSM checkpoint path).
Orchestration / FSMLangGraph + langgraph-checkpoint-postgres (amended 2026-04-25)Interrupt-and-resume API is the closest published primitive to Ratiba's admin-handoff requirement. Adopted via Option A wrapper from the spike. Custom-FSM-on-asyncpg (Option C in the spike) is reserved as an escape hatch for ADR-0005 if LangGraph constrains other architectural choices in M3-M4.
FrontendNext.js 14+ App Router / TypeScript / Tailwind / shadcn/uiSame reasoning as the rest of the org's stack. shadcn/ui is consistent with the S4U design-system convention.
AuthKeycloak 24+Tenant realms (one realm per tenant for hard isolation of admin user pools); phone-number-based authenticator for admin WhatsApp linking; OIDC for dashboard. Self-hosted, no per-MAU SaaS cost.
DatabasePostgreSQL 16, schema-per-tenantHard isolation without operational overhead of database-per-tenant. Shared public schema for the tenant registry. Row-Level Security available for any cross-tenant tables that arise later (e.g., shared service catalogue).
Cache / FSM stateRedis 7Session state for booking/admin orchestrators. FSM hydration on each turn keeps the orchestrators stateless. Also: rate limiting + WhatsApp webhook dedup.
WhatsApp ingressMeta WhatsApp Cloud API (direct)superseded ADR-0008 (2026-04-26)Originally pinned 360dialog BSP for faster onboarding. ADR-0008 reopened the choice mid-M4 once cost data + multi-tenant fit were clearer: €0/month platform fee vs €49/month BSP, project-level HMAC across all tenants, single-secret rotation, no BSP-shape translation layer. Twilio long-code for production phone-number provisioning (verified once with WhatsApp Business Platform via SMS OTP, then claimed by the WABA). Reversibility-by-design: schema preserves the option to switch to a BSP later if pilot data forces it.
Voice STTDeepgram Nova-3Multilingual (Swahili + English), streaming, low latency. Same provider the org already uses in zol-rag, so the playbook (key management, billing, eval) carries over.
Voice TTSElevenLabs Multilingual v2Same rationale as STT — already wired up in zol-rag, voice quality acceptable for Swahili.
Voice telephonyLiveKit (SIP + rooms)SIP bridge for phone-number ingress; rooms for the agent worker pattern. Same pattern as zol-rag.
PaymentsSafaricom Daraja API (M-Pesa STK push)Non-negotiable for the Kenyan market. Feature-switched per tenant (some tenants will not collect at booking). PRD Annex A reserves the abstraction for future Airtel Money + Equitel / Jenga.
ContainersDocker Compose (dev + small-VPS prod)Same pattern as the rest of the org. Production path to Kubernetes if scale demands it; not on the roadmap.

Library Currency Policy (added 2026-04-25)

Ratiba treats dependency currency as a methodology concern, not a per-PR guess. The policy applies to every package in backend/pyproject.toml, frontend/package.json, and docusaurus/ratiba/package.json.

  1. At lock-in (initial pinning): every dependency is pinned to its latest stable release at the moment of pinning. No "match the version trust-relay uses." The PoC is the moment to start at the front.
  2. Patch releases (x.y.Z): auto-applied via Renovate (or Dependabot). Most ecosystems treat patches as safe; eval suite (when A3 lands) gates merge.
  3. Minor releases (x.Y.0): reviewed monthly by Adrian (or by an automated PR with the eval suite running against it once A3 lands). Adopt unless something breaks.
  4. Major releases (X.0.0): reviewed quarterly. ADR-worthy for any major bump that changes the stack's contracts (e.g., FastAPI 1.0, SQLAlchemy 3.0, Pydantic 3.0, LangGraph 1.0).
  5. Deprecation tracking: subscribe to release notes for the load-bearing libraries (LangGraph, FastAPI, Daraja's official Python SDK if they ship one, Pydantic, ElevenLabs SDK, Deepgram SDK, LiveKit SDK, Keycloak). Phase C flagged the OpenAI Assistants API sunset as the canonical "vendor pulled the rug" risk — the methodology supplement (Phase B) will codify the tracking ritual.

The Python pin (3.13) is reviewed annually with the same discipline. 3.14 → re-evaluate Q4 2026; 3.15 → re-evaluate Q4 2027; etc.

Consequences

Positive. This stack is essentially the union of trust-relay-workflow and zol-rag's stacks. Three benefits compound from that:

  1. Skill reuse — every tool is one a contributor has already shipped with. No new framework adoption tax.
  2. Operational reuse — Keycloak realm management, Postgres tuning, Redis ops, LiveKit-on-Docker config drift mitigation are all known problems with known solutions in this org.
  3. Library reuse — once Ratiba has shipped its first month of code, shared packages (e.g., a phone-number-to-tenant resolver, a Daraja client, a voice greeting builder) can be lifted into the trustrelay-* shared-package model without retrofit.

Negative. The stack assumes Python proficiency for the backend and TypeScript proficiency for the frontend. East African contributor pool lean toward JavaScript-everywhere; this is a minor onboarding tax for local hires later. Acceptable trade-off for the first two years.

Negative (amended 2026-04-25). Two Postgres drivers in the backend (asyncpg for app queries, psycopg for the LangGraph FSM checkpoint path). Small operational wart: separate connection-pool tuning, two sets of driver upgrades to track. Mitigation: keep psycopg strictly scoped to the TenantScopedSaver wrapper module (backend/app/persistence/checkpointer.py) so the surface area never spreads. If langgraph-checkpoint-postgres ships an asyncpg adapter or upstream merges the schema parameter from the JS sibling, revisit and consolidate.

Neutral. Schema-per-tenant has a known scaling ceiling around ~10,000 tenants per Postgres instance (each schema's catalog rows add overhead). Beyond that, sharding by tenant becomes the conversation. Not a constraint we need to design around now.

Alternatives Considered

AlternativeRejected because
Node / NestJS backendSplits the org's tech surface into two language stacks. Org has more Python bench depth.
Database-per-tenantOperationally heavier (n connection pools, n migration runs, n backup windows) for a small additional isolation guarantee that schema-per-tenant already covers.
Self-hosted Mautic / Plivo / Twilio for WhatsAppTwilio is more expensive at scale; Plivo's WhatsApp coverage is patchy outside India; Mautic isn't a BSP.
No schema isolation, RLS onlyRLS-only requires every query to carry tenant context perfectly, every time. One missed WHERE tenant_id = … and you have a cross-tenant bug. Schema-per-tenant makes the isolation structural.
Bare-metal Linux SIP + AsteriskHigher operational burden than LiveKit + 360dialog. Not where we want our engineering time.

References

  • docs/prd/ratiba-prd.md — full product requirements
  • Cross-project port assignments: CLAUDE.md § Port Assignments
  • LiveKit-in-Docker config pattern: trust-relay-workflow / zol-rag livekit.yaml (use node_ip: 127.0.0.1, stun_servers: [] for loopback dev)
  • docs/research/2026-04-25-agentic-landscape-2026.md — Phase C landscape scan (anchor for vocabulary, adopt/reject/park calls)
  • docs/research/2026-04-25-langgraph-postgressaver-spike.md — source-read spike that produced Option A (TenantScopedSaver wrapper) and the psycopg exception