ADR-0002: Multi-Tenant Isolation Strategy
Status: Accepted Date: 2026-04-25
Context
Ratiba is multi-tenant from day one (PRD §1.4, §2.2). ADR-0001 pinned PostgreSQL 16 with schema-per-tenant as the isolation model and Keycloak realm-per-tenant for admin user-pool isolation. Both decisions remain in force.
What was deferred from ADR-0001 — and what this ADR settles — is the
operational shape of multi-tenancy: how schemas get created, which
tables live in shared public versus per-tenant scope, how Alembic
migrations apply across N tenant schemas, how connection pools avoid the
"data leakage" failure mode the PostgresSaver spike warned about, and
how tenant_id propagates through the async runtime without becoming
a per-call argument.
Two new architectural primitives surfaced during the 2026-04-25 research investment that this ADR also locks in:
public.payment_routing(research locked decision C2): a shared-schema bridge table that maps Daraja / PesaPalmerchant_referencevalues to(tenant_id, schema_name, thread_id). Required because payment-provider webhooks arrive at our backend before tenant resolution. Detailed justification indocs/research/2026-04-25-payments-orchestration.md§3 and the research-investment design spec §12.- TenantScopedSaver wrapper for LangGraph state (spike Option A,
ADR-0001 amendment): per-invocation
psycopgconnection withsearch_pathpre-set, used exclusively for the LangGraph checkpointer. Detailed indocs/research/2026-04-25-langgraph-postgressaver-spike.md.
This ADR ties those operational decisions together into a coherent isolation model that subsequent ADRs (0003 FSM persistence, 0004 testing strategy, 0005 orchestration model, 0006 handoff model, 0007 payments orchestration) can cite without re-deriving.
Decision
Seven specific decisions, taken together.
1. Tenant registry shape — three shared tables in public
The complete public-schema tenant registry consists of exactly three
tables. The discipline: public only holds rows whose lookup happens
before tenant resolution. Anything else lives inside the per-tenant
schema.
| Table | Purpose | Lookup happens before tenant resolution because… |
|---|---|---|
public.tenants | Tenant identity, config, lifecycle state | Channel-boundary lookup by whatsapp_number / voice_number resolves the tenant. |
public.tenant_admins | Admin phone → tenant + Keycloak user | Channel-boundary lookup distinguishes admin from customer when the inbound from arrives. |
public.payment_routing | merchant_reference → tenant + thread_id | Daraja / PesaPal IPN handlers arrive in tenant-less context; need to resolve which tenant before switching schemas. |
The schemas for public.tenants and public.tenant_admins are inherited
from PRD §3.1 (kept verbatim). The schema for public.payment_routing
is new with this ADR:
CREATE TABLE public.payment_routing (
merchant_reference VARCHAR(40) PRIMARY KEY,
tenant_id UUID NOT NULL REFERENCES public.tenants(id),
schema_name VARCHAR(100) NOT NULL,
thread_id VARCHAR(100) NOT NULL,
provider VARCHAR(20) NOT NULL
CHECK (provider IN ('mpesa', 'airtel', 'equitel', 'pesapal')),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL -- 24h after init; reaper purges
);
CREATE INDEX idx_payment_routing_tenant ON public.payment_routing(tenant_id);
CREATE INDEX idx_payment_routing_expires ON public.payment_routing(expires_at)
WHERE expires_at IS NOT NULL;
Reaper job runs hourly; deletes rows where expires_at < NOW(). Specific
TTL semantics (and what happens to a callback that arrives after TTL) are
ADR-0007 scope.
2. Per-tenant schema contents
Each tenant gets a schema named tenant_<slug> (where <slug> matches
public.tenants.slug, sanitized to [a-z0-9_]). Two phases populate it:
Phase A — App-side migrations (Alembic, asyncpg).
Tables from PRD §3.2 plus the handoff_log table introduced in
docs/research/2026-04-25-human-in-the-loop-handoff.md §5:
services staff staff_services
staff_schedules staff_blocks contacts
appointments interactions payments
handoff_log
Phase B — LangGraph checkpoint tables (psycopg, vendor setup()).
Created by PostgresSaver(conn=tenant_conn).setup() immediately after
Phase A on the same tenant connection:
checkpoints checkpoint_blobs checkpoint_writes checkpoint_migrations
Both phases are part of the onboarding transaction (D5).
3. Migration template — per-tenant invocation, not iterating loop
alembic/env.py is configured to apply migrations against either the
default schema (for public) or a specified --tenant=<slug> schema
(for tenant-scoped migrations). The target_metadata selection branches
on a RATIBA_MIGRATION_SCOPE environment variable.
Rationale. The iterating-loop pattern (one Alembic file that
iterates over public.tenants and applies the change to each schema)
hides per-tenant errors and is hard to retry surgically. Per-tenant
invocation isolates failures: a migration that crashes on tenant 47
doesn't strand 48-100 in inconsistent state.
For bulk operations (apply a new migration to all existing tenants),
scripts/migrate-all-tenants.sh iterates over public.tenants and
runs alembic upgrade head --tenant=<slug> for each, with explicit
per-tenant success/failure logging and a manifest output for partial
retries.
4. Connection pooling — two pools, two patterns
This is the load-bearing decision. The spike's §2.6 warned about the
"data leakage" failure mode where a shared connection pool returns a
connection still carrying the previous tenant's search_path. We
neutralize that risk with two distinct pools serving two workloads:
App-side queries (asyncpg) — single shared pool with strict per-checkout discipline.
- One process-wide
asyncpg.Poolof N connections (sized for peak app load). - Connection acquisition goes through a context-manager
(
backend/app/persistence/session.pyget_tenant_session()) that reads thecurrent_tenantcontextvar (D7) and issuesSET LOCAL search_path TO <tenant_schema>, publicon every borrow. SET LOCAL(notSET) means the search_path resets at the end of the implicit transaction — implicit safety net against forgetting to reset on release.- Acquisition without a
current_tenantcontextvar raises immediately; the pool refuses to hand out an unscoped connection.
LangGraph checkpoint (psycopg) — per-tenant micro-pools.
- The TenantScopedSaver factory (
backend/app/persistence/checkpointer.py) maintains a dict oftenant_id → psycopg_pool.AsyncConnectionPool, each sized 1-2 connections. - Pools are created lazily on first use of a tenant.
- Idle pools are closed after 30 min of no use (background reaper).
- The factory's
for_tenant(tenant_id)returns a dedicated connection withsearch_pathpre-set, used to construct a per-invocationPostgresSaver. - At scale (target: 1,000 tenants × ~50 active per hour), front the whole thing with PgBouncer in transaction-pooling mode; the per-tenant pool count becomes the connection-multiplexing problem PgBouncer is designed for. PgBouncer wiring is deferred until the scale justifies it; explicitly noted in §Consequences.
Why two patterns. App queries are async-asyncpg native and benefit
from a shared pool with rigorous per-checkout discipline. LangGraph's
PostgresSaver does not expose a per-call search_path injection point
(spike §2.4), so the only safe path is dedicated per-tenant connections.
Forking LangGraph (spike Option B) was rejected as carrying maintenance
debt without proportional ergonomic gain; building a custom FSM (spike
Option C) was reserved as an escape hatch only.
5. New-tenant provisioning sequence
1. INSERT public.tenants row → tenant_id, schema_name, status='trial'
2. CREATE SCHEMA tenant_<slug> AUTHORIZATION ratiba
3. SET search_path TO tenant_<slug>; alembic upgrade head --tenant=<slug>
4. PostgresSaver(conn=tenant_conn).setup() -- LangGraph checkpoint tables
5. Keycloak admin API: create realm tenant-<slug>
6. Keycloak admin API: create owner role + initial admin user
7. INSERT public.tenant_admins rows for the registered admin phone(s)
8. COMMIT
Atomicity strategy. Postgres steps 1-4 + 7 wrap in one
transaction. PostgreSQL DDL is transactional, so CREATE SCHEMA
participates in the transaction and rolls back on failure.
Keycloak steps 5-6 are NOT in the Postgres transaction (different system). Failure modes:
- Keycloak realm creation fails: rollback the Postgres transaction; no orphaned schema. Onboarding fails cleanly.
- Keycloak admin user creation fails AFTER realm created: delete the realm via Keycloak admin API (compensating action), then rollback Postgres transaction.
- Keycloak succeeds but step 7 (
tenant_adminsinsert) fails: rare (constraint violation on duplicate phone); compensating action is delete the Keycloak realm + user, then rollback Postgres.
The compensating-action layer is explicit code in
backend/app/tenancy/onboarding.py, not silent best-effort cleanup.
Each compensating action logs an audit.config.changed event per the
auto-debug logging schema.
A failed onboarding leaves zero residue: no orphaned schemas, no
orphaned Keycloak realms, no half-populated tenant_admins rows.
6. Tenant suspension and deletion
Three lifecycle states on public.tenants.status:
| State | Meaning | What changes |
|---|---|---|
active | Normal operation | All channel routing accepts inbound. |
suspended | Temporarily inactive | Channel routing rejects inbound politely; tenant schema and data preserved intact. |
deleted | Hard-deleted | Tenant schema dropped (DROP SCHEMA tenant_<slug> CASCADE); public.tenants row soft-deleted (deleted_at = NOW(), schema_name = NULL); Keycloak realm removed; backups retained per legal retention policy. |
Why soft-delete the registry row. Audit logs (and possibly historical
public.tenant_admins rows or compliance traces) reference
tenants.id via foreign key. Hard-deleting the row would either
cascade-delete the audit trail or fail the constraint. Soft-delete
preserves referential integrity while marking the tenant inactive.
Why hard-drop the schema. Zombie tenant data accumulates storage cost and presents an exfiltration target with no operational benefit. Backups are the recovery path if a deletion was mistaken.
Suspension and deletion both go through
backend/app/tenancy/lifecycle.py. Deletion is gated on a 7-day
"deletion pending" cooling-off period during which the tenant can be
restored without backup recovery.
7. Tenant identifier propagation — asyncio contextvar
The current_tenant contextvar pattern, set once at the channel
boundary and inherited automatically by every async task spawned from
that point:
# backend/app/tenancy/context.py
from contextvars import ContextVar
from dataclasses import dataclass
@dataclass(frozen=True)
class TenantContext:
tenant_id: UUID
schema_name: str # e.g., "tenant_msmama_spa"
timezone: str # e.g., "Africa/Nairobi"
locale: str # "en" | "sw"
mpesa_enabled: bool
current_tenant: ContextVar[TenantContext | None] = ContextVar(
"current_tenant", default=None
)
Set by the IdentityResolver at the channel boundary (WhatsApp webhook, voice ingress):
# backend/app/orchestration/runner.py (sketch)
async def handle_inbound_whatsapp(msg: WhatsAppMessage):
tenant = await tenant_resolver.resolve_from_business_number(msg.business_number)
token = current_tenant.set(tenant)
try:
await orchestrator.handle(msg)
finally:
current_tenant.reset(token)
Read by the connection-acquire wrapper (D4):
# backend/app/persistence/session.py (sketch)
@asynccontextmanager
async def get_tenant_session():
tenant = current_tenant.get()
if tenant is None:
raise RuntimeError("get_tenant_session called outside tenant context")
async with pool.acquire() as conn:
await conn.execute(
f"SET LOCAL search_path TO {tenant.schema_name}, public"
)
yield conn
The RuntimeError on missing context is deliberate: it surfaces a
real bug (someone tried to query the database from a code path that
never ran the IdentityResolver) instead of silently leaking into the
default public schema.
The contextvar module pattern is lifted from trust-relay-workflow's identical mechanism — proven at production scale, mirrors the pattern in the global S4U methodology.
Consequences
Positive.
- Hard tenant isolation by structural design, not by application
discipline. A bug in application code that forgets to filter by
tenant_idcannot leak across tenants because the database connection'ssearch_pathonly sees one tenant's schema. - Webhook routing solves cleanly. The
public.payment_routingtable makes the tenant-less-webhook-context problem a 1-row lookup instead of a heuristic phone-number-matching exercise. - Onboarding is auditable and atomic. Each compensating action logs an explicit event; failures leave no orphans.
- Per-tenant migration isolation. A buggy migration crashing on tenant 47 strands no other tenants.
- The contextvar pattern keeps tenant routing out of every function
signature.
tenant_iddoesn't appear in the call graph except at the channel boundary and the connection-acquire wrapper.
Negative.
- Two Postgres drivers in the backend (asyncpg + psycopg). Carried
forward from ADR-0001 amendment. Operational wart: separate
connection-pool tuning, two sets of driver-version upgrades to
track. Mitigated by scoping psycopg strictly to
backend/app/persistence/checkpointer.py. - Per-tenant micro-pools scale connection count quadratically with
tenants. At 1,000 tenants × 2 idle connections = 2,000 idle
Postgres connections at peak. Default
max_connections=100is insufficient; a tuned VPS does 500-1000. Beyond that, PgBouncer in transaction-pooling mode in front of Postgres becomes mandatory. Acceptable for PoC scale; revisit when active-tenant count exceeds ~100. - Two-phase tenant setup (Alembic + LangGraph saver) is a sequence
that must succeed atomically. Failure modes are documented (D5)
but the second phase requires the LangGraph library's
setup()method to remain stable. A breaking change inlanggraph-checkpoint-postgresmigrations would force a coordinated upgrade across all tenant schemas. Mitigation: the library-currency policy (ADR-0001) flagslanggraph-checkpoint-postgresas load-bearing with 14-day minimum-release-age and explicit reviewer.
Neutral.
- Schema-per-tenant has a known scaling ceiling around ~10,000 tenants per Postgres instance (carried forward from ADR-0001; each schema's catalog rows add overhead). Beyond that, sharding by tenant becomes the conversation. Not a constraint we need to design around now.
- Schema names are derived from
tenants.slugwhich is user-supplied at onboarding. The[a-z0-9_]-only sanitization + uniqueness constraint prevents SQL-injection attacks via schema-name interpolation, but the string interpolation inSET search_path TO {tenant.schema_name}is the one place where the sanitization is load-bearing. Tested extensively; called out in the human-review-only list (Phase B §6).
Alternatives Considered
| Alternative | Rejected because |
|---|---|
| Database-per-tenant (instead of schema-per-tenant). | Operationally heavier for the additional isolation guarantee that schema-per-tenant already provides. Same call as ADR-0001; reaffirmed here. |
| Row-Level Security (RLS) on shared tables, no schemas. | RLS-only requires every query to carry tenant context perfectly, every time. One missed WHERE tenant_id = … and you have a cross-tenant bug. Schema-per-tenant makes the isolation structural; the application cannot accidentally see another tenant's rows because the database's search_path doesn't include them. Same call as ADR-0001; reaffirmed here. |
Single shared connection pool with SET search_path per checkout for both asyncpg AND LangGraph. | The LangGraph PostgresSaver issues SQL through a connection it has captured at construction time; the _cursor chokepoint (spike §2.3) does not re-issue SET search_path per call. A shared pool whose search_path could be modified by another concurrent request is the exact "data leakage" failure mode the spike's forum reference (#3274) warned about. Per-tenant micro-pools for the LangGraph path eliminate this class of bug. |
Iterating-loop migration pattern (upgrade() iterates over public.tenants). | Per-tenant invocation (D3) gives clean per-tenant failure isolation. The iterating loop is harder to retry surgically and silently buries per-tenant errors in aggregate logs. |
Hard-delete public.tenants on tenant deletion. | Audit logs and historical references would either cascade-delete (loss of audit trail) or fail the constraint (operational nightmare). Soft-delete preserves referential integrity. |
Pass tenant_id explicitly through every function signature (no contextvar). | Verbose; pollutes every call signature with a parameter that 99% of code never reads; encourages "just default it" when tenant is missing — exactly the bug class contextvar prevents by raising immediately. |
| One Keycloak realm shared across tenants, with groups for tenant separation. | ADR-0001 already chose realm-per-tenant for hard isolation of admin user pools. Shared-realm-with-groups would require careful group-membership filtering on every Keycloak query — a runtime safety check rather than a structural one. Reaffirmed here. |
References
docs/prd/ratiba-prd.md— §1.4 (multi-tenant SaaS), §2.2 (tech stack), §3.1 (shared schema), §3.2 (tenant schema), Annex A (M-Pesa), Annex B (PesaPal cards)docs/adr/ADR-0001-tech-stack.md(amended 2026-04-25) — schema-per-tenant pinned + LangGraph + psycopg exceptiondocs/research/2026-04-25-langgraph-postgressaver-spike.md— Option A TenantScopedSaver wrapper + the "data leakage" failure mode warningdocs/research/2026-04-25-orchestration-patterns.md— A1 §7.2 TenantScopedSaver injection pattern; §8 two-tier persistencedocs/research/2026-04-25-payments-orchestration.md— A4 §3public.payment_routingschema + lookup patterndocs/research/2026-04-25-human-in-the-loop-handoff.md— A2 §9 per-tenant LangGraph checkpoint + handoff_log tabledocs/superpowers/specs/2026-04-25-agentic-research-investment-design.md§12 — Locked Decisions Log (C1-C3 + Q1-Q6 with cite-into-ADR mapping)- trust-relay-workflow contextvar pattern (
backend/app/persistence/session.py) — proven at production scale, lifted as the reference implementation shape - Postgres documentation:
SET LOCALsemantics, transactional DDL,search_pathresolution - LangGraph
langgraph-checkpoint-postgresv2.x —PostgresSaver.setup()table-creation contract