Skip to main content

ADR-0002: Multi-Tenant Isolation Strategy

Status: Accepted Date: 2026-04-25

Context

Ratiba is multi-tenant from day one (PRD §1.4, §2.2). ADR-0001 pinned PostgreSQL 16 with schema-per-tenant as the isolation model and Keycloak realm-per-tenant for admin user-pool isolation. Both decisions remain in force.

What was deferred from ADR-0001 — and what this ADR settles — is the operational shape of multi-tenancy: how schemas get created, which tables live in shared public versus per-tenant scope, how Alembic migrations apply across N tenant schemas, how connection pools avoid the "data leakage" failure mode the PostgresSaver spike warned about, and how tenant_id propagates through the async runtime without becoming a per-call argument.

Two new architectural primitives surfaced during the 2026-04-25 research investment that this ADR also locks in:

  1. public.payment_routing (research locked decision C2): a shared-schema bridge table that maps Daraja / PesaPal merchant_reference values to (tenant_id, schema_name, thread_id). Required because payment-provider webhooks arrive at our backend before tenant resolution. Detailed justification in docs/research/2026-04-25-payments-orchestration.md §3 and the research-investment design spec §12.
  2. TenantScopedSaver wrapper for LangGraph state (spike Option A, ADR-0001 amendment): per-invocation psycopg connection with search_path pre-set, used exclusively for the LangGraph checkpointer. Detailed in docs/research/2026-04-25-langgraph-postgressaver-spike.md.

This ADR ties those operational decisions together into a coherent isolation model that subsequent ADRs (0003 FSM persistence, 0004 testing strategy, 0005 orchestration model, 0006 handoff model, 0007 payments orchestration) can cite without re-deriving.

Decision

Seven specific decisions, taken together.

1. Tenant registry shape — three shared tables in public

The complete public-schema tenant registry consists of exactly three tables. The discipline: public only holds rows whose lookup happens before tenant resolution. Anything else lives inside the per-tenant schema.

TablePurposeLookup happens before tenant resolution because…
public.tenantsTenant identity, config, lifecycle stateChannel-boundary lookup by whatsapp_number / voice_number resolves the tenant.
public.tenant_adminsAdmin phone → tenant + Keycloak userChannel-boundary lookup distinguishes admin from customer when the inbound from arrives.
public.payment_routingmerchant_reference → tenant + thread_idDaraja / PesaPal IPN handlers arrive in tenant-less context; need to resolve which tenant before switching schemas.

The schemas for public.tenants and public.tenant_admins are inherited from PRD §3.1 (kept verbatim). The schema for public.payment_routing is new with this ADR:

CREATE TABLE public.payment_routing (
merchant_reference VARCHAR(40) PRIMARY KEY,
tenant_id UUID NOT NULL REFERENCES public.tenants(id),
schema_name VARCHAR(100) NOT NULL,
thread_id VARCHAR(100) NOT NULL,
provider VARCHAR(20) NOT NULL
CHECK (provider IN ('mpesa', 'airtel', 'equitel', 'pesapal')),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL -- 24h after init; reaper purges
);

CREATE INDEX idx_payment_routing_tenant ON public.payment_routing(tenant_id);
CREATE INDEX idx_payment_routing_expires ON public.payment_routing(expires_at)
WHERE expires_at IS NOT NULL;

Reaper job runs hourly; deletes rows where expires_at < NOW(). Specific TTL semantics (and what happens to a callback that arrives after TTL) are ADR-0007 scope.

2. Per-tenant schema contents

Each tenant gets a schema named tenant_<slug> (where <slug> matches public.tenants.slug, sanitized to [a-z0-9_]). Two phases populate it:

Phase A — App-side migrations (Alembic, asyncpg). Tables from PRD §3.2 plus the handoff_log table introduced in docs/research/2026-04-25-human-in-the-loop-handoff.md §5:

services staff staff_services
staff_schedules staff_blocks contacts
appointments interactions payments
handoff_log

Phase B — LangGraph checkpoint tables (psycopg, vendor setup()). Created by PostgresSaver(conn=tenant_conn).setup() immediately after Phase A on the same tenant connection:

checkpoints checkpoint_blobs checkpoint_writes checkpoint_migrations

Both phases are part of the onboarding transaction (D5).

3. Migration template — per-tenant invocation, not iterating loop

alembic/env.py is configured to apply migrations against either the default schema (for public) or a specified --tenant=<slug> schema (for tenant-scoped migrations). The target_metadata selection branches on a RATIBA_MIGRATION_SCOPE environment variable.

Rationale. The iterating-loop pattern (one Alembic file that iterates over public.tenants and applies the change to each schema) hides per-tenant errors and is hard to retry surgically. Per-tenant invocation isolates failures: a migration that crashes on tenant 47 doesn't strand 48-100 in inconsistent state.

For bulk operations (apply a new migration to all existing tenants), scripts/migrate-all-tenants.sh iterates over public.tenants and runs alembic upgrade head --tenant=<slug> for each, with explicit per-tenant success/failure logging and a manifest output for partial retries.

4. Connection pooling — two pools, two patterns

This is the load-bearing decision. The spike's §2.6 warned about the "data leakage" failure mode where a shared connection pool returns a connection still carrying the previous tenant's search_path. We neutralize that risk with two distinct pools serving two workloads:

App-side queries (asyncpg) — single shared pool with strict per-checkout discipline.

  • One process-wide asyncpg.Pool of N connections (sized for peak app load).
  • Connection acquisition goes through a context-manager (backend/app/persistence/session.py get_tenant_session()) that reads the current_tenant contextvar (D7) and issues SET LOCAL search_path TO <tenant_schema>, public on every borrow.
  • SET LOCAL (not SET) means the search_path resets at the end of the implicit transaction — implicit safety net against forgetting to reset on release.
  • Acquisition without a current_tenant contextvar raises immediately; the pool refuses to hand out an unscoped connection.

LangGraph checkpoint (psycopg) — per-tenant micro-pools.

  • The TenantScopedSaver factory (backend/app/persistence/checkpointer.py) maintains a dict of tenant_id → psycopg_pool.AsyncConnectionPool, each sized 1-2 connections.
  • Pools are created lazily on first use of a tenant.
  • Idle pools are closed after 30 min of no use (background reaper).
  • The factory's for_tenant(tenant_id) returns a dedicated connection with search_path pre-set, used to construct a per-invocation PostgresSaver.
  • At scale (target: 1,000 tenants × ~50 active per hour), front the whole thing with PgBouncer in transaction-pooling mode; the per-tenant pool count becomes the connection-multiplexing problem PgBouncer is designed for. PgBouncer wiring is deferred until the scale justifies it; explicitly noted in §Consequences.

Why two patterns. App queries are async-asyncpg native and benefit from a shared pool with rigorous per-checkout discipline. LangGraph's PostgresSaver does not expose a per-call search_path injection point (spike §2.4), so the only safe path is dedicated per-tenant connections. Forking LangGraph (spike Option B) was rejected as carrying maintenance debt without proportional ergonomic gain; building a custom FSM (spike Option C) was reserved as an escape hatch only.

5. New-tenant provisioning sequence

1. INSERT public.tenants row → tenant_id, schema_name, status='trial'
2. CREATE SCHEMA tenant_<slug> AUTHORIZATION ratiba
3. SET search_path TO tenant_<slug>; alembic upgrade head --tenant=<slug>
4. PostgresSaver(conn=tenant_conn).setup() -- LangGraph checkpoint tables
5. Keycloak admin API: create realm tenant-<slug>
6. Keycloak admin API: create owner role + initial admin user
7. INSERT public.tenant_admins rows for the registered admin phone(s)
8. COMMIT

Atomicity strategy. Postgres steps 1-4 + 7 wrap in one transaction. PostgreSQL DDL is transactional, so CREATE SCHEMA participates in the transaction and rolls back on failure.

Keycloak steps 5-6 are NOT in the Postgres transaction (different system). Failure modes:

  • Keycloak realm creation fails: rollback the Postgres transaction; no orphaned schema. Onboarding fails cleanly.
  • Keycloak admin user creation fails AFTER realm created: delete the realm via Keycloak admin API (compensating action), then rollback Postgres transaction.
  • Keycloak succeeds but step 7 (tenant_admins insert) fails: rare (constraint violation on duplicate phone); compensating action is delete the Keycloak realm + user, then rollback Postgres.

The compensating-action layer is explicit code in backend/app/tenancy/onboarding.py, not silent best-effort cleanup. Each compensating action logs an audit.config.changed event per the auto-debug logging schema.

A failed onboarding leaves zero residue: no orphaned schemas, no orphaned Keycloak realms, no half-populated tenant_admins rows.

6. Tenant suspension and deletion

Three lifecycle states on public.tenants.status:

StateMeaningWhat changes
activeNormal operationAll channel routing accepts inbound.
suspendedTemporarily inactiveChannel routing rejects inbound politely; tenant schema and data preserved intact.
deletedHard-deletedTenant schema dropped (DROP SCHEMA tenant_<slug> CASCADE); public.tenants row soft-deleted (deleted_at = NOW(), schema_name = NULL); Keycloak realm removed; backups retained per legal retention policy.

Why soft-delete the registry row. Audit logs (and possibly historical public.tenant_admins rows or compliance traces) reference tenants.id via foreign key. Hard-deleting the row would either cascade-delete the audit trail or fail the constraint. Soft-delete preserves referential integrity while marking the tenant inactive.

Why hard-drop the schema. Zombie tenant data accumulates storage cost and presents an exfiltration target with no operational benefit. Backups are the recovery path if a deletion was mistaken.

Suspension and deletion both go through backend/app/tenancy/lifecycle.py. Deletion is gated on a 7-day "deletion pending" cooling-off period during which the tenant can be restored without backup recovery.

7. Tenant identifier propagation — asyncio contextvar

The current_tenant contextvar pattern, set once at the channel boundary and inherited automatically by every async task spawned from that point:

# backend/app/tenancy/context.py
from contextvars import ContextVar
from dataclasses import dataclass

@dataclass(frozen=True)
class TenantContext:
tenant_id: UUID
schema_name: str # e.g., "tenant_msmama_spa"
timezone: str # e.g., "Africa/Nairobi"
locale: str # "en" | "sw"
mpesa_enabled: bool

current_tenant: ContextVar[TenantContext | None] = ContextVar(
"current_tenant", default=None
)

Set by the IdentityResolver at the channel boundary (WhatsApp webhook, voice ingress):

# backend/app/orchestration/runner.py (sketch)
async def handle_inbound_whatsapp(msg: WhatsAppMessage):
tenant = await tenant_resolver.resolve_from_business_number(msg.business_number)
token = current_tenant.set(tenant)
try:
await orchestrator.handle(msg)
finally:
current_tenant.reset(token)

Read by the connection-acquire wrapper (D4):

# backend/app/persistence/session.py (sketch)
@asynccontextmanager
async def get_tenant_session():
tenant = current_tenant.get()
if tenant is None:
raise RuntimeError("get_tenant_session called outside tenant context")
async with pool.acquire() as conn:
await conn.execute(
f"SET LOCAL search_path TO {tenant.schema_name}, public"
)
yield conn

The RuntimeError on missing context is deliberate: it surfaces a real bug (someone tried to query the database from a code path that never ran the IdentityResolver) instead of silently leaking into the default public schema.

The contextvar module pattern is lifted from trust-relay-workflow's identical mechanism — proven at production scale, mirrors the pattern in the global S4U methodology.

Consequences

Positive.

  1. Hard tenant isolation by structural design, not by application discipline. A bug in application code that forgets to filter by tenant_id cannot leak across tenants because the database connection's search_path only sees one tenant's schema.
  2. Webhook routing solves cleanly. The public.payment_routing table makes the tenant-less-webhook-context problem a 1-row lookup instead of a heuristic phone-number-matching exercise.
  3. Onboarding is auditable and atomic. Each compensating action logs an explicit event; failures leave no orphans.
  4. Per-tenant migration isolation. A buggy migration crashing on tenant 47 strands no other tenants.
  5. The contextvar pattern keeps tenant routing out of every function signature. tenant_id doesn't appear in the call graph except at the channel boundary and the connection-acquire wrapper.

Negative.

  1. Two Postgres drivers in the backend (asyncpg + psycopg). Carried forward from ADR-0001 amendment. Operational wart: separate connection-pool tuning, two sets of driver-version upgrades to track. Mitigated by scoping psycopg strictly to backend/app/persistence/checkpointer.py.
  2. Per-tenant micro-pools scale connection count quadratically with tenants. At 1,000 tenants × 2 idle connections = 2,000 idle Postgres connections at peak. Default max_connections=100 is insufficient; a tuned VPS does 500-1000. Beyond that, PgBouncer in transaction-pooling mode in front of Postgres becomes mandatory. Acceptable for PoC scale; revisit when active-tenant count exceeds ~100.
  3. Two-phase tenant setup (Alembic + LangGraph saver) is a sequence that must succeed atomically. Failure modes are documented (D5) but the second phase requires the LangGraph library's setup() method to remain stable. A breaking change in langgraph-checkpoint-postgres migrations would force a coordinated upgrade across all tenant schemas. Mitigation: the library-currency policy (ADR-0001) flags langgraph-checkpoint-postgres as load-bearing with 14-day minimum-release-age and explicit reviewer.

Neutral.

  1. Schema-per-tenant has a known scaling ceiling around ~10,000 tenants per Postgres instance (carried forward from ADR-0001; each schema's catalog rows add overhead). Beyond that, sharding by tenant becomes the conversation. Not a constraint we need to design around now.
  2. Schema names are derived from tenants.slug which is user-supplied at onboarding. The [a-z0-9_]-only sanitization + uniqueness constraint prevents SQL-injection attacks via schema-name interpolation, but the string interpolation in SET search_path TO {tenant.schema_name} is the one place where the sanitization is load-bearing. Tested extensively; called out in the human-review-only list (Phase B §6).

Alternatives Considered

AlternativeRejected because
Database-per-tenant (instead of schema-per-tenant).Operationally heavier for the additional isolation guarantee that schema-per-tenant already provides. Same call as ADR-0001; reaffirmed here.
Row-Level Security (RLS) on shared tables, no schemas.RLS-only requires every query to carry tenant context perfectly, every time. One missed WHERE tenant_id = … and you have a cross-tenant bug. Schema-per-tenant makes the isolation structural; the application cannot accidentally see another tenant's rows because the database's search_path doesn't include them. Same call as ADR-0001; reaffirmed here.
Single shared connection pool with SET search_path per checkout for both asyncpg AND LangGraph.The LangGraph PostgresSaver issues SQL through a connection it has captured at construction time; the _cursor chokepoint (spike §2.3) does not re-issue SET search_path per call. A shared pool whose search_path could be modified by another concurrent request is the exact "data leakage" failure mode the spike's forum reference (#3274) warned about. Per-tenant micro-pools for the LangGraph path eliminate this class of bug.
Iterating-loop migration pattern (upgrade() iterates over public.tenants).Per-tenant invocation (D3) gives clean per-tenant failure isolation. The iterating loop is harder to retry surgically and silently buries per-tenant errors in aggregate logs.
Hard-delete public.tenants on tenant deletion.Audit logs and historical references would either cascade-delete (loss of audit trail) or fail the constraint (operational nightmare). Soft-delete preserves referential integrity.
Pass tenant_id explicitly through every function signature (no contextvar).Verbose; pollutes every call signature with a parameter that 99% of code never reads; encourages "just default it" when tenant is missing — exactly the bug class contextvar prevents by raising immediately.
One Keycloak realm shared across tenants, with groups for tenant separation.ADR-0001 already chose realm-per-tenant for hard isolation of admin user pools. Shared-realm-with-groups would require careful group-membership filtering on every Keycloak query — a runtime safety check rather than a structural one. Reaffirmed here.

References

  • docs/prd/ratiba-prd.md — §1.4 (multi-tenant SaaS), §2.2 (tech stack), §3.1 (shared schema), §3.2 (tenant schema), Annex A (M-Pesa), Annex B (PesaPal cards)
  • docs/adr/ADR-0001-tech-stack.md (amended 2026-04-25) — schema-per-tenant pinned + LangGraph + psycopg exception
  • docs/research/2026-04-25-langgraph-postgressaver-spike.md — Option A TenantScopedSaver wrapper + the "data leakage" failure mode warning
  • docs/research/2026-04-25-orchestration-patterns.md — A1 §7.2 TenantScopedSaver injection pattern; §8 two-tier persistence
  • docs/research/2026-04-25-payments-orchestration.md — A4 §3 public.payment_routing schema + lookup pattern
  • docs/research/2026-04-25-human-in-the-loop-handoff.md — A2 §9 per-tenant LangGraph checkpoint + handoff_log table
  • docs/superpowers/specs/2026-04-25-agentic-research-investment-design.md §12 — Locked Decisions Log (C1-C3 + Q1-Q6 with cite-into-ADR mapping)
  • trust-relay-workflow contextvar pattern (backend/app/persistence/session.py) — proven at production scale, lifted as the reference implementation shape
  • Postgres documentation: SET LOCAL semantics, transactional DDL, search_path resolution
  • LangGraph langgraph-checkpoint-postgres v2.x — PostgresSaver.setup() table-creation contract