ADR-0002: Multi-Tenant Isolation Strategy

Status: Accepted Date: 2026-04-25

Context

Ratiba is multi-tenant from day one (PRD §1.4, §2.2). ADR-0001 pinned PostgreSQL 16 with schema-per-tenant as the isolation model and Keycloak realm-per-tenant for admin user-pool isolation. Both decisions remain in force.

What was deferred from ADR-0001 — and what this ADR settles — is the operational shape of multi-tenancy: how schemas get created, which tables live in shared public versus per-tenant scope, how Alembic migrations apply across N tenant schemas, how connection pools avoid the "data leakage" failure mode the PostgresSaver spike warned about, and how tenant_id propagates through the async runtime without becoming a per-call argument.

Two new architectural primitives surfaced during the 2026-04-25 research investment that this ADR also locks in:

public.payment_routing (research locked decision C2): a shared-schema bridge table that maps Daraja / PesaPal merchant_reference values to (tenant_id, schema_name, thread_id). Required because payment-provider webhooks arrive at our backend before tenant resolution. Detailed justification in docs/research/2026-04-25-payments-orchestration.md §3 and the research-investment design spec §12.
TenantScopedSaver wrapper for LangGraph state (spike Option A, ADR-0001 amendment): per-invocation psycopg connection with search_path pre-set, used exclusively for the LangGraph checkpointer. Detailed in docs/research/2026-04-25-langgraph-postgressaver-spike.md.

This ADR ties those operational decisions together into a coherent isolation model that subsequent ADRs (0003 FSM persistence, 0004 testing strategy, 0005 orchestration model, 0006 handoff model, 0007 payments orchestration) can cite without re-deriving.

Decision

Seven specific decisions, taken together.

1. Tenant registry shape — three shared tables in `public`

The complete public-schema tenant registry consists of exactly three tables. The discipline: public only holds rows whose lookup happens before tenant resolution. Anything else lives inside the per-tenant schema.

Table	Purpose	Lookup happens before tenant resolution because…
`public.tenants`	Tenant identity, config, lifecycle state	Channel-boundary lookup by `whatsapp_number` / `voice_number` resolves the tenant.
`public.tenant_admins`	Admin phone → tenant + Keycloak user	Channel-boundary lookup distinguishes admin from customer when the inbound `from` arrives.
`public.payment_routing`	`merchant_reference` → tenant + thread_id	Daraja / PesaPal IPN handlers arrive in tenant-less context; need to resolve which tenant before switching schemas.

The schemas for public.tenants and public.tenant_admins are inherited from PRD §3.1 (kept verbatim). The schema for public.payment_routing is new with this ADR:

CREATE TABLE public.payment_routing (
  merchant_reference VARCHAR(40) PRIMARY KEY,
  tenant_id          UUID NOT NULL REFERENCES public.tenants(id),
  schema_name        VARCHAR(100) NOT NULL,
  thread_id          VARCHAR(100) NOT NULL,
  provider           VARCHAR(20) NOT NULL
                     CHECK (provider IN ('mpesa', 'airtel', 'equitel', 'pesapal')),
  created_at         TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  expires_at         TIMESTAMPTZ NOT NULL  -- 24h after init; reaper purges
);

CREATE INDEX idx_payment_routing_tenant ON public.payment_routing(tenant_id);
CREATE INDEX idx_payment_routing_expires ON public.payment_routing(expires_at)
  WHERE expires_at IS NOT NULL;

Reaper job runs hourly; deletes rows where expires_at < NOW(). Specific TTL semantics (and what happens to a callback that arrives after TTL) are ADR-0007 scope.

2. Per-tenant schema contents

Each tenant gets a schema named tenant_<slug> (where <slug> matches public.tenants.slug, sanitized to [a-z0-9_]). Two phases populate it:

Phase A — App-side migrations (Alembic, asyncpg). Tables from PRD §3.2 plus the handoff_log table introduced in docs/research/2026-04-25-human-in-the-loop-handoff.md §5:

services            staff               staff_services
staff_schedules     staff_blocks        contacts
appointments        interactions        payments
handoff_log

Phase B — LangGraph checkpoint tables (psycopg, vendor setup()). Created by PostgresSaver(conn=tenant_conn).setup() immediately after Phase A on the same tenant connection:

checkpoints   checkpoint_blobs   checkpoint_writes   checkpoint_migrations

Both phases are part of the onboarding transaction (D5).

3. Migration template — per-tenant invocation, not iterating loop

alembic/env.py is configured to apply migrations against either the default schema (for public) or a specified --tenant=<slug> schema (for tenant-scoped migrations). The target_metadata selection branches on a RATIBA_MIGRATION_SCOPE environment variable.

Rationale. The iterating-loop pattern (one Alembic file that iterates over public.tenants and applies the change to each schema) hides per-tenant errors and is hard to retry surgically. Per-tenant invocation isolates failures: a migration that crashes on tenant 47 doesn't strand 48-100 in inconsistent state.

For bulk operations (apply a new migration to all existing tenants), scripts/migrate-all-tenants.sh iterates over public.tenants and runs alembic upgrade head --tenant=<slug> for each, with explicit per-tenant success/failure logging and a manifest output for partial retries.

4. Connection pooling — two pools, two patterns

This is the load-bearing decision. The spike's §2.6 warned about the "data leakage" failure mode where a shared connection pool returns a connection still carrying the previous tenant's search_path. We neutralize that risk with two distinct pools serving two workloads:

App-side queries (asyncpg) — single shared pool with strict per-checkout discipline.

One process-wide asyncpg.Pool of N connections (sized for peak app load).
Connection acquisition goes through a context-manager (backend/app/persistence/session.py get_tenant_session()) that reads the current_tenant contextvar (D7) and issues SET LOCAL search_path TO <tenant_schema>, public on every borrow.
SET LOCAL (not SET) means the search_path resets at the end of the implicit transaction — implicit safety net against forgetting to reset on release.
Acquisition without a current_tenant contextvar raises immediately; the pool refuses to hand out an unscoped connection.

LangGraph checkpoint (psycopg) — per-tenant micro-pools.

The TenantScopedSaver factory (backend/app/persistence/checkpointer.py) maintains a dict of tenant_id → psycopg_pool.AsyncConnectionPool, each sized 1-2 connections.
Pools are created lazily on first use of a tenant.
Idle pools are closed after 30 min of no use (background reaper).
The factory's for_tenant(tenant_id) returns a dedicated connection with search_path pre-set, used to construct a per-invocation PostgresSaver.
At scale (target: 1,000 tenants × ~50 active per hour), front the whole thing with PgBouncer in transaction-pooling mode; the per-tenant pool count becomes the connection-multiplexing problem PgBouncer is designed for. PgBouncer wiring is deferred until the scale justifies it; explicitly noted in §Consequences.

Why two patterns. App queries are async-asyncpg native and benefit from a shared pool with rigorous per-checkout discipline. LangGraph's PostgresSaver does not expose a per-call search_path injection point (spike §2.4), so the only safe path is dedicated per-tenant connections. Forking LangGraph (spike Option B) was rejected as carrying maintenance debt without proportional ergonomic gain; building a custom FSM (spike Option C) was reserved as an escape hatch only.

5. New-tenant provisioning sequence

INSERT public.tenants row → tenant_id, schema_name, status='trial'
CREATE SCHEMA tenant_<slug> AUTHORIZATION ratiba
SET search_path TO tenant_<slug>; alembic upgrade head --tenant=<slug>
PostgresSaver(conn=tenant_conn).setup()  -- LangGraph checkpoint tables
Keycloak admin API: create realm  tenant-<slug>
Keycloak admin API: create owner role + initial admin user
INSERT public.tenant_admins rows for the registered admin phone(s)
COMMIT

Atomicity strategy. Postgres steps 1-4 + 7 wrap in one transaction. PostgreSQL DDL is transactional, so CREATE SCHEMA participates in the transaction and rolls back on failure.

Keycloak steps 5-6 are NOT in the Postgres transaction (different system). Failure modes:

Keycloak realm creation fails: rollback the Postgres transaction; no orphaned schema. Onboarding fails cleanly.
Keycloak admin user creation fails AFTER realm created: delete the realm via Keycloak admin API (compensating action), then rollback Postgres transaction.
Keycloak succeeds but step 7 (tenant_admins insert) fails: rare (constraint violation on duplicate phone); compensating action is delete the Keycloak realm + user, then rollback Postgres.

The compensating-action layer is explicit code in backend/app/tenancy/onboarding.py, not silent best-effort cleanup. Each compensating action logs an audit.config.changed event per the auto-debug logging schema.

A failed onboarding leaves zero residue: no orphaned schemas, no orphaned Keycloak realms, no half-populated tenant_admins rows.

6. Tenant suspension and deletion

Three lifecycle states on public.tenants.status:

State	Meaning	What changes
`active`	Normal operation	All channel routing accepts inbound.
`suspended`	Temporarily inactive	Channel routing rejects inbound politely; tenant schema and data preserved intact.
`deleted`	Hard-deleted	Tenant schema dropped (`DROP SCHEMA tenant_<slug> CASCADE`); `public.tenants` row soft-deleted (`deleted_at = NOW()`, `schema_name = NULL`); Keycloak realm removed; backups retained per legal retention policy.

Why soft-delete the registry row. Audit logs (and possibly historical public.tenant_admins rows or compliance traces) reference tenants.id via foreign key. Hard-deleting the row would either cascade-delete the audit trail or fail the constraint. Soft-delete preserves referential integrity while marking the tenant inactive.

Why hard-drop the schema. Zombie tenant data accumulates storage cost and presents an exfiltration target with no operational benefit. Backups are the recovery path if a deletion was mistaken.

Suspension and deletion both go through backend/app/tenancy/lifecycle.py. Deletion is gated on a 7-day "deletion pending" cooling-off period during which the tenant can be restored without backup recovery.

7. Tenant identifier propagation — asyncio contextvar

The current_tenant contextvar pattern, set once at the channel boundary and inherited automatically by every async task spawned from that point:

# backend/app/tenancy/context.py
from contextvars import ContextVar
from dataclasses import dataclass

@dataclass(frozen=True)
class TenantContext:
    tenant_id: UUID
    schema_name: str          # e.g., "tenant_msmama_spa"
    timezone: str             # e.g., "Africa/Nairobi"
    locale: str               # "en" | "sw"
    mpesa_enabled: bool

current_tenant: ContextVar[TenantContext | None] = ContextVar(
    "current_tenant", default=None
)

Set by the IdentityResolver at the channel boundary (WhatsApp webhook, voice ingress):

# backend/app/orchestration/runner.py (sketch)
async def handle_inbound_whatsapp(msg: WhatsAppMessage):
    tenant = await tenant_resolver.resolve_from_business_number(msg.business_number)
    token = current_tenant.set(tenant)
    try:
        await orchestrator.handle(msg)
    finally:
        current_tenant.reset(token)

Read by the connection-acquire wrapper (D4):

# backend/app/persistence/session.py (sketch)
@asynccontextmanager
async def get_tenant_session():
    tenant = current_tenant.get()
    if tenant is None:
        raise RuntimeError("get_tenant_session called outside tenant context")
    async with pool.acquire() as conn:
        await conn.execute(
            f"SET LOCAL search_path TO {tenant.schema_name}, public"
        )
        yield conn

The RuntimeError on missing context is deliberate: it surfaces a real bug (someone tried to query the database from a code path that never ran the IdentityResolver) instead of silently leaking into the default public schema.

The contextvar module pattern is lifted from trust-relay-workflow's identical mechanism — proven at production scale, mirrors the pattern in the global S4U methodology.

Consequences

Positive.

Hard tenant isolation by structural design, not by application discipline. A bug in application code that forgets to filter by tenant_id cannot leak across tenants because the database connection's search_path only sees one tenant's schema.
Webhook routing solves cleanly. The public.payment_routing table makes the tenant-less-webhook-context problem a 1-row lookup instead of a heuristic phone-number-matching exercise.
Onboarding is auditable and atomic. Each compensating action logs an explicit event; failures leave no orphans.
Per-tenant migration isolation. A buggy migration crashing on tenant 47 strands no other tenants.
The contextvar pattern keeps tenant routing out of every function signature. tenant_id doesn't appear in the call graph except at the channel boundary and the connection-acquire wrapper.

Negative.

Two Postgres drivers in the backend (asyncpg + psycopg). Carried forward from ADR-0001 amendment. Operational wart: separate connection-pool tuning, two sets of driver-version upgrades to track. Mitigated by scoping psycopg strictly to backend/app/persistence/checkpointer.py.
Per-tenant micro-pools scale connection count quadratically with tenants. At 1,000 tenants × 2 idle connections = 2,000 idle Postgres connections at peak. Default max_connections=100 is insufficient; a tuned VPS does 500-1000. Beyond that, PgBouncer in transaction-pooling mode in front of Postgres becomes mandatory. Acceptable for PoC scale; revisit when active-tenant count exceeds ~100.
Two-phase tenant setup (Alembic + LangGraph saver) is a sequence that must succeed atomically. Failure modes are documented (D5) but the second phase requires the LangGraph library's setup() method to remain stable. A breaking change in langgraph-checkpoint-postgres migrations would force a coordinated upgrade across all tenant schemas. Mitigation: the library-currency policy (ADR-0001) flags langgraph-checkpoint-postgres as load-bearing with 14-day minimum-release-age and explicit reviewer.

Neutral.

Schema-per-tenant has a known scaling ceiling around ~10,000 tenants per Postgres instance (carried forward from ADR-0001; each schema's catalog rows add overhead). Beyond that, sharding by tenant becomes the conversation. Not a constraint we need to design around now.
Schema names are derived from tenants.slug which is user-supplied at onboarding. The [a-z0-9_]-only sanitization + uniqueness constraint prevents SQL-injection attacks via schema-name interpolation, but the string interpolation in SET search_path TO {tenant.schema_name} is the one place where the sanitization is load-bearing. Tested extensively; called out in the human-review-only list (Phase B §6).

Alternatives Considered

Alternative	Rejected because
Database-per-tenant (instead of schema-per-tenant).	Operationally heavier for the additional isolation guarantee that schema-per-tenant already provides. Same call as ADR-0001; reaffirmed here.
Row-Level Security (RLS) on shared tables, no schemas.	RLS-only requires every query to carry tenant context perfectly, every time. One missed `WHERE tenant_id = …` and you have a cross-tenant bug. Schema-per-tenant makes the isolation structural; the application cannot accidentally see another tenant's rows because the database's `search_path` doesn't include them. Same call as ADR-0001; reaffirmed here.
Single shared connection pool with `SET search_path` per checkout for both asyncpg AND LangGraph.	The LangGraph PostgresSaver issues SQL through a connection it has captured at construction time; the `_cursor` chokepoint (spike §2.3) does not re-issue `SET search_path` per call. A shared pool whose `search_path` could be modified by another concurrent request is the exact "data leakage" failure mode the spike's forum reference (#3274) warned about. Per-tenant micro-pools for the LangGraph path eliminate this class of bug.
Iterating-loop migration pattern (`upgrade()` iterates over `public.tenants`).	Per-tenant invocation (D3) gives clean per-tenant failure isolation. The iterating loop is harder to retry surgically and silently buries per-tenant errors in aggregate logs.
Hard-delete `public.tenants` on tenant deletion.	Audit logs and historical references would either cascade-delete (loss of audit trail) or fail the constraint (operational nightmare). Soft-delete preserves referential integrity.
Pass `tenant_id` explicitly through every function signature (no contextvar).	Verbose; pollutes every call signature with a parameter that 99% of code never reads; encourages "just default it" when tenant is missing — exactly the bug class contextvar prevents by raising immediately.
One Keycloak realm shared across tenants, with groups for tenant separation.	ADR-0001 already chose realm-per-tenant for hard isolation of admin user pools. Shared-realm-with-groups would require careful group-membership filtering on every Keycloak query — a runtime safety check rather than a structural one. Reaffirmed here.

References

docs/prd/ratiba-prd.md — §1.4 (multi-tenant SaaS), §2.2 (tech stack), §3.1 (shared schema), §3.2 (tenant schema), Annex A (M-Pesa), Annex B (PesaPal cards)
docs/adr/ADR-0001-tech-stack.md (amended 2026-04-25) — schema-per-tenant pinned + LangGraph + psycopg exception
docs/research/2026-04-25-langgraph-postgressaver-spike.md — Option A TenantScopedSaver wrapper + the "data leakage" failure mode warning
docs/research/2026-04-25-orchestration-patterns.md — A1 §7.2 TenantScopedSaver injection pattern; §8 two-tier persistence
docs/research/2026-04-25-payments-orchestration.md — A4 §3 public.payment_routing schema + lookup pattern
docs/research/2026-04-25-human-in-the-loop-handoff.md — A2 §9 per-tenant LangGraph checkpoint + handoff_log table
docs/superpowers/specs/2026-04-25-agentic-research-investment-design.md §12 — Locked Decisions Log (C1-C3 + Q1-Q6 with cite-into-ADR mapping)
trust-relay-workflow contextvar pattern (backend/app/persistence/session.py) — proven at production scale, lifted as the reference implementation shape
Postgres documentation: SET LOCAL semantics, transactional DDL, search_path resolution
LangGraph langgraph-checkpoint-postgres v2.x — PostgresSaver.setup() table-creation contract

Context​

Decision​

1. Tenant registry shape — three shared tables in public​

2. Per-tenant schema contents​

3. Migration template — per-tenant invocation, not iterating loop​

4. Connection pooling — two pools, two patterns​

5. New-tenant provisioning sequence​

6. Tenant suspension and deletion​

7. Tenant identifier propagation — asyncio contextvar​

Consequences​

Alternatives Considered​

References​