Identity and tenancy
What it does
Two strict isolation seams cooperate so that every query in Ratiba reaches the correct tenant's data without ever leaking across boundaries.
Schema-per-tenant Postgres (ADR-0002).
One Keycloak realm per tenant, one tenant_<slug> schema in Postgres, isolated
data and an independent per-tenant Alembic migration chain. Two distinct
connection pools sit underneath: a shared asyncpg pool for the public.* registry
(tenant lookup, payment routing, inbound quarantine) and per-tenant psycopg
micro-pools for tenant_<slug>.* operational data. This "two-pool model" separates
registry I/O (high concurrency, low schema cardinality) from operational I/O (medium
concurrency, high schema cardinality) — see Two-pool model below.
Tenant context and ContextVar propagation.
Tenant identity propagates through every async call stack via an asyncio
ContextVar (app/tenancy/context.py::current_tenant). Because Python copies
the contextvars.Context snapshot into every asyncio.create_task call,
background tasks cannot lose tenant scope — the snapshot travels with the coroutine.
get_tenant_session() reads that ContextVar and refuses to issue a connection if
it is unset (loud RuntimeError, never a silent fallback to public). The schema
name is re-validated against ^tenant_[a-z0-9_]+$ before being interpolated into
a SET LOCAL search_path SQL string, closing the injection window.
Phone-only deterministic cross-channel customer identity (ADR-0009 D4-D5).
Three tenant-scoped tables — customers, customer_sessions,
customer_identities — represent "who is talking to us, on which channel".
Tier-1 channels (WhatsApp + voice) carry
the customer's phone in inbound provider metadata, so identity is known immediately.
Tier-2 channels (web widget + Instagram DM
- Messenger DM) land as anonymous sessions and capture the phone progressively via
the
COLLECT_PHONEFSM entry-state. When a phone is bound, the resolver looks for an existing customer with the samephone_e164across the tenant — a HIT merges the new session into that customer record (cross-channel merge); a MISS creates a fresh customer row. Identity matching is strictly deterministic — no probabilistic name or device fingerprint matching is used.
How it fits in the system
Two-pool model
The two-pool model is the concrete implementation of ADR-0002's pool-isolation
requirement. It is not two separate databases — it is two separate
asyncpg / psycopg connection pools within a single Postgres instance, each
with distinct credentials, pool sizes, and search_path defaults:
| Pool | Driver | Default search_path | Used for |
|---|---|---|---|
Shared public pool | asyncpg | public | Tenant lookup, payment_routing, inbound_quarantine, payment_callbacks_unrouted |
| Per-tenant micro-pools | psycopg | tenant_<slug> | All operational tables: customers, sessions, services, appointments, checkpoints, commission data |
The shared pool is created once at process startup and reused across all requests. Per-tenant micro-pools are created on first request for a tenant, lazily, and bounded to a small number of connections (typically 2–5) to avoid connection exhaustion in a high-tenant-count deployment.
get_tenant_session() enforces the invariant: if current_tenant ContextVar is
unset, the call raises immediately. If it is set, the function picks the correct
micro-pool by schema_name, issues SET LOCAL search_path = tenant_<slug>, and
returns the connection. The SET LOCAL is transaction-scoped — it is rolled back
if the surrounding transaction aborts, so a partial failure can never bleed the
wrong search_path into a subsequent query on the same connection.
Tenant context propagation
The asyncio ContextVar mechanism deserves a closer look because it is the
load-bearing cross-cutting concern that makes schema isolation work end-to-end
without passing tenant_id in every function signature.
The TenantContext dataclass is frozen (frozen=True in the Pydantic model),
so no layer can mutate the tenant in-flight. A new ContextVar.set() call returns
a Token that the middleware saves to restore the previous context on exit —
this is the standard Python context-manager pattern and ensures nested tasks or
middleware stacking does not corrupt the outer tenant.
Atomic onboarding with Keycloak compensating actions
Onboarding a new tenant is a multi-system operation: create a Keycloak realm, write
the public.tenants row, create the tenant_<slug> schema, run per-tenant Alembic
migrations, and create the checkpoints_<slug> schema. Any step can fail.
ADR-0002 specifies a compensating-action pattern: each step that succeeds is
recorded, and if a later step fails, the compensating actions are executed in reverse
order. For example, if Alembic migrations fail after the Keycloak realm was created,
the middleware deletes the realm, rolls back the public.tenants insert, and drops
the tenant_<slug> schema. The caller receives a single clean error; no half-created
tenant is left in the registry.
See the onboarding runbook for the operator step-by-step walk-through and the CLI commands that drive this flow.
Phone-only deterministic identity
The resolver logic is deliberately simple: phone E.164 is the only merge key.
Key invariants enforced in app/agents/identity_resolver.py:
normalize_phone_e164()is called before every lookup and every insert — no raw user input ever touches thephone_e164unique index.link_identity()usesON CONFLICT DO NOTHINGagainst the composite PK(customer_id, provider, external_id), making it safely replayable.- A 3-strike re-prompt is enforced in the
COLLECT_PHONEFSM node; if the customer refuses or fails three times, the thread transitions toLEAD_CAPTUREDand the session remainscustomer_id=NULL(anonymous lead). - There is no probabilistic fallback: if the phone does not match, a new customer is created. False merges (two different people sharing a phone) are a known limitation documented in ADR-0009.
Cross-channel merge: the full sequence
A customer talks to the salon on WhatsApp first, then opens the web widget days later. Phone-only deterministic merge (ADR-0009 D5) recognises her on the second channel without asking her to re-register:
Prospect story: what this means for a booking
From a customer's point of view, the identity layer is invisible. She books by WhatsApp on Monday. On Thursday she opens the spa's website chat widget. The agent greets her by name ("Hi Mary, welcome back!"), knows her appointment history, and continues the conversation without asking her to sign up or log in. No password. No app. No registration form.
From the tenant owner's point of view, both conversations appear under the same customer profile in the dashboard — booking count, cancellation history, payment status — regardless of which channel the customer used.
Schema reference
All three tables live inside tenant_<slug> — never in public.
| Table | Key columns | Purpose |
|---|---|---|
customers | id, phone_e164 (unique), display_name, locale | One row per unique customer across all channels |
customer_sessions | id, customer_id (nullable FK), provider, external_id, promoted_at | One row per channel session; customer_id is NULL until phone is bound |
customer_identities | customer_id, provider, external_id (composite PK) | Many-to-one: multiple channel handles per customer |
The customer_id FK on customer_sessions being nullable is the key design choice
that allows anonymous sessions to exist before phone capture. promote_session()
sets it and records promoted_at atomically so the promotion is auditable.
Where it lives in code
| Concern | File | Key entry point |
|---|---|---|
| Tenant ContextVar | app/tenancy/context.py | current_tenant ContextVar (L56) + frozen TenantContext (L40) |
| Tenant ORM model | app/tenancy/models.py | Tenant ORM class (L68); public-schema only |
| Schema-scoped session | app/persistence/session.py | get_tenant_session() (L48) — ContextVar gate + SET LOCAL search_path |
| Channel-boundary resolve | app/agents/identity_resolver.py | resolve() (L154) — phone_number_id to tenant + actor_type |
| Customer-side resolve | app/agents/identity_resolver.py | resolve_session() (L326), bind_phone() (L384), link_identity() (L491) |
| Customers DAO | app/persistence/customers.py | get_customer_by_phone() (L114), get_customer_by_alias() (L139), insert_customer() (L175) |
| Sessions DAO | app/persistence/customers.py | get_session() (L238), insert_session() (L257), promote_session() (L283) |
| Identities DAO | app/persistence/customers.py | link_identity() (L305) — ON CONFLICT DO NOTHING against composite PK |
Decisions
- ADR-0002 — Multi-tenant isolation — schema-per-tenant + two pools + per-tenant Alembic + ContextVar propagation + atomic onboarding with Keycloak compensating actions.
- ADR-0009 — Channel-agnostic conversation substrate —
D4 (
customers+customer_sessions+customer_identitiesshape; Tier-1 vs Tier-2 phone-discovery semantics) and D5 (phone-only deterministic cross-channel merge;bind_phoneHIT/MISS algorithm).
Try this on local dev
-
Inspect the tenant schema after onboarding. From the project root:
psql "postgresql://postgres:postgres@localhost:5434/ratiba" -c "\dn"Lists every schema; you should see
publicplus onetenant_<slug>per tenant. Then\dt tenant_demo.*to confirm the tenant-scoped tables (customers,customer_sessions,customer_identities,appointments,services, ...) all exist. For the two-pool model, verifypublic.tenantshas the expectedschema_namecolumn:\d public.tenants. -
Watch a cross-channel merge happen. Onboard a tenant with the web widget enabled. Chat first via the WhatsApp test number to create the customer, then open
localhost:3010/widget?slug=<tenant>in a private window so you land as a fresh anonymous session. When the FSM hitsCOLLECT_PHONE, type the same phone you used on WhatsApp. Then run:psql "postgresql://postgres:postgres@localhost:5434/ratiba" \-c "SELECT id, customer_id, promoted_at FROM tenant_<slug>.customer_sessions ORDER BY created_at;"Both rows should reference the same
customer_id. The row withpromoted_at IS NOT NULLis the web session that was merged. -
ContextVar inspection in a pytest debugger. Drop
breakpoint()into any test that exercises a tenant-scoped DAO and run:from app.tenancy.context import current_tenantcurrent_tenant.get()You will see the frozen
TenantContextsnapshot that the per-scenario fixture installed. Ifget_tenant_session()ever raisesRuntimeError("get_tenant_session() called without an active tenant context"), this is the same primitive failing closed — the ContextVar was never set for the current coroutine scope. -
Verify
SET LOCAL search_pathscoping. In a test session that exercisesget_tenant_session(), check that after the connection is returned to the pool the search path is reset. TheSET LOCALis transaction-scoped: commit or rollback the transaction and re-runSHOW search_pathon the connection — it will have reverted to the pool default.
Related
- Channel substrate — Tier-1 vs Tier-2
channel semantics;
COLLECT_PHONEas a Tier-2 FSM entry-state. - Conversation FSM —
COLLECT_PHONEnode detail, 3-strike handling, andLEAD_CAPTUREDterminal state. - Onboard a tenant — operator step-by-step for running the atomic onboarding flow.
- Configuration — env vars and settings that govern pool sizes, Keycloak realm configuration, and per-tenant schema parameters.