Skip to main content

Identity and tenancy

What it does

Two strict isolation seams cooperate so that every query in Ratiba reaches the correct tenant's data without ever leaking across boundaries.

Schema-per-tenant Postgres (ADR-0002). One Keycloak realm per tenant, one tenant_<slug> schema in Postgres, isolated data and an independent per-tenant Alembic migration chain. Two distinct connection pools sit underneath: a shared asyncpg pool for the public.* registry (tenant lookup, payment routing, inbound quarantine) and per-tenant psycopg micro-pools for tenant_<slug>.* operational data. This "two-pool model" separates registry I/O (high concurrency, low schema cardinality) from operational I/O (medium concurrency, high schema cardinality) — see Two-pool model below.

Tenant context and ContextVar propagation. Tenant identity propagates through every async call stack via an asyncio ContextVar (app/tenancy/context.py::current_tenant). Because Python copies the contextvars.Context snapshot into every asyncio.create_task call, background tasks cannot lose tenant scope — the snapshot travels with the coroutine. get_tenant_session() reads that ContextVar and refuses to issue a connection if it is unset (loud RuntimeError, never a silent fallback to public). The schema name is re-validated against ^tenant_[a-z0-9_]+$ before being interpolated into a SET LOCAL search_path SQL string, closing the injection window.

Phone-only deterministic cross-channel customer identity (ADR-0009 D4-D5). Three tenant-scoped tables — customers, customer_sessions, customer_identities — represent "who is talking to us, on which channel". Tier-1 channels (WhatsApp + voice) carry the customer's phone in inbound provider metadata, so identity is known immediately. Tier-2 channels (web widget + Instagram DM

  • Messenger DM) land as anonymous sessions and capture the phone progressively via the COLLECT_PHONE FSM entry-state. When a phone is bound, the resolver looks for an existing customer with the same phone_e164 across the tenant — a HIT merges the new session into that customer record (cross-channel merge); a MISS creates a fresh customer row. Identity matching is strictly deterministic — no probabilistic name or device fingerprint matching is used.

How it fits in the system

Two-pool model

The two-pool model is the concrete implementation of ADR-0002's pool-isolation requirement. It is not two separate databases — it is two separate asyncpg / psycopg connection pools within a single Postgres instance, each with distinct credentials, pool sizes, and search_path defaults:

PoolDriverDefault search_pathUsed for
Shared public poolasyncpgpublicTenant lookup, payment_routing, inbound_quarantine, payment_callbacks_unrouted
Per-tenant micro-poolspsycopgtenant_<slug>All operational tables: customers, sessions, services, appointments, checkpoints, commission data

The shared pool is created once at process startup and reused across all requests. Per-tenant micro-pools are created on first request for a tenant, lazily, and bounded to a small number of connections (typically 2–5) to avoid connection exhaustion in a high-tenant-count deployment.

get_tenant_session() enforces the invariant: if current_tenant ContextVar is unset, the call raises immediately. If it is set, the function picks the correct micro-pool by schema_name, issues SET LOCAL search_path = tenant_<slug>, and returns the connection. The SET LOCAL is transaction-scoped — it is rolled back if the surrounding transaction aborts, so a partial failure can never bleed the wrong search_path into a subsequent query on the same connection.

Tenant context propagation

The asyncio ContextVar mechanism deserves a closer look because it is the load-bearing cross-cutting concern that makes schema isolation work end-to-end without passing tenant_id in every function signature.

The TenantContext dataclass is frozen (frozen=True in the Pydantic model), so no layer can mutate the tenant in-flight. A new ContextVar.set() call returns a Token that the middleware saves to restore the previous context on exit — this is the standard Python context-manager pattern and ensures nested tasks or middleware stacking does not corrupt the outer tenant.

Atomic onboarding with Keycloak compensating actions

Onboarding a new tenant is a multi-system operation: create a Keycloak realm, write the public.tenants row, create the tenant_<slug> schema, run per-tenant Alembic migrations, and create the checkpoints_<slug> schema. Any step can fail.

ADR-0002 specifies a compensating-action pattern: each step that succeeds is recorded, and if a later step fails, the compensating actions are executed in reverse order. For example, if Alembic migrations fail after the Keycloak realm was created, the middleware deletes the realm, rolls back the public.tenants insert, and drops the tenant_<slug> schema. The caller receives a single clean error; no half-created tenant is left in the registry.

See the onboarding runbook for the operator step-by-step walk-through and the CLI commands that drive this flow.

Phone-only deterministic identity

The resolver logic is deliberately simple: phone E.164 is the only merge key.

Key invariants enforced in app/agents/identity_resolver.py:

  • normalize_phone_e164() is called before every lookup and every insert — no raw user input ever touches the phone_e164 unique index.
  • link_identity() uses ON CONFLICT DO NOTHING against the composite PK (customer_id, provider, external_id), making it safely replayable.
  • A 3-strike re-prompt is enforced in the COLLECT_PHONE FSM node; if the customer refuses or fails three times, the thread transitions to LEAD_CAPTURED and the session remains customer_id=NULL (anonymous lead).
  • There is no probabilistic fallback: if the phone does not match, a new customer is created. False merges (two different people sharing a phone) are a known limitation documented in ADR-0009.

Cross-channel merge: the full sequence

A customer talks to the salon on WhatsApp first, then opens the web widget days later. Phone-only deterministic merge (ADR-0009 D5) recognises her on the second channel without asking her to re-register:

Prospect story: what this means for a booking

From a customer's point of view, the identity layer is invisible. She books by WhatsApp on Monday. On Thursday she opens the spa's website chat widget. The agent greets her by name ("Hi Mary, welcome back!"), knows her appointment history, and continues the conversation without asking her to sign up or log in. No password. No app. No registration form.

From the tenant owner's point of view, both conversations appear under the same customer profile in the dashboard — booking count, cancellation history, payment status — regardless of which channel the customer used.

Schema reference

All three tables live inside tenant_<slug> — never in public.

TableKey columnsPurpose
customersid, phone_e164 (unique), display_name, localeOne row per unique customer across all channels
customer_sessionsid, customer_id (nullable FK), provider, external_id, promoted_atOne row per channel session; customer_id is NULL until phone is bound
customer_identitiescustomer_id, provider, external_id (composite PK)Many-to-one: multiple channel handles per customer

The customer_id FK on customer_sessions being nullable is the key design choice that allows anonymous sessions to exist before phone capture. promote_session() sets it and records promoted_at atomically so the promotion is auditable.

Where it lives in code

ConcernFileKey entry point
Tenant ContextVarapp/tenancy/context.pycurrent_tenant ContextVar (L56) + frozen TenantContext (L40)
Tenant ORM modelapp/tenancy/models.pyTenant ORM class (L68); public-schema only
Schema-scoped sessionapp/persistence/session.pyget_tenant_session() (L48) — ContextVar gate + SET LOCAL search_path
Channel-boundary resolveapp/agents/identity_resolver.pyresolve() (L154) — phone_number_id to tenant + actor_type
Customer-side resolveapp/agents/identity_resolver.pyresolve_session() (L326), bind_phone() (L384), link_identity() (L491)
Customers DAOapp/persistence/customers.pyget_customer_by_phone() (L114), get_customer_by_alias() (L139), insert_customer() (L175)
Sessions DAOapp/persistence/customers.pyget_session() (L238), insert_session() (L257), promote_session() (L283)
Identities DAOapp/persistence/customers.pylink_identity() (L305) — ON CONFLICT DO NOTHING against composite PK

Decisions

Try this on local dev

  1. Inspect the tenant schema after onboarding. From the project root:

    psql "postgresql://postgres:postgres@localhost:5434/ratiba" -c "\dn"

    Lists every schema; you should see public plus one tenant_<slug> per tenant. Then \dt tenant_demo.* to confirm the tenant-scoped tables (customers, customer_sessions, customer_identities, appointments, services, ...) all exist. For the two-pool model, verify public.tenants has the expected schema_name column: \d public.tenants.

  2. Watch a cross-channel merge happen. Onboard a tenant with the web widget enabled. Chat first via the WhatsApp test number to create the customer, then open localhost:3010/widget?slug=<tenant> in a private window so you land as a fresh anonymous session. When the FSM hits COLLECT_PHONE, type the same phone you used on WhatsApp. Then run:

    psql "postgresql://postgres:postgres@localhost:5434/ratiba" \
    -c "SELECT id, customer_id, promoted_at FROM tenant_<slug>.customer_sessions ORDER BY created_at;"

    Both rows should reference the same customer_id. The row with promoted_at IS NOT NULL is the web session that was merged.

  3. ContextVar inspection in a pytest debugger. Drop breakpoint() into any test that exercises a tenant-scoped DAO and run:

    from app.tenancy.context import current_tenant
    current_tenant.get()

    You will see the frozen TenantContext snapshot that the per-scenario fixture installed. If get_tenant_session() ever raises RuntimeError("get_tenant_session() called without an active tenant context"), this is the same primitive failing closed — the ContextVar was never set for the current coroutine scope.

  4. Verify SET LOCAL search_path scoping. In a test session that exercises get_tenant_session(), check that after the connection is returned to the pool the search path is reset. The SET LOCAL is transaction-scoped: commit or rollback the transaction and re-run SHOW search_path on the connection — it will have reverted to the pool default.

  • Channel substrate — Tier-1 vs Tier-2 channel semantics; COLLECT_PHONE as a Tier-2 FSM entry-state.
  • Conversation FSMCOLLECT_PHONE node detail, 3-strike handling, and LEAD_CAPTURED terminal state.
  • Onboard a tenant — operator step-by-step for running the atomic onboarding flow.
  • Configuration — env vars and settings that govern pool sizes, Keycloak realm configuration, and per-tenant schema parameters.