LangGraph PostgresSaver — Multi-Tenant Schema Spike

Date: 2026-04-25 Author: Research subagent (commissioned by Adrian) Scope: Does langgraph-checkpoint-postgres (Python) support per-tenant SET search_path such that Ratiba can store checkpoints in tenant-isolated Postgres schemas without forking? Companion docs: docs/research/2026-04-25-agentic-landscape-2026.md (Phase C landscape scan)

1. Verdict

NO — not as-is. YES-WITH-FORK is viable, and a thin Connection.configure hook is the cleanest workaround if we accept LangGraph at all.

The Python PostgresSaver and AsyncPostgresSaver classes hardcode unqualified table names (checkpoints, checkpoint_blobs, checkpoint_writes, checkpoint_migrations) in every SQL statement, accept no schema or search_path argument anywhere in their constructors or per-call methods, and rely entirely on whatever search_path the underlying connection happens to have. The JS sibling package added a schema parameter; the Python port has not, and an open feature request (forum thread 3274, GitHub issue #7345) confirms maintainers are aware but have not shipped it. This means tenant isolation via schema-per-tenant only works if we control connection checkout — which is feasible but fragile under any pool that rebinds connections mid-flow (psycopg ConnectionPool is one such pool).

2. Evidence

2.1 Constructor signature — no schema parameter

libs/checkpoint-postgres/langgraph/checkpoint/postgres/__init__.py, lines ~40-49:

def __init__(
    self,
    conn: _internal.Conn,
    pipe: Pipeline | None = None,
    serde: SerializerProtocol | None = None,
) -> None:
    super().__init__(serde=serde)
    ...
    self.conn = conn
    self.pipe = pipe
    self.lock = threading.Lock()

No schema, no search_path, no connection_configurator callback. The async variant at aio.py lines ~45-57 is identical in this regard.

2.2 `from_conn_string` opens a bare connection

Same file, lines ~51-63:

@classmethod
@contextmanager
def from_conn_string(
    cls, conn_string: str, *, pipeline: bool = False
) -> Iterator[PostgresSaver]:
    with Connection.connect(
        conn_string, autocommit=True, prepare_threshold=0, row_factory=dict_row
    ) as conn:
        ...

The connection is opened with autocommit=True and never has SET search_path issued against it. The only way to inject one is to bypass from_conn_string entirely and pass a pre-configured Connection (or ConnectionPool with a configure hook) into __init__ directly.

2.3 `_cursor` does not re-set search_path on checkout

__init__.py lines ~266-301:

@contextmanager
def _cursor(self, *, pipeline: bool = False) -> Iterator[Cursor[DictRow]]:
    with self.lock, _internal.get_connection(self.conn) as conn:
        ...
        with conn.cursor(binary=True, row_factory=dict_row) as cur:
            yield cur

This is the single chokepoint every read/write goes through. There is no per-call hook, no thread_id-aware schema selection, nothing. Whatever search_path is on the connection at this moment is what the checkpoint write hits.

2.4 SQL constants use unqualified table names

libs/checkpoint-postgres/langgraph/checkpoint/postgres/base.py:

UPSERT_CHECKPOINTS_SQL (lines 106-112): INSERT INTO checkpoints (...)
UPSERT_CHECKPOINT_BLOBS_SQL (lines 100-104): INSERT INTO checkpoint_blobs (...)
UPSERT_CHECKPOINT_WRITES_SQL (lines 114-120): INSERT INTO checkpoint_writes (...)
INSERT_CHECKPOINT_WRITES_SQL (lines 122-127): INSERT INTO checkpoint_writes (...)
MIGRATIONS (lines 47-90): CREATE TABLE checkpoint_migrations (...), etc.

Every table reference is unqualified. None are template-string interpolated with a schema variable. A fork that adds schema support has to touch all five SQL constants plus setup() plus probably _cursor (to issue SET LOCAL search_path per call).

2.5 Upstream confirmation — feature request still open

LangChain forum thread 3274 ("Feature request: Configurable PostgreSQL schema for langgraph-checkpoint-postgres (parity with LangGraphJS)") — the requester explicitly notes that the search_path workaround "can lead to error and can cause data leakage" when connections are managed through a pool. A maintainer redirected the requester to file GitHub issue #7345; no maintainer response on implementation status. The JS package (@langchain/langgraph-checkpoint-postgres) shipped a schema parameter on PostgresSaver.fromConnString(connString, { schema: "custom_schema" }). Python has not.

2.6 The pool problem

The forum-thread warning is real. psycopg_pool.ConnectionPool recycles connections; a connection that was last used by tenant A retains tenant A's search_path until something explicitly resets it. If we hand the same pool to LangGraph, then a checkpoint write for tenant B can land in tenant A's schema. The pool's configure= callback fires on connection creation, not on checkout — wrong hook. The correct hook is reset= (called on check-in) or a check= callback (psycopg-pool 3.2+) that runs on every checkout. Neither is wired through LangGraph's _cursor, so we have to wire it ourselves outside.

3. Concrete options

Option A — Adopt PostgresSaver as-is, with a per-tenant connection wrapper

Approach. Don't use from_conn_string. Build a lightweight TenantScopedSaver factory: per incoming conversation, acquire a dedicated connection (not from a shared pool, or from a small per-tenant sub-pool), SET search_path TO tenant_xxx, public, then construct PostgresSaver(conn=that_connection) for the duration of one graph invocation. Dispose at the end.

Pros.

Zero fork. Stay on upstream releases.
Honest schema isolation: the connection only ever sees one tenant's schema during its lifetime.
Plays nicely with LangGraph's interrupt/resume because the saver instance is per-invocation anyway.

Cons.

Connection churn. Opening a fresh Postgres connection per WhatsApp turn is heavier than pool checkout (~5-15ms TCP + auth on the same VPS). Mitigable with a per-tenant micro-pool (size 1-2) but that quadratically multiplies pool count: 100 tenants = 100-200 idle connections.
The setup() migration step needs to run once per tenant schema, not once globally. Manageable but new operational work for tenant onboarding.
Still relies on convention: any future LangGraph internal that grabs a second connection (e.g., for streaming) bypasses our wrapper. We should grep the source for any other connection acquisition path before committing.

Complexity. M (2-4h) for the wrapper + tenant-onboarding migration hook. Add 2h for an integration test that proves cross-tenant leak is prevented (concurrent writes from tenants A and B, assert each lands in its own schema).

Affected files. backend/app/persistence/checkpointer.py (new), backend/app/tenancy/onboarding.py (extend with checkpoint-schema migration), backend/tests/integration/test_checkpoint_isolation.py (new).

Option B — Fork PostgresSaver, add `schema` parameter

Approach. Vendor a copy under backend/app/vendor/langgraph_checkpoint_postgres/, parameterise the five SQL constants with an {schema} placeholder, accept schema: str in __init__, and inject SET LOCAL search_path TO {schema}, public at the top of _cursor. Pin the original version we forked from and document the upgrade procedure.

Pros.

Clean ergonomics: PostgresSaver(conn=pool, schema=tenant_schema) per call.
No connection-churn cost — works fine with one shared pool.
Mirrors the JS package's API, so when (if) Python upstream merges the same shape, our migration is mechanical.

Cons.

Maintenance debt. Every LangGraph release we use, we have to reapply the patch. Forum thread 3274 suggests upstream may eventually accept a PR — best case is we contribute the patch and it lands in 3-6 months.
MIGRATIONS constant changes shape (DDL needs CREATE SCHEMA IF NOT EXISTS + qualified table names), which means every minor LangGraph release that touches migrations could conflict.
Forking is a yellow flag in the dependency story for the PRD's "ready for health-data compliance" promise — auditors will ask why we patched a state-store library.

Complexity. M (3-5h) for the fork + parameterisation + tests. Add 1-2h ongoing per LangGraph upgrade to rebase.

Affected files. backend/app/vendor/langgraph_checkpoint_postgres/ (vendored copy of __init__.py, aio.py, base.py), backend/pyproject.toml (drop the upstream dep, or pin and shadow), backend/tests/integration/test_vendored_checkpointer.py (new).

Option C — Reject LangGraph as a runtime dependency, custom FSM on asyncpg + Redis

Approach. Take only the API shape (graph nodes, interrupt(), resume(), checkpoint) and reimplement on top of asyncpg (for durable checkpoint state in the tenant's own schema) plus Redis (for hot conversation state and the FSM resume token). Roughly: a Graph class with add_node / add_edge / compile, an Agent.invoke(state) that runs nodes until completion or Interrupt, a checkpoints table per tenant schema with (thread_id, checkpoint_id, parent_id, state_blob, metadata), and a Redis key ratiba:fsm:{thread_id} holding the live FSM cursor with TTL.

Pros.

Native multi-tenancy: schema lives in the tenant's SET search_path, full stop.
Zero LangGraph dep — no upgrade churn, no forks, no LangChain ecosystem pull.
Full control over the persistence shape: we can index by tenant, by date, by status; we can integrate with the existing audit-log table; we can store M-Pesa correlation IDs on the checkpoint without monkeying with metadata blobs.
Lighter dependency footprint matters for compliance review.

Cons.

We lose LangGraph's ecosystem: prebuilt agents, tools integration, the langgraph dev UI for debugging graphs, future features.
Real engineering work. Estimate the MVP at 800-1500 LOC: Graph/Node/Edge (~200), interrupt/resume protocol (~150), durable checkpoint table + asyncpg DAO (~300), Redis FSM cursor (~150), tests (~400-700).
Reinventing edge cases LangGraph already solved: pending sends, parent checkpoint chain for forking conversations, write idempotency.

Complexity. XL (>6h, must be broken down). Realistic estimate is 2-4 days for a working FSM that covers Ratiba's actual needs (linear conversation graph + one interrupt point for human handoff). Should be split into: (1) checkpoint table + DAO, (2) graph runner + interrupt protocol, (3) Redis cursor + resume, (4) integration tests with simulated WhatsApp turn-taking.

Affected files. backend/app/orchestration/ (new package: graph.py, node.py, interrupt.py, runner.py), backend/app/orchestration/persistence.py (asyncpg checkpoint DAO), backend/app/orchestration/cursor.py (Redis FSM cursor), backend/alembic/versions/xxxx_create_checkpoints.py (per-tenant migration), backend/tests/orchestration/ (full suite).

4. Recommendation

Go with Option A for the spike-to-production path, with a planned escape hatch to Option C if LangGraph proves restrictive in other ways.

Justification: Option A buys us LangGraph's interrupt/resume semantics (the original reason we considered it) at the cost of one custom wrapper class and per-onboarding migration work — both of which we need anyway for the schema-per-tenant architecture. Option B's fork-maintenance overhead is not worth the small ergonomic gain, especially because we'd be carrying a patch that upstream is already on track to accept. Option C is the right answer if during M3-M4 we discover LangGraph forces other architectural compromises (e.g., its tool-call protocol doesn't fit our Daraja/Sonner integrations cleanly), but it's premature to commit to that work today.

Suggested acceptance criteria for the Option A wrapper:

TenantScopedCheckpointer.for_tenant(tenant_id) returns a PostgresSaver whose underlying connection has search_path = tenant_{id}, public set before any checkpoint call.
Concurrent invocations across tenants A and B (asyncio.gather of two graph runs) write checkpoints exclusively to their own schemas — verified by SELECT count(*) FROM tenant_a.checkpoints and tenant_b.checkpoints after the test.
Tenant onboarding flow runs PostgresSaver.setup() against the new tenant's connection so the four checkpoint tables exist in that schema.
Resume flow: a graph that hits interrupt() on tenant A can be resumed 10 minutes later by TenantScopedCheckpointer.for_tenant("A") and continues from the saved state.
No connection from tenant A's pool is ever returned by tenant B's resolver — enforced by separate per-tenant ConnectionPool instances or a check= callback that asserts current_setting('search_path') matches the expected tenant schema.

Test cases (testcontainers, no mocks):

Cross-tenant write isolation under concurrency.
Resume across process restart (kill + restart backend mid-interrupt, confirm checkpoint survives).
Tenant deletion: DROP SCHEMA tenant_x CASCADE cleanly removes all checkpoint state for that tenant.
Pool exhaustion: confirm that hitting the per-tenant pool ceiling produces a graceful 503-equivalent rather than leaking onto another tenant's connection.

Risk mitigation:

Before committing, grep the langgraph-checkpoint-postgres source for any path that opens a connection outside _cursor (e.g., during streaming or pipeline mode). If found, that path is a leak vector and forces us to Option B or C.
Pin LangGraph and langgraph-checkpoint-postgres to exact versions; subscribe to release notes for any DDL changes in MIGRATIONS.
Write the integration test in section "Suggested acceptance criteria" #2 first — that test is the gate. If it fails on Option A, fall back to Option B without sunk-cost hesitation.

5. Open questions

Pipeline mode and second connections. _cursor has a pipeline=True branch that uses conn.pipeline(). Does this stay on the same physical connection? The code suggests yes (it uses self.conn), but I have not traced every code path. If pipeline mode ever opens a second connection from a pool, our search_path guarantee breaks. Verify with a focused read of the psycopg pipeline docs before locking Option A in.
AsyncPostgresSaver and asyncpg. The async variant uses psycopg async connections, not asyncpg. Ratiba's existing standard is asyncpg. We need to confirm we're comfortable running both drivers in the same backend, or accept that the LangGraph state store uses psycopg while the rest of the app uses asyncpg. This is fine technically but a small operational wart.
setup() idempotency across tenant schemas. If LangGraph ships a MIGRATIONS change in v0.x, do we have to re-run setup() against every tenant schema? Probably yes. Need to confirm the migration mechanism is per-schema and that we have an Alembic-style story for tenant-fanout migrations.
langgraph dev UI compatibility. The dev UI assumes a single global checkpointer. If we want to use it for debugging, can it be pointed at one tenant schema at a time? Not blocking for production, but worth checking before relying on it as a dev tool.
The forum's "data leakage" warning specifics. The feature-requester flagged that the search_path workaround "can cause data leakage" with pools. We should reproduce the exact failure mode they hit before declaring our wrapper safe — possibly the check= callback is not enough and we genuinely need separate pools per tenant.

1. Verdict​

2. Evidence​

2.1 Constructor signature — no schema parameter​

2.2 from_conn_string opens a bare connection​

2.3 _cursor does not re-set search_path on checkout​

2.4 SQL constants use unqualified table names​

2.5 Upstream confirmation — feature request still open​

2.6 The pool problem​

3. Concrete options​

Option A — Adopt PostgresSaver as-is, with a per-tenant connection wrapper​

Option B — Fork PostgresSaver, add schema parameter​

Option C — Reject LangGraph as a runtime dependency, custom FSM on asyncpg + Redis​

4. Recommendation​

5. Open questions​

Sources​

1. Verdict

2. Evidence

2.1 Constructor signature — no schema parameter

2.2 `from_conn_string` opens a bare connection

2.3 `_cursor` does not re-set search_path on checkout

2.4 SQL constants use unqualified table names

2.5 Upstream confirmation — feature request still open

2.6 The pool problem

3. Concrete options

Option A — Adopt PostgresSaver as-is, with a per-tenant connection wrapper

Option B — Fork PostgresSaver, add `schema` parameter

Option C — Reject LangGraph as a runtime dependency, custom FSM on asyncpg + Redis

4. Recommendation

5. Open questions

Sources