LangGraph PostgresSaver — Multi-Tenant Schema Spike
Date: 2026-04-25
Author: Research subagent (commissioned by Adrian)
Scope: Does langgraph-checkpoint-postgres (Python) support per-tenant SET search_path such that Ratiba can store checkpoints in tenant-isolated Postgres schemas without forking?
Companion docs: docs/research/2026-04-25-agentic-landscape-2026.md (Phase C landscape scan)
1. Verdict
NO — not as-is. YES-WITH-FORK is viable, and a thin Connection.configure hook is the cleanest workaround if we accept LangGraph at all.
The Python PostgresSaver and AsyncPostgresSaver classes hardcode unqualified table names (checkpoints, checkpoint_blobs, checkpoint_writes, checkpoint_migrations) in every SQL statement, accept no schema or search_path argument anywhere in their constructors or per-call methods, and rely entirely on whatever search_path the underlying connection happens to have. The JS sibling package added a schema parameter; the Python port has not, and an open feature request (forum thread 3274, GitHub issue #7345) confirms maintainers are aware but have not shipped it. This means tenant isolation via schema-per-tenant only works if we control connection checkout — which is feasible but fragile under any pool that rebinds connections mid-flow (psycopg ConnectionPool is one such pool).
2. Evidence
2.1 Constructor signature — no schema parameter
libs/checkpoint-postgres/langgraph/checkpoint/postgres/__init__.py, lines ~40-49:
def __init__(
self,
conn: _internal.Conn,
pipe: Pipeline | None = None,
serde: SerializerProtocol | None = None,
) -> None:
super().__init__(serde=serde)
...
self.conn = conn
self.pipe = pipe
self.lock = threading.Lock()
No schema, no search_path, no connection_configurator callback. The async variant at aio.py lines ~45-57 is identical in this regard.
2.2 from_conn_string opens a bare connection
Same file, lines ~51-63:
@classmethod
@contextmanager
def from_conn_string(
cls, conn_string: str, *, pipeline: bool = False
) -> Iterator[PostgresSaver]:
with Connection.connect(
conn_string, autocommit=True, prepare_threshold=0, row_factory=dict_row
) as conn:
...
The connection is opened with autocommit=True and never has SET search_path issued against it. The only way to inject one is to bypass from_conn_string entirely and pass a pre-configured Connection (or ConnectionPool with a configure hook) into __init__ directly.
2.3 _cursor does not re-set search_path on checkout
__init__.py lines ~266-301:
@contextmanager
def _cursor(self, *, pipeline: bool = False) -> Iterator[Cursor[DictRow]]:
with self.lock, _internal.get_connection(self.conn) as conn:
...
with conn.cursor(binary=True, row_factory=dict_row) as cur:
yield cur
This is the single chokepoint every read/write goes through. There is no per-call hook, no thread_id-aware schema selection, nothing. Whatever search_path is on the connection at this moment is what the checkpoint write hits.
2.4 SQL constants use unqualified table names
libs/checkpoint-postgres/langgraph/checkpoint/postgres/base.py:
UPSERT_CHECKPOINTS_SQL(lines 106-112):INSERT INTO checkpoints (...)UPSERT_CHECKPOINT_BLOBS_SQL(lines 100-104):INSERT INTO checkpoint_blobs (...)UPSERT_CHECKPOINT_WRITES_SQL(lines 114-120):INSERT INTO checkpoint_writes (...)INSERT_CHECKPOINT_WRITES_SQL(lines 122-127):INSERT INTO checkpoint_writes (...)MIGRATIONS(lines 47-90):CREATE TABLE checkpoint_migrations (...), etc.
Every table reference is unqualified. None are template-string interpolated with a schema variable. A fork that adds schema support has to touch all five SQL constants plus setup() plus probably _cursor (to issue SET LOCAL search_path per call).
2.5 Upstream confirmation — feature request still open
LangChain forum thread 3274 ("Feature request: Configurable PostgreSQL schema for langgraph-checkpoint-postgres (parity with LangGraphJS)") — the requester explicitly notes that the search_path workaround "can lead to error and can cause data leakage" when connections are managed through a pool. A maintainer redirected the requester to file GitHub issue #7345; no maintainer response on implementation status. The JS package (@langchain/langgraph-checkpoint-postgres) shipped a schema parameter on PostgresSaver.fromConnString(connString, { schema: "custom_schema" }). Python has not.
2.6 The pool problem
The forum-thread warning is real. psycopg_pool.ConnectionPool recycles connections; a connection that was last used by tenant A retains tenant A's search_path until something explicitly resets it. If we hand the same pool to LangGraph, then a checkpoint write for tenant B can land in tenant A's schema. The pool's configure= callback fires on connection creation, not on checkout — wrong hook. The correct hook is reset= (called on check-in) or a check= callback (psycopg-pool 3.2+) that runs on every checkout. Neither is wired through LangGraph's _cursor, so we have to wire it ourselves outside.
3. Concrete options
Option A — Adopt PostgresSaver as-is, with a per-tenant connection wrapper
Approach. Don't use from_conn_string. Build a lightweight TenantScopedSaver factory: per incoming conversation, acquire a dedicated connection (not from a shared pool, or from a small per-tenant sub-pool), SET search_path TO tenant_xxx, public, then construct PostgresSaver(conn=that_connection) for the duration of one graph invocation. Dispose at the end.
Pros.
- Zero fork. Stay on upstream releases.
- Honest schema isolation: the connection only ever sees one tenant's schema during its lifetime.
- Plays nicely with LangGraph's interrupt/resume because the saver instance is per-invocation anyway.
Cons.
- Connection churn. Opening a fresh Postgres connection per WhatsApp turn is heavier than pool checkout (~5-15ms TCP + auth on the same VPS). Mitigable with a per-tenant micro-pool (size 1-2) but that quadratically multiplies pool count: 100 tenants = 100-200 idle connections.
- The
setup()migration step needs to run once per tenant schema, not once globally. Manageable but new operational work for tenant onboarding. - Still relies on convention: any future LangGraph internal that grabs a second connection (e.g., for streaming) bypasses our wrapper. We should grep the source for any other connection acquisition path before committing.
Complexity. M (2-4h) for the wrapper + tenant-onboarding migration hook. Add 2h for an integration test that proves cross-tenant leak is prevented (concurrent writes from tenants A and B, assert each lands in its own schema).
Affected files. backend/app/persistence/checkpointer.py (new), backend/app/tenancy/onboarding.py (extend with checkpoint-schema migration), backend/tests/integration/test_checkpoint_isolation.py (new).
Option B — Fork PostgresSaver, add schema parameter
Approach. Vendor a copy under backend/app/vendor/langgraph_checkpoint_postgres/, parameterise the five SQL constants with an {schema} placeholder, accept schema: str in __init__, and inject SET LOCAL search_path TO {schema}, public at the top of _cursor. Pin the original version we forked from and document the upgrade procedure.
Pros.
- Clean ergonomics:
PostgresSaver(conn=pool, schema=tenant_schema)per call. - No connection-churn cost — works fine with one shared pool.
- Mirrors the JS package's API, so when (if) Python upstream merges the same shape, our migration is mechanical.
Cons.
- Maintenance debt. Every LangGraph release we use, we have to reapply the patch. Forum thread 3274 suggests upstream may eventually accept a PR — best case is we contribute the patch and it lands in 3-6 months.
MIGRATIONSconstant changes shape (DDL needsCREATE SCHEMA IF NOT EXISTS+ qualified table names), which means every minor LangGraph release that touches migrations could conflict.- Forking is a yellow flag in the dependency story for the PRD's "ready for health-data compliance" promise — auditors will ask why we patched a state-store library.
Complexity. M (3-5h) for the fork + parameterisation + tests. Add 1-2h ongoing per LangGraph upgrade to rebase.
Affected files. backend/app/vendor/langgraph_checkpoint_postgres/ (vendored copy of __init__.py, aio.py, base.py), backend/pyproject.toml (drop the upstream dep, or pin and shadow), backend/tests/integration/test_vendored_checkpointer.py (new).
Option C — Reject LangGraph as a runtime dependency, custom FSM on asyncpg + Redis
Approach. Take only the API shape (graph nodes, interrupt(), resume(), checkpoint) and reimplement on top of asyncpg (for durable checkpoint state in the tenant's own schema) plus Redis (for hot conversation state and the FSM resume token). Roughly: a Graph class with add_node / add_edge / compile, an Agent.invoke(state) that runs nodes until completion or Interrupt, a checkpoints table per tenant schema with (thread_id, checkpoint_id, parent_id, state_blob, metadata), and a Redis key ratiba:fsm:{thread_id} holding the live FSM cursor with TTL.
Pros.
- Native multi-tenancy: schema lives in the tenant's
SET search_path, full stop. - Zero LangGraph dep — no upgrade churn, no forks, no LangChain ecosystem pull.
- Full control over the persistence shape: we can index by tenant, by date, by status; we can integrate with the existing audit-log table; we can store M-Pesa correlation IDs on the checkpoint without monkeying with
metadatablobs. - Lighter dependency footprint matters for compliance review.
Cons.
- We lose LangGraph's ecosystem: prebuilt agents, tools integration, the
langgraph devUI for debugging graphs, future features. - Real engineering work. Estimate the MVP at 800-1500 LOC:
Graph/Node/Edge(~200), interrupt/resume protocol (~150), durable checkpoint table + asyncpg DAO (~300), Redis FSM cursor (~150), tests (~400-700). - Reinventing edge cases LangGraph already solved: pending sends, parent checkpoint chain for forking conversations, write idempotency.
Complexity. XL (>6h, must be broken down). Realistic estimate is 2-4 days for a working FSM that covers Ratiba's actual needs (linear conversation graph + one interrupt point for human handoff). Should be split into: (1) checkpoint table + DAO, (2) graph runner + interrupt protocol, (3) Redis cursor + resume, (4) integration tests with simulated WhatsApp turn-taking.
Affected files. backend/app/orchestration/ (new package: graph.py, node.py, interrupt.py, runner.py), backend/app/orchestration/persistence.py (asyncpg checkpoint DAO), backend/app/orchestration/cursor.py (Redis FSM cursor), backend/alembic/versions/xxxx_create_checkpoints.py (per-tenant migration), backend/tests/orchestration/ (full suite).
4. Recommendation
Go with Option A for the spike-to-production path, with a planned escape hatch to Option C if LangGraph proves restrictive in other ways.
Justification: Option A buys us LangGraph's interrupt/resume semantics (the original reason we considered it) at the cost of one custom wrapper class and per-onboarding migration work — both of which we need anyway for the schema-per-tenant architecture. Option B's fork-maintenance overhead is not worth the small ergonomic gain, especially because we'd be carrying a patch that upstream is already on track to accept. Option C is the right answer if during M3-M4 we discover LangGraph forces other architectural compromises (e.g., its tool-call protocol doesn't fit our Daraja/Sonner integrations cleanly), but it's premature to commit to that work today.
Suggested acceptance criteria for the Option A wrapper:
TenantScopedCheckpointer.for_tenant(tenant_id)returns aPostgresSaverwhose underlying connection hassearch_path = tenant_{id}, publicset before any checkpoint call.- Concurrent invocations across tenants A and B (asyncio.gather of two graph runs) write checkpoints exclusively to their own schemas — verified by
SELECT count(*) FROM tenant_a.checkpointsandtenant_b.checkpointsafter the test. - Tenant onboarding flow runs
PostgresSaver.setup()against the new tenant's connection so the four checkpoint tables exist in that schema. - Resume flow: a graph that hits
interrupt()on tenant A can be resumed 10 minutes later byTenantScopedCheckpointer.for_tenant("A")and continues from the saved state. - No connection from tenant A's pool is ever returned by tenant B's resolver — enforced by separate per-tenant
ConnectionPoolinstances or acheck=callback that assertscurrent_setting('search_path')matches the expected tenant schema.
Test cases (testcontainers, no mocks):
- Cross-tenant write isolation under concurrency.
- Resume across process restart (kill + restart backend mid-interrupt, confirm checkpoint survives).
- Tenant deletion:
DROP SCHEMA tenant_x CASCADEcleanly removes all checkpoint state for that tenant. - Pool exhaustion: confirm that hitting the per-tenant pool ceiling produces a graceful 503-equivalent rather than leaking onto another tenant's connection.
Risk mitigation:
- Before committing, grep the
langgraph-checkpoint-postgressource for any path that opens a connection outside_cursor(e.g., during streaming or pipeline mode). If found, that path is a leak vector and forces us to Option B or C. - Pin LangGraph and
langgraph-checkpoint-postgresto exact versions; subscribe to release notes for any DDL changes inMIGRATIONS. - Write the integration test in section "Suggested acceptance criteria" #2 first — that test is the gate. If it fails on Option A, fall back to Option B without sunk-cost hesitation.
5. Open questions
- Pipeline mode and second connections.
_cursorhas apipeline=Truebranch that usesconn.pipeline(). Does this stay on the same physical connection? The code suggests yes (it usesself.conn), but I have not traced every code path. If pipeline mode ever opens a second connection from a pool, oursearch_pathguarantee breaks. Verify with a focused read of the psycopg pipeline docs before locking Option A in. AsyncPostgresSaverand asyncpg. The async variant usespsycopgasync connections, notasyncpg. Ratiba's existing standard isasyncpg. We need to confirm we're comfortable running both drivers in the same backend, or accept that the LangGraph state store usespsycopgwhile the rest of the app usesasyncpg. This is fine technically but a small operational wart.setup()idempotency across tenant schemas. If LangGraph ships aMIGRATIONSchange in v0.x, do we have to re-runsetup()against every tenant schema? Probably yes. Need to confirm the migration mechanism is per-schema and that we have an Alembic-style story for tenant-fanout migrations.langgraph devUI compatibility. The dev UI assumes a single global checkpointer. If we want to use it for debugging, can it be pointed at one tenant schema at a time? Not blocking for production, but worth checking before relying on it as a dev tool.- The forum's "data leakage" warning specifics. The feature-requester flagged that the
search_pathworkaround "can cause data leakage" with pools. We should reproduce the exact failure mode they hit before declaring our wrapper safe — possibly thecheck=callback is not enough and we genuinely need separate pools per tenant.
Sources
- langgraph/libs/checkpoint-postgres/langgraph/checkpoint/postgres/init.py (main branch)
- Feature request: Configurable PostgreSQL schema for langgraph-checkpoint-postgres (parity with LangGraphJS) — LangChain Forum #3274
- Multi-tenant / per-user checkpoint querying with AsyncPostgresSaver — LangChain Forum #2604
- PostgresSaver | LangGraph.js API Reference (shows the JS
schemaparameter we want) - langgraph-checkpoint-postgres on PyPI
- DOC: Postgres Schema for LangGraph Checkpointer — langchain-ai/docs #465
- LangGraph Best Practices — Swarnendu De (multi-tenant pattern guidance)