Capacity & scaling
Current reality
Ratiba M13 is a single-VPS Docker Compose stack serving a 5-10 trusted-customer beta cohort. There has been no load test. The numbers below are engineering estimates derived from the architecture decisions — not measurements from production traffic.
Where load concentrates
The yellow nodes are the first things that would bottleneck under concurrent load. The pink node is the primary cost surface.
Known concentration points
1. Redis SETNX booking mutex (per ADR-0003 D3)
Each active booking thread holds a Redis lock (ratiba:fsm:lock:<thread_id>, 30s TTL
with exponential-backoff retries to a 10s ceiling). This is per-thread, not per-tenant —
two concurrent bookings for the same tenant each hold their own lock. In M13 with <10
active tenants and light traffic, lock contention is not a concern. At higher concurrency
the retry backoff adds latency; the ceiling keeps it bounded.
2. Shared asyncpg pool (per ADR-0002 D4)
One process-wide pool of DB_POOL_SIZE=20 connections (env-tunable). Every booking FSM
turn that touches the database borrows from this pool. The pool is shared across all
tenants; a spike on one tenant competes with all others. At 5-10 beta tenants with
infrequent concurrent bookings, 20 connections is generous. The practical limit before
contention becomes visible is roughly DB_POOL_SIZE × avg_turn_duration in QPS — not
measured yet.
3. Per-tenant psycopg micro-pools (per ADR-0002 D4)
The LangGraph TenantScopedSaver uses a dedicated 1-2 connection pool per tenant, created
lazily and closed after 30 min of inactivity. At M13's beta scale this is fine. The
connection math becomes relevant at ~100 active tenants (2 idle connections each = 200
open Postgres connections against a tuned-VPS limit of 500-1000). This is documented
explicitly in ADR-0002 as the trigger for adding PgBouncer in transaction-pooling mode.
4. The public.* tenant registry (per ADR-0002 D1)
Three tables in the shared public schema (tenants, tenant_admins,
payment_routing) are read on every inbound message at the channel boundary. At M13's
tenant count these are trivially small. PostgreSQL's schema catalog adds overhead per
schema — this becomes relevant around ~10,000 tenants per instance, a scale we are not
designing for now.
5. Per-booking LLM cost ceiling (per ADR-0005 D4)
Every booking consumes LLM tokens at $0.05 soft / $0.20 hard per-booking defaults
(per-tenant configurable). The cost ceiling is the primary capacity-management lever for
M13: it bounds runaway spend on any single booking thread and escalates to the admin
rather than burning uncapped tokens. This is a financial safety rail, not a performance
one — it does not prevent high-frequency bookings from accumulating cost at scale.
6. Daily nightly reaper (per ADR-0007 D5)
A single cron job at 3 AM EAT runs public.payment_routing expiry, per-tenant
checkpoints_archive moves, and handoff_log_archive moves. In M13 this runs as a
Docker worker container. It has not been tested under load; if the reaper falls behind
at scale, public.payment_routing and the checkpoint tables grow unbounded until the
next run.
Practical limits (M13 estimates, unverified)
| Resource | M13 default | First constraint appears when… |
|---|---|---|
| DB connections (asyncpg) | 20 shared | > 20 concurrent FSM turns holding a DB connection |
| DB connections (psycopg micro-pools) | 1-2 per tenant | ~100 active tenants (approach VPS connection limit) |
| Redis memory | single instance, no cap configured | Large checkpoint blobs accumulate; check redis-cli INFO memory |
| LLM cost per booking | $0.05 soft / $0.20 hard | Any individual booking that loops the FSM excessively |
| Tenant registry tables | trivially small | ~10,000 tenants / Postgres instance (catalog overhead) |
We have not load-tested any of these. The table represents design-time estimates from the ADR authors. Real numbers from the M13 beta cohort will calibrate these before M14.
Target posture
The scaling levers below are aspirational — post-pilot targets, not M13 commitments. None are in scope until the beta demonstrates sustained load that warrants them.
Database
- Read replica for reporting and analytics queries (
/admin/analytics, Langfuse trace queries, catalog listing). The write path stays on the primary. - PgBouncer in transaction-pooling mode in front of Postgres once active-tenant
count exceeds
~100. ADR-0002 documents this as the known trigger point for the per-tenant micro-pool connection math. - Per-tenant Alembic invocations remain independent of this —
scripts/migrate-all-tenants.shalready handles bulk upgrades per ADR-0002 D3.
Redis
- Redis Sentinel (primary + 2 replicas) for HA once the booking mutex and FSM checkpoints are proven load-bearing at scale. Sentinel is simpler than Cluster for Ratiba's access patterns (key-per-thread, no hash-slot distribution needed).
- Per-tenant key prefix scheme (already in ADR-0003) survives unchanged under Sentinel.
Backend
- Horizontal backend scaling behind a load balancer (nginx or Cloudflare Load
Balancer). The
current_tenantcontextvar andcurrent_booking_costcontextvar (ADR-0005 D4) are both set per-request at the channel boundary — there is no process-level tenant state, so horizontal scaling is architecturally clean. - Worker horizontal scaling: the ARQ worker is a separate container running the
same image with an overridden
CMD. Additional worker replicas can be added to the compose stack without code changes.
LLM cost
- Per-tenant
cost_ceiling_soft_usd/cost_ceiling_hard_usdcolumns onpublic.tenants(ADR-0005 D4) are tunable viaUPDATE public.tenants. Calibrate from observed booking cost distribution after the first 100 production bookings per tenant. - LLM provider failover: the
LLMRouter(ADR-0005 D5) is a YAML-config swap; adding a cheaper provider or a Tier-2 fallback is arole_assignments.yamlchange, not a code change.
Voice / LiveKit
- LiveKit moves to a dedicated server or LiveKit Cloud once voice traffic justifies it. In M13, LiveKit runs in the same compose stack on the same VPS host. The Q7 lock defers multi-tenant LiveKit production deployment to M14.
Monitoring graduation
- Lean observability (docker logs + daily WhatsApp digest) is right for M13. See the Observability runbook for the current posture and the post-pilot instrumentation targets (Prometheus + Grafana + alerting).
Related
- Observability runbook — what you can see today and how to read it
- ADR-0002 — two-pool model, micro-pool connection math
- ADR-0005 — per-booking cost ceiling and LLMRouter
- ADR-0007 — daily consolidated reaper