Capacity & scaling

Current reality

Ratiba M13 is a single-VPS Docker Compose stack serving a 5-10 trusted-customer beta cohort. There has been no load test. The numbers below are engineering estimates derived from the architecture decisions — not measurements from production traffic.

Where load concentrates

The yellow nodes are the first things that would bottleneck under concurrent load. The pink node is the primary cost surface.

Known concentration points

1. Redis SETNX booking mutex (per ADR-0003 D3)

Each active booking thread holds a Redis lock (ratiba:fsm:lock:<thread_id>, 30s TTL with exponential-backoff retries to a 10s ceiling). This is per-thread, not per-tenant — two concurrent bookings for the same tenant each hold their own lock. In M13 with <10 active tenants and light traffic, lock contention is not a concern. At higher concurrency the retry backoff adds latency; the ceiling keeps it bounded.

2. Shared asyncpg pool (per ADR-0002 D4)

One process-wide pool of DB_POOL_SIZE=20 connections (env-tunable). Every booking FSM turn that touches the database borrows from this pool. The pool is shared across all tenants; a spike on one tenant competes with all others. At 5-10 beta tenants with infrequent concurrent bookings, 20 connections is generous. The practical limit before contention becomes visible is roughly DB_POOL_SIZE × avg_turn_duration in QPS — not measured yet.

3. Per-tenant psycopg micro-pools (per ADR-0002 D4)

The LangGraph TenantScopedSaver uses a dedicated 1-2 connection pool per tenant, created lazily and closed after 30 min of inactivity. At M13's beta scale this is fine. The connection math becomes relevant at ~100 active tenants (2 idle connections each = 200 open Postgres connections against a tuned-VPS limit of 500-1000). This is documented explicitly in ADR-0002 as the trigger for adding PgBouncer in transaction-pooling mode.

4. The public.* tenant registry (per ADR-0002 D1)

Three tables in the shared public schema (tenants, tenant_admins, payment_routing) are read on every inbound message at the channel boundary. At M13's tenant count these are trivially small. PostgreSQL's schema catalog adds overhead per schema — this becomes relevant around ~10,000 tenants per instance, a scale we are not designing for now.

5. Per-booking LLM cost ceiling (per ADR-0005 D4)

Every booking consumes LLM tokens at $0.05 soft / $0.20 hard per-booking defaults (per-tenant configurable). The cost ceiling is the primary capacity-management lever for M13: it bounds runaway spend on any single booking thread and escalates to the admin rather than burning uncapped tokens. This is a financial safety rail, not a performance one — it does not prevent high-frequency bookings from accumulating cost at scale.

6. Daily nightly reaper (per ADR-0007 D5)

A single cron job at 3 AM EAT runs public.payment_routing expiry, per-tenant checkpoints_archive moves, and handoff_log_archive moves. In M13 this runs as a Docker worker container. It has not been tested under load; if the reaper falls behind at scale, public.payment_routing and the checkpoint tables grow unbounded until the next run.

Practical limits (M13 estimates, unverified)

Resource	M13 default	First constraint appears when…
DB connections (asyncpg)	20 shared	`> 20` concurrent FSM turns holding a DB connection
DB connections (psycopg micro-pools)	1-2 per tenant	`~100` active tenants (approach VPS connection limit)
Redis memory	single instance, no cap configured	Large checkpoint blobs accumulate; check `redis-cli INFO memory`
LLM cost per booking	`$0.05` soft / `$0.20` hard	Any individual booking that loops the FSM excessively
Tenant registry tables	trivially small	`~10,000` tenants / Postgres instance (catalog overhead)

We have not load-tested any of these. The table represents design-time estimates from the ADR authors. Real numbers from the M13 beta cohort will calibrate these before M14.

Target posture

The scaling levers below are aspirational — post-pilot targets, not M13 commitments. None are in scope until the beta demonstrates sustained load that warrants them.

Database

Read replica for reporting and analytics queries (/admin/analytics, Langfuse trace queries, catalog listing). The write path stays on the primary.
PgBouncer in transaction-pooling mode in front of Postgres once active-tenant count exceeds ~100. ADR-0002 documents this as the known trigger point for the per-tenant micro-pool connection math.
Per-tenant Alembic invocations remain independent of this — scripts/migrate-all-tenants.sh already handles bulk upgrades per ADR-0002 D3.

Redis

Redis Sentinel (primary + 2 replicas) for HA once the booking mutex and FSM checkpoints are proven load-bearing at scale. Sentinel is simpler than Cluster for Ratiba's access patterns (key-per-thread, no hash-slot distribution needed).
Per-tenant key prefix scheme (already in ADR-0003) survives unchanged under Sentinel.

Backend

Horizontal backend scaling behind a load balancer (nginx or Cloudflare Load Balancer). The current_tenant contextvar and current_booking_cost contextvar (ADR-0005 D4) are both set per-request at the channel boundary — there is no process-level tenant state, so horizontal scaling is architecturally clean.
Worker horizontal scaling: the ARQ worker is a separate container running the same image with an overridden CMD. Additional worker replicas can be added to the compose stack without code changes.

LLM cost

Per-tenant cost_ceiling_soft_usd / cost_ceiling_hard_usd columns on public.tenants (ADR-0005 D4) are tunable via UPDATE public.tenants. Calibrate from observed booking cost distribution after the first 100 production bookings per tenant.
LLM provider failover: the LLMRouter (ADR-0005 D5) is a YAML-config swap; adding a cheaper provider or a Tier-2 fallback is a role_assignments.yaml change, not a code change.

Voice / LiveKit

LiveKit moves to a dedicated server or LiveKit Cloud once voice traffic justifies it. In M13, LiveKit runs in the same compose stack on the same VPS host. The Q7 lock defers multi-tenant LiveKit production deployment to M14.

Monitoring graduation

Lean observability (docker logs + daily WhatsApp digest) is right for M13. See the Observability runbook for the current posture and the post-pilot instrumentation targets (Prometheus + Grafana + alerting).

Observability runbook — what you can see today and how to read it
ADR-0002 — two-pool model, micro-pool connection math
ADR-0005 — per-booking cost ceiling and LLMRouter
ADR-0007 — daily consolidated reaper

Current reality​

Where load concentrates​

Known concentration points​

Practical limits (M13 estimates, unverified)​

Target posture​

Database​

Redis​

Backend​

LLM cost​

Voice / LiveKit​

Monitoring graduation​

Related​