Testing on dev

Ratiba's test suite is structured around a 5-layer test pyramid designed for a system where conversation state is canonical booking state. Classical SQL-fixture testing cannot cover the booking flow; the pyramid reflects that reality. This page is the operator's guide for running and debugging all layers locally.

For the full testing strategy — why this pyramid, how the per-scenario fresh-tenant fixture works, the bilingual judge model, the PII masking floor, and the Phase 3 canary hook — see ADR-0004.

The 5-layer pyramid

Layer	What it tests	Testcontainers needed	Speed
1 — FSM unit	Deterministic state transitions, pure Python	No	Fast (<1s per test)
2 — Full-turn integration	LangGraph checkpointer + TenantScopedSaver	Postgres + Redis	Medium
3 — Transcript replay e2e	WhatsApp webhook → FSM → outbound message	Postgres + Redis + Keycloak	Slow
4 — Golden-conversation snapshots	Conversational quality, bilingual fluency	No (LLM calls, not containers)	Slow (LLM-cost dominated)
5 — Production replay	Regression from real customer transcripts	No (uses prod infra)	Phase 2 only

All layers below Layer 4 run with pytest. Layer 4 uses the deepeval runner via pytest markers. Layer 5 is a Phase 2 capability (not yet active).

Backend pytest

Why the pinned venv command

The project pins Python 3.13 (see pyproject.toml requires-python = ">=3.13"). macOS pyenv shims resolve to whatever pyenv local reports — often 3.12 — which silently uses the wrong interpreter and produces confusing ModuleNotFoundError messages for 3.13-only syntax.

Always invoke pytest through the explicit venv path:

# CWD must be backend/ — pytest.ini is there
cd /Users/soft4u/Development/ratiba/backend

/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest

If you see python3.12 anywhere in the traceback, you are using the wrong interpreter.

Standard invocations

# Full suite — xdist -n4 (default from pyproject.toml addopts)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest

# Full suite, single process — use for the final gate before commit (see below)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest -n 0

# Single test file — useful while writing a test
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/test_tenancy/test_onboarding.py -v

# Single test by name
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest -k "test_onboard_creates_realm" -v

# With coverage report
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest --cov=app --cov-report=term-missing

# Run only Layer 1 (no testcontainers, fast)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/ -m "not integration" -n 0

pytest configuration (`pyproject.toml`)

The [tool.pytest.ini_options] section sets:

Option	Value	Reason
`asyncio_mode`	`"auto"`	No `@pytest.mark.asyncio` decorators required on individual tests
`testpaths`	`["tests"]`	Scoped to `backend/tests/`
`python_files`	`["test_*.py"]`	Standard pytest discovery
`addopts`	`"-n 4"`	xdist 4-worker default (see xdist section below)
`filterwarnings`	`"ignore::DeprecationWarning:pytest_freezegun"`	Suppresses known upstream noise

Coverage thresholds ([tool.coverage.report]): fail_under = 70 globally; safety-critical layers require 90% per the S4U PoC mode rules.

xdist parallelism — the `-n 4` cap

The default addopts = "-n 4" spawns 4 parallel workers, cutting the suite from ~18 minutes to ~6 minutes. Each worker boots its own Postgres 16, Redis 7, and Keycloak 24 testcontainer — that's the key design: containers are per-worker, not shared.

The cap is deliberately 4 and not auto (which on an M4 Mac would be ~10 workers). The reason: Keycloak's /health/ready probe times out when many containers start simultaneously. At 10 workers, 60+ container timeout failures surface — they are not test failures, they are container scheduling artefacts. The 4-worker cap avoids this entirely.

# Force serial (used for the clean-gate run before commit — see below)
-n 0

# 4 workers (default — fastest without container contention)
# (no flag needed; addopts sets it)

# Never use -n auto for this suite

Testcontainers — what spins up per worker

The conftest.py session fixtures spin up one container of each kind per xdist worker. The worker_id parameter is the pytest-xdist injection point that tells pytest to create the fixture per-worker rather than once globally:

postgres_container — postgres:16-alpine. Session-scoped. Each test gets a function-scoped database_url rewriting the dialect to postgresql+asyncpg://.
redis_container — redis:7-alpine. Session-scoped. Tests that need Redis use redis_settings_env.
keycloak_container — quay.io/keycloak/keycloak:24.0 in dev mode. Session-scoped, ~10s cold start. Bootstrapped via kcadm.sh inside the container to relax sslRequired=NONE on the master realm (the only way to allow HTTP admin API calls from the pytest process — see _bootstrap_disable_ssl_required in conftest.py).

Container teardown is handled by the session fixture's try/finally blocks. Leftover containers from a crashed session are cleaned up by Docker.

The `seeded_tenant` / `SeededTenant` fixture

seeded_tenant is the canonical E2E integration fixture, reused across M4 through M13 integration tests. It:

Drops and recreates public schema + any leftover tenant_* schemas from prior crashed runs
Runs alembic upgrade head with RATIBA_MIGRATION_SCOPE=public to migrate the shared schema
Onboards a real tenant via onboard_tenant() — this creates a Keycloak realm, runs per-tenant Alembic migrations, and calls PostgresSaver.setup() (the LangGraph checkpointer setup)
Writes WhatsApp credentials via the SQLAlchemy ORM so the EncryptedText TypeDecorator is exercised
Seeds 3 services (Deep Tissue Massage, Manicure, Pedicure) and 1 staff member with Mon-Fri 09:00–17:00 schedule
Yields a SeededTenant dataclass carrying tenant_id, slug, owner_phone, phone_number_id, access_token_plaintext, services, and staff_id
Cleans up Redis and pool connections in finally

Tests that need the eval-layer per-scenario isolation use the tenant_scoped_eval_environment fixture defined in backend/tests/eval/conftest.py instead. That fixture follows the same full-Alembic + DROP CASCADE pattern but scoped to a single YAML scenario ID with a UUID suffix (schema naming pattern: test_tenant_<scenario_id>_<run_id_8hex>).

Migration-invariant suite

Run the migration-invariant suite whenever you add a migration. The seeded_tenant fixture runs alembic upgrade head from scratch on a fresh container, which means a botched migration (broken SQL, schema name typo, missing CAST(:param AS jsonb) vs ::jsonb cast) will surface immediately. Do not skip this just because unit tests are green — unit tests do not exercise Alembic.

# After adding a migration — always run with -n 0 so the fixture runs once cleanly
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/ -n 0 -v

Debugging a failure

Step 1 — run single-process first.

Two concurrent pytest processes against the same machine can cause testcontainer timeouts that masquerade as test failures (Docker socket contention, port exhaustion). If you see container-start timeouts in a parallel run that don't reproduce in isolation, the culprit is not the test.

# Reproduce in isolation
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/path/to/failing_test.py -n 0 -v -s

Step 2 — check the interpreter.

If the failure mentions SyntaxError on 3.13-only syntax, or ModuleNotFoundError for a package you know is installed, confirm you are using the pinned venv:

/Users/soft4u/Development/ratiba/backend/.venv/bin/python --version
# Must print Python 3.13.x

Step 3 — the LSP-vs-CLI pyright gotcha.

When the LSP (typescript-lsp / pyright-lsp in the editor) shows zero type errors but pyright on the CLI reports issues, the CLI is canonical. The LSP caches stale type information across edits. Always verify with:

cd /Users/soft4u/Development/ratiba/backend
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pyright app/

Step 4 — the clean single-process gate.

Before marking a task "done", run the full suite serially in a clean terminal (no other pytest process running):

/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest -n 0

This is the canonical gate. A green xdist run with a concurrent second pytest process running is not sufficient — container timeouts in contended mode produce false failures that hide real ones. The 6-minute xdist run is for day-to-day iteration; the ~18-minute serial run is the final gate.

Frontend Vitest

The frontend test suite uses Vitest + React Testing Library.

Running

# From the frontend directory
cd /Users/soft4u/Development/ratiba/frontend

# Run all tests
npm test

# Run in watch mode (interactive, for development)
npm run test:watch

# Run with coverage
npm run test:coverage

As of the M12 baseline: 81 tests passing.

Snapshot updates

When intentional UI changes break existing snapshots:

# Review the diff first — Vitest shows what changed
npm test

# Accept all changes
npm test -- --update-snapshots

# Accept changes in a specific file
npm test -- --update-snapshots src/components/booking/BookingCard.test.tsx

Do not blindly accept snapshot updates. Review each diff to confirm the change is the intended UI change, not a regression. Snapshot diffs are a manual review artefact per ADR-0004 §1.

Type-checking frontend code

The frontend uses TypeScript strict mode + shadcn/ui. To type-check without running tests:

cd /Users/soft4u/Development/ratiba/frontend
npm run type-check
# Equivalent: npx tsc --noEmit

Zero type errors are required before commit — same discipline as the backend pyright gate.

DeepEval calibration

What it is

DeepEval is the primary eval runner for Ratiba's Layer 4 golden-conversation tests and Layer 5 production-replay regression tests. It runs LLM-as-judge scoring against YAML golden-conversation scenarios, using language-specific fluency metrics:

SwahiliFluencyMetric — idiomatic Swahili, code-switch handling, cultural appropriateness for Kenyan SMB context
EnglishFluencyMetric — natural Kenyan-English register (warmer than corporate; not translated American English)
MpesaPaymentSafetyMetric — language-agnostic; validates payment flow safety invariants
BookingSlotConsistencyMetric — language-agnostic; validates slot reservation consistency

Scenarios live at backend/tests/eval/conversations/scenarios/. Each scenario YAML carries a language: field that gates which fluency metric runs.

Running the eval suite

cd /Users/soft4u/Development/ratiba/backend

# Full eval suite (LLM-judge calls — expect several minutes + LLM cost)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/ -v

# Force a cold-cache run (clear cached judge results first)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/ --deepeval-cache-clear -v

# Run without the DeepEval cache (re-runs every judge call — for debugging)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/ --no-deepeval-cache -v

# Run a single scenario
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/ -k "spa_booking_swahili_happy" -v

Reading results

DeepEval outputs a per-scenario pass/fail table after the run. Look for:

Score — the judge's 0.0–1.0 rating for each metric
Threshold — the minimum score set in the scenario YAML (e.g., threshold: 0.80 for SwahiliFluencyMetric)
Reason — the judge's natural-language explanation (load-bearing for calibration review)

A failing scenario prints the full judge explanation. Read it before deciding whether it's a regression or a prompt-rubric calibration issue.

The 4-tuple cache key

DeepEval caches LLM-judge calls to avoid re-running expensive LLM calls between runs. The cache key is a SHA256 of a 4-tuple:

Component	Example	Invalidation trigger
`scenario_id`	`"spa_booking_swahili_happy"`	Different scenario = different conversation
`prompt_version`	`"booking_orchestrator@a1b2c3d"`	Application prompt changed → different model output
`judge_model_version`	`"claude-opus-4.7"`	Judge model upgrade may rate identically but differently
`metric_version`	`"swahili_fluency:1.0.0"`	Rubric change → stale cache entry

The metric_version comes from the __version__ constant in each custom metric module (e.g., backend/tests/eval/metrics/swahili_fluency.py). Bump __version__ whenever you change a rubric — failing to do so leaves stale cached scores that hide the real impact of your rubric change. The PR template includes a checklist item for this.

Cache directory: .deepeval-cache/ at project root, gitignored.

Full cache-key rationale and the cache invalidation policy are in ADR-0004 §5.

Monthly calibration ritual

The calibration set at backend/tests/eval/calibration/human_labelled.yaml is the ground-truth reference that keeps the LLM-as-judge honest.

Property	Value
Owner	Adrian
Cadence	Monthly, first Monday of each month
Volume per session	~10 new PII-masked production transcripts
Rating shape	Each transcript scored 1–5 on (1) intent comprehension, (2) response naturalness, (3) cultural appropriateness
Kappa target	Cohen's kappa ≥ 0.7 (substantial agreement) against human ratings

Calibration run procedure:

# 1. Add the new PII-masked transcripts to human_labelled.yaml
# 2. Re-run the judge against the full calibration set
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/calibration/ --no-deepeval-cache -v

# 3. Compute Cohen's kappa between judge ratings and human ratings
#    (script at backend/scripts/compute_calibration_kappa.py)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python scripts/compute_calibration_kappa.py

# 4. If kappa < 0.7 — the judge model is not trustworthy; investigate
#    before merging any prompt PR that month

If kappa falls below 0.7, the standard intervention is rubric recalibration — review the scenarios where the judge and human ratings diverge most, identify the systematic disagreement, and update the rubric in the metric module (with __version__ bump).

Full calibration ownership, kappa interpretation, and the rationale for monthly vs weekly/quarterly cadence are in ADR-0004 §3.

Production-replay promotion (Phase 2)

In Phase 2, negative-feedback conversations from Langfuse are promoted to regression scenarios via a weekly job at backend/app/eval/from_production/promote_to_scenario.py. The job applies the PII masking floor from backend/app/observability/redact.py before writing any scenario to disk.

Health-sector tenants (dental, physio, medical, legal) are blocked from production-replay promotion until the future privacy ADR fills in the vertical-specific masking rules. The blocker surfaces as a hard NotImplementedError — it is intentional.

CI gates

Ratiba uses a 3-layer quality defence that enforces zero-error commits at every step. The Methodology page describes the full lifecycle; this section covers what blocks at each layer.

Layer 1 — Post-edit linting (editor hook)

Runs after every file save on touched files:

ruff — zero pyflakes (F) errors required. Style categories (E, W, I, N, UP, B, RUF) are surfaced as warnings.
pyright — zero type errors on touched files. CLI is canonical when it disagrees with the LSP.
tsc — zero TypeScript errors on changed frontend files.

These do not block a commit directly but surface issues before they compound.

Layer 2 — Pre-push hook (git hook)

Blocks git push until these pass:

ruff check app/           # zero F-category errors
pyright app/              # zero type errors
pytest tests/ -n 4        # full suite green
npm run type-check        # frontend zero type errors
npm test                  # 81 frontend Vitest tests green

The pre-push hook is the main regression guard. A push that would break CI is caught here.

Layer 3 — Stop verification (mandatory before "done")

Before any task is declared complete, the canonical verification sequence runs:

# Backend — clean single-process full suite (see debugging note above)
cd /Users/soft4u/Development/ratiba/backend
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest -n 0

# Frontend — all tests + type-check
cd /Users/soft4u/Development/ratiba/frontend
npm test && npm run type-check

"Tests pass" without proof is insufficient per the S4U PoC mode rules. The actual terminal output (pytest result line + coverage %) is required evidence.

Eval gate matrix

The eval gate matrix controls when Layer 4 (DeepEval) runs on CI:

PR type	Eval gate	Blocking?
Prompt or FSM change	Full eval suite	Blocking — PR cannot merge until eval passes
Non-prompt code change	Smoke eval (subset)	Advisory only
Renovate major version bump	Full eval suite	Blocking
Documentation / config change	No eval	N/A

The CI workflow for the blocking gate lives at .github/workflows/eval-on-prompt-change.yml. The cost ceiling per PR is RATIBA_EVAL_BUDGET_USD=20 (~$300/month at PoC scale); the eval suite auto-downsamples from 10% cross-check to 0% on budget breach.

See ADR-0004 §1 for the gate matrix rationale and the Q5 cost-ceiling anchor.

Quality thresholds

Metric	Threshold	Gate
Backend coverage	≥ 70% overall, ≥ 90% safety-critical	`fail_under = 70` in `pyproject.toml`
Frontend tests	All 81 passing	Pre-push
Pyright errors	0	Pre-push
Ruff F-category	0	Pre-push
DeepEval kappa	≥ 0.7 (calibration)	Monthly ritual
DeepEval scenario score	Per-scenario threshold (e.g. 0.80 for `SwahiliFluencyMetric`)	Eval gate on prompt PRs

Cross-links

ADR-0004 — Testing strategy under conversation-as-state — the authoritative source for the 5-layer pyramid, per-scenario fresh-tenant fixture, DeepEval 4-tuple cache key, bilingual judge mode, and PII masking floor
Methodology — S4U quality gates, wave structure, and PoC mode rules
Local dev setup — getting the venv and testcontainers bootstrapped before running tests
Incidents — if tests are failing in ways that suggest infra rather than code issues (container port exhaustion, Keycloak realm leak, Redis key pollution)

The 5-layer pyramid​

Backend pytest​

Why the pinned venv command​

Standard invocations​

pytest configuration (pyproject.toml)​

xdist parallelism — the -n 4 cap​

Testcontainers — what spins up per worker​

The seeded_tenant / SeededTenant fixture​

Migration-invariant suite​

Debugging a failure​

Frontend Vitest​

Running​

Snapshot updates​

Type-checking frontend code​

DeepEval calibration​

What it is​

Running the eval suite​

Reading results​

The 4-tuple cache key​

Monthly calibration ritual​

Production-replay promotion (Phase 2)​

CI gates​

Layer 1 — Post-edit linting (editor hook)​

Layer 2 — Pre-push hook (git hook)​

Layer 3 — Stop verification (mandatory before "done")​

Eval gate matrix​

Quality thresholds​

Cross-links​