Testing on dev
Ratiba's test suite is structured around a 5-layer test pyramid designed for a system where conversation state is canonical booking state. Classical SQL-fixture testing cannot cover the booking flow; the pyramid reflects that reality. This page is the operator's guide for running and debugging all layers locally.
For the full testing strategy — why this pyramid, how the per-scenario fresh-tenant fixture works, the bilingual judge model, the PII masking floor, and the Phase 3 canary hook — see ADR-0004.
The 5-layer pyramid
| Layer | What it tests | Testcontainers needed | Speed |
|---|---|---|---|
| 1 — FSM unit | Deterministic state transitions, pure Python | No | Fast (<1s per test) |
| 2 — Full-turn integration | LangGraph checkpointer + TenantScopedSaver | Postgres + Redis | Medium |
| 3 — Transcript replay e2e | WhatsApp webhook → FSM → outbound message | Postgres + Redis + Keycloak | Slow |
| 4 — Golden-conversation snapshots | Conversational quality, bilingual fluency | No (LLM calls, not containers) | Slow (LLM-cost dominated) |
| 5 — Production replay | Regression from real customer transcripts | No (uses prod infra) | Phase 2 only |
All layers below Layer 4 run with pytest. Layer 4 uses the deepeval runner via pytest markers. Layer 5 is a Phase 2 capability (not yet active).
Backend pytest
Why the pinned venv command
The project pins Python 3.13 (see pyproject.toml requires-python = ">=3.13"). macOS pyenv shims resolve to whatever pyenv local reports — often 3.12 — which silently uses the wrong interpreter and produces confusing ModuleNotFoundError messages for 3.13-only syntax.
Always invoke pytest through the explicit venv path:
# CWD must be backend/ — pytest.ini is there
cd /Users/soft4u/Development/ratiba/backend
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest
If you see python3.12 anywhere in the traceback, you are using the wrong interpreter.
Standard invocations
# Full suite — xdist -n4 (default from pyproject.toml addopts)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest
# Full suite, single process — use for the final gate before commit (see below)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest -n 0
# Single test file — useful while writing a test
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/test_tenancy/test_onboarding.py -v
# Single test by name
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest -k "test_onboard_creates_realm" -v
# With coverage report
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest --cov=app --cov-report=term-missing
# Run only Layer 1 (no testcontainers, fast)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/ -m "not integration" -n 0
pytest configuration (pyproject.toml)
The [tool.pytest.ini_options] section sets:
| Option | Value | Reason |
|---|---|---|
asyncio_mode | "auto" | No @pytest.mark.asyncio decorators required on individual tests |
testpaths | ["tests"] | Scoped to backend/tests/ |
python_files | ["test_*.py"] | Standard pytest discovery |
addopts | "-n 4" | xdist 4-worker default (see xdist section below) |
filterwarnings | "ignore::DeprecationWarning:pytest_freezegun" | Suppresses known upstream noise |
Coverage thresholds ([tool.coverage.report]): fail_under = 70 globally; safety-critical layers require 90% per the S4U PoC mode rules.
xdist parallelism — the -n 4 cap
The default addopts = "-n 4" spawns 4 parallel workers, cutting the suite from ~18 minutes to ~6 minutes. Each worker boots its own Postgres 16, Redis 7, and Keycloak 24 testcontainer — that's the key design: containers are per-worker, not shared.
The cap is deliberately 4 and not auto (which on an M4 Mac would be ~10 workers). The reason: Keycloak's /health/ready probe times out when many containers start simultaneously. At 10 workers, 60+ container timeout failures surface — they are not test failures, they are container scheduling artefacts. The 4-worker cap avoids this entirely.
# Force serial (used for the clean-gate run before commit — see below)
-n 0
# 4 workers (default — fastest without container contention)
# (no flag needed; addopts sets it)
# Never use -n auto for this suite
Testcontainers — what spins up per worker
The conftest.py session fixtures spin up one container of each kind per xdist worker. The worker_id parameter is the pytest-xdist injection point that tells pytest to create the fixture per-worker rather than once globally:
postgres_container—postgres:16-alpine. Session-scoped. Each test gets a function-scopeddatabase_urlrewriting the dialect topostgresql+asyncpg://.redis_container—redis:7-alpine. Session-scoped. Tests that need Redis useredis_settings_env.keycloak_container—quay.io/keycloak/keycloak:24.0in dev mode. Session-scoped, ~10s cold start. Bootstrapped viakcadm.shinside the container to relaxsslRequired=NONEon the master realm (the only way to allow HTTP admin API calls from the pytest process — see_bootstrap_disable_ssl_requiredin conftest.py).
Container teardown is handled by the session fixture's try/finally blocks. Leftover containers from a crashed session are cleaned up by Docker.
The seeded_tenant / SeededTenant fixture
seeded_tenant is the canonical E2E integration fixture, reused across M4 through M13 integration tests. It:
- Drops and recreates
publicschema + any leftovertenant_*schemas from prior crashed runs - Runs
alembic upgrade headwithRATIBA_MIGRATION_SCOPE=publicto migrate the shared schema - Onboards a real tenant via
onboard_tenant()— this creates a Keycloak realm, runs per-tenant Alembic migrations, and callsPostgresSaver.setup()(the LangGraph checkpointer setup) - Writes WhatsApp credentials via the SQLAlchemy ORM so the
EncryptedTextTypeDecorator is exercised - Seeds 3 services (Deep Tissue Massage, Manicure, Pedicure) and 1 staff member with Mon-Fri 09:00–17:00 schedule
- Yields a
SeededTenantdataclass carryingtenant_id,slug,owner_phone,phone_number_id,access_token_plaintext,services, andstaff_id - Cleans up Redis and pool connections in
finally
Tests that need the eval-layer per-scenario isolation use the tenant_scoped_eval_environment fixture defined in backend/tests/eval/conftest.py instead. That fixture follows the same full-Alembic + DROP CASCADE pattern but scoped to a single YAML scenario ID with a UUID suffix (schema naming pattern: test_tenant_<scenario_id>_<run_id_8hex>).
Migration-invariant suite
Run the migration-invariant suite whenever you add a migration. The seeded_tenant fixture runs alembic upgrade head from scratch on a fresh container, which means a botched migration (broken SQL, schema name typo, missing CAST(:param AS jsonb) vs ::jsonb cast) will surface immediately. Do not skip this just because unit tests are green — unit tests do not exercise Alembic.
# After adding a migration — always run with -n 0 so the fixture runs once cleanly
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/ -n 0 -v
Debugging a failure
Step 1 — run single-process first.
Two concurrent pytest processes against the same machine can cause testcontainer timeouts that masquerade as test failures (Docker socket contention, port exhaustion). If you see container-start timeouts in a parallel run that don't reproduce in isolation, the culprit is not the test.
# Reproduce in isolation
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/path/to/failing_test.py -n 0 -v -s
Step 2 — check the interpreter.
If the failure mentions SyntaxError on 3.13-only syntax, or ModuleNotFoundError for a package you know is installed, confirm you are using the pinned venv:
/Users/soft4u/Development/ratiba/backend/.venv/bin/python --version
# Must print Python 3.13.x
Step 3 — the LSP-vs-CLI pyright gotcha.
When the LSP (typescript-lsp / pyright-lsp in the editor) shows zero type errors but pyright on the CLI reports issues, the CLI is canonical. The LSP caches stale type information across edits. Always verify with:
cd /Users/soft4u/Development/ratiba/backend
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pyright app/
Step 4 — the clean single-process gate.
Before marking a task "done", run the full suite serially in a clean terminal (no other pytest process running):
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest -n 0
This is the canonical gate. A green xdist run with a concurrent second pytest process running is not sufficient — container timeouts in contended mode produce false failures that hide real ones. The 6-minute xdist run is for day-to-day iteration; the ~18-minute serial run is the final gate.
Frontend Vitest
The frontend test suite uses Vitest + React Testing Library.
Running
# From the frontend directory
cd /Users/soft4u/Development/ratiba/frontend
# Run all tests
npm test
# Run in watch mode (interactive, for development)
npm run test:watch
# Run with coverage
npm run test:coverage
As of the M12 baseline: 81 tests passing.
Snapshot updates
When intentional UI changes break existing snapshots:
# Review the diff first — Vitest shows what changed
npm test
# Accept all changes
npm test -- --update-snapshots
# Accept changes in a specific file
npm test -- --update-snapshots src/components/booking/BookingCard.test.tsx
Do not blindly accept snapshot updates. Review each diff to confirm the change is the intended UI change, not a regression. Snapshot diffs are a manual review artefact per ADR-0004 §1.
Type-checking frontend code
The frontend uses TypeScript strict mode + shadcn/ui. To type-check without running tests:
cd /Users/soft4u/Development/ratiba/frontend
npm run type-check
# Equivalent: npx tsc --noEmit
Zero type errors are required before commit — same discipline as the backend pyright gate.
DeepEval calibration
What it is
DeepEval is the primary eval runner for Ratiba's Layer 4 golden-conversation tests and Layer 5 production-replay regression tests. It runs LLM-as-judge scoring against YAML golden-conversation scenarios, using language-specific fluency metrics:
SwahiliFluencyMetric— idiomatic Swahili, code-switch handling, cultural appropriateness for Kenyan SMB contextEnglishFluencyMetric— natural Kenyan-English register (warmer than corporate; not translated American English)MpesaPaymentSafetyMetric— language-agnostic; validates payment flow safety invariantsBookingSlotConsistencyMetric— language-agnostic; validates slot reservation consistency
Scenarios live at backend/tests/eval/conversations/scenarios/. Each scenario YAML carries a language: field that gates which fluency metric runs.
Running the eval suite
cd /Users/soft4u/Development/ratiba/backend
# Full eval suite (LLM-judge calls — expect several minutes + LLM cost)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/ -v
# Force a cold-cache run (clear cached judge results first)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/ --deepeval-cache-clear -v
# Run without the DeepEval cache (re-runs every judge call — for debugging)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/ --no-deepeval-cache -v
# Run a single scenario
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/ -k "spa_booking_swahili_happy" -v
Reading results
DeepEval outputs a per-scenario pass/fail table after the run. Look for:
- Score — the judge's 0.0–1.0 rating for each metric
- Threshold — the minimum score set in the scenario YAML (e.g.,
threshold: 0.80forSwahiliFluencyMetric) - Reason — the judge's natural-language explanation (load-bearing for calibration review)
A failing scenario prints the full judge explanation. Read it before deciding whether it's a regression or a prompt-rubric calibration issue.
The 4-tuple cache key
DeepEval caches LLM-judge calls to avoid re-running expensive LLM calls between runs. The cache key is a SHA256 of a 4-tuple:
| Component | Example | Invalidation trigger |
|---|---|---|
scenario_id | "spa_booking_swahili_happy" | Different scenario = different conversation |
prompt_version | "booking_orchestrator@a1b2c3d" | Application prompt changed → different model output |
judge_model_version | "claude-opus-4.7" | Judge model upgrade may rate identically but differently |
metric_version | "swahili_fluency:1.0.0" | Rubric change → stale cache entry |
The metric_version comes from the __version__ constant in each custom metric module (e.g., backend/tests/eval/metrics/swahili_fluency.py). Bump __version__ whenever you change a rubric — failing to do so leaves stale cached scores that hide the real impact of your rubric change. The PR template includes a checklist item for this.
Cache directory: .deepeval-cache/ at project root, gitignored.
Full cache-key rationale and the cache invalidation policy are in ADR-0004 §5.
Monthly calibration ritual
The calibration set at backend/tests/eval/calibration/human_labelled.yaml is the ground-truth reference that keeps the LLM-as-judge honest.
| Property | Value |
|---|---|
| Owner | Adrian |
| Cadence | Monthly, first Monday of each month |
| Volume per session | ~10 new PII-masked production transcripts |
| Rating shape | Each transcript scored 1–5 on (1) intent comprehension, (2) response naturalness, (3) cultural appropriateness |
| Kappa target | Cohen's kappa ≥ 0.7 (substantial agreement) against human ratings |
Calibration run procedure:
# 1. Add the new PII-masked transcripts to human_labelled.yaml
# 2. Re-run the judge against the full calibration set
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/calibration/ --no-deepeval-cache -v
# 3. Compute Cohen's kappa between judge ratings and human ratings
# (script at backend/scripts/compute_calibration_kappa.py)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python scripts/compute_calibration_kappa.py
# 4. If kappa < 0.7 — the judge model is not trustworthy; investigate
# before merging any prompt PR that month
If kappa falls below 0.7, the standard intervention is rubric recalibration — review the scenarios where the judge and human ratings diverge most, identify the systematic disagreement, and update the rubric in the metric module (with __version__ bump).
Full calibration ownership, kappa interpretation, and the rationale for monthly vs weekly/quarterly cadence are in ADR-0004 §3.
Production-replay promotion (Phase 2)
In Phase 2, negative-feedback conversations from Langfuse are promoted to regression scenarios via a weekly job at backend/app/eval/from_production/promote_to_scenario.py. The job applies the PII masking floor from backend/app/observability/redact.py before writing any scenario to disk.
Health-sector tenants (dental, physio, medical, legal) are blocked from production-replay promotion until the future privacy ADR fills in the vertical-specific masking rules. The blocker surfaces as a hard NotImplementedError — it is intentional.
CI gates
Ratiba uses a 3-layer quality defence that enforces zero-error commits at every step. The Methodology page describes the full lifecycle; this section covers what blocks at each layer.
Layer 1 — Post-edit linting (editor hook)
Runs after every file save on touched files:
- ruff — zero pyflakes (
F) errors required. Style categories (E,W,I,N,UP,B,RUF) are surfaced as warnings. - pyright — zero type errors on touched files. CLI is canonical when it disagrees with the LSP.
- tsc — zero TypeScript errors on changed frontend files.
These do not block a commit directly but surface issues before they compound.
Layer 2 — Pre-push hook (git hook)
Blocks git push until these pass:
ruff check app/ # zero F-category errors
pyright app/ # zero type errors
pytest tests/ -n 4 # full suite green
npm run type-check # frontend zero type errors
npm test # 81 frontend Vitest tests green
The pre-push hook is the main regression guard. A push that would break CI is caught here.
Layer 3 — Stop verification (mandatory before "done")
Before any task is declared complete, the canonical verification sequence runs:
# Backend — clean single-process full suite (see debugging note above)
cd /Users/soft4u/Development/ratiba/backend
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest -n 0
# Frontend — all tests + type-check
cd /Users/soft4u/Development/ratiba/frontend
npm test && npm run type-check
"Tests pass" without proof is insufficient per the S4U PoC mode rules. The actual terminal output (pytest result line + coverage %) is required evidence.
Eval gate matrix
The eval gate matrix controls when Layer 4 (DeepEval) runs on CI:
| PR type | Eval gate | Blocking? |
|---|---|---|
| Prompt or FSM change | Full eval suite | Blocking — PR cannot merge until eval passes |
| Non-prompt code change | Smoke eval (subset) | Advisory only |
| Renovate major version bump | Full eval suite | Blocking |
| Documentation / config change | No eval | N/A |
The CI workflow for the blocking gate lives at .github/workflows/eval-on-prompt-change.yml. The cost ceiling per PR is RATIBA_EVAL_BUDGET_USD=20 (~$300/month at PoC scale); the eval suite auto-downsamples from 10% cross-check to 0% on budget breach.
See ADR-0004 §1 for the gate matrix rationale and the Q5 cost-ceiling anchor.
Quality thresholds
| Metric | Threshold | Gate |
|---|---|---|
| Backend coverage | ≥ 70% overall, ≥ 90% safety-critical | fail_under = 70 in pyproject.toml |
| Frontend tests | All 81 passing | Pre-push |
| Pyright errors | 0 | Pre-push |
| Ruff F-category | 0 | Pre-push |
| DeepEval kappa | ≥ 0.7 (calibration) | Monthly ritual |
| DeepEval scenario score | Per-scenario threshold (e.g. 0.80 for SwahiliFluencyMetric) | Eval gate on prompt PRs |
Cross-links
- ADR-0004 — Testing strategy under conversation-as-state — the authoritative source for the 5-layer pyramid, per-scenario fresh-tenant fixture, DeepEval 4-tuple cache key, bilingual judge mode, and PII masking floor
- Methodology — S4U quality gates, wave structure, and PoC mode rules
- Local dev setup — getting the venv and testcontainers bootstrapped before running tests
- Incidents — if tests are failing in ways that suggest infra rather than code issues (container port exhaustion, Keycloak realm leak, Redis key pollution)