Skip to main content

Testing on dev

Ratiba's test suite is structured around a 5-layer test pyramid designed for a system where conversation state is canonical booking state. Classical SQL-fixture testing cannot cover the booking flow; the pyramid reflects that reality. This page is the operator's guide for running and debugging all layers locally.

For the full testing strategy — why this pyramid, how the per-scenario fresh-tenant fixture works, the bilingual judge model, the PII masking floor, and the Phase 3 canary hook — see ADR-0004.

The 5-layer pyramid

LayerWhat it testsTestcontainers neededSpeed
1 — FSM unitDeterministic state transitions, pure PythonNoFast (<1s per test)
2 — Full-turn integrationLangGraph checkpointer + TenantScopedSaverPostgres + RedisMedium
3 — Transcript replay e2eWhatsApp webhook → FSM → outbound messagePostgres + Redis + KeycloakSlow
4 — Golden-conversation snapshotsConversational quality, bilingual fluencyNo (LLM calls, not containers)Slow (LLM-cost dominated)
5 — Production replayRegression from real customer transcriptsNo (uses prod infra)Phase 2 only

All layers below Layer 4 run with pytest. Layer 4 uses the deepeval runner via pytest markers. Layer 5 is a Phase 2 capability (not yet active).


Backend pytest

Why the pinned venv command

The project pins Python 3.13 (see pyproject.toml requires-python = ">=3.13"). macOS pyenv shims resolve to whatever pyenv local reports — often 3.12 — which silently uses the wrong interpreter and produces confusing ModuleNotFoundError messages for 3.13-only syntax.

Always invoke pytest through the explicit venv path:

# CWD must be backend/ — pytest.ini is there
cd /Users/soft4u/Development/ratiba/backend

/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest

If you see python3.12 anywhere in the traceback, you are using the wrong interpreter.

Standard invocations

# Full suite — xdist -n4 (default from pyproject.toml addopts)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest

# Full suite, single process — use for the final gate before commit (see below)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest -n 0

# Single test file — useful while writing a test
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/test_tenancy/test_onboarding.py -v

# Single test by name
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest -k "test_onboard_creates_realm" -v

# With coverage report
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest --cov=app --cov-report=term-missing

# Run only Layer 1 (no testcontainers, fast)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/ -m "not integration" -n 0

pytest configuration (pyproject.toml)

The [tool.pytest.ini_options] section sets:

OptionValueReason
asyncio_mode"auto"No @pytest.mark.asyncio decorators required on individual tests
testpaths["tests"]Scoped to backend/tests/
python_files["test_*.py"]Standard pytest discovery
addopts"-n 4"xdist 4-worker default (see xdist section below)
filterwarnings"ignore::DeprecationWarning:pytest_freezegun"Suppresses known upstream noise

Coverage thresholds ([tool.coverage.report]): fail_under = 70 globally; safety-critical layers require 90% per the S4U PoC mode rules.

xdist parallelism — the -n 4 cap

The default addopts = "-n 4" spawns 4 parallel workers, cutting the suite from ~18 minutes to ~6 minutes. Each worker boots its own Postgres 16, Redis 7, and Keycloak 24 testcontainer — that's the key design: containers are per-worker, not shared.

The cap is deliberately 4 and not auto (which on an M4 Mac would be ~10 workers). The reason: Keycloak's /health/ready probe times out when many containers start simultaneously. At 10 workers, 60+ container timeout failures surface — they are not test failures, they are container scheduling artefacts. The 4-worker cap avoids this entirely.

# Force serial (used for the clean-gate run before commit — see below)
-n 0

# 4 workers (default — fastest without container contention)
# (no flag needed; addopts sets it)

# Never use -n auto for this suite

Testcontainers — what spins up per worker

The conftest.py session fixtures spin up one container of each kind per xdist worker. The worker_id parameter is the pytest-xdist injection point that tells pytest to create the fixture per-worker rather than once globally:

  • postgres_containerpostgres:16-alpine. Session-scoped. Each test gets a function-scoped database_url rewriting the dialect to postgresql+asyncpg://.
  • redis_containerredis:7-alpine. Session-scoped. Tests that need Redis use redis_settings_env.
  • keycloak_containerquay.io/keycloak/keycloak:24.0 in dev mode. Session-scoped, ~10s cold start. Bootstrapped via kcadm.sh inside the container to relax sslRequired=NONE on the master realm (the only way to allow HTTP admin API calls from the pytest process — see _bootstrap_disable_ssl_required in conftest.py).

Container teardown is handled by the session fixture's try/finally blocks. Leftover containers from a crashed session are cleaned up by Docker.

The seeded_tenant / SeededTenant fixture

seeded_tenant is the canonical E2E integration fixture, reused across M4 through M13 integration tests. It:

  1. Drops and recreates public schema + any leftover tenant_* schemas from prior crashed runs
  2. Runs alembic upgrade head with RATIBA_MIGRATION_SCOPE=public to migrate the shared schema
  3. Onboards a real tenant via onboard_tenant() — this creates a Keycloak realm, runs per-tenant Alembic migrations, and calls PostgresSaver.setup() (the LangGraph checkpointer setup)
  4. Writes WhatsApp credentials via the SQLAlchemy ORM so the EncryptedText TypeDecorator is exercised
  5. Seeds 3 services (Deep Tissue Massage, Manicure, Pedicure) and 1 staff member with Mon-Fri 09:00–17:00 schedule
  6. Yields a SeededTenant dataclass carrying tenant_id, slug, owner_phone, phone_number_id, access_token_plaintext, services, and staff_id
  7. Cleans up Redis and pool connections in finally

Tests that need the eval-layer per-scenario isolation use the tenant_scoped_eval_environment fixture defined in backend/tests/eval/conftest.py instead. That fixture follows the same full-Alembic + DROP CASCADE pattern but scoped to a single YAML scenario ID with a UUID suffix (schema naming pattern: test_tenant_<scenario_id>_<run_id_8hex>).

Migration-invariant suite

Run the migration-invariant suite whenever you add a migration. The seeded_tenant fixture runs alembic upgrade head from scratch on a fresh container, which means a botched migration (broken SQL, schema name typo, missing CAST(:param AS jsonb) vs ::jsonb cast) will surface immediately. Do not skip this just because unit tests are green — unit tests do not exercise Alembic.

# After adding a migration — always run with -n 0 so the fixture runs once cleanly
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/ -n 0 -v

Debugging a failure

Step 1 — run single-process first.

Two concurrent pytest processes against the same machine can cause testcontainer timeouts that masquerade as test failures (Docker socket contention, port exhaustion). If you see container-start timeouts in a parallel run that don't reproduce in isolation, the culprit is not the test.

# Reproduce in isolation
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/path/to/failing_test.py -n 0 -v -s

Step 2 — check the interpreter.

If the failure mentions SyntaxError on 3.13-only syntax, or ModuleNotFoundError for a package you know is installed, confirm you are using the pinned venv:

/Users/soft4u/Development/ratiba/backend/.venv/bin/python --version
# Must print Python 3.13.x

Step 3 — the LSP-vs-CLI pyright gotcha.

When the LSP (typescript-lsp / pyright-lsp in the editor) shows zero type errors but pyright on the CLI reports issues, the CLI is canonical. The LSP caches stale type information across edits. Always verify with:

cd /Users/soft4u/Development/ratiba/backend
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pyright app/

Step 4 — the clean single-process gate.

Before marking a task "done", run the full suite serially in a clean terminal (no other pytest process running):

/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest -n 0

This is the canonical gate. A green xdist run with a concurrent second pytest process running is not sufficient — container timeouts in contended mode produce false failures that hide real ones. The 6-minute xdist run is for day-to-day iteration; the ~18-minute serial run is the final gate.


Frontend Vitest

The frontend test suite uses Vitest + React Testing Library.

Running

# From the frontend directory
cd /Users/soft4u/Development/ratiba/frontend

# Run all tests
npm test

# Run in watch mode (interactive, for development)
npm run test:watch

# Run with coverage
npm run test:coverage

As of the M12 baseline: 81 tests passing.

Snapshot updates

When intentional UI changes break existing snapshots:

# Review the diff first — Vitest shows what changed
npm test

# Accept all changes
npm test -- --update-snapshots

# Accept changes in a specific file
npm test -- --update-snapshots src/components/booking/BookingCard.test.tsx

Do not blindly accept snapshot updates. Review each diff to confirm the change is the intended UI change, not a regression. Snapshot diffs are a manual review artefact per ADR-0004 §1.

Type-checking frontend code

The frontend uses TypeScript strict mode + shadcn/ui. To type-check without running tests:

cd /Users/soft4u/Development/ratiba/frontend
npm run type-check
# Equivalent: npx tsc --noEmit

Zero type errors are required before commit — same discipline as the backend pyright gate.


DeepEval calibration

What it is

DeepEval is the primary eval runner for Ratiba's Layer 4 golden-conversation tests and Layer 5 production-replay regression tests. It runs LLM-as-judge scoring against YAML golden-conversation scenarios, using language-specific fluency metrics:

  • SwahiliFluencyMetric — idiomatic Swahili, code-switch handling, cultural appropriateness for Kenyan SMB context
  • EnglishFluencyMetric — natural Kenyan-English register (warmer than corporate; not translated American English)
  • MpesaPaymentSafetyMetric — language-agnostic; validates payment flow safety invariants
  • BookingSlotConsistencyMetric — language-agnostic; validates slot reservation consistency

Scenarios live at backend/tests/eval/conversations/scenarios/. Each scenario YAML carries a language: field that gates which fluency metric runs.

Running the eval suite

cd /Users/soft4u/Development/ratiba/backend

# Full eval suite (LLM-judge calls — expect several minutes + LLM cost)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/ -v

# Force a cold-cache run (clear cached judge results first)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/ --deepeval-cache-clear -v

# Run without the DeepEval cache (re-runs every judge call — for debugging)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/ --no-deepeval-cache -v

# Run a single scenario
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/ -k "spa_booking_swahili_happy" -v

Reading results

DeepEval outputs a per-scenario pass/fail table after the run. Look for:

  • Score — the judge's 0.0–1.0 rating for each metric
  • Threshold — the minimum score set in the scenario YAML (e.g., threshold: 0.80 for SwahiliFluencyMetric)
  • Reason — the judge's natural-language explanation (load-bearing for calibration review)

A failing scenario prints the full judge explanation. Read it before deciding whether it's a regression or a prompt-rubric calibration issue.

The 4-tuple cache key

DeepEval caches LLM-judge calls to avoid re-running expensive LLM calls between runs. The cache key is a SHA256 of a 4-tuple:

ComponentExampleInvalidation trigger
scenario_id"spa_booking_swahili_happy"Different scenario = different conversation
prompt_version"booking_orchestrator@a1b2c3d"Application prompt changed → different model output
judge_model_version"claude-opus-4.7"Judge model upgrade may rate identically but differently
metric_version"swahili_fluency:1.0.0"Rubric change → stale cache entry

The metric_version comes from the __version__ constant in each custom metric module (e.g., backend/tests/eval/metrics/swahili_fluency.py). Bump __version__ whenever you change a rubric — failing to do so leaves stale cached scores that hide the real impact of your rubric change. The PR template includes a checklist item for this.

Cache directory: .deepeval-cache/ at project root, gitignored.

Full cache-key rationale and the cache invalidation policy are in ADR-0004 §5.

Monthly calibration ritual

The calibration set at backend/tests/eval/calibration/human_labelled.yaml is the ground-truth reference that keeps the LLM-as-judge honest.

PropertyValue
OwnerAdrian
CadenceMonthly, first Monday of each month
Volume per session~10 new PII-masked production transcripts
Rating shapeEach transcript scored 1–5 on (1) intent comprehension, (2) response naturalness, (3) cultural appropriateness
Kappa targetCohen's kappa ≥ 0.7 (substantial agreement) against human ratings

Calibration run procedure:

# 1. Add the new PII-masked transcripts to human_labelled.yaml
# 2. Re-run the judge against the full calibration set
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest tests/eval/calibration/ --no-deepeval-cache -v

# 3. Compute Cohen's kappa between judge ratings and human ratings
# (script at backend/scripts/compute_calibration_kappa.py)
/Users/soft4u/Development/ratiba/backend/.venv/bin/python scripts/compute_calibration_kappa.py

# 4. If kappa < 0.7 — the judge model is not trustworthy; investigate
# before merging any prompt PR that month

If kappa falls below 0.7, the standard intervention is rubric recalibration — review the scenarios where the judge and human ratings diverge most, identify the systematic disagreement, and update the rubric in the metric module (with __version__ bump).

Full calibration ownership, kappa interpretation, and the rationale for monthly vs weekly/quarterly cadence are in ADR-0004 §3.

Production-replay promotion (Phase 2)

In Phase 2, negative-feedback conversations from Langfuse are promoted to regression scenarios via a weekly job at backend/app/eval/from_production/promote_to_scenario.py. The job applies the PII masking floor from backend/app/observability/redact.py before writing any scenario to disk.

Health-sector tenants (dental, physio, medical, legal) are blocked from production-replay promotion until the future privacy ADR fills in the vertical-specific masking rules. The blocker surfaces as a hard NotImplementedError — it is intentional.


CI gates

Ratiba uses a 3-layer quality defence that enforces zero-error commits at every step. The Methodology page describes the full lifecycle; this section covers what blocks at each layer.

Layer 1 — Post-edit linting (editor hook)

Runs after every file save on touched files:

  • ruff — zero pyflakes (F) errors required. Style categories (E, W, I, N, UP, B, RUF) are surfaced as warnings.
  • pyright — zero type errors on touched files. CLI is canonical when it disagrees with the LSP.
  • tsc — zero TypeScript errors on changed frontend files.

These do not block a commit directly but surface issues before they compound.

Layer 2 — Pre-push hook (git hook)

Blocks git push until these pass:

ruff check app/ # zero F-category errors
pyright app/ # zero type errors
pytest tests/ -n 4 # full suite green
npm run type-check # frontend zero type errors
npm test # 81 frontend Vitest tests green

The pre-push hook is the main regression guard. A push that would break CI is caught here.

Layer 3 — Stop verification (mandatory before "done")

Before any task is declared complete, the canonical verification sequence runs:

# Backend — clean single-process full suite (see debugging note above)
cd /Users/soft4u/Development/ratiba/backend
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m pytest -n 0

# Frontend — all tests + type-check
cd /Users/soft4u/Development/ratiba/frontend
npm test && npm run type-check

"Tests pass" without proof is insufficient per the S4U PoC mode rules. The actual terminal output (pytest result line + coverage %) is required evidence.

Eval gate matrix

The eval gate matrix controls when Layer 4 (DeepEval) runs on CI:

PR typeEval gateBlocking?
Prompt or FSM changeFull eval suiteBlocking — PR cannot merge until eval passes
Non-prompt code changeSmoke eval (subset)Advisory only
Renovate major version bumpFull eval suiteBlocking
Documentation / config changeNo evalN/A

The CI workflow for the blocking gate lives at .github/workflows/eval-on-prompt-change.yml. The cost ceiling per PR is RATIBA_EVAL_BUDGET_USD=20 (~$300/month at PoC scale); the eval suite auto-downsamples from 10% cross-check to 0% on budget breach.

See ADR-0004 §1 for the gate matrix rationale and the Q5 cost-ceiling anchor.

Quality thresholds

MetricThresholdGate
Backend coverage≥ 70% overall, ≥ 90% safety-criticalfail_under = 70 in pyproject.toml
Frontend testsAll 81 passingPre-push
Pyright errors0Pre-push
Ruff F-category0Pre-push
DeepEval kappa≥ 0.7 (calibration)Monthly ritual
DeepEval scenario scorePer-scenario threshold (e.g. 0.80 for SwahiliFluencyMetric)Eval gate on prompt PRs

  • ADR-0004 — Testing strategy under conversation-as-state — the authoritative source for the 5-layer pyramid, per-scenario fresh-tenant fixture, DeepEval 4-tuple cache key, bilingual judge mode, and PII masking floor
  • Methodology — S4U quality gates, wave structure, and PoC mode rules
  • Local dev setup — getting the venv and testcontainers bootstrapped before running tests
  • Incidents — if tests are failing in ways that suggest infra rather than code issues (container port exhaustion, Keycloak realm leak, Redis key pollution)