ADR-0004: Testing Strategy Under Conversation-as-State

Status: Accepted Date: 2026-04-25

Context

Ratiba's "conversation state IS canonical state for in-flight bookings" distinction (project memory, PRD §1.4) means ordinary pytest against SQL fixtures cannot cover the booking flow. The orchestrator's correctness is partly probabilistic (LLM outputs); its state lives in LangGraph checkpoints across two stores; testing must verify both deterministic FSM transitions and conversational quality.

A3 (docs/research/2026-04-25-eval-frameworks.md) settled the high-level eval architecture: DeepEval as primary eval runner + self-hosted Langfuse v4 as observability backbone, prompt CMS, and customer-feedback collector. The five-layer test pyramid (FSM unit, full-turn integration, transcript replay, golden-conversation snapshots, production replay) replaces the classical pyramid that doesn't survive contact with conversation-as-state.

Spec §12 locked two foundational eval-related decisions:

Q1 — self-host Langfuse from day 1 (no external cloud dependencies; data-residency posture for health-data tenants).
Q5 — $300/month eval-suite cost ceiling at PoC scale; aligned with the cost-discipline principle.

Phase B §4 settled the eval gate matrix: prompts + FSM PRs blocking the eval suite, non-prompt code advisory smoke eval, Renovate major bumps blocking. The CI workflow at .github/workflows/eval-on-prompt-change.yml is the canonical gate-runner.

A3 §5 settled the WhatsApp button feedback loop: 👍/👎/💬 buttons captured as Langfuse scores; weekly job promotes negative-feedback conversations to candidate regression scenarios.

What ADR-0004 settles is the operational specifics A3 punted to follow-up: how the TenantScopedSaver gets consumed cleanly in eval fixtures, who owns the calibration set + at what cadence, the PII masking floor for production-replay datasets, the DeepEval cache invalidation policy, the bilingual judge mode, and the LLMRouter forward-compatibility shape that lets Phase 3 canary deployment plug in as an extension rather than a refactor.

This is the last ADR in the post-research chain (ADR-0002 / 0003 / 0005 / 0006 / 0007 / 0004). After this lands, the architectural foundation is complete and M3 implementation can begin.

Decision

Six specific decisions, organized as a coherent testing strategy.

1. Architectural recap (inherited, locked here)

Aspect	Where decided	Recap
Eval stack	A3 §1 + spec §12 Q1	DeepEval primary + self-hosted Langfuse v4 observability
Five-layer test pyramid	A3 §2	(1) FSM unit (mocked LLM); (2) full-turn integration (testcontainers + TenantScopedSaver); (3) transcript replay e2e; (4) golden-conversation snapshots (DeepEval); (5) production replay (Phase 2)
Golden-conversation YAML format	A3 §3	One file per scenario at `backend/tests/eval/conversations/scenarios/`; LLM responses recorded once into `tests/fixtures/llm_recordings/<scenario_id>.jsonl`; manual diff review on snapshot updates
Eval gate matrix	Phase B §4 + A3 §7	Prompts + FSM PRs blocking; non-prompt advisory smoke eval; Renovate majors blocking; `.github/workflows/eval-on-prompt-change.yml` is canonical
Cost ceiling	spec §12 Q5	`RATIBA_EVAL_BUDGET_USD=20` per PR; ~$300/month at PoC scale; cross-check sample auto-downgrades from 10% to 0% on budget breach
Customer feedback bridge	A3 §5	WhatsApp interactive reply buttons → Langfuse `create_score(name='user_feedback')` → weekly promotion of negative-feedback conversations to candidate regression scenarios
LLM-as-judge model	A3 §4	Claude Opus 4.7 primary judge + GPT-4 cross-check on 10% sample; Cohen's kappa target ≥ 0.7 against `human_labelled.yaml` calibration set
Bias mitigations	A3 §4	Position randomization in pairwise comparisons; rubric-based scoring via `ConversationalGEval`; calibration against human-labelled samples; multi-judge majority vote for failing scores; never same model as judge and generator in tight loop
Production eval phasing	A3 §6	Phase 1 = offline only; Phase 2 = + online sampling (every 6h, 5% sample); Phase 3 = + canary deployment (5% of conversations on new prompt for 24h, statistical comparison, auto-rollback)
Observability instrumentation	A3 §8 + Phase B §5	Per-conversation Langfuse trace; `tenant_id`, `conversation_id`, `prompt_version`, `model`, `cost_usd` propagated as metadata via `propagate_attributes`; structured-log JSONL files at `/var/log/ratiba/` as the durable substrate
Library pins	A3 §1	`deepeval >= 3.9` + `langfuse >= 4.0` in `[project.optional-dependencies] eval` group of `backend/pyproject.toml`

ADR-0004 builds on these without re-deriving them.

2. `tenant_scoped_eval_environment` pytest fixture — per-scenario fresh tenant

A3 §10 #2 surfaced the gap: the eval suite needs a clean way to spin up a TenantScopedSaver (per ADR-0002 D4 + ADR-0001 amendment) against a testcontainered Postgres without leaking state across scenarios.

Decision: per-scenario fresh tenant. Each YAML golden-conversation scenario gets its own ephemeral tenant schema, created at scenario start and dropped at scenario end. Maximum hermetic isolation; ~30-80ms overhead per scenario (negligible against the LLM-call costs that dominate).

Fixture contract:

# backend/tests/eval/conftest.py (sketch)

@pytest.fixture
async def tenant_scoped_eval_environment(scenario_id: str):
    """Spins up a fresh tenant schema for one scenario.

    Schema name: test_tenant_<scenario_id>_<run_id_8hex>
    Lifecycle:
      1. CREATE SCHEMA test_tenant_xxx
      2. SET search_path TO test_tenant_xxx
      3. alembic upgrade head --tenant=test_tenant_xxx
      4. PostgresSaver(get_psycopg_conn(test_tenant_xxx)).setup()
      5. tenant_id = await register_test_tenant(schema_name=test_tenant_xxx)
      6. saver = TenantScopedSaverFactory.for_tenant(tenant_id)
      7. orchestrator = build_test_orchestrator(saver, tenant_id)
      8. yield orchestrator
      9. DROP SCHEMA test_tenant_xxx CASCADE
    """
    schema_name = f"test_tenant_{scenario_id}_{uuid7().hex[:8]}"
    async with create_tenant_schema(schema_name) as tenant_id:
        await run_alembic_migrations(schema_name)
        await PostgresSaver(get_psycopg_conn(schema_name)).setup()
        saver = TenantScopedSaverFactory.for_tenant(tenant_id)
        orchestrator = await build_test_orchestrator(saver, tenant_id)
        yield orchestrator
        # cleanup happens via context manager exit

Belt-and-braces at session start: a pytest_sessionstart hook queries pg_namespace for any leftover test_tenant_* schemas from prior crashed runs and drops them. Defends against the case where a previous test killed the worker before fixture cleanup ran.

Why per-scenario over alternatives. Postgres CREATE SCHEMA + DROP SCHEMA CASCADE are fast on testcontainered Postgres. The hermetic-isolation guarantee — each scenario starts pristine, no cross-scenario state coupling — eliminates a class of bugs that's hard to debug when it surfaces (test passes alone, fails in suite). Session-wide-with-truncate (rejected) optimizes for a few seconds of total suite runtime at the cost of test interdependence; bad trade for a methodology-disciplined project.

3. Calibration set — ownership and refresh ritual

A3 §10 #4 surfaced the gap: A3 §4's recommended LLM-as-judge calibration relies on a growing set of human-rated conversations that, without explicit ownership and cadence, dies of neglect.

Decision: monthly cadence + ~10 new conversations + Adrian-owned + calendared.

Property	Value
File	`backend/tests/eval/calibration/human_labelled.yaml`
Owner	Adrian (single engineer in current state)
Ownership transition	PR amends ADR-0004 "calibration owner" field when second team member joins
Cadence	Monthly, first Monday of each month
Volume per session	~10 new PII-masked production transcripts
Rating shape	Each transcript scored 1-5 by a Swahili-fluent reviewer on (1) intent comprehension, (2) response naturalness, (3) cultural appropriateness
Trigger for promotion	A real production interaction that surprised either way (great handoff resolution worth eval-encoding, OR terrible classifier output worth regression-testing against)
Pairing	Aligned with quarterly model-pin review (Phase B §7) — the quarterly Monday is also a calibration-set Monday. Operational efficiency: one calendar block, two reviews.

Why monthly, not weekly or quarterly. Weekly risks falling behind reality and getting neglected (single-engineer time pressure). Quarterly creates a 3-month drift window where bad calibration data could compound without detection — bad bet for a system whose eval signal is load-bearing for prompt-PR gating.

Calibration kappa check runs as part of the monthly ritual: after adding the new conversations, re-run the LLM-as-judge against the full calibration set; compute Cohen's kappa between judge ratings and human ratings. Target: kappa ≥ 0.7 (substantial agreement threshold per A3 §4). Below 0.7, the judge model isn't trustworthy; either reject the upgrade or recalibrate the rubrics.

After ~12 months of monthly additions, the calibration set has ~120 conversations — meaningful sample for the LLM-as-judge calibration.

4. Production-replay PII masking floor + future privacy ADR reference

A3 §10 #6 surfaced this: when negative-feedback conversations get promoted from Langfuse into tests/eval/conversations/from_production/ (per A3 §5 weekly job), what gets masked? The masking floor is concrete and load-bearing; the vertical-specific rules belong to a future privacy ADR.

Masking floor (locked in this ADR):

# backend/app/eval/from_production/promote_to_scenario.py (sketch)

from backend.app.observability.redact import redactor
from backend.app.eval.from_production import redact_extras

async def mask_for_eval(
    conversation_jsonl_blob: str,
    tenant: TenantContext,
) -> str:
    """Apply masking floor + per-vertical extras before promotion."""
    # Floor: same redactor as Phase B §5.4 (single source of truth)
    floor_masked = await redactor.apply(conversation_jsonl_blob)

    # Per-vertical extras (TODO until privacy ADR fills this in)
    if tenant.vertical in ("dental", "physio", "medical", "legal"):
        floor_masked = await redact_extras.apply(floor_masked, tenant.vertical)

    return floor_masked

Floor rules (single source of truth at backend/app/observability/redact.py, shared with Phase B §5.4 structured-log redaction):

Phone numbers → masked to last-4 digits (+254 7** *** 432)
Customer names → replaced with <name> token
Free-text customer messages → numeric runs of length 6+ masked (likely IDs/account numbers); email patterns masked; full names matched against identity table → masked
Money fields (KES amount, M-Pesa receipt, PesaPal OrderTrackingId) → kept verbatim (load-bearing for "double-charge" diagnostic recipes per Phase B §5.6)
Tool args → hashed (sha256), full args go to Langfuse trace if needed for inspection

Vertical-specific extras (TODO until future privacy ADR):

backend/app/eval/from_production/redact_extras.py exists as a TODO-marker module — the floor handles spa/salon/barbershop tenants fully; the file documents the gap explicitly:

# backend/app/eval/from_production/redact_extras.py

async def apply(blob: str, vertical: str) -> str:
    """
    Vertical-specific PII masking extras.

    Currently a no-op + warning — implementation deferred to the
    future privacy ADR which will analyse Kenya DPA + EU AI Act +
    sector-specific health regulations.

    Tenants with vertical in (dental, physio, medical, legal) are
    BLOCKED from production-replay promotion until this is filled in.
    """
    raise NotImplementedError(
        f"Vertical '{vertical}' requires the future privacy ADR's "
        f"masking rules. Promotion blocked until those are defined."
    )

Effect: Phase 1 production-replay works fully for spa / salon / barbershop tenants. Vertical-specific tenants need the privacy ADR before their conversations can be promoted to regression scenarios — acceptable Phase 1 constraint; surfaces the compliance gap as a hard-blocking error rather than silently shipping unmasked data.

5. DeepEval cache invalidation — 4-tuple cache key

A3 §10 #7 surfaced this: DeepEval caches LLM-judge calls between runs to make re-runs cheap. The cache key needs to invalidate on every input that could change the result.

Cache key: SHA256 of canonicalized JSON 4-tuple:

# backend/tests/eval/cache_key.py (sketch)

import hashlib
import json

def deepeval_cache_key(
    scenario_id: str,
    prompt_version: str,         # e.g., "booking_orchestrator@a1b2c3d"
    judge_model_version: str,    # e.g., "claude-opus-4.7"
    metric_version: str,         # e.g., "swahili_fluency:1.0.0"
) -> str:
    canonical = json.dumps({
        "scenario_id": scenario_id,
        "prompt_version": prompt_version,
        "judge_model_version": judge_model_version,
        "metric_version": metric_version,
    }, sort_keys=True, separators=(",", ":"))
    return hashlib.sha256(canonical.encode()).hexdigest()

Each component represents a distinct change-cause that MUST invalidate the cached judge call:

Component	Source	Invalidation trigger
`scenario_id`	YAML scenario file `id:` field	Different scenario = different conversation
`prompt_version`	Per Phase B §3 — `<prompt_name>@<git-short-sha>` computed at startup from prompt file's git history	Application prompt changed → potentially different model output → potentially different judge score
`judge_model_version`	Provider model identifier string (e.g., `claude-opus-4.7`)	Judge model upgrade may rate the same output differently
`metric_version`	`__version__` constant exported by each custom metric module	Custom-metric rubric changed → potentially different score

__version__ convention for each custom metric module:

# backend/tests/eval/metrics/swahili_fluency.py
__version__ = "1.0.0"  # bump on rubric change

class SwahiliFluencyMetric(GEval):
    ...

Cache location. .deepeval-cache/ at the project root, gitignored. Persists across runs locally; CI runs from cold cache by default (eval gate timing assumes worst-case cold-cache for the eval-budget ceiling per Q5).

Override flags:

pytest --no-deepeval-cache — forced re-run; useful for debugging or when verifying a judge-model swap behaves consistently with the cached pre-swap scores.
pytest --deepeval-cache-clear — clears .deepeval-cache/ before the run.

Why the 4-tuple over alternatives. Coarser (3-tuple, drops metric_version) is risky in a methodology-disciplined project — manual "remember to invalidate when metric changes" gets skipped under deadline pressure → stale eval results mask the regressions the suite exists to catch. Finer (5-tuple, adding metric_threshold) is over-engineering — threshold is the pass/fail line, not an input to the score itself.

6. Bilingual judge mode — language-specific fluency metrics

A3 §10 #8 surfaced this: do we use one fluency metric per language (separate SwahiliFluencyMetric and EnglishFluencyMetric) or one combined BilingualFluencyMetric that auto-detects language and applies the appropriate rubric internally?

Decision: language-specific metrics. Separate modules per language; YAML scenario language: field gates which metric runs.

backend/tests/eval/metrics/
  swahili_fluency.py      # SwahiliFluencyMetric, __version__, Swahili rubric
  english_fluency.py      # EnglishFluencyMetric, __version__, Kenyan-English rubric
  mpesa_payment_safety.py # MpesaPaymentSafetyMetric, language-agnostic
  booking_slot_consistency.py # BookingSlotConsistencyMetric, language-agnostic

Swahili rubric (SwahiliFluencyMetric):

Idiomatic Swahili appropriate for service-business booking context
Avoids Anglicisms unless conventional ("spa", "booking" are conventional; "reservation" is not)
Handles code-switches gracefully (customer mixing English + Swahili mid-message)
Cultural appropriateness for Kenyan SMB context (e.g., "samahani" vs "pole" register)

English rubric (EnglishFluencyMetric):

Natural Kenyan-English register (avoids translated American English)
SMB-context formality (warmer than corporate; more direct than hospitality-luxury)

YAML scenario gating:

# backend/tests/eval/conversations/scenarios/spa_booking_swahili_happy.yaml
id: spa_booking_swahili_happy
language: sw       # gates SwahiliFluencyMetric
assertions:
  - metric: SwahiliFluencyMetric
    threshold: 0.80
  - metric: BookingSlotConsistencyMetric  # language-agnostic
    threshold: 1.0

Optional auto-attach in conftest.py:

def attach_language_metrics(scenario):
    """Auto-attach language-appropriate fluency metric.

    Test authors don't have to remember which fluency metric to
    specify; the scenario's `language:` field drives the choice.
    """
    if scenario.language == "sw":
        scenario.assertions.append(SwahiliFluencyMetric())
    elif scenario.language == "en":
        scenario.assertions.append(EnglishFluencyMetric())

Why language-specific over combined. A3 §4 emphasized Swahili rubric is materially different from English rubric; combining them inside one metric hides the per-language rubric handling and makes PR review harder ("which language's rubric did this PR change?"). Two modules with separate __version__ constants (per Q4) means iterating on Swahili rubric doesn't invalidate cached English-scenario scores — independent iteration cycles.

Language-agnostic single metric (rejected) loses Swahili-specific fluency signals (idiomatic register, code-switch handling, cultural appropriateness) — exactly the load-bearing differentiator for Ratiba's product.

7. LLMRouter forward-compatibility for Phase 3 canary

A3 §10 #3 surfaced this: A3 §6's Phase 3 plan introduces canary deployment for new prompt versions. The "prompt assignment service" that decides which prompt version each conversation uses needs to be plug-in-able into the existing LLMRouter (per ADR-0005 D4) without a refactor when Phase 3 lands.

Decision: commit to the deterministic-hash mechanism in the ADR now, sketch the architecture, do not implement until Phase 3.

Phase 1 LLMRouter signature (forward-compatible):

# backend/app/orchestrator/llm/router.py (Phase 1)

async def call_for_role(
    role: str,
    tenant_id: UUID,
    conversation_id: UUID,  # NEW — unused in Phase 1, locked for Phase 3
    system_prompt: str,
    user_prompt: str,
    schema: dict | None = None,
    max_tokens: int | None = None,
) -> LLMResponse:
    # Phase 1: always uses @stable alias from prompts/aliases.yaml
    prompt_version = resolve_prompt_alias(role, "stable")
    assignment = role_assignments[role]
    provider = providers[assignment["provider"]]
    return await provider.call(
        model=assignment["model"],
        system_prompt=system_prompt,
        user_prompt=user_prompt,
        schema=schema,
    )

Phase 3 extension hook (NOT implemented in Phase 1):

# Phase 3: extension hook (sketch in ADR for forward-compat)

async def _resolve_prompt_for_canary(
    role: str,
    tenant_id: UUID,
    conversation_id: UUID,
) -> str:
    """Resolve prompt version via deterministic-hash canary routing.

    Phase 3: plugged into call_for_role between line N and line M.
    """
    rollout = await redis.hgetall(f"ratiba:canary:rollout:{role}")
    if not rollout or int(rollout.get("percentage", 0)) == 0:
        return resolve_prompt_alias(role, "stable")
    bucket = int(
        hashlib.sha256(f"{tenant_id}:{conversation_id}".encode()).hexdigest(),
        16,
    ) % 100
    if bucket < int(rollout["percentage"]):
        return resolve_prompt_alias(role, "canary")
    return resolve_prompt_alias(role, "stable")

The Phase 1 cost is two unused parameters in the router signature for months. The Phase 3 win is canary deployment becomes a ~50-line extension to one function rather than a refactor that touches every call site.

Why deterministic-hash over alternatives. It's the canonical industry pattern (used by feature-flagging systems, A/B testing frameworks); provider-neutral; uses existing Redis (no-cloud-dependencies principle); plays cleanly with the LLMRouter from ADR-0005 D4. Langfuse experiment-routing (rejected) is vendor-specific bet on a feature that may not ship in time. Per-tenant opt-in (rejected) loses the gradual-rollout property — would deploy canary to whole tenants instead of 5% of any tenant's traffic; bad statistical design.

Consequences

Positive.

Hermetic test isolation eliminates cross-scenario state leak as a bug class (D2). The ~50ms overhead per scenario is dwarfed by LLM-call costs that dominate eval-suite runtime.
Calibration ritual is calendared (D3), aligned with quarterly model-pin review for operational efficiency. Won't die of neglect.
PII masking reuses single-source-of-truth redactor (D4) from Phase B §5.4. Same rules across structured logs + Langfuse traces + production-replay datasets.
Cache key catches every change-cause (D5). No stale eval results that mask regressions — every relevant input is part of the cache key.
Language-specific metrics enable Swahili-tuned iteration without invalidating English-scenario cached scores (D6). Independent iteration cycles for the two languages.
Phase 3 canary is a small extension, not a refactor (D7). Two unused parameters in Phase 1 LLMRouter signature is the entire forward-compatibility cost.
tests/eval/calibration/human_labelled.yaml becomes a long-term asset. ~120 conversations after a year of monthly ritual; meaningful calibration sample.

Negative.

Per-scenario tenant overhead (~50ms × ~50 scenarios = ~2.5s per full eval run) is real but small against the ~15-minute full-suite runtime. Acceptable trade for hermetic isolation.
Calibration ritual creates a recurring time-block on Adrian's calendar. Mitigation: paired with quarterly model-pin review; small monthly cost; explicit owner in the ADR makes it non-skippable.
Vertical-specific PII rules are a TODO until the future privacy ADR. Production-replay BLOCKS for dental / physio / medical / legal tenants until then. Acceptable Phase 1 constraint; surfaces the compliance gap as a hard error.
LLMRouter signature carries conversation_id parameter unused for months until Phase 3. Trivial cost; zero runtime impact.
__version__ convention on custom metrics requires discipline — bumping the version when changing rubric is a manual step. Mitigation: PR template includes "did you bump __version__?" checklist for changes under tests/eval/metrics/.
DeepEval cache directory .deepeval-cache/ adds another gitignored artifact to manage. Standard pattern; trivial.

Neutral.

Manual diff review of golden-conversation snapshots (per A3 §3) remains the load-bearing review step. Not auto-accept.
Conftest auto-attach for language-specific metrics is convenience, not contract. Test authors can skip it and specify metrics explicitly in the YAML.
GPT-4 cross-check on 10% of eval runs (per A3 §4) is the bias-mitigation mechanism for self-preference; tracks Cohen's kappa between Claude judge and GPT-4 judge as a drift signal.
Phase 3 LLMRouter sketch is informational only — the actual implementation lands when canary work begins, and the sketch may be refined at that point based on additional context.

Alternatives Considered

Alternative	Rejected because
Session-wide tenant + per-test truncate for the eval fixture.	Loses the hermetic isolation guarantee; tests become coupled (one test creates a row another test depends on, passes alone, fails in suite). The bug class is real and hard to debug.
Per-class tenant + per-method truncate.	Middle-ground compromise that gives up the isolation guarantee for marginal performance gain. Per-scenario fresh tenant is cheap enough that the compromise isn't worth it.
Weekly calibration cadence (~3 conversations/week).	Risks falling behind reality and getting neglected (single-engineer time pressure). The "first Monday of each month" calendar slot is sustainable; weekly time-blocks slip.
Quarterly calibration cadence (~30 conversations/quarter).	3-month drift window where bad calibration data could compound without detection. Bad bet for a system whose eval signal is load-bearing for prompt-PR gating.
Inline full PII policy in ADR-0004 including vertical rules.	Compliance scope bloat — vertical-specific rules belong to a privacy ADR alongside Kenya DPA + EU AI Act + sector-specific health regs analysis. ADR-0004 ships the floor + TODO marker; privacy ADR fills the marker.
Defer PII handling entirely to the future privacy ADR.	Blocks the production-replay code path indefinitely (A3 §5 weekly job has no spec for masking). Floor + TODO marker is the right Phase 1 posture.
Coarser 3-tuple cache key (drop `metric_version`).	Manual "remember to invalidate when metric changes" gets skipped under deadline pressure → stale eval results mask the regressions the suite exists to catch. Bad pattern in a methodology-disciplined project.
Finer 5-tuple cache key (add `metric_threshold`).	Threshold is the pass/fail line, not an input to the score itself. Same score crossed against different thresholds = same cache value, different assertion outcome. No need to invalidate.
Combined `BilingualFluencyMetric` that auto-detects language internally.	Hides per-language rubric handling — opaque to PR review ("which language's rubric did this PR change?"). Two metric modules with separate `__version__` enable independent iteration.
Single language-agnostic `FluencyMetric`.	Loses Swahili-specific fluency signals (idiomatic register, code-switch handling, cultural appropriateness). Ratiba's Swahili-quality differentiator is structural; flat fluency metric flattens the differentiator.
Langfuse experiment-routing primitive for Phase 3 canary.	Vendor-specific bet on a feature that may not exist when Phase 3 needs it (months out). Deterministic-hash pattern is provider-neutral and uses existing Redis.
Per-tenant opt-in for canary (each tenant explicitly enrolled via dashboard).	Loses the gradual-rollout property — canary deploys to whole tenants instead of 5% of any tenant's traffic. Bad statistical design (no within-tenant control group).
Defer Phase 3 LLMRouter signature entirely (no `conversation_id` parameter in Phase 1).	Phase 3 work becomes a refactor that touches every LLMRouter call site instead of a 50-line extension. Avoidable for ~10 lines of forward-compat plumbing in Phase 1.
Skip cross-check sample on full eval runs (Claude judge only, no GPT-4 cross-check).	Loses the bias-mitigation signal for self-preference (Claude generator + Claude judge in tight loop). 10% sample is the small cost for the drift signal.

References

docs/prd/ratiba-prd.md — §1.4 conversational thesis, §4 Modules 7-9 (orchestrator + scheduling + admin)
docs/adr/ADR-0001-tech-stack.md (amended 2026-04-25) — pinned Python 3.13; library-currency policy
docs/adr/ADR-0002-multi-tenant-isolation.md — D4 TenantScopedSaver via per-tenant micro-pools; D7 asyncio contextvar tenant propagation; D3 per-tenant Alembic invocation (used by fixture's alembic upgrade head --tenant=test_tenant_xxx)
docs/adr/ADR-0003-fsm-persistence.md — conversation_threads pointer table; LangGraph checkpoint tables (created in fixture by PostgresSaver.setup())
docs/adr/ADR-0005-orchestration-model.md — D4 LLMRouter + role-assignments YAML (extended in this ADR D7 with conversation_id parameter); D6 directory-per-tool registry (eval scenarios assert against tool_calls[] per turn)
docs/adr/ADR-0006-handoff-model.md — handoff_log table (eval scenarios for handoff flow read from this); D9 briefing card schema (eval scenarios validate brief generation)
docs/adr/ADR-0007-payments-orchestration.md — PAYMENT_CANCELLED_BY_CUSTOMER state (eval scenarios cover cancellation flow); D9 reversal logic (eval scenarios validate auto-reverse on STK callback to cancelled payment)
docs/research/2026-04-25-langgraph-postgressaver-spike.md — TenantScopedSaver wrapper used by fixture (D2)
docs/research/2026-04-25-orchestration-patterns.md — A1 §3 FSM states (eval scenarios assert on FSM transitions)
docs/research/2026-04-25-eval-frameworks.md — A3 (heavy use throughout; this ADR locks A3's open questions §10)
docs/research/2026-04-25-human-in-the-loop-handoff.md — A2 §3 briefing card (eval scenarios validate brief generation)
docs/research/2026-04-25-payments-orchestration.md — A4 §1 payment lifecycle (eval scenarios cover both rails)
docs/methodology/agentic-development.md — Phase B §3 prompt storage (prompt_version in cache key per D5); §4 eval gate matrix (CI workflow); §5 auto-debug logging schema (event_type enum used in eval-suite log assertions); §6 delegate-vs-human-review boundaries (eval-suite changes are human-review per prompt-eval-reviewer agent)
docs/superpowers/specs/2026-04-25-agentic-research-investment-design.md §12 — Q1 (self-host Langfuse), Q5 ($300/month eval ceiling) anchors
~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_no_cloud_dependencies.md — drives D6 deterministic-hash over Langfuse experiment-routing
~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_cost_discipline.md — drives D5 cache key (avoid stale results that hide regressions → cost of missed bugs in production)
DeepEval documentation — ConversationalGEval, ConversationSimulator, ConversationalGolden primitives; cache mechanism
Langfuse v4 SDK documentation — OTel-native traces; propagate_attributes; create_score for feedback loop
Cohen's kappa interpretation thresholds — Landis & Koch 1977 (substantial agreement = 0.61-0.80; ≥ 0.7 is the conservative threshold for substantial)

Context​

Decision​

1. Architectural recap (inherited, locked here)​

2. tenant_scoped_eval_environment pytest fixture — per-scenario fresh tenant​

3. Calibration set — ownership and refresh ritual​

4. Production-replay PII masking floor + future privacy ADR reference​

5. DeepEval cache invalidation — 4-tuple cache key​

6. Bilingual judge mode — language-specific fluency metrics​

7. LLMRouter forward-compatibility for Phase 3 canary​

Consequences​

Alternatives Considered​

References​