ADR-0004: Testing Strategy Under Conversation-as-State
Status: Accepted Date: 2026-04-25
Context
Ratiba's "conversation state IS canonical state for in-flight bookings"
distinction (project memory, PRD §1.4) means ordinary pytest against
SQL fixtures cannot cover the booking flow. The orchestrator's
correctness is partly probabilistic (LLM outputs); its state lives in
LangGraph checkpoints across two stores; testing must verify both
deterministic FSM transitions and conversational quality.
A3 (docs/research/2026-04-25-eval-frameworks.md) settled the
high-level eval architecture: DeepEval as primary eval runner +
self-hosted Langfuse v4 as observability backbone, prompt CMS, and
customer-feedback collector. The five-layer test pyramid (FSM unit,
full-turn integration, transcript replay, golden-conversation
snapshots, production replay) replaces the classical pyramid that
doesn't survive contact with conversation-as-state.
Spec §12 locked two foundational eval-related decisions:
- Q1 — self-host Langfuse from day 1 (no external cloud dependencies; data-residency posture for health-data tenants).
- Q5 — $300/month eval-suite cost ceiling at PoC scale; aligned with the cost-discipline principle.
Phase B §4 settled the eval gate matrix: prompts + FSM PRs blocking
the eval suite, non-prompt code advisory smoke eval, Renovate major
bumps blocking. The CI workflow at
.github/workflows/eval-on-prompt-change.yml is the canonical
gate-runner.
A3 §5 settled the WhatsApp button feedback loop: 👍/👎/💬 buttons captured as Langfuse scores; weekly job promotes negative-feedback conversations to candidate regression scenarios.
What ADR-0004 settles is the operational specifics A3 punted to follow-up: how the TenantScopedSaver gets consumed cleanly in eval fixtures, who owns the calibration set + at what cadence, the PII masking floor for production-replay datasets, the DeepEval cache invalidation policy, the bilingual judge mode, and the LLMRouter forward-compatibility shape that lets Phase 3 canary deployment plug in as an extension rather than a refactor.
This is the last ADR in the post-research chain (ADR-0002 / 0003 / 0005 / 0006 / 0007 / 0004). After this lands, the architectural foundation is complete and M3 implementation can begin.
Decision
Six specific decisions, organized as a coherent testing strategy.
1. Architectural recap (inherited, locked here)
| Aspect | Where decided | Recap |
|---|---|---|
| Eval stack | A3 §1 + spec §12 Q1 | DeepEval primary + self-hosted Langfuse v4 observability |
| Five-layer test pyramid | A3 §2 | (1) FSM unit (mocked LLM); (2) full-turn integration (testcontainers + TenantScopedSaver); (3) transcript replay e2e; (4) golden-conversation snapshots (DeepEval); (5) production replay (Phase 2) |
| Golden-conversation YAML format | A3 §3 | One file per scenario at backend/tests/eval/conversations/scenarios/; LLM responses recorded once into tests/fixtures/llm_recordings/<scenario_id>.jsonl; manual diff review on snapshot updates |
| Eval gate matrix | Phase B §4 + A3 §7 | Prompts + FSM PRs blocking; non-prompt advisory smoke eval; Renovate majors blocking; .github/workflows/eval-on-prompt-change.yml is canonical |
| Cost ceiling | spec §12 Q5 | RATIBA_EVAL_BUDGET_USD=20 per PR; ~$300/month at PoC scale; cross-check sample auto-downgrades from 10% to 0% on budget breach |
| Customer feedback bridge | A3 §5 | WhatsApp interactive reply buttons → Langfuse create_score(name='user_feedback') → weekly promotion of negative-feedback conversations to candidate regression scenarios |
| LLM-as-judge model | A3 §4 | Claude Opus 4.7 primary judge + GPT-4 cross-check on 10% sample; Cohen's kappa target ≥ 0.7 against human_labelled.yaml calibration set |
| Bias mitigations | A3 §4 | Position randomization in pairwise comparisons; rubric-based scoring via ConversationalGEval; calibration against human-labelled samples; multi-judge majority vote for failing scores; never same model as judge and generator in tight loop |
| Production eval phasing | A3 §6 | Phase 1 = offline only; Phase 2 = + online sampling (every 6h, 5% sample); Phase 3 = + canary deployment (5% of conversations on new prompt for 24h, statistical comparison, auto-rollback) |
| Observability instrumentation | A3 §8 + Phase B §5 | Per-conversation Langfuse trace; tenant_id, conversation_id, prompt_version, model, cost_usd propagated as metadata via propagate_attributes; structured-log JSONL files at /var/log/ratiba/ as the durable substrate |
| Library pins | A3 §1 | deepeval >= 3.9 + langfuse >= 4.0 in [project.optional-dependencies] eval group of backend/pyproject.toml |
ADR-0004 builds on these without re-deriving them.
2. tenant_scoped_eval_environment pytest fixture — per-scenario fresh tenant
A3 §10 #2 surfaced the gap: the eval suite needs a clean way to spin
up a TenantScopedSaver (per ADR-0002 D4 + ADR-0001 amendment) against
a testcontainered Postgres without leaking state across scenarios.
Decision: per-scenario fresh tenant. Each YAML golden-conversation scenario gets its own ephemeral tenant schema, created at scenario start and dropped at scenario end. Maximum hermetic isolation; ~30-80ms overhead per scenario (negligible against the LLM-call costs that dominate).
Fixture contract:
# backend/tests/eval/conftest.py (sketch)
@pytest.fixture
async def tenant_scoped_eval_environment(scenario_id: str):
"""Spins up a fresh tenant schema for one scenario.
Schema name: test_tenant_<scenario_id>_<run_id_8hex>
Lifecycle:
1. CREATE SCHEMA test_tenant_xxx
2. SET search_path TO test_tenant_xxx
3. alembic upgrade head --tenant=test_tenant_xxx
4. PostgresSaver(get_psycopg_conn(test_tenant_xxx)).setup()
5. tenant_id = await register_test_tenant(schema_name=test_tenant_xxx)
6. saver = TenantScopedSaverFactory.for_tenant(tenant_id)
7. orchestrator = build_test_orchestrator(saver, tenant_id)
8. yield orchestrator
9. DROP SCHEMA test_tenant_xxx CASCADE
"""
schema_name = f"test_tenant_{scenario_id}_{uuid7().hex[:8]}"
async with create_tenant_schema(schema_name) as tenant_id:
await run_alembic_migrations(schema_name)
await PostgresSaver(get_psycopg_conn(schema_name)).setup()
saver = TenantScopedSaverFactory.for_tenant(tenant_id)
orchestrator = await build_test_orchestrator(saver, tenant_id)
yield orchestrator
# cleanup happens via context manager exit
Belt-and-braces at session start: a pytest_sessionstart hook
queries pg_namespace for any leftover test_tenant_* schemas from
prior crashed runs and drops them. Defends against the case where a
previous test killed the worker before fixture cleanup ran.
Why per-scenario over alternatives. Postgres CREATE SCHEMA +
DROP SCHEMA CASCADE are fast on testcontainered Postgres. The
hermetic-isolation guarantee — each scenario starts pristine, no
cross-scenario state coupling — eliminates a class of bugs that's
hard to debug when it surfaces (test passes alone, fails in suite).
Session-wide-with-truncate (rejected) optimizes for a few seconds of
total suite runtime at the cost of test interdependence; bad trade
for a methodology-disciplined project.
3. Calibration set — ownership and refresh ritual
A3 §10 #4 surfaced the gap: A3 §4's recommended LLM-as-judge calibration relies on a growing set of human-rated conversations that, without explicit ownership and cadence, dies of neglect.
Decision: monthly cadence + ~10 new conversations + Adrian-owned + calendared.
| Property | Value |
|---|---|
| File | backend/tests/eval/calibration/human_labelled.yaml |
| Owner | Adrian (single engineer in current state) |
| Ownership transition | PR amends ADR-0004 "calibration owner" field when second team member joins |
| Cadence | Monthly, first Monday of each month |
| Volume per session | ~10 new PII-masked production transcripts |
| Rating shape | Each transcript scored 1-5 by a Swahili-fluent reviewer on (1) intent comprehension, (2) response naturalness, (3) cultural appropriateness |
| Trigger for promotion | A real production interaction that surprised either way (great handoff resolution worth eval-encoding, OR terrible classifier output worth regression-testing against) |
| Pairing | Aligned with quarterly model-pin review (Phase B §7) — the quarterly Monday is also a calibration-set Monday. Operational efficiency: one calendar block, two reviews. |
Why monthly, not weekly or quarterly. Weekly risks falling behind reality and getting neglected (single-engineer time pressure). Quarterly creates a 3-month drift window where bad calibration data could compound without detection — bad bet for a system whose eval signal is load-bearing for prompt-PR gating.
Calibration kappa check runs as part of the monthly ritual: after adding the new conversations, re-run the LLM-as-judge against the full calibration set; compute Cohen's kappa between judge ratings and human ratings. Target: kappa ≥ 0.7 (substantial agreement threshold per A3 §4). Below 0.7, the judge model isn't trustworthy; either reject the upgrade or recalibrate the rubrics.
After ~12 months of monthly additions, the calibration set has ~120 conversations — meaningful sample for the LLM-as-judge calibration.
4. Production-replay PII masking floor + future privacy ADR reference
A3 §10 #6 surfaced this: when negative-feedback conversations get
promoted from Langfuse into tests/eval/conversations/from_production/
(per A3 §5 weekly job), what gets masked? The masking floor is concrete
and load-bearing; the vertical-specific rules belong to a future
privacy ADR.
Masking floor (locked in this ADR):
# backend/app/eval/from_production/promote_to_scenario.py (sketch)
from backend.app.observability.redact import redactor
from backend.app.eval.from_production import redact_extras
async def mask_for_eval(
conversation_jsonl_blob: str,
tenant: TenantContext,
) -> str:
"""Apply masking floor + per-vertical extras before promotion."""
# Floor: same redactor as Phase B §5.4 (single source of truth)
floor_masked = await redactor.apply(conversation_jsonl_blob)
# Per-vertical extras (TODO until privacy ADR fills this in)
if tenant.vertical in ("dental", "physio", "medical", "legal"):
floor_masked = await redact_extras.apply(floor_masked, tenant.vertical)
return floor_masked
Floor rules (single source of truth at backend/app/observability/redact.py,
shared with Phase B §5.4 structured-log redaction):
- Phone numbers → masked to last-4 digits (
+254 7** *** 432) - Customer names → replaced with
<name>token - Free-text customer messages → numeric runs of length 6+ masked (likely IDs/account numbers); email patterns masked; full names matched against identity table → masked
- Money fields (KES amount, M-Pesa receipt, PesaPal
OrderTrackingId) → kept verbatim (load-bearing for "double-charge" diagnostic recipes per Phase B §5.6) - Tool args → hashed (sha256), full args go to Langfuse trace if needed for inspection
Vertical-specific extras (TODO until future privacy ADR):
backend/app/eval/from_production/redact_extras.py exists as a
TODO-marker module — the floor handles spa/salon/barbershop tenants
fully; the file documents the gap explicitly:
# backend/app/eval/from_production/redact_extras.py
async def apply(blob: str, vertical: str) -> str:
"""
Vertical-specific PII masking extras.
Currently a no-op + warning — implementation deferred to the
future privacy ADR which will analyse Kenya DPA + EU AI Act +
sector-specific health regulations.
Tenants with vertical in (dental, physio, medical, legal) are
BLOCKED from production-replay promotion until this is filled in.
"""
raise NotImplementedError(
f"Vertical '{vertical}' requires the future privacy ADR's "
f"masking rules. Promotion blocked until those are defined."
)
Effect: Phase 1 production-replay works fully for spa / salon / barbershop tenants. Vertical-specific tenants need the privacy ADR before their conversations can be promoted to regression scenarios — acceptable Phase 1 constraint; surfaces the compliance gap as a hard-blocking error rather than silently shipping unmasked data.
5. DeepEval cache invalidation — 4-tuple cache key
A3 §10 #7 surfaced this: DeepEval caches LLM-judge calls between runs to make re-runs cheap. The cache key needs to invalidate on every input that could change the result.
Cache key: SHA256 of canonicalized JSON 4-tuple:
# backend/tests/eval/cache_key.py (sketch)
import hashlib
import json
def deepeval_cache_key(
scenario_id: str,
prompt_version: str, # e.g., "booking_orchestrator@a1b2c3d"
judge_model_version: str, # e.g., "claude-opus-4.7"
metric_version: str, # e.g., "swahili_fluency:1.0.0"
) -> str:
canonical = json.dumps({
"scenario_id": scenario_id,
"prompt_version": prompt_version,
"judge_model_version": judge_model_version,
"metric_version": metric_version,
}, sort_keys=True, separators=(",", ":"))
return hashlib.sha256(canonical.encode()).hexdigest()
Each component represents a distinct change-cause that MUST invalidate the cached judge call:
| Component | Source | Invalidation trigger |
|---|---|---|
scenario_id | YAML scenario file id: field | Different scenario = different conversation |
prompt_version | Per Phase B §3 — <prompt_name>@<git-short-sha> computed at startup from prompt file's git history | Application prompt changed → potentially different model output → potentially different judge score |
judge_model_version | Provider model identifier string (e.g., claude-opus-4.7) | Judge model upgrade may rate the same output differently |
metric_version | __version__ constant exported by each custom metric module | Custom-metric rubric changed → potentially different score |
__version__ convention for each custom metric module:
# backend/tests/eval/metrics/swahili_fluency.py
__version__ = "1.0.0" # bump on rubric change
class SwahiliFluencyMetric(GEval):
...
Cache location. .deepeval-cache/ at the project root, gitignored.
Persists across runs locally; CI runs from cold cache by default
(eval gate timing assumes worst-case cold-cache for the eval-budget
ceiling per Q5).
Override flags:
pytest --no-deepeval-cache— forced re-run; useful for debugging or when verifying a judge-model swap behaves consistently with the cached pre-swap scores.pytest --deepeval-cache-clear— clears.deepeval-cache/before the run.
Why the 4-tuple over alternatives. Coarser (3-tuple, drops
metric_version) is risky in a methodology-disciplined project — manual
"remember to invalidate when metric changes" gets skipped under
deadline pressure → stale eval results mask the regressions the
suite exists to catch. Finer (5-tuple, adding metric_threshold) is
over-engineering — threshold is the pass/fail line, not an input to
the score itself.
6. Bilingual judge mode — language-specific fluency metrics
A3 §10 #8 surfaced this: do we use one fluency metric per language
(separate SwahiliFluencyMetric and EnglishFluencyMetric) or one
combined BilingualFluencyMetric that auto-detects language and
applies the appropriate rubric internally?
Decision: language-specific metrics. Separate modules per
language; YAML scenario language: field gates which metric runs.
backend/tests/eval/metrics/
swahili_fluency.py # SwahiliFluencyMetric, __version__, Swahili rubric
english_fluency.py # EnglishFluencyMetric, __version__, Kenyan-English rubric
mpesa_payment_safety.py # MpesaPaymentSafetyMetric, language-agnostic
booking_slot_consistency.py # BookingSlotConsistencyMetric, language-agnostic
Swahili rubric (SwahiliFluencyMetric):
- Idiomatic Swahili appropriate for service-business booking context
- Avoids Anglicisms unless conventional ("spa", "booking" are conventional; "reservation" is not)
- Handles code-switches gracefully (customer mixing English + Swahili mid-message)
- Cultural appropriateness for Kenyan SMB context (e.g., "samahani" vs "pole" register)
English rubric (EnglishFluencyMetric):
- Natural Kenyan-English register (avoids translated American English)
- SMB-context formality (warmer than corporate; more direct than hospitality-luxury)
YAML scenario gating:
# backend/tests/eval/conversations/scenarios/spa_booking_swahili_happy.yaml
id: spa_booking_swahili_happy
language: sw # gates SwahiliFluencyMetric
assertions:
- metric: SwahiliFluencyMetric
threshold: 0.80
- metric: BookingSlotConsistencyMetric # language-agnostic
threshold: 1.0
Optional auto-attach in conftest.py:
def attach_language_metrics(scenario):
"""Auto-attach language-appropriate fluency metric.
Test authors don't have to remember which fluency metric to
specify; the scenario's `language:` field drives the choice.
"""
if scenario.language == "sw":
scenario.assertions.append(SwahiliFluencyMetric())
elif scenario.language == "en":
scenario.assertions.append(EnglishFluencyMetric())
Why language-specific over combined. A3 §4 emphasized Swahili
rubric is materially different from English rubric; combining them
inside one metric hides the per-language rubric handling and makes
PR review harder ("which language's rubric did this PR change?").
Two modules with separate __version__ constants (per Q4) means
iterating on Swahili rubric doesn't invalidate cached English-scenario
scores — independent iteration cycles.
Language-agnostic single metric (rejected) loses Swahili-specific fluency signals (idiomatic register, code-switch handling, cultural appropriateness) — exactly the load-bearing differentiator for Ratiba's product.
7. LLMRouter forward-compatibility for Phase 3 canary
A3 §10 #3 surfaced this: A3 §6's Phase 3 plan introduces canary deployment for new prompt versions. The "prompt assignment service" that decides which prompt version each conversation uses needs to be plug-in-able into the existing LLMRouter (per ADR-0005 D4) without a refactor when Phase 3 lands.
Decision: commit to the deterministic-hash mechanism in the ADR now, sketch the architecture, do not implement until Phase 3.
Phase 1 LLMRouter signature (forward-compatible):
# backend/app/orchestrator/llm/router.py (Phase 1)
async def call_for_role(
role: str,
tenant_id: UUID,
conversation_id: UUID, # NEW — unused in Phase 1, locked for Phase 3
system_prompt: str,
user_prompt: str,
schema: dict | None = None,
max_tokens: int | None = None,
) -> LLMResponse:
# Phase 1: always uses @stable alias from prompts/aliases.yaml
prompt_version = resolve_prompt_alias(role, "stable")
assignment = role_assignments[role]
provider = providers[assignment["provider"]]
return await provider.call(
model=assignment["model"],
system_prompt=system_prompt,
user_prompt=user_prompt,
schema=schema,
)
Phase 3 extension hook (NOT implemented in Phase 1):
# Phase 3: extension hook (sketch in ADR for forward-compat)
async def _resolve_prompt_for_canary(
role: str,
tenant_id: UUID,
conversation_id: UUID,
) -> str:
"""Resolve prompt version via deterministic-hash canary routing.
Phase 3: plugged into call_for_role between line N and line M.
"""
rollout = await redis.hgetall(f"ratiba:canary:rollout:{role}")
if not rollout or int(rollout.get("percentage", 0)) == 0:
return resolve_prompt_alias(role, "stable")
bucket = int(
hashlib.sha256(f"{tenant_id}:{conversation_id}".encode()).hexdigest(),
16,
) % 100
if bucket < int(rollout["percentage"]):
return resolve_prompt_alias(role, "canary")
return resolve_prompt_alias(role, "stable")
The Phase 1 cost is two unused parameters in the router signature for months. The Phase 3 win is canary deployment becomes a ~50-line extension to one function rather than a refactor that touches every call site.
Why deterministic-hash over alternatives. It's the canonical industry pattern (used by feature-flagging systems, A/B testing frameworks); provider-neutral; uses existing Redis (no-cloud-dependencies principle); plays cleanly with the LLMRouter from ADR-0005 D4. Langfuse experiment-routing (rejected) is vendor-specific bet on a feature that may not ship in time. Per-tenant opt-in (rejected) loses the gradual-rollout property — would deploy canary to whole tenants instead of 5% of any tenant's traffic; bad statistical design.
Consequences
Positive.
- Hermetic test isolation eliminates cross-scenario state leak as a bug class (D2). The ~50ms overhead per scenario is dwarfed by LLM-call costs that dominate eval-suite runtime.
- Calibration ritual is calendared (D3), aligned with quarterly model-pin review for operational efficiency. Won't die of neglect.
- PII masking reuses single-source-of-truth redactor (D4) from Phase B §5.4. Same rules across structured logs + Langfuse traces + production-replay datasets.
- Cache key catches every change-cause (D5). No stale eval results that mask regressions — every relevant input is part of the cache key.
- Language-specific metrics enable Swahili-tuned iteration without invalidating English-scenario cached scores (D6). Independent iteration cycles for the two languages.
- Phase 3 canary is a small extension, not a refactor (D7). Two unused parameters in Phase 1 LLMRouter signature is the entire forward-compatibility cost.
tests/eval/calibration/human_labelled.yamlbecomes a long-term asset. ~120 conversations after a year of monthly ritual; meaningful calibration sample.
Negative.
- Per-scenario tenant overhead (~50ms × ~50 scenarios = ~2.5s per full eval run) is real but small against the ~15-minute full-suite runtime. Acceptable trade for hermetic isolation.
- Calibration ritual creates a recurring time-block on Adrian's calendar. Mitigation: paired with quarterly model-pin review; small monthly cost; explicit owner in the ADR makes it non-skippable.
- Vertical-specific PII rules are a TODO until the future privacy ADR. Production-replay BLOCKS for dental / physio / medical / legal tenants until then. Acceptable Phase 1 constraint; surfaces the compliance gap as a hard error.
- LLMRouter signature carries
conversation_idparameter unused for months until Phase 3. Trivial cost; zero runtime impact. __version__convention on custom metrics requires discipline — bumping the version when changing rubric is a manual step. Mitigation: PR template includes "did you bump__version__?" checklist for changes undertests/eval/metrics/.- DeepEval cache directory
.deepeval-cache/adds another gitignored artifact to manage. Standard pattern; trivial.
Neutral.
- Manual diff review of golden-conversation snapshots (per A3 §3) remains the load-bearing review step. Not auto-accept.
- Conftest auto-attach for language-specific metrics is convenience, not contract. Test authors can skip it and specify metrics explicitly in the YAML.
- GPT-4 cross-check on 10% of eval runs (per A3 §4) is the bias-mitigation mechanism for self-preference; tracks Cohen's kappa between Claude judge and GPT-4 judge as a drift signal.
- Phase 3 LLMRouter sketch is informational only — the actual implementation lands when canary work begins, and the sketch may be refined at that point based on additional context.
Alternatives Considered
| Alternative | Rejected because |
|---|---|
| Session-wide tenant + per-test truncate for the eval fixture. | Loses the hermetic isolation guarantee; tests become coupled (one test creates a row another test depends on, passes alone, fails in suite). The bug class is real and hard to debug. |
| Per-class tenant + per-method truncate. | Middle-ground compromise that gives up the isolation guarantee for marginal performance gain. Per-scenario fresh tenant is cheap enough that the compromise isn't worth it. |
| Weekly calibration cadence (~3 conversations/week). | Risks falling behind reality and getting neglected (single-engineer time pressure). The "first Monday of each month" calendar slot is sustainable; weekly time-blocks slip. |
| Quarterly calibration cadence (~30 conversations/quarter). | 3-month drift window where bad calibration data could compound without detection. Bad bet for a system whose eval signal is load-bearing for prompt-PR gating. |
| Inline full PII policy in ADR-0004 including vertical rules. | Compliance scope bloat — vertical-specific rules belong to a privacy ADR alongside Kenya DPA + EU AI Act + sector-specific health regs analysis. ADR-0004 ships the floor + TODO marker; privacy ADR fills the marker. |
| Defer PII handling entirely to the future privacy ADR. | Blocks the production-replay code path indefinitely (A3 §5 weekly job has no spec for masking). Floor + TODO marker is the right Phase 1 posture. |
Coarser 3-tuple cache key (drop metric_version). | Manual "remember to invalidate when metric changes" gets skipped under deadline pressure → stale eval results mask the regressions the suite exists to catch. Bad pattern in a methodology-disciplined project. |
Finer 5-tuple cache key (add metric_threshold). | Threshold is the pass/fail line, not an input to the score itself. Same score crossed against different thresholds = same cache value, different assertion outcome. No need to invalidate. |
Combined BilingualFluencyMetric that auto-detects language internally. | Hides per-language rubric handling — opaque to PR review ("which language's rubric did this PR change?"). Two metric modules with separate __version__ enable independent iteration. |
Single language-agnostic FluencyMetric. | Loses Swahili-specific fluency signals (idiomatic register, code-switch handling, cultural appropriateness). Ratiba's Swahili-quality differentiator is structural; flat fluency metric flattens the differentiator. |
| Langfuse experiment-routing primitive for Phase 3 canary. | Vendor-specific bet on a feature that may not exist when Phase 3 needs it (months out). Deterministic-hash pattern is provider-neutral and uses existing Redis. |
| Per-tenant opt-in for canary (each tenant explicitly enrolled via dashboard). | Loses the gradual-rollout property — canary deploys to whole tenants instead of 5% of any tenant's traffic. Bad statistical design (no within-tenant control group). |
Defer Phase 3 LLMRouter signature entirely (no conversation_id parameter in Phase 1). | Phase 3 work becomes a refactor that touches every LLMRouter call site instead of a 50-line extension. Avoidable for ~10 lines of forward-compat plumbing in Phase 1. |
| Skip cross-check sample on full eval runs (Claude judge only, no GPT-4 cross-check). | Loses the bias-mitigation signal for self-preference (Claude generator + Claude judge in tight loop). 10% sample is the small cost for the drift signal. |
References
docs/prd/ratiba-prd.md— §1.4 conversational thesis, §4 Modules 7-9 (orchestrator + scheduling + admin)docs/adr/ADR-0001-tech-stack.md(amended 2026-04-25) — pinned Python 3.13; library-currency policydocs/adr/ADR-0002-multi-tenant-isolation.md— D4 TenantScopedSaver via per-tenant micro-pools; D7 asyncio contextvar tenant propagation; D3 per-tenant Alembic invocation (used by fixture'salembic upgrade head --tenant=test_tenant_xxx)docs/adr/ADR-0003-fsm-persistence.md—conversation_threadspointer table; LangGraph checkpoint tables (created in fixture byPostgresSaver.setup())docs/adr/ADR-0005-orchestration-model.md— D4 LLMRouter + role-assignments YAML (extended in this ADR D7 withconversation_idparameter); D6 directory-per-tool registry (eval scenarios assert againsttool_calls[]per turn)docs/adr/ADR-0006-handoff-model.md—handoff_logtable (eval scenarios for handoff flow read from this); D9 briefing card schema (eval scenarios validate brief generation)docs/adr/ADR-0007-payments-orchestration.md—PAYMENT_CANCELLED_BY_CUSTOMERstate (eval scenarios cover cancellation flow); D9 reversal logic (eval scenarios validate auto-reverse on STK callback to cancelled payment)docs/research/2026-04-25-langgraph-postgressaver-spike.md— TenantScopedSaver wrapper used by fixture (D2)docs/research/2026-04-25-orchestration-patterns.md— A1 §3 FSM states (eval scenarios assert on FSM transitions)docs/research/2026-04-25-eval-frameworks.md— A3 (heavy use throughout; this ADR locks A3's open questions §10)docs/research/2026-04-25-human-in-the-loop-handoff.md— A2 §3 briefing card (eval scenarios validate brief generation)docs/research/2026-04-25-payments-orchestration.md— A4 §1 payment lifecycle (eval scenarios cover both rails)docs/methodology/agentic-development.md— Phase B §3 prompt storage (prompt_versionin cache key per D5); §4 eval gate matrix (CI workflow); §5 auto-debug logging schema (event_typeenum used in eval-suite log assertions); §6 delegate-vs-human-review boundaries (eval-suite changes are human-review per prompt-eval-reviewer agent)docs/superpowers/specs/2026-04-25-agentic-research-investment-design.md§12 — Q1 (self-host Langfuse), Q5 ($300/month eval ceiling) anchors~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_no_cloud_dependencies.md— drives D6 deterministic-hash over Langfuse experiment-routing~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_cost_discipline.md— drives D5 cache key (avoid stale results that hide regressions → cost of missed bugs in production)- DeepEval documentation —
ConversationalGEval,ConversationSimulator,ConversationalGoldenprimitives; cache mechanism - Langfuse v4 SDK documentation — OTel-native traces;
propagate_attributes;create_scorefor feedback loop - Cohen's kappa interpretation thresholds — Landis & Koch 1977 (substantial agreement = 0.61-0.80; ≥ 0.7 is the conservative threshold for substantial)