Skip to main content

ADR-0004: Testing Strategy Under Conversation-as-State

Status: Accepted Date: 2026-04-25

Context

Ratiba's "conversation state IS canonical state for in-flight bookings" distinction (project memory, PRD §1.4) means ordinary pytest against SQL fixtures cannot cover the booking flow. The orchestrator's correctness is partly probabilistic (LLM outputs); its state lives in LangGraph checkpoints across two stores; testing must verify both deterministic FSM transitions and conversational quality.

A3 (docs/research/2026-04-25-eval-frameworks.md) settled the high-level eval architecture: DeepEval as primary eval runner + self-hosted Langfuse v4 as observability backbone, prompt CMS, and customer-feedback collector. The five-layer test pyramid (FSM unit, full-turn integration, transcript replay, golden-conversation snapshots, production replay) replaces the classical pyramid that doesn't survive contact with conversation-as-state.

Spec §12 locked two foundational eval-related decisions:

  • Q1 — self-host Langfuse from day 1 (no external cloud dependencies; data-residency posture for health-data tenants).
  • Q5 — $300/month eval-suite cost ceiling at PoC scale; aligned with the cost-discipline principle.

Phase B §4 settled the eval gate matrix: prompts + FSM PRs blocking the eval suite, non-prompt code advisory smoke eval, Renovate major bumps blocking. The CI workflow at .github/workflows/eval-on-prompt-change.yml is the canonical gate-runner.

A3 §5 settled the WhatsApp button feedback loop: 👍/👎/💬 buttons captured as Langfuse scores; weekly job promotes negative-feedback conversations to candidate regression scenarios.

What ADR-0004 settles is the operational specifics A3 punted to follow-up: how the TenantScopedSaver gets consumed cleanly in eval fixtures, who owns the calibration set + at what cadence, the PII masking floor for production-replay datasets, the DeepEval cache invalidation policy, the bilingual judge mode, and the LLMRouter forward-compatibility shape that lets Phase 3 canary deployment plug in as an extension rather than a refactor.

This is the last ADR in the post-research chain (ADR-0002 / 0003 / 0005 / 0006 / 0007 / 0004). After this lands, the architectural foundation is complete and M3 implementation can begin.

Decision

Six specific decisions, organized as a coherent testing strategy.

1. Architectural recap (inherited, locked here)

AspectWhere decidedRecap
Eval stackA3 §1 + spec §12 Q1DeepEval primary + self-hosted Langfuse v4 observability
Five-layer test pyramidA3 §2(1) FSM unit (mocked LLM); (2) full-turn integration (testcontainers + TenantScopedSaver); (3) transcript replay e2e; (4) golden-conversation snapshots (DeepEval); (5) production replay (Phase 2)
Golden-conversation YAML formatA3 §3One file per scenario at backend/tests/eval/conversations/scenarios/; LLM responses recorded once into tests/fixtures/llm_recordings/<scenario_id>.jsonl; manual diff review on snapshot updates
Eval gate matrixPhase B §4 + A3 §7Prompts + FSM PRs blocking; non-prompt advisory smoke eval; Renovate majors blocking; .github/workflows/eval-on-prompt-change.yml is canonical
Cost ceilingspec §12 Q5RATIBA_EVAL_BUDGET_USD=20 per PR; ~$300/month at PoC scale; cross-check sample auto-downgrades from 10% to 0% on budget breach
Customer feedback bridgeA3 §5WhatsApp interactive reply buttons → Langfuse create_score(name='user_feedback') → weekly promotion of negative-feedback conversations to candidate regression scenarios
LLM-as-judge modelA3 §4Claude Opus 4.7 primary judge + GPT-4 cross-check on 10% sample; Cohen's kappa target ≥ 0.7 against human_labelled.yaml calibration set
Bias mitigationsA3 §4Position randomization in pairwise comparisons; rubric-based scoring via ConversationalGEval; calibration against human-labelled samples; multi-judge majority vote for failing scores; never same model as judge and generator in tight loop
Production eval phasingA3 §6Phase 1 = offline only; Phase 2 = + online sampling (every 6h, 5% sample); Phase 3 = + canary deployment (5% of conversations on new prompt for 24h, statistical comparison, auto-rollback)
Observability instrumentationA3 §8 + Phase B §5Per-conversation Langfuse trace; tenant_id, conversation_id, prompt_version, model, cost_usd propagated as metadata via propagate_attributes; structured-log JSONL files at /var/log/ratiba/ as the durable substrate
Library pinsA3 §1deepeval >= 3.9 + langfuse >= 4.0 in [project.optional-dependencies] eval group of backend/pyproject.toml

ADR-0004 builds on these without re-deriving them.

2. tenant_scoped_eval_environment pytest fixture — per-scenario fresh tenant

A3 §10 #2 surfaced the gap: the eval suite needs a clean way to spin up a TenantScopedSaver (per ADR-0002 D4 + ADR-0001 amendment) against a testcontainered Postgres without leaking state across scenarios.

Decision: per-scenario fresh tenant. Each YAML golden-conversation scenario gets its own ephemeral tenant schema, created at scenario start and dropped at scenario end. Maximum hermetic isolation; ~30-80ms overhead per scenario (negligible against the LLM-call costs that dominate).

Fixture contract:

# backend/tests/eval/conftest.py (sketch)

@pytest.fixture
async def tenant_scoped_eval_environment(scenario_id: str):
"""Spins up a fresh tenant schema for one scenario.

Schema name: test_tenant_<scenario_id>_<run_id_8hex>
Lifecycle:
1. CREATE SCHEMA test_tenant_xxx
2. SET search_path TO test_tenant_xxx
3. alembic upgrade head --tenant=test_tenant_xxx
4. PostgresSaver(get_psycopg_conn(test_tenant_xxx)).setup()
5. tenant_id = await register_test_tenant(schema_name=test_tenant_xxx)
6. saver = TenantScopedSaverFactory.for_tenant(tenant_id)
7. orchestrator = build_test_orchestrator(saver, tenant_id)
8. yield orchestrator
9. DROP SCHEMA test_tenant_xxx CASCADE
"""
schema_name = f"test_tenant_{scenario_id}_{uuid7().hex[:8]}"
async with create_tenant_schema(schema_name) as tenant_id:
await run_alembic_migrations(schema_name)
await PostgresSaver(get_psycopg_conn(schema_name)).setup()
saver = TenantScopedSaverFactory.for_tenant(tenant_id)
orchestrator = await build_test_orchestrator(saver, tenant_id)
yield orchestrator
# cleanup happens via context manager exit

Belt-and-braces at session start: a pytest_sessionstart hook queries pg_namespace for any leftover test_tenant_* schemas from prior crashed runs and drops them. Defends against the case where a previous test killed the worker before fixture cleanup ran.

Why per-scenario over alternatives. Postgres CREATE SCHEMA + DROP SCHEMA CASCADE are fast on testcontainered Postgres. The hermetic-isolation guarantee — each scenario starts pristine, no cross-scenario state coupling — eliminates a class of bugs that's hard to debug when it surfaces (test passes alone, fails in suite). Session-wide-with-truncate (rejected) optimizes for a few seconds of total suite runtime at the cost of test interdependence; bad trade for a methodology-disciplined project.

3. Calibration set — ownership and refresh ritual

A3 §10 #4 surfaced the gap: A3 §4's recommended LLM-as-judge calibration relies on a growing set of human-rated conversations that, without explicit ownership and cadence, dies of neglect.

Decision: monthly cadence + ~10 new conversations + Adrian-owned + calendared.

PropertyValue
Filebackend/tests/eval/calibration/human_labelled.yaml
OwnerAdrian (single engineer in current state)
Ownership transitionPR amends ADR-0004 "calibration owner" field when second team member joins
CadenceMonthly, first Monday of each month
Volume per session~10 new PII-masked production transcripts
Rating shapeEach transcript scored 1-5 by a Swahili-fluent reviewer on (1) intent comprehension, (2) response naturalness, (3) cultural appropriateness
Trigger for promotionA real production interaction that surprised either way (great handoff resolution worth eval-encoding, OR terrible classifier output worth regression-testing against)
PairingAligned with quarterly model-pin review (Phase B §7) — the quarterly Monday is also a calibration-set Monday. Operational efficiency: one calendar block, two reviews.

Why monthly, not weekly or quarterly. Weekly risks falling behind reality and getting neglected (single-engineer time pressure). Quarterly creates a 3-month drift window where bad calibration data could compound without detection — bad bet for a system whose eval signal is load-bearing for prompt-PR gating.

Calibration kappa check runs as part of the monthly ritual: after adding the new conversations, re-run the LLM-as-judge against the full calibration set; compute Cohen's kappa between judge ratings and human ratings. Target: kappa ≥ 0.7 (substantial agreement threshold per A3 §4). Below 0.7, the judge model isn't trustworthy; either reject the upgrade or recalibrate the rubrics.

After ~12 months of monthly additions, the calibration set has ~120 conversations — meaningful sample for the LLM-as-judge calibration.

4. Production-replay PII masking floor + future privacy ADR reference

A3 §10 #6 surfaced this: when negative-feedback conversations get promoted from Langfuse into tests/eval/conversations/from_production/ (per A3 §5 weekly job), what gets masked? The masking floor is concrete and load-bearing; the vertical-specific rules belong to a future privacy ADR.

Masking floor (locked in this ADR):

# backend/app/eval/from_production/promote_to_scenario.py (sketch)

from backend.app.observability.redact import redactor
from backend.app.eval.from_production import redact_extras

async def mask_for_eval(
conversation_jsonl_blob: str,
tenant: TenantContext,
) -> str:
"""Apply masking floor + per-vertical extras before promotion."""
# Floor: same redactor as Phase B §5.4 (single source of truth)
floor_masked = await redactor.apply(conversation_jsonl_blob)

# Per-vertical extras (TODO until privacy ADR fills this in)
if tenant.vertical in ("dental", "physio", "medical", "legal"):
floor_masked = await redact_extras.apply(floor_masked, tenant.vertical)

return floor_masked

Floor rules (single source of truth at backend/app/observability/redact.py, shared with Phase B §5.4 structured-log redaction):

  • Phone numbers → masked to last-4 digits (+254 7** *** 432)
  • Customer names → replaced with <name> token
  • Free-text customer messages → numeric runs of length 6+ masked (likely IDs/account numbers); email patterns masked; full names matched against identity table → masked
  • Money fields (KES amount, M-Pesa receipt, PesaPal OrderTrackingId) → kept verbatim (load-bearing for "double-charge" diagnostic recipes per Phase B §5.6)
  • Tool args → hashed (sha256), full args go to Langfuse trace if needed for inspection

Vertical-specific extras (TODO until future privacy ADR):

backend/app/eval/from_production/redact_extras.py exists as a TODO-marker module — the floor handles spa/salon/barbershop tenants fully; the file documents the gap explicitly:

# backend/app/eval/from_production/redact_extras.py

async def apply(blob: str, vertical: str) -> str:
"""
Vertical-specific PII masking extras.

Currently a no-op + warning — implementation deferred to the
future privacy ADR which will analyse Kenya DPA + EU AI Act +
sector-specific health regulations.

Tenants with vertical in (dental, physio, medical, legal) are
BLOCKED from production-replay promotion until this is filled in.
"""
raise NotImplementedError(
f"Vertical '{vertical}' requires the future privacy ADR's "
f"masking rules. Promotion blocked until those are defined."
)

Effect: Phase 1 production-replay works fully for spa / salon / barbershop tenants. Vertical-specific tenants need the privacy ADR before their conversations can be promoted to regression scenarios — acceptable Phase 1 constraint; surfaces the compliance gap as a hard-blocking error rather than silently shipping unmasked data.

5. DeepEval cache invalidation — 4-tuple cache key

A3 §10 #7 surfaced this: DeepEval caches LLM-judge calls between runs to make re-runs cheap. The cache key needs to invalidate on every input that could change the result.

Cache key: SHA256 of canonicalized JSON 4-tuple:

# backend/tests/eval/cache_key.py (sketch)

import hashlib
import json

def deepeval_cache_key(
scenario_id: str,
prompt_version: str, # e.g., "booking_orchestrator@a1b2c3d"
judge_model_version: str, # e.g., "claude-opus-4.7"
metric_version: str, # e.g., "swahili_fluency:1.0.0"
) -> str:
canonical = json.dumps({
"scenario_id": scenario_id,
"prompt_version": prompt_version,
"judge_model_version": judge_model_version,
"metric_version": metric_version,
}, sort_keys=True, separators=(",", ":"))
return hashlib.sha256(canonical.encode()).hexdigest()

Each component represents a distinct change-cause that MUST invalidate the cached judge call:

ComponentSourceInvalidation trigger
scenario_idYAML scenario file id: fieldDifferent scenario = different conversation
prompt_versionPer Phase B §3 — <prompt_name>@<git-short-sha> computed at startup from prompt file's git historyApplication prompt changed → potentially different model output → potentially different judge score
judge_model_versionProvider model identifier string (e.g., claude-opus-4.7)Judge model upgrade may rate the same output differently
metric_version__version__ constant exported by each custom metric moduleCustom-metric rubric changed → potentially different score

__version__ convention for each custom metric module:

# backend/tests/eval/metrics/swahili_fluency.py
__version__ = "1.0.0" # bump on rubric change

class SwahiliFluencyMetric(GEval):
...

Cache location. .deepeval-cache/ at the project root, gitignored. Persists across runs locally; CI runs from cold cache by default (eval gate timing assumes worst-case cold-cache for the eval-budget ceiling per Q5).

Override flags:

  • pytest --no-deepeval-cache — forced re-run; useful for debugging or when verifying a judge-model swap behaves consistently with the cached pre-swap scores.
  • pytest --deepeval-cache-clear — clears .deepeval-cache/ before the run.

Why the 4-tuple over alternatives. Coarser (3-tuple, drops metric_version) is risky in a methodology-disciplined project — manual "remember to invalidate when metric changes" gets skipped under deadline pressure → stale eval results mask the regressions the suite exists to catch. Finer (5-tuple, adding metric_threshold) is over-engineering — threshold is the pass/fail line, not an input to the score itself.

6. Bilingual judge mode — language-specific fluency metrics

A3 §10 #8 surfaced this: do we use one fluency metric per language (separate SwahiliFluencyMetric and EnglishFluencyMetric) or one combined BilingualFluencyMetric that auto-detects language and applies the appropriate rubric internally?

Decision: language-specific metrics. Separate modules per language; YAML scenario language: field gates which metric runs.

backend/tests/eval/metrics/
swahili_fluency.py # SwahiliFluencyMetric, __version__, Swahili rubric
english_fluency.py # EnglishFluencyMetric, __version__, Kenyan-English rubric
mpesa_payment_safety.py # MpesaPaymentSafetyMetric, language-agnostic
booking_slot_consistency.py # BookingSlotConsistencyMetric, language-agnostic

Swahili rubric (SwahiliFluencyMetric):

  • Idiomatic Swahili appropriate for service-business booking context
  • Avoids Anglicisms unless conventional ("spa", "booking" are conventional; "reservation" is not)
  • Handles code-switches gracefully (customer mixing English + Swahili mid-message)
  • Cultural appropriateness for Kenyan SMB context (e.g., "samahani" vs "pole" register)

English rubric (EnglishFluencyMetric):

  • Natural Kenyan-English register (avoids translated American English)
  • SMB-context formality (warmer than corporate; more direct than hospitality-luxury)

YAML scenario gating:

# backend/tests/eval/conversations/scenarios/spa_booking_swahili_happy.yaml
id: spa_booking_swahili_happy
language: sw # gates SwahiliFluencyMetric
assertions:
- metric: SwahiliFluencyMetric
threshold: 0.80
- metric: BookingSlotConsistencyMetric # language-agnostic
threshold: 1.0

Optional auto-attach in conftest.py:

def attach_language_metrics(scenario):
"""Auto-attach language-appropriate fluency metric.

Test authors don't have to remember which fluency metric to
specify; the scenario's `language:` field drives the choice.
"""
if scenario.language == "sw":
scenario.assertions.append(SwahiliFluencyMetric())
elif scenario.language == "en":
scenario.assertions.append(EnglishFluencyMetric())

Why language-specific over combined. A3 §4 emphasized Swahili rubric is materially different from English rubric; combining them inside one metric hides the per-language rubric handling and makes PR review harder ("which language's rubric did this PR change?"). Two modules with separate __version__ constants (per Q4) means iterating on Swahili rubric doesn't invalidate cached English-scenario scores — independent iteration cycles.

Language-agnostic single metric (rejected) loses Swahili-specific fluency signals (idiomatic register, code-switch handling, cultural appropriateness) — exactly the load-bearing differentiator for Ratiba's product.

7. LLMRouter forward-compatibility for Phase 3 canary

A3 §10 #3 surfaced this: A3 §6's Phase 3 plan introduces canary deployment for new prompt versions. The "prompt assignment service" that decides which prompt version each conversation uses needs to be plug-in-able into the existing LLMRouter (per ADR-0005 D4) without a refactor when Phase 3 lands.

Decision: commit to the deterministic-hash mechanism in the ADR now, sketch the architecture, do not implement until Phase 3.

Phase 1 LLMRouter signature (forward-compatible):

# backend/app/orchestrator/llm/router.py (Phase 1)

async def call_for_role(
role: str,
tenant_id: UUID,
conversation_id: UUID, # NEW — unused in Phase 1, locked for Phase 3
system_prompt: str,
user_prompt: str,
schema: dict | None = None,
max_tokens: int | None = None,
) -> LLMResponse:
# Phase 1: always uses @stable alias from prompts/aliases.yaml
prompt_version = resolve_prompt_alias(role, "stable")
assignment = role_assignments[role]
provider = providers[assignment["provider"]]
return await provider.call(
model=assignment["model"],
system_prompt=system_prompt,
user_prompt=user_prompt,
schema=schema,
)

Phase 3 extension hook (NOT implemented in Phase 1):

# Phase 3: extension hook (sketch in ADR for forward-compat)

async def _resolve_prompt_for_canary(
role: str,
tenant_id: UUID,
conversation_id: UUID,
) -> str:
"""Resolve prompt version via deterministic-hash canary routing.

Phase 3: plugged into call_for_role between line N and line M.
"""
rollout = await redis.hgetall(f"ratiba:canary:rollout:{role}")
if not rollout or int(rollout.get("percentage", 0)) == 0:
return resolve_prompt_alias(role, "stable")
bucket = int(
hashlib.sha256(f"{tenant_id}:{conversation_id}".encode()).hexdigest(),
16,
) % 100
if bucket < int(rollout["percentage"]):
return resolve_prompt_alias(role, "canary")
return resolve_prompt_alias(role, "stable")

The Phase 1 cost is two unused parameters in the router signature for months. The Phase 3 win is canary deployment becomes a ~50-line extension to one function rather than a refactor that touches every call site.

Why deterministic-hash over alternatives. It's the canonical industry pattern (used by feature-flagging systems, A/B testing frameworks); provider-neutral; uses existing Redis (no-cloud-dependencies principle); plays cleanly with the LLMRouter from ADR-0005 D4. Langfuse experiment-routing (rejected) is vendor-specific bet on a feature that may not ship in time. Per-tenant opt-in (rejected) loses the gradual-rollout property — would deploy canary to whole tenants instead of 5% of any tenant's traffic; bad statistical design.

Consequences

Positive.

  1. Hermetic test isolation eliminates cross-scenario state leak as a bug class (D2). The ~50ms overhead per scenario is dwarfed by LLM-call costs that dominate eval-suite runtime.
  2. Calibration ritual is calendared (D3), aligned with quarterly model-pin review for operational efficiency. Won't die of neglect.
  3. PII masking reuses single-source-of-truth redactor (D4) from Phase B §5.4. Same rules across structured logs + Langfuse traces + production-replay datasets.
  4. Cache key catches every change-cause (D5). No stale eval results that mask regressions — every relevant input is part of the cache key.
  5. Language-specific metrics enable Swahili-tuned iteration without invalidating English-scenario cached scores (D6). Independent iteration cycles for the two languages.
  6. Phase 3 canary is a small extension, not a refactor (D7). Two unused parameters in Phase 1 LLMRouter signature is the entire forward-compatibility cost.
  7. tests/eval/calibration/human_labelled.yaml becomes a long-term asset. ~120 conversations after a year of monthly ritual; meaningful calibration sample.

Negative.

  1. Per-scenario tenant overhead (~50ms × ~50 scenarios = ~2.5s per full eval run) is real but small against the ~15-minute full-suite runtime. Acceptable trade for hermetic isolation.
  2. Calibration ritual creates a recurring time-block on Adrian's calendar. Mitigation: paired with quarterly model-pin review; small monthly cost; explicit owner in the ADR makes it non-skippable.
  3. Vertical-specific PII rules are a TODO until the future privacy ADR. Production-replay BLOCKS for dental / physio / medical / legal tenants until then. Acceptable Phase 1 constraint; surfaces the compliance gap as a hard error.
  4. LLMRouter signature carries conversation_id parameter unused for months until Phase 3. Trivial cost; zero runtime impact.
  5. __version__ convention on custom metrics requires discipline — bumping the version when changing rubric is a manual step. Mitigation: PR template includes "did you bump __version__?" checklist for changes under tests/eval/metrics/.
  6. DeepEval cache directory .deepeval-cache/ adds another gitignored artifact to manage. Standard pattern; trivial.

Neutral.

  1. Manual diff review of golden-conversation snapshots (per A3 §3) remains the load-bearing review step. Not auto-accept.
  2. Conftest auto-attach for language-specific metrics is convenience, not contract. Test authors can skip it and specify metrics explicitly in the YAML.
  3. GPT-4 cross-check on 10% of eval runs (per A3 §4) is the bias-mitigation mechanism for self-preference; tracks Cohen's kappa between Claude judge and GPT-4 judge as a drift signal.
  4. Phase 3 LLMRouter sketch is informational only — the actual implementation lands when canary work begins, and the sketch may be refined at that point based on additional context.

Alternatives Considered

AlternativeRejected because
Session-wide tenant + per-test truncate for the eval fixture.Loses the hermetic isolation guarantee; tests become coupled (one test creates a row another test depends on, passes alone, fails in suite). The bug class is real and hard to debug.
Per-class tenant + per-method truncate.Middle-ground compromise that gives up the isolation guarantee for marginal performance gain. Per-scenario fresh tenant is cheap enough that the compromise isn't worth it.
Weekly calibration cadence (~3 conversations/week).Risks falling behind reality and getting neglected (single-engineer time pressure). The "first Monday of each month" calendar slot is sustainable; weekly time-blocks slip.
Quarterly calibration cadence (~30 conversations/quarter).3-month drift window where bad calibration data could compound without detection. Bad bet for a system whose eval signal is load-bearing for prompt-PR gating.
Inline full PII policy in ADR-0004 including vertical rules.Compliance scope bloat — vertical-specific rules belong to a privacy ADR alongside Kenya DPA + EU AI Act + sector-specific health regs analysis. ADR-0004 ships the floor + TODO marker; privacy ADR fills the marker.
Defer PII handling entirely to the future privacy ADR.Blocks the production-replay code path indefinitely (A3 §5 weekly job has no spec for masking). Floor + TODO marker is the right Phase 1 posture.
Coarser 3-tuple cache key (drop metric_version).Manual "remember to invalidate when metric changes" gets skipped under deadline pressure → stale eval results mask the regressions the suite exists to catch. Bad pattern in a methodology-disciplined project.
Finer 5-tuple cache key (add metric_threshold).Threshold is the pass/fail line, not an input to the score itself. Same score crossed against different thresholds = same cache value, different assertion outcome. No need to invalidate.
Combined BilingualFluencyMetric that auto-detects language internally.Hides per-language rubric handling — opaque to PR review ("which language's rubric did this PR change?"). Two metric modules with separate __version__ enable independent iteration.
Single language-agnostic FluencyMetric.Loses Swahili-specific fluency signals (idiomatic register, code-switch handling, cultural appropriateness). Ratiba's Swahili-quality differentiator is structural; flat fluency metric flattens the differentiator.
Langfuse experiment-routing primitive for Phase 3 canary.Vendor-specific bet on a feature that may not exist when Phase 3 needs it (months out). Deterministic-hash pattern is provider-neutral and uses existing Redis.
Per-tenant opt-in for canary (each tenant explicitly enrolled via dashboard).Loses the gradual-rollout property — canary deploys to whole tenants instead of 5% of any tenant's traffic. Bad statistical design (no within-tenant control group).
Defer Phase 3 LLMRouter signature entirely (no conversation_id parameter in Phase 1).Phase 3 work becomes a refactor that touches every LLMRouter call site instead of a 50-line extension. Avoidable for ~10 lines of forward-compat plumbing in Phase 1.
Skip cross-check sample on full eval runs (Claude judge only, no GPT-4 cross-check).Loses the bias-mitigation signal for self-preference (Claude generator + Claude judge in tight loop). 10% sample is the small cost for the drift signal.

References

  • docs/prd/ratiba-prd.md — §1.4 conversational thesis, §4 Modules 7-9 (orchestrator + scheduling + admin)
  • docs/adr/ADR-0001-tech-stack.md (amended 2026-04-25) — pinned Python 3.13; library-currency policy
  • docs/adr/ADR-0002-multi-tenant-isolation.md — D4 TenantScopedSaver via per-tenant micro-pools; D7 asyncio contextvar tenant propagation; D3 per-tenant Alembic invocation (used by fixture's alembic upgrade head --tenant=test_tenant_xxx)
  • docs/adr/ADR-0003-fsm-persistence.mdconversation_threads pointer table; LangGraph checkpoint tables (created in fixture by PostgresSaver.setup())
  • docs/adr/ADR-0005-orchestration-model.md — D4 LLMRouter + role-assignments YAML (extended in this ADR D7 with conversation_id parameter); D6 directory-per-tool registry (eval scenarios assert against tool_calls[] per turn)
  • docs/adr/ADR-0006-handoff-model.mdhandoff_log table (eval scenarios for handoff flow read from this); D9 briefing card schema (eval scenarios validate brief generation)
  • docs/adr/ADR-0007-payments-orchestration.mdPAYMENT_CANCELLED_BY_CUSTOMER state (eval scenarios cover cancellation flow); D9 reversal logic (eval scenarios validate auto-reverse on STK callback to cancelled payment)
  • docs/research/2026-04-25-langgraph-postgressaver-spike.md — TenantScopedSaver wrapper used by fixture (D2)
  • docs/research/2026-04-25-orchestration-patterns.md — A1 §3 FSM states (eval scenarios assert on FSM transitions)
  • docs/research/2026-04-25-eval-frameworks.md — A3 (heavy use throughout; this ADR locks A3's open questions §10)
  • docs/research/2026-04-25-human-in-the-loop-handoff.md — A2 §3 briefing card (eval scenarios validate brief generation)
  • docs/research/2026-04-25-payments-orchestration.md — A4 §1 payment lifecycle (eval scenarios cover both rails)
  • docs/methodology/agentic-development.md — Phase B §3 prompt storage (prompt_version in cache key per D5); §4 eval gate matrix (CI workflow); §5 auto-debug logging schema (event_type enum used in eval-suite log assertions); §6 delegate-vs-human-review boundaries (eval-suite changes are human-review per prompt-eval-reviewer agent)
  • docs/superpowers/specs/2026-04-25-agentic-research-investment-design.md §12 — Q1 (self-host Langfuse), Q5 ($300/month eval ceiling) anchors
  • ~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_no_cloud_dependencies.md — drives D6 deterministic-hash over Langfuse experiment-routing
  • ~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_cost_discipline.md — drives D5 cache key (avoid stale results that hide regressions → cost of missed bugs in production)
  • DeepEval documentation — ConversationalGEval, ConversationSimulator, ConversationalGolden primitives; cache mechanism
  • Langfuse v4 SDK documentation — OTel-native traces; propagate_attributes; create_score for feedback loop
  • Cohen's kappa interpretation thresholds — Landis & Koch 1977 (substantial agreement = 0.61-0.80; ≥ 0.7 is the conservative threshold for substantial)