Recommended testing + eval + observability stack for Ratiba's conversational orchestrator

Status: A3 deliverable. Feeds ADR-0004 (testing strategy under conversation-as-state). Audience: Adrian + future Ratiba contributors deciding the eval stack before any orchestration code is written. Voice: Opinionated. Build on the locked decisions in ADR-0001 amended 2026-04-25 (Python 3.13, Option A LangGraph wrapper, library currency policy).

Executive verdict (one paragraph, then read on)

Adopt: DeepEval as the primary eval runner + Langfuse as the observability and prompt-management backbone. This is a two-tool stack, not a three-tool stack, and the seam is clean: DeepEval owns "is this agent behaving correctly?" (offline, in CI, on every prompt PR); Langfuse owns "what is happening in production right now, and which prompt version produced it?" (online, traces, scores, datasets). The reason to pair them rather than pick one is structural — DeepEval has the best Python-first conversational metrics surface in 2026 (multi-turn ConversationalGEval, ConversationSimulator, pytest-style integration that drops straight into our existing asyncio_mode = "auto" config), but its hosted Confident AI sidecar is weaker than Langfuse on prompt versioning, session replay, and self-hostability. Langfuse has the production observability story we want (LangChain/LangGraph CallbackHandler, propagated metadata, OSS self-host, prompt CMS), but its eval primitives are thinner than DeepEval's and require more glue. Braintrust is the third path Adrian could justifiably take and would accelerate CI integration via their GitHub Action, but it's hosted-first with usage-based pricing that gets uncomfortable at multi-tenant scale, and the OSS escape hatch is weaker. Maxim is the fourth and we reject it — its differentiator is enterprise-grade governance dashboards, not eval rigour, and we don't need the dashboards.

1. Recommended eval stack (verdict)

Verdict: DeepEval as primary + Langfuse for observability.

Why this pair, and not the others

The four candidates that survived an honest scan:

Stack	What it's best at	Where it breaks down for Ratiba
Langfuse-as-backbone + custom Python eval runner	Observability is best-in-class, OSS, no vendor capture	We'd write the conversational eval primitives DeepEval already ships. Maintenance cost we don't need on a 1-FTE team.
DeepEval primary + Langfuse observability (chosen)	DeepEval ships `ConversationalGEval`, `ConversationSimulator`, `ConversationalGolden` — these are exactly the abstractions we need. Langfuse handles tracing + prompt CMS.	Two tools to wire together. Mitigated: DeepEval can push results to Langfuse via a thin adapter.
Braintrust as primary	Best CI ergonomics (GitHub Action posts eval diffs as PR comments). Strong "experiment" abstraction.	Hosted-first; OSS story is thin; usage-based pricing for a multi-tenant product where a noisy tenant could spike eval costs. Vendor dependency on a Series-B company.
Maxim + Langfuse	Enterprise dashboards, governance review queues	Overkill. Maxim's differentiator is approvals-and-governance UI, which a solo founder doesn't need until there's a compliance team.

Concrete library pins (recommended at lock-in)

Add to backend/pyproject.toml [project.optional-dependencies]:

eval = [
  "deepeval>=3.9",          # 3.9.7 was the latest stable on 2026-04-14
  "langfuse>=4.0",          # v4 SDK released March 2026; OTel-native
]

The dev group remains unchanged. The eval group is opt-in (uv sync --extra eval) so plain unit-test CI doesn't pay the install cost.

asyncio_mode = "auto" already in pyproject means DeepEval's async metrics work without decorator boilerplate. Confirmed compatible — DeepEval's pytest integration is the same pattern as ours.

2. Test pyramid for conversation-as-state

The classical pyramid (lots of unit tests, fewer integration, even fewer e2e) does not survive contact with conversation-as-state. When state lives in a LangGraph checkpointer that itself lives in Postgres, a "unit test" of a single FSM transition is borderline meaningless without the checkpointer present. The right reframing:

                   ┌──────────────────────────┐
                   │  (5)  Production replay  │   nightly, sampled
                   └──────────────────────────┘
                  ┌────────────────────────────┐
                  │  (4)  Golden conversations  │   on every prompt PR
                  └────────────────────────────┘
                ┌────────────────────────────────┐
                │  (3)  End-to-end transcript    │   on PR (small sample)
                │       replay (testcontainers)  │
                └────────────────────────────────┘
              ┌────────────────────────────────────┐
              │ (2)  Integration: full turn        │   majority of tests
              │      intent → FSM → tool → answer  │
              └────────────────────────────────────┘
            ┌────────────────────────────────────────┐
            │ (1)  FSM transition: state + utterance │   pure asyncio, fast
            │      → next state + side-effects list  │
            └────────────────────────────────────────┘

Layer 1 — FSM transition (pytest, no LLM, no DB). A single FSM transition given a (BookingState, IncomingMessage) pair returns a (BookingState, list[Effect]) tuple where Effect is a typed enum (SendWhatsApp, CallTool("calendar.find_slots"), EnqueueAdminHandoff, etc.). The LLM is mocked at this level (one of the few MOCK APPROVED cases per S4U methodology, justified because we're testing the FSM, not the model). Should be ~70% of test count, ~10% of test runtime. Lives in backend/tests/unit/fsm/.

Layer 2 — Full conversation turn against testcontainers. A real Postgres testcontainer with the tenant schema + the LangGraph langgraph-checkpoint-postgres saver pointed at it. A real Redis testcontainer for hot state. The LLM call goes through a LLMRouter that, in tests, dispatches to a RecordedLLM that replays canned responses from tests/fixtures/llm_recordings/<test_id>.jsonl — recorded once against the real model, then replayed deterministically. This is the workhorse layer: per the LangGraph adoption decision in ADR-0001 (Option A, TenantScopedSaver), each test gets its own per-tenant saver instance pointed at a freshly-migrated schema in the testcontainered Postgres. Lives in backend/tests/integration/orchestrator/.

Layer 3 — End-to-end transcript replay. A small set (~10 captured WhatsApp transcripts) replayed through the full backend pipeline — webhook receiver → tenant resolver → orchestrator → answer-shaper → outgoing message — with the LLM still recorded but every other component live. Used as a smoke test on PR. Lives in backend/tests/e2e/transcripts/.

Layer 4 — Golden conversation snapshots (DeepEval). Curated scenarios with assertions on conversation-level metrics: ConversationCompletenessMetric, TurnRelevancyMetric, KnowledgeRetentionMetric, plus our custom Ratiba metrics (SwahiliFluencyMetric, MpesaPaymentSafetyMetric, BookingSlotConsistencyMetric). LLM-as-judge here, with the bias mitigations spelled out in §4. Lives in backend/tests/eval/conversations/.

Layer 5 — Production replay (Langfuse). Nightly job samples N production conversations from Langfuse, replays them through the current build (with masked tenant data, per multi-tenant hygiene), and flags any divergence. This is online eval, covered in §6.

The shape inverts naturally. Layer 1 has more tests than any other layer, but Layer 4 is where we spend cognitive effort on test design, because a regressed prompt is the failure mode that hurts most. This matches the Red Hat 8-stage observation: "the agent might communicate an answer in many different ways, making it difficult to test with frameworks that expect deterministic output" (Red Hat eval-driven dev).

3. Golden-conversation snapshot tests

Format: YAML, one file per scenario, conversation under a `turns:` key

# backend/tests/eval/conversations/scenarios/spa_booking_swahili_happy.yaml
id: spa_booking_swahili_happy
description: Customer books a 60-min massage in Swahili, no slot conflicts
language: sw
tenant_archetype: spa
prompt_version: "booking-orchestrator@2026-04-25.1"
recorded_against:
  llm: claude-sonnet-4-7
  fsm_version: "0.3.2"

scenario: |
  A returning customer wants to book a 60-minute aromatherapy massage
  for tomorrow afternoon. Slots are available at 14:00 and 16:00.
  Customer prefers the earlier slot.

expected_outcome: |
  Booking confirmed for 14:00, M-Pesa STK push initiated, customer
  receives confirmation in Swahili.

turns:
  - role: user
    content: "Habari, nataka kufanya booking ya massage kesho"
  - role: assistant
    expected_intent: collect_service_details
    expected_slots_filled: [service_category]
    must_contain_swahili: true
    must_not_contain: ["I don't understand", "English only"]
  - role: user
    content: "Aromatherapy, kama saa moja"
  - role: assistant
    expected_intent: offer_slots
    expected_tool_calls: ["calendar.find_slots"]
    expected_slots_filled: [service_category, service_name, duration_min]
  # … etc.

assertions:
  - metric: ConversationCompletenessMetric
    threshold: 0.85
  - metric: SwahiliFluencyMetric          # custom, see §4
    threshold: 0.80
  - metric: BookingSlotConsistencyMetric  # custom: agent never offers a slot it didn't fetch
    threshold: 1.0                         # zero tolerance
  - metric: MpesaPaymentSafetyMetric      # custom: STK push only after explicit confirmation
    threshold: 1.0                         # zero tolerance

Why YAML, not JSON/JSONL.

Bilingual content with multi-line natural-language strings reads catastrophically as escaped JSON.
YAML supports comments — load-bearing for "this turn was hand-edited 2026-04-25 because the original transcript had a typo".
One file per scenario means git blame works at the scenario level (which question was last touched, by whom, when).
JSONL would be denser but loses the comment + diff-friendly properties; reserve JSONL for tests/fixtures/llm_recordings/ where files are machine-written and never hand-edited.

Where they live.

backend/tests/eval/conversations/
  scenarios/
    spa_booking_swahili_happy.yaml
    spa_booking_english_happy.yaml
    barbershop_double_booking_recovery.yaml
    dental_payment_failure_retry.yaml
    physio_admin_handoff.yaml
    ... (~50 scenarios at PoC, growing to ~200 by Phase 2)
  metrics/
    swahili_fluency.py        # custom DeepEval metric
    mpesa_payment_safety.py
    booking_slot_consistency.py
  conftest.py                 # loads YAML, builds DeepEval test cases
  test_golden_conversations.py  # the actual pytest entry point

Update workflow when the FSM changes.

prompt or FSM change merged to main
        │
        ▼
CI runs the full golden-conversation suite
        │
        ├── ALL PASS  → merge proceeds
        │
        └── REGRESSIONS DETECTED → CI fails, blocks merge
                │
                ▼
        Developer runs `pytest tests/eval --update-snapshots`
                │
                ▼
        Tool emits a diff report:
            - Scenario X: 3 turns changed
            - For each: old vs new transcript side-by-side
            - Score deltas per metric
                │
                ▼
        DEVELOPER MANUALLY REVIEWS the diff
                │
                ├── Acceptable change   → commit updated YAML
                │                          (PR template forces a
                │                          "why this change is good" note)
                │
                └── Genuine regression  → fix the prompt/FSM, don't update

Manual review is non-negotiable. Auto-accept of snapshot diffs is how prompt regressions get rubber-stamped into production. The "why this change is good" PR-template note is the social mechanism that forces a human to actually look. This mirrors how Jest / pytest-snapshot shops handle it; the LLM-output case is more dangerous, not less.

Recording layer. When a scenario is added or updated, a one-shot script (scripts/record_llm_responses.py) runs the scenario against the real LLM, captures the responses into tests/fixtures/llm_recordings/<scenario_id>.jsonl, and commits both the YAML and the recording in the same PR. CI then runs deterministically off the recording. To re-record (e.g., after a model upgrade), the developer runs the script with --force and reviews the diff.

4. LLM-as-judge patterns

Which model judges Ratiba's bilingual responses

Recommendation: Claude Opus 4.7 as primary judge, GPT-4-class as a cross-check on a 10% sample.

Rationale:

Self-preference bias is real and measurable. Self-Preference Bias in LLM-as-a-Judge (arXiv 2410.21819) shows GPT-4 systematically over-rates its own outputs, and the mechanism is perplexity-based — the judge prefers outputs whose token distribution matches what it would have generated. Since Ratiba's generator is also Claude (per ADR-0001's voice/text leaning), using Claude as the judge is a self-preference trap.
But Swahili coverage breaks the symmetry. GPT-4 and Claude both perform measurably better on Swahili than smaller models, but no rigorous public comparison exists between them on East African Swahili specifically (flagged in Phase C §A.1 as uncomfortably thin). Empirically, Claude is reported stronger on Swahili register and code-switching. So we have a tension: the judge that's most accurate on the language is the one most prone to self-preference on the content.
Resolution: use Claude as judge but disclose-and-monitor. Run a 10% cross-check sample through GPT-4 (or Gemini) and compute Cohen's kappa between the two judges. If agreement drops below 0.6, the eval suite logs a warning and the affected scores are flagged for human review. This is cheaper than a full dual-judge setup and catches drift.

Bias mitigations (the whole list)

Position randomization in pairwise comparisons. Per Judging the Judges (arXiv 2406.07791), shuffling cut Cohen's kappa from 0.807 to 0.639 — proving the bias was there. Always randomize, always.
Score on a rubric, not a vibe. ConversationalGEval does this by design — chain-of-thought-derived evaluation steps from the criteria, then score per step. We use this primitive.
Calibrate the judge against human-labelled samples. Maintain a tests/eval/calibration/human_labelled.yaml with 50–100 conversations rated by a Swahili-fluent reviewer. On every judge-model upgrade (or quarterly, whichever first), re-run the judge against this set and compute Cohen's kappa with the human ratings. Target: kappa ≥ 0.7 (substantial agreement). Below that, the judge model isn't trustworthy and we either reject the upgrade or recalibrate the rubrics.
Use multiple independent judge runs and take a majority vote for any score that fails on the first pass. Reduces single-judge variance without doubling cost on the happy path.
Never use the same model as judge and as generator in a tight loop. If we ever ship an evaluator-optimizer workflow (Anthropic's pattern §1 in Phase C), the optimizer must use a different model than the generator — or at minimum, a different temperature regime — to avoid the closed loop where the generator and judge agree because they're the same brain.

The Swahili-specific problem and the recommended path

Phase C §A.1 flagged Swahili LLM benchmarking as "uncomfortably thin" public information. The eval-suite implication: we cannot rely on published benchmarks; we must build our own calibration set.

Concrete plan:

Start with 50 hand-labelled conversations from the first 30 days of WhatsApp pilot traffic, rated by a Swahili-fluent reviewer on a 1–5 scale across (a) intent comprehension, (b) response naturalness, (c) cultural appropriateness (e.g., "samahani" vs "pole" register).
Compute kappa between Claude-as-judge and the human reviewer on this set. Below 0.6, escalate.
Grow this set monthly; treat it as the ground truth the judge model is trying to approximate.
Build one custom DeepEval metric, SwahiliFluencyMetric, that uses Claude with a carefully-constructed rubric ("Does the response use idiomatic Swahili appropriate for a service-business booking context? Does it avoid Anglicisms unless they're conventional, e.g., 'spa' is fine, 'booking' is conventional but 'reservation' is not?"). The rubric is the calibration mechanism.

Honest uncertainty: I have not found rigorous published evidence that any single LLM is a reliably better Swahili judge than another. Treat the Claude-primary recommendation as a default that should be re-tested against GPT-4 and Gemini as soon as we have the calibration set.

5. Customer-feedback capture loops

Recommendation: WhatsApp interactive reply buttons (primary) + in-conversation 1–5 rating (fallback)

Primary: WhatsApp interactive reply buttons (thumbs-up / thumbs-down / "comment").

After every booking-flow completion (success or abandoned-after-3-turns), the agent sends an interactive reply-button message:

"Asante! Ulipata huduma nzuri leo? / Did we serve you well today?"
[👍 Ndiyo / Yes]  [👎 La / No]  [💬 Sema zaidi / Tell more]

The 360dialog webhook delivers a messages[].type = "button" payload with the button.payload field set to a value we control (e.g., feedback_thumbs_up_<conversation_id>). The handler:

Posts a Langfuse score on the trace_id (which equals conversation_id per §8 instrumentation): langfuse.create_score(trace_id=..., name="user_feedback", value=1.0 or 0.0, data_type="NUMERIC").
If "Tell more" → switches to free-text capture mode, persists the comment as a Langfuse score with comment=<text> and value=null.
The conversation is added to a Langfuse dataset (production_feedback_negative or production_feedback_positive) for later replay through the eval suite.

Why buttons not free-text: screen-literacy-low users in our target market click a button reliably; they freeze on "rate from 1 to 5". A button has zero language-ambiguity ("Ndiyo" is Swahili, "Yes" is English, both work). Per the 360dialog interactive messages docs, interactive reply buttons are a first-class WhatsApp Cloud API primitive and the webhook payload is well-defined.

Fallback: in-conversation 1–5 rating.

If the customer responds to the agent's "anything else?" with a free-text complaint or compliment, the agent recognizes the affect intent and asks:

"Sawa, asante kwa maoni. Tafadhali kadiria huduma yetu kati ya 1 na 5."
"OK, thanks for the feedback. Please rate our service from 1 to 5."

The customer types a digit. This path is lower-volume but captures higher-signal feedback (a customer who volunteered sentiment is worth more than a passive thumb).

Rejected alternatives:

Post-booking SMS survey: the customer experience is "I just paid, now you're texting me again to ask how I felt about paying" — bad UX, low response rate.
Star ratings: WhatsApp doesn't render them well; numbers are clearer.
Native WhatsApp template-message thumbs (the ones Meta auto-attaches to template messages): per Turn.io's writeup, this feedback goes to Meta, not the business, with no webhook. Useless for our purposes.

How feedback feeds back into the eval suite

The feedback loop closes via the Langfuse → DeepEval bridge:

WhatsApp button click
        │
        ▼
360dialog webhook → backend handler
        │
        ▼
langfuse.create_score(trace_id=conversation_id,
                       name="user_feedback",
                       value=1.0|0.0)
        │
        ▼
[scheduled: weekly]
Langfuse dataset filter:
  WHERE score.user_feedback = 0.0
        AND timestamp >= last_run
  → Export to backend/tests/eval/conversations/from_production/
        │
        ▼
Engineer reviews each negative-feedback conversation,
PII-masks the tenant data, converts to a YAML scenario
        │
        ▼
Adds to the golden-conversation suite as a regression test
"agent must NOT do this again"

This is the load-bearing part. Without this loop, the eval suite ages from "current production reality" into "stale fixtures" within months. With it, every customer thumb-down becomes a future regression test. Adrian flagged this as "load-bearing" — it is.

6. Production eval loops: phased

Phase 1 (now → end of M6, M-Pesa launch): offline only.

Eval suite runs on every prompt PR via CI (§7).
Eval suite runs nightly on main against the full golden-conversation set.
No production sampling — we don't have enough production traffic for it to be statistically meaningful, and the engineering cost of building the sampler isn't justified yet.
Customer feedback is captured (§5) and accumulated into Langfuse datasets, but doesn't yet auto-feed back into eval.

Phase 2 (post-M6, ~50+ conversations/day): add online sampling.

A scheduled job (every 6 hours) samples 5% of completed conversations from Langfuse, replays them through the current build's orchestrator (with the LLM call going to a recording layer to control costs), and compares the replayed FSM transition trace to the original. Divergence → flag for review.
The negative-feedback dataset auto-promotes to a "candidate regression scenarios" queue; an engineer triages weekly and either (a) converts to a formal eval scenario, (b) marks as known-good, or (c) flags as a real bug.
Monitoring dashboards (Langfuse) show feedback-score trend per tenant per prompt version.

Phase 3 (multi-tenant scale, ~1k+ conversations/day): canary on prompt changes.

New prompt versions deploy to 5% of conversations for 24h.
Compare metrics (completion rate, feedback score, latency, cost-per-conversation, escalation rate) between cohort A (current prompt) and cohort B (new prompt).
Statistical significance check before promoting to 100%. If new prompt is worse on any metric — auto-rollback.
This requires per-conversation prompt-version routing, which means the orchestrator must consult a "prompt assignment service" (deterministic hash of (tenant_id, conversation_id) → prompt_version) at the start of each conversation. Build this in Phase 3, not before.

Why phased. Online sampling and canary deployment are real engineering work (probably 1–2 weeks each), and they pay off only when there's enough traffic to make their conclusions statistically valid. Adopting them in Phase 1 means engineering days spent on infrastructure that processes 3 conversations a night. Wait.

7. Regression detection on prompt changes — CI integration

The eval suite gates a prompt PR via two pytest commands. Both run on every PR that touches app/orchestrator/prompts/ or app/orchestrator/fsm/:

# Fast suite (must pass, blocks merge)
# ~2 minutes, runs on every push
pytest backend/tests/eval/conversations \
       -m "smoke" \
       --deepeval-cache \
       -n auto

# Full suite (must pass, blocks merge)
# ~15 minutes, runs on PRs labelled prompt-change
pytest backend/tests/eval/conversations \
       --deepeval-cache \
       --langfuse-publish-results \
       -n 4

The --langfuse-publish-results flag is a custom plugin that publishes the run as a Langfuse "experiment" so we get historical trend data per metric per prompt version. Cost: a few hundred LLM calls per full run, ~$5–15 depending on judge model — affordable on PR cadence.

Concrete CI workflow (.github/workflows/eval-on-prompt-change.yml):

name: Eval on prompt change
on:
  pull_request:
    paths:
      - 'backend/app/orchestrator/prompts/**'
      - 'backend/app/orchestrator/fsm/**'
      - 'backend/tests/eval/**'

jobs:
  golden-conversations:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - uses: astral-sh/setup-uv@v4
      - run: uv sync --extra dev --extra eval
      - name: Run golden-conversation eval suite
        env:
          LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY_CI }}
          LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY_CI }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY_CI }}
          OPENAI_API_KEY:    ${{ secrets.OPENAI_API_KEY_CI }}  # cross-check judge
        run: |
          cd backend
          pytest tests/eval/conversations \
                 --deepeval-cache \
                 --langfuse-publish-results \
                 -n 4 \
                 --junitxml=eval-results.xml
      - name: Post eval delta as PR comment
        if: always()
        uses: ./.github/actions/post-eval-delta
        with:
          results: eval-results.xml
          baseline-prompt-version: main

The Red Hat 8-stage framework anchors this: stage 4 (multi-type coverage) and stage 5 (CI/CD integration) are exactly what we're building. Their operational data — "20 conversations using different models and different prompting approaches run every night" — is a useful sanity check on volume; we should be in the same order of magnitude. Their use of DeepEval as the eval primitive is corroborating evidence for our choice.

Block, don't warn. A prompt PR that regresses the eval suite must fail CI. Warnings get rubber-stamped; failures get fixed.

8. Observability — Langfuse, with concrete instrumentation plan

Verdict: Langfuse, self-hosted on the production VPS. Aligned with Phase C §10 — and now reinforced by the deeper compare in this round:

vs LangSmith: LangChain/LangGraph integration is similar quality on both, but LangSmith is hosted-only at $39/seat/mo and doesn't offer an OSS escape hatch. Vendor lock-in for a multi-tenant compliance-leaning product is the wrong shape.
vs Helicone: Helicone's gateway model (URL/header swap) is the fastest setup in the world, but it sees only LLM calls — not tool calls, not FSM transitions, not handoffs. For a workflow that's 70% non-LLM logic, that's a dealbreaker.
vs Phoenix: Phoenix is genuinely easier to self-host (single Docker container vs Langfuse's PostgreSQL + ClickHouse + Redis + S3 stack), and several Langfuse features (Prompt Playground, LLM-as-Judge) are paywalled even in self-hosted. This is the most credible alternative. The reason we still pick Langfuse: the prompt CMS and session-replay primitives are stronger, and the Docker Compose burden is something we already accept (we're running 8+ services anyway). If self-host complexity becomes a real pain by M9, revisit Phoenix.
vs Arize AX: Enterprise-grade overkill. Statistical drift detection is impressive, but a solo founder doesn't operate at the scale where it matters yet.

What to instrument

Every conversation produces one Langfuse trace, with these spans:

Event	Span name	Metadata to attach
Inbound message received	`inbound.whatsapp` or `inbound.voice`	`tenant_id`, `conversation_id`, `channel`, `wa_message_id`, `phone_hash`
Tenant resolution	`tenant.resolve`	`tenant_id`, `resolution_method` (channel-mapping or phone-mapping)
FSM transition	`fsm.transition`	`from_state`, `to_state`, `trigger_intent`, `slots_filled`
LLM call	(auto via CallbackHandler)	`prompt_version`, `model`, `input_tokens`, `output_tokens`, `latency_ms`, `cost_usd`
Tool call	`tool.<name>` (e.g., `tool.calendar.find_slots`)	`args_hash`, `result_status`, `latency_ms`
Answer-shaper	`answer_shaper.<channel>`	`output_format` (text/buttons/list/voice), `truncated_to_voice_constraints` (bool)
Outbound send	`outbound.<channel>.send`	`wa_message_id_outbound`, `provider_response_status`
Admin handoff (interrupt)	`handoff.interrupt`	`reason`, `last_n_turns_summary`
Admin handback (resume)	`handoff.resume`	`admin_user_id_hash`, `handoff_duration_s`
User feedback received	(score, not span)	`value`, `comment` (if any)

Metadata propagation pattern (Langfuse SDK v4 idiom):

from langfuse import get_client, propagate_attributes

langfuse = get_client()

# At the inbound webhook handler
with langfuse.start_as_current_observation(
    as_type="span",
    name=f"inbound.{channel}",
    trace_context={"trace_id": conversation_id},  # use our IDs as trace IDs
) as root_span:
    with propagate_attributes(metadata={
        "tenant_id": tenant_id,
        "conversation_id": conversation_id,
        "channel": channel,
        "prompt_version": current_prompt_version_for(tenant_id),
    }):
        # All nested observations inherit tenant_id, prompt_version, etc.
        await orchestrator.handle(message)

Tenant isolation in Langfuse. Use Langfuse projects to separate tenants — one project per major tenant, or one shared project with tenant_id as a metadata key for smaller tenants. The choice matters for data-residency and access-control: production tenants in regulated verticals (dental, medical) get their own project; PoC tenants share. ADR-0004 should formalize this.

Note on the propagated-metadata gotcha. Per Langfuse issue #8493, the LangChain/LangGraph callback handler has a known issue where if metadata is set after the root span is created, it doesn't propagate to the trace itself — only to child spans. Workaround: set metadata in the very first span creation, before the LangGraph .invoke() call. This is a real bug to plan around, not theoretical.

Cost attribution. Langfuse computes cost per span automatically given the model name + token counts. With prompt_version and tenant_id propagated as metadata, we get free dashboards for "cost per tenant per prompt version per day" — which is exactly the slicing needed when one prompt change tanks margins on a single tenant.

9. Sister-project reuse

Honest answer: nothing concrete to lift from zol-rag or trust-relay-workflow yet — both are at similar early-stage maturity. The user-memory file (reference_voice_stack_pattern.md, reference_sister_projects.md) describes patterns at the architectural level, not the eval-framework level. I did not find sister-project source code that already solves the conversation-eval problem in a way Ratiba can copy.

What we can lift conceptually:

trust-relay-workflow's [tool.pytest.ini_options] and [tool.coverage.report] conventions — already lifted into backend/pyproject.toml (note the comments explicitly mirroring trust-relay).
trust-relay-workflow's structured-logging conventions — structlog is already pinned in pyproject and called out in the Phase B auto-debug logging spec. The eval suite should use the same logger configuration so test output and production traces have the same shape (makes reproducing a production failure in a test trivial).
zol-rag's voice-stack pattern — relevant for Phase 2 voice eval, but doesn't help the WhatsApp eval suite that ADR-0004 is about. When voice eval comes online, we should expect to lift the STT-recording / TTS-byte-comparison harness from zol-rag if it exists by then.

Recommend building from scratch for:

The DeepEval/Langfuse glue layer (publish DeepEval results to Langfuse as experiments).
The YAML golden-conversation format and loader.
The custom DeepEval metrics (SwahiliFluencyMetric, MpesaPaymentSafetyMetric, BookingSlotConsistencyMetric).
The Langfuse → eval-suite feedback bridge (negative-feedback conversations into candidate scenarios).
The LLM-recording layer for deterministic CI replay.

Recommend coordinating with sister projects on:

A shared s4u-eval-utils package down the road, if all three projects converge on DeepEval + Langfuse. Premature now — let Ratiba prove the pattern first, then extract.

10. Open questions surfaced for ADR-0004

These are the decisions ADR-0004 needs to take a position on. Each is a real fork in the road that this research doesn't resolve.

Self-hosted Langfuse vs Langfuse Cloud for Phase 1. Self-hosting is consistent with our OSS-leaning posture and tenant-data-residency story, but adds 4 services (Postgres, ClickHouse, Redis, S3) to the dev/prod compose stack. Langfuse Cloud is friction-free but pulls customer conversation data (even masked) onto a third-party SaaS. Recommendation: self-host from day 1, but ADR-0004 must make this explicit because the dev-environment cost is real.
How does the eval suite consume the multi-tenant TenantScopedSaver? Option A from ADR-0001's amendment defines per-tenant savers; the eval suite needs a clean way to spin one up against a testcontainered Postgres without leaking state across scenarios. Recommended pattern: a pytest fixture tenant_scoped_eval_environment that creates a fresh tenant schema, migrates it, builds a TenantScopedSaver pointing at it, and yields the orchestrator. ADR-0004 should specify the fixture's exact contract.
Prompt-version routing in Phase 3 canary. Where does the "prompt assignment service" live? Possibilities: (a) Redis-backed deterministic-hash lookup at the start of each conversation; (b) Langfuse's experiment-routing primitive (if/when it ships); (c) tenant-config field. ADR-0004 can defer this to Phase 3 but should at minimum acknowledge it as a future decision so we don't paint ourselves into a corner.
Calibration-set ownership and refresh cadence. Who curates tests/eval/calibration/human_labelled.yaml and how often does it grow? Suggested: monthly cadence, ~10 new conversations per month, owned by Adrian until there's a second team member. ADR-0004 should name the owner and the refresh policy explicitly, otherwise it dies of neglect.
Cost ceiling on the eval suite. A full eval run with judge-model calls + cross-check sample is ~$10–20. Run on every prompt PR + nightly on main = ~$300/month. Acceptable at PoC scale, but ADR-0004 should specify the budget ceiling and what happens when it's hit (skip cross-check? sample fewer scenarios? raise the budget?).
PII handling in production-replay datasets. When a negative-feedback conversation is promoted to a regression test, what gets masked? Phone numbers definitely. Customer names probably. Service descriptions (could contain medical info for dental/physio tenants) — definitely for those tenants. ADR-0004 should reference a separate data-handling policy doc, not solve it inline, but must commit to one existing.
DeepEval cache invalidation policy. DeepEval caches LLM-judge calls between runs (cheap re-runs). When does the cache get invalidated? Per prompt-version change, yes. But also when the judge model updates? The DeepEval docs don't make this fully obvious; ADR-0004 should specify our policy (suggested: cache key includes (scenario_id, prompt_version, judge_model_version, metric_version)).
Bilingual judge mode. Do we use one SwahiliFluencyMetric for Swahili turns and one EnglishFluencyMetric for English, or one combined metric that detects language first? Trade-off: combined is cleaner, language-specific is more precise. Recommendation: language-specific, gated by a language: field in the YAML scenario. ADR-0004 should formalize.

Appendix A — Concrete dependency additions

# Add to backend/pyproject.toml [project.optional-dependencies]:

eval = [
  "deepeval>=3.9",          # 3.9.7 latest stable on 2026-04-14
  "langfuse>=4.0",          # v4 SDK, March 2026; OTel-native
]

The dev group remains unchanged. The eval group is opt-in (uv sync --extra eval) so plain unit-test CI doesn't pay the install cost.

Executive verdict (one paragraph, then read on)​

1. Recommended eval stack (verdict)​

Why this pair, and not the others​

Concrete library pins (recommended at lock-in)​

2. Test pyramid for conversation-as-state​

3. Golden-conversation snapshot tests​

Format: YAML, one file per scenario, conversation under a turns: key​

4. LLM-as-judge patterns​

Which model judges Ratiba's bilingual responses​

Bias mitigations (the whole list)​

The Swahili-specific problem and the recommended path​

5. Customer-feedback capture loops​

Recommendation: WhatsApp interactive reply buttons (primary) + in-conversation 1–5 rating (fallback)​

How feedback feeds back into the eval suite​

6. Production eval loops: phased​

7. Regression detection on prompt changes — CI integration​

8. Observability — Langfuse, with concrete instrumentation plan​

What to instrument​

9. Sister-project reuse​

10. Open questions surfaced for ADR-0004​

Appendix A — Concrete dependency additions​

Appendix B — Sources cited​