Skip to main content

Agentic Development — S4U Methodology Supplement

1. Scope

This supplement codifies how Ratiba operates around the non-deterministic parts of an agentic product: prompts, model outputs, conversation flows, eval suites, observability traces. It supplements — does not replace — the global S4U methodology at /Users/soft4u/Development/s4u-methodology/docs/methodology.md and the project landing at docs/methodology/index.md. Where the global methodology applies cleanly (design-before-code, ADRs, no-mock default, test evidence, quality gates, subagent dispatch, direct-to-master), defer to it. Where the agentic context introduces failure modes the deterministic playbook doesn't address — silent prompt regressions, eval drift, model deprecation, opaque production failures that need an AI to diagnose them — this document is authoritative.

In scope. Subagent discipline for non-deterministic work, prompt versioning, eval-driven development integration with CI, the auto-debuggable logging schema, delegate-vs-human-review boundaries, model deprecation ritual, library-currency operations, sister-project reuse, harness additions.

Out of scope. Anything already decided in ADR-0001 (the stack, the LangGraph wrapper, psycopg/asyncpg coexistence, the library-currency policy — this doc covers its operations). Anything decided in A1–A4 research (orchestration shape, handoff model, eval stack choice, payments shape — this doc references them, doesn't re-decide). Anything specific to M-Pesa or Daraja credentials handling — that lives with the payments ADR (candidate ADR-0007). Anything telephony-specific to LiveKit (deferred to voice-stack docs in M7).

Voice. Opinionated. One recommendation per question. Where evidence is thin, said so explicitly.


2. Subagent-driven development for non-deterministic artifacts

The global S4U rule is "fresh subagent per task, two-stage review." That rule was written for code where the artifact is a function or a migration and "done" is measurable. With prompts, evals, model swaps, the artifact's correctness is itself a probabilistic judgment, and naive subagent dispatch multiplies the non-determinism instead of containing it.

Decision rule. Dispatch a subagent when the work has at least one deterministic anchor the subagent can return: a passing test, a numeric eval score, a generated artifact with a known shape, or a structured research document. If the artifact is "go iterate on this prompt until it feels right," keep it in the main session — the iteration loop needs a human in the chair. The wrong shape is a subagent that comes back saying "I improved the prompt"; the right shape is a subagent that comes back saying "I added 7 golden-conversation scenarios, the eval suite passes, here are the diffs."

Containment patterns by artifact type.

Work typeDispatch a subagent?Why / How to scope
Prompt tuning (rewriting the booking-orchestrator system prompt)Conditional.Only after a baseline eval suite exists. Subagent's task: "Modify prompt X to satisfy failing scenarios A, B, C; do not regress passing scenarios; deliver the updated prompt file plus the eval run output as a single PR." Without the eval anchor, keep this in main.
Adding eval scenarios from production transcriptsYes.High-leverage, mechanical. Subagent's task: "Convert these 5 PII-masked transcripts into golden-conversation YAML files following docs/research/2026-04-25-eval-frameworks.md §3, run pytest tests/eval -k <new_ids> to confirm they execute, return diffs."
Building a new custom DeepEval metricYes.Bounded surface, testable. Subagent's task: scaffold metric class, wire one positive + one negative test, deliver passing test output.
Model replacement (swap Claude Sonnet for Opus on a node)No (orchestrate from main).Touches multiple seams: prompt, cost ceiling, eval calibration, possibly the LLM-as-judge cross-check. Run the swap from main; dispatch subagents only for sub-tasks (e.g., "re-record LLM fixtures" or "update cost-ceiling test").
Conversation-flow design (FSM state-graph changes)No.Architectural; needs an ADR; needs the design-before-code spec. Brainstorm in main, write spec, then optionally dispatch a subagent for the mechanical state-table implementation.
LangGraph node implementation against a written specYes.Spec is the deterministic anchor. Subagent's task: implement nodes/slot_filler.py to satisfy spec at docs/superpowers/specs/..., deliver passing unit tests + integration test against testcontainered Postgres.
Refactoring the answer-shaper for a new channelYes.Code-shaped. Standard S4U dispatch.
Investigating a production incident (auto-debuggable logging trail)Yes — and this is the highest-value case.The whole logging schema in §6 is designed to make this dispatch crisp: "Diagnose conversation <id>. The trace is at Langfuse <url> and the structured logs are at <path>. Return a markdown post-mortem with timeline, root cause hypothesis, and proposed fix."

Subagent-prompt template for non-deterministic work. The template that contains the non-determinism lives at docs/methodology/templates/subagent-prompt-nondeterministic.md (harness addition — see §11). The mandatory fields:

  1. Deterministic anchor. What measurable artifact must the subagent return? (Passing tests, eval score above N, file diff matching shape Y.)
  2. Don't-regress invariants. What must remain true? (Existing tests pass, existing eval scenarios don't drop below their current scores, no new dependencies added.)
  3. Stop conditions. When to stop iterating. (After 3 prompt iterations without improvement, hand back; do not silently keep trying.)
  4. Evidence to return. Pasted test output, eval-suite output, file paths touched, list of decisions made and why.

The two-stage review remains. First reviewer: did the subagent satisfy the deterministic anchor? (Run the eval, run the tests.) Second reviewer: did the subagent introduce silent quality regressions? (Read the diff; look for "improved" prompts that are subtly worse on edge cases the eval suite doesn't yet cover.)


3. Prompt versioning

Recommendation: prompts live in code, in git, deployed with the binary. Langfuse is observability + experiment tooling, not the source of truth. This is a hybrid model with a clear primary, not a both-and.

Why code-first. Prompts are application logic. They are tested by the eval suite. They are reviewed in PRs. They are deployed atomically with the code that depends on them. Putting them in a live-editable CMS introduces a class of bug that the eval gate cannot catch (someone edits the production prompt at 2 AM; the deployed code is now incompatible with the runtime prompt; nothing in CI saw the change). Langfuse's own docs acknowledge both modes; we pick the safer one for a system that touches real money via M-Pesa.

Why Langfuse alongside. Langfuse gets the trace of which prompt version produced which response, and Langfuse's prompt-management UI is useful for the experimental path (try a draft against recorded conversations before promoting to a code PR). The discipline: anything that lands in Langfuse's prompt CMS is draft; anything that ships is in code.

Storage layout.

backend/app/orchestrator/prompts/
__init__.py
registry.py # loads + caches; assigns version IDs
booking_orchestrator/
system.md # the prompt body
metadata.yaml # version, model, created, owner, eval_baseline
intent_classifier/
system.md
metadata.yaml
slot_extractor_swahili/
system.md
metadata.yaml
handoff_summarizer/
system.md
metadata.yaml

Each prompt is a directory because the metadata is load-bearing — version, target model, last eval score, who owns it. The system.md is the body; the metadata.yaml is the contract.

Version identifier. A prompt version is <prompt_name>@<git-short-sha> (e.g., booking_orchestrator@a1b2c3d) computed at startup from the file's git history. This is what gets logged on every span (§6) and what Langfuse sees as prompt_version. Human-readable aliases (@stable, @canary) live in prompts/aliases.yaml and resolve to git SHAs at deploy time.

Branching workflow.

1. Engineer creates a branch: prompt/booking-orchestrator-v3
2. Edits system.md + metadata.yaml; bumps version field
3. Runs locally: uv run pytest tests/eval -k booking
4. Opens PR; the path-filtered CI workflow (§5) runs the full eval suite
5. Eval-delta comment posted to PR by post-eval-delta action
6. Two-stage review: eval-delta reviewer + spec-compliance reviewer
7. Merge → tagged release → deploy → Langfuse trace shows new prompt_version

Approval before deploy. A prompt PR cannot merge unless:

  • The eval suite passes (§5).
  • The PR description contains a "why this change is good" note (template enforces).
  • For prompts touching payments or admin handoff: additional human reviewer required (the delegate-vs-human-review list, §7).

Rollback procedure. Two paths, both fast:

  1. Revert the merge commit. Standard git revert + redeploy. Prompt version goes back; eval suite still passes (it's the previously-passing prompt). 5–15 minutes.
  2. Pin via alias for emergency. prompts/aliases.yaml can point @stable at any prior git SHA without a code rebuild — the registry resolves the alias at startup. Useful when the regression isn't in the prompt itself but in surrounding code that the rollback would also undo. Document the pin in an incident note; convert to a proper revert PR within 24h.

What we do not do. No live prompt editing in Langfuse against production. The Langfuse prompt CMS is connected to a separate "experiments" project; nothing routes traffic to it without a code merge.


4. Eval-driven development

Eval gate model: prompts and FSM changes block on the eval suite; non-prompt code changes get advisory eval runs. This is the practical seam between "CI must pass" (prompts, FSM) and "CI is informational" (refactors, infra) so that engineers don't burn eval costs on every typo fix.

Gate matrix.

File path touchedEval suite runs?Gating?Cost ceiling per run
backend/app/orchestrator/prompts/**Full suiteBlocks merge$15
backend/app/orchestrator/fsm/**Full suiteBlocks merge$15
backend/app/orchestrator/nodes/**Smoke suiteBlocks merge$3
backend/tests/eval/**Full suite (self-test)Blocks merge$15
backend/app/answer_shaper/**Smoke suiteBlocks merge$3
Any other backend codeSkippedn/a$0
Renovate dependency PRSmoke suiteAdvisory (warns but doesn't block)$3
Major Renovate PR (Pydantic, FastAPI, LangGraph, Anthropic SDK)Full suiteBlocks merge$15

The "smoke suite" is pytest -m smoke over ~10 representative scenarios (~2 min, ~$3). The "full suite" is the entire golden-conversation set (~50 scenarios at PoC, ~15 min, ~$15). Both publish results to Langfuse as experiments so trends are visible across runs.

Concrete CI integration. The workflow already drafted in docs/research/2026-04-25-eval-frameworks.md §7 is the canonical version. Persist as .github/workflows/eval-on-prompt-change.yml:

name: Eval on prompt change
on:
pull_request:
paths:
- 'backend/app/orchestrator/prompts/**'
- 'backend/app/orchestrator/fsm/**'
- 'backend/tests/eval/**'

jobs:
golden-conversations:
runs-on: ubuntu-latest
timeout-minutes: 25
steps:
- uses: actions/checkout@v5
with:
fetch-depth: 0 # eval-delta needs git history
- uses: astral-sh/setup-uv@v4
- run: uv sync --extra dev --extra eval
working-directory: backend
- name: Run golden-conversation eval suite
env:
LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY_CI }}
LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY_CI }}
LANGFUSE_HOST: ${{ secrets.LANGFUSE_HOST_CI }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY_CI }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY_CI }}
RATIBA_EVAL_BUDGET_USD: "20"
run: |
cd backend
uv run pytest tests/eval/conversations \
--deepeval-cache \
--langfuse-publish-results \
-n 4 \
--junitxml=eval-results.xml
- name: Post eval-delta comment
if: always()
uses: ./.github/actions/post-eval-delta
with:
results: backend/eval-results.xml
baseline-prompt-version: main

smoke-on-other-changes:
if: github.event.pull_request.changed_files != 0
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@v5
- uses: astral-sh/setup-uv@v4
- run: uv sync --extra dev --extra eval
working-directory: backend
- name: Smoke eval (advisory)
continue-on-error: true # advisory: do not fail the PR
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY_CI }}
run: |
cd backend
uv run pytest tests/eval/conversations -m smoke -n auto

Block, don't warn. Per the A3 deliverable: a prompt PR that regresses the eval suite must fail CI. A PR that is path-irrelevant just doesn't trigger the suite at all — silence is better than an ignored warning.

Eval cost guard. RATIBA_EVAL_BUDGET_USD is read by the suite's conftest. If running judge calls would exceed the ceiling, the suite emits a warning and downgrades the cross-check sample from 10% to 0% for the run, rather than running over budget. Documented behaviour beats surprise bills.


5. Auto-debuggable logging schema

This is the section Adrian flagged as load-bearing. The goal is precise: a fresh Claude session with no project context must be able to diagnose a production issue from logs alone, given only the schema and the conversation/turn/tenant identifiers.

That bar drives every choice below.

5.1 Required fields on every event log line

Every log event emitted by backend code carries the following fields. Some are populated by contextvars (set at request boundary, propagated automatically by structlog), some at the call site. Missing a field is a schema violation enforced at runtime (§5.5).

FieldTypeSourceNotes
timestampISO-8601 UTC stringstructlog processor2026-04-25T13:42:01.123456Z
levelenum: debug/info/warning/error/criticalstructlog
eventstringcall siteShort verb-phrase: fsm.transition, tool.calendar.find_slots.invoked, payment.stk_dispatched
tenant_idUUID stringcontextvarSet at webhook boundary; absent → schema error
conversation_idUUID stringcontextvarSame as Langfuse trace_id (§6)
turn_idintcontextvarMonotonic per-conversation; 0 for non-turn events
event_typeenum: see §5.2call siteThe dispatchable category
fsm_state_beforestring | nullcontextvarState name before this event; null on inbound-receive events
fsm_state_afterstring | nullcall siteState name after this event; null on tool/external events
modelstring | nullcall siteclaude-sonnet-4-7, deepgram-nova-3, etc.; null if no model used
prompt_versionstring | nullcontextvar<prompt_name>@<sha> per §3; null if no prompt used
latency_msint | nullcall siteRequired on any external-call event
cost_usdfloat | nullcall siteRequired on any LLM-call event
tool_callsarray of {name, args_hash, result_status, latency_ms}call siteEmpty array if none
redacted_payloadobject | nullcall siteTruncated, PII-masked. Never the full message. See §5.4
correlation_idsobjectcontextvar{wa_message_id, langfuse_trace_id, livekit_session_id, mpesa_merchant_reference}; only populated keys present
servicestringstartupAlways ratiba-backend; useful when logs co-mingle with other services
git_shastringstartupBuild SHA — disambiguates "which version produced this log"

5.2 Event types (closed enum)

inbound.whatsapp admin.handoff.interrupt
inbound.voice admin.handoff.resume
tenant.resolved admin.message.relayed
identity.resolved payment.stk.dispatched
fsm.transition payment.stk.callback.received
llm.call payment.timeout.fired
llm.call.failed feedback.score.recorded
tool.invoked error.unhandled
tool.failed audit.config.changed
answer.shaper.rendered
outbound.send
outbound.send.failed

The enum is closed because the diagnostic queries in §5.6 dispatch on it. Adding a new event type requires editing this list and the validator — which is the point: it forces a deliberate schema decision.

5.3 Pydantic model (the source of truth)

The schema lives at backend/app/observability/log_event.py (the implementation), with a JSON-Schema mirror auto-generated to docs/methodology/logging-schema.json for external consumers (a fresh Claude session reads the JSON Schema; the runtime enforces the Pydantic model).

# backend/app/observability/log_event.py
from __future__ import annotations
from datetime import datetime
from enum import Enum
from typing import Any, Literal
from uuid import UUID
from pydantic import BaseModel, ConfigDict, Field, field_validator


class EventType(str, Enum):
INBOUND_WHATSAPP = "inbound.whatsapp"
INBOUND_VOICE = "inbound.voice"
TENANT_RESOLVED = "tenant.resolved"
IDENTITY_RESOLVED = "identity.resolved"
FSM_TRANSITION = "fsm.transition"
LLM_CALL = "llm.call"
LLM_CALL_FAILED = "llm.call.failed"
TOOL_INVOKED = "tool.invoked"
TOOL_FAILED = "tool.failed"
ANSWER_SHAPER_RENDERED = "answer.shaper.rendered"
OUTBOUND_SEND = "outbound.send"
OUTBOUND_SEND_FAILED = "outbound.send.failed"
ADMIN_HANDOFF_INTERRUPT = "admin.handoff.interrupt"
ADMIN_HANDOFF_RESUME = "admin.handoff.resume"
ADMIN_MESSAGE_RELAYED = "admin.message.relayed"
PAYMENT_STK_DISPATCHED = "payment.stk.dispatched"
PAYMENT_STK_CALLBACK_RECEIVED = "payment.stk.callback.received"
PAYMENT_TIMEOUT_FIRED = "payment.timeout.fired"
FEEDBACK_SCORE_RECORDED = "feedback.score.recorded"
ERROR_UNHANDLED = "error.unhandled"
AUDIT_CONFIG_CHANGED = "audit.config.changed"


class ToolCall(BaseModel):
name: str
args_hash: str = Field(..., description="sha256 of canonicalised args; full args go to redacted_payload if needed")
result_status: Literal["ok", "error", "timeout"]
latency_ms: int


class CorrelationIds(BaseModel):
model_config = ConfigDict(extra="forbid")
wa_message_id: str | None = None
wa_message_id_outbound: str | None = None
langfuse_trace_id: str | None = None
livekit_session_id: str | None = None
mpesa_merchant_reference: str | None = None
mpesa_checkout_request_id: str | None = None
pesapal_order_tracking_id: str | None = None


class LogEvent(BaseModel):
"""Canonical log-event shape for Ratiba backend.

Every structlog call site emits an event that conforms to this model.
Validation is enforced at runtime by the structlog processor in
backend/app/observability/processors.py.
"""
model_config = ConfigDict(extra="forbid")

timestamp: datetime
level: Literal["debug", "info", "warning", "error", "critical"]
event: str = Field(..., min_length=1, max_length=120)
event_type: EventType
service: Literal["ratiba-backend"] = "ratiba-backend"
git_sha: str

tenant_id: UUID | None = None # null only for events before tenant resolution
conversation_id: UUID | None = None
turn_id: int = 0

fsm_state_before: str | None = None
fsm_state_after: str | None = None

model: str | None = None
prompt_version: str | None = None
latency_ms: int | None = None
cost_usd: float | None = None
tool_calls: list[ToolCall] = Field(default_factory=list)
redacted_payload: dict[str, Any] | None = None
correlation_ids: CorrelationIds = Field(default_factory=CorrelationIds)

@field_validator("event_type", mode="before")
@classmethod
def coerce_event_type(cls, v: Any) -> EventType:
return v if isinstance(v, EventType) else EventType(v)

5.4 Redaction and truncation rules

redacted_payload is bounded — it must not become "full message dumped to log." The rules:

  • Hard cap: 2 KB serialized. Anything longer is truncated with a _truncated: true marker. Full payloads live in Langfuse traces (which have richer storage) or in the conversation table.
  • PII masking. Phone numbers are masked to last-4 (+254 7** *** 432). Names are replaced with <name> token. Free-text customer content is passed through a redactor that masks: numeric runs of length 6+ (likely IDs/account numbers), email patterns, full names matched against the identity table.
  • Money fields are kept verbatim. Amount, currency, M-Pesa receipt — these are needed for "double-charged" diagnostics (§5.6).
  • Tool args are hashed, not stored. args_hash lets us correlate duplicate calls; the full args go to Langfuse if they need to be inspected.

A redactor module lives at backend/app/observability/redact.py and is the single place these rules are defined. PII test cases in tests/observability/test_redact.py.

5.5 Destinations and enforcement

Two destinations, one pipeline.

  1. Langfuse traces — for every conversation-bearing event. The structlog processor langfuse_emit mirrors events with a conversation_id into the Langfuse trace tree (§6 instrumentation plan). Langfuse is the human-friendly trace explorer.
  2. Structured-log JSONL files — for all events, written to stdout in production (collected by the container runtime; on the VPS this feeds journald → a daily rotation under /var/log/ratiba/). In dev, structlog renders to console with the ConsoleRenderer for human readability but the same pydantic validation runs.

The JSONL files are the source of truth a fresh Claude session can grep/jq without needing Langfuse access. Langfuse is the better UX; the files are the durable substrate.

Runtime enforcement. A structlog processor at the end of the chain constructs a LogEvent, calls LogEvent.model_validate(), and raises in dev / warns in production on schema violation. Never silently discard. The processor lives at backend/app/observability/processors.py:

# backend/app/observability/processors.py (sketch)
def enforce_schema(_, __, event_dict):
try:
LogEvent.model_validate(event_dict)
except ValidationError as e:
if settings.environment == "dev":
raise # fail fast in dev so the engineer fixes the call site
# In prod: emit a synthetic error.unhandled event with the violation
# captured in redacted_payload, but don't drop the original line
event_dict["_schema_violation"] = e.errors()
return event_dict

Pre-commit enforcement. A grep-based pre-commit hook (.claude/hooks/log-call-shape.sh) scans staged Python files for structlog.get_logger(). calls and rejects any that don't pass an event_type= keyword. This catches the most common schema violation (forgetting the type) before runtime ever sees it. Pattern:

# Reject `log.info("something")` without event_type=
git diff --cached --name-only --diff-filter=ACM | grep '\.py$' | \
xargs -r grep -nE 'log\.(info|warning|error|debug|critical)\(' | \
grep -v 'event_type=' && {
echo "log call missing event_type=" >&2
exit 1
} || exit 0

The hook lives in the project's .claude/hooks/ alongside lint-edited.sh.

5.6 How a fresh Claude session diagnoses production issues

This is the proof the schema works. Two scenarios.

Scenario A: "this customer reported they were double-charged."

Inputs the operator gives Claude: conversation_id, tenant_id, time window. The diagnostic recipe:

# 1. All events for the conversation
jq -c 'select(.conversation_id=="<id>")' /var/log/ratiba/events.jsonl

# 2. Specifically the payment timeline
jq -c 'select(.conversation_id=="<id>") |
select(.event_type|startswith("payment."))' \
/var/log/ratiba/events.jsonl

# 3. Are there two STK dispatches?
jq -c 'select(.conversation_id=="<id>") |
select(.event_type=="payment.stk.dispatched") |
{ts:.timestamp, ref:.correlation_ids.mpesa_merchant_reference,
amount:.redacted_payload.amount_kes}' \
/var/log/ratiba/events.jsonl

If two payment.stk.dispatched events exist with different mpesa_merchant_reference values, the FSM looped through AWAITING_PAYMENT twice. Cross-reference with payment.stk.callback.received to see how many succeeded; cross-reference with the payments table to see what M-Pesa recorded. The schema's correlation_ids.mpesa_merchant_reference is what makes this single-pass — without it, you're matching on timestamps and amounts and crossing your fingers.

Scenario B: "this booking flow stalled at slot=service."

# 1. FSM transition timeline
jq -c 'select(.conversation_id=="<id>") |
select(.event_type=="fsm.transition") |
{ts:.timestamp, before:.fsm_state_before, after:.fsm_state_after}' \
/var/log/ratiba/events.jsonl

# 2. LLM calls in the SERVICE state
jq -c 'select(.conversation_id=="<id>") |
select(.fsm_state_before=="SERVICE") |
select(.event_type|startswith("llm."))' \
/var/log/ratiba/events.jsonl

# 3. What did the slot extractor see / return?
jq -c 'select(.conversation_id=="<id>") |
select(.event_type=="llm.call") |
select(.event=="slot_extractor.invoked")' \
/var/log/ratiba/events.jsonl

If the FSM never left SERVICE, the slot extractor either failed to extract, returned low confidence, or the LLM call failed (search llm.call.failed). The prompt_version field tells you which extractor prompt produced the bad extraction — directly actionable as a regression test scenario.

The schema is designed so these recipes are short. A field that doesn't shorten a diagnostic recipe shouldn't be in the schema.

5.7 Schema location and ownership

ArtifactPathOwner
Pydantic model (runtime truth)backend/app/observability/log_event.pyBackend
JSON Schema mirror (external readers)docusaurus/ratiba/static/logging-schema.json (auto-generated by scripts/sync-log-schema.sh in pre-commit)Auto
Redactor + masking rulesbackend/app/observability/redact.pyBackend
Structlog processor chainbackend/app/observability/processors.pyBackend
Pre-commit log-shape hook.claude/hooks/log-call-shape.shMethodology
Diagnostic recipe librarydocs/methodology/log-diagnostics.mdMethodology (grows with incidents)

docs/methodology/log-diagnostics.md is a living recipe book. Every production incident postmortem appends one new recipe. After 6 months it becomes the most-read internal doc.


6. What to delegate vs. human-review-only

The S4U "subagent-driven" rule is not a license to delegate everything. The list below is the explicit boundary. Items on the right require Adrian (or any future second human) to be present during the change — not because subagents are untrustworthy, but because the failure modes are adversarial-money or adversarial-tenant-isolation in nature.

Delegate to subagents (with the §3 deterministic-anchor discipline).

  • Routine subagent dispatches against written specs.
  • Scaffolding (new module, new test file, new migration).
  • Eval-suite refresh (adding scenarios, recording new LLM fixtures).
  • Documentation updates (Docusaurus pages, ADR drafts pre-review).
  • Dependency bump triage on Renovate PRs (read changelog, run smoke eval, recommend merge or block).
  • Custom DeepEval metric implementation against a written rubric.
  • Refactoring within a single module when tests cover the surface.
  • Production-incident diagnosis (read logs, propose hypothesis — but the fix falls into one of the lists below).

Human-review-only (Adrian must read the diff before merge).

  • Booking-flow FSM edits — graph topology changes, new states, new transitions, threshold tuning for handoff triggers.
  • The IdentityResolver — phone-to-customer matching is the hinge of multi-tenant safety. A bug here cross-contaminates tenants.
  • Anything touching M-Pesa Daraja credentials, the merchant_reference scheme, callback signature verification, or the payments table schema.
  • Tenant-creation / tenant-deletion logic, schema-per-tenant migrations.
  • Admin-handoff interrupt/resume code path (a deadlock here strands real conversations).
  • Auth / Keycloak realm management.
  • Any change to the auto-debug logging schema (§5.3) — schema drift is irreversible.
  • ADR drafts — subagent writes; Adrian decides.
  • Any prompt that touches money, refunds, medical advice, legal advice.
  • The PII redactor (backend/app/observability/redact.py) — getting this wrong leaks regulated data.

Spectrum, not binary. The list above is sharp. The reality is many PRs have a delegate-able skeleton and a review-only seam. The discipline is to scope subagent tasks so the review-only seam is left empty, and the human picks it up in the same PR cycle.


7. Model deprecation handling

PRD pins specific models: Claude (LLM brain), Deepgram Nova-3 (STT), ElevenLabs Multilingual v2 (TTS). Phase C flagged the OpenAI Assistants API sunset (August 2026) as the canonical "vendor pulled the rug" risk. Anthropic's public deprecation policy guarantees customers with active deployments at least 60 days' notice before retirement. That's not enough headroom to wing it; it requires ritual.

Quarterly model-pin review (calendared). First Monday of each quarter, 2-hour time block:

  1. Inventory. What models are pinned? Where? Run scripts/list-pinned-models.sh (harness addition, §11) — greps the codebase for model identifiers and prints a table with the file/line where each is pinned, plus the prompt versions targeting them.
  2. Deprecation check. Visit each provider's deprecation page (links below). Subscribe via RSS so you're not relying on memory:
  3. Replacement candidate evaluation. For each model with a deprecation date inside 6 months, identify the replacement and run the eval suite against the replacement (pytest tests/eval --model-override <new>). If the replacement passes within 5% of baseline scores — schedule the swap for the following sprint. If it fails — escalate to a focused investigation.
  4. Calibrate the LLM-as-judge. If the judge model changes, re-run the calibration set (tests/eval/calibration/human_labelled.yaml) and verify Cohen's kappa stays ≥ 0.7 against the human ratings.
  5. Update ADR. Any model pin change is an ADR amendment, not a code PR. The amendment notes the old version, new version, eval delta, and judge-recalibration kappa.

Out-of-band CVE-style deprecation (model removed with short notice). Rare but happens. Procedure:

  1. The first engineer to see the deprecation notice opens an incident issue and pages Adrian.
  2. Eval suite is run against the next-closest replacement immediately, not on the quarterly cadence.
  3. If results are within 10%: hot-deploy the swap behind a feature flag, canary at 10% of traffic for 24h, then 100%. Monitor Langfuse cost and feedback dashboards continuously.
  4. If results are worse than 10%: stay on the deprecating model until the cutoff, accept the eval regression, and prioritize prompt-tuning for the replacement before forced cutover.
  5. Post-incident: ADR amendment + recipe in docs/methodology/log-diagnostics.md for "how we knew the swap was working in production."

Voice-stack note. Deepgram and ElevenLabs deprecations are operationally noisier (audio fixtures need re-recording, voice quality needs a human ear). The ritual is the same; the eval suite for voice is sparser at PoC and grows in M7.


8. Library-currency operations

The policy is in ADR-0001. The operations are below.

8.1 Renovate configuration

We use Renovate (not Dependabot) because uv is supported by Renovate but not by Dependabot as of 2026-04. File: .github/renovate.json5:

{
$schema: "https://docs.renovatebot.com/renovate-schema.json",
extends: [
"config:recommended",
":dependencyDashboard",
":semanticCommits",
"group:linters",
"group:test",
],
timezone: "Africa/Nairobi",
schedule: ["before 6am on monday"],
prHourlyLimit: 4,
prConcurrentLimit: 8,
labels: ["dependencies"],
rangeStrategy: "bump",

// Keep uv.lock fresh for transitive updates
lockFileMaintenance: {
enabled: true,
schedule: ["before 6am on monday"],
},

packageRules: [
// Patch updates auto-merge once smoke-eval CI passes
{
matchUpdateTypes: ["patch", "pin", "digest"],
automerge: true,
automergeType: "pr",
platformAutomerge: true,
labels: ["dependencies", "auto-merge"],
},
// Wait one week for releases to bake (catches yanks)
{
matchDatasources: ["pypi", "npm"],
minimumReleaseAge: "7 days",
},
// Group minors monthly so we get one review session, not many PRs
{
groupName: "monthly minor updates",
matchUpdateTypes: ["minor"],
schedule: ["before 6am on the 1st day of the month"],
labels: ["dependencies", "monthly-minor"],
},
// Major updates: one PR each, ADR-required, full eval suite blocks
{
groupName: null,
matchUpdateTypes: ["major"],
labels: ["dependencies", "major", "needs-adr", "needs-full-eval"],
reviewers: ["Tsunami-max"],
automerge: false,
},
// Load-bearing libraries: explicit per-package treatment
{
matchPackageNames: [
"langgraph",
"langgraph-checkpoint-postgres",
"anthropic",
"fastapi",
"pydantic",
"sqlalchemy",
"asyncpg",
"psycopg",
],
labels: ["dependencies", "load-bearing"],
reviewers: ["Tsunami-max"],
automerge: false,
minimumReleaseAge: "14 days",
},
// Voice stack pinned strictly; don't auto-bump audio libraries
{
matchPackageNames: ["livekit", "livekit-agents", "deepgram-sdk", "elevenlabs"],
labels: ["dependencies", "voice-stack"],
automerge: false,
minimumReleaseAge: "21 days",
},
],

// Vulnerability alerts always cut the cadence
vulnerabilityAlerts: {
enabled: true,
labels: ["dependencies", "security", "urgent"],
schedule: ["at any time"],
automerge: false,
},
}

8.2 CI workflow for eval-on-Renovate-PR

The eval workflow already gates prompt PRs (§5). Renovate PRs hit the same workflow via the smoke-on-other-changes job (advisory) plus a dedicated job for major bumps:

# .github/workflows/eval-on-renovate-major.yml
name: Eval on Renovate major
on:
pull_request:
types: [opened, synchronize, labeled]

jobs:
full-eval:
if: contains(github.event.pull_request.labels.*.name, 'needs-full-eval')
runs-on: ubuntu-latest
timeout-minutes: 25
steps:
- uses: actions/checkout@v5
with: { fetch-depth: 0 }
- uses: astral-sh/setup-uv@v4
- run: uv sync --extra dev --extra eval
working-directory: backend
- name: Full eval suite (major bump)
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY_CI }}
LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY_CI }}
LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY_CI }}
run: |
cd backend
uv run pytest tests/eval/conversations \
--langfuse-publish-results \
-n 4 \
--junitxml=eval-results.xml
- name: Eval-delta comment
if: always()
uses: ./.github/actions/post-eval-delta
with:
results: backend/eval-results.xml
baseline-prompt-version: main
extra-context: |
This is a major dependency bump. ADR is required before merge.

8.3 Deprecation tracking subscriptions

Aggregate in docs/operations/deprecation-watch.md (refreshed weekly by a scheduled GitHub Action that pulls each feed):

8.4 CVE response procedure (breaks the monthly cadence)

When vulnerabilityAlerts fires:

  1. Triage within 4 hours (during EAT business hours; next morning otherwise). Severity assessment: is the affected code path exposed in our runtime?
  2. If exposed and CVSS ≥ 7.0: patch within 24 hours. Smoke eval gates the patch; full eval runs in parallel but does not block the hotfix. Post-deploy: full eval; rollback if it regresses materially.
  3. If exposed and CVSS < 7.0: patch within 7 days, normal eval gating.
  4. If not exposed: apply the patch on the next monthly cadence; note the assessment in the Renovate PR.
  5. Aikido SAST (already in the harness) is the safety net; it should surface the same advisories independently.

9. Sister-project reuse

Honest accounting: Ratiba is at the same early-stage maturity as trust-relay-workflow and zol-rag. There is no production-grade conversation eval, no battle-tested prompt registry, no auto-debug logging schema in either sister project that we can lift verbatim. What we can lift is shape and convention.

Lift now (already done or trivially done).

  • pyproject.toml shape — [tool.ruff], [tool.pyright], [tool.pytest.ini_options] configurations mirror trust-relay-workflow. Already done; called out in the file's comments.
  • asyncio_mode = "auto" for pytest — same convention; reduces decorator noise. Already done.
  • .claude/hooks/ three-layer defence pattern (post-edit linting → pre-push gate → stop verification) — already done.
  • structlog as the logging foundation (pinned in pyproject) — both sister projects use it; we use the same processor chain idiom.
  • docs/architecture-index.json + scripts/check-docs-sync.sh — to be built from the trust-relay convention when the source tree exists.
  • livekit.yaml config drift mitigations (node_ip: 127.0.0.1, stun_servers: [] for loopback dev) — lift from zol-rag when M7 begins. Captured in user memory.

Lift conceptually, build fresh.

  • The auto-debuggable logging schema (§5) — neither sister project has this. Build for Ratiba; if it works, propose extracting to a shared s4u-observability package once two of the three projects converge.
  • The DeepEval/Langfuse glue layer — build fresh.
  • The YAML golden-conversation format — build fresh.
  • The Langfuse → eval-suite feedback bridge — build fresh.
  • Custom DeepEval metrics (SwahiliFluencyMetric, etc.) — build fresh.

Do not lift.

  • Any LangChain chains/ or agents/ patterns that may live in zol-rag. Per Phase C: LangChain proper is rejected. We use LangGraph and only LangGraph.
  • Any conversation-state-in-Redis-only pattern. Ratiba's canonical store is the LangGraph TenantScopedSaver in Postgres; Redis is hot cache.

Coordination opportunity (deferred). When Ratiba has 6 months of production logs against this schema and the eval suite has run ≥ 1,000 times, propose to the org an s4u-eval-utils shared package. Until then, keep it inside Ratiba.


10. Proposed harness additions

Punch-list. Each item is small enough to be its own PR. File paths are absolute.

  1. Pre-commit hook: log-call-shape.sh Trigger: pre-commit, on staged .py files. Path: /Users/soft4u/Development/ratiba/.claude/hooks/log-call-shape.sh Purpose: reject log calls without event_type= keyword. Sketch in §5.5.

  2. Pre-commit hook: prompt-version-bump.sh Trigger: pre-commit, on changes to backend/app/orchestrator/prompts/**/system.md. Path: /Users/soft4u/Development/ratiba/.claude/hooks/prompt-version-bump.sh Purpose: refuse to commit if system.md changed without bumping metadata.yaml.version.

  3. Pre-commit hook: eval-scenario-naming.sh Trigger: pre-commit, on changes to backend/tests/eval/conversations/scenarios/**.yaml. Path: /Users/soft4u/Development/ratiba/.claude/hooks/eval-scenario-naming.sh Purpose: enforce <sector>_<flow>_<language>_<variant>.yaml naming so directory listings stay scan-friendly at scale.

  4. Reviewer agent: prompt-eval-reviewer Trigger: PR opened with file changes under backend/app/orchestrator/prompts/**. Path: /Users/soft4u/Development/ratiba/.claude/agents/prompt-eval-reviewer.md Purpose: read the eval-delta PR comment, flag scenarios that regressed, write a "merge-or-block" recommendation with reasoning. Two-stage review per S4U.

  5. Reviewer agent: log-schema-reviewer Trigger: PR opened with file changes under backend/app/observability/log_event.py or processors.py or redact.py. Path: /Users/soft4u/Development/ratiba/.claude/agents/log-schema-reviewer.md Purpose: schema changes are irreversible; require human-readable migration note and consumer-impact assessment.

  6. Reviewer agent: model-pin-reviewer Trigger: PR opened with file changes that include strings matching the pinned-model regex (claude-*, nova-*, eleven_*, gpt-*). Path: /Users/soft4u/Development/ratiba/.claude/agents/model-pin-reviewer.md Purpose: ensure model swaps come with eval-suite results + ADR amendment.

  7. Subagent prompt template: subagent-prompt-nondeterministic.md Path: /Users/soft4u/Development/ratiba/docs/methodology/templates/subagent-prompt-nondeterministic.md Purpose: the §3 template — deterministic anchor, don't-regress invariants, stop conditions, evidence to return.

  8. Script: scripts/list-pinned-models.sh Path: /Users/soft4u/Development/ratiba/scripts/list-pinned-models.sh Purpose: quarterly model-pin review (§7). Greps codebase for model identifiers, prints table with file/line + which prompts target each.

  9. Script: scripts/sync-log-schema.sh Path: /Users/soft4u/Development/ratiba/scripts/sync-log-schema.sh Purpose: regenerate docusaurus/ratiba/static/logging-schema.json from the Pydantic LogEvent model. Wired into pre-commit so the JSON Schema can never drift.

  10. Script: scripts/refresh-deprecation-watch.sh Path: /Users/soft4u/Development/ratiba/scripts/refresh-deprecation-watch.sh Purpose: pull each RSS / changelog from §8.3, write a unified docs/operations/deprecation-watch.md. Run by a weekly GitHub Action.

  11. GitHub Action: eval-on-prompt-change.yml Path: /Users/soft4u/Development/ratiba/.github/workflows/eval-on-prompt-change.yml Purpose: §4 workflow.

  12. GitHub Action: eval-on-renovate-major.yml Path: /Users/soft4u/Development/ratiba/.github/workflows/eval-on-renovate-major.yml Purpose: §8.2 workflow.

  13. GitHub Action: weekly-deprecation-watch.yml Path: /Users/soft4u/Development/ratiba/.github/workflows/weekly-deprecation-watch.yml Purpose: cron weekly; runs scripts/refresh-deprecation-watch.sh, opens an issue if a load-bearing model has a deprecation date inside 180 days.

  14. Composite action: post-eval-delta Path: /Users/soft4u/Development/ratiba/.github/actions/post-eval-delta/action.yml Purpose: parse eval-results.xml, diff against the baseline branch's last successful run, post a markdown table to the PR.

  15. Renovate config: .github/renovate.json5 Path: /Users/soft4u/Development/ratiba/.github/renovate.json5 Purpose: §8.1 configuration verbatim.

  16. Doc: docs/methodology/log-diagnostics.md Path: /Users/soft4u/Development/ratiba/docs/methodology/log-diagnostics.md Purpose: living recipe book for jq-driven log diagnostics. Seeded with the two recipes from §5.6.

  17. Doc: docs/operations/deprecation-watch.md (auto-maintained) Path: /Users/soft4u/Development/ratiba/docs/operations/deprecation-watch.md Purpose: weekly snapshot of pinned-model deprecation status.

  18. CLAUDE.md amendment: register the new hooks + agents in the methodology table. Mechanical follow-up after items 1–6 land.


11. Open questions surfaced

Decisions this supplement explicitly does not resolve:

  1. Should the Pydantic LogEvent model live in a shared s4u-observability package from day 1? Premature now (one consumer); inevitable later (three consumers). Decide when zol-rag or trust-relay starts asking to borrow it.

  2. Langfuse trace storage retention. Per-tenant project vs one shared project (per A3 §8) intersects with retention policy. Health-data tenants will need a documented retention window. Do we set Langfuse retention per project, or do we run a daily janitor? Defer to ADR-0004.

  3. Eval cost budget governance. The $20 per-PR ceiling is fine at PoC scale. At 50 prompt PRs/month + nightly runs, the bill is real. Who approves a budget bump? What's the kill-switch? Defer until the first month we hit the ceiling.

  4. Subagent dispatch budget. Long-running subagents on Opus tokens add up. Should we put a per-task token ceiling on subagent dispatches the way we have an eval-suite cost ceiling? Worth deciding before subagent-driven development becomes the default for non-trivial tasks. No data yet.

  5. Prompt versioning across tenants. One tenant on prompt v3, another on v2 (because v3 regresses on Swahili and that tenant is Swahili- primary). The §3 storage scheme assumes one global active version. Multi-tenant prompt routing is the Phase 3 canary problem (A3 §6); the storage layout should grow into it without rework. Re-examine when the first tenant requests a held-back version.

  6. Auto-generated diagnostic recipes. Is it worth a small daemon that, on every "incident closed" event, asks an LLM to summarize the diagnostic recipe used and append to log-diagnostics.md? Productivity win but with the usual non-determinism caveat. Park until we have ≥ 5 incidents to seed it from.

  7. Cross-project deprecation watch. zol-rag and trust-relay-workflow pin some of the same libraries (FastAPI, Pydantic, structlog). Do we centralize the deprecation watch script across the three projects or keep three independent copies? Centralizing introduces a fourth-thing to maintain. Discuss when one of the projects misses a deprecation the other two caught.

  8. LLM-as-judge for the auto-debug logging itself. Could a Claude session grade whether a log line "would have been diagnosable" via the §5.6 recipes? Recursive eval. Tempting but probably overkill. Park.


Appendix A — File path index (canonical locations referenced above)

WhatPath
This document/Users/soft4u/Development/ratiba/docs/methodology/agentic-development.md
Methodology landing/Users/soft4u/Development/ratiba/docs/methodology/index.md
Global S4U methodology/Users/soft4u/Development/s4u-methodology/docs/methodology.md
ADR-0001 (tech stack + library policy)/Users/soft4u/Development/ratiba/docs/adr/ADR-0001-tech-stack.md
Phase C landscape/Users/soft4u/Development/ratiba/docs/research/2026-04-25-agentic-landscape-2026.md
A1 orchestration/Users/soft4u/Development/ratiba/docs/research/2026-04-25-orchestration-patterns.md
A2 HITL/Users/soft4u/Development/ratiba/docs/research/2026-04-25-human-in-the-loop-handoff.md
A3 evals/Users/soft4u/Development/ratiba/docs/research/2026-04-25-eval-frameworks.md
A4 payments/Users/soft4u/Development/ratiba/docs/research/2026-04-25-payments-orchestration.md
Backend pyproject/Users/soft4u/Development/ratiba/backend/pyproject.toml
Log event modelbackend/app/observability/log_event.py (proposed)
Log JSON Schema mirrordocusaurus/ratiba/static/logging-schema.json (proposed)
Log diagnostics recipe bookdocs/methodology/log-diagnostics.md (proposed)
Subagent templatedocs/methodology/templates/subagent-prompt-nondeterministic.md (proposed)
Renovate config.github/renovate.json5 (proposed)
Eval-on-prompt-change workflow.github/workflows/eval-on-prompt-change.yml (proposed)
Deprecation watch (auto)docs/operations/deprecation-watch.md (proposed)

Appendix B — Sources cited