Skip to main content

ADR-0013: Phase 0 Tenant Knowledge Answers ("no-RAG RAG")

Status: Accepted Date: 2026-05-31

Context

Ratiba's conversational agent today handles three inquiry intents — services, hours, and other — through a single bilingual answer_shaper LLM call (app/services/answer_shaper.py). The services and hours intents render from structured tenant data; the other catch-all dead-ends: its raw_data is an empty dict, so the prompt instructs the model to "politely acknowledge you don't have a ready answer and offer to ask the team."

This means the agent cannot answer the most common free-form customer questions a service business gets: cancellation policy, parking, what a service includes, prep instructions, holiday hours, deposit rules. For the Aria Aura Spa pilot (M13) this is a visible gap — these questions arrive constantly on WhatsApp.

The obvious answer is RAG (retrieval-augmented generation), and the sister project zol-rag has a production-grade implementation (pgvector + embeddings + hybrid retrieval + reranking + CRAG). But porting it is ~11–17k LOC of core machinery (plus stripping ~16k LOC of medical-specific code), and — critically — Ratiba has no paying customer yet. Building a vector-retrieval subsystem before a single tenant is in production is speculative generality.

The insight that makes a cheaper path viable: a single service business's entire knowledge base is small — on the order of tens of short snippets, not thousands of documents. At that scale, retrieval is unnecessary: the whole KB fits in the prompt, and the LLM performs the "retrieval" by attention. We can ship the usefulness of RAG (grounded free-form answers) without any retrieval infrastructure, and defer the real thing until a tenant's KB actually outgrows the prompt — a moment we can make observable rather than guess.

This ADR records that deferral and the Phase-0 mechanism. The full design is docs/superpowers/specs/2026-05-31-phase0-knowledge-answers-design.md.

Decision

Phase 0 mechanism: snippets-in-prompt, no retrieval

  • Per-tenant knowledge_snippets table (new per-tenant Alembic revision in the tenant/ lineage; lives inside each tenant schema, auto-scoped by the ADR-0002 search_path routing — no tenant_id column). Columns: id, category, title, body, language (default en, informational only), is_active, created_at, updated_at. Deliberately no embedding vector column — that is the single additive Phase-1 upgrade.
  • categorypolicy | facility | prep | service | hours | general drives intent routing.
  • New seam app/services/knowledge.py::fetch_snippets(intent, *, limit, max_chars) loads active snippets for an intent: services{service, general}, hours{hours, general}, other → all categories (the catch-all). Not filtered by language (an EN-only KB still serves Swahili customers via render-time translation). This function is the seam that becomes the top-k cosine query in Phase 1; its signature and consumers stay identical.
  • Injection via the existing data path: the dispatcher attaches raw_data["knowledge"] = await fetch_snippets(intent) (one line) before the existing shape_answer(...) call. Snippets serialize into raw_data_json and ride in the user template — exactly the path personality already uses (ADR-0010 D9).
  • All three inquiry intents (services + hours + other) draw from the store; services answers additionally surface each service's existing description column.
  • Content is hand-seeded by an idempotent scripts/seed_knowledge.py from a curated per-tenant YAML. No conversational or dashboard authoring in Phase 0.

Prompt-cache invariant preserved

answer_shaper.yaml is cache_eligible: true: its system_message is byte-identical across all tenants, with per-tenant content confined to the user template (ADR-0010 D9). Phase 0 preserves this exactly — snippets enter through raw_data_json (user template), never the system message. The prompt version bumps 0.2.0 → 0.3.0 (feeds the ADR-0004 4-tuple eval cache key); the system-message text changes (new rules to answer from snippets) but remains tenant-invariant.

Cost trade-off (accepted)

Snippets are uncached input tokens on every inquiry turn (the user template is not part of the cached prefix). For a capped KB (~1500 tokens) this is a few hundredths of a cent per turn — well inside ADR-0005's $0.05 soft / $0.20 hard per-booking ceiling. Tracked by the existing record_cost / peek_cost; no new cost plumbing.

Knowledge miss → deflect + log

On a question the KB doesn't cover, keep the existing "I'll check with the team" deflection (no ADR-0006 handoff wiring). On every other-intent turn, emit a structured knowledge_gap_candidate log event (question, tenant, snippets-available count); the M13 daily WhatsApp digest surfaces recurring gaps so Adrian curates snippets to close them. (Honest limitation: other is the classifier's catch-all, so this logs catch-all questions, not a precise "the bot failed here" — precise detection needs LLM self-report, which breaks the plain-text output contract, and is deferred.)

Phase-0 → Phase-1 graduation trigger (observable)

fetch_snippets caps at ~20 snippets / ~1500 tokens (configurable). Below the cap, the entire active KB is injected. Cap overflow is the graduation signal: when a tenant routinely overflows (knowledge_overflow WARN), stuffing-everything no longer fits and real retrieval (Phase 1: embeddings + pgvector + top-k) is finally justified — for that tenant, with evidence. YAGNI expires observably, not by guess.

Consequences

Positive

  • Ships grounded free-form answers in a small backend-only change (one table, one new service function, one dispatcher line, one prompt revision) — no retrieval subsystem, no new infra, no pgvector extension.
  • The pilot's most common questions get real answers from day one.
  • The table and fetch_snippets seam make Phase 1 a purely additive upgrade (add embedding column + backfill + swap the SELECT internals); consumers unchanged.
  • The cost ceiling and graduation trigger are quantified and observable.
  • Prompt-cache eligibility is provably preserved (cross-tenant system_message identity is test-guarded).

Negative / accepted

  • Uncached snippet tokens on every inquiry turn (bounded by the cap; measured; inside the ceiling).
  • No self-service content authoring — Adrian hand-seeds the pilot KB.
  • Gap logging is coarse (catch-all questions, not precise answer failures).
  • On LLM failure, the deterministic fallback renders only structured services/hours data; an other query degrades to today's deflection (acceptable — failure is already silent-fallback by design).

Alternatives Considered

  • Port zol-rag's full RAG now. Rejected as premature: ~11–17k LOC of core machinery + ~16k LOC of medical code to strip, built before any paying customer. Real retrieval is YAGNI until a tenant's KB outgrows the prompt.
  • Seed YAML/MD file per tenant (no DB table). Fastest to pilot, but a dead-end: content can't move to the agent/dashboard later and would have to be migrated into a table for Phase 1 anyway — paying twice. The table costs one migration now and upgrades additively.
  • JSONB blob on the tenant-config row. Avoids a new table but is awkward to query/order/cap and a poor base for the eventual embedding column.
  • Real handoff on every miss (ADR-0006 wiring). Rejected for Phase 0 scope: pulls handoff plumbing and its test surface into a change meant to be minimal. Deflect + gap-log first; wire real escalation when the data shows it is needed.

References

  • Design spec: docs/superpowers/specs/2026-05-31-phase0-knowledge-answers-design.md
  • ADR-0002 (per-tenant schema + search_path scoping)
  • ADR-0004 (eval cache key, fresh-tenant fixture)
  • ADR-0005 (per-booking cost ceiling, answer_shaper role)
  • ADR-0010 D9 (prompt-cache-preserving user-template splice)
  • ADR-0006 (handoff model — referenced, not wired in Phase 0)
  • Reference architecture (deferred Phase-1 target): zol-rag backend/app/services/{rag_service,search_service,embedding_service}.py