ADR-0013: Phase 0 Tenant Knowledge Answers ("no-RAG RAG")
Status: Accepted Date: 2026-05-31
Context
Ratiba's conversational agent today handles three inquiry intents —
services, hours, and other — through a single bilingual answer_shaper
LLM call (app/services/answer_shaper.py). The services and hours intents
render from structured tenant data; the other catch-all dead-ends: its
raw_data is an empty dict, so the prompt instructs the model to "politely
acknowledge you don't have a ready answer and offer to ask the team."
This means the agent cannot answer the most common free-form customer questions a service business gets: cancellation policy, parking, what a service includes, prep instructions, holiday hours, deposit rules. For the Aria Aura Spa pilot (M13) this is a visible gap — these questions arrive constantly on WhatsApp.
The obvious answer is RAG (retrieval-augmented generation), and the sister project zol-rag has a production-grade implementation (pgvector + embeddings + hybrid retrieval + reranking + CRAG). But porting it is ~11–17k LOC of core machinery (plus stripping ~16k LOC of medical-specific code), and — critically — Ratiba has no paying customer yet. Building a vector-retrieval subsystem before a single tenant is in production is speculative generality.
The insight that makes a cheaper path viable: a single service business's entire knowledge base is small — on the order of tens of short snippets, not thousands of documents. At that scale, retrieval is unnecessary: the whole KB fits in the prompt, and the LLM performs the "retrieval" by attention. We can ship the usefulness of RAG (grounded free-form answers) without any retrieval infrastructure, and defer the real thing until a tenant's KB actually outgrows the prompt — a moment we can make observable rather than guess.
This ADR records that deferral and the Phase-0 mechanism. The full design is
docs/superpowers/specs/2026-05-31-phase0-knowledge-answers-design.md.
Decision
Phase 0 mechanism: snippets-in-prompt, no retrieval
- Per-tenant
knowledge_snippetstable (new per-tenant Alembic revision in thetenant/lineage; lives inside each tenant schema, auto-scoped by the ADR-0002search_pathrouting — notenant_idcolumn). Columns:id,category,title,body,language(defaulten, informational only),is_active,created_at,updated_at. Deliberately noembedding vectorcolumn — that is the single additive Phase-1 upgrade. category∈policy | facility | prep | service | hours | generaldrives intent routing.- New seam
app/services/knowledge.py::fetch_snippets(intent, *, limit, max_chars)loads active snippets for an intent:services→{service, general},hours→{hours, general},other→ all categories (the catch-all). Not filtered bylanguage(an EN-only KB still serves Swahili customers via render-time translation). This function is the seam that becomes the top-k cosine query in Phase 1; its signature and consumers stay identical. - Injection via the existing data path: the dispatcher attaches
raw_data["knowledge"] = await fetch_snippets(intent)(one line) before the existingshape_answer(...)call. Snippets serialize intoraw_data_jsonand ride in the user template — exactly the path personality already uses (ADR-0010 D9). - All three inquiry intents (
services+hours+other) draw from the store;servicesanswers additionally surface each service's existingdescriptioncolumn. - Content is hand-seeded by an idempotent
scripts/seed_knowledge.pyfrom a curated per-tenant YAML. No conversational or dashboard authoring in Phase 0.
Prompt-cache invariant preserved
answer_shaper.yaml is cache_eligible: true: its system_message is
byte-identical across all tenants, with per-tenant content confined to the
user template (ADR-0010 D9). Phase 0 preserves this exactly — snippets
enter through raw_data_json (user template), never the system message. The
prompt version bumps 0.2.0 → 0.3.0 (feeds the ADR-0004 4-tuple eval cache
key); the system-message text changes (new rules to answer from snippets)
but remains tenant-invariant.
Cost trade-off (accepted)
Snippets are uncached input tokens on every inquiry turn (the user
template is not part of the cached prefix). For a capped KB (~1500 tokens)
this is a few hundredths of a cent per turn — well inside ADR-0005's $0.05
soft / $0.20 hard per-booking ceiling. Tracked by the existing
record_cost / peek_cost; no new cost plumbing.
Knowledge miss → deflect + log
On a question the KB doesn't cover, keep the existing "I'll check with the
team" deflection (no ADR-0006 handoff wiring). On every other-intent turn,
emit a structured knowledge_gap_candidate log event (question, tenant,
snippets-available count); the M13 daily WhatsApp digest surfaces recurring
gaps so Adrian curates snippets to close them. (Honest limitation: other is
the classifier's catch-all, so this logs catch-all questions, not a precise
"the bot failed here" — precise detection needs LLM self-report, which breaks
the plain-text output contract, and is deferred.)
Phase-0 → Phase-1 graduation trigger (observable)
fetch_snippets caps at ~20 snippets / ~1500 tokens (configurable). Below the
cap, the entire active KB is injected. Cap overflow is the graduation
signal: when a tenant routinely overflows (knowledge_overflow WARN),
stuffing-everything no longer fits and real retrieval (Phase 1: embeddings +
pgvector + top-k) is finally justified — for that tenant, with evidence.
YAGNI expires observably, not by guess.
Consequences
Positive
- Ships grounded free-form answers in a small backend-only change (one table,
one new service function, one dispatcher line, one prompt revision) — no
retrieval subsystem, no new infra, no
pgvectorextension. - The pilot's most common questions get real answers from day one.
- The table and
fetch_snippetsseam make Phase 1 a purely additive upgrade (addembeddingcolumn + backfill + swap the SELECT internals); consumers unchanged. - The cost ceiling and graduation trigger are quantified and observable.
- Prompt-cache eligibility is provably preserved (cross-tenant
system_messageidentity is test-guarded).
Negative / accepted
- Uncached snippet tokens on every inquiry turn (bounded by the cap; measured; inside the ceiling).
- No self-service content authoring — Adrian hand-seeds the pilot KB.
- Gap logging is coarse (catch-all questions, not precise answer failures).
- On LLM failure, the deterministic fallback renders only structured
services/hoursdata; anotherquery degrades to today's deflection (acceptable — failure is already silent-fallback by design).
Alternatives Considered
- Port zol-rag's full RAG now. Rejected as premature: ~11–17k LOC of core machinery + ~16k LOC of medical code to strip, built before any paying customer. Real retrieval is YAGNI until a tenant's KB outgrows the prompt.
- Seed YAML/MD file per tenant (no DB table). Fastest to pilot, but a dead-end: content can't move to the agent/dashboard later and would have to be migrated into a table for Phase 1 anyway — paying twice. The table costs one migration now and upgrades additively.
- JSONB blob on the tenant-config row. Avoids a new table but is awkward to
query/order/cap and a poor base for the eventual
embeddingcolumn. - Real handoff on every miss (ADR-0006 wiring). Rejected for Phase 0 scope: pulls handoff plumbing and its test surface into a change meant to be minimal. Deflect + gap-log first; wire real escalation when the data shows it is needed.
References
- Design spec:
docs/superpowers/specs/2026-05-31-phase0-knowledge-answers-design.md - ADR-0002 (per-tenant schema +
search_pathscoping) - ADR-0004 (eval cache key, fresh-tenant fixture)
- ADR-0005 (per-booking cost ceiling,
answer_shaperrole) - ADR-0010 D9 (prompt-cache-preserving user-template splice)
- ADR-0006 (handoff model — referenced, not wired in Phase 0)
- Reference architecture (deferred Phase-1 target): zol-rag
backend/app/services/{rag_service,search_service,embedding_service}.py