Agentic Application Landscape, 2024–2026
Status: Landscape scan to anchor vocabulary for upcoming Ratiba ADRs (A1 orchestration, A2 human-in-the-loop, A3 evals, B dev-process). Audience: Adrian (solo founder) + future Ratiba contributors. Voice: Opinionated. Short on hedging.
0. Reading map
This document organizes the 2024–2026 agentic landscape by paradigm/team rather than chronology. For each section: what it is, what's stable vs. hype, and an Adopt/Reject/Park call for Ratiba — where Ratiba is a WhatsApp-first + voice-second, multi-tenant, Swahili/English booking agent with M-Pesa STK push and an "AI agent IS the UI" thesis (see the project's load-bearing design distinctions captured in user memory).
The three load-bearing constraints to keep in front of mind:
- Tenant resolution happens at the channel boundary (no JWT).
- Conversation state IS canonical state for in-flight bookings.
- Voice and WhatsApp share the orchestration layer; only the AnswerShaper differs.
1. Anthropic — Claude Agent SDK, MCP, computer use
What it is. Anthropic's canonical agent shape in 2026 has converged on three pieces: (a) the Claude Agent SDK (Python + TypeScript) which wraps message streaming, tool calling, sandboxing, OAuth-vaulted credentials, and a long-running harness; (b) the Model Context Protocol (MCP) — an open, USB-C-style standard for "AI app ↔ external tool/data" wiring, now widely adopted across providers and IDEs; and (c) computer-use as a first-party tool primitive (screenshot + click/type) that the SDK natively understands.
The canonical Claude shape is best read off two Anthropic engineering posts:
- "Building Effective Agents" (Dec 2024, still cited as canonical in 2026): a tight taxonomy of workflows (predefined code paths) vs. agents (LLM-directed loops) with five workflow patterns — prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer.
- "Effective harnesses for long-running agents": single-loop, single-threaded master loop with disciplined planning, progress files, and pre-flight tests as the recommended production shape.
Multi-agent vs single-loop — Anthropic's actual position. Both are real, but they apply to different problems. From the multi-agent research system writeup: an orchestrator + parallel subagents architecture beat a single-agent baseline by 90.2% on their internal research eval, but uses ~15× more tokens than a chat session and is explicitly not recommended for tasks "where all agents need the same context or where there are many dependencies between agents — like coding". Claude Code itself ships as a deliberately single-threaded master loop (analysis).
Stable vs. hype.
- Stable: MCP as a wire protocol is now the default integration surface (OpenAI, Google, Vercel AI SDK all building MCP support); tool-use; structured outputs (now GA on Claude); the workflows-vs-agents taxonomy.
- Hype: "spawn 10 subagents and let them collaborate" — Anthropic's own data says this only pays off for breadth-first, parallelizable, high-value research-style tasks. Don't reach for it because it sounds advanced.
Adopt / Reject / Park for Ratiba.
- Adopt: Claude as Phase 1 LLM brain (already implied by ADR-0001 implicitly via "best-in-class for Swahili"), MCP as the internal tool-bus shape so calendar/M-Pesa/CRM tools speak the same protocol regardless of which model we route to.
- Park: multi-agent orchestrator/worker. Booking is tightly interdependent ("which slot? which service? confirm? pay?") — the exact anti-pattern Anthropic warns against. Single-loop with strong FSM scaffolding is the right shape.
2. OpenAI — Assistants API (sunset), Agents SDK, Responses API
What it is. OpenAI's agent story in 2026 is a generational reset. The Assistants API is deprecated and sunsets August 26, 2026. The replacement is the Responses API (lower-level, stateless-by-default, with Conversations API for threads) plus the OpenAI Agents SDK — a production-ready evolution of the experimental Swarm framework.
The Agents SDK ships handoffs, guardrails, sessions, tracing, and as of April 2026 a Codex-style harness with sandboxing, long-horizon resume, and subagents. It is provider-agnostic — supports 100+ LLMs via Chat Completions compatibility, not just OpenAI models.
Breaking changes to track.
- Threads/Files/Vector Stores in Assistants → must be re-implemented on Responses + your own storage. Migration is non-trivial; multiple vendor-led "wire-compatible" shims are emerging (example) but those are platform lock-in by another name.
- Swarm → Agents SDK is mostly additive but the agent/handoff abstraction was renamed and tightened.
Stable vs. hype.
- Stable: Responses API as the new primitive; structured outputs with JSON schema (strict mode, 99.9% schema compliance per benchmarks); function calling.
- Hype: "OpenAI Agents SDK as universal orchestration layer" — it's good, but it's still mostly an OpenAI-tilted runtime. Claim of provider-agnosticism is technically true but the ergonomic gradient leans Responses API.
Adopt / Reject / Park for Ratiba.
- Reject as primary brain: Swahili quality and cost-per-conversation favor Claude (and Deepgram for STT) per the existing voice-stack reference. Adopting OpenAI as primary would make us a worse product.
- Park as backup model: keep the option open via an LLM-router abstraction so we can A/B GPT-4-class models for English-only tenants if the cost ratio shifts.
- Adopt the Responses-API patterns conceptually: stateless-call + external state store (in our case Redis-hydrated FSM) maps cleanly to how Ratiba already wants to work.
3. LangChain / LangGraph — state machines, interrupt/resume, HITL
What it is. LangGraph is LangChain's stateful agent framework: you model the agent as a graph of nodes with explicit state, run it on a checkpointer (Postgres, SQLite, Redis), and use interrupt() + Command(resume=…) for human-in-the-loop pauses. It is widely considered the most production-ready Python agent framework in 2026 (survey).
The HITL primitive that matters. The mental model is: a node calls interrupt(payload) → the framework persists the entire graph state via the checkpointer → returns control to the caller → external system (Slack, WhatsApp, web UI) collects the human input → caller invokes the graph again with Command(resume=value) and the same thread_id → execution resumes from the same node, which now sees the human input. This is exactly the shape Ratiba needs for admin handoff: the agent suspends, the human takes the WhatsApp thread, then hands back. (LangChain blog)
Stable vs. hype.
- Stable: the graph + checkpointer + interrupt model; the Postgres checkpointer in production; LangSmith integration for tracing.
- Hype: the broader LangChain umbrella (chains, hubs, etc.) is increasingly considered legacy. "LangChain" as a positive brand is fading; "LangGraph" is what serious teams pick.
- Caveat: the JS port lags Python and has known interrupt-resume bugs as recently as 2025.
Adopt / Reject / Park for Ratiba.
- Adopt the pattern: interrupt-and-resume on a Postgres checkpointer is the canonical shape for our admin-handoff requirement. Whether we literally use LangGraph or build our own thin FSM on top of asyncpg + Redis is a separate ADR (likely A1 orchestration). LangGraph is heavyweight and Python-only — building our own gives us tenant-scoped checkpointing for free, but we should plagiarize the API shape shamelessly.
- Reject: LangChain proper (chains, agents, memory wrappers) — too much abstraction tax for a 1-FTE team.
4. Vercel AI SDK — streamUI / generative UI
What it is. The Vercel AI SDK is a TypeScript-first toolkit (>20M monthly downloads) for building chat/agent UIs. Headline primitive: streamUI — the model can stream React Server Components, not just text, so the UI itself is generated by the model in response to tool calls. The AI SDK is also building native MCP client support.
Stable vs. hype.
- Stable:
streamTextanduseChatfor plain chat; tool/function calling; provider abstraction (swap Anthropic ↔ OpenAI ↔ Gemini with one line). - Hype: generative UI as the primary paradigm. It's beautiful in demos. In production, it makes regression testing nightmarish (your UI is now a non-deterministic LLM output) and it's Next.js + RSC-shaped, which is a load-bearing architectural commitment.
Adopt / Reject / Park for Ratiba.
- Park. Phase 1 the dashboard is admin fallback only — onboarding, bulk uploads, viewing the tenant config. Generative UI for that is overkill; Tailwind + shadcn handles it. If Phase 3 ever turns the dashboard into a "talk to your business data" agent, this becomes Adopt — the streamUI pattern is the cleanest answer in the JS world. Until then, no commitment.
5. CrewAI — role-based multi-agent orchestration
What it is. CrewAI is a Python framework where you declare agents with roles ("Researcher", "Writer", "Reviewer"), tasks, and a process (sequential, hierarchical, or hybrid). 40k+ GitHub stars by early 2026. (CrewAI repo). Hierarchical mode adds a manager-agent that coordinates and reviews subagent output.
Stable vs. hype.
- Stable: the role-based DSL; the lowest-friction "Hello World" for multi-agent in Python (under 20 lines).
- Hype: production-readiness. Independent surveys (example) consistently rank CrewAI as "medium production readiness" vs. LangGraph as "highest". The role metaphor is great for marketing demos and internal back-office automations; for customer-facing latency-sensitive systems it adds layers without buying us anything.
Adopt / Reject / Park for Ratiba.
- Reject. Wrong shape for our problem. We have one agent persona (the booking assistant) on each side of the conversation (admin orchestrator + customer orchestrator); the value is in deep FSM control, not in orchestrating multiple personas. Adopting CrewAI would obscure our actual control flow.
6. Microsoft AutoGen — v0.4 redesign
What it is. AutoGen v0.4 (January 2025) was a ground-up rewrite around an asynchronous, event-driven, three-layer architecture (core / agent-chat / extensions). The high-level AgentChat API offers RoundRobinGroupChat and SelectorGroupChat (the latter dynamically picks the next speaker). AutoGen Studio adds real-time agent observability and mid-execution control. Microsoft Research's Magentic-One sits on top as a generalist multi-agent system with an Orchestrator that plans, tracks progress, and re-plans on failure.
Stable vs. hype.
- Stable: the v0.4 redesign delivered on its promises — async-first is correct; the GroupChat pattern remains the most influential multi-agent pattern in published literature.
- Hype: "GroupChat for everything." Same critique as CrewAI — it's a hammer for orchestration problems, but conversational booking isn't an orchestration problem, it's a state-machine problem.
Adopt / Reject / Park for Ratiba.
- Reject. Same logic as CrewAI. AutoGen's natural home is research, internal automation, and complex enterprise workflows — not WhatsApp booking.
7. Letta (formerly MemGPT) — agent memory architectures
What it is. Letta productionizes the MemGPT paper's idea: treat the LLM context window as RAM, with the agent itself responsible for paging knowledge in and out via tool calls.
The three-tier memory model:
- Core memory — small in-context block the agent reads/writes directly (persona, user pinned facts).
- Recall memory — searchable conversation history outside context.
- Archival memory — long-term store the agent queries via explicit tools.
Letta's recent direction (Letta v1 agent loop, Letta Code) is to be the memory primitive that any agent harness can ride on, model-agnostic. Compared to the Mem0/Zep family, Letta uses the model itself to curate memory; Mem0 uses heuristics. (Comparison).
Stable vs. hype.
- Stable: the memory-hierarchy concept (it's just OS paging, applied to context windows) is the right mental model and is referenced throughout 2025–2026 production-grade agent literature.
- Hype: "deploy Letta as your memory backend" for narrow business agents. For a booking flow most "memory" is just a row in the tenants schema (this customer's last appointment, their service preference) — you don't need a self-editing agent for that.
Adopt / Reject / Park for Ratiba.
- Adopt the pattern (memory hierarchy): model what's load-bearing in-context (current FSM slots, tenant persona) vs. recallable (last 5 conversations, summarized) vs. archival (full message history in Postgres for audit/compliance).
- Park Letta as a runtime dependency until we have a Phase-3 "remember preferences across months" requirement that Postgres + a summarizer can't satisfy.
8. Voice-agent stacks — LiveKit, Pipecat, Vapi, Retell, ElevenLabs
This is the most operationally interesting section for Ratiba because Phase 2 is voice and the latency/barge-in problem is unforgiving.
The proven production pipeline shape (per LiveKit's sequential pipeline writeup):
Audio In → VAD → STT → LLM → TTS → Audio Out
Target latencies:
- VAD: 10–50 ms
- STT first partial: <100 ms (streaming)
- LLM first token: 300–800 ms
- TTS first chunk: 100–200 ms
- End-to-end target: <800 ms streaming, <500 ms feels human.
The single biggest win is streaming everything so total latency collapses from Σ(stages) to max(stages).
Barge-in / interruption handling. When VAD detects user speech during agent TTS playback: cancel TTS, flush the audio queue, restart the pipeline at STT. Naive barge-in suffers from false positives ("mm-hmm" or background noise stops the agent mid-sentence). LiveKit's Adaptive Interruption Handling (2026) uses an audio-domain classifier — not transcript-domain — to decide whether the user actually wants the floor.
Filler-clock / "thinking sound" patterns. Production systems play a brief acknowledgement ("mm-hmm", "let me check") when the LLM is going to take >300 ms, so the user doesn't think the line dropped. LiveKit and Pipecat both expose this; on hosted platforms (Vapi, Retell) it's a config toggle.
Platform comparison (2026):
| Platform | Model | Best for | Latency | Notes |
|---|---|---|---|---|
| LiveKit Agents | OSS framework + Cloud | Self-host, custom, SIP-native | <500 ms achievable | Already in ADR-0001. Mature SIP bridge. |
| Pipecat | OSS framework | Self-host, Python pipelines | <500 ms achievable | Daily.co's framework, similar shape to LiveKit. |
| Vapi | Hosted | Fast time-to-first-call | 500–600 ms | Easiest dashboard; least control. |
| Retell | Hosted | Sales/CX teams | 600–800 ms | Polished but hosted-only. |
| ElevenLabs Conv. AI | Hosted | Voice quality + 70+ languages | not benchmarked here | Native MCP, 70-language detection, $11B valuation Feb 2026. |
The crossover threshold (Hamming benchmark) for build-vs-buy is ~10–50K minutes/month: above that, self-hosted (LiveKit/Pipecat) wins on cost, control, and compliance.
Stable vs. hype.
- Stable: the streaming sequential pipeline; barge-in via VAD + interruption events; filler-clock as a pattern; <500 ms feels human as a UX SLO.
- Hype: end-to-end "voice models" (single model that goes audio→audio) — they exist (GPT-4o realtime, etc.) but are still expensive, English-tilted, and lack the per-stage observability you need to debug a Swahili tenant complaining about misrecognition.
Adopt / Reject / Park for Ratiba.
- Adopt: LiveKit + Deepgram Nova-3 + ElevenLabs Multilingual v2 — already in ADR-0001 and validated by the zol-rag voice-stack pattern (captured in user memory). Ship Adaptive Interruption Handling on day 1 of Phase 2.
- Adopt as discipline: filler-clock pattern with Swahili + English prompts ("hebu nikague" / "let me check") wired into the orchestrator any time an LLM call is expected to exceed 300 ms.
- Reject: end-to-end audio-audio models for Phase 2 — too opaque for a multi-tenant compliance-sensitive product where we may have to debug "why did it mishear mtoto as moto".
9. Notable papers/projects still influencing 2026 production
| Paper / Project | Year | Still influential because… |
|---|---|---|
| ReAct | 2022 | The interleaved Reasoning + Acting loop is the default agent skeleton across every framework above. Mostly invisible because it's the substrate. |
| Reflexion | 2023 | "Verbal self-critique" still shows up as the evaluator-optimizer workflow in Anthropic's taxonomy, and as the "reflection" step in CrewAI/LangGraph. Known to suffer from cascading-failure on long horizons; modern variants add explicit recovery paths. |
| Voyager | 2023 | Coined the skill library pattern (agent writes reusable code, stores it, retrieves later). This is the conceptual ancestor of "agents that build their own tools" and informs Letta Code, Devin, Magentic-One. |
| SWE-Agent | 2024 | The agent-computer interface (ACI) idea — design tools for the LLM, not for the human — is now best practice. mini-SWE-agent (~100 LoC, scores >74% on SWE-bench Verified) is the existence proof that less harness, more discipline wins. |
| Devin | 2024 | The first "long-horizon autonomous engineer" framing. Whether or not you respect the marketing, Devin's deployment at Goldman/Santander/Nubank made "agent that runs for hours, commits PRs, asks for review" a category. |
| Magentic-One | 2024 | Microsoft's Orchestrator pattern (plan → track → re-plan on failure) is the academic root of the AutoGen v0.4 GroupChat selector. More research-influence than production-influence. |
Adopt / Reject / Park for Ratiba.
- Adopt as mental models: ReAct loop (substrate of our orchestrator), evaluator-optimizer (use it for the answer-shaper sanity check before sending a WhatsApp message), agent-computer interface (when we design our internal tool surface, design it for the LLM — descriptive names, narrow signatures, returns that already explain themselves).
- Park: skill library (Voyager-style) — premature for booking flows; revisit if we ever let an admin "teach the agent" via natural language.
10. Meta-trends — SLMs, structured outputs, eval-driven dev, observability
Small language models for routine steps. The NVIDIA position paper and LogRocket's primer argue the same thing: most agent invocations (intent classification, tool routing, slot extraction) don't need a frontier model. SLMs (sub-10B params, often local) are 5–20× cheaper at inference and competitive on narrow tasks. The emerging 2026 pattern is router-then-escalate: an SLM does the boring 80%, escalates ambiguous cases to a frontier LLM.
Structured outputs / constrained generation. As of 2026, json_schema strict mode is supported by OpenAI (since Aug 2024), Anthropic (GA early 2026), Google Gemini, Cohere, xAI. Constrained decoding gives a mathematical guarantee of schema compliance (invalid tokens get logits set to -inf), not a statistical one. JSON mode without a schema is now considered legacy. Cost: 30–300 extra tokens per call depending on provider. (Reference).
Eval-driven development. Framed as "the missing discipline" of agentic AI (Adnan Masood, Anthropic's "Demystifying evals"). The recommended starting point is universally the same: 20–50 real failures from production or pilot users, encoded as test cases, run on every prompt/model change. Frameworks: DeepEval, Braintrust, Maxim, LangSmith. Red Hat publishes an 8-stage framework for CI/CD-integrated agent evals.
Prompt versioning. Treat prompts as production assets: branching, approval workflows, rollback, eval-on-change. Top platforms 2026: Langfuse, LangSmith, Maxim, Braintrust, Helicone. (Survey).
Observability platforms. (Survey).
- Langfuse — open-source, easy self-host, generous free tier (50k events/mo), strong prompt-mgmt + eval features. Best default for OSS-leaning teams.
- LangSmith — best if you're on LangChain/LangGraph; pricing ($39/seat/mo before any traces) is the friction.
- Helicone — gateway-style (URL/header swap), fastest setup, weak on agent-specific tracing.
- Phoenix (Arize) — Elastic-2.0 OSS, runs in-notebook, strong eval features.
- Arize AX — enterprise-grade, statistical rigor, overkill for solo founders.
Adopt / Reject / Park for Ratiba.
- Adopt structured outputs in strict mode for every agent → tool call. Non-negotiable. The cost of a malformed M-Pesa STK push payload is real money lost.
- Adopt eval-driven dev as soon as we have 20 real WhatsApp transcripts. Encode the failure modes; gate prompt/model changes behind the eval suite. (Likely ADR A3.)
- Adopt Langfuse as the observability backbone. Self-hostable, OSS, prompt management built-in, doesn't lock us to a model vendor. Helicone is tempting for the gateway pattern, but won't show us FSM transitions.
- Park SLM router until we measure that frontier-model cost is a problem. Premature optimization for a PoC; obvious win for a multi-tenant scale-up. Revisit at >1k conversations/day per tenant.
Appendix A — Vocabulary Glossary (Ratiba canonical usage)
| Term | Definition (as we use it in Ratiba) |
|---|---|
| Agent loop | The repeated cycle of (perceive context → decide next action → call tool or respond → observe result). Our orchestrator IS one. |
| Single-loop / single-threaded master loop | One agent instance, one chronological loop, one context window. The shape Ratiba uses — not fan-out subagents. |
| Multi-agent / supervisor pattern | Lead agent spawns parallel subagents on sub-tasks. We don't use this; we acknowledge it exists. |
| Tool calling / function calling | Model emits a structured request to invoke an external function; framework executes it; result is fed back into the next model turn. |
| MCP (Model Context Protocol) | Open standard wire protocol for "agent ↔ tool/data source". Our internal calendar/M-Pesa/CRM tools should expose MCP-shaped servers so we can swap LLM brains. |
| Structured outputs / constrained generation | Forcing the model to emit JSON conforming to a schema via constrained decoding. Mandatory for any tool-call payload that touches money or persistence. |
| ReAct | Interleaved Reasoning + Acting prompt pattern. The substrate of our orchestrator. |
| Workflow vs Agent (Anthropic taxonomy) | Workflow = code-defined paths between LLM calls; Agent = LLM dynamically chooses next step. Our FSM is mostly workflow with agent-shaped escape hatches. |
| Interrupt-and-resume | Persist full agent state, return control to caller (admin), accept human input, resume from same state with thread_id. The pattern for our admin handoff. |
| Checkpointer | The storage layer that makes interrupt-and-resume possible. For us: Postgres (durable) + Redis (hot path). |
| Human-in-the-loop (HITL) | Any pattern where a human approves, overrides, or takes over agent execution. Ratiba's admin-handoff is the canonical case. |
| FSM (finite-state machine) | Our explicit booking-flow state machine: greet → identify → service → slot → confirm → pay. Conversation state IS canonical state until commit. |
| Slot filling | Extracting and tracking the named entities the FSM needs (service, date, time, customer name) across conversation turns. |
| Multi-turn grounding | Making sure each turn's response references prior turns correctly — i.e., the agent doesn't ask "what service?" after the user already said "haircut". |
| Memory hierarchy (MemGPT-style) | Core (in-context) / Recall (searchable conversation) / Archival (cold storage). Ratiba's mapping: FSM slots / summarized conversation history / full Postgres audit log. |
| AnswerShaper | Ratiba-specific: the channel-agnostic interface that turns an orchestrator decision into either WhatsApp output (buttons/lists/markdown) or voice output (≤2 sentences, no markdown). |
| Filler clock | Voice pattern: emit a short acknowledgement ("mm-hmm", "hebu nikague") when LLM is going to take >300 ms, so the user doesn't think the line dropped. |
| Barge-in / interruption | User speaks while agent is talking. Detection: VAD; correct response: cancel TTS, flush queue, restart STT. Use adaptive interruption (audio classifier) to avoid false positives. |
| Generative UI / streamUI | Model streams UI components, not just text. Parked for Ratiba dashboard fallback. |
| RAG (retrieval-augmented generation) | Retrieve relevant context (e.g., tenant-specific FAQs) and inject into prompt. We will use this for tenant-specific knowledge (services, policies). |
| Guardrails | Pre-/post-LLM filters that enforce safety, scope, format, or business rules. Distinct from structured outputs (which enforce shape) — guardrails enforce policy. |
| Eval-driven development | Define ≥20–50 real-failure test cases; gate every prompt/model/code change on them; never ship a prompt that regresses the eval suite. |
| Prompt versioning | Treat prompts as code: tracked, branched, reviewed, deployable, rollback-able, evaluated on every change. |
| Observability (LLM) | Traces of (prompt, model call, tool calls, latency, cost, user feedback) for every conversation. Our default: Langfuse. |
| SLM router | A small model that does cheap classification/routing and only escalates ambiguous cases to a frontier LLM. Cost optimization, not a Phase-1 concern. |
Appendix B — Open Questions Surfaced
These are decisions the landscape exposes that warrant their own focused investigation before commitment:
A1 — Orchestration patterns
- Do we adopt LangGraph as a runtime dependency, or build a thin FSM on asyncpg + Redis that imitates its interrupt/resume API? Trade-off: LangGraph gives us a battle-tested checkpointer but pulls in the LangChain ecosystem and is Python-only. A custom FSM is leaner but is yet another thing to maintain. Needs a deeper dive into LangGraph's tenant-isolation story (does the Postgres checkpointer support
search_pathper call?) before we can choose. - Workflow-vs-agent split: which parts of the booking flow are deterministic FSM transitions and which are LLM-decided? My instinct is "as deterministic as possible, LLM only for slot-filling and ambiguity resolution", but this needs to be specified turn-by-turn.
A2 — Human-in-the-loop
- The admin-handoff "context briefing" is under-specified. What goes in it? Last N turns verbatim? An LLM-generated summary? Both? Who pays for the summary call?
- Hand-back: how does the admin signal "OK agent, you can take it from here"? A magic phrase? A dashboard button? A timeout?
- WhatsApp UX during handoff: does the customer see a "talking to a human now" indicator? (Compliance question — in some jurisdictions, AI-vs-human disclosure is required.)
A3 — Eval frameworks
- We need a Swahili-aware eval suite. Most published eval datasets are English. Do we hand-curate from pilot transcripts (slow, accurate) or synthesize (fast, suspect)? Likely a mix.
- LLM-as-judge for eval grading: which model do we trust to grade Swahili responses? This is a confidence-of-grader problem we haven't seen good public answers for.
B — Development-process patterns
- Prompt versioning storage: do prompts live in code (git-versioned, deployed with the binary) or in Langfuse (live-editable, hot-reloaded)? The right answer is probably "in code, with Langfuse as observability + experiment branch". Needs an ADR.
- Tool surface: do we expose internal tools (calendar, M-Pesa) via MCP (uniform across LLM brains, more ceremony) or as raw function-calling schemas (less ceremony, locks us in)? MCP is the right long-term bet but the cost of doing it on day 1 isn't well-quantified.
Topics where public information was thin enough to be uncomfortable:
- Swahili-language LLM benchmarking. Almost every "Claude vs GPT for X" benchmark is English. Anecdotal reports favor Claude for Swahili but I could not find a rigorous public comparison. Recommendation: before A1 freezes the orchestration layer, run a small in-house bake-off on 50 real Swahili WhatsApp utterances, judged by a Swahili-fluent reviewer.
- Multi-tenant LangGraph checkpointing. I could not confirm whether the Postgres checkpointer supports
SET search_pathperthread_id, which is load-bearing for our schema-per-tenant model. Recommendation: read LangGraph'sPostgresSaversource before committing to it, or assume we'll fork/replace it. - WhatsApp Cloud API + LLM streaming. WhatsApp doesn't natively support streaming partial responses (it's a request/response message system, not a stream). I could not find a definitive 2026 pattern for "LLM is generating, show typing indicator, send full message when done" — most LiveKit/Pipecat literature assumes voice or web. Recommendation: scope Phase 1 to non-streamed WhatsApp messages with explicit "typing…" indicators.
- M-Pesa STK push from inside an agent loop. Daraja API docs are well-covered, but I found no published pattern for "agent decides to charge → STK push → user is in the middle of a WhatsApp conversation → confirmation arrives async via callback → agent resumes the same conversation thread". This is a payments-orchestration question that probably becomes its own ADR.
Sources cited (in order of appearance)
- Anthropic — Building Effective Agents
- Anthropic — Effective harnesses for long-running agents
- Anthropic — How we built our multi-agent research system
- Anthropic — Effective context engineering for AI agents
- Anthropic — Demystifying evals for AI agents
- Model Context Protocol and GitHub org
- ZenML LLMOps DB — Claude Code single-threaded master loop
- OpenAI — Agents SDK docs
- OpenAI — Swarm (legacy)
- OpenAI — Responses API / Structured Outputs
- OpenAI Community — Assistants API sunset Aug 26, 2026
- Ragwalla — Assistants API → Responses migration guide
- LangChain — LangGraph interrupts
- LangChain — Making it easier to build HITL agents with interrupt
- LangChain — State of agent engineering
- GitHub — LangGraph JS interrupt-resume bug
- Vercel — AI SDK docs and streamUI
- CrewAI — Site, Repo, Docs
- Microsoft — AutoGen v0.4 reimagined and Research blog
- arXiv — Magentic-One
- Letta — Agent memory primer, MemGPT concepts, Letta v1 agent loop, Letta Code
- Vectorize — Mem0 vs Letta
- LiveKit — Sequential pipeline architecture, Adaptive interruption handling docs and blog
- Hamming — Best voice agent stack
- ElevenLabs — Conversational AI, Voice agents 2026 trends
- arXiv — ReAct (2210.03629), Reflexion (2303.11366), Voyager (2305.16291), SWE-Agent (2405.15793)
- Cognition — Devin 2025 performance review
- NVIDIA Research — SLM agents position paper and arXiv 2506.02153
- Red Hat — Eval-driven development
- Firecrawl — Best LLM observability tools 2026
- Langfuse — Phoenix/Arize comparison
- Maxim — Top 5 prompt versioning tools 2026
- Survey — Best multi-agent frameworks 2026