Voice conversation
What it does
Phone calls are a first-class channel in Ratiba. A customer can call the business's dedicated DID, describe what they want in Swahili or English, and hang up with a confirmed booking — no app, no typing, no screen literacy required.
Voice differs from text on one critical axis: it is real-time and simultaneous. A text channel delivers one discrete message at a time; a phone call is a continuous audio stream in both directions at once. The agent is speaking TTS while the customer is already forming their next sentence. That simultaneity is what the full-duplex primitives manage.
The thesis — inherited from ADR-0009
and the channel-substrate design — is that the FSM stays authoritative and
the voice agent is a brain-less proxy. The agent receives a finalised
customer utterance, calls dispatch_inbound_message(channel=VOICE), and
speaks the returned reply. It does not interpret intent or hold state. Every
booking, cancellation, and reschedule transition lives in the same
LangGraph booking graph that WhatsApp and the web widget use.
A VoiceStreamEvent seam (app/voice/brain_stream.py) is in place today:
the FSM result is adapted into an AsyncIterator[VoiceStreamEvent] that the
voice layer already consumes. When a future agentic backend lands — streaming
intent classification, mid-sentence generation — it implements the same
AsyncIterator[VoiceStreamEvent] contract and the voice layer is untouched.
The stack:
| Component | Technology |
|---|---|
| Telephony / rooms | LiveKit Agents 1.5.15, SIP bridge (signal :7890, RTC :52000–52050) |
| Speech-to-text | Deepgram Nova-3 (streaming, Swahili + English language ID) |
| Text-to-speech | ElevenLabs Multilingual v2 |
| Voice activity detection | Silero VAD (LiveKit native plugin layer) |
For the channel abstraction — Tier-1 vs Tier-2, capability flags,
cross-channel identity — see Channel substrate.
Voice is a Tier-1 channel: the caller's E.164 phone is known from SIP
metadata the instant the leg picks up; no COLLECT_PHONE gate is needed.
For the handoff model — voice Phase 1 uses Pattern 3 (WhatsApp follow-up after the call, not an in-call handoff) — see Admin orchestrator.
For the 90-second STK hard cap on voice calls, see Payments.
How it fits
The per-call handler in app/voice/agent.py wires every piece at session
build time. DID-to-tenant resolution runs on the shared asyncpg pool (public
schema) before the TenantContext ContextVar is set, so the FSM's schema-aware
pool picks up the right tenant schema for the rest of the call.
Full-duplex turn-taking
In a half-duplex channel the protocol is simple: customer sends, agent replies, agent sends, customer waits. Voice cannot do this — TTS is streaming audio, the customer hears it in real time, and they may start speaking before the agent finishes. Full-duplex means both sides can be "in flight" simultaneously.
End-of-turn detection (endpointing)
The agent must know when the customer has finished their thought before dispatching to the FSM. Too early and a mid-sentence pause triggers a premature round-trip. Too late and the conversation feels sluggish.
app/voice/eot.py ships an EndOfTurnDetector that uses a composite
signal — both conditions must be true before the turn is declared complete:
- Deepgram
is_final=Truehas fired at least once for this utterance. Deepgram emitsis_finalat sentence boundaries, not merely at pauses. This prevents the detector from firing on a filler-word pause in the middle of a longer thought. - Silence for at least
silence_threshold_mshas elapsed since the last Deepgram event. This prevents firing on a Deepgram sentence boundary that the customer would naturally continue in the next breath.
The effective silence threshold is per-tenant — public.tenants.voice_eot_silence_ms
— defaulting to 700ms when NULL. The project default is tuned for
conversational booking dialogue; verticals with longer-pause patterns (dental
intake vs. salon walk-in) can widen it per-tenant without a code change.
Min and max endpointing delays (voice_endpointing_min_delay_ms,
voice_endpointing_max_delay_ms) are resolved at session build time by
resolve_voice_config in app/voice/config.py with the same NULL-fallback
pattern. See Configuration for env var
defaults and the full VoiceConfig flag reference.
Preemptive generation
The voice agent starts composing the FSM reply as soon as end-of-turn fires —
it does not wait for a TTS play-through to finish before entering the next
turn. The VoiceStreamEvent seam means sentence chunks can be handed to TTS
incrementally, reducing first-word latency. Today the FSM returns a single
batch reply (one sentence event followed by final); the seam is already
exercised so a future streaming-agentic backend requires zero changes to the
voice layer.
FSM mutex
The ADR-0003 SETNX mutex (30-second TTL, exponential backoff to 10s ceiling) serialises concurrent turns on the same conversation thread. If two audio events race — unlikely on voice but possible if a barge-in fires simultaneously with an EOT tick — the mutex guarantees the FSM processes exactly one at a time. A bilingual failure response is spoken if the lock cannot be acquired.
Per-tenant feature flags
The VoiceConfig dataclass (app/voice/config.py) is resolved once at call
connect and passed immutably through the session. Eight booleans and two
floats govern every full-duplex primitive:
| Field | Controls |
|---|---|
full_duplex | Master switch — full-duplex primitives on/off |
endpointing_min_delay_s | EOT minimum silence window (seconds) |
endpointing_max_delay_s | EOT maximum silence window (seconds) |
backchannel_filter | Suppress barge-in on backchannel filler words |
hard_interrupt | Bilingual hard-interrupt stop-pattern detection |
streaming | VoiceStreamEvent streaming seam enabled |
listening_ack | Mid-utterance listening acknowledgements |
adaptive_speed | WPM-adaptive TTS speed multiplier |
Project-level kill-switches override per-tenant flags:
VOICE_LISTENING_ACK_ENABLED and VOICE_ADAPTIVE_SPEED_ENABLED (both
default on; set to "0" or "false" to disable globally). Full reference:
Configuration.
Barge-in & backchannel
Barge-in
When the customer starts speaking while the agent is delivering TTS, the
LiveKit AgentSession sets handle.interrupted = True on the active
SpeechHandle. The agent reads this via was_interrupted in
app/voice/barge_in.py and emits one barge_in telemetry event per turn
(deduplicated by should_emit_barge_in — a streamed reply is many say()
calls, but only one telemetry event fires per turn).
safe_say in the same module wraps every session.say() call to tolerate
the closing-session race: the filler clock fires on a timer and if the
caller hangs up or the turn ends first, session.say() raises
RuntimeError("AgentSession is closing, cannot use say()"). That error is
benign — the call is already over — so safe_say swallows it with a DEBUG
log instead of letting an ERROR and traceback reach the ops log.
Bilingual hard-interrupt
Normal barge-in depends on the customer speaking over the agent. The hard-interrupt provides a stronger override for when the customer is frustrated and wants TTS to stop immediately.
app/voice/hard_interrupt.py scans interim transcripts (not just final
ones) for the pattern: three or more stop-tokens in one transcript event.
Tokens: English stop, Swahili acha / wacha. Single or double
occurrences do not fire — the 3× threshold avoids false positives on
natural phrasings like "stop talking to me about that". When the pattern
matches, session.interrupt() cancels active TTS within approximately 100ms,
independent of the FSM dispatcher round-trip.
Backchannel filtering
Customers naturally emit short acknowledgement sounds while listening — "mm", "yeah", "ndio", "sawa" — without intending to interrupt. If those sounds were treated as barge-in events they would cancel TTS mid-sentence on every natural acknowledgement, breaking the flow.
app/voice/backchannel.py exposes is_backchannel(text, language) which
returns True if every token in the interim transcript is a recognised
backchannel word for that language. The agent checks this before treating an
over-talk event as a real interruption.
Bilingual vocabulary:
| Language | Tokens |
|---|---|
| Swahili | ndio, ndiyo, sawa, sawasawa, haya, mm, mh, eeh, aha |
| English | yeah, okay, ok, mhm, uh-huh, right, mm, uh, yep |
Unknown languages fall back to the English set. zol-rag (the sister project this was ported from) reports 30–50% fewer false interrupts with this filter in place.
Listening acknowledgements
When the customer is delivering a long utterance — describing symptoms, listing multiple services, giving a complex schedule — there is a gap between the first words and the final end-of-turn signal. In that gap, silence from the agent sounds like a disconnected call.
app/voice/listening_ack.py fires a brief bilingual affirmation while the
customer is still speaking, via session.say() with
allow_interruptions=True (the ack must never block the next customer word).
Five gates are checked in order before an ack fires:
| Gate | Condition to pass |
|---|---|
| 1 — distress override | distress_active is False (always False in Ratiba; gate kept for zol-rag signature compatibility) |
| 2 — word floor | word_count >= 12 (utterance long enough to warrant an ack) |
| 3 — stability floor | stable_ms >= 1200 (partial has not changed for 1.2s — still actively forming) |
| 4 — per-turn cap | fires_this_turn < 3 (at most 3 acks per customer turn) |
| 5 — throttle | At least 10s since last ack (cross-turn throttle) |
Ack phrases are randomised from a short list per language:
| Language | Phrases |
|---|---|
| Swahili | Naskia. · Endelea. · Niko nawe. · Ninafuatilia. |
| English | I'm listening. · Please go on. · I'm with you. · I'm following. |
Fallback language is English. The feature can be toggled per-tenant via
VoiceConfig.listening_ack and globally via the
VOICE_LISTENING_ACK_ENABLED env kill-switch.
WPM-adaptive TTS speed
Customers speak at different paces. A slow speaker is more likely to be confused by fast TTS; matching their pace lowers cognitive load. A fast speaker is less likely to need a slower response.
app/voice/voice_speed.py measures the caller's words-per-minute from the
first interim transcript event to turn-end, maps the result to a WPM bucket,
and applies a speed offset to the ElevenLabs session.tts.update_options
multiplier:
| WPM bucket | Boundary | Offset | Effective speed |
|---|---|---|---|
| slow | < 110 WPM | −0.05 | 0.95× |
| normal | 110–180 WPM | 0.00 | 1.00× |
| fast | > 180 WPM | 0.00 | 1.00× (capped at baseline) |
The speed multiplier is clamped to [0.70, 1.00]. Two additional tiers —
explicit_slow (0.85×) and distress (0.75×) — are present in the module
for zol-rag port fidelity but are not wired in Ratiba yet (distress signal
not yet implemented; explicit-slow not yet surfaced as a customer command).
The feature can be toggled per-tenant via VoiceConfig.adaptive_speed and
globally via the VOICE_ADAPTIVE_SPEED_ENABLED env kill-switch.
Where it lives in code
| Module | Responsibility |
|---|---|
app/voice/agent.py | Per-call handler; DID→tenant lookup; TenantContext set; greeting; on_user_turn_completed loop |
app/voice/config.py | VoiceConfig dataclass + resolve_voice_config — reads 8 tenant columns + project defaults |
app/voice/eot.py | EndOfTurnDetector — composite is_final + silence endpointing |
app/voice/barge_in.py | safe_say; was_interrupted; should_emit_barge_in telemetry deduplication |
app/voice/hard_interrupt.py | is_stop_pattern (bilingual 3× regex); maybe_hard_interrupt session cancel |
app/voice/backchannel.py | is_backchannel — bilingual filler-word ignore-list |
app/voice/listening_ack.py | should_fire_ack (5-gate); pick_ack_template; AckState; env kill-switch |
app/voice/voice_speed.py | WpmTracker; bucket_for_wpm; compute_target_speed; SpeedState; env kill-switch |
app/voice/brain_stream.py | VoiceStreamEvent TypedDict; stream_from_dispatch_result — batch→streaming adapter |
app/voice/deepgram.py | Raw Deepgram WebSocket client (used directly by evals and batch paths; not by the per-call handler) |
app/voice/elevenlabs.py | Raw ElevenLabs client (same — direct primitives, not the per-call path) |
app/voice/handoff.py | Voice Phase 1 Pattern 3 — WhatsApp follow-up wiring post-call |
app/voice/payment_listener.py | Postgres LISTEN/NOTIFY hook for STK approval during a voice call |
app/voice/filler_clock.py | Filler-clock interstitial during FSM dispatch latency |
Related
- Channel substrate — Tier-1 vs Tier-2 channels, capability flags, identity resolution,
ChannelKindenum - Conversation FSM — the LangGraph booking graph that voice dispatches into
- Admin orchestrator — voice Phase 1 Pattern 3 handoff (WhatsApp follow-up after the call)
- Payments — 90s STK hard cap on voice calls
- Configuration —
VoiceConfigenv vars, per-tenant columns, kill-switches - Glossary — full-duplex turn-taking, barge-in, backchannel definitions