Skip to main content

Voice conversation

What it does

Phone calls are a first-class channel in Ratiba. A customer can call the business's dedicated DID, describe what they want in Swahili or English, and hang up with a confirmed booking — no app, no typing, no screen literacy required.

Voice differs from text on one critical axis: it is real-time and simultaneous. A text channel delivers one discrete message at a time; a phone call is a continuous audio stream in both directions at once. The agent is speaking TTS while the customer is already forming their next sentence. That simultaneity is what the full-duplex primitives manage.

The thesis — inherited from ADR-0009 and the channel-substrate design — is that the FSM stays authoritative and the voice agent is a brain-less proxy. The agent receives a finalised customer utterance, calls dispatch_inbound_message(channel=VOICE), and speaks the returned reply. It does not interpret intent or hold state. Every booking, cancellation, and reschedule transition lives in the same LangGraph booking graph that WhatsApp and the web widget use.

A VoiceStreamEvent seam (app/voice/brain_stream.py) is in place today: the FSM result is adapted into an AsyncIterator[VoiceStreamEvent] that the voice layer already consumes. When a future agentic backend lands — streaming intent classification, mid-sentence generation — it implements the same AsyncIterator[VoiceStreamEvent] contract and the voice layer is untouched.

The stack:

ComponentTechnology
Telephony / roomsLiveKit Agents 1.5.15, SIP bridge (signal :7890, RTC :52000–52050)
Speech-to-textDeepgram Nova-3 (streaming, Swahili + English language ID)
Text-to-speechElevenLabs Multilingual v2
Voice activity detectionSilero VAD (LiveKit native plugin layer)

For the channel abstraction — Tier-1 vs Tier-2, capability flags, cross-channel identity — see Channel substrate. Voice is a Tier-1 channel: the caller's E.164 phone is known from SIP metadata the instant the leg picks up; no COLLECT_PHONE gate is needed.

For the handoff model — voice Phase 1 uses Pattern 3 (WhatsApp follow-up after the call, not an in-call handoff) — see Admin orchestrator.

For the 90-second STK hard cap on voice calls, see Payments.


How it fits

The per-call handler in app/voice/agent.py wires every piece at session build time. DID-to-tenant resolution runs on the shared asyncpg pool (public schema) before the TenantContext ContextVar is set, so the FSM's schema-aware pool picks up the right tenant schema for the rest of the call.


Full-duplex turn-taking

In a half-duplex channel the protocol is simple: customer sends, agent replies, agent sends, customer waits. Voice cannot do this — TTS is streaming audio, the customer hears it in real time, and they may start speaking before the agent finishes. Full-duplex means both sides can be "in flight" simultaneously.

End-of-turn detection (endpointing)

The agent must know when the customer has finished their thought before dispatching to the FSM. Too early and a mid-sentence pause triggers a premature round-trip. Too late and the conversation feels sluggish.

app/voice/eot.py ships an EndOfTurnDetector that uses a composite signal — both conditions must be true before the turn is declared complete:

  1. Deepgram is_final=True has fired at least once for this utterance. Deepgram emits is_final at sentence boundaries, not merely at pauses. This prevents the detector from firing on a filler-word pause in the middle of a longer thought.
  2. Silence for at least silence_threshold_ms has elapsed since the last Deepgram event. This prevents firing on a Deepgram sentence boundary that the customer would naturally continue in the next breath.

The effective silence threshold is per-tenant — public.tenants.voice_eot_silence_ms — defaulting to 700ms when NULL. The project default is tuned for conversational booking dialogue; verticals with longer-pause patterns (dental intake vs. salon walk-in) can widen it per-tenant without a code change.

Min and max endpointing delays (voice_endpointing_min_delay_ms, voice_endpointing_max_delay_ms) are resolved at session build time by resolve_voice_config in app/voice/config.py with the same NULL-fallback pattern. See Configuration for env var defaults and the full VoiceConfig flag reference.

Preemptive generation

The voice agent starts composing the FSM reply as soon as end-of-turn fires — it does not wait for a TTS play-through to finish before entering the next turn. The VoiceStreamEvent seam means sentence chunks can be handed to TTS incrementally, reducing first-word latency. Today the FSM returns a single batch reply (one sentence event followed by final); the seam is already exercised so a future streaming-agentic backend requires zero changes to the voice layer.

FSM mutex

The ADR-0003 SETNX mutex (30-second TTL, exponential backoff to 10s ceiling) serialises concurrent turns on the same conversation thread. If two audio events race — unlikely on voice but possible if a barge-in fires simultaneously with an EOT tick — the mutex guarantees the FSM processes exactly one at a time. A bilingual failure response is spoken if the lock cannot be acquired.

Per-tenant feature flags

The VoiceConfig dataclass (app/voice/config.py) is resolved once at call connect and passed immutably through the session. Eight booleans and two floats govern every full-duplex primitive:

FieldControls
full_duplexMaster switch — full-duplex primitives on/off
endpointing_min_delay_sEOT minimum silence window (seconds)
endpointing_max_delay_sEOT maximum silence window (seconds)
backchannel_filterSuppress barge-in on backchannel filler words
hard_interruptBilingual hard-interrupt stop-pattern detection
streamingVoiceStreamEvent streaming seam enabled
listening_ackMid-utterance listening acknowledgements
adaptive_speedWPM-adaptive TTS speed multiplier

Project-level kill-switches override per-tenant flags: VOICE_LISTENING_ACK_ENABLED and VOICE_ADAPTIVE_SPEED_ENABLED (both default on; set to "0" or "false" to disable globally). Full reference: Configuration.


Barge-in & backchannel

Barge-in

When the customer starts speaking while the agent is delivering TTS, the LiveKit AgentSession sets handle.interrupted = True on the active SpeechHandle. The agent reads this via was_interrupted in app/voice/barge_in.py and emits one barge_in telemetry event per turn (deduplicated by should_emit_barge_in — a streamed reply is many say() calls, but only one telemetry event fires per turn).

safe_say in the same module wraps every session.say() call to tolerate the closing-session race: the filler clock fires on a timer and if the caller hangs up or the turn ends first, session.say() raises RuntimeError("AgentSession is closing, cannot use say()"). That error is benign — the call is already over — so safe_say swallows it with a DEBUG log instead of letting an ERROR and traceback reach the ops log.

Bilingual hard-interrupt

Normal barge-in depends on the customer speaking over the agent. The hard-interrupt provides a stronger override for when the customer is frustrated and wants TTS to stop immediately.

app/voice/hard_interrupt.py scans interim transcripts (not just final ones) for the pattern: three or more stop-tokens in one transcript event. Tokens: English stop, Swahili acha / wacha. Single or double occurrences do not fire — the 3× threshold avoids false positives on natural phrasings like "stop talking to me about that". When the pattern matches, session.interrupt() cancels active TTS within approximately 100ms, independent of the FSM dispatcher round-trip.

Backchannel filtering

Customers naturally emit short acknowledgement sounds while listening — "mm", "yeah", "ndio", "sawa" — without intending to interrupt. If those sounds were treated as barge-in events they would cancel TTS mid-sentence on every natural acknowledgement, breaking the flow.

app/voice/backchannel.py exposes is_backchannel(text, language) which returns True if every token in the interim transcript is a recognised backchannel word for that language. The agent checks this before treating an over-talk event as a real interruption.

Bilingual vocabulary:

LanguageTokens
Swahilindio, ndiyo, sawa, sawasawa, haya, mm, mh, eeh, aha
Englishyeah, okay, ok, mhm, uh-huh, right, mm, uh, yep

Unknown languages fall back to the English set. zol-rag (the sister project this was ported from) reports 30–50% fewer false interrupts with this filter in place.

Listening acknowledgements

When the customer is delivering a long utterance — describing symptoms, listing multiple services, giving a complex schedule — there is a gap between the first words and the final end-of-turn signal. In that gap, silence from the agent sounds like a disconnected call.

app/voice/listening_ack.py fires a brief bilingual affirmation while the customer is still speaking, via session.say() with allow_interruptions=True (the ack must never block the next customer word).

Five gates are checked in order before an ack fires:

GateCondition to pass
1 — distress overridedistress_active is False (always False in Ratiba; gate kept for zol-rag signature compatibility)
2 — word floorword_count >= 12 (utterance long enough to warrant an ack)
3 — stability floorstable_ms >= 1200 (partial has not changed for 1.2s — still actively forming)
4 — per-turn capfires_this_turn < 3 (at most 3 acks per customer turn)
5 — throttleAt least 10s since last ack (cross-turn throttle)

Ack phrases are randomised from a short list per language:

LanguagePhrases
SwahiliNaskia. · Endelea. · Niko nawe. · Ninafuatilia.
EnglishI'm listening. · Please go on. · I'm with you. · I'm following.

Fallback language is English. The feature can be toggled per-tenant via VoiceConfig.listening_ack and globally via the VOICE_LISTENING_ACK_ENABLED env kill-switch.

WPM-adaptive TTS speed

Customers speak at different paces. A slow speaker is more likely to be confused by fast TTS; matching their pace lowers cognitive load. A fast speaker is less likely to need a slower response.

app/voice/voice_speed.py measures the caller's words-per-minute from the first interim transcript event to turn-end, maps the result to a WPM bucket, and applies a speed offset to the ElevenLabs session.tts.update_options multiplier:

WPM bucketBoundaryOffsetEffective speed
slow< 110 WPM−0.050.95×
normal110–180 WPM0.001.00×
fast> 180 WPM0.001.00× (capped at baseline)

The speed multiplier is clamped to [0.70, 1.00]. Two additional tiers — explicit_slow (0.85×) and distress (0.75×) — are present in the module for zol-rag port fidelity but are not wired in Ratiba yet (distress signal not yet implemented; explicit-slow not yet surfaced as a customer command).

The feature can be toggled per-tenant via VoiceConfig.adaptive_speed and globally via the VOICE_ADAPTIVE_SPEED_ENABLED env kill-switch.


Where it lives in code

ModuleResponsibility
app/voice/agent.pyPer-call handler; DID→tenant lookup; TenantContext set; greeting; on_user_turn_completed loop
app/voice/config.pyVoiceConfig dataclass + resolve_voice_config — reads 8 tenant columns + project defaults
app/voice/eot.pyEndOfTurnDetector — composite is_final + silence endpointing
app/voice/barge_in.pysafe_say; was_interrupted; should_emit_barge_in telemetry deduplication
app/voice/hard_interrupt.pyis_stop_pattern (bilingual 3× regex); maybe_hard_interrupt session cancel
app/voice/backchannel.pyis_backchannel — bilingual filler-word ignore-list
app/voice/listening_ack.pyshould_fire_ack (5-gate); pick_ack_template; AckState; env kill-switch
app/voice/voice_speed.pyWpmTracker; bucket_for_wpm; compute_target_speed; SpeedState; env kill-switch
app/voice/brain_stream.pyVoiceStreamEvent TypedDict; stream_from_dispatch_result — batch→streaming adapter
app/voice/deepgram.pyRaw Deepgram WebSocket client (used directly by evals and batch paths; not by the per-call handler)
app/voice/elevenlabs.pyRaw ElevenLabs client (same — direct primitives, not the per-call path)
app/voice/handoff.pyVoice Phase 1 Pattern 3 — WhatsApp follow-up wiring post-call
app/voice/payment_listener.pyPostgres LISTEN/NOTIFY hook for STK approval during a voice call
app/voice/filler_clock.pyFiller-clock interstitial during FSM dispatch latency

  • Channel substrate — Tier-1 vs Tier-2 channels, capability flags, identity resolution, ChannelKind enum
  • Conversation FSM — the LangGraph booking graph that voice dispatches into
  • Admin orchestrator — voice Phase 1 Pattern 3 handoff (WhatsApp follow-up after the call)
  • Payments — 90s STK hard cap on voice calls
  • ConfigurationVoiceConfig env vars, per-tenant columns, kill-switches
  • Glossary — full-duplex turn-taking, barge-in, backchannel definitions