Payments
What it does
Payments are the rail Ratiba uses to convert a confirmed slot into a paid
booking. M-Pesa STK push via Daraja is the primary path (per
ADR-0007 D1, cost-discipline mandate:
M-Pesa is never routed through PesaPal — Daraja-direct only).
Cards run via PesaPal as a separate, optional flow with 8min/30min
nudge/abandon scheduling for the longer card-checkout window. Daraja's
stkpushquery fires exactly once at t=60s as a reconciliation poll,
not a long-poll loop. The voice channel imposes a 90s STK hard cap.
A daily 3 AM EAT consolidated reaper sweeps three tables: expired rows
in public.payment_routing, rows aged ≥90 days in
<tenant>.checkpoints_archive, and rows aged ≥90 days in
<tenant>.handoff_log_archive. Late or unroutable callbacks land in
public.payment_callbacks_unrouted (dead-letter table) for manual triage.
A booking thread is held to one in-flight payment at a time — both via
the FSM single-in-flight invariant and via Layer-3 idempotency on
public.payment_routing. Customer-initiated cancellation is a first-class
PAYMENT_CANCELLED_BY_CUSTOMER FSM state with provider-specific
reversal: Daraja relies on STK timeout + auto-reverse; PesaPal actively
voids the order and triggers an auto-refund.
M11 W2 T7+T8 added the atomic ordered-pair reservation primitive
(reserve_pair) — a Redis EVAL Lua script that acquires both a primary
and an adjacent secondary slot in a single round-trip, so the cross-sell
bundle STK can hold two slots without splitting them across concurrent
holders. The single-slot M6 SETNX in _try_reserve_slot is unchanged.
The LLM cost ceiling ($0.05 soft / $0.20 hard per booking) is tracked on
BookingState.total_token_cost_usd and is separate from the payment amount;
see Conversation FSM for how the cost
ceiling signals a handoff trigger.
Prospect story: paying for a booking feels frictionless
From the customer's point of view: she says "book a haircut for tomorrow at 10am", confirms the slot, and within seconds her phone screen lights up with a Safaricom M-Pesa prompt asking her to enter her PIN. She types four digits. The agent replies "Your booking is confirmed — see you tomorrow!". That's it. No browser redirect. No card number. No receipts to print. Payment and booking are one atomic step, entirely within the WhatsApp or phone conversation.
If she is on the web widget and prefers to pay by card, she gets a PesaPal checkout link instead. The agent nudges once at 8 minutes; if she hasn't paid by 30 minutes the slot is released back to the pool.
How it fits in the system
Payment FSM states
The payment FSM runs after the booking FSM reaches BOOKED. It is a separate
PaymentState machine — the booking FSM is already terminal by the time any
payment state exists.
State details:
| State | What happens |
|---|---|
PAYMENT_PENDING | STK push sent; Redis reservation lock held; payment_routing correlation row alive |
BOOKED | Daraja ResultCode=0 or PesaPal IPN confirmed; slot locked permanently; reservation key deleted |
PAYMENT_FAILED | Callback ResultCode != 0 or stkpushquery at t=60s returns a non-zero code; slot released; customer receives a bilingual failure message |
PAYMENT_CANCELLED_BY_CUSTOMER | Customer sends a cancel intent during PAYMENT_PENDING; Daraja path = let STK timeout + auto-reverse; PesaPal path = active void_order API call + auto-refund; slot released |
Daraja STK push: end-to-end flow
Why one-shot stkpushquery at t=60s?
Safaricom Daraja callbacks are reliable in production but occasionally delayed
(network retries, Safaricom-side queue spikes). A common pattern is to poll in
a loop until the callback arrives, but this wastes Daraja API quota and complicates
the FSM. ADR-0007 chose a single deterministic poll at t=60s: by that point the
STK prompt has either been acted on or dismissed. If the poll also returns
non-final, the FSM waits for the late callback (which must arrive within the
payment_routing.expires_at window) or the reaper cleans up the row at 3 AM EAT.
On the voice channel the customer is on a live call; 90 seconds is the hard cap before the agent tells the customer the payment timed out, to respect their time and avoid dead air.
PesaPal card flow
PesaPal is the optional card path for web-widget users or tenants who have enabled card payments. M-Pesa is never sent through PesaPal — this is a hard cost discipline rule (ADR-0007 D2): PesaPal charges a percentage fee that makes it uneconomical for M-Pesa transactions that Daraja handles directly at a flat rate.
Both the 8-minute nudge threshold and the 30-minute abandon threshold are
per-tenant configurable via handoff_thresholds JSONB. A tenant running a
busy walk-in clinic might tighten these; a high-ticket dental practice might
relax them.
Slot reservation model
Reservations are the pessimistic lock that prevents two concurrent customers from booking the same slot. They are Redis keys — not database rows — because the booking turn completes in seconds and holding a database row lock across an STK prompt would be impractical.
| Operation | Redis key pattern | When used |
|---|---|---|
| Single-slot reserve | reservation:<tenant_id>:<service_id>:<slot_ts> | M6+ standard booking |
| Ordered-pair reserve | reservation:pair:<tenant_id>:<primary_ts>:<secondary_ts> | M11+ cross-sell bundle |
_try_reserve_slot() uses SET NX EX <ttl> — it either acquires the lock or
returns False immediately (no blocking). The TTL is set to slightly beyond the
STK hard cap (90s on voice, longer on text) so the slot is automatically released
if the payment hangs. reserve_pair() is a Lua script evaluated in one Redis
round-trip: it acquires the primary key first, then the secondary key; if the
secondary acquisition fails it deletes the primary before returning, preventing
a half-acquired pair.
The Lua atomicity guarantee means no interleaving between the two SETNX operations. Without this, two concurrent users could each acquire one half of the pair, creating a deadlock where neither can complete.
Concurrent-payment guard
A booking thread can hold at most one in-flight payment at any time. This is enforced at two layers:
- FSM single-in-flight invariant — The
PAYMENT_PENDINGstate is a terminal gate. The booking graph does not emit a second STK push while the first is pending; any new inbound message in this state is queued or deflected. - Layer-3 idempotency on
payment_routing—payment_routinghas a unique constraint on(tenant_id, booking_id, status='pending'). An accidental double-insert (e.g. from a webhook retry) is rejected by Postgres before reachingDarajaClient.stk_push().
The payment_routing correlation table
public.payment_routing is the cross-tenant routing index that lets a Daraja or
PesaPal callback arrive at the load balancer with no per-tenant context, and still
find the correct per-tenant payment record.
| Column | Type | Purpose |
|---|---|---|
checkout_request_id | text PK | Daraja CheckoutRequestID (or PesaPal order_tracking_id) |
tenant_id | uuid | Which tenant owns this payment |
payment_id | uuid | FK into tenant_<slug>.payments |
provider | enum | daraja or pesapal |
status | enum | pending, routed, abandoned, dead_letter |
expires_at | timestamptz | When the reaper is allowed to sweep this row |
The webhook handler does a single SELECT on public.payment_routing by
checkout_request_id, extracts tenant_id + payment_id, then fires a Postgres
NOTIFY payment_state on the per-tenant channel. The payment FSM is listening on
that channel (via LISTEN) and receives the notification without any polling.
If the callback arrives after expires_at and the row is already gone, the handler
writes the raw payload into public.payment_callbacks_unrouted for manual triage.
Daily 3 AM EAT consolidated reaper
The reaper (app/workers/payments_reaper.py::run_daily_reaper()) runs as a
scheduled worker task. It combines three sweeps into one transaction to avoid
partial cleanup:
The reaper intentionally excludes rows with status='pending' from the
payment_routing sweep — a payment that is still pending at 3 AM is either very
late or stuck, and should remain in the dead-letter path rather than be silently
deleted. An operator can inspect payment_callbacks_unrouted the next morning.
The archive sweeps cover two separate Postgres schemas per tenant
(checkpoints_<slug> and the tenant's own handoff_log_archive) — the reaper
iterates over all tenants in public.tenants and performs the sweep for each one.
See Observability for the reaper.complete log event
format and how to monitor it.
Where it lives in code
| Concern | File | Key entry point |
|---|---|---|
| Daraja STK push client | app/payments/daraja.py | DarajaClient.stk_push() (L345) |
| Daraja stkpushquery primitive | app/payments/daraja.py | DarajaClient.stk_push_query() (L444) |
| Daraja t=60s poll job | app/payments/poll_daraja.py | poll_daraja_status() (L115) / schedule_daraja_poll() (L313) |
| PesaPal card client | app/payments/pesapal.py | PesaPalClient.submit_order() (L419) |
| PesaPal nudge/abandon flow | app/payments/pesapal_flow.py | initiate_pesapal_payment() (L247) |
| Atomic pair reservation (M11 T8) | app/services/reservations.py | reserve_pair() (L212) |
| Single-slot reservation (M6 SETNX inline) | app/orchestrator/booking_graph.py | _try_reserve_slot() (L114) |
| Daraja callback handler (HTTP edge) | app/api/webhooks/daraja.py | receive_daraja_callback() (L41) |
| Daily 3 AM EAT consolidated reaper | app/workers/payments_reaper.py | run_daily_reaper() (L100) |
Decisions
- ADR-0007 — Payments orchestration
is the authoritative source: 8min/30min PesaPal nudge/abandon (per-tenant
configurable); one-shot Daraja
stkpushqueryat t=60s; 90s voice STK hard cap; daily 3 AM EAT consolidated reaper;payment_callbacks_unrouteddead-letter table; concurrent-payment prohibition per booking thread;PAYMENT_CANCELLED_BY_CUSTOMERfirst-class FSM state with hybrid provider-specific reversal; M-Pesa never routed through PesaPal (cost discipline).
Try this on local dev
-
Trigger an STK push end-to-end. Boot the stack (
docker compose up -d); start a WhatsApp booking against the test tenant withpayment_enabled=True; approve the prompt in the Daraja sandbox; verify the callback handler atapp/api/webhooks/daraja.py::receive_daraja_callbackfires — watchstructlogfor thewebhook.daraja.receivedevent and thenpayment.state.transitionwithnew_state=BOOKED. -
Inspect the correlation row.
psql postgresql://postgres:postgres@localhost:5434/ratiba \-c "SELECT checkout_request_id, tenant_id, payment_id, status, expires_atFROM public.payment_routing ORDER BY created_at DESC LIMIT 5;"You'll see the
(checkout_request_id, tenant_id, payment_id, expires_at)correlation pointer the callback handler uses to route into the per-tenantpaymentstable. Thestatuscolumn should readroutedafter the callback lands. -
Watch the reservation lock live.
redis-cli -p 6381 KEYS 'reservation:*'Run this while the booking sits in
PAYMENT_PENDING— you'll see the SETNX key appear when_try_reserve_slotsucceeds. Run it again after the booking completes and the key will be gone (TTL expiry or explicitDEL). -
Test the dead-letter path. Trigger an STK push, then manually delete the
payment_routingrow before the Daraja sandbox callback arrives. When the callback fires, it will fail to find a routing row and write topublic.payment_callbacks_unrouted. Query it:psql postgresql://postgres:postgres@localhost:5434/ratiba \-c "SELECT * FROM public.payment_callbacks_unrouted ORDER BY created_at DESC LIMIT 5;" -
Simulate the reaper. Run the reaper manually against the dev database to verify it sweeps without errors:
/Users/soft4u/Development/ratiba/backend/.venv/bin/python -m app.workers.payments_reaper --dry-runIn dry-run mode it logs what it would delete without committing. Check
structlogfor thereaper.sweep.dry_runevent and row counts.
Related
- Conversation FSM — booking FSM
BOOKEDterminal state that fires the payment hook; cost ceiling tracking onBookingState.total_token_cost_usd. - Cross-sell —
reserve_pair()ordered-pair Lua script that the cross-sell bundle booking uses. - Observability —
reaper.complete,payment.state.transition, andwebhook.daraja.receivedlog events; how to tail and filter them. - Configuration —
DARAJA_*andPESAPAL_*env vars; per-tenanthandoff_thresholdsJSONB for nudge/abandon timing. - Testing — how to run payment integration tests against Testcontainers (no real STK push required in CI).