Skip to main content

Payments

What it does

Payments are the rail Ratiba uses to convert a confirmed slot into a paid booking. M-Pesa STK push via Daraja is the primary path (per ADR-0007 D1, cost-discipline mandate: M-Pesa is never routed through PesaPal — Daraja-direct only). Cards run via PesaPal as a separate, optional flow with 8min/30min nudge/abandon scheduling for the longer card-checkout window. Daraja's stkpushquery fires exactly once at t=60s as a reconciliation poll, not a long-poll loop. The voice channel imposes a 90s STK hard cap.

A daily 3 AM EAT consolidated reaper sweeps three tables: expired rows in public.payment_routing, rows aged ≥90 days in <tenant>.checkpoints_archive, and rows aged ≥90 days in <tenant>.handoff_log_archive. Late or unroutable callbacks land in public.payment_callbacks_unrouted (dead-letter table) for manual triage.

A booking thread is held to one in-flight payment at a time — both via the FSM single-in-flight invariant and via Layer-3 idempotency on public.payment_routing. Customer-initiated cancellation is a first-class PAYMENT_CANCELLED_BY_CUSTOMER FSM state with provider-specific reversal: Daraja relies on STK timeout + auto-reverse; PesaPal actively voids the order and triggers an auto-refund.

M11 W2 T7+T8 added the atomic ordered-pair reservation primitive (reserve_pair) — a Redis EVAL Lua script that acquires both a primary and an adjacent secondary slot in a single round-trip, so the cross-sell bundle STK can hold two slots without splitting them across concurrent holders. The single-slot M6 SETNX in _try_reserve_slot is unchanged.

The LLM cost ceiling ($0.05 soft / $0.20 hard per booking) is tracked on BookingState.total_token_cost_usd and is separate from the payment amount; see Conversation FSM for how the cost ceiling signals a handoff trigger.

Prospect story: paying for a booking feels frictionless

From the customer's point of view: she says "book a haircut for tomorrow at 10am", confirms the slot, and within seconds her phone screen lights up with a Safaricom M-Pesa prompt asking her to enter her PIN. She types four digits. The agent replies "Your booking is confirmed — see you tomorrow!". That's it. No browser redirect. No card number. No receipts to print. Payment and booking are one atomic step, entirely within the WhatsApp or phone conversation.

If she is on the web widget and prefers to pay by card, she gets a PesaPal checkout link instead. The agent nudges once at 8 minutes; if she hasn't paid by 30 minutes the slot is released back to the pool.

How it fits in the system

Payment FSM states

The payment FSM runs after the booking FSM reaches BOOKED. It is a separate PaymentState machine — the booking FSM is already terminal by the time any payment state exists.

State details:

StateWhat happens
PAYMENT_PENDINGSTK push sent; Redis reservation lock held; payment_routing correlation row alive
BOOKEDDaraja ResultCode=0 or PesaPal IPN confirmed; slot locked permanently; reservation key deleted
PAYMENT_FAILEDCallback ResultCode != 0 or stkpushquery at t=60s returns a non-zero code; slot released; customer receives a bilingual failure message
PAYMENT_CANCELLED_BY_CUSTOMERCustomer sends a cancel intent during PAYMENT_PENDING; Daraja path = let STK timeout + auto-reverse; PesaPal path = active void_order API call + auto-refund; slot released

Daraja STK push: end-to-end flow

Why one-shot stkpushquery at t=60s?

Safaricom Daraja callbacks are reliable in production but occasionally delayed (network retries, Safaricom-side queue spikes). A common pattern is to poll in a loop until the callback arrives, but this wastes Daraja API quota and complicates the FSM. ADR-0007 chose a single deterministic poll at t=60s: by that point the STK prompt has either been acted on or dismissed. If the poll also returns non-final, the FSM waits for the late callback (which must arrive within the payment_routing.expires_at window) or the reaper cleans up the row at 3 AM EAT.

On the voice channel the customer is on a live call; 90 seconds is the hard cap before the agent tells the customer the payment timed out, to respect their time and avoid dead air.

PesaPal card flow

PesaPal is the optional card path for web-widget users or tenants who have enabled card payments. M-Pesa is never sent through PesaPal — this is a hard cost discipline rule (ADR-0007 D2): PesaPal charges a percentage fee that makes it uneconomical for M-Pesa transactions that Daraja handles directly at a flat rate.

Both the 8-minute nudge threshold and the 30-minute abandon threshold are per-tenant configurable via handoff_thresholds JSONB. A tenant running a busy walk-in clinic might tighten these; a high-ticket dental practice might relax them.

Slot reservation model

Reservations are the pessimistic lock that prevents two concurrent customers from booking the same slot. They are Redis keys — not database rows — because the booking turn completes in seconds and holding a database row lock across an STK prompt would be impractical.

OperationRedis key patternWhen used
Single-slot reservereservation:<tenant_id>:<service_id>:<slot_ts>M6+ standard booking
Ordered-pair reservereservation:pair:<tenant_id>:<primary_ts>:<secondary_ts>M11+ cross-sell bundle

_try_reserve_slot() uses SET NX EX <ttl> — it either acquires the lock or returns False immediately (no blocking). The TTL is set to slightly beyond the STK hard cap (90s on voice, longer on text) so the slot is automatically released if the payment hangs. reserve_pair() is a Lua script evaluated in one Redis round-trip: it acquires the primary key first, then the secondary key; if the secondary acquisition fails it deletes the primary before returning, preventing a half-acquired pair.

The Lua atomicity guarantee means no interleaving between the two SETNX operations. Without this, two concurrent users could each acquire one half of the pair, creating a deadlock where neither can complete.

Concurrent-payment guard

A booking thread can hold at most one in-flight payment at any time. This is enforced at two layers:

  1. FSM single-in-flight invariant — The PAYMENT_PENDING state is a terminal gate. The booking graph does not emit a second STK push while the first is pending; any new inbound message in this state is queued or deflected.
  2. Layer-3 idempotency on payment_routingpayment_routing has a unique constraint on (tenant_id, booking_id, status='pending'). An accidental double-insert (e.g. from a webhook retry) is rejected by Postgres before reaching DarajaClient.stk_push().

The payment_routing correlation table

public.payment_routing is the cross-tenant routing index that lets a Daraja or PesaPal callback arrive at the load balancer with no per-tenant context, and still find the correct per-tenant payment record.

ColumnTypePurpose
checkout_request_idtext PKDaraja CheckoutRequestID (or PesaPal order_tracking_id)
tenant_iduuidWhich tenant owns this payment
payment_iduuidFK into tenant_<slug>.payments
providerenumdaraja or pesapal
statusenumpending, routed, abandoned, dead_letter
expires_attimestamptzWhen the reaper is allowed to sweep this row

The webhook handler does a single SELECT on public.payment_routing by checkout_request_id, extracts tenant_id + payment_id, then fires a Postgres NOTIFY payment_state on the per-tenant channel. The payment FSM is listening on that channel (via LISTEN) and receives the notification without any polling.

If the callback arrives after expires_at and the row is already gone, the handler writes the raw payload into public.payment_callbacks_unrouted for manual triage.

Daily 3 AM EAT consolidated reaper

The reaper (app/workers/payments_reaper.py::run_daily_reaper()) runs as a scheduled worker task. It combines three sweeps into one transaction to avoid partial cleanup:

The reaper intentionally excludes rows with status='pending' from the payment_routing sweep — a payment that is still pending at 3 AM is either very late or stuck, and should remain in the dead-letter path rather than be silently deleted. An operator can inspect payment_callbacks_unrouted the next morning.

The archive sweeps cover two separate Postgres schemas per tenant (checkpoints_<slug> and the tenant's own handoff_log_archive) — the reaper iterates over all tenants in public.tenants and performs the sweep for each one. See Observability for the reaper.complete log event format and how to monitor it.

Where it lives in code

ConcernFileKey entry point
Daraja STK push clientapp/payments/daraja.pyDarajaClient.stk_push() (L345)
Daraja stkpushquery primitiveapp/payments/daraja.pyDarajaClient.stk_push_query() (L444)
Daraja t=60s poll jobapp/payments/poll_daraja.pypoll_daraja_status() (L115) / schedule_daraja_poll() (L313)
PesaPal card clientapp/payments/pesapal.pyPesaPalClient.submit_order() (L419)
PesaPal nudge/abandon flowapp/payments/pesapal_flow.pyinitiate_pesapal_payment() (L247)
Atomic pair reservation (M11 T8)app/services/reservations.pyreserve_pair() (L212)
Single-slot reservation (M6 SETNX inline)app/orchestrator/booking_graph.py_try_reserve_slot() (L114)
Daraja callback handler (HTTP edge)app/api/webhooks/daraja.pyreceive_daraja_callback() (L41)
Daily 3 AM EAT consolidated reaperapp/workers/payments_reaper.pyrun_daily_reaper() (L100)

Decisions

  • ADR-0007 — Payments orchestration is the authoritative source: 8min/30min PesaPal nudge/abandon (per-tenant configurable); one-shot Daraja stkpushquery at t=60s; 90s voice STK hard cap; daily 3 AM EAT consolidated reaper; payment_callbacks_unrouted dead-letter table; concurrent-payment prohibition per booking thread; PAYMENT_CANCELLED_BY_CUSTOMER first-class FSM state with hybrid provider-specific reversal; M-Pesa never routed through PesaPal (cost discipline).

Try this on local dev

  1. Trigger an STK push end-to-end. Boot the stack (docker compose up -d); start a WhatsApp booking against the test tenant with payment_enabled=True; approve the prompt in the Daraja sandbox; verify the callback handler at app/api/webhooks/daraja.py::receive_daraja_callback fires — watch structlog for the webhook.daraja.received event and then payment.state.transition with new_state=BOOKED.

  2. Inspect the correlation row.

    psql postgresql://postgres:postgres@localhost:5434/ratiba \
    -c "SELECT checkout_request_id, tenant_id, payment_id, status, expires_at
    FROM public.payment_routing ORDER BY created_at DESC LIMIT 5;"

    You'll see the (checkout_request_id, tenant_id, payment_id, expires_at) correlation pointer the callback handler uses to route into the per-tenant payments table. The status column should read routed after the callback lands.

  3. Watch the reservation lock live.

    redis-cli -p 6381 KEYS 'reservation:*'

    Run this while the booking sits in PAYMENT_PENDING — you'll see the SETNX key appear when _try_reserve_slot succeeds. Run it again after the booking completes and the key will be gone (TTL expiry or explicit DEL).

  4. Test the dead-letter path. Trigger an STK push, then manually delete the payment_routing row before the Daraja sandbox callback arrives. When the callback fires, it will fail to find a routing row and write to public.payment_callbacks_unrouted. Query it:

    psql postgresql://postgres:postgres@localhost:5434/ratiba \
    -c "SELECT * FROM public.payment_callbacks_unrouted ORDER BY created_at DESC LIMIT 5;"
  5. Simulate the reaper. Run the reaper manually against the dev database to verify it sweeps without errors:

    /Users/soft4u/Development/ratiba/backend/.venv/bin/python -m app.workers.payments_reaper --dry-run

    In dry-run mode it logs what it would delete without committing. Check structlog for the reaper.sweep.dry_run event and row counts.

  • Conversation FSM — booking FSM BOOKED terminal state that fires the payment hook; cost ceiling tracking on BookingState.total_token_cost_usd.
  • Cross-sellreserve_pair() ordered-pair Lua script that the cross-sell bundle booking uses.
  • Observabilityreaper.complete, payment.state.transition, and webhook.daraja.received log events; how to tail and filter them.
  • ConfigurationDARAJA_* and PESAPAL_* env vars; per-tenant handoff_thresholds JSONB for nudge/abandon timing.
  • Testing — how to run payment integration tests against Testcontainers (no real STK push required in CI).