Skip to main content

ADR-0007: Payments Orchestration

Status: Accepted Date: 2026-04-25

Context

Ratiba's product thesis hinges on payments working cleanly inside the conversation. The customer never leaves WhatsApp for an M-Pesa payment; for card payments via PesaPal they leave briefly but the return-to-conversation must feel seamless. PRD §B.2 specifies a hybrid payment architecture: M-Pesa, Airtel Money, and Equitel via direct provider APIs (Daraja, Airtel Africa, Jenga); cards via PesaPal hosted checkout.

A4 (docs/research/2026-04-25-payments-orchestration.md) settled the high-level orchestration shape: a single LangGraph payment_node that calls interrupt() after dispatching the payment to the provider, suspending the conversation thread until the webhook handler resolves the payment and resumes via Command(resume={...}). Same FSM (AWAITING_PAYMENT / PAYMENT_TIMEOUT / PAYMENT_FAILED / PAYMENT_CONFIRMED) handles both rails — they differ only in policy (timeout values, callback shape).

Spec §12 C2 settled the new architectural primitive that ADR-0002 D1 codified: public.payment_routing shared bridge table that maps merchant_reference → (tenant_id, schema_name, thread_id). Required because Daraja and PesaPal webhooks arrive in tenant-less context and need a shared lookup to resolve the right tenant before switching schemas.

ADR-0006 D4 settled the STK-in-flight handoff interaction: when a confidence trigger fires while a payment is in flight, handoff suspends the thread but the M-Pesa callback (when it arrives) is always authoritative — it resolves the payment row + appends a system message to admin's brief panel without resuming the conversation thread.

What ADR-0007 settles is the operational specifics A4 punted to follow-up: PesaPal nudge/abandon timing, Daraja stkpushquery polling cadence, voice STK hold cap, public.payment_routing reaper cadence, late-callback handling, PesaPal card-only policy commitment, concurrent-payment semantics, and customer-initiated cancellation handling (the case Adrian flagged in the 2026-04-25 ADR-0007 brainstorm: customer wants to cancel a pending payment to re-select a service).

Decision

Eight specific decisions, organized as a coherent payments orchestration model.

1. Architectural recap (inherited, locked here)

AspectWhere decidedRecap
Lifecycle shapeA4 §1Initiate → suspend (interrupt()) → callback → resume (Command(resume={...}))
Single FSM, two railsA4 §2AWAITING_PAYMENT / PAYMENT_TIMEOUT / PAYMENT_FAILED / PAYMENT_CONFIRMED (this ADR D8 adds PAYMENT_CANCELLED_BY_CUSTOMER)
Routing tablespec §12 C2, ADR-0002 D1, A4 §3public.payment_routing(merchant_reference PK, tenant_id, schema_name, thread_id, provider, expires_at)
payments table additions per A4 §3A4 §3merchant_reference VARCHAR(40) UNIQUE NOT NULL; thread_id VARCHAR(100) NOT NULL; provider VARCHAR(20) NOT NULL CHECK IN ('mpesa', 'airtel', 'equitel', 'pesapal')
merchant_reference formatA4 §3RTB-{tenant_short}-{ulid} — 12-char Daraja AccountReference constraint accommodated by passing tenant_short alone; ULID is merchant_reference PK + Daraja TransactionDesc
Triple-layer idempotencyA4 §5(1) Postgres mpesa_receipt UNIQUE DEFERRABLE; (2) Redis dedupe key on provider's CheckoutRequestID / OrderTrackingId, 24h TTL; (3) SELECT FOR UPDATE on payments.status before resume
TenantScopedSaver via for_tenant factoryADR-0001 amendment + spike Option A + ADR-0002 D4 + A4 §8Webhook handler asks TenantScopedSaverFactory.for_tenant(tenant_id); never constructs a saver against a specific schema directly
persist_payment_routing BEFORE provider.initiate_paymentA4 §8Non-negotiable order. If reversed, callback can beat us to the lookup table and the resume is lost (Daraja sandbox callbacks have been observed at sub-200ms in the field)
STK-in-flight handoff interactionADR-0006 D4Handoff suspends thread; M-Pesa callback always authoritative; resolves payment row + appends system message to admin's brief panel without resuming the conversation thread
Hold-the-line-and-text pattern (voice card flow)A4 §6 + ADR-0006 D3After 25s on voice for a card payment: agent says "I'll text you the link and call back when confirmed," ends call, webhook handler triggers LiveKit outbound SIP callback on PAYMENT_CONFIRMED (Phase 2 only; Phase 1 voice card flow defers to WhatsApp follow-up per ADR-0006 D3)

ADR-0007 builds on these without re-deriving them.

2. PesaPal nudge timing — 8 minutes soft / 30 minutes hard

PesaPal hosted checkout has no fixed timeout (customer is in browser doing 3D Secure, which can take 5+ minutes if issuing-bank OTP SMS is slow — common in Kenya). This ADR locks two timing parameters:

ParameterDefaultColumn
Soft nudge ("did you finish? tap the link again if you closed the page" + resends link)8 minutespublic.tenants.pesapal_nudge_seconds (default 480)
Hard abandonment (payments.status='timeout', free slot hold, agent says "looks like the payment didn't go through, want to try again or pay at the venue?")30 minutespublic.tenants.pesapal_abandon_seconds (default 1800)
ALTER TABLE public.tenants
ADD COLUMN pesapal_nudge_seconds INTEGER NOT NULL DEFAULT 480,
ADD COLUMN pesapal_abandon_seconds INTEGER NOT NULL DEFAULT 1800;

Why these values. Real-world 3D Secure can take 5+ minutes when the issuing bank's OTP SMS is slow. A 5-minute nudge would interrupt customers mid-payment. 30-minute hard abandonment is the natural window after which a customer who hasn't returned has effectively moved on; longer windows tie up the slot hold without recovery benefit.

Per-tenant configurable lets tenants with high-friction payment demographics (e.g., dental clinics with older customer base, more 3DS issues) loosen the threshold without code changes. Eval-tunable after first 100 production PesaPal transactions per pilot tenant.

3. Daraja stkpushquery polling — one-shot at t=60s only

Daraja's STK push has a 60-second timeout on Safaricom's side. The stkpushquery API is the authoritative way to check status when the callback hasn't arrived. This ADR locks the polling cadence at one-shot at t=60s as belt-and-braces final check before the FSM transitions to PAYMENT_TIMEOUT.

The t=30s customer-facing nudge ("still waiting, check your phone") specified in A4 §4 is a pure UX gesture — it does not require a backing API check. We do not poll stkpushquery at t=30s.

Why one-shot only. Daraja's median callback latency is 8-15s in the field; if no callback by 60s, something is genuinely off (lossy callback delivery, customer ignored prompt, customer on slow network). One query at t=60s catches the lossy-callback case authoritatively; doubling the call volume to add a t=30s probe buys nothing decision-relevant.

Cost discipline: stkpushquery calls count against Safaricom's per-shortcode quota; minimizing unnecessary calls preserves headroom for legitimate retries.

4. Voice STK hold cap — 90 seconds

When a voice call enters the M-Pesa STK flow, the agent holds the line while waiting for the callback. This ADR locks 90 seconds as the hard cap (60s for STK + 30s for graceful wrap-up).

ALTER TABLE public.tenants
ADD COLUMN voice_stk_max_hold_seconds INTEGER NOT NULL DEFAULT 90;

Lifecycle:

t=0s STK dispatched. Agent says (in detected language):
"Nitaomba M-Pesa prompt sasa hivi. Itakuwa kwenye simu yako
baada ya sekunde tano. Weka PIN yako."
t=15s Soft hold filler ("Bado tunasubiri…")
t=30s Soft hold filler again
t=45s Soft hold filler again
t=60s STK timeout (no callback). Agent: "Hatuoni malipo. Tujaribu
tena, ulipe baadaye, au sema 'simu' kupitishia kupiga simu
tena baadaye?"
t=60-90s wrap-up: customer responds; agent confirms next step
(text new link via WhatsApp, pay at venue, or schedule
callback)
t=90s Hard end of call regardless of state

Why 90s. The 30s wrap-up window matters for human-quality voice UX. Customer who's been on hold for 60s waiting for an M-Pesa prompt is in a vulnerable moment; abruptly ending the call (60s cap) feels brusque. 30s is enough for one back-and-forth; locking at 90s means the call gracefully concludes without awkward stretches into 120s+.

Per-tenant configurable. High-volume tenants where telephony cost discipline matters more than the wrap-up smoothness can tune down.

5. Daily reaper at 3 AM EAT — consolidated nightly maintenance

public.payment_routing rows are reaped past their 24-hour expires_at. This ADR locks the reaper cadence at daily at 3 AM EAT (East Africa Time), consolidated with the other nightly mover jobs:

JobSource ADRAction
public.payment_routing reaperthis ADR D5DELETE rows WHERE expires_at < NOW()
Per-tenant checkpoints_archive moverADR-0003 D2INSERT INTO archive + DELETE from live (rows older than 90 days)
Per-tenant handoff_log_archive moverADR-0006 D7INSERT INTO archive + DELETE from live (rows older than 90 days)

Single cron entry (scripts/nightly-maintenance.sh); single ops dashboard surfaces job outcomes; single failure-handling code path.

Why daily, not hourly. The public.payment_routing table is small. Per-row ~120 bytes; at 1k tenants × 100 payments/day = 100k rows/day. After 24h all expired. Peak unreaped backlog ~12 MB. Hourly reaping is unnecessary load — no operational benefit from cleaning every hour vs every night for a table this small.

3 AM EAT (East Africa Time) is the canonical low-traffic window for Ratiba's target market — overnight in Kenya, weekly retail rhythms permit longer maintenance windows then.

6. Late-callback dead-letter table

The edge case A4 §9 #6 surfaced: a callback arrives 24+ hours after STK push initiation. The public.payment_routing row has been reaped (D4 = daily TTL); the callback handler can't resolve merchant_reference → (tenant, thread_id). This ADR specifies a dead-letter pattern.

New table in public schema:

CREATE TABLE public.payment_callbacks_unrouted (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
provider VARCHAR(20) NOT NULL CHECK (provider IN ('mpesa', 'airtel', 'equitel', 'pesapal')),
raw_payload JSONB NOT NULL,
received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
attempted_merchant_reference VARCHAR(40),
reviewed_at TIMESTAMPTZ,
reviewed_by UUID, -- staff user; null until reviewed
resolution_note TEXT
);

CREATE INDEX idx_payment_callbacks_unrouted_unreviewed
ON public.payment_callbacks_unrouted (received_at)
WHERE reviewed_at IS NULL;

Lifecycle:

1. Callback arrives at /api/v1/webhooks/<provider>/callback/<secret>
2. Handler parses provider's callback into unified shape (extracts
checkout_id, merchant_reference)
3. Layer 2 idempotency check (Redis dedupe key) — skip if duplicate
4. Lookup public.payment_routing by merchant_reference
├── HIT → proceed to Layer 3 + resume per A4 §8
└── MISS → INSERT INTO public.payment_callbacks_unrouted
with full raw_payload + attempted_merchant_reference
Log event_type=payment.callback.unrouted
5. Always return 200 to provider (avoid retry amplification)

Daily ops dashboard surfaces unreviewed rows. Manual reconciliation flow: ops looks at attempted_merchant_reference, finds the corresponding payments row in tenant schemas (or payments_archive if old), updates manually if appropriate. Reversal/refund attempted via provider API if customer was charged but no booking was created.

Why dead-letter over alternatives. Late callbacks are vanishingly rare (Daraja and PesaPal callback infrastructure generally lands within minutes); designing the normal path around the rare case is wrong. Extending payment_routing TTL to 90 days to "match" checkpoint retention bloats the table 30x for < 0.1% of callbacks. Phone-number fallback introduces ambiguity (which of the customer's three concurrent bookings does this callback belong to?) that A4 §3 explicitly rejected.

7. PesaPal exclusively for cards — policy commitment

Annex B §B.2 surfaced the PesaPal multi-method commission leak: the hosted page shows ALL payment methods by default, so a customer who selected "Card" in WhatsApp could pay via M-Pesa on the PesaPal page → merchant pays 3.5% PesaPal surcharge instead of the cheap direct-Daraja rate.

This ADR locks an unconditional policy commitment:

PesaPal is exclusively the card rail. M-Pesa, Airtel Money, and Equitel always use their direct provider APIs (Daraja, Airtel Africa, Jenga). The orchestrator's payment-method-selection step never routes mobile-money intents through PesaPal.

Implementation:

  1. Customer's payment-method choice in WhatsApp (button-based per PRD §B.2) drives provider routing in backend/app/services/payments/router.py:

    PROVIDER_BY_METHOD = {
    "mpesa": "daraja",
    "airtel": "airtel_africa",
    "equitel": "jenga",
    "card": "pesapal",
    }
  2. PesaPal SubmitOrderRequest is sent with the payment_method = "CARD" filter (per A4 §7 option (c)).

  3. Implementation verifies whether PesaPal API 3.0 actually supports the filter at integration time (M5/M6 sandbox testing).

    • If filter works → hosted page locked to card-only.
    • If filter unavailable → leak accepted as known-risk. Small surface: only fires when a customer who selected "Card" deliberately navigates to non-card buttons on PesaPal's page. Lost margin bounded; merchant reconciliation gets a footnote.

Why unconditional. Routing M-Pesa through PesaPal would burn 3.5% on every M-Pesa transaction — catastrophic violation of the cost-discipline principle (~14x the per-transaction cost of direct Daraja). The leak is a small surface; deferring PesaPal entirely to ship Phase 1 with M-Pesa-only would lose card capability that PRD Annex B added as part of the canonical product vision.

8. Concurrent-payment prohibition per booking thread

A4 §9 #9 surfaced the racing-retries case: a second payment initiation arrives while the first is still in AWAITING_PAYMENT. This ADR enforces single-payment-in-flight per thread.

Mechanism:

  1. FSM-level: AWAITING_PAYMENT is a single state; the FSM cannot dispatch a second STK push or PesaPal order without first transitioning out (to a terminal payment state: PAYMENT_TIMEOUT, PAYMENT_FAILED, PAYMENT_CONFIRMED, or PAYMENT_CANCELLED_BY_CUSTOMER per D8).
  2. Database-level (Layer 3 idempotency): SELECT FOR UPDATE on payments.status in the resume handler (per A4 §5) catches the race even if the FSM check fails. A row with status in ('initiated', 'pending') means a payment is in flight; the second initiation is rejected.
  3. Customer-facing UX on rejection: soft message in detected language. English: "One moment please — we're still processing your previous payment." Swahili: "Subiri kidogo — tunashughulikia malipo yako ya awali."
  4. Observability: every rejected retry logs event_type=payment.race.rejected per Phase B §5.2 with both contenders' correlation_ids populated.

Important interaction with D8: PAYMENT_CANCELLED_BY_CUSTOMER is a terminal state. Cancellation explicitly clears the path for a fresh payment attempt — the prohibition ONLY blocks against non-terminal in-flight payments.

Why not allow concurrent payments. Concurrent-payment handling introduces financial complexity (which one wins? what about the loser? do we refund automatically?) that has zero business value. The customer doesn't need to pay twice; the merchant doesn't want to refund. Rejecting cleanly with a friendly message is the correct UX — the customer's previous payment is still resolving; making them wait 60s is far better than risking a double-charge.

9. Customer-initiated cancellation — hybrid provider-specific

The case Adrian flagged in the 2026-04-25 brainstorm: a customer selects a service, gets the payment link, then changes their mind ("actually I want the deep tissue instead") before paying. The in-flight payment must be cancelled cleanly to allow a second payment attempt.

New first-class FSM state: PAYMENT_CANCELLED_BY_CUSTOMER joins AWAITING_PAYMENT / PAYMENT_TIMEOUT / PAYMENT_FAILED / PAYMENT_CONFIRMED from A4 §2. Terminal state; D7's concurrent-prohibition does NOT block subsequent initiations.

New payments.status value: cancelled_by_customer.

Trigger detection. Customer expresses cancel intent during AWAITING_PAYMENT via:

  • Natural language ("actually wait, I want the deep tissue instead", "futa hii", "cancel this please") — picked up by the intent classifier (per ADR-0005 D2 single bilingual prompt).
  • Slash command (rare for customer side; supported for uniformity).
  • Explicit "Cancel" button if the FSM emitted one in the prior turn.

Provider-specific cancellation mechanics (the hybrid):

M-Pesa STK (Daraja has no direct cancel API):

  1. FSM transitions AWAITING_PAYMENT → PAYMENT_CANCELLED_BY_CUSTOMER.
  2. payments row updated: status = 'cancelled_by_customer', cancelled_at = NOW().
  3. Agent responds in detected language (upfront expectation-setting per Fork A): "Sawa, nimebatilisha. Ukiweka PIN tayari, tutarudisha pesa zako ndani ya saa 24." / "OK, cancelling. If you already entered your PIN, we'll refund within 24h."
  4. FSM transitions back to CONFIRM (or wherever customer redirected — likely SERVICE for a re-selection).
  5. STK either times out naturally (60s) OR callback arrives if customer entered PIN before cancelling. On callback arrival:
    • Layer 3 sees status = 'cancelled_by_customer' → does NOT resume the booking flow.
    • Auto-trigger Daraja Reversal API with the callback's receipt number. Reversal is idempotent on Daraja's side.
    • Log event_type=payment.reversal.attempted per Phase B §5.2.
    • On Reversal API success: log payment.reversal.succeeded; audit-only admin notification.
    • On Reversal API failure: log payment.reversal.failed; action-required admin alert; row inserted into public.payment_callbacks_unrouted (re-using the dead-letter pattern from D6) with resolution_note = 'reversal_failed'.

PesaPal hosted (PesaPal supports voiding unpaid orders):

  1. FSM transitions same as above.
  2. payments row updated same as above.
  3. Actively call PesaPal void/cancel API on the order. The call is fire-and-forget (we don't block the FSM transition on it).
  4. Agent responds: "Sawa, link imefutwa." / "OK, the payment link has been cancelled." (No "if you already paid" caveat needed — void usually takes effect before customer can complete 3D Secure.)
  5. FSM transitions back.
  6. If callback arrives (customer completed before void took effect):
    • Layer 3 sees status = 'cancelled_by_customer' → does NOT resume the booking flow.
    • Auto-trigger PesaPal refund API.
    • Same logging + admin notification pattern as M-Pesa.

Common to both rails:

  • Two new event_types per Phase B §5.2: payment.reversal.attempted, payment.reversal.succeeded, payment.reversal.failed.
  • Admin gets notified for any reversal/refund outcome (success = audit-only; failure = action-required).
  • The customer's next message after cancellation triggers a fresh booking flow turn — the FSM is in CONFIRM or SERVICE, ready for a new choice.

Why upfront customer messaging on M-Pesa (Fork A = (1)). The 1% of customers who actually entered PIN deserve clear expectations about the refund window. The 99% who didn't will mentally discard the "if you" qualifier. Trust > avoiding minor anxiety.

Why first-class FSM state (Fork B = (X)). Consistent with how PAYMENT_TIMEOUT, PAYMENT_FAILED, PAYMENT_CONFIRMED are first-class states. The audit signal matters for "how often do customers cancel mid-payment?" analytics — directly visible in dashboards without having to filter on payments.status.

Consequences

Positive.

  1. Per-tenant tunability extended to PesaPal timing + voice STK cap (D2, D4) means timing parameters are eval-driven, not deploy-driven. A high-volume tenant tunes voice cost; high-friction-payment tenants loosen PesaPal timing.
  2. Cancellation handling preserves payment fidelity even in race conditions (D9). The auto-reverse paths for both rails mean customer money is never permanently misallocated.
  3. Dead-letter pattern handles late callbacks operationally (D6). Rare events get explicit operator visibility instead of silent failure.
  4. PesaPal card-only policy keeps cost discipline (D7). 3.5% surcharge only on card volume (the cheaper-to-process-on-PesaPal choice); never on M-Pesa volume (where direct Daraja saves the 3.5%).
  5. Concurrent-prohibition simplifies FSM and AND eliminates a class of double-charge bugs (D8). The single-in-flight invariant is structurally enforced; financial complexity from reversals/refunds is bounded to deliberate-cancellation case only.
  6. Daily consolidated reaper job (D5) means one nightly maintenance window, one ops dashboard, one failure handling path.
  7. Cancellation FSM state is first-class (D9 Fork B) so "cancellation rate" becomes a directly-queryable analytics signal.

Negative.

  1. public.payment_callbacks_unrouted dead-letter table requires daily review process. Adds an ops touchpoint; if review is neglected the table grows unbounded. Mitigation: ops dashboard surfaces unreviewed rows; weekly automated reminder if count(unreviewed) > 0.
  2. Auto-reversal/refund logic adds two new provider API integrations (Daraja Reversal API, PesaPal Refund API). Both exist but have their own auth, error modes, and rate limits. Implementation cost real but bounded.
  3. Cancellation messaging on M-Pesa (the "if you entered PIN" caveat) is a slightly anxiety-inducing message for the 99% of customers who didn't pay. Mitigation: phrasing is gentle ("if you already entered" not "we may have charged you"), ordering leads with cancellation confirmation.
  4. Per-tenant tunability columns continue accumulating on public.tenants — this ADR adds three more (pesapal_nudge_seconds, pesapal_abandon_seconds, voice_stk_max_hold_seconds). Cumulative growth across ADR-0005 (3 columns), ADR-0006 (4 columns), ADR-0007 (3 columns) is 10 new columns + 3 JSONB. Migration coordination matters; documented schema-evolution log at docs/architecture/tenants-schema-evolution.md maintained alongside the migrations.
  5. PesaPal payment_method filter remains verify-at-implementation-time. If filter is unavailable in API 3.0, the leak is accepted as known-risk. Bounded surface but non-zero.

Neutral.

  1. 8/30/60/90 timing values are starting, eval-tunable per pilot. First 100 production transactions per pilot tenant reveal the actual distribution.
  2. Concurrent-payment prohibition is a hard contract; if pilot data shows the soft-rejection wait feels too long for legitimate retry cases, ADR-0007 amendment can soften (e.g., allow retry after N seconds even before the previous payment terminates).
  3. The Daraja Reversal API and PesaPal Refund API may have their own rate limits and edge cases (e.g., Daraja Reversal can fail for transactions outside their reversal window). These manifest as payment.reversal.failed events that route to the dead-letter admin queue — operationally handled but not invisible.

Alternatives Considered

AlternativeRejected because
Shorter PesaPal nudge timing (5 min soft / 15 min hard).Real-world 3D Secure can take 5+ minutes when the issuing bank's OTP SMS is slow. 5-min nudge interrupts customers mid-payment. The 8-min default respects payment-flow reality; per-tenant override available for tenants whose customers actually move faster.
Longer PesaPal nudge timing (12 min soft / 60 min hard).60-minute slot hold ties up the calendar without recovery benefit; customers who haven't returned in 30 min have effectively moved on. Per-tenant override available for higher tolerances.
Add t=30s Daraja stkpushquery probe before customer nudge.Doubles API call volume per payment for negligible information gain — the 30s nudge is informational ("still waiting"), not a state-machine decision. Triple polling at t=15s/t=30s/t=60s is over-engineering for a rare lost-callback case the t=60s one-shot already handles.
60s voice STK hold cap (no wrap-up window).Abruptly ending the call on STK timeout feels brusque; customer who's been on hold is in a vulnerable moment. 30s wrap-up window matters for human-quality voice UX. Per-tenant override available for telephony cost discipline.
120s voice STK hold cap (longer wrap-up).~30% more telephony cost per failed flow; risks customer fatigue mid-call. The 90s default is the sweet spot.
Hourly public.payment_routing reaper.Unnecessary load — no operational benefit from cleaning every hour vs every night for a table this small (~12 MB peak unreaped backlog). Hourly is a pre-optimization; daily off-peak is the right cadence.
On-demand-only public.payment_routing cleanup (no scheduled job).Unbounded growth failure mode if reaper logic ever fails silently. Bad operational pattern.
Extend public.payment_routing TTL to 90 days (match LangGraph checkpoint retention).Bloats the table 30x for < 0.1% of callbacks (late-callback edge case). Dead-letter pattern handles the rare case explicitly without bloating the normal-case routing table.
Phone-number fallback for unrouted callbacks.Introduces ambiguity (which of customer's concurrent bookings?). A4 §3 explicitly rejected phone-lookup as the primary routing pattern; rejecting it as fallback for the same reason.
Defer PesaPal entirely from Phase 1.Loses Phase 1 card capability — a real product gap given Annex B is now part of the canonical vision. PesaPal payment_method filter verification is M5/M6 implementation work; not an ADR-blocking question.
Use PesaPal for ALL payments (including M-Pesa).Burns 3.5% on every M-Pesa transaction — catastrophic violation of cost discipline (~14x the per-transaction cost of direct Daraja). Contradicts PRD §B.2 routing logic.
Concurrent-payment first-callback-wins.Adds reversal logic for the loser; financial complexity for zero business value (customer doesn't need to pay twice). Soft-rejection of the second initiation is the correct UX.
Concurrent-payment both-charge (manual ops refund).Operationally awful; defers a system bug to a manual ops process. Soft-rejection prevents the situation from arising.
Soft cancellation for both rails (no PesaPal void).Leaves PesaPal hosted-page link live for ~24h; customer who returns to the page can accidentally complete payment after cancelling. PesaPal void API is supported and small to call; using it eliminates the "accidental-completion-after-cancel" surface.
Hard cancellation with Daraja Reversal API attempted on every cancel.Daraja Reversal is for already-completed transactions; calling it on a cancelled-but-not-completed STK push doesn't apply. Reversal only fires when callback arrives for a cancelled payment (D9 lifecycle).
Silent cancellation messaging on M-Pesa (don't mention "if you already paid").The 1% who actually entered PIN deserve clear expectations. Trust beats avoiding the 99%'s minor "wait, did I pay?" thought.
Status-flag-only cancellation (no first-class PAYMENT_CANCELLED_BY_CUSTOMER FSM state).Inconsistent with how other terminal payment states are first-class. Loses the cleanest analytics signal for "how often do customers cancel mid-payment?" — would have to filter on payments.status instead of querying FSM-state event log.

References

  • docs/prd/ratiba-prd.md — §1.4 conversational thesis; §3.2 payments table; Annex A (M-Pesa STK push, Airtel, Equitel); Annex B (PesaPal hosted checkout, IPN callback, multi-method-leak surface)
  • docs/adr/ADR-0001-tech-stack.md (amended 2026-04-25) — LangGraph + TenantScopedSaver model; payment provider list
  • docs/adr/ADR-0002-multi-tenant-isolation.md — D1 public.payment_routing shared bridge table; D4 TenantScopedSaver via per-tenant micro-pools; D7 asyncio contextvar tenant propagation
  • docs/adr/ADR-0003-fsm-persistence.mdconversation_threads pointer table; 90-day retention pattern (D2) inherited by the consolidated reaper job (D5); per-thread Redis SETNX mutex
  • docs/adr/ADR-0005-orchestration-model.md — D2 single bilingual intent classifier (picks up cancellation intent for D9); D6 MCP-shape tool registry (initiate_stk_push, initiate_pesapal_order, cancel_payment are tools with safety_class = irreversible)
  • docs/adr/ADR-0006-handoff-model.md — D4 STK-in-flight handoff interaction (callback always authoritative; system message appended to admin's brief panel); D7 handoff_log 90-day retention pattern
  • docs/research/2026-04-25-payments-orchestration.md — A4 (heavy use throughout; this ADR locks A4's open questions)
  • docs/research/2026-04-25-langgraph-postgressaver-spike.md — TenantScopedSaver wrapper (Option A) used for payment-thread checkpointing
  • docs/research/2026-04-25-orchestration-patterns.md — A1 §5.1 irreversibility rule (payments are irreversible tools requiring AWAIT_CONFIRMATION — driven by safety_class per ADR-0005 D6)
  • docs/research/2026-04-25-human-in-the-loop-handoff.md — A2 §6 voice handoff patterns (hold-the-line-and-text reference for Phase 2 voice card flow)
  • docs/methodology/agentic-development.md — Phase B §5 auto-debug logging schema (payment.race.rejected, payment.callback.unrouted, payment.reversal.attempted / succeeded / failed event types extend §5.2 enum); §6 delegate-vs-human-review boundaries (payment code is human-review-only)
  • docs/superpowers/specs/2026-04-25-agentic-research-investment-design.md §12 — C2 (public.payment_routing shared-schema primitive)
  • ~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_cost_discipline.md — per-conversation cost framing (drives D7 PesaPal card-only commitment)
  • Daraja API documentation — STK push, stkpushquery, Reversal API
  • PesaPal API 3.0 documentation — SubmitOrderRequest (payment_method filter — verify at implementation), IPN webhook contract, Refund / Void APIs
  • LangGraph interrupt() + Command(resume=...) API reference
  • ULID specification — for merchant_reference time-sortable IDs