ADR-0007: Payments Orchestration
Status: Accepted Date: 2026-04-25
Context
Ratiba's product thesis hinges on payments working cleanly inside the conversation. The customer never leaves WhatsApp for an M-Pesa payment; for card payments via PesaPal they leave briefly but the return-to-conversation must feel seamless. PRD §B.2 specifies a hybrid payment architecture: M-Pesa, Airtel Money, and Equitel via direct provider APIs (Daraja, Airtel Africa, Jenga); cards via PesaPal hosted checkout.
A4 (docs/research/2026-04-25-payments-orchestration.md) settled
the high-level orchestration shape: a single LangGraph payment_node
that calls interrupt() after dispatching the payment to the
provider, suspending the conversation thread until the webhook
handler resolves the payment and resumes via Command(resume={...}).
Same FSM (AWAITING_PAYMENT / PAYMENT_TIMEOUT / PAYMENT_FAILED
/ PAYMENT_CONFIRMED) handles both rails — they differ only in
policy (timeout values, callback shape).
Spec §12 C2 settled the new architectural primitive that ADR-0002
D1 codified: public.payment_routing shared bridge table that maps
merchant_reference → (tenant_id, schema_name, thread_id). Required
because Daraja and PesaPal webhooks arrive in tenant-less context
and need a shared lookup to resolve the right tenant before
switching schemas.
ADR-0006 D4 settled the STK-in-flight handoff interaction: when a confidence trigger fires while a payment is in flight, handoff suspends the thread but the M-Pesa callback (when it arrives) is always authoritative — it resolves the payment row + appends a system message to admin's brief panel without resuming the conversation thread.
What ADR-0007 settles is the operational specifics A4 punted to
follow-up: PesaPal nudge/abandon timing, Daraja stkpushquery
polling cadence, voice STK hold cap, public.payment_routing
reaper cadence, late-callback handling, PesaPal card-only policy
commitment, concurrent-payment semantics, and customer-initiated
cancellation handling (the case Adrian flagged in the 2026-04-25
ADR-0007 brainstorm: customer wants to cancel a pending payment to
re-select a service).
Decision
Eight specific decisions, organized as a coherent payments orchestration model.
1. Architectural recap (inherited, locked here)
| Aspect | Where decided | Recap |
|---|---|---|
| Lifecycle shape | A4 §1 | Initiate → suspend (interrupt()) → callback → resume (Command(resume={...})) |
| Single FSM, two rails | A4 §2 | AWAITING_PAYMENT / PAYMENT_TIMEOUT / PAYMENT_FAILED / PAYMENT_CONFIRMED (this ADR D8 adds PAYMENT_CANCELLED_BY_CUSTOMER) |
| Routing table | spec §12 C2, ADR-0002 D1, A4 §3 | public.payment_routing(merchant_reference PK, tenant_id, schema_name, thread_id, provider, expires_at) |
payments table additions per A4 §3 | A4 §3 | merchant_reference VARCHAR(40) UNIQUE NOT NULL; thread_id VARCHAR(100) NOT NULL; provider VARCHAR(20) NOT NULL CHECK IN ('mpesa', 'airtel', 'equitel', 'pesapal') |
merchant_reference format | A4 §3 | RTB-{tenant_short}-{ulid} — 12-char Daraja AccountReference constraint accommodated by passing tenant_short alone; ULID is merchant_reference PK + Daraja TransactionDesc |
| Triple-layer idempotency | A4 §5 | (1) Postgres mpesa_receipt UNIQUE DEFERRABLE; (2) Redis dedupe key on provider's CheckoutRequestID / OrderTrackingId, 24h TTL; (3) SELECT FOR UPDATE on payments.status before resume |
TenantScopedSaver via for_tenant factory | ADR-0001 amendment + spike Option A + ADR-0002 D4 + A4 §8 | Webhook handler asks TenantScopedSaverFactory.for_tenant(tenant_id); never constructs a saver against a specific schema directly |
persist_payment_routing BEFORE provider.initiate_payment | A4 §8 | Non-negotiable order. If reversed, callback can beat us to the lookup table and the resume is lost (Daraja sandbox callbacks have been observed at sub-200ms in the field) |
| STK-in-flight handoff interaction | ADR-0006 D4 | Handoff suspends thread; M-Pesa callback always authoritative; resolves payment row + appends system message to admin's brief panel without resuming the conversation thread |
| Hold-the-line-and-text pattern (voice card flow) | A4 §6 + ADR-0006 D3 | After 25s on voice for a card payment: agent says "I'll text you the link and call back when confirmed," ends call, webhook handler triggers LiveKit outbound SIP callback on PAYMENT_CONFIRMED (Phase 2 only; Phase 1 voice card flow defers to WhatsApp follow-up per ADR-0006 D3) |
ADR-0007 builds on these without re-deriving them.
2. PesaPal nudge timing — 8 minutes soft / 30 minutes hard
PesaPal hosted checkout has no fixed timeout (customer is in browser doing 3D Secure, which can take 5+ minutes if issuing-bank OTP SMS is slow — common in Kenya). This ADR locks two timing parameters:
| Parameter | Default | Column |
|---|---|---|
| Soft nudge ("did you finish? tap the link again if you closed the page" + resends link) | 8 minutes | public.tenants.pesapal_nudge_seconds (default 480) |
Hard abandonment (payments.status='timeout', free slot hold, agent says "looks like the payment didn't go through, want to try again or pay at the venue?") | 30 minutes | public.tenants.pesapal_abandon_seconds (default 1800) |
ALTER TABLE public.tenants
ADD COLUMN pesapal_nudge_seconds INTEGER NOT NULL DEFAULT 480,
ADD COLUMN pesapal_abandon_seconds INTEGER NOT NULL DEFAULT 1800;
Why these values. Real-world 3D Secure can take 5+ minutes when the issuing bank's OTP SMS is slow. A 5-minute nudge would interrupt customers mid-payment. 30-minute hard abandonment is the natural window after which a customer who hasn't returned has effectively moved on; longer windows tie up the slot hold without recovery benefit.
Per-tenant configurable lets tenants with high-friction payment demographics (e.g., dental clinics with older customer base, more 3DS issues) loosen the threshold without code changes. Eval-tunable after first 100 production PesaPal transactions per pilot tenant.
3. Daraja stkpushquery polling — one-shot at t=60s only
Daraja's STK push has a 60-second timeout on Safaricom's side. The
stkpushquery API is the authoritative way to check status when
the callback hasn't arrived. This ADR locks the polling cadence at
one-shot at t=60s as belt-and-braces final check before the FSM
transitions to PAYMENT_TIMEOUT.
The t=30s customer-facing nudge ("still waiting, check your phone")
specified in A4 §4 is a pure UX gesture — it does not require a
backing API check. We do not poll stkpushquery at t=30s.
Why one-shot only. Daraja's median callback latency is 8-15s in the field; if no callback by 60s, something is genuinely off (lossy callback delivery, customer ignored prompt, customer on slow network). One query at t=60s catches the lossy-callback case authoritatively; doubling the call volume to add a t=30s probe buys nothing decision-relevant.
Cost discipline: stkpushquery calls count against Safaricom's
per-shortcode quota; minimizing unnecessary calls preserves
headroom for legitimate retries.
4. Voice STK hold cap — 90 seconds
When a voice call enters the M-Pesa STK flow, the agent holds the line while waiting for the callback. This ADR locks 90 seconds as the hard cap (60s for STK + 30s for graceful wrap-up).
ALTER TABLE public.tenants
ADD COLUMN voice_stk_max_hold_seconds INTEGER NOT NULL DEFAULT 90;
Lifecycle:
t=0s STK dispatched. Agent says (in detected language):
"Nitaomba M-Pesa prompt sasa hivi. Itakuwa kwenye simu yako
baada ya sekunde tano. Weka PIN yako."
t=15s Soft hold filler ("Bado tunasubiri…")
t=30s Soft hold filler again
t=45s Soft hold filler again
t=60s STK timeout (no callback). Agent: "Hatuoni malipo. Tujaribu
tena, ulipe baadaye, au sema 'simu' kupitishia kupiga simu
tena baadaye?"
t=60-90s wrap-up: customer responds; agent confirms next step
(text new link via WhatsApp, pay at venue, or schedule
callback)
t=90s Hard end of call regardless of state
Why 90s. The 30s wrap-up window matters for human-quality voice UX. Customer who's been on hold for 60s waiting for an M-Pesa prompt is in a vulnerable moment; abruptly ending the call (60s cap) feels brusque. 30s is enough for one back-and-forth; locking at 90s means the call gracefully concludes without awkward stretches into 120s+.
Per-tenant configurable. High-volume tenants where telephony cost discipline matters more than the wrap-up smoothness can tune down.
5. Daily reaper at 3 AM EAT — consolidated nightly maintenance
public.payment_routing rows are reaped past their 24-hour
expires_at. This ADR locks the reaper cadence at daily at 3 AM
EAT (East Africa Time), consolidated with the other nightly mover
jobs:
| Job | Source ADR | Action |
|---|---|---|
public.payment_routing reaper | this ADR D5 | DELETE rows WHERE expires_at < NOW() |
Per-tenant checkpoints_archive mover | ADR-0003 D2 | INSERT INTO archive + DELETE from live (rows older than 90 days) |
Per-tenant handoff_log_archive mover | ADR-0006 D7 | INSERT INTO archive + DELETE from live (rows older than 90 days) |
Single cron entry (scripts/nightly-maintenance.sh); single ops
dashboard surfaces job outcomes; single failure-handling code path.
Why daily, not hourly. The public.payment_routing table is
small. Per-row ~120 bytes; at 1k tenants × 100 payments/day = 100k
rows/day. After 24h all expired. Peak unreaped backlog ~12 MB.
Hourly reaping is unnecessary load — no operational benefit from
cleaning every hour vs every night for a table this small.
3 AM EAT (East Africa Time) is the canonical low-traffic window for Ratiba's target market — overnight in Kenya, weekly retail rhythms permit longer maintenance windows then.
6. Late-callback dead-letter table
The edge case A4 §9 #6 surfaced: a callback arrives 24+ hours after
STK push initiation. The public.payment_routing row has been
reaped (D4 = daily TTL); the callback handler can't resolve
merchant_reference → (tenant, thread_id). This ADR specifies a
dead-letter pattern.
New table in public schema:
CREATE TABLE public.payment_callbacks_unrouted (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
provider VARCHAR(20) NOT NULL CHECK (provider IN ('mpesa', 'airtel', 'equitel', 'pesapal')),
raw_payload JSONB NOT NULL,
received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
attempted_merchant_reference VARCHAR(40),
reviewed_at TIMESTAMPTZ,
reviewed_by UUID, -- staff user; null until reviewed
resolution_note TEXT
);
CREATE INDEX idx_payment_callbacks_unrouted_unreviewed
ON public.payment_callbacks_unrouted (received_at)
WHERE reviewed_at IS NULL;
Lifecycle:
1. Callback arrives at /api/v1/webhooks/<provider>/callback/<secret>
2. Handler parses provider's callback into unified shape (extracts
checkout_id, merchant_reference)
3. Layer 2 idempotency check (Redis dedupe key) — skip if duplicate
4. Lookup public.payment_routing by merchant_reference
├── HIT → proceed to Layer 3 + resume per A4 §8
└── MISS → INSERT INTO public.payment_callbacks_unrouted
with full raw_payload + attempted_merchant_reference
Log event_type=payment.callback.unrouted
5. Always return 200 to provider (avoid retry amplification)
Daily ops dashboard surfaces unreviewed rows. Manual
reconciliation flow: ops looks at attempted_merchant_reference,
finds the corresponding payments row in tenant schemas (or
payments_archive if old), updates manually if appropriate.
Reversal/refund attempted via provider API if customer was charged
but no booking was created.
Why dead-letter over alternatives. Late callbacks are
vanishingly rare (Daraja and PesaPal callback infrastructure
generally lands within minutes); designing the normal path around
the rare case is wrong. Extending payment_routing TTL to 90 days
to "match" checkpoint retention bloats the table 30x for < 0.1%
of callbacks. Phone-number fallback introduces ambiguity (which of
the customer's three concurrent bookings does this callback belong
to?) that A4 §3 explicitly rejected.
7. PesaPal exclusively for cards — policy commitment
Annex B §B.2 surfaced the PesaPal multi-method commission leak: the hosted page shows ALL payment methods by default, so a customer who selected "Card" in WhatsApp could pay via M-Pesa on the PesaPal page → merchant pays 3.5% PesaPal surcharge instead of the cheap direct-Daraja rate.
This ADR locks an unconditional policy commitment:
PesaPal is exclusively the card rail. M-Pesa, Airtel Money, and Equitel always use their direct provider APIs (Daraja, Airtel Africa, Jenga). The orchestrator's payment-method-selection step never routes mobile-money intents through PesaPal.
Implementation:
-
Customer's payment-method choice in WhatsApp (button-based per PRD §B.2) drives provider routing in
backend/app/services/payments/router.py:PROVIDER_BY_METHOD = {"mpesa": "daraja","airtel": "airtel_africa","equitel": "jenga","card": "pesapal",} -
PesaPal
SubmitOrderRequestis sent with thepayment_method = "CARD"filter (per A4 §7 option (c)). -
Implementation verifies whether PesaPal API 3.0 actually supports the filter at integration time (M5/M6 sandbox testing).
- If filter works → hosted page locked to card-only.
- If filter unavailable → leak accepted as known-risk. Small surface: only fires when a customer who selected "Card" deliberately navigates to non-card buttons on PesaPal's page. Lost margin bounded; merchant reconciliation gets a footnote.
Why unconditional. Routing M-Pesa through PesaPal would burn 3.5% on every M-Pesa transaction — catastrophic violation of the cost-discipline principle (~14x the per-transaction cost of direct Daraja). The leak is a small surface; deferring PesaPal entirely to ship Phase 1 with M-Pesa-only would lose card capability that PRD Annex B added as part of the canonical product vision.
8. Concurrent-payment prohibition per booking thread
A4 §9 #9 surfaced the racing-retries case: a second payment
initiation arrives while the first is still in AWAITING_PAYMENT.
This ADR enforces single-payment-in-flight per thread.
Mechanism:
- FSM-level:
AWAITING_PAYMENTis a single state; the FSM cannot dispatch a second STK push or PesaPal order without first transitioning out (to a terminal payment state:PAYMENT_TIMEOUT,PAYMENT_FAILED,PAYMENT_CONFIRMED, orPAYMENT_CANCELLED_BY_CUSTOMERper D8). - Database-level (Layer 3 idempotency):
SELECT FOR UPDATEonpayments.statusin the resume handler (per A4 §5) catches the race even if the FSM check fails. A row with status in('initiated', 'pending')means a payment is in flight; the second initiation is rejected. - Customer-facing UX on rejection: soft message in detected language. English: "One moment please — we're still processing your previous payment." Swahili: "Subiri kidogo — tunashughulikia malipo yako ya awali."
- Observability: every rejected retry logs
event_type=payment.race.rejectedper Phase B §5.2 with both contenders'correlation_idspopulated.
Important interaction with D8: PAYMENT_CANCELLED_BY_CUSTOMER
is a terminal state. Cancellation explicitly clears the path for a
fresh payment attempt — the prohibition ONLY blocks against
non-terminal in-flight payments.
Why not allow concurrent payments. Concurrent-payment handling introduces financial complexity (which one wins? what about the loser? do we refund automatically?) that has zero business value. The customer doesn't need to pay twice; the merchant doesn't want to refund. Rejecting cleanly with a friendly message is the correct UX — the customer's previous payment is still resolving; making them wait 60s is far better than risking a double-charge.
9. Customer-initiated cancellation — hybrid provider-specific
The case Adrian flagged in the 2026-04-25 brainstorm: a customer selects a service, gets the payment link, then changes their mind ("actually I want the deep tissue instead") before paying. The in-flight payment must be cancelled cleanly to allow a second payment attempt.
New first-class FSM state: PAYMENT_CANCELLED_BY_CUSTOMER
joins AWAITING_PAYMENT / PAYMENT_TIMEOUT / PAYMENT_FAILED /
PAYMENT_CONFIRMED from A4 §2. Terminal state; D7's
concurrent-prohibition does NOT block subsequent initiations.
New payments.status value: cancelled_by_customer.
Trigger detection. Customer expresses cancel intent during
AWAITING_PAYMENT via:
- Natural language ("actually wait, I want the deep tissue instead", "futa hii", "cancel this please") — picked up by the intent classifier (per ADR-0005 D2 single bilingual prompt).
- Slash command (rare for customer side; supported for uniformity).
- Explicit "Cancel" button if the FSM emitted one in the prior turn.
Provider-specific cancellation mechanics (the hybrid):
M-Pesa STK (Daraja has no direct cancel API):
- FSM transitions
AWAITING_PAYMENT → PAYMENT_CANCELLED_BY_CUSTOMER. paymentsrow updated:status = 'cancelled_by_customer',cancelled_at = NOW().- Agent responds in detected language (upfront expectation-setting per Fork A): "Sawa, nimebatilisha. Ukiweka PIN tayari, tutarudisha pesa zako ndani ya saa 24." / "OK, cancelling. If you already entered your PIN, we'll refund within 24h."
- FSM transitions back to
CONFIRM(or wherever customer redirected — likelySERVICEfor a re-selection). - STK either times out naturally (60s) OR callback arrives if
customer entered PIN before cancelling. On callback arrival:
- Layer 3 sees
status = 'cancelled_by_customer'→ does NOT resume the booking flow. - Auto-trigger Daraja Reversal API with the callback's receipt number. Reversal is idempotent on Daraja's side.
- Log
event_type=payment.reversal.attemptedper Phase B §5.2. - On Reversal API success: log
payment.reversal.succeeded; audit-only admin notification. - On Reversal API failure: log
payment.reversal.failed; action-required admin alert; row inserted intopublic.payment_callbacks_unrouted(re-using the dead-letter pattern from D6) withresolution_note = 'reversal_failed'.
- Layer 3 sees
PesaPal hosted (PesaPal supports voiding unpaid orders):
- FSM transitions same as above.
paymentsrow updated same as above.- Actively call PesaPal void/cancel API on the order. The call is fire-and-forget (we don't block the FSM transition on it).
- Agent responds: "Sawa, link imefutwa." / "OK, the payment link has been cancelled." (No "if you already paid" caveat needed — void usually takes effect before customer can complete 3D Secure.)
- FSM transitions back.
- If callback arrives (customer completed before void took
effect):
- Layer 3 sees
status = 'cancelled_by_customer'→ does NOT resume the booking flow. - Auto-trigger PesaPal refund API.
- Same logging + admin notification pattern as M-Pesa.
- Layer 3 sees
Common to both rails:
- Two new event_types per Phase B §5.2:
payment.reversal.attempted,payment.reversal.succeeded,payment.reversal.failed. - Admin gets notified for any reversal/refund outcome (success = audit-only; failure = action-required).
- The customer's next message after cancellation triggers a fresh
booking flow turn — the FSM is in
CONFIRMorSERVICE, ready for a new choice.
Why upfront customer messaging on M-Pesa (Fork A = (1)). The 1% of customers who actually entered PIN deserve clear expectations about the refund window. The 99% who didn't will mentally discard the "if you" qualifier. Trust > avoiding minor anxiety.
Why first-class FSM state (Fork B = (X)). Consistent with how
PAYMENT_TIMEOUT, PAYMENT_FAILED, PAYMENT_CONFIRMED are
first-class states. The audit signal matters for "how often do
customers cancel mid-payment?" analytics — directly visible in
dashboards without having to filter on payments.status.
Consequences
Positive.
- Per-tenant tunability extended to PesaPal timing + voice STK cap (D2, D4) means timing parameters are eval-driven, not deploy-driven. A high-volume tenant tunes voice cost; high-friction-payment tenants loosen PesaPal timing.
- Cancellation handling preserves payment fidelity even in race conditions (D9). The auto-reverse paths for both rails mean customer money is never permanently misallocated.
- Dead-letter pattern handles late callbacks operationally (D6). Rare events get explicit operator visibility instead of silent failure.
- PesaPal card-only policy keeps cost discipline (D7). 3.5% surcharge only on card volume (the cheaper-to-process-on-PesaPal choice); never on M-Pesa volume (where direct Daraja saves the 3.5%).
- Concurrent-prohibition simplifies FSM and AND eliminates a class of double-charge bugs (D8). The single-in-flight invariant is structurally enforced; financial complexity from reversals/refunds is bounded to deliberate-cancellation case only.
- Daily consolidated reaper job (D5) means one nightly maintenance window, one ops dashboard, one failure handling path.
- Cancellation FSM state is first-class (D9 Fork B) so "cancellation rate" becomes a directly-queryable analytics signal.
Negative.
public.payment_callbacks_unrouteddead-letter table requires daily review process. Adds an ops touchpoint; if review is neglected the table grows unbounded. Mitigation: ops dashboard surfaces unreviewed rows; weekly automated reminder ifcount(unreviewed) > 0.- Auto-reversal/refund logic adds two new provider API integrations (Daraja Reversal API, PesaPal Refund API). Both exist but have their own auth, error modes, and rate limits. Implementation cost real but bounded.
- Cancellation messaging on M-Pesa (the "if you entered PIN" caveat) is a slightly anxiety-inducing message for the 99% of customers who didn't pay. Mitigation: phrasing is gentle ("if you already entered" not "we may have charged you"), ordering leads with cancellation confirmation.
- Per-tenant tunability columns continue accumulating on
public.tenants— this ADR adds three more (pesapal_nudge_seconds,pesapal_abandon_seconds,voice_stk_max_hold_seconds). Cumulative growth across ADR-0005 (3 columns), ADR-0006 (4 columns), ADR-0007 (3 columns) is 10 new columns + 3 JSONB. Migration coordination matters; documented schema-evolution log atdocs/architecture/tenants-schema-evolution.mdmaintained alongside the migrations. - PesaPal
payment_methodfilter remains verify-at-implementation-time. If filter is unavailable in API 3.0, the leak is accepted as known-risk. Bounded surface but non-zero.
Neutral.
- 8/30/60/90 timing values are starting, eval-tunable per pilot. First 100 production transactions per pilot tenant reveal the actual distribution.
- Concurrent-payment prohibition is a hard contract; if pilot data shows the soft-rejection wait feels too long for legitimate retry cases, ADR-0007 amendment can soften (e.g., allow retry after N seconds even before the previous payment terminates).
- The Daraja Reversal API and PesaPal Refund API may have their
own rate limits and edge cases (e.g., Daraja Reversal can fail
for transactions outside their reversal window). These manifest
as
payment.reversal.failedevents that route to the dead-letter admin queue — operationally handled but not invisible.
Alternatives Considered
| Alternative | Rejected because |
|---|---|
| Shorter PesaPal nudge timing (5 min soft / 15 min hard). | Real-world 3D Secure can take 5+ minutes when the issuing bank's OTP SMS is slow. 5-min nudge interrupts customers mid-payment. The 8-min default respects payment-flow reality; per-tenant override available for tenants whose customers actually move faster. |
| Longer PesaPal nudge timing (12 min soft / 60 min hard). | 60-minute slot hold ties up the calendar without recovery benefit; customers who haven't returned in 30 min have effectively moved on. Per-tenant override available for higher tolerances. |
Add t=30s Daraja stkpushquery probe before customer nudge. | Doubles API call volume per payment for negligible information gain — the 30s nudge is informational ("still waiting"), not a state-machine decision. Triple polling at t=15s/t=30s/t=60s is over-engineering for a rare lost-callback case the t=60s one-shot already handles. |
| 60s voice STK hold cap (no wrap-up window). | Abruptly ending the call on STK timeout feels brusque; customer who's been on hold is in a vulnerable moment. 30s wrap-up window matters for human-quality voice UX. Per-tenant override available for telephony cost discipline. |
| 120s voice STK hold cap (longer wrap-up). | ~30% more telephony cost per failed flow; risks customer fatigue mid-call. The 90s default is the sweet spot. |
Hourly public.payment_routing reaper. | Unnecessary load — no operational benefit from cleaning every hour vs every night for a table this small (~12 MB peak unreaped backlog). Hourly is a pre-optimization; daily off-peak is the right cadence. |
On-demand-only public.payment_routing cleanup (no scheduled job). | Unbounded growth failure mode if reaper logic ever fails silently. Bad operational pattern. |
Extend public.payment_routing TTL to 90 days (match LangGraph checkpoint retention). | Bloats the table 30x for < 0.1% of callbacks (late-callback edge case). Dead-letter pattern handles the rare case explicitly without bloating the normal-case routing table. |
| Phone-number fallback for unrouted callbacks. | Introduces ambiguity (which of customer's concurrent bookings?). A4 §3 explicitly rejected phone-lookup as the primary routing pattern; rejecting it as fallback for the same reason. |
| Defer PesaPal entirely from Phase 1. | Loses Phase 1 card capability — a real product gap given Annex B is now part of the canonical vision. PesaPal payment_method filter verification is M5/M6 implementation work; not an ADR-blocking question. |
| Use PesaPal for ALL payments (including M-Pesa). | Burns 3.5% on every M-Pesa transaction — catastrophic violation of cost discipline (~14x the per-transaction cost of direct Daraja). Contradicts PRD §B.2 routing logic. |
| Concurrent-payment first-callback-wins. | Adds reversal logic for the loser; financial complexity for zero business value (customer doesn't need to pay twice). Soft-rejection of the second initiation is the correct UX. |
| Concurrent-payment both-charge (manual ops refund). | Operationally awful; defers a system bug to a manual ops process. Soft-rejection prevents the situation from arising. |
| Soft cancellation for both rails (no PesaPal void). | Leaves PesaPal hosted-page link live for ~24h; customer who returns to the page can accidentally complete payment after cancelling. PesaPal void API is supported and small to call; using it eliminates the "accidental-completion-after-cancel" surface. |
| Hard cancellation with Daraja Reversal API attempted on every cancel. | Daraja Reversal is for already-completed transactions; calling it on a cancelled-but-not-completed STK push doesn't apply. Reversal only fires when callback arrives for a cancelled payment (D9 lifecycle). |
| Silent cancellation messaging on M-Pesa (don't mention "if you already paid"). | The 1% who actually entered PIN deserve clear expectations. Trust beats avoiding the 99%'s minor "wait, did I pay?" thought. |
Status-flag-only cancellation (no first-class PAYMENT_CANCELLED_BY_CUSTOMER FSM state). | Inconsistent with how other terminal payment states are first-class. Loses the cleanest analytics signal for "how often do customers cancel mid-payment?" — would have to filter on payments.status instead of querying FSM-state event log. |
References
docs/prd/ratiba-prd.md— §1.4 conversational thesis; §3.2 payments table; Annex A (M-Pesa STK push, Airtel, Equitel); Annex B (PesaPal hosted checkout, IPN callback, multi-method-leak surface)docs/adr/ADR-0001-tech-stack.md(amended 2026-04-25) — LangGraph + TenantScopedSaver model; payment provider listdocs/adr/ADR-0002-multi-tenant-isolation.md— D1public.payment_routingshared bridge table; D4 TenantScopedSaver via per-tenant micro-pools; D7 asyncio contextvar tenant propagationdocs/adr/ADR-0003-fsm-persistence.md—conversation_threadspointer table; 90-day retention pattern (D2) inherited by the consolidated reaper job (D5); per-thread Redis SETNX mutexdocs/adr/ADR-0005-orchestration-model.md— D2 single bilingual intent classifier (picks up cancellation intent for D9); D6 MCP-shape tool registry (initiate_stk_push,initiate_pesapal_order,cancel_paymentare tools withsafety_class = irreversible)docs/adr/ADR-0006-handoff-model.md— D4 STK-in-flight handoff interaction (callback always authoritative; system message appended to admin's brief panel); D7handoff_log90-day retention patterndocs/research/2026-04-25-payments-orchestration.md— A4 (heavy use throughout; this ADR locks A4's open questions)docs/research/2026-04-25-langgraph-postgressaver-spike.md— TenantScopedSaver wrapper (Option A) used for payment-thread checkpointingdocs/research/2026-04-25-orchestration-patterns.md— A1 §5.1 irreversibility rule (payments are irreversible tools requiringAWAIT_CONFIRMATION— driven bysafety_classper ADR-0005 D6)docs/research/2026-04-25-human-in-the-loop-handoff.md— A2 §6 voice handoff patterns (hold-the-line-and-text reference for Phase 2 voice card flow)docs/methodology/agentic-development.md— Phase B §5 auto-debug logging schema (payment.race.rejected,payment.callback.unrouted,payment.reversal.attempted/succeeded/failedevent types extend §5.2 enum); §6 delegate-vs-human-review boundaries (payment code is human-review-only)docs/superpowers/specs/2026-04-25-agentic-research-investment-design.md§12 — C2 (public.payment_routingshared-schema primitive)~/.claude/projects/-Users-soft4u-Development-ratiba/memory/project_cost_discipline.md— per-conversation cost framing (drives D7 PesaPal card-only commitment)- Daraja API documentation — STK push,
stkpushquery, Reversal API - PesaPal API 3.0 documentation —
SubmitOrderRequest(payment_methodfilter — verify at implementation), IPN webhook contract, Refund / Void APIs - LangGraph
interrupt()+Command(resume=...)API reference - ULID specification — for
merchant_referencetime-sortable IDs