Payments Orchestration: M-Pesa STK Push and PesaPal Hosted Redirect
Status: Research deliverable, feeds candidate ADR-0007 (payments-orchestration).
Builds on: ADR-0001 (amended 2026-04-25), A1 LangGraph spike (docs/research/2026-04-25-langgraph-postgressaver-spike.md), A1 orchestration deliverable, Phase C landscape scan.
Decisions in scope: FSM shape for payment lifecycle, callback resume mechanism, timeout policy per rail, idempotency, multi-channel parity (WhatsApp + voice), the PesaPal multi-method commission leak.
1. Recommended orchestration shape
The canonical lifecycle is initiate → suspend → callback → resume, modelled in LangGraph as a payment_node that calls interrupt() after dispatching the push. The agent never polls in the conversation loop — polling is a backstop owned by a worker, not the agent. Concretely: the agent writes a payments row keyed by a Ratiba-owned merchant_reference (the only identifier that survives both rails' callback shapes), invokes the provider client, then yields control via interrupt(). The thread checkpoint sits in the tenant schema until either (a) the provider's webhook hits /api/v1/webhooks/{provider}/callback/{secret}, the handler resolves merchant_reference → (tenant_id, thread_id), and re-invokes the graph with Command(resume={...}); or (b) a worker-driven timeout fires and resumes the same thread with a synthesised {status: "TIMEOUT"} payload. The two rails differ only in policy, not in shape. STK push is a 60-second strict suspend with one in-conversation status update at 30s and a hard fail-forward at 60s; PesaPal hosted redirect is an open-ended suspend with a soft nudge at 8 minutes and an abandonment fail at 30 minutes. Voice uses the same shape with one extra branch — past 25 seconds of dead air, the agent offers to text the link (PesaPal) or hang up and call back on success (STK).
This is opinionated: do not build two separate state machines. The PRD already implies this (the FSM has a single BOOKING_PAYMENT state — Annex A §A.5 lines 1515-1554). What the PRD doesn't say but should: the orchestration boundary is between the FSM and the provider, not inside the FSM. The FSM doesn't know it's M-Pesa vs PesaPal; it knows it suspended a payment and is waiting on a callback.
2. FSM state design
These four states sit inside a single BOOKING_PAYMENT super-state (entered after the user confirms slot + service + amount):
| State | Meaning | Entry trigger | Exit triggers |
|---|---|---|---|
AWAITING_PAYMENT | Push dispatched, waiting on callback | provider.initiate_payment() returned 200 | Callback received (any terminal status), or worker-driven timeout |
PAYMENT_TIMEOUT | Provider's deadline elapsed without callback | Timeout worker fires Command(resume={status: "TIMEOUT"}) | User retries (back to AWAITING_PAYMENT with new merchant_reference), pays at venue, cancels |
PAYMENT_FAILED | Callback arrived with non-zero ResultCode (Daraja) or FAILED (PesaPal) | Webhook handler resumes with {status: "FAILED", reason: ...} | User retries, pays at venue, cancels |
PAYMENT_CONFIRMED | Callback confirmed payment | Webhook handler resumes with {status: "COMPLETED", receipt: ...} | Move to BOOKING_CONFIRMED, dispatch confirmation message |
Transition table — M-Pesa STK push
| From | Event | To | Side effect |
|---|---|---|---|
BOOKING_CONFIRM | User taps "Pay now" | AWAITING_PAYMENT | payments row inserted with status='initiated', STK dispatched |
AWAITING_PAYMENT | 30s elapsed, no callback | AWAITING_PAYMENT | Send WhatsApp nudge ("Still waiting — check your phone for the M-Pesa prompt") — does not transition, just emits |
AWAITING_PAYMENT | 60s elapsed, no callback | PAYMENT_TIMEOUT | payments.status='timeout', worker also fires Daraja stkpushquery once as belt-and-braces |
AWAITING_PAYMENT | Callback ResultCode=0 | PAYMENT_CONFIRMED | payments.status='completed', mpesa_receipt filled, appointment confirmed |
AWAITING_PAYMENT | Callback ResultCode!=0 | PAYMENT_FAILED | payments.status='failed', reason logged |
PAYMENT_TIMEOUT / PAYMENT_FAILED | "Try again" | AWAITING_PAYMENT | New merchant_reference, new payments row, old row stays as audit |
PAYMENT_TIMEOUT / PAYMENT_FAILED | "Pay at venue" | BOOKING_CONFIRMED | Appointment with payment_status='pay_at_venue' |
Transition table — PesaPal hosted redirect
Same table, three changes:
- 30s nudge becomes an 8-minute nudge ("Tap the link when you're ready — it's still valid").
- 60s timeout becomes a 30-minute abandonment.
- Callback
status='COMPLETED'may arrive at minute 0.5 or minute 28; the FSM doesn't care.
The PRD's Annex A §A.5 already sketches this for M-Pesa; what's new is recognising PesaPal as the same FSM with different policy constants, not a parallel branch.
3. Resume mechanism on callback
Recommendation: store (tenant_id, thread_id, merchant_reference) as a triple in the payments row, keyed by merchant_reference. The webhook handler resolves the triple, then calls Command(resume=...) against the matching thread_id. No phone-number lookup, no parsing reference strings for embedded data.
Why not the alternatives
- Phone number lookup. Brittle. A customer may have two parallel bookings (one across two tenants, or two services within one tenant). Phone is not unique to a payment attempt.
- Parse
merchant_referenceto extract embeddedthread_id. Tempting because Daraja'sAccountReferenceis<=12 chars— you can encode a short ID. But (a) PesaPal'smerchant_referenceis bigger but still constrained; (b) coupling routing logic to a string-format convention is exactly the kind of thing that breaks at 3am when you change ID formats; (c) the tenant routing problem is still unsolved — you'd needtenant_idin there too. - Lookup by Daraja's
CheckoutRequestID/ PesaPal'sOrderTrackingId. This is what the PRD'smpesa_checkout_idcolumn is for. Useful as a secondary lookup (covers the rare case where Daraja sends a callback but you lost your local row mid-write). Should be aUNIQUEindex, not the primary lookup.
Schema additions to the existing payments table (PRD §3.2 lines 268-283)
ALTER TABLE payments ADD COLUMN merchant_reference VARCHAR(40) NOT NULL UNIQUE;
ALTER TABLE payments ADD COLUMN thread_id VARCHAR(100) NOT NULL;
ALTER TABLE payments ADD COLUMN provider VARCHAR(20) NOT NULL DEFAULT 'mpesa'
CHECK (provider IN ('mpesa', 'airtel', 'equitel', 'pesapal'));
CREATE INDEX idx_payments_merchant_ref ON payments(merchant_reference);
CREATE INDEX idx_payments_thread ON payments(thread_id);
tenant_id is implicit via schema-per-tenant, but the webhook handler runs in the public schema (no tenant context yet). So we also need a shared lookup table in public:
CREATE TABLE public.payment_routing (
merchant_reference VARCHAR(40) PRIMARY KEY,
tenant_id UUID NOT NULL REFERENCES public.tenants(id),
schema_name VARCHAR(100) NOT NULL,
thread_id VARCHAR(100) NOT NULL,
provider VARCHAR(20) NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL -- 24h after init; reaper job purges
);
The webhook handler does one shared-schema lookup, switches into the tenant schema, then resumes the graph. This sounds like duplication but it's the price of schema-per-tenant: webhooks land in a tenant-less context.
merchant_reference format
Recommend: RTB-{tenant_short}-{ulid}, e.g. RTB-ms3k-01HF8K9.... ULID gives time-sortable + lexically-sortable IDs and fits Daraja's 12-char AccountReference constraint if you use the tenant short alone (M-Pesa shows this string to the customer on the prompt — it should be readable). For Daraja specifically, store the ULID in merchant_reference but pass RTB-{tenant_short} as the AccountReference and the ULID as TransactionDesc. Daraja echoes both back in the callback.
4. Timeout handling
M-Pesa STK: 60s strict
Daraja times out the prompt after ~60 seconds on Safaricom's side (the customer's phone stops accepting the PIN). Common practice in the field is to poll stkpushquery every 2 seconds, but for Ratiba this is the wrong tool — we're on the agent loop, not a request/response API. Instead:
- t=0: STK dispatched. Agent says
"Tafadhali angalia simu yako kwa M-Pesa prompt." / "Please check your phone for the M-Pesa prompt." - t=30s: No callback yet. Agent emits a soft nudge:
"Bado tunasubiri… ukiona prompt, weka PIN yako." / "Still waiting — enter your PIN when you see the prompt."This is a non-transitioning emit; the FSM stays inAWAITING_PAYMENT. - t=60s: No callback. Worker fires
stkpushqueryonce as a belt-and-braces final check (Daraja's callback delivery is occasionally lossy). If query says "in progress" or "successful", treat as authoritative; otherwise transition toPAYMENT_TIMEOUTand say"Hatuoni malipo. Tujaribu tena?" / "We didn't see the payment. Want to try again?"with[Try again] [Pay at venue] [Cancel].
The 30s nudge is a UX call. Daraja's median callback latency is around 8-15s in the field; if 30s passes, something is meaningfully off (customer ignored the prompt, customer is on a 2G connection, Safaricom is slow). Saying something before the 60s wall hits prevents the perception that the agent died.
PesaPal: open-ended
PesaPal has no fixed timeout because the customer is in a browser doing 3D Secure, which itself can take 5+ minutes if the issuing bank's OTP SMS is slow. Pick N=8 minutes for the soft nudge, N=30 minutes for hard abandonment.
- t=0: Hosted-page link sent. Agent says
"Tap the link, complete payment, come back here. I'll let you know once it's done." - t=8min: Soft nudge:
"Just checking — did you finish the payment? Tap the link again if you closed the page."Resend the link. - t=30min: Abandon:
"Looks like the payment didn't go through. Want to try again or pay at the venue?". Markpayments.status='timeout', free the slot hold (if you've taken one).
8 minutes is opinionated — it's long enough that a customer doing 3D Secure isn't pestered, short enough that an abandoner gets a graceful out before they forget Ratiba exists. Caveat: I'm uncertain about the empirical right value here; the PRD will need real-world calibration in pilot.
Voice: when to drop the line
Voice has a third axis: telephony costs money per minute and dead air is awkward.
- STK push on voice: Hold the line for the full 60s. The agent should fill silence with conversational filler ("Take your time, I'll wait"). If callback arrives in time, confirm verbally and end the call cleanly. If 60s elapses, offer one re-try. Never hold past 90s.
- PesaPal on voice: Hold the line for 25 seconds after sending the link. After 25s, agent says:
"I'll text you the link and call you back once it's confirmed. Habari njema, nakurudia." / "Good news, I'll call you back."Hang up. The webhook handler, on success, schedules a callback (LiveKit outbound SIP) to deliver confirmation. This is the hold-the-line-and-text pattern.
The "text the link" step uses 360dialog (WhatsApp) if the caller's number has WhatsApp, otherwise SMS via Daraja's bulk SMS gateway (which is in scope per ADR-0001). Detection: try WhatsApp first; if 360dialog returns "phone not on WhatsApp", fall back to SMS.
5. Idempotency under callback retries
Both Daraja and PesaPal will, on occasion, fire the same callback twice. (Daraja's documented behaviour — "callbacks can repeat, your system should update the order once".) Resuming the same LangGraph thread twice is a bug — it transitions through PAYMENT_CONFIRMED → BOOKING_CONFIRMED and may dispatch a second confirmation message, or worse, double-book if the confirmation node has side effects.
Layered defence
Layer 1: Postgres unique constraint (the ground truth).
-- Already added above: merchant_reference UNIQUE
-- Add: receipt cannot be assigned to two payment rows
ALTER TABLE payments ADD CONSTRAINT uniq_mpesa_receipt
UNIQUE (mpesa_receipt) DEFERRABLE INITIALLY DEFERRED;
This catches the worst case: a duplicate callback that somehow makes it past Layer 2 and tries to write a receipt that already exists.
Layer 2: Redis dedupe key on the webhook handler.
async def handle_mpesa_callback(payload: dict, redis: Redis) -> None:
checkout_id = payload["Body"]["stkCallback"]["CheckoutRequestID"]
dedupe_key = f"mpesa:cb:{checkout_id}"
# SET NX with 24h TTL — only one handler wins
won = await redis.set(dedupe_key, "1", nx=True, ex=86400)
if not won:
logger.info("mpesa_callback_duplicate_dropped", checkout_id=checkout_id)
return # Acknowledged earlier, ignore
# ... proceed to resume the graph
The dedupe key is the provider's identifier (CheckoutRequestID for Daraja, OrderTrackingId for PesaPal), not our merchant_reference. This is because the duplicate arrives with the same provider ID, and the provider ID is what the provider thinks it's identifying.
Layer 3: payment row state check before resume.
async def resume_payment(merchant_ref: str, status: str, payload: dict):
payment = await db.fetch_one(
"SELECT status FROM payments WHERE merchant_reference = :ref FOR UPDATE",
{"ref": merchant_ref}
)
if payment["status"] in ("completed", "failed"):
logger.info("payment_already_terminal", ref=merchant_ref, status=payment["status"])
return # Don't re-resume the graph
# ... proceed with Command(resume=...)
SELECT FOR UPDATE makes the read+update atomic, so even if Layer 2 fails (Redis down, key expired), Layer 3 catches it inside the transaction.
Failure modes if idempotency breaks
- Double STK callback resumed twice. Without idempotency: the FSM transitions
CONFIRMED → CONFIRMEDand re-emits the WhatsApp confirmation message ("Asante! Booking confirmed for 3pm."). Customer sees two identical messages. Annoying but not financial. However, the confirmation node likely also writes toappointments.statusand may trigger reminder scheduling — re-running it could schedule duplicate reminders. Worse case: if the agent'spayment_nodeincludes any "charge handling fee" logic, you double-charge. - Double PesaPal IPN resumed twice. Same risk profile, plus: PesaPal's IPN sometimes fires once for
PENDINGand again forCOMPLETEDfor the same transaction. If we treatPENDINGas terminal we'd never resume onCOMPLETED. Rule: onlyCOMPLETEDandFAILEDare terminal status from PesaPal;PENDINGIPNs are logged and discarded.
What we explicitly do NOT do
We do not return non-200 to the provider on duplicate callbacks. Both Daraja and PesaPal interpret non-200 as "retry me later", which would amplify the duplicate problem. Always 200, then dedupe internally.
6. Multi-channel (voice)
The orchestration is identical to WhatsApp; the differences are in the AnswerShaper and the timeout policy.
STK push on voice
Customer is on a phone call. Agent says (Swahili or English per detected language): "Nitaomba M-Pesa prompt sasa hivi. Itakuwa kwenye simu yako baada ya sekunde tano. Weka PIN yako." / "I'll send the M-Pesa prompt now. It will appear on your phone in about five seconds — please enter your PIN." Then silence with low-volume hold music or filler ("Bado tunasubiri…" every 15s). At 60s, if no callback, offer one retry, then degrade to "I'll text you the M-Pesa request — please complete it when you can, and I'll call you back."
The trick: the customer's phone is the same line they're on with you. M-Pesa STK on Safaricom interrupts with a system overlay, doesn't drop the call. Generally. We've heard reports of older Android handsets dropping the LiveKit SIP audio momentarily — uncertain about how prevalent this is in 2026. Pilot data needed.
Card flow on voice — the side-channel problem
Customer is on the line, wants to pay by card, can't tap a link mid-call. Three options:
- Option (a) — Hold the line, text the link, hang up, call back on success. The recommended pattern. Agent:
"Nakutumia link kwenye WhatsApp/SMS sasa. Ukimaliza nitakupigia tena." / "Sending you the link on WhatsApp now. I'll call you back when it's confirmed."360dialog WhatsApp first, SMS via Daraja fallback. Hang up. Webhook handler triggers a LiveKit outbound SIP callback onPAYMENT_CONFIRMED. - Option (b) — Hold the line until customer completes. Realistic only if customer can do 3D Secure in <90s. Most can't. Don't do this by default.
- Option (c) — Schedule a callback at customer's chosen time. "Want me to call you in 10 minutes once you've paid?" Useful for elderly users who'll need help finding the link.
Recommendation: Option (a) as default, Option (c) as a fallback if customer says "I can't read links / I need help". Adrian's elderly users are a non-trivial segment of the salon market.
Daraja's bulk SMS gateway as the SMS rail
Per ADR-0001 amendment, Daraja's SMS API is in scope. Use it. Don't add Twilio.
7. PesaPal multi-method commission leak
Annex B §B.2 describes the problem: customer picks "Card" in the WhatsApp flow (because the agent asked "M-Pesa or Card?"), gets a PesaPal hosted page, and the page shows all 8 payment methods. Customer taps M-Pesa instead of Card. PesaPal charges the merchant the higher card rate (3.5%) because the "intent" was card. Or, more accurately and damagingly: PesaPal will charge whatever rate corresponds to the actually selected method on their hosted page, which means the agent told the customer "this will be a card payment" but they paid with M-Pesa — confusing for the receipt and for the merchant's reconciliation.
The four options
Option (a) — Accept the leak. Do nothing. Customer sees PesaPal page, picks whatever, merchant pays whatever rate. Pros: zero engineering work. Cons: violates the customer's expectation set by the agent ("you said this was card"); merchant's reconciliation gets messy ("why is there an M-Pesa receipt against a card payment intent?"); we lose the ability to route customers to the cheaper rail.
Option (b) — Intercept PesaPal page selection client-side. Inject JS that hides non-card methods. Technically infeasible — we don't control PesaPal's hosted page; we'd need to fork to a custom checkout, which means PCI compliance, which means we shouldn't.
Option (c) — Use PesaPal's payment_method parameter to lock the page. PesaPal's API 3.0 SubmitOrderRequest accepts a payment_method filter that restricts which methods show on the hosted page. This is the one to verify and use. From the API docs (need to confirm in implementation), passing "payment_method": "CARD" in the order submission renders only card options on the hosted page. Uncertainty: I'm not 100% sure this filter exists in PesaPal API 3.0 as of 2026 — needs verification against their current docs at https://developer.pesapal.com/. If it does, this is the answer.
Option (d) — PRD clarification: "card" means card-only. Update the agent's payment-selection step so that picking "Card" sends an order with method-restriction; picking "M-Pesa" doesn't go through PesaPal at all (use Daraja STK directly). This is consistent with the existing PRD architecture — Annex A already handles M-Pesa via Daraja, not PesaPal.
Recommendation
Combine (c) and (d). Submit ADR-0007 with this rule: PesaPal is exclusively the card rail. M-Pesa, Airtel, and Equitel never go through PesaPal — they go through their direct APIs (Daraja, Airtel Africa, Jenga). When the agent offers "Card" and the customer accepts, we submit a PesaPal order with payment_method restricted to card-only. If PesaPal's API doesn't support that restriction in 2026, fall back to (d) alone — accept the leak on the rare case of a customer overriding to M-Pesa, but document it as known-risk.
This eliminates the dual-rail confusion and gives the merchant a clean cost model: M-Pesa = Daraja rate, card = PesaPal card rate, never the wrong combination.
8. Tying back to A1 (LangGraph + Option A wrapper)
Pseudocode showing how the callback resume integrates. This is the load-bearing code shape ADR-0007 will commit to:
# app/payments/graph.py
from langgraph.graph import StateGraph
from langgraph.types import interrupt, Command
from app.checkpoint import TenantScopedSaverFactory
async def payment_node(state: BookingState) -> dict:
"""Initiate payment and suspend until callback."""
payment_req = build_payment_request(state)
merchant_ref = generate_merchant_reference(state.tenant_id)
# Persist routing BEFORE dispatching push (so callback can never beat us)
await persist_payment_routing(
merchant_ref=merchant_ref,
tenant_id=state.tenant_id,
schema_name=state.schema_name,
thread_id=state.thread_id,
provider=state.payment_provider,
)
await insert_payment_row(state, merchant_ref)
# Dispatch the push (Daraja or PesaPal)
provider = get_provider(state.payment_provider)
result = await provider.initiate_payment(payment_req, merchant_ref)
await update_payment_row(merchant_ref, checkout_id=result.checkout_id, status="pending")
# Suspend; resume value comes from the webhook handler
callback_result = interrupt({
"kind": "payment_pending",
"merchant_reference": merchant_ref,
"provider": state.payment_provider,
})
return {"payment_result": callback_result}
# app/webhooks/payments.py
from fastapi import APIRouter, Request
from langgraph.types import Command
router = APIRouter()
@router.post("/api/v1/webhooks/mpesa/callback/{secret}")
async def mpesa_callback(secret: str, request: Request, redis: Redis):
verify_secret(secret)
payload = await request.json()
# 1. Parse callback into unified shape
callback = MpesaProvider.parse_callback(payload)
checkout_id = callback.checkout_id
# 2. Layer 2 idempotency — Redis dedupe
if not await redis.set(f"mpesa:cb:{checkout_id}", "1", nx=True, ex=86400):
return {"ResultCode": 0} # Always 200 to provider
# 3. Resolve merchant_reference -> (tenant_id, schema_name, thread_id)
routing = await fetch_payment_routing_by_checkout_id(checkout_id)
if not routing:
logger.error("mpesa_callback_no_routing", checkout_id=checkout_id)
return {"ResultCode": 0}
# 4. Layer 3 idempotency — payment row state check
async with tenant_schema(routing["schema_name"]):
terminal = await is_payment_terminal(routing["merchant_reference"])
if terminal:
return {"ResultCode": 0}
# 5. Acquire TenantScopedSaver and resume the graph
saver = await TenantScopedSaverFactory.for_tenant(routing["tenant_id"])
graph = compile_booking_graph(checkpointer=saver)
await graph.ainvoke(
Command(resume={
"status": callback.status.value,
"receipt": callback.receipt_number,
"amount": callback.amount,
"raw": callback.raw_payload,
}),
config={"configurable": {"thread_id": routing["thread_id"]}},
)
return {"ResultCode": 0}
Three things to notice:
persist_payment_routingruns BEFOREprovider.initiate_payment. This is non-negotiable. If the order is reversed and the callback beats us to the lookup table, we lose the resume. Daraja sandbox callbacks have been observed at sub-200ms in the field.TenantScopedSaverFactory.for_tenant(tenant_id)is the A1-Option-A wrapper. The webhook handler does not construct a saver against a specific schema directly; it asks the factory, which knows the schema_name → saver mapping.- The resume value is a dict, not a string. This matters because the
payment_node's next step needsreceipt,amount, andstatusto write the appointment row. Keep the resume payload structured.
What this looks like for PesaPal
Same shape. PesaPal's IPN gives you OrderTrackingId (their checkout ID) and OrderMerchantReference (your merchant_reference echoed back). The handler is:
@router.post("/api/v1/webhooks/pesapal/ipn/{secret}")
async def pesapal_ipn(secret: str, request: Request, redis: Redis):
verify_secret(secret)
params = dict(request.query_params) # PesaPal sends GET params not JSON
order_tracking_id = params["OrderTrackingId"]
merchant_ref = params["OrderMerchantReference"]
if not await redis.set(f"pesapal:ipn:{order_tracking_id}", "1", nx=True, ex=86400):
return {"orderNotificationType": "IPNCHANGE", "orderTrackingId": order_tracking_id, "orderMerchantReference": merchant_ref, "status": 200}
# Fetch authoritative status (PesaPal IPN is just a notification — get the real status)
status_resp = await pesapal_client.get_transaction_status(order_tracking_id)
if status_resp.payment_status_description.lower() not in ("completed", "failed"):
return {"...status": 200} # PENDING — don't resume yet
routing = await fetch_payment_routing_by_merchant_ref(merchant_ref)
# ... (same shape as mpesa_callback from step 4 onwards)
The PesaPal-specific quirks: IPN is GET not POST, IPN only signals "something changed" (you must call GetTransactionStatus to get the actual status), and the response shape PesaPal expects back is structured (not just HTTP 200).
9. Open questions for ADR-0007 to resolve
payment_methodfilter on PesaPal API 3.0. Does theSubmitOrderRequestactually accept a payment-method restriction in 2026? Verify against current PesaPal developer docs. If yes, ADR-0007 commits to it. If no, ADR-0007 documents the leak as known-risk and adds a PRD clarification.- Daraja
stkpushqueryas a polling backstop — how often. Recommendation: only at t=60s as a one-shot. But should we also fire it at t=30s as a "is the prompt even alive?" check? Trade-off is API call volume vs. accuracy. - PesaPal nudge timing (8 min / 30 min). These are guesses. Pilot data should drive the final values. ADR-0007 should mark these as
tentative_pending_pilotand reference a follow-up review at first 100 transactions. - Voice timeout for STK (90s hard cap). Is 90s the right ceiling? Telephony cost vs. UX trade. Need pilot data.
- Reaper job cadence for
public.payment_routing. Recommended TTL is 24h on the row, but if a payment legitimately takes 23h59m and the customer comes back to retry — what happens? Probably fine, but ADR should specify reaper cadence (hourly? daily?) and edge case for late-arriving callbacks past TTL. - What happens if the LangGraph thread has been reaped but a callback arrives? A1 spike concluded thread checkpoints persist indefinitely in Postgres unless explicitly purged. ADR-0007 should declare a retention policy for booking threads (90 days? indefinite?). If a callback arrives on a purged thread, log loudly and write the receipt to
paymentsfor reconciliation, but accept the booking flow won't complete. - Tenant-scoped RSA / signing keys on the webhook side. Daraja doesn't sign callbacks (it relies on IP whitelist + secret-in-URL). PesaPal doesn't sign IPNs (same model). If we ever move to a provider that signs (Stripe, Adyen), the resume mechanism shape doesn't change but the verification step does. Note for forward-compatibility.
- LiveKit outbound SIP call on payment success. Voice flow's "I'll call you back when it's confirmed" needs a queue-based callback dispatcher. Out of scope for ADR-0007 (belongs to the voice-orchestration ADR), but flag as a dependency.
- Concurrent callbacks for the same thread (different rails). Could a customer have a Daraja STK and a PesaPal session open simultaneously? In theory no (FSM only initiates one at a time). In practice, racing retries might cause this. Layer 3 idempotency (state check) should catch it, but ADR should explicitly prohibit concurrent payments on the same booking.
What was new vs. derived from first principles
New (from public 2026 sources, mostly confirmed via search):
- LangGraph
Command(resume=...)withthread_idviaconfigurableis the canonical 2026 resume pattern (multiple sources, well-documented). - Daraja callback retries-on-occasion is explicitly documented industry knowledge.
- PesaPal IPN as GET-with-query-params and the "IPN says something changed, you must fetch status" pattern is documented in their dev portal.
Derived from first principles (Adrian + agent):
- The
payment_routingshared-schema table to bridge tenant-less webhook context to tenant-scoped graph state. No public reference for this — it's a direct consequence of schema-per-tenant + LangGraph-per-tenant-saver. - The triple-layer idempotency (Postgres unique + Redis dedupe + payment-row state check). Each layer individually is standard; the combination is our defensive recipe.
- The "8 minute / 30 minute" PesaPal nudge timing. Pure judgement call, no benchmark.
- The "hold-the-line-and-text" pattern for voice card payments. There's no published 2026 pattern for "agent on phone call hands off to async card payment then calls back" — Phase C was right, this topic is uncomfortably thin.
- The PesaPal multi-method leak strategy. No public discussion of this trade-off found.
Sources
- Daraja: callback handling and idempotency best practices
- Mpesa Daraja: handling callback
- PesaPal RegisterIPNURL
- PesaPal GetTransactionStatus
- PesaPal hosted page payment methods (Kenya)
- LangGraph Interrupts (canonical Python docs)
- LangGraph HITL deployment with FastAPI
- LangGraph interrupt-workflow template
- STK Push polling pattern (Daraja)