The problem
Before an agent payment protocol touches mainnet, the bugs that matter most are not the ones a unit test finds. They are the ones that only appear when forty buyers are hitting the same provider endpoint concurrently, or when the oracle drops an attestation at exactly the wrong moment and the escrow hold lingers for three settlement cycles instead of one. Those bugs live in the timing gaps between components, not in the logic of any single component.
switchboard ships four payment rails simultaneously: x402 HTTP middleware, ZAP binary wire, AgentEscrow.sol, and a spend-policy layer with per-transaction, hourly, and daily caps. Each rail works correctly in isolation. The question that kept us up was what happens when all four run under realistic multi-agent load, with failure injection, before we have real traffic to learn from. The answer was to build a simulator where we could engineer the traffic ourselves.
The second pressure: reproducible bug reports. When a contributor surfaces a timing failure, we need to be able to replay it exactly. A random seed produces random failures; a deterministic seed means the failure is a fixture, not an event.
The model
The simulator initializes five role classes from a single integer seed: buyers, providers, oracles, a treasury, and a HITL (human-in-the-loop) decision point. Each role gets a scripted behavior profile — arrival rate, payment size distribution, failure probability — and runs through a configurable number of settlement rounds. The failure injectors are explicit: oracle dropout, nonce replay attempts, escrow timeout without confirmation, and provider non-delivery mid-flight.
Because every random draw is seeded, the same integer reproduces the same agent population, the same arrival sequence, and the same failure events. A bug surfaced at seed 7419 is filed as seed 7419. The person who triages it runs seed 7419. They see exactly what you saw. This sounds simple; it was not the first design we tried.
Concrete finding one: nonce collision rate is superlinear under naive policy
The first surprise was in the nonce manager. Under a naive policy that issues nonces by incrementing a counter per buyer, concurrent buyers share a nonce namespace without coordination. The simulator made this visible: at four concurrent buyers the collision rate was low enough to miss in manual testing. At twelve buyers it was notable. At thirty-two it was a failure mode, not an edge case.
The collision rate scales with the square of concurrency under the naive scheme because every additional buyer adds a collision surface against all existing buyers, not just one. The fix — per-buyer nonce namespacing with a chain-ID prefix — is straightforward once you know that is the problem. The simulator made it reproducible at an exact seed, which meant we could write a regression test that holds the fixed behavior without relying on real network conditions.
Concrete finding two: escrow holds skew the settle-latency right tail
The second finding took longer to trust. Under normal conditions the P50 and P95 settle-latency figures look healthy. The simulator's oracle-dropout injector revealed a long-tail skew we had not anticipated: when an oracle drops out during the attestation window, the escrow hold remains open until the timeout fires. That timeout is a protocol parameter, not a per-transaction variable. What the right-tail skew showed was that a small percentage of holds were staying open for much longer than the median, pulling the P95 latency figure high enough that a human reading the metrics would notice something was wrong but not immediately know where to look.
The actionable finding: the apparent settle-latency right tail is a proxy for oracle health, not for payment volume. A P95 spike during a period of flat payment volume points to oracle dropout, not buyer congestion. We now monitor them as separate signals. The simulator surfaces both in the same CSV export so the relationship is legible before it becomes an incident.
Try it
The simulator runs in the browser at kcolbchain.com/switchboard/simulator.html. The source is in kcolbchain/switchboard. The lab.html companion walks through the failure injectors step by step with annotated output.
What's next
We have five open problems we want PRs on:
- Cross-language wire validation. The ZAP binary wire has a Python encoder and a partial Go mirror. A Go-side simulator consumer would catch any byte-order disagreement before it reaches mainnet agents.
- Adversarial oracle simulation. The current oracle-dropout injector models honest oracles that go offline. Byzantine oracles — ones that sign incorrect attestations intentionally — need a separate injector and a different threat model.
- Spend-policy replay from real traces. The simulator generates synthetic traffic. Replaying a real x402 traffic trace against the spend-policy layer would let us validate
BoundedSpendPolicyparameters against production behavior rather than synthetic assumptions. - Latency profile calibration. The settlement latency model is parameterized but not yet calibrated against on-chain data from Base Sepolia or Lux testnet. Someone with access to historical testnet logs could close that gap.
- Provider non-delivery penalties. The current escrow model refunds the buyer on non-delivery after a timeout. A simulator extension that models partial delivery and proportional settlement would let us evaluate the fairness properties of different refund policies before we commit to one in the contract.
These are all concrete, bounded, and impactful. If any of them match what you are already working on, the issue tracker is at github.com/kcolbchain/switchboard/issues.