The incident you never had: Deterministic simulations

Most reliability engineering happens after something breaks. Will Wilson thinks that's the wrong place to be. As co-founder and CEO of Antithesis, the autonomous testing platform that just raised $105M in a Series A, Will has spent years building the infrastructure to catch failure modes before they ever reach production. His starting point is uncomfortable: the testing practices most teams rely on are structurally incapable of finding the bugs that cause real incidents.

In this episode, Will traces that argument from its origins at FoundationDB, where a small team used deterministic simulation to ship a near-zero-bug distributed database at speed, to Antithesis's bet that any software system can be made fully testable without rewriting it. The conversation covers fault injection without production risk, the limits of chaos engineering as it's practiced today, and why the explosion of AI-generated code makes a reliable testing foundation more urgent than ever.

How did you end up building a company around software testing?

I came to computer science late. I didn't study it at university — I learned programming on the job doing scientific research and gradually realized tech was where I needed to be. And because I came in sideways, I never really learned how to test software properly. I'd write something, it would work for a while, and then I'd hit a certain scale of complexity where I couldn't hold the whole system in my head anymore. Bugs would start appearing. I'd fix one and introduce two more. Someone told me to write tests. I tried. It didn't really help.

So I was actually a testing skeptic when I joined FoundationDB — a distributed database startup that was later acquired by Apple in 2015. And that's where everything changed. The team there had built a completely new way of testing software. Instead of trying to test a distributed system in the real world — with all its chaos and unpredictability — they built a simulation of it. A simulated network of processes, with packets dropping at random, hard drives crashing, sysadmins doing dumb things, IP addresses changing. They'd run millions of such simulations and verify the database behaved correctly in every single one.

The results were remarkable. In the entire history of that company, I think there were maybe one or two bugs ever reported from production. But the more surprising thing was how fast we moved. When you have no bugs, it becomes trivially easy to detect if a change introduced one. You just look for any red test. And if something's red, you revert the change. All the time you'd normally spend analyzing production incidents, reconstructing what happened, reviewing code — it all shrinks dramatically. A small team built something very difficult, at high velocity, with very high quality. It was unlike anything I'd experienced before or since.

After the acquisition, I worked at Apple, then Google, then a few other places. My old colleagues spread out too. And nowhere did we have that same experience. We kept saying: this is a hundred-dollar bill lying on the floor. So we started Antithesis to bring it to everyone else.

What's actually broken about the way most teams test today?

There are a few interconnected problems. The first is that end-to-end tests — the kind that spin up all your microservices talking to each other in a real environment — are slow, flaky, and painful to debug. They're secretly dependent on timing. You add a sleep somewhere, the test goes green more often, but still fails one in a thousand times. And when it fails, you don't know if you introduced a bug or if it just had a bad moment. So people hate these tests. And because they hate them, they don't write them.

Instead, teams write unit tests. Which test small components in isolation. The problem is most bugs don't live in isolated components — they live at the interfaces between components, in the error handling and retry logic that kicks in when something goes slightly wrong, in the edge cases that happen when two systems built at different times by different people interact in ways nobody planned for.

But the deeper problem is that all these tests — unit tests, integration tests — only test what you thought to test. You write a test for parsing a customer name and you put in "John Smith." It never occurs to you that some people have three names, some have one, some have names with Chinese characters, some have emojis. You don't think of those cases. So when you deploy to production, it breaks immediately.

What you actually want is full randomization. A test that says: here's a model of the world, do your worst, throw every crazy thing at my software and tell me what happens. For entirely practical reasons — tests are slow, brittle, expensive — nobody does that. But that's the kind of testing that would actually help.

What's the core idea behind simulation testing, and how does Antithesis make it work?

The fundamental insight is that most of those problems go away if you can force the system under test to be fully deterministic. If a test is green, it's green. If it's red, it's red. You can always go back and see exactly what caused a failure. Debugging becomes dramatically easier. And you can apply intelligent search techniques to find rare, weird failure modes.

The catch is that most software isn't deterministic. It talks over networks where routers take unpredictable amounts of time. It runs on Linux kernels that schedule processes however they want. Real systems have all kinds of non-determinism baked in.

At FoundationDB, we solved this by being extremely disciplined — we wrote all our software to run in a single process, one thread, no outside dependencies, fully simulated. That was powerful but completely impractical for anyone else. Most teams have existing systems running in Java or Go, with frameworks and dependencies they don't control.

So at Antithesis, we went even further. We built a hypervisor — a VM that can run any operating system and any software inside it, fully deterministically. Whatever you're running, if you give it the same inputs, you get the exact same execution every time. Even a one-in-a-million race condition will reproduce perfectly. That's the foundation. Once you have it, you can do almost anything: inject any failure, explore any rare scenario, and when you find something interesting, rewind time to exactly that moment and examine it from every angle.

Can you walk through what this looks like in practice?

Sure. Say you're building something like Kafka — a durable message queue. The core property is simple: messages go in, messages come out, in the same order. How do you test that today? You write an integration test that puts in 1, 2, 3 and reads back 1, 2, 3. You've tested one scenario. You have no idea if the property holds when a machine dies halfway through, when there's a network partition, when the JVM garbage collector kicks in and a process falls asleep, when a data center loses connectivity.

All of those things happen in production. Every day.

What you really want is not one test — you want a test generator that keeps finding new situations and verifying the property holds in all of them. That's what Antithesis does. You write that same simple test. We run it millions of times, varying the timing, the network conditions, the machine failures, the DNS problems — everything we can think of, and everything you didn't think of. If in any single scenario your messages don't come back in order, we tell you exactly what sequence of events led to that failure.

The more interesting scenarios are the combinations. Maybe the client pauses for a long time before sending the third message, and that coincides with a network partition at exactly the right moment. Our system learns from how it explores your software and gets progressively better at finding the combinations that actually break things.

How does this compare to chaos engineering?

I joke with people that it's like chaos engineering, but without causing your actual customers to feel pain and making your pager go off. You can do fault injection on your own computer, at any time, and nobody else ever finds out about it.

But there's a deeper difference. Chaos engineering in production is similar to production debugging — you inject a failure, something horrible happens, and you better hope you had the right log statement in there. Otherwise you may never reproduce that situation again. With simulation testing, every failure is deterministically replayable. You find something, you rewind, you look at exactly what happened, you make it happen again. That's a fundamentally different debugging experience.

Where does Antithesis fit in the development workflow?

It can plug in at any stage where you'd run tests. Where exactly depends on how fast your tests are and how much compute you want to spend.

A lot of customers start with nightly runs. All the day's code is merged, you run a big parallel test overnight for twelve hours, and in the morning you see what was introduced the day before. MongoDB still does this. It's a solid starting point.

If you optimize your tests and want faster feedback, you can run on every PR for a shorter window. That catches bugs during the workday rather than the next morning.

The thing none of our customers do yet — but that we did at FoundationDB, and that I want us to get to — is testing before you open a PR. You write some code, you run it through the simulator yourself, and you find out if it has bugs before you bother your colleague with a review. That's the model that makes this really transformative.

What does AI-generated code mean for reliability?

I don't have anything particularly original to say here. These tools are genuinely powerful — you can write a lot of code very fast. But there are two real problems.

One is that you're often writing so fast you don't fully understand what you've produced. People say they read every line Claude gives them. Some of them are lying. And even if you do read it carefully, it's still not inside you the way code you wrote yourself is. If you have to come back six months later and debug it, or change it, the cost hits you all at once.

The other problem is maintenance. AI tools work really well for throwaway scripts, for anything where you can tell at a glance if it's working. They work less well for long-lived, production software where rare edge cases and failure modes matter. That's the barrier right now.

I think powerful testing can move that frontier. If you have something that can rapidly validate AI-generated code against a full range of real-world conditions, you can actually trust the output more. That's what would make AI coding tools fulfill their real promise — not just writing code faster, but building systems you can actually rely on.

Where do you see software development heading?

One way to think about AI coding is that it's another step on a journey we've been on since compilers were invented. A compiler takes something closer to human language and turns it into machine instructions. AI coding tools do something similar, but less deterministic. If these tools get reliable enough — maybe through powerful testing, maybe through formal methods — the developer's job moves up a level. Instead of writing loops, you're describing what functions should do. Eventually maybe you're specifying high-level business requirements and the transformation happens automatically.

The obstacle is specification. If you've ever been a product manager, you know that telling an engineering team what you want and getting back what you meant are two very different things. That problem doesn't go away with AI.

What I find interesting is that specifying software and testing software are actually the same activity. In both cases, you're saying what you expect — what should happen in a given situation, what should never happen. If you could write one description and have it both generate the software via an LLM and generate the property-based tests to validate that software, and the person doesn't have to do any additional work — that's a genuinely exciting vision. It's test-driven development where you don't have to write the tests. And I think nobody would complain about that.