Code Is Cheap, Reliability Isn’t: Owning Production in the AI era w/ Swizec Teller

In this episode, Swizec Teller, author of the bestselling Scaling Fast: Software Engineering Through the Hockey Stick makes a bold claim: code is cheap, reliability is not.

As AI coding tools accelerate feature development, the real competitive advantage shifts to operating systems reliably in production. We explore the hidden complexity of SRE work, the addictive nature of agentic coding, and why ownership — not automation — remains at the core of modern software engineering.

From debugging redirect loops to cutting logging bills by an order of magnitude, this conversation dives into what actually makes systems resilient at scale.

Key Topics Discussed

Why users don’t buy code, they buy reliable services
The “Feynman computer disease” and AI coding rabbit holes
Why babysitting AI agents can kill productivity
The subtle art of writing useful logs
Debugging distributed systems and redirect loops in production
Benchmarks for AI SRE agents vs real-world reliability work
Accountability in engineering (“sign it with your phone number”)
Why SLAs matter more than how the code was produced
The future of engineering in a probabilistic, AI-assisted world

Resources Mentioned

Swizec’s article: The Future of Software Engineering is SRE
SRE Skill Bench
Swizec’s book: Scaling Fast: Software Engineering Through the Hockey Stick

In Swizec’s words

Why do you argue that code is cheap, but reliability isn’t?

For the record, I’m not an SRE. I’ve just always worked at small startups where you own your own software and you’re responsible for keeping it running in production.

As engineers, we like to think people buy our code. But most people don’t want a pile of code — they want a service that just works. They don’t want to think about it. It should be invisible.

A pile of code is like a hobby car. It’s fun to work on — until you need it tomorrow morning to get to work. Then you just want it to start. You don’t want to wonder whether someone tightened the screws yesterday.

It’s easy to build the first 90%. The other 190% — turning it into a reliable service — that’s the hard part. That’s the secret sauce. That’s why big tech companies make the big money.

What is the “Feynman computer disease” — and how does it show up with AI coding?

Feynman described this idea that once you have a computer, you start automating everything just because you can. It’s fun. You keep tinkering. You forget that sometimes there’s already a simple solution.

I see that a lot with AI coding and agentic workflows. You get excited about what the agents can do. You spend hours trying to get it to work perfectly — when it might’ve been faster to just write the thing yourself.

This happens to me all the time. I think, “Oh, this is a cursor one-shot.” It gets 95% right. Then I tweak the prompt. Then I try again. Three hours later it works… and I realize I could’ve done it in 10 minutes.

It’s like a slot machine. Just one more pull. This time it’ll work.

How do you avoid falling into the AI “slot machine” trap?

I don’t have a perfect solution. But I try not to babysit agents.

If I’m watching the agent think, I’m not actually saving time. I’m pair programming — just with slightly less brainpower. I find it hard not to read along when text is scrolling on the screen.

So I use background agents from Slack. I tell it to do something and then I go do other work. When it’s done, I get a message.

If you’re going to use agents, don’t babysit them. Let them work in the background.

Can LLMs realistically automate SRE work?

It depends what you mean.

There are already benchmarks for SRE agents. I think the best one I saw recently could complete around 30% of tasks successfully. That’s impressive — but those are probably the repetitive, annoying tasks.

The hard part of SRE isn’t just reading logs. It’s understanding the whole system at a macro level. Distributed systems, microservices, infrastructure, business logic — how everything fits together.

There’s subtle skill involved. I’ve seen junior engineers add logs that look fine in a PR. But at 2AM, when you’re paged, those logs are useless. Meanwhile, an experienced engineer’s logs give you exactly what you need.

That instinct comes from experience — from getting punched in the face by production systems.

What makes SRE skill so hard to automate?

It’s system-level thinking.

We once had a Lambda producing 1.5 million logs per day. It was blowing up our logging bill. Turns out there was basically a print statement every two lines of code.

You don’t need a full trace of everything happening in production. You need the right signals.

We also had a case where something worked in dev but not in prod — even though Terraform made them “identical.” It ended up being one NGINX variable causing an HTTPS redirect loop.

You still need to know what question to ask. Even with ChatGPT, you need the right prompt. That requires understanding.

How does accountability change in an AI-driven world?

In my book, I talk about ownership. Engineers need to own their systems.

I like to say: I reviewed your PR. It probably works. And we have your phone number if it breaks at 2AM. If you can sign it with your phone number, we’ll ship it to prod.

LLMs don’t have a phone number.

There’s an old IBM quote: a machine can’t make a decision because it can’t be held accountable for that decision.

You can use agents. You can use AI for debugging. But somewhere, there needs to be a human whose job it is to make sure it works.

Does it matter whether the code was written by AI or a human?

It doesn’t matter how you produce the code. It matters whether you meet your SLAs.

In the B2B world, reliability comes with legally binding contracts. If you promise five nines and you don’t deliver, it costs millions.

Customers don’t care if AWS went down. They care that your product didn’t work. They just want their coffee. They don’t care about your supply chain.

If you have unreliable vendors — or unreliable AI — that’s still your responsibility.

Should developers start thinking more like SREs?

Yes.

Code is becoming cheaper and more probabilistic. There’s going to be more of it. Less inspected. Less fully understood.

That means reliability practices matter more: tests, observability, guardrails, good architecture.

All the fundamentals still apply. Factor your code well. Build vertical services that own business processes. Reduce the context required to ship a feature. Make systems scream loudly when something breaks.

Those principles help humans write better code — and they help LLMs too.

Will AI reduce the number of software engineers?

I think we’ll need more.

We’ve been trying to eliminate engineers since COBOL. Spreadsheets were supposed to do it. No-code was supposed to do it.

What actually happens is people build prototypes. Then they want someone else to own it in production.

When it’s time to run a multimillion-dollar system, someone needs to be accountable. Someone needs to get paged.

There’s going to be more code. That means more people needed to run it.

Where to Find Swizec

Website & Blog: https://swizec.com
Book: https://scalingfastbook.com
Twitter / X: @Swizec
BlueSky: Swizec Teller

‍