Every pilot is ready for engine failure: are your engineers?

Every pilot who's never had an engine failure is still ready for one. The same can't be said for most software engineers facing their first major incident.

Hamed Silatani, co-founder and CEO of Uptime Labs, and former Head of Reliability Engineering at IG Group, has spent two decades watching engineers learn incident response the hard way: alone, under pressure, with no training.

A moment in a hospital delivery room, where a team of nurses responded to a crisis with clockwork precision, crystallized what he'd long suspected: IT spends heavily on tooling and process but almost nothing on training the humans who have to use them under fire.

In this conversation, Hamed breaks down the three skill categories that matter in incident response, explains why communication is always the first pain point leaders name, and makes the case that AI-driven complexity is about to make human incident skills more important, not less.

How did the idea for Uptime Labs come about?

If I really go back to the genesis, it starts around 2009. I mistakenly deleted an entire Oracle database at my employer — purely a missing property, a delayed cascade. I was expecting to get fired the next day, but instead I was praised for what I did. The company was in the middle of a major incident, a difficult situation with no other options. No one else had an idea. And the act of that manager praising me got me thinking — there's an entirely different set of skillsets and leadership involved in incidents.

I leave that company, join a financial services trading firm in the UK. I get the job in 24 hours, which was the dream of every software engineer in London — working in the City. So I'm thinking, "Am I too good and I don't know, or is there a fine print I missed?" Sure enough, there was fine print. It was a new team and the job was to support the trading system 24/7, because no other engineer wanted to. Week three, I was handed a BlackBerry — this was before Rootly existed, people carried BlackBerrys for notifications. And sure enough, 3AM it goes off. By the time I get to my computer, I had a call from a trader in APAC shouting about how much this was costing them. I didn't even know where to start. The horror of that night, feeling alone — it never left me.

As my team grew, I watched every engineer go through the same journey: learning how to deal with incidents on their own, the hard way. And the skills involved aren't technical at all — they're an entirely different set. Fast forward, I climb the ladders of management, and one night I find myself being the executive putting engineers under pressure because now I had to report back up to the CEO. That was the moment I thought: this is a losing game. No one is winning.

Then, during my son's birth — a very painful, lengthy three days — a nurse walks in, notices something wrong on a monitor, and presses a red button next to the bed. In a split second, the room was filled with medical staff working like clockwork. They knew exactly what to do, and twenty minutes later I was holding my son. Normal people take time to admire that. I did too — but I also had a light bulb go off: if healthcare can do this, we're definitely missing something in IT.

So I started looking into safety-critical industries and realized healthcare learned many of its practices from aviation. It became obvious: in IT, like those industries, we spend a ton of money on tooling and processes. But they also spend heavily on constantly training their staff to deal with emergencies — communication, coordination, decision-making. That was missing in IT. I tried running simulated incidents within my organization — game days, tabletops — but it was expensive and hard to scale. One a month, at best, for a thousand engineers? Not nearly enough. And that's where I thought: there's a product here. I don't want any other engineer to go through the pain I went through. That became Uptime Labs — essentially what the flight simulator does for pilots, we do for engineers.

What does incident management training look like in industries where it's critical?

Take aviation. Lives are on the line. During my early research, I was interviewing a pilot and they said something that never left me. We were sitting in an open space in London, and they said: "Look up at the sky. How many of those pilots flying planes with people's lives in their hands have experienced engine failure? A very, very small percentage. But how many of them are ready to deal with one if it happens? All of them."

Aviation has a very strong culture of human factors. Pilots and crews regularly train on flight simulators. They practice scenarios that could happen, and the focus isn't just the technical skills of flying the plane — it's how to communicate, how to make decisions under maximum uncertainty. Pilots do have runbooks for certain issues, but when you're up there with an engine failure, you don't have time to flip to page one. You need to be able to act. You want to keep all your cognitive capacity for what matters most: aviating the plane.

They have a universal language — "aviate, navigate, communicate" — that everyone in the industry understands, no matter where you go. In IT, we don't have that common language. We try to borrow some of it, but every business is slightly different. The truth is, responding to incidents and emergencies under uncertainty is, at a high level, very similar regardless of the industry. The skills and practices from aviation, healthcare, or IT — when you look at them side by side, there's enormous overlap.

What skills should incident responders master?

I categorize them in three broad buckets. First is domain knowledge — knowing your environment. Understanding that a small application or server running somewhere in AWS that no one thinks about, if it has a problem, might take down customer sign-ins or payments. That comes with experience of being in the environment.

Second is understanding how technology works and fails. We're all supporting distributed internet systems with layers of network, compute, and data, all connected. At a principle level, the failures are similar: load issues, network saturation, deadlocks at the application layer. Understanding those layers and getting exposure to how they fail — that's an important skillset.

The third category is the soft skills, and this is experienced but new to most of us in IT: communication, coordination, decision-making. Incidents are situations of maximum uncertainty. As time passes, the level of coordination needed expands — more people get interested, more people join. Very rarely can one person solve it alone, so you need to convey context to others. You need to know what information different parties care about. There's a core group actually working to resolve the incident, and many other people who for valid reasons want to know what's going on but aren't resolvers. How do you balance the time you spend sharing information with people who aren't fixing the incident versus working with the people who are? Because every minute not spent resolving is a minute the incident runs longer.

At a high level: communication, coordination, decision-making, and sense-making. When an incident happens, your view of how the world works has just broken down. Software isn't behaving the way you expect. Making sense of that is critical.

Which of those skills is the industry lacking the most?

Hard to say definitively, but when I talk to people — and I talk to a lot of people about their incidents — communication comes up first. Not because it's probably the most lacking, but because it's the most visible and most painful. Executives are desperate to know what's going on. When they don't hear updates that give them certainty and answer questions rather than raising new ones, they start poking around — talking to that manager, pulling people in. It creates a vicious circle that takes responders away from actually driving the incident. Communication comes up in almost every conversation I have.

The other thing that comes up, in various different ways, is engineers freaking out — not knowing what to do. The term "freaking out" came up three times in my conversations just today. Staying calm and composed during incidents is key. And that takes being exposed to many, many incidents so you know it's okay not to know what's going on, and that there's a path to fixing it.

How can engineers develop these skills?

There's a range of ways — I'll start with the most painful and work up. One way is by experiencing major incidents. Being thrown in the deep end. For me, the best learning happened when there wasn't a more senior person available and I had to act as one. It's scary, but it's the best opportunity. A common phenomenon is that when a major incident happens, the heroes — every organization has them — get pulled in and handle everything. Juniors just get to watch. So the first path is painful and takes many years.

The second way is shadowing incidents, but there's an opportunity cost. Software engineers are expected to build features and ship code, so not all organizations proactively push people to shadow. And when you're shadowing as a junior, you see the output of a senior person's decisions, but you never get inside their head. You don't understand why they made those calls.

The next best thing is game days and tabletop exercises. These are effective when they happen, but they're really hard to run. I tried for many years — one a month at best. You need to design scenarios, which requires understanding incident complexity, patterns, behaviors, and the skills you want to develop. That's a whole field — resilience engineering folks have spent years studying it. But any exercise that puts you in a situation of uncertainty where you have to make decisions on your feet, communicate with teammates, and solve a problem under time pressure — that's good.

And then there's realistic simulations — a safe environment that feels real enough to get you to behave the way you would in an actual incident. I went through all the other paths first, and I saw that I didn't want any other engineer to suffer through what I did. That's what Uptime Labs is.

There's also classroom-style incident response training from solid people in the market. Those are great for theory. But there's a big, big gap between theory and action.

How is AI changing incident management?

Very topical question. I was at SREcon in Seattle a couple of weeks ago — there was a lot of great conversation on this. The reality is AI has already changed how we build software and work. The skills needed as an engineer are shifting.

The way I look at it: the amount of code being generated is increasing relative to the number of engineers. So system complexity is naturally increasing. With that increased complexity, the skills you need to decompose it when nothing makes sense — dealing with the fear of the unknown, sense-making — become even more important.

AI SRE tools are powerful when used to take some of the toil off incidents. They give you more time to think. Some are already moving to a stage where trivial incidents get handled and fixed before you even know about them, which is great. But what that means is the incidents that do hit humans are the really hard ones. Control gets transferred to you with: "We tried all the easy things — now hold the wheel and save us." At that point, the training you need to manage distress, understand the situation, and make sense of it is even more important.

If anything, incident management will require more training, more focus on sense-making, communication, and coordination. The technical troubleshooting might get handled by AI, but here's the thing: even if there's just one incident that still requires a human, you have to train everyone — because you don't know when that one incident will hit. And the paradox is the fewer incidents you get, the less natural practice you have. Skills atrophy is real. We're all going to see more productivity — more software generated per human — which means more complexity. We need to prepare for that.

Where does AI fall short as an incident responder?

I ran an experiment a couple of weeks ago. I prompted an LLM to act as an incident manager, then had one of our drills on the other side and copy-pasted messages between the two. The agent actually did pretty well for most of it. Then it hit the key decision point: the scenario unfolds so you have to decide whether to wait for a vendor to fix a data center issue or fail over to another data center.

The AI incident manager suggested disaster recovery — fair enough. Then I challenged it: "But if we fail over, the original site might get fixed and we lose time." It said, "Oh yeah, you've got a very good point. Let's stay here." I argued back the other way, and for ten minutes it was flopping back and forth.

The point is: another responsibility that remains for humans is taking accountability and making decisions. Ownership.