From 600 to 6,000: Federating incident response across an engineering org

A centralized SRE team of 600 engineers as the first line of defense for every incident works, until the business asks you to spread that responsibility across 6,000. Cliff Snyder, senior SRE at Multimedia LLC and a decade-long veteran of LinkedIn's SRE org, walks through the 18-month project to democratize incident response: replacing a patchwork of Jira, Google Docs, and in-house tooling with a purpose-built platform designed for engineers who may have never been on-call before. He gets into the tradeoffs that shaped the rollout — flexibility versus opinionation, exec pushback on even renaming "postmortem" to "retrospective," the reporting miss that almost pushed launch back, and why MTTR is the wrong number when a single incident can burn a thousand engineer-hours. An honest look at what it actually takes to move incident management from tribal knowledge to federated practice at scale.

Why should an organization have a dedicated incident management platform at all?

For a larger organization, the value is that incident management tooling tends to grow organically over time — different tools and systems accrete into an amalgamation that isn't really serving the business anymore. I'll steal a former colleague's phrase: you find yourself being successful despite the tooling, not because of it. A purpose-built platform that isn't cobbled together out of Jira, in-house systems, and Google Docs is really, really valuable in that situation.

At a smaller or less mature org, the platform provides a different kind of value — it brings standards and practices, sane defaults around severities and workflows, all that incident management goodness. So at either end of the spectrum, large established shop or smaller shop, a dedicated platform can provide huge value.

Do smaller orgs usually build or buy?

It really depends on the org. Places that lean heavily toward "build" usually have large engineering orgs where engineers like building things — so they build. Places that tend to buy anything that isn't part of the core business don't want to spend valuable engineering cycles on something that's important but out-of-band.

Walk us through the incident management transformation you led.

We had a relatively mature incident management process, but it was very centralized. A NOC team would help coordinate, and the SRE team was the first point of escalation for any given incident. The place we wanted to get to was federating operational responsibility across the broader engineering org — putting SWEs directly in the line of fire for their own services.

In that brave new world, we needed systems that didn't depend on the tribal knowledge of battle-hardened, trial-by-fire SREs. We were going to federate this out over thousands of engineers who might be brand new to the company, brand new to their careers, or had never been on-call before. The question became: how do we make them successful at self-managing incidents without first being experts in the process?

What was the scale of this project?

Around 6,000 engineers across the engineering org. The SRE org was about 600, depending on how you counted. So we were going from 600 SREs who already understood on-call and incident management to 6,000 engineers who maybe didn't.

How did you handle the transformation itself, the people side?

The technology is typically the easy part. It's the people and process that end up being really difficult.

Having a mature incident management process going in was both a blessing and a curse. Blessing because it wasn't greenfield — a lot was already sorted in terms of how incidents were managed and how people thought about them culturally. Curse because a lot of people in high places — VPs, execs — had opinions capital-O about how the new platform should and shouldn't work.

One small example: we wanted to rename "postmortem" to "retrospective" to align with industry best practices. Even that one change got huge pushback from high places, and we ended up keeping the original term. If renaming one thing is that hard, imagine how thorny the conversations got for everything else.

Did you have to compromise on the platform's value to keep the peace?

Yes, we made compromises. But once a platform is in place, it doesn't have to stay static. You have opportunity over time to demonstrate where a specific change would have helped — and then make it. I tried to stay optimistic: change what you can, have the patience to wait for the rest.

One of the benefits we were hoping to get from a purpose-built platform was richer data and insights. The previous cobbled-together system made it really difficult to form a cohesive data story around incidents, and that was one of the things we most wanted to fix.

Flexibility or opinionation: which matters more when you're shopping for a platform?

I'm a little torn. I'd agree you need some flexibility, especially rolling out at a large org with established precedent that the platform needs to support. But opinionation is actually good and useful — if someone's coming in without an established process, they want strong guidance, good industry defaults.

More flexibility means more customization work to turn the platform into something effective. That's part of the reason our project took so long — 18 months end-to-end, including vendor evaluation, budget, legal, and security. Without all those customizations, we could have had a shorter timeline and possibly a more standardized experience at launch.

Did you hit your targets? What numbers did you track?

One thing I wanted to do personally was move away from MTTD and MTTR. There's good research showing those aren't necessarily the most valuable numbers to gather. But we knew we couldn't just scrap them and not have any story around what we were improving — that wasn't going to fly. MTTD and MTTR are probably here to stay in the immediate term.

The way I wanted to reframe it: maybe our MTTD and MTTR are great, but every time we have an incident, we engage a thousand engineers for an hour. That's a really expensive incident. So how do we start thinking about people power, opportunity cost, lost engineering productivity, potential for burnout? Maybe there's one guy who gets called into every incident and is at the end of his rope. Those are the kinds of things I wanted to track.

Timestamps are easy to collect. Shifting the conversation to something more meaningful is the ongoing battle.

What were the biggest misses from the project?

The biggest miss before launch was not thinking hard enough about how we were going to expose incident data to stakeholders. We focused on enriched data but didn't think enough about reporting flexibility.

As we started roadshowing the platform to execs and running workshops, we kept hearing the same story: "Today I have a TPM run a big complicated JQL query, dump it to CSV, pull it into a Google sheet, pivot-table it — and that's my quarterly reporting. What's the equivalent in the new platform?" We didn't have an answer.

We caught it late. Fixing it pushed launch back a couple of months and required building an in-house system that provided reporting flexibility and apples-to-apples comparisons with the legacy Jira data — so people could still do year-over-year reporting across the migration.

Post-launch was smoother, though not without drama. An hour before our April 1st launch, the SaaS platform we'd picked had one of the largest outages in its history. We scrambled, it came back up, we got assurances it wouldn't happen again, and we launched on time — but we were sweating.

The other miss: email notifications worked fine with individual addresses in testing, but due to implementation details, they didn't work with Outlook or AD groups. That was so easy to test — one group would have caught it. We ended up burning the midnight oil during launch week to fix it.

Other than that, a million things had to go right for this launch to go smoothly. Missing a hundred of them is a rounding error. It went pretty well, all things considered.

Could AI or MCP have helped with the reporting and attribution problems?

The platform at the time was just putting together AI features — mostly things like "jump into an incident channel and ask the bot for a summary of the latest status."

For reporting, attribution is a hard problem. If I'm a VP with hundreds or thousands of microservices in my product line, and any one of them breaks, it ripples across upstreams and downstreams. Even if an incident wasn't my service's fault, I probably still want to know about the impact. AI could maybe help with "construct a query" or "tell me which incidents actually impacted my space."

Another use case we were hopeful about: drop a list of symptoms into an incident channel and let the bot tell you if it's seen similar incidents before, or suggest a fix based on past patterns. We absolutely want to leverage historical incident data to find repeaters, find common patterns, find the systemic issues where one fix solves seven different-looking incidents.

But we had interesting conversations about hallucinations. In the limit, maybe the bot tells you "just turn off all of broadcast and there won't be any errors" — technically correct, operationally catastrophic. In a world where we're federating to engineers who haven't been on-call before, a new hire might think that's a perfectly reasonable thing to try. So we wanted to be cautious.

What would you tell someone starting this journey today?

It depends on the org. For us, strong API support and strong integration with third-party and internal systems were non-negotiable. Even though we were buying a purpose-built platform, we knew there'd be integration points internally — that was unavoidable.

At vendor eval time we had an official rubric: non-functional requirements (security, legal), API support, general usability, cost. It had to be a small number of categories — maybe six — because this was what we were taking to execs and VPs for budget approval. They're not going to read a hundred things. Those six were roll-ups of other things we'd noted internally as important to the implementation and rollout.

Where can people find you?

LinkedIn is the best place. I'm a little old-fashioned about social media — I have accounts elsewhere but don't really monitor them.