De-Siloing Incident Management: How to Make Reliability Engineering Everyone’s Job

JJ Tang

July 15, 2021

De-Siloing Incident Management: How to Make Reliability Engineering Everyone’s Job

SREs may “own” reliability engineering. But they can succeed in that role only with help from a variety of other stakeholders. If you can’t collaborate and communicate readily with developers, IT engineers and even non-technical teams like PR and legal, you’ll struggle to optimize reliability engineering.

That’s why de-siloing the organization is such a crucial part of managing reliability. Here’s why breaking down the silos that separate SREs from other teams is so important, and practical strategies for doing so.

The risks of siloed reliability engineering

At first glance, you may not think of organizational silos (meaning divisions between different groups or business units that hinder communication and collaboration) as a major challenge for SREs. After all, the SRE role by its very nature is a sort of hybrid one that bridges the gap between development and IT operations, the two main components of a conventional IT organization. SREs are supposed to bring both software engineering and IT Ops skills to the table in order to build as much reliability as possible into the systems they manage.

Yet just because the SRE skillset overlaps with that of other disciplines doesn’t automatically eliminate silos between SRE teams and other teams. Those silos have a tendency to persist, for several reasons:

Different goals: SREs don’t share the same goals as other technical teams. The main goal of modern development teams is to release software continuously, not to optimize for reliability. As for IT ops, their focus is on deploying software continuously and responding to incidents effectively when they occur (which is not the same thing as engineering software in such a way that incidents are minimal).
Separate team structures: SREs aren’t necessarily organized as part of development or IT ops teams. SREs tend to exist apart, organized into their own teams, which means they have few if any natural opportunities to interact with other technical teams.
No role in CI/CD for reliability engineering: Perhaps because SREs do different work and have different priorities, they don’t fit naturally into the CI/CD processes that guide the work of other teams. There is no stage of the CI/CD pipeline where SREs somehow insert reliability into the code. Unless SREs actively collaborate with other stakeholders to make reliability a priority across the CI/CD pipeline, it’s easy for reliability to get stuck in its own silo (kind of like security, which is also not a default part of the standard CI/CD pipeline and only gets integrated if you take a DevSecOps approach).
Different measures of success: SREs measure success in terms of metrics like availability, MTTR, SLOs. These metrics may matter somewhat to developers and IT ops, too. But they are generally not as important as other metrics that relate more directly to development and operations work, like application release frequency and performance metrics.

The disconnect between SREs and other technical roles matters, of course, because it hampers the ability of the IT organization as a whole to manage reliability efficiently and effectively. When different parts of the IT organization focus on different pursuits and place different priority levels on reliability engineering, you end up with teams that work toward their own individual interests, rather than optimizing outcomes for the business as a whole.

Beyond technical: Reliability engineering and the business

It’s worth noting that it’s not just silos within the IT organization that make it harder to optimize reliability engineering. Divides between SREs and non-technical business units can be just as problematic.

For instance, SREs don’t typically work alongside or in close collaboration with PR and legal teams. But when an incident occurs, communicating with these teams can be paramount, especially if the incident affects customers in a major way. Legal can help SREs determine what the contractual impact of an incident is, or which service disruptions to prioritize in order to minimize the fallout of SLA violations. Likewise, PR can work with SREs to formulate statements about disruptions and estimated recovery times.

But again, just because SREs should collaborate with these teams doesn’t mean they do. These non-technical teams are typically even more siloed from SREs than are developers and IT engineers.

Breaking the silos: 4 practical strategies

So, that’s the problem. The real question is: How do you fix it?

Following are four approaches to increasing collaboration between SREs and other stakeholders in reliability engineering.

Include all teams in incident response playbooks

Your incident response playbooks probably focus first and foremost on the technical procedures that teams will follow to restore service.

But ideally, the playbooks will also cover other operations -- like communications work by the PR team and contract assessment by the legal team -- that are necessary to ensure holistic response to incidents. When you build these processes into your playbooks, you make it easier to achieve close collaboration between SREs and other stakeholders.

Include all teams in testing

SREs often perform various kinds of tests -- like FMEA assessments -- to evaluate the reliability of systems they manage.

But these tests need not be the responsibility of SREs alone. Other stakeholders from across the IT organization and beyond can and should play a role in identifying reliability weak-points and assessing the impact of potential failures within the system.

When you include everyone in reliability testing, you build a stronger culture of shared responsibility.

Track the impact of changes on reliability

Ideally, every time a developer writes a new line of code, an IT engineer modifies a production server or a lawyer changes the terms of a customer contract, reliability should be a consideration. But it’s often not, especially within organizations where reliability is seen as something that only SREs have to manage.

To change this, require all stakeholders to assess the consequences for reliability each time they make a change. When thinking about reliability becomes second nature for everyone, you end up with a healthier reliability culture and fewer barriers between SREs and the rest of the organization.

Blameless culture

Finally, even as you work to make all stakeholders assume ownership of reliability, remember that your culture should nonetheless remain blameless. Just because everyone shares in reliability engineering doesn’t mean that any one group needs to be held responsible when something goes wrong.

Maintaining a blameless culture surrounding reliability is important for ensuring that stakeholders see reliability not as a burden imposed on them, but as an opportunity to collaborate with other teams and reinforce collective success.

Conclusion: Everyone owns reliability engineering

SREs may specialize in reliability engineering, but ultimately, every stakeholder within the business plays a role in building and managing reliable systems. The key to getting the most out of reliability engineering is gaining buy-in from across the organization for collaborating and community with SREs, and breaking apart the silos that have conventionally isolated SREs from everyone else.