A comprehensive definition of SREs and Site Reliability Engineering, including what SREs do and what makes SREs different from other roles.


The SRE role is the most basic and important job within the realm of incident response and reliability management. That you may know.

But what does an SRE do, exactly? How is an SRE different from a developer, a DevOps engineer and other technical roles? Are SREs independent teams, or do they work as part of other teams?

This article answers these and other questions in order to provide a complete definition of an SRE. Keep reading for tips on understanding what an SRE actually does, and how to help the SREs in your organization be the best they can be.

So, what is an SRE?

An SRE, or Site Reliability Engineer, is an engineer whose main role is maximizing the reliability of IT systems.

The SRE role is part and parcel of the discipline of Site Reliability Engineering (which, somewhat confusingly, is also represented by the acronym SRE). An SRE, then, is someone who specializes in Site Reliability Engineering within a broader IT organization.

It’s worth noting that the term “Site” within Site Reliability Engineering can be misleading because it implies that SREs only manage the reliability of websites (or, possibly, a local office, if you take “Site” to refer to a worksite or on-premises location). In reality, SREs can help manage any type of system, including but certainly not limited to websites. 

What do SREs do?

Broadly speaking, the job responsibilities of SREs can be broken down into two main categories.

First, SREs take the lead in ensuring that IT systems are designed to be as reliable as possible before they are deployed. An SRE might help developers plan the optimal microservices architecture for maximizing the ability of an application to resist failures, for example. Or, an SRE could help development and IT teams decide which public cloud or clouds to use to host their apps, based on the SLA guarantees and performance records of the various clouds. The goal of activities like these is to minimize the risk that systems will fail or underperform in the first place.

Second, SREs play a leading role in responding to incidents when something does go wrong. Although incident response teams include many other roles (like communications leads and customer support leads) in addition to SREs, SREs are typically the experts who oversee the core technical components of incident response.

SRE is a flexible role

SREs can do a variety of other things, too, that don’t fall cleanly into either of the categories described above.

SREs could help Quality Assurance engineers write tests in order to validate the reliability of applications before deployment. They could work alongside IT engineers to perform chaos engineering or interpret monitoring data and solve complex application performance issues, even if those issues aren’t serious enough to be designated as incidents. They could even play a role in deciding which software developers and IT engineers to hire, based on the experience and expertise that SREs deem critical to achieving a strong track record of reliability.

At the end of the day, the SRE role is a highly flexible one. It tends to be more expansive and less strictly defined than jobs like software engineering or IT support. The ability to be agile and apply creative solutions to reliability challenges is part of what makes SREs so valuable within a broader IT organization.

SREs and performance management

One facet of SRE work that can be a little confusing is the role that SREs play in managing performance, as opposed to reliability.

Reliability and performance are distinct but closely related concepts. Reliability is the measure of a system’s ability to deliver adequate levels of functionality. Performance, meanwhile, measures how well a system achieves its intended functionality.

A system could be reliable in the sense that it meets its basic functionality requirements by remaining available and generally responsible. But at the same time, it could underperform because it handles requests more slowly than customers would like.

In general, SREs tend to focus on reliability first and foremost. Their main goal is typically ensuring that their organization maintains the basic levels of functionality that it guarantees its users in SLAs and SLOs. However, because reliability engineering is closely related to performance management, SREs also typically support performance optimization operations.

SREs vs. developers, IT engineers and DevOps engineers

There is some debate regarding exactly how SREs should relate to other technical roles, like developers, IT engineers and DevOps engineers.

In general, most organizations treat SREs as a separate team with a unique set of skills and priorities. However, because SREs typically need a blend of software development and IT engineering skills to do their jobs well, it’s not unheard of to integrate SREs directly into IT or development teams.

As for the differences between SREs and DevOps engineers, that’s a weighty subject. Some folks would tell you that SRE and DevOps mean essentially the same thing. But the general consensus is that these are somewhat different roles because SREs rely more heavily on software engineering skills to engineer reliability, whereas DevOps engineers lean on automation and CI/CD tools to help ensure reliable software delivery cycles.

Here again, however, the bottom line is that SRE roles are inherently flexible. There are no hard-and-fast rules about how you have to structure SREs within your organization, or how to distinguish them from other technical stakeholders.

Conclusion

The flexibility of the SRE role is part of what makes SREs so powerful. At the same time, though, it can make SREs somewhat difficult to understand, especially for businesses that got along just fine using traditional IT roles alone, without adding SREs to the mix. However, in today’s world of increasingly complex applications, SREs have become a vital resource for building out agile IT organizations that are prepared to maximize software reliability and performance, regardless of how their systems evolve.