If you know anything about the origins of Site Reliability Engineering, or SRE, you know that the concept was born at Google.
But why did Google establish the SRE role? And how did SRE spread from the search giant to companies of all types -- including but not limited to Web-scale businesses with massive reliability needs?
Keep reading for answers to these questions as we explore and analyze the history of SRE from Google to the present.
Google’s creation of the SRE role
The first SRE team originated at Google in 2003 under the direction of Ben Treynor Sloss, who had begun his career as a software engineer at Oracle and several other companies before joining Google.
Neither Sloss, nor Google in general, have said much publicly about exactly why they created the SRE role. However, likely factors included:
- Web-scale reliability needs: Google was one of the first companies with truly massive reliability needs. Circa 2003, most businesses could get away with some downtime or slow page loads; after all, more than 50 million U.S. households still had dialup at the time. But as one of the largest Web companies on the planet, Google was on the frontier of a new type of user experience that involved minimal downtime and latency. Building an SRE team was an obvious step toward achieving that goal.
- Massive infrastructure: Along similar lines, Google was one of the first companies with a truly massive, distributed infrastructure. In 2003, the public cloud was not yet a thing, and few businesses had hundreds of thousands of servers spread across dozens of data centers to manage. But Google did, which is why it needed a strategy that would enable large-scale automation of reliability across this sprawling infrastructure. Most other businesses wouldn’t face this challenge until they started to move to the cloud in the later 2000s.
- There was no DevOps: If Sloss had joined Google five years later than he did, it’s reasonable to imagine that he would have formed a DevOps team instead of an SRE team to achieve the type of code-driven reliability management he envisioned. After all, DevOps and SRE are driven by similar (albeit not identical) methodologies and goals. But, in 2003, DevOps didn’t yet exist, so Google had to invent its own concept from scratch. (Why DevOps emerged in parallel to SRE, instead of merging with it, is a story for another time.)
How did SRE spread beyond Google?
The story of SRE’s gradual expansion from Google into businesses of all types unfolded in two main stages.
SREs spreads to Web-scalers
The first stage involved the adoption of SRE by other large, Web-scale companies similar to Google. Facebook had an SRE team by 2010, according to a blog post from the time. Netflix established a “core SRE team” by 2016. Uber started writing in the same year about how it uses SRE. LinkedIn was touting its “SRE culture” by 2017.
It’s easy enough to understand why large companies like these would import Google’s SRE concept into their own IT organizations. They faced the same challenges as Google: Starting early-on, they had massive, distributed infrastructures to manage. They also needed to meet ever-steeper user expectations regarding performance and availability. And although most of these companies embraced SRE after DevOps was already well established, that’s probably because it was clear by the mid-2010s that DevOps alone doesn’t guarantee an excellent user experience.
SRE for “ordinary” companies
The second, more interesting stage in the history of SRE is the adoption of SRE by “ordinary” companies -- meaning those without huge server farms to manage or billions of transactions to handle each day. Over the past few years, businesses of all types have begun hiring SREs, even if they don’t face special reliability challenges.
There are two possible explanations for why the SRE role has become a core part of IT organizations writ large. The cynical one is that SRE is just a trendy new name for what used to be called IT operations. In other words, companies that hire SREs today perhaps haven’t really changed how they operate; they’ve just adopted a fancier job title for their IT engineers.
But there’s a less cynical explanation for widespread adoption of SRE, too. It boils down to the fact that we live in a world where users have extremely high expectations from websites and applications, and traditional IT operations strategy can’t accommodate them. Today, even if you operate a run-of-the-mill website or a mobile apps with just a few thousand users, you need to make sure you can measure content load times in milliseconds and resolve availability issues in minutes instead of days if you want to keep up with your competition. The concepts, tools and strategies that SREs bring have helped smaller businesses achieve these goals.
Conclusion: The future of SRE
That’s a brief summary of how SRE came to exist as we know it today. But where is it headed next?
It’s impossible to predict the future, of course. But if we had to take a guess, we’d say that SRE will become even more widespread at smaller companies. We also foresee ever-greater use of automation tools to streamline SRE workflows in ways that make it more practical for smaller companies to take advantage of SRE even if they lack large in-house IT teams.
If anything is certain, though, it’s that -- despite having originated as a relatively obscure concept within an elite company two decades ago -- SRE is not going anywhere. Even if the rate of creation of new SRE teams levels off, SRE is so well established at this point across companies of virtually all types and sizes that it’s hard to imagine a future where SRE is not a core part of IT strategies everywhere.