What SREs Can Learn from Facebook’s Largest Outage
An SRE’s analysis of the October 2021 Facebook outage.
June 25, 2021
5 min read
From chaos engineering to monitoring and beyond, SREs rely on several key types of tools to do their jobs.
Mastering the concepts at the core of reliability is the first step in becoming an SRE. But you also need tools to put those concepts into practice.
Which types of tools do SREs need to do their jobs? And what are the best tools in each category? This article answers these questions by discussing what SREs should think about when building their toolbox. It walks through the key categories of tools for SREs to leverage and suggests specific options in each one.
Chaos engineering, which means experimenting with systems in order to discover and assess problems that you might not otherwise foresee, has become a key concept in the SRE world since Netflix popularized it about a decade ago.
Although chaos engineering might sound like something that, by definition, you would conduct in an ad hoc fashion, there are actually several tools that help SREs perform chaos engineering systematically and efficiently. Chaos engineering tools let SREs define experiments that they want to run on their systems. The tools then execute those experiments automatically and help teams record the results.
Chaos Monkey, the open source chaos engineering tool that Netflix uses, is one popular tool in this category. Another is Gremlin, which has a more expansive feature set and is more user-friendly than open source alternatives.
Chaos engineering can help SREs find reliability weak points within their environments, but it’s no substitute for continuous monitoring and alerting of those environments. Tools that track all layers of your stack and send alerts about issues like an application that has become slow to respond to user requests or infrastructure that appears to be nearing capacity notify SREs about problems in production environments as quickly as possible. In turn, they help SREs maintain SLAs even when disruptions occur in production.
Some monitoring tools also allow teams to perform synthetic monitoring and testing, which means running simulated transactions and evaluating how the environment handles them. Synthetic monitoring and tests are another way to find problems before they become serious disruptions. They’re the opposite of real-user monitoring, which is the type of production-environment monitoring described above that involves actual user transactions.
There are dozens of monitoring and alerting platforms on the market. For example, SREs might choose to work with Datadog, which offers both synthetic monitoring and real-user monitoring in a single package.
There is a big and ongoing debate about whether and how monitoring is different from observability. We won’t get into it here, but the majority viewpoint seems to be that while monitoring focuses on finding problems, observability goes a step further by helping teams to investigate them -- especially within complex, distributed environments where surface-level transaction monitoring doesn’t always clearly expose the root cause of a problem.
As with monitoring, there are a variety of observability tools on the market (and many platforms market themselves as fitting within both categories). One popular example is Honeycomb, a relatively new observability tool that was built from the start for helping SREs and DevOps teams manage cloud-native environments.
When you have multiple monitoring systems in place, you also have multiple sources of alerts.
That makes it important to implement a paging and alerting platform, such as PagerDuty, which allows you to collect alerts from across your tools and forward them to the right engineers. They also help manage on-call schedules so that you don’t bother off-duty team members unnecessarily, and they automate alert escalation, which helps to ensure that the right person responds to each alert.
How do you know that your reliability performance meets the SLOs that your business has committed to?
If you have just a handful of SLAs to manage, you can assess SLO performance manually. But when you are juggling dozens or more SLAs, each with different SLOs, you need a tool that automates the process of comparing reliability outcomes with SLO commitments.
Tools that address this need occupy a niche that is relatively small, but growing. An example to consider is Nobl9, which not only measures SLO performance, but also helps SREs understand the business impact of different SLOs so that they’ll know which ones are most critical to meet.
Infrastructure-as-Code, or IaC, has become an almost boring category at this point. But it’s also an essential one for most SREs.
IaC tools let SREs and other engineers provision software environments automatically by writing configuration files that define how the environments should be set up. By making provisioning efficient and consistent, IaC goes far toward helping teams bake reliability into their application deployment and configuration processes.
Most cloud providers offer IaC tools that are compatible with their platforms. There are also third-party solutions like Terraform, which work with most environments.
And then there are tools like Open Policy Agent, or OPA, which lets teams define any aspect of their environment or delivery chain as code. Tools like OPA aren’t IaC solutions per se; they’re part of what some analysts have started calling “everything as code” tools, which extend the IaC concept to address all layers and facets of IT environments.
If you’re an SRE who likes being able to define reliability requirements as code and enforce them automatically across the delivery chain, you’ll want to take advantage of IaC tools at a minimum. And it may be worth exploring solutions like OPA as well.
When -- despite all the time SREs invest in testing, monitoring and optimizing their systems for reliability -- something does go wrong, it’s critical to have a way to collaborate efficiently.
Automated incident response platforms like Rootly address this requirement. By integrating with monitoring systems, automatically aligning alerts with the right teams, streamlining the creation of virtual war rooms and keeping all stakeholders in constant communication until problems are resolved, incident response platforms ensure that SREs can react quickly and consistently when something breaks.
On top of this, response platforms also help SREs perform postmortems and track action items to aid with learning from incidents.
From monitoring and observability, to SLO management and beyond, SREs require a variety of tools to do their jobs efficiently. We’ve suggested some solid tool options to meet each of these needs, but you can find a variety of other great tools out there, too.
{{subscribe-form}}