A Primer on the History and Evolution of Incident Management to Today
Many of the concepts SREs take for granted about incident management originated with efforts to fight fires in California in the 1970s.
July 31, 2024
10 mins
Discover the essential SRE tools for monitoring, incident management, automation, and more!
For better or worse, most companies—including their execs and developers—see SREs as superheroes who’ll save them from the evils of downtime and service degradation with their boundless superpowers.
SREs are expected to constantly perform dangerous stunts like production debugging or communicating highly technical issues to angry VPs. They must also be able to manage infrastructure, networks, databases, pipelines, operating systems and much more.
Because there’s so much to work with, it can be a challenge to make sense of all the terms you come across with as you navigate through your career as an SRE. That’s why we’re putting together a list of the common tools and what to look out for in each of them.
We’ve separated the tools in concentric circles from closest to you as an SRE to further, but still relevant to your function.
A big part of being an SRE is dealing with incidents. Whether they occur discreetly during work hours, or as a big scandal during rush hour on a weekend. The tools you use to digest alerts and coordinate a response will be a constant in your.
Rootly is a modern on-call and incident response tool, used by teams like LinkedIn, Cisco, Canva and Elastic.
Key features:
Best suited for:
PagerDuty is the most popular legacy on-call management tool, founded in 2009. For a few years, PagerDuty was the only reliable alerting solution that could provide enterprise customers.
However, SRE managers often report their frustration on the amount of manual work they put in to configure and manage PagerDuty due to its assumptions on how software is shipped. Its high costs and aggressive upselling tactics make customers question if they’re getting the value of their investment back.
Key features:
Best suited for:
OpsGenie is another legacy on-call management tool. Owned by Atlassian, OpsGenie is more often found in organizations using other products from the company, like Jira or Confluence. However, OpsGenie has been known for its multiple and constant outages. OpsGenie is also often critized by its customers for the lack of investments made by Atlassian to improve the product over the years.
Key features:
Best suited for:
{{cta-checklist}}
You can only respond and resolve an incident if you detect it. Monitoring and observability tools constantly look at data coming from your system to understand if everything is working as expected. If they detect an anomaly, they’ll trigger an alert.
Prometheus is a popular open-source system monitoring suite. Prometheus is known for its scalability, reliability, and strong community support (54k stars on GitHub).
Key features:
Best suited for:
Datadog one of the most used observability vendors. It offers a wide set of products aimed at getting visibility of the performance and health of your applications, infrastructure and environments.
Key features:
Best suited for:
Most teams these days rely on containers to deploy their software because they make composability easier. However, you’ll end up having dozen of containers flying around when you account for every component of your system. Thus, you need a platform to help you automate, manage, scale, and network containers.
Even though Container Orchestrators usually offer some kind of ‘auto-heal’ feature, as an SRE, you’ll rapidly get accustomed to having to ‘heal’ things yourself.
Kubernetes is the most widely used container orchestration platform. It’s the founding project of the Cloud Native Computing Foundation (CNCF), which means being open-source is at its core.
Key features:
Best suited for:
Before Kubernetes, Docker was pretty much the default way of deploying containers. The Docker ecosystem isn’t fully open source, but that brings the benefits of a more polished product that is easier to setup.
Key features:
Best suited for:
Continuous Integration and Continuous Delivery are pretty much the standard way of developing and shipping software these days. CI/CD pipelines execute a set of steps to build and run software within specified parameters and environments.
Yes, you’ll likely become besties with a wide range of CI/CD pipelines as you try to figure out what or why something went wrong at 3am when you tried to rerun the workflows to bring a service back online.
Infamous and feared, Jenkins has been around since 2011. As a robust and highly-customizable open-source solution to build pipelines, Jenkins dominated the market for a few years. However, configuring it and its plugins can rapidly become a challenge for a lot of teams.
Key features:
Best suited for:
CircleCI got into the market around the same time as Jenkins, but it’s more of an open-core solution. Their main offering is a fully-featured Cloud hosted version, but a self-managed free to use version is also available.
Key features:
Best suited for:
GitHub shook the CI/CD world with their release of GitHub Actions in 2019. The clear caveat is that it only works for GitHub customers. But the ergonomics built into GitHub Actions make it the easiest-to-use CI/CD solution in the market at the moment.
Key features:
Best suited for:
A small misconfiguration can cause major outages, so most organizations opt to manage their configurations through a dedicated tool. When you have multiple environments and systems, you definitely need some form of automation around your configs.
As an SRE, you won’t be stranger to the configuration management tool your team uses. Whether if it’s fishing for that problematic provisioning or figuring if the updates were applied uniformly across services and environments.
Ansible is an open source automation engine that helps teams set up configuration management, among other processes. It’s commercialized by RedHad through a broader offer platform offering.
Key features:
Best suited for:
Released in 2005, Puppet is still relied on by many organizations due to its maturity and extensive ecosystem. Similar to Ansible, it offers an open-source core and a product around it, as well as services and training.
Key features:
Best suited for:
A full service unavailability is not the only kind of incident you’ll deal with as an SRE. You’ll likely also want to address any performance degradation in any of your components. And for that you need, yes, more tools.
Application Performance Monitoring tools provide real-time data on response times, error rates, and transaction throughput.
New Relic offers full-stack observability capabilities, including application performance, user experience, and infrastructure health.
Key features:
Best suited for:
Launched in 2005, Dynatrace is one of the most established application performance monitoring solution. The offer a large suite of products to cover several use cases within the observability space.
Key features:
Best suited for:
SREs rely on a wide range of tools to keep systems running smoothly. Rootly integrates with all the tools you already use so your incident response experience is as streamlined as possible.
Book a demo with one of our reliability experts to see how Rootly can help your reduce your MTTR.
{{cta-demo}}
Unlock simple and cost-effective incident management with Rootly
Our Tool Evaluation Checklist helps you compare features across different platforms, ensuring you find a perfect fit for your team's unique needs