5 Tips If You’re the 1st SRE Hire by Instacart's First SRE
Best practices for “SRE pioneers” – meaning engineers who are the very first SREs hired at an organization.
September 5, 2024
10 mins
Discover the power of automating your incident response process in 2024. Learn how leveraging modern tools and AI can reduce your Mean Time to Resolution (MTTR) and minimize human error. This article breaks down actionable steps to help SRE teams of any scale improve reliability and efficiency.
It’s 2024: no company is shipping its software to production by manually copying and pasting files from a local machine into a server. In fact, it’s been about ten years since automated pipelines like CI/CD became the norm for delivering software at scale.
Manually deploying code is a slow process, with a huge error margin, and impossible to coordinate at scale. That’s why organizations invest so much in automated deployments: no matter how complex CI/CD can get, it’s still more reliable and easier than trying to coordinate dozens of teams pasting their code into a server.
Incident management has also evolved to a point where organizations need faster resolution times and deal with more complexity. Automation in incident management is a necessary practice at performing SRE teams of any scale as it helps hit SLOs, cuts down time on repetitive tasks, and prevents human errors.
Incident response requires you to be thorough in the midst of pressure and uncertainty. That’s not an easy task. Before automation, this is how I would normally handle an incident:
Without automation, there is a pain point in almost every step of the incident resolution process. And all of them could be alleviated with automation. Incident response automation can make collaboration a breeze.
From automating the incident Slack (or Microsoft Teams) channel and inviting team members with specific roles, to automatic notifications to stakeholders and constructing an incident timeline for you.
When your team doesn’t have to be concerned with repetitive tasks, everyone can devote themselves to investigating and resolving the incident.
There are numerous key benefits to automating incident response. Automation can help you consolidate your SRE practice, improve your reliability, and prepare you to scale with your business.
Tony Holmes, Head of SRE at Affirm, explains that a big part of leading a Software Reliability Engineering practice is creating a framework that makes things more consistent and easier to reason about. When your guidelines and processes are comprehensive enough, they give your responders the confidence to make better decisions and the resources to move more nimbly.
Automation in incident response is about codifying the practices that you’ve identified to make your process effective. Automation removes the guesswork for your responders by providing them with communication channels and an environment where they can dive into the incident without worrying about ‘process.’
Using automation can reduce your MTTR by up to 78% because it dramatically changes how your responders handle incidents. Your on-call engineers can focus on addressing the incident at hand instead of having to come up with a process to collaborate or figure what actions to perform.
Modern incident management tools like Rootly set up all communication and collaboration channels for you, without taking your team our of Slack or Microsoft Teams. Your SRE team can set up workflows to automate tasks like fetching data from Datadog or notifying leadership based on certain triggers, so your responders avoid context-switching and are laser focused on resolving the incident.
Incidents are high-stress tasks where you have to deal with a lot of complexity and do so as quickly as you can. No matter how experience your responder is, mistakes can slip their mind when dealing with specially tricky incidents.
Some companies and incident types are specially susceptible to human errors. For example, if your company operates in a highly-regulated environment, your responders have to keep track of many tasks and perform checks with each incident that they can automate instead.
Your responders free space in their memory to perform more meaningful tasks, knowing that all logs and key events are being tracked automatically for them.
As your company grows, so does the complexity of your infrastructure and the volume of incidents you have to manage. Automation lets your SRE manage more incidents without having to hire and train exponentially even though the services they support are. It’s not even a recruitment budget limitation: there are few SREs in the market, and training them work independently in your tech ecosystem requires significant time.
Furthermore, once you have to manage several incidents at the same time as part of your daily routine, automation is the only way to move forward. You need your incident management practice to mature across the organization and form repeatable (and improvable) processes. Only by delegating repetitive tasks you can ensure you scale that process.
A common mistake when introducing automation to any process is doing it at the wrong places. Start by taking a good look at your current incident response process. Identify the tasks that are repetitive, time-consuming, or where there’s often errors found. These are the prime candidates for automation. For example, you may find that automating the initial alerting and notification process could be a good place to start, or automating the gathering of diagnostic data to kick start an investigation process.
Before you jump into automation, it’s important to make sure your processes are known and repeatable. This means having clear, well-documented steps for how to handle different types of incidents, including who needs to be involved, what actions should be taken, and how communication should be managed. By standardizing these processes, you make it easier to automate them and ensure that everyone is on the same page.
To get the most out of automation, your incident response tools need to work well together. Make sure that your monitoring, alerting, and communication tools are all integrated so that information can flow smoothly between them. This way, when an incident occurs, your automated workflows can kick in without any manual hand-offs, ensuring a faster and more coordinated response.
With your tools integrated and your processes standardized, you can start setting up automated workflows. These are essentially pre-defined sequences of actions that are triggered when certain conditions are met. For instance, if a service goes down, an automated workflow might send out an alert, create a conference call for the response team, and start collecting logs for analysis. By automating these steps, you can ensure that incidents are handled quickly and consistently every time.
Not all incident response tools are created equal, so it’s important to choose ones that fit your needs. Look for tools that offer robust automation features, such as automated alerting, workflow orchestration, and incident tracking. Also, consider tools that are easy to integrate with your existing systems and offer flexibility as your needs evolve. Tools like Rootly, which can integrate with Slack, Jira, and Datadog, provide a comprehensive solution for automating incident management and are a good example of what to look for.
A big part of effective incident management is effective communication. Instead of having people create collaboration environments manually every time there is an incident, automate the creation of a Slack channel that has all the tools you’ll need. For example, the incident channel can come with a Zoom conference call ready to use, as well as shortcuts to create Linear tickets related to the incident without leaving Slack.
When you’re trying to get to the bottom of an incident and come up with a way to fix it, notifying different stakeholders is in the back of your mind as a cumbersome, pending task. Automation can help you get rid of these bureaucratic requirements that, although not fun, are fundamental to incident response.
Deciding when and what to write on the Status Page is a very thoughtful exercise for most teams, who may even have a dedicated comms role in their incident response team to deal with it. However, it is not rare to forget to update status pages back to normal once an incident is resolved. You can set your incident response manager to update your status page automatically when an incident is resolved so your customers are notified immediately about the good news.
After your team has resolved an incident, the last thing they want to do is continue talking about it. But they have to in order to turn the incident into actionable insights. Make running retrospectives less of a burden by automating as much of it as possible. For example, Rootly builds incident timelines automatically for your team and guides them through the steps that you’ve defined so they can have an effective retrospective without any paperwork.
Rootly is a leading on-call and incident management solution trusted by LinkedIn, Dropbox, Cisco, Webflow, and 100+ high-performing companies. Rootly offers a comprehensive suite with powerful yet simple automations throughout the entire incident lifecycle.
On the alerting side, Rootly allows you to schedule shadow rotations with a single click and automatically detects gaps in your 24/7 coverage. You can connect dozens of alert sources and rely on workflows to route alert urgencies efficiently.
When managing incidents with Rootly, you get robust Slack and Microsoft Teams bots packed with automations. Additionally, you can integrate all the tools that your team already relies on, thanks to Rootly’s 70+ native integrations.
PagerDuty is the most common on-call scheduler in the market due to its long-standing position since 2009. However, its exorbitant costs, aggressive upselling, and lack of innovation have pushed many customers to look for alternatives.
Beyond concerns about ROI, PagerDuty customers also report frustrations with how difficult it is to use. Its UI has remained largely unchanged for over a decade, making it challenging for modern teams. Even long-time users often need to Google how to perform basic tasks.
Automated workflows in PagerDuty are a paid add-on and are limited to the alerting side of incidents. This means there is little you can automate regarding incident response collaboration or post-incident tasks through PagerDuty. It’s also a common reason many companies look for PagerDuty alternatives.
Opsgenie is a more corporate alternative to PagerDuty, often bundled with Jira and Confluence for existing Atlassian customers. However, Opsgenie is one of the most unreliable alerting solutions, with reports of unavailability and service disruptions lasting for hours.
Since Atlassian acquired Opsgenie in 2018, no significant investments have been made in the product. Automation in Opsgenie remains nearly non-existent. You can configure basic escalation policies and rotations, but beyond alerting, you’re on your own.
VictorOps was acquired by Splunk and is now commercialized as Splunk On-Call. Splunk On-Call is a straightforward alerting solution for teams using Splunk as their observability platform.
However, Splunk On-Call functions primarily as a companion product, making it easy to use and set up if you only use Splunk and don’t have advanced alerting needs. The automations are limited to handling alerts, leaving everything else for you to figure out through other tools.
Rootly is modern on-call and incident management solution that automates much of the process for your responders. Rootly provides no-code automations for many use cases, from simple to complex workflows connecting up to 70 tools like Jira, Notion, and Zoom.
Rootly is trusted by leading SRE teams around the world, including LinkedIn, NVIDIA, and Tripadvisor.
Book a demo with one of our reliability experts to find out how Rootly can help your organization automate your incident management process.
See Rootly in action and book a personalized demo with our team