What SREs Can Learn from the Atlassian Nightmare Outage of 2022
A look at the Atlassian outage of April 2022, and what it stands to teach Site Reliability Engineers. A lot to unpack here.
October 8, 2024
6 mins
Reliability is a lot about being ready to respond in the mids of uncertainty. This guide highlights how playbooks can work as runway lights to help your responders land on an incident effectively. Learn how to design and maintain an incident response playbook.
Playbooks save lives, literally. In 2009, shortly after taking off from New York’s LaGuardia Airport, a plane struck a flock of geese, which took out both of the plane’s engines. The pilot and co-pilot, realizing their situation was critical and time to act was extremely short, pulled out the emergency playbooks stored in the cockpit.
The plane’s playbook for this kind of situation described standard procedures and roles for every crew member. By following Standard Operating Procedures (SOPs) and adapting to the circumstances, the team managed to land on the Hudson River—although nobody aboard had done a water landing before—saving all 155 passengers.
Aviation is full of examples where playbooks have helped save the day and contain the panic that is bound to come from any airborne incident. However, the virtues of playbooks or SOPs are also sung in all IT governance standards, from ITIL to CISA. Indeed, maturity in SRE and DevOps practices is measured by the repeatability of processes.
In this blog post, you’ll learn what an Incident Response Playbook looks like and what I’ve seen work for dozens of customers.
When you get a high-severity alert at 3 a.m., everything around you is darkness. A playbook gives you an illuminated path to follow—a guide to navigating critical situations without additional headaches.
Playbooks provide standard procedures to follow during an incident, including task lists and detailing incident response roles. Having a place to start and concrete tasks to focus on will help any responder get into solution mode much faster than staring at an empty desktop.
A playbook can be a simple page with standard steps at a startup beginning its reliability journey. As your operation grows, so will the scenarios your playbook needs to accommodate. Enterprises often have trees of guidelines that help responders react effectively to a wide variety of situations.
The first step in any response is figuring out the magnitude of what you’re dealing with. If all systems are down and the building is on fire, you’ll need to do very different tasks than if a lower-priority API is taking too long to respond. Figuring out everything on the fly in every incident is complicated and error-prone. That’s why all playbooks start by helping determine the scope and impact of the incident.
Incident severity levels are usually assigned on a scale from SEV3 for a minor issue to SEV0 as the worst possible scenario. Make sure your playbook offers concise definitions—there’s no room for extra words. Focus on specific criteria that can help determine the severity level as quickly as possible. Responders can reassess the severity as they discover more details later on.
Incidents are complex and often require a response team to be assembled. Playbooks should explain which incident response roles are required in each situation and the responsibilities of each one. This helps offload pressure from the team and empowers individuals to take ownership of their actions without stepping on each other’s toes.
Typical response team roles include Incident Commander, Scribe, and Subject Matter Experts (SMEs). The Incident Commander leads the charge, making key decisions and coordinating the response. The Scribe documents every action, creating a timeline for later analysis, while SMEs are called on-demand according to the nature of the incident.
By providing clear guidelines on when an incident should be escalated, you empower your responders to confidently ping an executive or wake up a partner when they encounter a roadblock. This can be critical, as it ensures the resources and expertise needed to tackle an incident are available as soon as possible.
Tying escalation policies to incident severity helps prevent overreaction while ensuring critical issues receive the attention they need. Minimize the risk of delays caused by a lack of resources or indecision during an incident by documenting paths that can be taken when an incident becomes more complex than initially considered.
Remove the guesswork from your response team by providing them with clear communication protocols. An approach that works repeatedly is to have responders collaborate on the same tools they use during their normal workdays, such as Slack or Microsoft Teams.
Your playbook should outline a structure for using Slack or Microsoft Teams to coordinate the incident response process. A common best practice is to codify these practices through a Slack bot that takes care of creating the respective channels and prompting the team as necessary. Incident response platforms like Rootly do this very well, offering dozens of no-code automations to make communication less of a concern for your response team.
Your playbook should also outline how to manage communications with third parties or public announcements, including pointers on how and when to update your organization’s status page.
To make things easier, we’ve created a free incident comms playbook to help you respond quickly and confidently, no matter the severity of the issue.
Once the dust has settled, an incident retrospective is key to learning from what happened. Your playbook should include a framework for conducting retrospectives, analyzing the response, and identifying areas for improvement. The goal is to refine your playbook and response processes based on what worked and what didn’t.
Conducting these reviews allows your team to grow from each incident, updating your playbook with new insights and strategies. Over time, this continuous improvement leads to faster, more efficient incident responses, helping prevent the same issues from happening again.
You won’t get your playbook right on the first try, or the second. You’ll need several iterations to make your playbook a true and useful guideline for your responders. But even after you have a solid foundation, you’ll want to keep your playbooks relevant by tuning them based on feedback from each incident’s retrospective.
Was the playbook easy to follow? Did it lead you to a good spot when you followed it? Did you find any out-of-date instructions? These are examples of questions you can ask your responders when conducting a retrospective to gain insights on how to improve your playbooks each time.
It happens more often than you’d think: the on-call engineer gets thrown into an incident as Incident Commander but doesn’t know where to start and can’t find the playbooks. The levels of stress and the information available during an incident are overwhelming.
Ideally, your playbooks should be accessible from within the incident response management tool that you use. For example, with Rootly, all responders have a button that takes them to the relevant playbook at the top of the incident channel.
The playbook will help you recognize which parts of your incident response can be automated so responders have all the tools they need ready as soon as possible. Your incident response tool can help your team get rid of manual work and focus on resolving the incident.
For example, you might notice that fetching dashboards from Datadog is a task responders always need to do to assess the impact of an incident. You can create an automation with Rootly that fetches the relevant dashboards and attaches them to the incident’s Slack channel.
Rootly is the leading incident response management tool trusted by companies like LinkedIn, Dropbox, NVIDIA, and Webflow. Rootly offers a centralized place for you to document and manage your playbooks, with automations that let you attach them to incidents based on their type, severity level, and team.
Talk to one of our reliability experts to learn how Rootly can help your team work with playbooks to expedite your incident response process.