Creating Chaos to Achieve Reliability
How can creating chaos achieve better reliability? Chaos and reliability might seem mutually exclusive, but through the use of Chaos Engineering, SREs can bring about meaningful changes to system resiliency.
June 5, 2024
4 mins
Not all incidents are created equal. Thus, trying to fit all the possible inputs an incident declaration may need in a single form can slow down responders and impact your data quality.
A flight from Barcelona to Madrid takes 1.5 hours, while taking the train takes 2.5 hours. However, when you fly, you have to check-in, go through TSA, and commute to and from the airports to get downtown. This adds at least 3 hours of overhead to flying. For the Barcelona-Madrid itinerary, flying might not be the best idea given the overhead. But what if you’re going further? The flight overhead might be worth it.
Much like travel itineraries, incidents require different considerations depending on their nature. Some will be a basic service restart, while others will take you further into the night. What if the incident is a security breach? or what if it’s a rack that won’t turn on again? Oh, wait, is it impacting an important customer?
Imposing the same incident process for all incidents will end up bloating the checks, questions, and tasks that your responders have to go through every time. This is what we call “Incident overhead:” having too many corner cases in your playbook that you end up slowing down your mean time to resolution.
Whether you’re using an incident management tool or hand-rolling your response process, trying to use the same process for all incidents is likely to make responders having to deal with overhead that doesn’t make up for their use case. In this article I’ll go over considerations to do when designing your incident response process.
{{subscribe-form}}
Incident overhead starts building up little by little. First you realize you need to collect compliance data when a security incident breaks, so you add it to your responders’ playbook. Then, customer success asks you to put special attention to some accounts who are at risk of churn: toss inputs and guidelines into your response process. Expecting new migration-related inputs? Here’s some more for the incident declaration form!
It’s not uncommon to end up with long questionnaires that every responder has to rush through to find relevant fields for their incident. This not only adds an overhead to the response process, but makes human error more common, rendering your incident data less consistent.
A way to start reducing your incident overhead is by splitting your response process and forms based on the incident type. For example, the marketing website going down requires very different inputs from a potential security incident in the payments system.
Instead of having everyone fill in a giant form that contemplates the security incident case, you can have them first select the type of incident and show the a special form if it’s a security incident.
To reduce the overhead your responders have to deal with when declaring an incident, you need to take out more steps and questions from the process. More precisely, only require the actions and answers that are needed in each case.
That might require creating dedicated logic forks in the incident declaration process, such that responders are only exposed to what they need to see.
You can also reduce the overhead in declaration by providing responders the questions in the format that they need. For example, if your security team uses a different convention to declare severity, you can show them a dropdown with the relevant options.
Making your incident declaration forms more dynamic can significantly reduce your incident overhead. But maintaining logic forks manually can represent a lot of work. Modern incident management tools like Rootly provide robust options to customize your playbooks when you need it.