Making Your On-call and Incident Management Program Stick
Maintenance of your incident management practice is as important as creation - find out what you can do to keep your engineering organization strong and consistent year over year.
February 26, 2024
6 min read
The Google books offer helpful general principles. In practice, however, you shouldn’t assume they are immediately applicable to your environment. Some companies have comprehensive, high-quality documentation, mature incident response tooling, and more importantly, experienced ICs. However, if your organization is early in the process of establishing the incident response practice, you will need to expand on these principles, so that your response team can handle the chaos with greater ease.
This post was contributed by Strong Liang. It has been lightly revised and reposted with his permission from the original article on Medium.
So, you’re training incident commanders (IC), and you have your group read Google’s SRE books. Everyone knows what they are supposed to do and you are ready for any incident, right?
Not quite. Half of your team complains that the descriptions are too vague or don’t apply to their situations, and the other half just starts to improvise.
The result? Inconsistent incident responses and burnt out ICs. Both are undesirable and hard to fix.
The Google books offer helpful general principles. In practice, however, you shouldn’t assume they are immediately applicable to your environment. Some companies have comprehensive, high-quality documentation, mature incident response tooling, and more importantly, experienced ICs. However, if your organization is early in the process of establishing the incident response practice, you will need to expand on these principles, so that your response team can handle the chaos with greater ease.
Below are 3 key steps to make the Google principles clearer and more actionable.
The Google text says:
“the incident commander holds the high-level state about the incident.”
This makes sense at a high level, but the actual behaviors of getting “high-level state” are open to interpretation. As the IC, you need to make sure you and the leadership have the same interpretation. Multiple times, I’ve seen leadership upset about incident handling that the IC felt great about — not an easy piece of feedback to deliver to the IC.
You want to craft guidelines that can drive consistent behaviors from the incident commanders. “[H]olds the high-level state” means being able to answer the following questions:
Your list may have more items to encapsulate all the aspects that lead to optimal TTM (time to mitigate) and timely, clear comms. This can become the contract between stakeholders and the IC, which also enables the IC to self-check if they are doing their job right.
The books talk about the many tasks during an incident including coordination, communication and troubleshooting. But priorities are context dependent. Priorities may change over time, they may compete, or they may get neglected. Here are some examples and ideas to handle them.
Adapt to Changing Priorities: Incident response can be a sensitive area when it comes to changing expectations, especially if your major customers become unhappy. For example, an established response team I worked with got feedback that the comms were not frequent enough. The team members were confused and pushed back — they had been doing comms all along and hadn’t gotten this feedback before. Management needs to update the response team about major customer concerns and help them understand the resulting process changes.
Balance Competing Priorities: Many incident response tasks seem to be equally important and urgent. In the previous example, after the team improved the comms frequency, we got a new problem — the clarity and consistency took a hit. I said “Wait a minute, we cannot compromise on quality”, to which a team member asked, “Which one do I prioritize, speed or quality?”. It might be tempting to say “both”, but that’s often wishful thinking. These strategies can help you prioritize:
Setting deadlines for tasks: In the above example, I said, “You don’t need to go faster than the policy (e.g.: 15’ for SEV0). So within that window, prioritize quality. But if you keep seeing delays in sending the comms out, you need to do something — reducing the level of details, crowdsourcing the writing — to speed things up.”
Single-tasking and delegation: with multiple high priorities, pick one and delegate the rest. “A managed incident” illustrates a good example of this idea. The incident commander realized that she couldn’t troubleshoot and command at the same time, so she delegated the command. In the example above, crowdsourcing is also a form of delegation.
Avoid Overlooking Key Priorities: Because of the pressure during a crisis, some assumptions may go unnoticed and cause negative effects. For example, at the beginning of an incident, it’s a top priority to give context to new response team members. However, this often doesn’t happen enough when there are 10+ people joining at different times. It’s common that some people join after the briefing and feel reluctant to ask questions. Under pressure, it’s easy to equate no questions with understanding, when people are actually struggling with context. This is underutilizing critical resources when they are most needed. It is worth making it a priority for ICs to give context periodically — spoken and written — in the communication channel, or assign someone to do so. They may be worried about repeating themselves, but it’s often new information to the group.
Preventing anti-patterns may be just as important as laying out the happy path. You should expand on the what the books offer, based on your organization culture and maturity. Here are 3 common anti-patterns and how to avoid them:
Once I joined an incident war room to observe the response. There were 20 people in the room, but no one said a word in 5 minutes. I DM’ed the IC to find out what was going on. He said people were hard at work and looking at things individually, he didn’t want to disturb them to find out their progress. I said “I hope you’re right, but how do you know?” It could also be that no one was making progress, and people were just secretly counting on others. Without active management, the team tends to slow down and become single-threaded. To avoid this, the IC should periodically break the silence and re-engage the team. If your ICs are still reluctant about disturbing or sounding pushy, here’s some language:
“Team, it’s been 15 minutes since the last update, where are we in identifying the problem?”
“Javier, you were looking into XYZ. Do you have an update?”.
“Meeta, would you report your findings in 10 minutes?”.
(Note: the frequency of updates depends on the nature of the incident. If an incident lasts for hours, you can negotiate a cadence for timed syncs or progress reports)
This happens when two or more people insist on different directions, causing confusion about what action to take. It may have the appearance of progress because of active discussions, but on a closer look, no action is being taken. To avoid this, the IC needs to interrupt the flow by reminding the team of the priorities, and drive towards actions. If people are talking past each other, focus the team on one idea at a time. If a subgroup of people are having an important discussion that doesn’t require the rest of the group, create a breakout room for them and ask them to report back after a set time.
This often happens when an executive, not usually involved in technical resolutions, joins the war room. Their intention is to help, but oftentimes it’s counterproductive because of the power dynamics. This can also happen with any long-tenured and/or well-respected experts. They may stand behind a decision not based on best practice, and it can be a challenge to push back. Once, when I oversaw an incident caused by a bad release, a long-tenured expert suggested fixing forward instead of rolling back. The IC questioned the decision; the expert insisted. I thought of saying something, but the conviction of the expert made me second-guess myself (maybe this is a special case with nuances that only the expert can appreciate). Turns out the fix forward was not a good call, since it took at least twice as long as a rollback. That was embarrassing to explain to my boss later. Since then, when someone suggests something outside the norm, the IC whips out a policy doc with the standard procedure. People may still ignore policies, but it tends to make them pause for a moment, perhaps assessing the risk and consequence. That brief moment may be all the IC needs to strengthen their position.
In summary, these solutions empower you to efficiently address challenges in incident management. They include clear definitions, alignment with leadership, detailed operational guidelines, prioritization strategies, and the prevention of anti-patterns, collectively contributing to a structured and efficient incident management approach. Implementing these tailored solutions bridges the gap between theory and practice, ultimately resulting in more successful incident responses with less burnout for you.
Thank you to contributors: Nicolas Gattig, Ashley Sawatsky, Dominic Becker, Victoria Tovar
Zhuang (Strong) Liang is a software engineering leader with over 16 years of experience, specializing in Reliability and Infrastructure at world-class companies like Affirm, Google, and Uber. You can keep up with his posts on Medium.