Part of making incident management “work” at your company is establishing a set of shared vocabulary and rituals. These cultural artifacts have allowed your team to ascend to the next level of proficiency in the practice, in which they can convey complex ideas and status to each other with brevity. Achieving this with consistency across teams and functions at your company is very difficult work, so if you’ve made it this far: congratulations!
To get here, you’ve probably presented material like severity scale, tools, and retro format to a large group. You’ve answered a thousand “what if” questions about just severity alone, along with a dozen of the other tools and practices established. Your organization gets it. When an incident happens, engineers pass issues between teams quickly and easily. They declare a cryptic severity to a crowd of onlookers who knows what it means, and they’re closing with confidence. Your efforts have been rewarded! Chaos is contained. But there’s a serious problem on the horizon: new engineers are starting all the time.
In this article we’re going to take a moment and unpack what makes a good incident management training session, and explain why you should absolutely have your SRE team participating in the practice.
As your organization advances in the practices of culture building around incident management, more and more complexity will be abstracted behind words. As your organization becomes more predictable, it does so at the expense of more and more actions becoming assumed, rather than said out loud. And just like every culture, yours will require rituals in order to replicate itself. Without this mechanism, engineers will depart. At normal rates of attrition, your culture will erode within as little as a year. When jargon is shared and understood, inclusion is increased. When jargon is used by only a portion of your organization, it’s easy to create rifts in your organization without even realizing it.
Introducing New Hires to Your On-Call Culture
Like every initiation rite, your introduction to on-call should have two portions:
- Stage 1: Understanding and Observing the practice
- Stage 2: Performing the practice themselves
Your goal in stage one is to reduce the fear and anxiety level around the topic. Being on call for a new system is intimidating, it’s a best practice to separate the cognitive load of learning the systems and products from that of learning the tools used to manage incidents. Because of this, it’s a best practice to have this meeting 3-4 weeks prior to what is likely to be the first on-call rotation for an engineer. At many companies, this will mean that this meeting is part of the rest of their new-hire training.
Here’s what you should cover in this training:
- What is an incident?
- What are our levels of incident severity and what do they mean?
- What is the lifecycle of an incident?
- How can I learn the current status of an incident?
- What is expected of the primary on-call for a team?
- What is the role of the secondary on-call for a team?
- What tools do we use for incident management and what is their purpose?
It is absolutely critical that you assume nothing about your participants for this training session. This information can be presented via a quick slide deck – the most important part comes next. After explaining this to your cohort of new hires, you should design a quick and memorable exercise to test for understanding.
Walkthrough the setup of your pager-product-of-choice and have fun choosing the sound that you’re going to instinctively dread for the next couple of years. Ensure that everyone understands what it will look like to get paged, that their phone’s Do Not Disturb settings are bypassed, and other simple hiccups are out of the way. Everyone in this meeting should get at least one example page, and send one to a colleague. This will be a very loud meeting, if performed correctly.
With this tool out of the way, you should work on a “tabletop” incident as a group. Have a story or scenario written up in advance. Pick a volunteer from the group to receive the page and unravel the story together. During this activity your volunteer, with the help from the group, should walk through the entire process of receiving and communicating about an incident. The group should answer simple questions throughout the process: what is the severity of this? Should we change the severity based on what we’ve just learned? What status would you assign to this incident? Who should be engaged? How should we communicate about this?
Try and make these mock incidents a hands-on, fun experience, with a ridiculous story. It’s important not to overload this exercise with complexity or real-world failure modes. Your goal is to build positive associations with the SRE team, and provide a cohort of new hires with a bonding experience. Cultivating psychological safety is absolutely critical to improving your practices. This exercise provides a low-risk place to begin. The training tests for understanding in a memorable, fun way.
Your attendees will leave your meeting having a moderate degree of comfort and understanding about the process – over the next few weeks they’ll use what you’ve taught them to make sense of what they’re seeing at work, and if things are going well, they’ll view your SRE team as an approachable source of expertise in these matters.
First On-Call Rotation
This is an important “rite of passage” moment for your new hire, and it’s important that you find a way to show them encouragement and support as they transition from observer to participant. This might be as simple as asking engineering managers on every team to mark a shared calendar with every first rotation. Acts like a high-five or a quick check-in to talk with a new hire prior to going on-call send an important message: the SRE team is here to help and support you, and we care about how this is going for you.