As new incidents emerge, there are often many unknowns about the size, severity, and cause of the problem. Sometimes it’s not clear if the problem is an incident at all. That’s where introducing a triage stage to your incident management process can help. In this post, we’ll look at the benefits of adding a triage layer to your incident management, and how Rootly’s Triage feature allows you to seamlessly transition from triage to real incident (or false alarm).
What is “triage”?
Triage is the initial stage of an incident before the incident is officially declared. During this stage, you’re determining whether the problem meets your organization’s criteria for an “incident”. If you work in a distributed system, you might be trying to validate whether the problem is stemming from your system, or something that needs to be resolved by a third party that interacts with your in-house system. By the end of the triage stage, you’ve determined whether the problem is a new incident, related to an existing incident, or a false alarm, and you’re ready to move forward accordingly by starting, merging, or canceling the incident.
So if triage happens before the actual incident, why should it be codified into your incident management process at all? Let’s talk about it.
Create psychological safety
Starting an incident is scary! Nobody wants to feel like the boy who cried wolf. At many organizations, especially ones that leverage automation in their incident management, starting an incident triggers a series of events, like paging on-call engineers, notifying executives, etc. All of whom will likely immediately have questions that the reporter doesn’t yet have answers to. It’s easy to think, “I better look into this before I ring any alarm bells.” If you have a real incident on your hands, that’s valuable time that could be spent assembling a response team and capturing data.
Having a triage process creates a safe and consistent place for someone to say “Hey—I think we might have an incident here, but I’m not sure. Here’s what I’m seeing.” With Rootly’s flexible workflows, you can decide what you want to happen in these cases. For example:
Set a time limit for how long an incident can remain in “triage” with automated Slack reminders
Send automated notifications to the affected product’s Slack channel
Send a “soft page” (a Slack ping or email) to the on-call SRE to notify them of a potential emerging incident being triaged
Create a path for customer support or other non-engineering teams
Even with observability systems in place, sometimes customers pick up on issues before automated alerting. Maybe your Social Media manager is suddenly receiving messages from customers claiming there’s an issue with your service. Conversations start popping up in Slack among the customer support team, marketing, etc. At what point do a handful of similar tickets become a trending issue? Who is responsible for noticing and reporting that trend to your Engineering team, and what path do they take to do so? You probably don’t want to give your entire customer support team the ability to start an incident, but they should have a consistent way to flag trending issues to engineers to investigate when needed. Introducing triage can ensure that issues picked up outside of your engineering team make it to the right team as quickly as possible, without causing panic by declaring incidents before they’re validated.
Create visibility to avoid duplicating efforts
You know the old saying about trees falling in a forest? The same thing goes for incident investigation. When a problem is being looked into, but nobody has visibility into that investigation, people waste time starting their own investigations or looking for ways to report the issue to someone they think can help. The sooner you can create a single source of truth for people to refer to, the less you run the risk of multiple channels and investigations popping up.
Creating a triage incident gives immediate visibility into your investigation. If it turns out the problem is already being investigated, or is related to an existing incident, you can easily merge your triage incident to the existing incident. Any incidents in triage are displayed on your Rootly home page. Just like regular incidents, you can create workflows to notify specific individuals or Slack channels when a triage incident is created.
Capture more accurate incident data sooner
With the official declaration of an incident, typically dedicated conversation spaces like a Slack channel, Zoom room, etc. are spun up. In cases where an investigation occurred elsewhere before the incident was declared, you’re now left to track down and manually carry over all of that context and conversation to your new incident work space, or lose it completely. The early stages of incidents reveal valuable data about how issues are being identified, the time to assemble a response and mitigate the issue, and more.
When you create a triage incident in Rootly, your dedicated Slack channel will be automatically created. If you validate the problem, you can simply mark the incident as “Started” and kick off your processes from the same channel.
If you solved the issue in triage (or discovered it wasn’t an incident at all), you can “Cancel” the triage incident to resolve it in the triage stage without declaring an official incident.
Either way, you’ll still automatically capture all that important data, like the incident’s timeline and resolution. And because Rootly provides customizable incident metrics, you can easily filter triage incidents out of your dashboards, so no need to worry about your uptime metrics being skewed by false alarms.
If you’re a Rootly user, you can start creating triage incidents today in your web platform or right in Slack with the
/incident new command.
💡 Want to learn more about how Rootly automates the incident management process, from discovery to retrospective? Book a free personalized demo with our team or sign up for a 14-day trial today at rootly.com.