Practical Guide to SRE: Incident Severity Levels
Incident severity levels are a measurement of the impact an incident has on the business. Classifying the severity of an issue is critical to decide how quickly and efficiently problems get resolved.
March 11, 2022
4 min read
AIOps can bring some value to SREs, but it’s important to maintain healthy perspective about the limitations of AIOps.
If you’re an SRE, you might view AIOps with great excitement. By automating complex workflows and troubleshooting processes, AIOps could make your life as an SRE much easier.
Alternatively, SREs may choose to view AIOps with disdain. They might think of AIOps as just a fancy buzzword that doesn’t live up to its promises, and that can become a distraction from the SRE tools that really matter.
Which perspective is right? Should SREs embrace AIOps with open arms, or should they resist marketers’ efforts to position AIOps as the latest, greatest tooling innovation in the IT industry?
Those are subjective questions that we can’t answer definitively, but let’s at least gain some perspective by examining what AIOps means for SREs.
As you’ve probably heard by now if you keep up to date with your IT buzzwords, AIOps – which is short for artificial intelligence for IT operations – is the use of AI and machine learning to help automate IT Ops workflows.
The big idea behind AIOps is that, by using AI and ML to perform advanced analysis of large volumes of data from IT systems, IT and SRE teams can solve complex problems more efficiently than they could using a manual approach.
AIOps can, for example, help to surface the root cause of a performance issue in a complex, multi-layered environment like Kubernetes. Or, it could make recommendations about how best to resolve an incident.
AIOps entered the IT lexicon in 2016, when Gartner coined the term. At this point, it’s a relatively well established tool domain.
Despite the fact that AIOps has been around for some time at this point, it doesn’t yet appear that many SREs have bought into the AIOps revolution. Catchpoint found in a 2021 survey that just 7.5 percent of SREs reported that AIOps tools delivered “high value” to their organizations.
It’s unclear exactly why SREs report low rates of excitement about AIOps. But we’d speculate that there are a few key factors at play:
From an SRE’s perspective, then, AIOps may appear over-hyped, overly complicated and underperforming compared to traditional approaches to SRE.
SREs’ wariness toward AIOps is valid – but only to a point. It’s important not to let suspicions about the limitations of AIOps turn into excuses not to use AIOps at all. AIOps has some value to offer to SREs, even if it’s not perfect.
For example, AIOps can play a role in reducing toil. To the extent that AIOps tools can recognize complex patterns or interrelate data sets more quickly than human engineers, AIOps reduces the time SREs have to spend manually troubleshooting problems or poring over complicated information.
AIOps also helps to enable a more proactive approach to monitoring and incident management. If AIOps tools can alert SREs to emerging issues before SREs would otherwise recognize them, AIOps can help the SREs to get in front of the problems before they turn into true incidents. That’s better for SREs and end-users alike.
There is also an argument to be made that AIOps can help SREs do more with fewer engineering resources. If you can use AI to automate some aspects of monitoring and incident response, you can maintain the same levels of availability and performance with fewer human engineers on hand.
None of the above is to say that AIOps can replace SREs, or that it magically solves every problem SREs face. Anyone who believes AIOps is a silver bullet has bought into the marketing hype to an unhealthy degree.
Nonetheless, AIOps tools do offer value to SREs. They make their jobs easier in some respects, and they can improve reliability outcomes.
So, while it’s wise to maintain a healthy perspective about the limitations of AIOps, SREs shouldn't rule out AIOps tools as one way to improve reliability engineering.
{{subscribe-form}}