An Introduction to Incident Response Roles
Learn about the key roles within an incident response team, as well as optional incident roles you may not have thought about.
August 13, 2021
5 min read
The Four Golden Signals of monitoring and observability get a lot of things right. But they could be even better.
If you’re an SRE, there’s a decent chance that you live and die by the “Four Golden Signals.” Alongside similar concepts like the RED Method, the Four Golden Signals form the foundation for many a monitoring and observability strategy today.
That’s not a bad thing. In many ways, the Golden Signals excel at distilling complex monitoring processes down into a core set of easy-to-digest concepts.
But increasingly, the Golden Signals are no longer enough to achieve optimal monitoring and observability outcomes. It’s not time to do away with the Golden Signals, but it’s worth rethinking and extending them to meet modern SRE challenges.
Here’s why.
The Four Golden Signals are a set of recommendations about which types of data to collect when monitoring and observing systems. Popularized by Google’s SRE book, they boil down to the idea that SREs should collect four basic types of information from the systems they support:
The Golden Signals have several important strengths.
One is that they do a nice job of covering all of the data points an SRE would typically want to collect from an application or system. In other words, even though there are only four signals, they’re comprehensive, making this a simple yet effective way to approach monitoring and observability.
The Golden Signals are also advantageous because they address any type of system. Whether you’re monitoring a SaaS application, a containerized microservices app running in Kubernetes or a monolith hosted on bare metal, the Golden Signals cover pretty much everything you’d need to know about the state of the app itself.
Along similar lines, I like that the Golden Signals don’t try to draw a distinction between application metrics and infrastructure metrics. Historically, SREs tended to treat each of these layers of the stack as a separate entity when it came to monitoring. You’d collect metrics like CPU and memory utilization from your infrastructure, while collecting request rates and error metrics from the app.
The problem with that approach is that the line between applications and infrastructure is not so clear in many modern environments. In Kubernetes, for example, CPU utilization isn’t necessarily a good measure of how much of the total available CPU resources the pod is using, because Kubernetes abstracts the pod from the underlying physical infrastructure and may impose arbitrary resource limits.
Finally, it’s hard not to love how the Golden Signals avoid terminology like “logs” and “metrics.” Instead, they refer to “signals.” That’s nice because, although SREs are primed to think about logs and metrics (and traces, for that matter) as being separate sorts of things, the fact is that they are often overlapping categories of data, and the difference usually doesn’t really matter. If you generate logs in AWS CloudWatch based on metrics that you collect from an AWS service, for instance, are those metrics or logs? It doesn’t matter from an observability standpoint. The Golden Signals helps teams avoid getting stuck in the mud of trying to force data into different buckets, and helps them focus on the data itself, no matter what its form. Here is a guide on how SREs can drive the adoption of these four golden signals and why.
Now that we’ve detailed all the things that the Golden Signals get right, let’s look at their shortcomings.
One arguable problem with the Golden Signals is that, although they seem very simple on the surface, they are difficult to apply to a real-world monitoring or observability strategy.
That’s mainly because you often need to collect many more than just four total signals when supporting a system. Instead, you need to collect at least four signals from every microservice in your application. Collecting the Four Golden Signals just for an application as a whole isn’t very useful because it won’t give you the visibility you need to pinpoint problems that originate in a specific microservice.
Likewise, you may also need to collect signals from your orchestrator, your cloud environment, your network and any other layers in your software stack. Doing so is the only way to know whether a performance or availability issue lies in your application itself, or one of the external resources on which it depends.
A second challenge when using the Golden Signals approach is that it’s not very helpful for identifying and troubleshooting outliers within your data. And, of course, it’s usually the outliers that are the first signs of trouble.
For example, tracking average latency for application requests is great if you want to know how long it takes your app to handle most transactions. It will also alert you to sudden spikes in latency that could reflect a significant issue that impacts many users.
But what average latency monitoring won’t do is help you identify a minority of users or request types that are subject to delays. That’s bad if you’re trying to achieve SLOs of 99 percent or greater. In that case, you need to know about the 1 percent of requests that are not going well.
Perhaps the greatest shortcoming of the Golden Signals is that they don’t do anything to align technical outcomes with business outcomes, or help ensure that all stakeholders -- technical and non-technical -- can support reliability.
The Golden Signals are comprehensive from a technical standpoint. They cover all of the information you’d want to know about an application.
But they don’t correlate application performance with business performance. They won’t tell you how changes in application behavior correlate with increases in customer support requests, for example, or with fluctuations in the length of user sessions (which are a metric that serves as a proxy for user engagement and satisfaction).
In other words, in addition to using the Four Golden Signals for technical monitoring and observability, you should consider incorporating some business-centric signals into your data collection routines. It’s only by pairing business data with technical data that you gain real observability.
Again, no one is saying the Golden Signals should go away. They’re a great method for shaping the contours of a modern observability strategy.
But if you’re an SRE considering using the Golden Signals, it’s worth educating yourself about what they don’t do so well. Prepare for the unexpected complexity of applying the Golden Signals to an actual microservices app. Make sure you look for outliers in addition to tracking averages. And contextualize the Golden Signals with business-oriented metrics so that you know how technical changes affect business outcomes.
{{subscribe-form}}