The Unique Reliability Engineering Requirements of Microservices

Although the fundamental concepts of site reliability engineering are the same in any environment, SREs must adapt practices to different technologies, like microservices.

Written by

JJ Tang

The Unique Reliability Engineering Requirements of Microservices

Most of the reliability engineering concepts that SREs learn can be applied to any type of application architecture or environment. That doesn’t mean, however, that reliability engineering methodologies should be app-agnostic. On the contrary, SREs should tailor their approach to the type of application they are supporting.

To prove the point, let’s discuss how managing reliability for a microservices-based app is different from working with a monolith.

SRE fundamentals

Before jumping into the unique reliability challenges of microservices, it’s worth noting what doesn’t change about SRE work, regardless of the type of app you’re dealing with.

The fundamental principles that guide SREs are the same in almost any environment. For example, SLOs are important when managing virtually any service or application. So is the automation of SRE responsibilities and the use of techniques like severity levels to help manage incident response.

In this respect, the SRE role is different from many other types of technical roles. Developers tend to specialize in certain programming languages or architectural components (like frontends or backends). IT engineers may tailor their methodologies to the type of OS or cloud environment they have to support (the metrics that an IT operations team cares about when dealing with a Windows-based environment are probably different from those that matter in Kubernetes, for example). Security analysts may approach their work differently depending on the type of industry their business operates in because risks tend to vary between sectors, as do compliance rules.

But with SREs, fundamental concepts tend to be consistent across any type of environment. No one says “I’m a Windows SRE” or “I do SRE for mobile apps.” If you’re an SRE today, you’re expected to be able to do it all.

SRE for microservices: What’s different

But again, that doesn’t mean that SREs can take the same approach to reliability engineering for any type of technology or architecture.

Case in point: Microservices applications. When you’re managing reliability for microservices, you face special challenges that don’t apply in the context of monoliths:

Complex metrics: With a microservices app, metrics that you collect from the application as a whole -- like overall response time or latency -- are less meaningful, because it’s not always clear from the surface how these metrics translate to the state of individual microservices, or which individual microservice is causing a problem.
Disparate data sources: Microservices apps tend to store logs and expose metrics in multiple ways, which makes it harder to collect and aggregate all of the data. You may have to collect individual logs from each microservice, for example, instead of having just one log file to work with.
Constant updates: Individual microservices may be updated on a frequent basis, making it more complicated to keep track of the state of the overall application.
More layers: Microservices apps are typically deployed with the help of containers and orchestration tools. These tools add more layers to the stack -- and more layers mean more reliability risks for SREs to manage. In contrast, with monoliths, you typically have to deal just with the app itself and possibly a virtual machine; there are no containers or Kubernetes clusters in the mix.

At the same time, microservices also require a special approach because they offer some inherent reliability advantages that monoliths lack. Above all, microservices apps are less prone to single points of failure. Even if one microservice fails or becomes slow to respond, the app as a whole may continue to function. In addition, it’s usually easier to fix and redeploy an individual microservice than it is an entire monolith.

Best practices for microservices reliability management

Given the special traits of microservices, SREs should adjust their approach to microservices reliability in a few key ways.

For one, application-level metrics are arguably less important than they would be in other contexts. Instead of fixating on overall application response rates, error rates and duration, SREs should track metrics at the level of individual microservices. Of course, you’ll still want to make sure the application as a whole performs adequately, but it’s hard to fix performance issues if you lack visibility into the individual microservices that cause them.

Likewise, when setting SLOs for a microservices app, it often makes sense to establish SLOs on the basis of individual microservices -- or at least factor in your microservices architecture when devising SLOs. Think about which microservices within your app are the least reliable, and set SLOs based on them.

SREs must also take a more nuanced approach to monitoring and observing the host environment when working with microservices apps. With a monolith, you can usually get away with monitoring metrics and logs from just the host server’s OS. But with microservices, you need to track Kubernetes logs and metrics, as well as the OS-level metrics from each node in your cluster. And you have to correlate all of this with performance data from each microservice so that you can determine whether the root cause of an issue lies in the microservice, a Kubernetes service, a node or somewhere else.

A final difference between reliability engineering for monoliths and microservices, perhaps, is that with microservices, SREs can get away with taking more risks in production, given the fact that it is easier to redeploy (or roll back) a microservice than a monolith. That doesn’t mean that pre-deployment testing isn’t necessary when you’re working with microservices, of course. But in general, microservices make it easier to accept higher levels of risk than you could when dealing with more cumbersome monoliths.

Conclusion: Consistent principles and unique practices

In short, although the fundamental principles and concepts that undergird reliability engineering are the same in any context, SREs should adapt practices to the special requirements of whichever type of environment they are supporting. There are crucial differences between a monolith and a microservices app, and those differences should be reflected in the way SREs approach each type of environment.