Back to Blog
Back to Blog

July 9, 2024

7 mins

Round Robin escalation policies: do's and don'ts

Minimize alert fatigue by distributing incoming alerts evenly across responders with a Round Robin schedule. This strategy comes in two variations and can benefit some teams more than others.

Jorge Lainfiesta
Written by
Jorge Lainfiesta
Round Robin escalation policies: do's and don'ts
Table of contents

The concept of Round Robin comes from sports. And it has nothing to do with anyone called Robin, but the french word ruban (ribbon). In a Round Robin tournament, all participants face each other by taking turns. When applied to on-call schedules, a Round Robin escalation policy means that responders assigned to a level will take turns responding to alerts.

When is this strategy useful and when isn’t? In this article, we’ll dig into the key aspects of Round Robin escalation policies, including the two types available, and best practices to improve the responders team dynamics.

On-call schedules and escalation policies

You never know when something will go wrong with your website, app, or a provider you rely on. That’s why having someone available—even if it’s outside business hours or on a holiday—to get your service back to normal at any time is crucial. This is where on-call schedules come into the reliability story: you distribute the responsibility for keeping everything running 24/7 across different shifts or rotations that you organize in a schedule.

However, it’s not enough to have a single person available to respond to alerts. What if that on-call person is out of reach for any reason? Maybe their phone ran out of battery, or they got into an accident. That’s when escalation policies come in: they define a hierarchy of layers so that no alert goes through the system without being acknowledged and handled adequately.

Escalation policies let you define a set of responders or teams in each layer, as well as define how you want them to be contacted. If a layer doesn’t acknowledge the alert within a certain timeframe, then the next layer of responders will be notified. The whole escalation policy can be repeated a few times if needed.

What is a Round Robin escalation policy?

It’s a quiet Saturday but you’re on-call. You get an alert at 7 pm that interrupts your dinner. You get another alert at 9 pm, interrupting wine time with your partner. At 11 pm, right before bed, another alert pops up. It’s getting annoying. When a new alert wakes you up at 3 am, will you throw the phone at the wall? This feeling is called alert fatigue, and it’s unfortunately common in people with on-call shifts.

A Round Robin escalation policy can help reduce alert fatigue in your team by distributing the alerts evenly among responders. In a Round Robin schedule, incoming alerts are not all given to a single responder. Instead, responders take turns handling incoming alerts in sequential order.

Traditionally, implementing a Round Robin escalation policy required a lot of manual work, especially from the on-call manager. The manager would set up a spreadsheet where people could fill in when they responded to an alert and resolved an incident. This manual tracking determined whose turn it was to respond to the next alert.

Screenshot: an example of an escalation policy

However, modern on-call solution likes Rootly On-Call automate this work for you.  With a single click, you can make a level in your escalation policy behave like a Round Robin cycle.

Types of Round Robin escalation policies

In a Round Robin escalation policy, responders take turns to address incoming alerts. But what happens when the responder in charge of the next alert doesn’t acknowledge it in time? This is where Round Robin can vary: in an alert-based Round Robin, the alert will jump to the next escalation level if not acknowledged timely; in an cycle-based Round Robin, instead of jumping to the next level, the alert will go to the next responder in line on the same level.

Alert-based Round Robin

Diagram: how alerts are distributed with an alert-based Round Robin

Alert-based Round Robin escalation policies are the most popular option, even supported by legacy on-call solutions like PagerDuty. In this type, each incoming alert is assigned to different reporters in turns. If, at any point, any of them fails to acknowledge the alert they received, the escalation policy will notify the next level.

The advantage of this type is that responders in the same level receive alerts and responsability more or less evenly.

Cycle-based Round Robin

Diagram: how alerts are distributed with an cycle-based Round Robin

In a cycle-based Round Robin, an incoming alert is assigned to responders in turns. But if the responder who received the request doesn’t acknowledge it within a specified timeframe, it is passed to the next user in the Round Robin level. The alert will only escalate to the next level if it goes unacknowledged throughout the entire cycle.

The advantage of this type is that the next level in the escalation policy is less burdened as they don’t have to handle every alert that slips through the previous stage.

On-call teams who benefit from Round Robin

Most teams can benefit from using a Round Robin escalation policy, if they have multiple responders available on call. However, this strategy can represent a bigger win for teams who expect a larger number of alerts.

When a team consistently receive alerts across time zones, weekends, and holidays, having a single on-call responder taking care of all incidents that may pop up can be challenging. Not only it is overwhelming for the person on-call, but it affects their ability to effectively respond to alerts throughout their shift. In a longer term, it causes burnout on responders.

By organizing shifts with multiple responders on-call, you can distribute the alert burden among them to minimize the risk of alert fatigue and improve your overall MTTR.

Best practices for a successful Round Robin schedule

Distributing on-call work through Round Robin escalation policies has several advantages, but it also requires adjusting it to your team’s dynamic and culture.

  • Do not have everyone on rotations on the weekend: even if you have a lot of people lined up in the Round Robin level. Being on-call is mentally taxing, even if the likeliness of getting paged the end of the queue is low. This also applies to shifts outside of business hours.
  • Take people OOO out of the rotation: it’s true that an OOO responder who gets an alert can ignore it without much fuss, knowing that somebody else will pick it up. But getting a priority notification while you’re relaxing at the beach is not ideal and should be avoided.
  • Never stop training: being on-call is a specific skill that you want to nurture in your team. By having more people who can act as on-call responders you effectively improve your organization’s reliability.
  • Check in with your team: being part of a Round Robin escalation policy can feel better for some team members than others. See how comfortable each team member is and keep iterating on the schedule strategy.

Conclusion

Using Round Robin escalation policies can help you mitigate alert fatigue in your team and improve how quickly incidents are resolved. Consider whether an alert-based or cycle-based Round Robin are the best solution for the structure of your on-call team. Rootly On-Call comes with both strategies built-in by default, so you can get access to them with a click on any of your escalation layers. Feel free to schedule a demo if you want to take a closer look.