How Meta and Google use AI to improve incident response
Discover how Google is optimizing for accuracy in its AI strategy, while Meta strives to expand its response capabilities through machine learning.
December 17, 2021
4 min read
SREs face special challenges during the holidays. Here’s how to manage them.
It’s almost a cliché to say you have a love/hate relationship with the holiday season.
But if you’re an SRE, you may have better reasons than the rest of humanity to hate the end-of-year holidays in particular. Indeed, even if you like the holidays themselves, the fact that reliability issues tend to peak during the holiday period is reason enough for SREs to be less than perfectly cheerful at this time of year.
To help make the season a bit less stressful for SREs tasked with keeping critical systems running during times of high demand, here’s a list of what SREs should be thinking about during the holidays, and tips on getting through the season with your SLIs intact.
Let’s start by discussing the two main reasons why some SREs face particular challenges during the holiday season: Heavier site load, and higher stakes for preventing issues.
It’s no secret that demand for the applications and systems that SREs oversee tend to peak around the holidays.
And we’re not talking here just about sites or apps related to shopping, although those do see peak activity toward the end of the year. A variety of other systems also tend to come under heavy load around the holidays. Internal LOB apps may see a spike in usage as departments close out their books or prepare end-of-year reports. Sites associated with leisure activity may experience high traffic levels when more people take time off from work. Social media apps need to contend with a surge of users uploading photos of their holiday festivities. And so on.
This is a problem for SREs, of course, because with more load comes a higher risk that something will go wrong. An app that works fine when it receives 100 requests per minute might collapse if it’s hit with 1000, for instance.
At the same time, the fallout of reliability problems may be higher during the holidays. More traffic means more users who will be disappointed – and who may flock to competitors – if something goes wrong with a site you manage.
Likewise, the fact IT teams tend to be stretched thin during the holidays as engineers take vacations makes it even more important to prevent reliability issues. There will be fewer people around to fix things if they break, so you need to work extra hard to nip problems in the bud.
Those are the challenges SREs face toward the end of the year. Now, let’s look at strategies for addressing them.
Load testing, which evaluates how your apps perform under heavy demand, is valuable at any time of year. Ideally, it will be baked into your regular QA process.
But if it’s not – or even if it is – SREs should consider performing another round of load tests on critical applications ahead of the holiday surge. New load testing may reveal problems that have arisen since the last set of tests that could cause applications to crash or degrade in performance under heavy demand.
Along similar lines, now is a smart time to do some chaos engineering. Chaos engineering means experimenting with your systems to evaluate how well they perform, especially under certain conditions (such as those that may materialize during the holidays).
That said, be sure to be smart about when you perform your chaos engineering. It’s not a great idea to kick your systems’ tires when they are under heavy load due to requests from real-world users. Do your experimentation during off hours.
Times of high load and high stakes are not good times for introducing major changes into your applications or systems. If you can avoid it, hold off on deploying new updates until the holidays have passed, so that you can stick with what you already know to work.
Make exceptions for critical patches, of course. But otherwise, put a plug in your CI/CD pipeline until the holidays are behind you.
When you’re working with a bare-bones team – as you may be during the holidays, when some of your colleagues are on PTO – you’ll need to work extra hard to make sure that incident response is as efficient and automated as possible. The last thing you need is to try to figure out who’s available and who’s not when delegating tasks following a major incident.
That’s why it’s worth investing in incident response automation platforms. Incident response automation is valuable at any time of year, but it’s especially critical during the holidays.
If, despite your best efforts, systems do fail during the holidays, you can mitigate the impact more effectively if you know which systems are most critical.
To gain that insight, look at your SLAs and SLOs, and revisit error budgets, to determine which services to prioritize during an outage.
You should also take the broader business context into account when setting priorities. Some applications and services may be more or less important than others during the holidays, even though that difference is probably not reflected in your SLAs.
The holidays are bound to be at least a little stressful, especially if you’re an SRE tasked with keeping critical systems running. But you can minimize the stress by validating the reliability of your critical systems before the holidays put them to the test. Just as important, be strategic about what to prioritize and what to let slide from a reliability perspective during this time of year.
{{subscribe-form}}