Six tips on how Site Reliability Engineers (SREs) can prepare for the reliability challenges of Black Friday and Cyber Monday 2021


Being an SRE is a tough (if rewarding) job on any day of the year. But it’s especially challenging on Black Friday and Cyber Monday, the post-Thanksgiving event that has become the biggest online shopping day of the year. We'll focus on calling it Cyber Monday throughout this guide.

And for 2021, Cyber Monday promises to include not just the standard challenges associated with massive spikes in traffic, but also a spike in cybersecurity attacks, which the FBI expects to surge in frequency this holiday season. And although security may not be SREs’ main job, they’ll be expected to assist security and DevSecOps teams in confronting the reliability threats that hackers pose. 

So, those are the problems SREs face heading into Cyber Monday 2021 and beyond. For solutions, let’s take a look at six best practices for managing the reliability and security challenges that loom this holiday season.

Assess your Cyber Monday stress

The first step in preparing for Cyber Monday is evaluating how much stress the holiday is likely to place on the systems you support.

Obviously, online retailers -- especially those that sell tech products -- will face the biggest increase in load during the Cyber Monday shopping surge.

Businesses that don’t sell things online are less likely to see a major peak in traffic. That said, the fact that Cyber Monday brings more people online in general can increase demand across the board. Don’t assume, then, that Cyber Monday will be business as usual for you just because your business isn’t a retailer.

Configure auto-scaling -- including for Kubernetes

You probably already know that you can configure auto-scaling for cloud VMs in order to help them accommodate a spike in demand.

But did you know that you may also be able to auto-scale Kubernetes clusters? Kubernetes auto-scaling features let you add nodes to your clusters so that they can support a higher load.

Not all Kubernetes distributions support auto-scaling, but most of the cloud-based managed Kubernetes distros do. (You can find details about what each major Kubernetes service supports in our blog on choosing a Kubernetes distro for SREs.) If you use Kubernetes, and auto-scaling is available, be sure to take advantage of it to help manage the Cyber Monday surge.

Distribute your distributed infrastructure even more 

Adding more redundancy to your infrastructure is another relatively simple and highly effective way to prepare for Cyber Monday load increases. It may also help protect against DDoS attacks, should they target your environment (or your cloud host) during the event.

So, consider adding another availability zone or region to your cloud, if you have the time to configure it before Cyber Monday is here. You could also move VM or container images to another region (or an entirely different cloud) so that they’re on hand in case you need to spin them up in response to a failure in another region or cloud.

Double down on your incident response plan

It’s a pretty safe bet that something will go wrong with your applications or infrastructure over the course of Cyber Monday. But how wrong it goes depends on how quickly and effectively you can respond.

Toward that end, now is the time to verify that you have incident response playbooks in place. Equally important is ensuring you have an incident response platform that helps you manage incidents quickly and efficiently.

Verify your backups

Along similar lines, the lead up to Cyber Monday is an excellent time to make sure you’re systematically backing up critical data.

Remember, too, to make sure that your backup data is stored in a location where it will remain intact if production systems fail -- or, worse, if they are hacked. To maximize the chances of speedy recovery in the event of the latter scenario, consider “air gapping” your backup data, which means disconnecting it from the network so that remote attackers can’t touch it.

Engineer some chaos

Chaos engineering can help you detect unanticipated reliability issues at any time of year. But it’s especially helpful in preparing for Cyber Monday, when your sites will likely be under higher stress than usual.

So, if you haven’t performed any chaos engineering lately, grab yourself a chaos engineering tool and kick the tires of your system before actually users do.

Conclusion

Cyber Monday may create special reliability stresses. But it doesn’t have to lead to reliability failures. With the right plans and tools in place, SREs can prepare their environments to handle whatever Cyber Monday brings -- from traffic surges to DDoS attacks and beyond.