The Unique Reliability Engineering Requirements of Microservices
Although the fundamental concepts of site reliability engineering are the same in any environment, SREs must adapt practices to different technologies, like microservices.
March 4, 2022
6 min read
Does it always make sense to stick to your playbooks? There’s no clear answer, but it’s still something you should think about.
When are you smarter than your playbooks, and when are your playbooks smarter than you?
That’s a question that engineers rarely step back to consider. The rational, disciplined parts of our minds tell us that the playbooks we are supposed to follow were carefully designed and tested, and that we should stick to them at all costs.
But, on some level, there is also an instinctual – perhaps even arrogant – tendency to assume that ultimately, playbooks are not a substitute for human knowledge or expertise. Sometimes, the best way to resolve a problem is to deviate from the playbook.
That’s a lesson that came to my mind recently when I listened to this Changelog podcast. At a certain point in the conversation, Nora Jones, Adam Stacoviak and Jerod Santo brought up the experience of Chesley 'Sully' Sullenberger, the pilot who famously landed US Airways Flight 1549 in the Hudson River in 2009 after a collision with a flock of birds rendered the plane’s engines inoperable – a feat that made him the hero of the 2016 film Sully, where Tom Hanks played the title character.
What’s interesting about Sully’s story is that he didn’t do exactly what pilots (or engineers) are trained to do. He didn’t stick completely to the playbook that a pilot is supposed to follow during engine failure, which stipulates that the plane should land at the nearest airport. Instead, he made a decision to crash-land in the Hudson River.
The fact that Sully did this without any loss of human life turned him into a hero. In fact, Sully the movie almost villainizes the National Transportation Safety Board (NTSB) for what the film presents as an unfair investigation of Sully for not sticking to the playbook. (The actual story is tamer; NTSB investigators weren’t nearly as harsh as the film makes them out to be, but that’s sort of beside the point.)
Yet, as the podcasters noted, the difference between heroism and villanism for Sully may just have boiled down to luck. They pointed out that in similar incidents – like the Costa Concordia sinking in 2012 – in which staff deviated from playbooks, they ended up facing stiff penalties. In the Costa Concordia case, the captain of the boat was placed in jail – despite the fact that his decision not to stick rigidly to the playbook most likely reduced the total loss of human life.
Where Sully perhaps just got lucky, then, is that things went very well during his crash landing. If there had been even just a few casualties – especially if they resulted from the plane’s landing in a body of water – it’s easy to imagine Sully having been vilified for not having stuck to the rules that told him to land at an airport instead of heading to an airport.
Yes, the NTSB found in simulations that Sully would have had only about a 50 percent chance of reaching an airport successfully if he had chosen that route. Still, it’s likely that if the landing in the river had not gone smoothly, investigators would have concluded that Sully should have taken his chances on finding an airport, where passengers would at least have faced much lower risks upon landing.
The takeaway from all of the above is what you might call the Sully conundrum. On the one hand, Sully’s story is an example of why breaking the playbook rules is arguably the right thing to do in certain situations. On the other, it could be interpreted as a lesson that the playbook is really the way to go, and Sully just got lucky. There’s really no way to prove what’s right or wrong.
Fortunately, few, if any, Site Reliability Engineers have to make life-or-death decisions, or face jail time for making the wrong choices. But they are often tasked with solving complex problems that existing playbooks may or may not be well designed to handle.
In those cases, SREs have to step back and make the same choice Sully did: Do they stick to the playbooks? Or do they use their intuition and modify their response?
If they choose the playbooks, at least they can’t be accused later on of having mucked things up due to their arrogance. But by not adhering rigidly to the playbooks, perhaps they will solve the problem more effectively than the playbook would, and be hailed as heroes.
It’s hard to spell out a rigid set of criteria for determining how to decide which path to take. But if you did, it might include factors like:
You could think of guidelines like these as a sort of “playbook of playbooks,” in the sense that they can guide decisions on how closely SREs should stick to playbooks. It may not make sense to bake rules like these into your actual incident response plans, but at least consider having conversations about these questions so that your SREs can think about and discuss them before they are faced with an incident that may or may not be best remediated using the playbook. Ultimately, the way you think about how to treat playbooks should be baked into your incident response culture.
Ultimately, deciding whether to follow playbooks to the letter or not is a subjective and personal choice – and it’s a great example of why you can never fully remove the human element from computing.
But whatever choice you decide to make, the important thing is to think about this issue before you’re in the midst of an incident. You want to know ahead of time what you’ll do if you’re in Sully’s shoes – or, for matter, the shoes of the many non-heroes who made basically the same choice as Sully, but ended up facing different outcomes.
{{subscribe-form}}