What happens when the tools and services you depend on to drive Site Reliability Engineering turn out to be susceptible to reliability failures of their own?
That’s the question that teams at about 400 businesses have presumably had to ask themselves this month in the wake of a major outage in Atlassian Cloud. The incident offers a number of insights for SREs about reliability risks within reliability management software itself – as well as how to work through complex outages efficiently and transparently, as Atlassian has done following the incident.
What caused the Atlassian Cloud outage?
The outage, which began on April 4 (resolved on Apr 18) and affected about 400 Atlassian Cloud customer accounts. Atlassian Cloud is a hosted suite of popular Atlassian products, such as Jira and OpsGenie. The outage meant that affected customers could no longer access these tools or the data they managed in them.
According to Atlassian, the problem was triggered by a faulty application migration process. Engineers wrote a script to deactivate an obsolete version of an application. However, due to what Atlassian called a “communication gap” between teams, the script was written in such a way that it deactivated all Atlassian Cloud products, not just the obsolete application.
To make matters worse, the script was apparently configured to delete data permanently, rather than mark it for deletion, which was the intention. As a result, data in affected accounts was removed permanently from production environments.
The good, the bad and the ugly from the Atlassian incident
The Atlassian Cloud outage may not be the very worst type of incident imaginable – failures like Facebook’s 2021 outage were arguably worse because they affected more people and because service restoration was complicated by physical access issues – but it was still pretty bad. Production data was permanently deleted, and hundreds of enterprise customers experienced total service disruptions that have lasted several days and counting.
Given the seriousness of the incident, it’s tempting to point fingers at Atlassian engineers for letting an incident like this happen in the first place. They seem to have written a script with some serious issues, then presumably deployed it without testing it first – which is exactly the opposite of what you might call an SRE best practice.
On the other hand, Atlassian deserves lots of points for responding to the incident efficiently and transparently. Although the company was silent at first, it ultimately shared details about what happened and why, even though those details were a bit embarrassing to its engineers.
Crucially, Atlassian also had backups and failover environments in place, which it has used to speed the recovery process. The major reason why the outage has lasted so long, the company said, is that restoring data from backups to production requires integrating backup data for individual customers into storage that is shared by multiple customers, a tedious process that Atlassian apparently can’t perform automatically (or doesn’t want to, presumably because it would be too risky to automate).
Unfortunately for impacted customers, it does not appear that any fallback tools or services were made available while they waited for Atlassian to restore operations. We imagine this poses more than minor issues for teams that rely on tools like Jira to manage projects and OpsGenie to handle incidents. Perhaps those teams have stood up alternative tools in the meantime – or perhaps they have just spent the past several days crossing their fingers, hoping their project and reliability management tools will come back online ASAP. The full outage postmortem can be found here.
Takeaways for SREs from the Atlassian outage
For SREs, then, the key takeaways from this incident would seem to be:
- Always perform dry runs of migration processes in testing environments before putting them into production. Presumably, if Atlassian engineers had tested their application migration script first, they would have noticed its flaws before it took out live customer environments.
- Back up, back up, back up – and make sure you have failover environments where you can rebuild failed services based on backups. While this outage is bad, it would be 100 times worse if Atlassian couldn’t restore service based on backups and data had been lost permanently.
- Ideally, each customer’s data should be stored separately. As we noted above, the fact that Atlassian used shared storage seems to have been a factor in delaying recovery. That said, it’s hard to fault Atlassian too much on this point; it’s not always practical to isolate data for each user due to the cost and administrative complexity of doing so.
- SRE teams would do well to think about how they’ll respond if their reliability management software itself goes offline. For example, it might be worth extracting and backing up data from your reliability management tools so you can still access it if your tool provider experiences an incident like this.
- Over communicate with your customers often and early. In this case, there was quite a bit of radio silence leaving customers in the dark wondering what to do. Most of this chatter eventually took to public forums.
The Atlassian Cloud outage is notable both for its length and for the fact that, somewhat ironically, it took out software that teams use to help prevent these types of issues from happening at their own businesses.
The good news is that Atlassian had the necessary resources in place to restore service as quickly as possible. A shared data storage architecture has led to slow recovery, which is unfortunate, but again, it’s hard to blame Atlassian too much for not setting up dedicated storage for each customer.