Atlassian has published a blog post in which CTO Sri Viswanath explains in detail the reasons why the provider’s cloud tools were down for some of their customers. In addition to the already known issues running the script, internal communication issues are said to have been to blame. The seemingly complex recovery process is also described, and provides an initial explanation as to why the process could take another two weeks, as announced.
Failed to delete old standalone instances
The bug arose as a result of the now-native integration of Jira’s “Insight – Asset Management” service into the manufacturer’s products. As part of the conversion process, the intention was to disable legacy standalone versions of “Insight” that were still installed, Atlassian explains in the blog post.
The team used a ready-made script for this. However, in the run-up to this, there was an internal communication error: the team that was supposed to carry out the deactivation received incorrect information from the team that planned the process. Instead of just the IDs of the affected Insight instances, the IDs of all cloud instances on which the standalone application was installed were passed.
In addition, the script used was not suitable for use: in addition to a “mark for deletion” function (“mark for deletion”), which allows to restore deleted data, it also has a “permanently delete” function (“Permanently delete “), which you really only need to meet compliance rules. However, when the script was run, the last mode was executed and the data for 400 clients was permanently deleted.
Elaborate recovery process
Management data backups are maintained across multiple AWS Availability Zones. In the past, these backups only had to be used to restore individual data points, for example if customers themselves had accidentally deleted their data. The process had not previously been designed to restore multiple data sets at once.
The recovery process is also complex and requires, among other things, 1-on-1 communication with those affected, so the recovery of individual accounts takes up to five days. In the meantime, however, the company wants to have further automated the time-consuming manual process and be able to process up to 60 cases in parallel.
The incident has ‘undermined confidence’
However, the incident and the company’s reaction time do not meet their own requirements, CTO Sri Viswanath continued in the blopost: “We know that incidents like this can undermine trust.” That’s why they want to create another more detailed post-incident report, as well as work on external communication and provide daily status updates in the future.
As of Tuesday, some Atlassian customers no longer have access to the provider’s popular cloud tools like Jira and Confluence. On Tuesday, the company said outages for individual computers could last up to two weeks. As of April 13, the problem had only been fixed for 45 percent of those affected.