Infrastructure Upgrades for Data Residency
Scheduled Maintenance Report for GitKraken
Postmortem

On Thursday, January 20 we did a release to add Data Residency support (adding hosting in the European Union using the AWS Frankfurt Germany region), and change the AWS instance class for our production instances.

The release process was estimated to take ~20 minutes, with the majority of that time spent on shutting down the AWS instance, changing the type, and restarting all services.

After resizing the instance, we noticed that our EBS data volumes were performing slowly, to the point of being unusable.

The initial diagnosis was the slow EBS volume behavior behaved like a volume recently restored from a backup snapshot (where data needs to be lazy-loaded behind the scenes by AWS).

We could not find an explanation for this behavior, as we had not made any changes to the data volumes. Working with AWS support, it was determined changing the instance class from r5 to r5b caused the volumes to go through an optimization step behind the scenes. During this optimization, we encountered an issue with some of the data volumes, resulting in increased latency.

AWS support tried to investigate the issue, but could not find a lead as to the cause, or possible fix. This issue needed to be escalated to the EBS services team to investigate. The timeline for optimization to finish and/or the EBS team to do their investigation was unknown at this time, and AWS Support could not even provide an estimate.

Because of the uncertain AWS estimates, we decided to restore all data from snapshots (created prior to the release). Because we regularly create backup snapshots, including during each release, we had a full data set to restore.

At this stage, we rolled back our instance changes, and used EBS Fast Snapshot Restore (FSR) to recover all production data from backup snapshots. This process took about 4 hours in total, as we waited for FSR to go through optimizations, and credits to accrue that allowed us to recreate the volumes. Once that was done, things returned back to normal.

We apologize for the interruption. We are reviewing our release and operations processes to address these issues in the future.

Git Integration for Jira Cloud Operations Team

Posted Jan 24, 2022 - 16:12 EST

Completed
The scheduled maintenance has been completed.
Posted Jan 21, 2022 - 00:15 EST
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Jan 20, 2022 - 23:45 EST
Scheduled
We will be undergoing scheduled maintenance during this time to add Data Residency capabilities.
Please expect some minor outages and slowdowns during this outage window.
Posted Jan 20, 2022 - 22:48 EST
This scheduled maintenance affected: Git Integration for Jira Cloud - Global Region.