Major outage due to inaccessible data volumes

Incident Report for GitKraken

Postmortem

On Thursday, January 20 we did a release to add Data Residency support (adding hosting in the European Union using the AWS Frankfurt Germany region), and change the AWS instance class for our production instances.

The release process was estimated to take ~20 minutes, with the majority of that time spent on shutting down the AWS instance, changing the type, and restarting all services.

After resizing the instance, we noticed that our EBS data volumes were performing slowly, to the point of being unusable.

The initial diagnosis was the slow EBS volume behavior behaved like a volume recently restored from a backup snapshot (where data needs to be lazy-loaded behind the scenes by AWS).

We could not find an explanation for this behavior, as we had not made any changes to the data volumes. Working with AWS support, it was determined changing the instance class from r5 to r5b caused the volumes to go through an optimization step behind the scenes. During this optimization, we encountered an issue with some of the data volumes, resulting in increased latency.

AWS support tried to investigate the issue, but could not find a lead as to the cause, or possible fix. This issue needed to be escalated to the EBS services team to investigate. The timeline for optimization to finish and/or the EBS team to do their investigation was unknown at this time, and AWS Support could not even provide an estimate.

Because of the uncertain AWS estimates, we decided to restore all data from snapshots (created prior to the release). Because we regularly create backup snapshots, including during each release, we had a full data set to restore.

At this stage, we rolled back our instance changes, and used EBS Fast Snapshot Restore (FSR) to recover all production data from backup snapshots. This process took about 4 hours in total, as we waited for FSR to go through optimizations, and credits to accrue that allowed us to recreate the volumes. Once that was done, things returned back to normal.

We apologize for the interruption. We are reviewing our release and operations processes to address these issues in the future.

Git Integration for Jira Cloud Operations Team

Posted Jan 24, 2022 - 16:13 EST

Resolved

It looks like we're out of the woods, and everything is back to operational.
We'll continue to keep an eye on things, and will follow up with a postmortem in a few days.

Posted Jan 21, 2022 - 16:06 EST

Investigating

We are still seeing occasional outages. We continue to troubleshoot this issue with AWS.

Posted Jan 21, 2022 - 11:25 EST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 21, 2022 - 07:34 EST

Update

Hello everyone.

The path we tried to take to get things up and running didn't quite pan out the way AWS promised. We're still working with AWS to figure out the most expedient path forward, but as of right now, we don't have an exact ETA.

I'll update again once we know more.

Posted Jan 21, 2022 - 06:25 EST

Update

A quick update on what's going on:

Some of our data volumes became sick tonight, leading to degraded performance, and eventually becoming unusable. We're currently working with AWS to see if it can be fixed.

We're also actively recovering the affected volumes from backup, to try and replace them, and see if that solves our issue.

If all goes well, this outage will last another 2-3 hours.

Posted Jan 21, 2022 - 02:27 EST

Update

We are continuing to work on a fix for this issue.

Posted Jan 21, 2022 - 01:46 EST

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 21, 2022 - 01:40 EST

Investigating

We're experiencing a minor outage due to some release hiccups.
We're currently investigation the cause.

Posted Jan 21, 2022 - 00:17 EST

This incident affected: Git Integration for Jira Cloud - Global Region.