One of the first things GitHub did under its new owner over the past week was put the finishing touches on one of the most detailed outage reports you’ll ever see.
Last week GitHub, which millions of developers around the world use every day to build software, was unusable for a sizable portion of Sunday evening into Monday morning after routine maintenance went awry and knocked an important link between GitHub infrastructure equipment offline for 43 seconds, said Jason Warner, senior vice president for technology, in a very thorough outage report. That connectivity gap was restored quickly, but caused a cascading series of issues as GitHub engineers quickly realized that the outage had created inconsistent user data.
“With this incident, we failed you, and we are deeply sorry,” Warner wrote, days after Microsoft’s $7.5 billion acquisition of the company closed. “While we cannot undo the problems that were created by GitHub’s platform being unusable for an extended period of time, we can explain the events that led to this incident, the lessons we’ve learned, and the steps we’re taking as a company to better ensure this doesn’t happen again.”
You don’t need to be an operations or database nerd to properly appreciate the minute-by-minute detail that Warner provides on how the situation unfolded and how GitHub engineers scrambled to recover, but it would probably help. He said that GitHub was basically forced to choose between a speedy recovery that might have destroyed some user data, or a recovery process that would take a long time but would save all the data.
During that 43-second loss of connectivity between GitHub’s East Coast networking hub and its primary East Coast data center, computers and people reacted swiftly to make sure that new data being created by GitHub users failed over to its West Coast data center. But there was a small amount of data that was in transit and written to the East Coast data center that was not replicated at the West Coast location. That meant that user data existed in different states within different data centers, and that neither data center had a completely accurate picture.
That is a problem. The only solution that would prevent data loss was to rebuild user data from backups, which takes a lot of time, and because applications running on the East Coast now had to travel all the way to the Pacific to write information to their database, site performance was bound to suffer during that process.
That led to frustrated users, and multiple, vague attempts at communicating how long it would take to fix made things worse.
“In our desire to communicate meaningful information to you during the incident, we made several public estimates on time to repair based on the rate of processing of the backlog of data. In retrospect, our estimates did not factor in all variables,” Warner said.
GitHub is reviewing the incident to see what it can learn about its infrastructure decision-making process and its incident response procedures. Warner noted that the company plans to invest in introducing new resiliency concepts like chaos engineering in hopes of preventing an outage of this duration from happening again.
Kudos to Warner and GitHub for their transparency, which only helps everyone involved in engineering the cloud. It’s also a sign that GitHub is going to be very careful when it comes to customer and developer relations during its early days under Microsoft, a company that until recently was quite disliked within software community.