Facebook says the root cause of its outage Monday involved a routine maintenance job gone awry that resulted in rendering its DNS servers unavailable, but first the entire Facebook backbone network had crashed.
To make matters worse, the loss of DNS made it impossible for Facebook engineers to remotely access the devices they needed to in order to bring the network back up, so they had to go into the data centers to manually restart systems.
That slowed things down, but they were slowed down even more because the data centers have safeguards in place to make tampering hard—for anybody. “They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them,” according to a Facebook blog written by Santosh Janardhan, the company's vice president of engineering and infrastructure.
It took time, but once the systems were restored, the network came back up.
Restoring the customer-facing services that run over the network was another lengthy process because turning them up all at once could cause another round of crashes. “Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk,” Janardhan wrote.
In all, Facebook was down for seven hours and five minutes.
Routine-maintenance foul up
To kick off the outage, Facebook was taking just part of the backbone network offline for maintenance at 11:39 a.m. EDT. “During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally,” Janardhan wrote.
That wasn’t the plan, and Facebook even had a tool in place to sort out commands that might cause such a catastrophic failure, but it didn’t work. “Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command,” according to Janardhan.
Once that happened, the DNS was doomed.
DNS was a single point of failure
An automated response to the backbone crash seems to be what took down the DNS, according to Angelique Medina, head of product marketing at Cisco ThousandEyes, which monitorsand analyzes internet traffic and outages.
DNS, or directory name services, responds to queries about how to translate Web names into IP addresses, and Facebook hosts its own DNS nameservers. “They have an architecture where their DNS service is scaled up or down in relation to server availability,” Medina says. “And when server availability went to zero because the network went down, they decommissioned all their DNS servers.”
That decommissioning was accomplished by Facebook’s DNS nameservers initiating messages to internet border gateway protocol (BGP) routers that store knowledge about routes to use to reach specific IP addresses. The routes are routinely advertised to the routers to keep them current on how to direct traffic appropriately.
Facebook DNS servers' route-withdrawal messages disabled the advertised routes to themselves, making it impossible for BGP routers to send traffic their way. “The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers,” Janardhan wrote.
Even if the DNS servers were still accessible from the internet, Facebook customers would have lost service because the network they were trying to reach had crashed. Unfortunately for Facebook, its own engineers also lost access to the DNS servers, which were necessary for their remote management platforms to reach the downed backbone systems.
“They don’t use their DNS service just for their customer-facing Web properties,” Medina says. “They also use it for their own internal tools and systems. By taking it down completely, that prevented their network operators or engineers from gaining access to the systems they needed to in order to fix the problem.”
That meant that rather than fix things from a management console, the engineers had to lay hands on data-center devices to bring them back up, one by one.
A more robust architecture would have dual DNS services so one could backup the other. For example, Amazon, whose AWS offers a DNS service, uses two external services—Dyn and UltraDNS—for its DNS, according to Medina.
Lessons to learn
The incident reveals what networking best-practices suggest is a shortcoming of the Facebook architecture. “Why was their DNS effectively a single point of failure here?” she says. If there were no underlying backbone failure and the single DNS system failed, that itself might trigger an outage, “so I think having redundant DNS is a big takeaway.”
Another general observation is one that Medina has made about other service-provider outages. “Often times with these outages there are so many interdependencies within their network that one small issue in one part of their overall service architecture experiences an issue, and then it has sort of this cascading effect,” she says.
“A lot of companies are leveraging a lot of internal services, and in doing that there can be unforeseen consequences. That may be more for the technical folks [to analyze], but I do think it’s worth pointing out.”
Copyright © 2021 IDG Communications, Inc.