Facebook’s Vanishing Act Explained

On Monday, Facebook vanished from the Internet, along with the company’s other platforms, Instagram and WhatsApp. Although all of them were back up and functioning by the end of the day, for the more than five hours during which they were down, Facebook potentially lost tens of millions of dollars in revenue. Analytics provider Haystack saw a 32 percent increase in developer throughput during the period Facebook was down, suggesting that developers actually got more work done than usual because of the outages.

But why did Facebook disappear from the Internet to begin with? As it turns out, it was initially a small bug that cascaded into bigger problems. And while Facebook’s accounting of what went wrong checks out, some missing details raised questions for network experts.

Why did Facebook vanish?

According to Facebook’s own explanation of the events, the problems started during a routine bit of maintenance on the company’s internal backbone. This backbone is a series of fiber optic cables and data centers built and operated by Facebook to handle both internal communications and the external requests. Any time you log onto your Facebook account, browse Instagram, or send a message on WhatsApp, you’re making such an external request.

Like any company maintaining a portion of the Internet’s infrastructure, Facebook uses software tools to check on the status of its backbone. These tools are relatively simple: One might measure data throughput on a fiber line, for example, or temporarily take down one fiber line to test the redundancy of other lines. «These tools are not big, convoluted systems,» says Yiannis Psaras, a researcher at Protocol Labs.

Apparently during some maintenance on Monday, a particular tool, instead of taking one line down for maintenance, sent out a command to take down every line. According to Psaras, it was as though the tool essentially cut every one of Facebook’s fiber lines in half.

The problem compounded from there: Each of Facebook’s own servers, unable to communicate with anything else, assumed it was the source of the fault, and therefore each one took itself offline.

Facebook uses larger data centers to hold all the content on its websites and apps, and smaller servers that handle Domain Name System (DNS) queries. DNS is often referred to as the Internet’s phonebook—it’s the system that converts plaintext URLs (such as spectrum.ieee.org) to IP addresses—the strings of numbers used to locate and retrieve a website’s data.

When Facebook’s DNS servers removed themselves from both Facebook’s internal backbone, as well as the public-facing Internet, no one could reach anything Facebook-related for the same reason you can’t call someone if you don’t have their phone number. The DNS queries made by people trying to log onto their accounts all failed because there was no valid IP address to query.

Each of Facebook’s own servers, unable to communicate with anything else, assumed it was the source of the fault, and therefore each one took itself offline.

As the hours passed, this slowed down the rest of the Internet too. Shiv Panwar, a researcher at NYU Wireless, explains that DNS is hierarchical—if a DNS query runs into a problem, it will check a wider range of servers to see if it can locate the information it needs. It’s the equivalent of switching from a local phonebook to a regional one. In other words, people’s attempts to log onto Facebook and Instagram affected requests for the rest of the Internet as their queries searched anywhere and everywhere for the information they were after.

Does Facebook’s explanation make sense?

Yes, although the fact that a tool was the initial culprit surprised both Psaras and Panwar. Recall that the original problem was a tool sending out a command that managed to sever all of the routes between Facebook’s data centers. «Why would the tool have this functionality, even as backup?» says Psaras.

Psaras explains that because software tools are designed to be simple, testing just one aspect of a network, it’s a bit odd that the tool was able to cause a global screw up. It is possible, however, that Facebook does have and use a tool that could take down its internal backbone because of a bug.

Panwar suggests that the tool may have been designed to take down a particular route to test how the rest of the backbone picked up the slack. In other words, a tool designed to test the network’s redundancy. It’s easy to imagine a tool designed to take down a route, check the rest of the network, and bring the route back up before moving on to another potentially bugging out and taking down every route instead.

Could Facebook vanish again?

The answer is never «never,» but it’s unlikely. I’ve mentioned redundancy a few times—Facebook’s backbone, like the rest of the Internet, has redundant routes to get from one location to another. Redundancy gives the Internet resiliency. It’s not enough for one route to go down—every route has to fail to totally disrupt traffic.

The Internet is designed to survive «single-point failures.» These are problems like a chip going bad, a link going down, a backhoe ripping a fiber line underneath a construction site, or someone pulling a plug in a data center. In fact, that’s just the sort of thing Facebook’s tool would have been testing for, if it was in fact supposed to be taking down one route to check the traffic load on the others.

Multi-point failures, on the other hand, are much rarer, because they’re harder to pull off, either accidentally or intentionally. Of course, the fact that it did happen to Facebook is enough evidence that it’s possible, if not probable.

You can rest assured, however, now that you’re done reading this, you can log into Facebook or Instagram and know that they’ll (almost definitely) load.

Source: IEEE Spectrum Telecom Channel