How Facebook Keeps Messenger From Crashing on New Year's Eve

Or, as Ahdout puts it, “once you start falling behind, you fall behind more.”

“The biggest thing we worry about is: How do you prevent that cascading failure from happening?” adds Georgiou.

One way is to perform extensive load testing ahead of time, to simulate the volume of messages that Facebook expects on New Year’s Eve based on activity in previous years. (The company declined to share its forecasts, and would not say how many messages were sent in previous years.) Load testing allows the team to validate how many messages a given server can handle before the team must shift traffic over to other servers in the network.  

During the last New Year’s Eve, for example, one data center struggled with the volume of incoming messages, so the team directed traffic away from that center to another one. Following that incident, the group built tools to allow them to make those kinds of changes more easily this year.

In addition to shifting loads, the Messenger team has developed other levers that it can pull “if things get really bad,” says Ahdout. Every new message sent to a server goes into a queue as part of a service called Iris. There, messages are assigned a timeout—a period of time after which, that message will drop out of the queue to make room for new messages. During a high-volume event, this allows the team to quickly discard certain types of messages, such as read receipts, to focus its resources on delivering ones that users have composed.

“We set up our systems so that if it comes to that, they start shedding the lowest-priority traffic,” says Ahdout. “So if it came to it, Iris would rather deliver a message and drop the read receipt, rather than drop the message and deliver the read receipt.”

Georgiou says the group can also sacrifice the accuracy of the green dot displayed in the Messenger app that indicates a friend is currently online. Slowing the frequency at which the dot is updated can relieve network congestion. Or, the team could instruct the system to temporarily delay certain functions—such as deleting information about old messages—for a few hours to free up CPUs that would ordinarily perform that task, in order to process more messages in the moment.

All of these options fall under the notion of “graceful degradation,” says Ahdout. “Rather than having your service dying on the floor and no one using it, you make it a little less awesome and people can still use it.” Fortunately, the Messenger team didn’t have to resort to any of these measures last year.

Aside from those efforts, Messenger’s engineers also spend a lot of time on efficiency projects designed to make the most of the CPUs and memory within each server. Ahead of New Year’s Eve 2018, for example, the team added a scheduler, which is a program that allows the system to “batch” similar messages together.

“You can imagine that our servers are getting many requests concurrently,” explains Ahdout. “You can bundle some of those together into a single large request before you send it downstream. Doing that, you reduce the computational load on downstream systems.”

Batches are formed based on a principle called affinity, which can be derived from a variety of characteristics. For example, two messages may have higher affinity if they are traveling to the same recipient, or require similar resources from the back end. As traffic increases, the Messenger team can have the system batch more aggressively. Doing so will increase latency (a message’s roundtrip delay) by a few milliseconds, but makes it more likely that all messages will get through.

This year for New Year’s Eve, neither Ahdout nor Georgiou will be on duty as midnight approaches in Asia, when the service sees its largest spike in messages, but Ahdout says he will stay close to his laptop, just in case. “Basically, a lot of this work never really sees the light of day, in the sense that things go well, or if they don’t, we handle them so gracefully that users don’t even know what happened,” he says.

“It’s sort of been awhile since there was a major problem,” he adds. Fingers crossed. 

Source: IEEE Spectrum Computing