NEWS

Why operators went down or slow due to the facebook outage

By VARINDIA - 2021-10-06

On October 4, 2021, Facebook experienced a prolonged outage preventing users from around the globe from reaching its services.

We live in a collaborative and interactive world which brings endless possibilities. Mobile operators form the pillars of a connected and efficient world. There are zillion advantages of being connected. However, yesterday was a day when we realized a potential downside.

Each of those nameservers is covered by a different IP prefix, or Internet “route” (more on that later), covering a range of IP addresses.

At approximately 15:40 UTC, Facebook’s service started to go offline, as users were unable to resolve its domains to IP addresses through the DNS.

Does that mean we do not connect or collaborate? No, it means we learn from the incident and handle it better in future. Here is my take on what happened and the possible remedies:

The Issue

1. The authoritative DNS (Domain Name System) nameservers of Facebook became unreachable.

2. It meant - when a user sent a DNS query on the Internet for Facebook, there was no response.

3. Thus, for a normal user, applications like Facebook, Instagram and WhatsApp did not work.

The Impact on Operators

1. Since Facebook applications were not responding, DNS queries began to accumulate (as the earlier queries were still open).

2. The surge also occurred as users kept trying to connect which generated more DNS queries per second than usual.

3. The above led to an overall slowness of the Internet as all DNS queries were taking more time.

4. In a few cases, it led to downtime as the local DNS of the internet service provider could not manage the surge.

Learning

1. Use DNS anycast architecture to spread the surge on all sites rather than managing it individually.

2. Put relevant DNS security policies in place like ‘Per Subscriber Limit’ to manage surge efficiently.

3. Organise a Quick Response Team to make short-term configuration changes like reduction of time-out values.

4. Have sufficient buffer DNS capacities to manage surge (as these incidents occur frequently).

5. Create a balanced DNS architecture with fallbacks in place.

Now, a word on the DNS. The DNS is so critical to the reachability of sites and web applications, that most major service providers don’t mess about with it. For example, Amazon stores the authoritative DNS records for amazon.com not on its own infrastructure (which, as one of the top public cloud providers, is amongst the most heavily used in the world), but on two separate external DNS services, Dyn (Oracle) and UltraDNS (Neustar) - even though Amazon AWS offers its own DNS service.

People and businesses around the world who depend on Facebook,WhatsApp and Instagram apps faced the outage across our platforms. Even thoe the systems are now back up and running. The underlying cause of this outage also impacted many of the internal tools and systems used in day-to-day operations.

Why didn’t Facebook move their DNS records to an external DNS service provider and get their services back online?

Nameserver records, which are served by top level domain (TLD) servers (in this case, com. TLD) can be long-lived records - which makes sense given that app and site operators are not frequently moving their records around - unlike A and AAAA records, which often change very frequently for major sites, as the DNS can be used to balance traffic across application infrastructure and point users to the optimal server for their best experience. In the case of Facebook, their nameserver records have a two day shelf life (see figure 5), meaning that even if they were to move their records to an external service, it could take up to two days for some users to reach Facebook, as the original nameserver records would continue to persist in the wilds of the Internet until they expire.

So moving to a secondary provider after the incident began wasn’t a practical option for Facebook to resolve the issue. Better to focus on getting the service back up.

Like DNS, BGP is one of those scary acronyms that frequently comes up when any major event goes down on the Internet. And like DNS, it is essential vocabulary for Internet literacy. BGP is the way that traffic gets routed across the Internet. You can think of it as a telephone chain. I tell Sally (my peer) where to reach me. She in turn calls her friends and neighbors (her peers) and tells them to call her if they want to reach me. They in turn call their contacts (their peers) to tell them the same, and the chain continues until, in theory, anyone who wants to reach me has some “path” to me through a chain of connections — some may be long, some short.

Even if DNS was the domino that toppled it all, and even if a rogue set of BGP withdrawals was the source of that toppling, like any BGP route change, it can be changed again. Or can it? History tells us that the longest lived and most damaging outages can most often be laid at the feet of some issue with the control plane. Whether through human error or bug, if the mechanism for network operators to control the network — to make changes to it — is damaged or severed, that’s when things can go very very wrong. Take the aforementioned Google outage. In that incident, which lasted about four hours, a maintenance operation inadvertently took down all of the network controllers for a region of Google’s network. Without the controllers, the network infrastructure was effectively headless and unable to route traffic. Google network engineers were unable to quickly bring the network back online because their access to the network controllers depended on the very network that was down.

Lack of access to the network management system would certainly have prevented Facebook from rolling back any faulty changes. Access could have been due to some network change that was part of the original route withdrawals that precipitated the outage, or it could have been due to a service dependency (for example, if their internal DNS was a dependency for access to an authentication service or other key system).

The Facebook suite collapsed like a house of cards on Monday which led to the Great Social Media Blackout of 2021 for 6 hours. One thing to keep in mind is that such things do happen in the tech world and it cannot be avoided. However, the impact could have been minimized or eliminated to a large extent. Facebook ran three different apps from one centralized infrastructure. To top it all, it was a single point of failure for users worldwide. If things go wrong, users across the world are impacted.

This outage demonstrates the risks of the whole Internet being dependent on one company and this could have been minimized. Within few hours of the incident, a Twitter user posted a question: “If WhatsApp isn’t back by tomorrow, is it a holiday?” We don’t want this situation to arise again. A possible solution to this is to segregate infrastructure by country/region so that the impact in case of failure is minimized/localized.

It also helps respect the local data laws of that particular country/region. From a user’s perspective, we need to have alternatives available. It is not good for the entire world to be dependent on apps from one single company. Especially when communication, business transactions are dependent on them.