• CERTIFICATE
    • Eminent VARs of India
    • Best OEM 2023
  • SYNDICATION
    • AMD
    • DELL TECHNOLOGIES
    • HITACHI
    • LOGMEIN
    • MICROSOFT
    • RIVERBED
    • STORAGECRAFT
    • THALES
  • EVENTS
  • GO DIGITAL
  • INFOGRAPHICS
  • PRESS
    • Press Release PR News Wire
    • Press Release Business Wire
    • GlobeNewsWire
  • SPECIAL
    • WHITE PAPER
    • TECHNOMANIA
    • SME
    • SMART CITY
    • SERVICES
    • EDITOR SPEAK
    • CSR INITIATIVES
    • CHANNEL GURU
    • CHANNEL CHIEF
    • CASE STUDY
  • TECHTREND
    • VAR PANCHAYAT
    • TELECOM
    • SOFTWARE
    • POWER
    • PERIPHERALS
    • NETWORKING
    • LTE
    • CHANNEL BUZZ
    • ASK AN EXPERT
  • SUBSCRIBE
  • Apps
  • Gaming
  • KDS
  • Security
  • Telecom
  • WFH
  • Subscriber to Newsletter
  • April Issue
  • Blogs
  • Vlogs
  • Faceoff AI
    

HOME
NEWS

Why operators went down or slow due to the facebook outage


By VARINDIA - 2021-10-06
Why operators went down or slow due to the facebook outage

On October 4, 2021, Facebook experienced a prolonged outage preventing users from around the globe from reaching its services.

 

We live in a collaborative and interactive world which brings endless possibilities. Mobile operators form the pillars of a connected and efficient world. There are zillion advantages of being connected. However, yesterday was a day when we realized a potential downside.

 

 

Each of those nameservers is covered by a different IP prefix, or Internet “route” (more on that later), covering a range of IP addresses.

 

At approximately 15:40 UTC, Facebook’s service started to go offline, as users were unable to resolve its domains to IP addresses through the DNS.

 

Does that mean we do not connect or collaborate? No, it means we learn from the incident and handle it better in future. Here is my take on what happened and the possible remedies:

 

The Issue

 

1. The authoritative DNS (Domain Name System) nameservers of Facebook became unreachable.

2. It meant - when a user sent a DNS query on the Internet for Facebook, there was no response.

3. Thus, for a normal user, applications like Facebook, Instagram and WhatsApp did not work.

 

The Impact on Operators

 

1. Since Facebook applications were not responding, DNS queries began to accumulate (as the earlier queries were still open).

2. The surge also occurred as users kept trying to connect which generated more DNS queries per second than usual.

3. The above led to an overall slowness of the Internet as all DNS queries were taking more time.

4. In a few cases, it led to downtime as the local DNS of the internet service provider could not manage the surge.

 

Learning

 

1. Use DNS anycast architecture to spread the surge on all sites rather than managing it individually.

2. Put relevant DNS security policies in place like ‘Per Subscriber Limit’ to manage surge efficiently.

3. Organise a Quick Response Team to make short-term configuration changes like reduction of time-out values.

4. Have sufficient buffer DNS capacities to manage surge (as these incidents occur frequently).

5. Create a balanced DNS architecture with fallbacks in place.

 

Now, a word on the DNS. The DNS is so critical to the reachability of sites and web applications, that most major service providers don’t mess about with it. For example, Amazon stores the authoritative DNS records for amazon.com not on its own infrastructure (which, as one of the top public cloud providers, is amongst the most heavily used in the world), but on two separate external DNS services, Dyn (Oracle) and UltraDNS (Neustar) - even though Amazon AWS offers its own DNS service.

 

People and businesses around the world who depend on Facebook,WhatsApp and Instagram apps faced the outage across our platforms. Even thoe the systems are now back up and running. The underlying cause of this outage also impacted many of the internal tools and systems used in day-to-day operations.

 

Why didn’t Facebook move their DNS records to an external DNS service provider and get their services back online?

 

Nameserver records, which are served by top level domain (TLD) servers (in this case, com. TLD) can be long-lived records - which makes sense given that app and site operators are not frequently moving their records around - unlike A and AAAA records, which often change very frequently for major sites, as the DNS can be used to balance traffic across application infrastructure and point users to the optimal server for their best experience. In the case of Facebook, their nameserver records have a two day shelf life (see figure 5), meaning that even if they were to move their records to an external service, it could take up to two days for some users to reach Facebook, as the original nameserver records would continue to persist in the wilds of the Internet until they expire.

 

 

So moving to a secondary provider after the incident began wasn’t a practical option for Facebook to resolve the issue. Better to focus on getting the service back up.

 

Like DNS, BGP is one of those scary acronyms that frequently comes up when any major event goes down on the Internet. And like DNS, it is essential vocabulary for Internet literacy. BGP is the way that traffic gets routed across the Internet. You can think of it as a telephone chain. I tell Sally (my peer) where to reach me. She in turn calls her friends and neighbors (her peers) and tells them to call her if they want to reach me. They in turn call their contacts (their peers) to tell them the same, and the chain continues until, in theory, anyone who wants to reach me has some “path” to me through a chain of connections — some may be long, some short.

 

Even if DNS was the domino that toppled it all, and even if a rogue set of BGP withdrawals was the source of that toppling, like any BGP route change, it can be changed again. Or can it? History tells us that the longest lived and most damaging outages can most often be laid at the feet of some issue with the control plane. Whether through human error or bug, if the mechanism for network operators to control the network — to make changes to it — is damaged or severed, that’s when things can go very very wrong. Take the aforementioned Google outage. In that incident, which lasted about four hours, a maintenance operation inadvertently took down all of the network controllers for a region of Google’s network. Without the controllers, the network infrastructure was effectively headless and unable to route traffic. Google network engineers were unable to quickly bring the network back online because their access to the network controllers depended on the very network that was down.

 

Lack of access to the network management system would certainly have prevented Facebook from rolling back any faulty changes. Access could have been due to some network change that was part of the original route withdrawals that precipitated the outage, or it could have been due to a service dependency (for example, if their internal DNS was a dependency for access to an authentication service or other key system).

 

The Facebook suite collapsed like a house of cards on Monday which led to the Great Social Media Blackout of 2021 for 6 hours. One thing to keep in mind is that such things do happen in the tech world and it cannot be avoided. However, the impact could have been minimized or eliminated to a large extent. Facebook ran three different apps from one centralized infrastructure. To top it all, it was a single point of failure for users worldwide. If things go wrong, users across the world are impacted.

 

This outage demonstrates the risks of the whole Internet being dependent on one company and this could have been minimized. Within few hours of the incident, a Twitter user posted a question: “If WhatsApp isn’t back by tomorrow, is it a holiday?” We don’t want this situation to arise again. A possible solution to this is to segregate infrastructure by country/region so that the impact in case of failure is minimized/localized.

 

It also helps respect the local data laws of that particular country/region. From a user’s perspective, we need to have alternatives available. It is not good for the entire world to be dependent on apps from one single company. Especially when communication, business transactions are dependent on them.

See What’s Next in Tech With the Fast Forward Newsletter

SECURITY
View All
Zscaler announces AI innovations to its Data Protection Platform
Technology

Zscaler announces AI innovations to its Data Protection Platform

by VARINDIA 2024-05-20
SHIELD to enhance Swiggy’s fraud prevention and detection capabilities
Technology

SHIELD to enhance Swiggy’s fraud prevention and detection capabilities

by VARINDIA 2024-05-20
Axis Communications announces its first thermometric camera designed for Zone/Division 2
Technology

Axis Communications announces its first thermometric camera designed for Zone/Division 2

by VARINDIA 2024-05-20
SOFTWARE
View All
Hitachi Vantara and Veeam announce Global Strategic Alliance
Technology

Hitachi Vantara and Veeam announce Global Strategic Alliance

by VARINDIA 2024-05-16
Adobe launches Acrobat AI Assistant for the Enterprise
Technology

Adobe launches Acrobat AI Assistant for the Enterprise

by VARINDIA 2024-05-11
Oracle Database 23ai offers the power of AI to Enterprise Data and Applications
Technology

Oracle Database 23ai offers the power of AI to Enterprise Data and Applications

by VARINDIA 2024-05-10
START - UP
View All
Data Subject Access Request is an integrated module within ID-REDACT®
Technology

Data Subject Access Request is an integrated module within ID-REDACT®

by VARINDIA 2024-04-30
SiMa.ai Secures $70M Funds from Maverick Capital
Technology

SiMa.ai Secures $70M Funds from Maverick Capital

by VARINDIA 2024-04-05
Sarvam AI collaborates with Microsoft to bring its Indic voice LLM to Azure
Technology

Sarvam AI collaborates with Microsoft to bring its Indic voice LLM to Azure

by VARINDIA 2024-02-08

Tweets From @varindiamag

Nothing to see here - yet

When they Tweet, their Tweets will show up here.

CIO - SPEAK
Automation has the potential to greatly improve efficiency and production

Automation has the potential to greatly improve efficiency and production

by VARINDIA
Various approaches are followed to enhance efficiency, productivity, and cost-effectiveness

Various approaches are followed to enhance efficiency, productivity, and cost-effectiveness

by VARINDIA
Technology can be leveraged in several ways to boost efficiency, productivity and reduce cost

Technology can be leveraged in several ways to boost efficiency, productivity and reduce cost

by VARINDIA
Start-Up and Unicorn Ecosystem
GoDaddy harnesses AI power for new domain name recommendations

GoDaddy harnesses AI power for new domain name recommendations

by VARINDIA
UAE’s du Telecom selects STL as a strategic fibre partner

UAE’s du Telecom selects STL as a strategic fibre partner

by VARINDIA
JLR and Dassault Systèmes extend partnership for All Vehicle Programs worldwide

JLR and Dassault Systèmes extend partnership for All Vehicle Programs worldwide

by VARINDIA
Rapyder partners with AWS to accelerate Generative AI led innovation

Rapyder partners with AWS to accelerate Generative AI led innovation

by VARINDIA
ManageEngine integrates its SIEM solution with Constella Intelligence

ManageEngine integrates its SIEM solution with Constella Intelligence

by VARINDIA
Elastic replaces traditional SIEM game with AI-driven security analytics

Elastic replaces traditional SIEM game with AI-driven security analytics

by VARINDIA
Infosys and ServiceNow to transform customer experiences with generative AI-powered solutions

Infosys and ServiceNow to transform customer experiences with generative AI-powered solutions

by VARINDIA
Crayon Software Experts India inaugurates its ISV Incubation Center in Kolkata

Crayon Software Experts India inaugurates its ISV Incubation Center in Kolkata

by VARINDIA
Dassault Systèmes to accelerate EV charging infrastructure development in India

Dassault Systèmes to accelerate EV charging infrastructure development in India

by VARINDIA
Tech Mahindra and Atento to deliver GenAI powered business transformation services

Tech Mahindra and Atento to deliver GenAI powered business transformation services

by VARINDIA
×

Reproduction in whole or in part in any form or medium without express written permission of Kalinga Digital Media Pvt. Ltd. is prohibited.

  • Distributors & VADs
  • Industry Associations
  • Telco's in India
  • Indian Global Leaders
  • Edit Calendar
  • About Us
  • Advertise Us
  • Contact Us
  • Disclaimer
  • Privacy Statement
  • Sitemap

Copyright varindia.com @1999-2024 - All rights reserved.