Facebook Downtime: What Really Happened

Charmie Lyn FloresArticles, Blog

facebook downtime

Millions of users discovered they couldn’t use Facebook or its other apps about midday on Monday October 11th, 2021. Facebook, WhatsApp, Instagram, and Oculus all went dark for about six hours and were inaccessible. While this isn’t the first outage of its kind, it is by far the most inconvenient.

With 2.8 billion active members, Facebook is the most popular social networking platform on the planet. With 2 billion and 1.38 billion users, respectively, WhatsApp and Instagram are close behind. Because many businesses rely on these platforms to communicate with their clients, being unable to use Facebook might result in a financial loss.

Although Facebook and its family of applications are now operational, the impact on the corporation and its consumers may be long-lasting.

What Caused The Outage?

facebook downtime

Photo Credit: www.datacenterknowledge.com

When users began to realize that Facebook and Instagram were not loading, theories about what was causing the problem began to fly. Some speculated that it was the product of a distributed denial-of-service attack, while others speculated that it was the work of a hostile insider. Facebook characterized the occurrence in a blog post as something lot more mundane.

According to the blog post, the outage was caused by configuration modifications to Facebook’s backbone routers. The improper arrangement of these routers, which coordinate network traffic between the company’s data centers, triggered a domino effect of disruption across Facebook’s communications. These outages even impacted Facebook’s internal systems, making it more difficult for personnel to evaluate and resolve the issue.

Facebook hasn’t said what caused the changes in settings, although it may be as simple as a software upgrade gone wrong. A server configuration change was also to blame for the 2019 Facebook outage, which was the company’s worst yet at the time.

Although the most recent outage was shorter in duration, it was more disruptive. The number of reported faults was 10.6 million, according to Downdetector, a program that tracks outages. This is the highest number they’ve ever seen.

Now that the details of the outage are known and available, Facebook’s engineering team will assess what went wrong and seek to prevent a similar issue from recurring in the future.

An Extended Delay

Because all of Facebook’s data centers were unavailable, recovery became difficult, and the DNS outage hampered numerous network tools that would have been useful in investigating and correcting the issues.
Because remote management tools are unavailable, professionals in the data centers must manually diagnose and restart the impacted systems. “Activating the secure access mechanisms required to get employees onsite and working on the systems took extra time.” “Only then would we be able to validate the problem and get our backbone back online,” Janardhan explained.
Finally, there was the issue of restarting Facebook’s massive worldwide data center network and dealing with an immediate rise in traffic. This is a problem that extends beyond network bottlenecks to data center hardware and power.
“Individual data centers were reporting power usage drops in the tens of megawatts,” Janardhan said. Suddenly reversing such a reduction in power consumption may put everything from electrical systems to caching at risk.”
The data center sector exists to ensure that IT equipment never goes down by ensuring that power and network are always available. The elimination of single points of failure is a critical premise, and Monday’s outage demonstrates how hyperscale networks that serve global audiences can also enable massive outages.

 User Information Secure?

Given the large number of people who use these platforms, some may be concerned that the outage has harmed their personal information. Facebook maintains in an official statement that it has no proof that user data has been hacked. The outage caused Instagram to not load and users to be unable to access Facebook, although it did not reveal their personal information.

As more information regarding the outage becomes available, users will be able to determine whether or not their data is safe. There’s no reason for users to be concerned about their data if it’s just a misconfiguration issue caused by a bad update.

Is This Related To Facebook’s Other Recent Issues?

facebook controversy

Photo Credit: www.datacenterknowledge.com

This Facebook outage has occurred at a very inconvenient time for the firm. A former employee alleged the business tried to hide proof that it intentionally allowed users to spew misinformation and hate on its platform just one day prior. Because of the rapid succession of occurrences, the instances can be linked.

If the outage was caused by hacktivists reacting to the news, it might have far-reaching effects. While the False Claims Act contains an anti-retaliation provision to protect whistleblowers, such a costly disruption may prompt Facebook to act. Because the scope of this occurrence is unprecedented, any repercussions and how they might play out in court are unknown.

While the outage’s timing is strange, the occurrences are most likely unrelated. According to Facebook’s explanations, the cause is more likely to be an internal error than an act of hacktivism.

Lessons From The Outage 

This problem is much more serious than people being unable to use Facebook and WhatsApp. While user data appears to be safe, the outage demonstrated how damaging a Facebook outage can be. Because many other sites rely on Facebook for things like user authentication, they may lock people out of numerous accounts if they can’t access Facebook.

Some individuals use Facebook to access smart home devices such as smart TVs and thermostats. As a result, several people were unable to operate fundamental functions within their houses when the service went down. A greater occurrence might be disastrous if a small issue can potentially hamper millions of people’s ability to manage objects within their homes.

This incident demonstrates how many devices and apps rely on a single service nowadays. This allows attackers to inflict extensive damage with a relatively small and targeted attack. Although today’s hyper-connectivity is convenient, it also poses certain concerning hazards.

This problem was most likely not caused by cybercrime, but it does highlight what cybercriminals can do. It serves as a stark reminder of the fragility of many of the mechanisms on which people rely.

Caution: Outages Can Affect Anyone

This problem was most likely not caused by cybercrime, but it does highlight what cybercriminals can do. It serves as a stark reminder of the fragility of many of the mechanisms on which people rely.

How To Avoid This Kind Of Disaster?

  • Acknowledge Human Error As A Given And Aim To Compensate For It
Facebook’s infrastructure vice president says engineers were performing routine maintenance. This is reminiscent of an Amazon Web Services (AWS) outage in February 2017 that incapacitated a slew of websites for several hours. Human error contributed to a previous large AWS outage in April 2011, according to the company.
The underlying software should be able to naturally limit the blast radius of any individual command. Facebook had such a control, but a bug in that audit tool prevented it from properly stopping the command. Slack’s outage in January 2021 shows how automation can also cause cascading failures.
  • Conduct Blameless Post-Mortems
Companies that suffer an outage should never point fingers at individuals. Instead, consider the bigger picture of what systems and processes could have thwarted it. Companies should focus on the technical and organizational means to reduce errors. “We’ve already paid for this outage, what benefit can we get?”.
  • Stay Away From The Deadly Embrace
That single command sparked a domino effect that shut down the backbone connecting all of Facebook’s data centers. A problem with Facebook’s DNS servers “broke many of the internal tools we’d normally use” to investigate and resolve outages.
There’s a good lesson here: Maintain a deep understanding of dependencies in a network so you’re not caught flat-footed if trouble begins. And have redundancies and fallbacks in place so that efforts to resolve an outage can proceed quickly. The thinking should be similar to how, if a natural disaster takes down first responders’ modern communication systems, they can still turn to older technologies like ham radio channels to do their jobs.
  • Favor Decentralized IT Architectures
It may have surprised many tech industry insiders to discover how monolithic Facebook has been in its IT approach. For whatever reason, the company has wanted to manage its network in a highly centralized manner. This strategy made the outages worse than they should have been.
Another issue was Facebook’s use of a “global control plane” — i.e. a single management point for all of the company’s resources worldwide. With a more decentralized, regional control plane, the apps might have gone offline in one part of the world, say America, but continued working in Europe and Asia. By comparison, AWS and Microsoft Azure use this design and Google has somewhat moved toward it.
Facebook may have suffered the mother of all outages — and back to back at that — but both episodes have provided valuable lessons for other companies to avoid the same fate. These four steps are a great start.

The Role of AKCP in a Downtime

AKCPro Server

AKCPro Server provides data center managers with a system where they can view the assets, power, connectivity, cooling, and physical security across multiple locations and accurately make changes to their data centers wherever they are located. The remote management and business intelligence capabilities of DCIM software help data center managers achieve their goals of reducing latency while maintaining availability and uptime.

There are many challenges specific to data center management, such as being able to direct technicians to complete changes properly, monitoring data center health across multiple locations, and managing all assets and their connections across the entire data center deployment. Having to manage data centers remotely usually involves a combination of multiple remote management tools, analytic capabilities, and databases. This can easily lead to inaccurate data, incorrect work orders, and poor decision-making.

Reference Link:





Charmie Lyn FloresFacebook Downtime: What Really Happened