Fatal Errors and Fire Drills: Lessons from #FacebookDown 2021

Security Social Media
Sean Dean
10/14/21

Last Monday I was starting my day by catching up with all the perfect creative things my friends had done over the weekend by scrolling through Facebook when the feed stopped working. This happens every so often, either I’m scrolling too fast or my local router needs to be rebooted, but this was different. I refreshed the page and Facebook was gone, DNS not found. What?! So I checked Instagram to see if it was working, then Messenger, then WhatsApp – all down. 

As a systems guy, I was now intrigued. So, I opened up my terminal and ran a dig against facebook.com – no answer. Then I dug the nameservers and found no answer there as well. Facebook was officially down. 

What. 

The. 

Heck. 

By now we all know what happened and there are some great write-ups describing technically what happened like these from Cloudflare and Facebook Engineering.  What I’d like to talk about, though, is why it took so long for Facebook and their other sites to come back up. Think what you want about Facebook as a company, but this is true – they have some really talented systems people. Something extraordinary had to happen to keep them from bouncing the systems back up in under 30 minutes.

Planning for the One in a Trillion Thing

Facebook is a big company with a lot of needs smaller shops do not have, so they’ve been forced (or have chosen) to build a lot of tooling themselves. In a world where mistakes don’t happen this is a fine choice, but of course we don’t live in that world. Facebook tied all their tooling into the same systems that they needed to get back up and that was their fatal mistake. To have the series of things that caused the outage to happen in the right order and at the same time is a one in a trillion thing. The more mission critical your systems are, the more important it is that you plan for the one in a trillion thing.

Somehow, in spite of all their planning, no one at Facebook conceived of their DNS disappearing entirely. This caused a lot of problems for their teams, from getting into the physical locations where their servers are located to logging into those systems. They could have built backdoors into their systems that allowed them to bypass other systems, but backdoors are security flaws and someone is sure to exploit those flaws. So, while backdoors would have worked they surely would have caused more problems than they solved. Instead, they should have built out a more heterogeneous network of tools that is redundant and not reliant upon all the same DNS systems.

Scaling Security for your Business

For most companies, using a series of third party tools that are not reliant on internal tooling makes a lot of sense. First, there is an extremely low chance that all of the tools from multiple services go down at the same time. Secondly, most companies don’t have the resources to build all their own tooling. Spend your valuable time and money building tools unique to your business and let those who are more adept at the other stuff handle it. Lastly, tool maintenance is time consuming and you probably don’t have time for it. The cost of running several third party services will be a fraction of the cost of building and maintaining the tooling yourself. Every company will have their own level of comfort using third party services based upon their security needs. Find the services that best fit your security and access needs.

A Case for Creating Chaos on Purpose

Once you have the tools, make sure the people in your organization that need access to them have the proper permissions before something bad happens. Do fire drills to make sure you’re ready to handle the worst. Netflix famously has its Chaos Monkey tool to force its team into unexpected situations. Find a way to replicate, in your own way, this same sort of preparedness.

Major outages are a rarity, especially if your systems team is serious about uptime and security. Having the right setup and preparedness for when they do happen will help make a major outage into less of a crisis and get your services back online in a minimal amount of time.