What caused the internet outage that brought down Amazon, Reddit and Gov.uk?

Vir 45 minutes in the UK morning, a significant chunk of the web did not work. People trying to visit a huge array of websites, from the Guardian through Gov.uk to Reddit, Hulu and the White House, received a blank white page and an error message telling them the connection was unavailable.

The errors were focused on large websites with substantial traffic, but weren’t universal: users in some places, such as Berlin, Duitsland, reported no problems throughout the outage.

The cause of the outage was quickly identified as a problem with the “edge cloud” provider Fastly. Within a few minutes, the company admitted on a status page that it was experiencing problems. With the exception of a few providers, including the BBC, which had backup systems in place, every affected website had to wait for Fastly to fix the error before they could restore service.

The company offers a content delivery network service, or CDN. When it works, a CDN is supposed to improve the speed and reliability of the internet. Rather than visitors to a website all having to connect to servers run by that company – which might not even be in the same country they are – they instead contact Fastly, which runs huge server farms all around the world that host copies of their clients’ websites.

That means that the page loads faster for the user, because the physical signals don’t have to travel as far. It also improves the reliability of the website, by ensuring that if there’s a big spike in traffic, it first hits Fastly’s servers, which are designed to handle a lot of traffic.

In normal times, ja. The company is one of a few major CDN providers: others include Cloudflare and Amazon’s CloudFront. Maar, to give a sense of how well respected Fastly is, Amazon’s own retail website actually runs through Fastly, rather than CloudFront, and has done since May 2020.

We still don’t know the exact details. A Fastly spokesperson said: “We identified a service configuration that triggered disruptions across our POPs” – points of presence, the worldwide network of server farms that Fastly runs – “globally and have disabled that configuration. Our global network is coming back online.” It seems likely that the problem will prove to be a simple configuration error that led to a cascading failure, as one small problem triggers a bigger one, which triggers an even bigger one, and so on.

With Fastly blaming the outage on a “service configuration” and no further evidence to the contrary, it is vanishingly unlikely that the problems were the result of a malicious attack. The investigation into a similar error at Cloudflare last year should give an idea of the sort of problems that could happen: there, a single error on a physical link between Newark and Chicago caused that connection to fail, which led to traffic overloading a connection between Atlanta and Washington DC. An emergency change to try to deal with that overload instead sent all traffic from the entire network to the Atlanta datacentre, which failed itself, and caused the entire system to go down.

The growing need for speed online has led to a serious concentration of internet infrastructure in the hands of just a few companies. One choke point is content delivery networks, like those operated by Fastly and Cloudflare. Another is cloud hosts, like AWS (formerly Amazon Web Services), Microsoft’s Azure, and Google Cloud Platform. Those providers fail rarely, because they are large, specialist services which devote huge resources to resilience and reliability. But occasionally, often through human error, they do fail, and can bring huge numbers of sites with them.

It is possible for a site to run on two or more providers, to provide a backup in case one fails, but doing so is expensive, technically complex, and still unlikely to prevent short-term outages. Gov.uk, for instance, ran a backup CDN on Amazon’s CloudFront service – but required manual intervention to switch to the backup.

Kommentaar gesluit.