Skip to main content

AWS down: How a single network outage rippled through businesses, institutions and the economy

Racks of computer servers in rows of black cabinets in data center

Server racks are pictured in a data center. (Credit: Adobe Stock)

Amazon Web Services (AWS) made worldwide headlines on Monday when the service went offline for hours, disrupting popular apps, services and tools from Zoom and Venmo to Snapchat and Reddit.

According to Reuters, it was the biggest internet disruption since the cybersecurity firm CrowdStrike went down last year. It was also not the first time the AWS data center in northern Virginia has been implicated in a major outage.

What causes this type of network outage? What impacts can it have on businesses, institutions and the economy? And is there anything we can do to make such outages less likely in the future?

To answer these questions, we spoke with Levi Perigo, professor of computer science and co-director of CU Boulder’s Professional Master's Program in Network Engineering.

What happened and who did it impact?

The AWS outage was widespread and caused major disruptions for many businesses, resulting in what some experts estimate to be hundreds of billions of dollars in economic impact. The issue stemmed from a failure in the Domain Name System (DNS), which acts as the Internet’s “phone book.” DNS translates the easy-to-remember web addresses we use—like amazon.com—into the numeric IP addresses computers use to communicate.

When part of AWS’s internal DNS infrastructure went down, many of the services and websites hosted by AWS lost the ability to “find” each other. To users, it appeared that websites and applications were broken or offline. Even though DNS is a relatively simple technology, it’s a fundamental part of how the Internet works, so when it fails, the effects can be enormous.

What causes this type of outage to happen?

Man poses for portrait

Levi Perigo

Outages like this can happen for several reasons, but most often it comes down to human or configuration errors that are amplified by the massive scale of operations at companies like AWS. To manage millions of systems efficiently, large cloud providers rely on network automation—essentially using software to configure and control their infrastructure.

In this case, it’s likely that a small misconfiguration or script error was deployed across thousands of systems, resulting in a large-scale failure. These incidents highlight the importance of careful testing, validation, and documentation, especially when automation is involved.

Could it happen again? How vulnerable are we?

Unfortunately, outages like this are always possible. The more we rely on centralized cloud platforms such as AWS, the more we share in their risk. The scale of this week’s disruption shows just how much of the internet depends on a few key providers. While AWS has strong and reliable overall, no system—no matter how advanced—is immune to failure.

What can we do to prevent this or reduce the risk of it happening again?

There are ways to reduce risk, though it’s difficult to eliminate it completely. One key strategy is called multi-cloud architecture—using multiple cloud providers (such as AWS, Google Cloud, and Microsoft Azure) to host services rather than relying on just one. This approach helps ensure that if one provider experiences an outage, the others can keep systems running.

Ultimately, incidents like this remind us that the internet is now critical infrastructure, and its reliability depends not only on technology, but also on careful design, operational discipline, and shared responsibility across providers and customers alike.