On the morning of Sunday August 30, CenturyLink’s IP network was reported to be down for about 4.5 hours – probably focused on the part of the network that came with their acquisition of Level 3.
From what I’ve read (simplifying greatly), a CenturyLink customer had requested that they block a certain IP address from hitting their network – which seems like a very routine event for anyone used to experiencing DOS or hacking attempts. Except that someone incorrectly entered the configuration update. Apparently…
“this request was accidentally implemented with wildcards, rather than isolated to a specific IP address”ThousandEyes CenturyLink Outage analysis
I don’t know a lot about BGP routing, but I’m pretty sure that if you were supposed to block a specific IP and instead you use a wildcard, you end up blocking ALL IP addresses. Which would be bad.
This request was distributed across all the routers in the network, which then were (presumably) unable to be contacted because they were blocking all traffic. Fantastic.
It’s always fun to read outage reports – when it’s someone else who had the outage – and see what we can learn from them. In this case there are two big lessons.
Lesson 1: Even a customer-specific change to the core network can cause big problems
What could be more mundane than blocking a certain IP from reaching a particular customer? I’m sure many of you update ACLs on your firewalls all the time.
It’s easy to be complacent about changes that only impact a specific customer – but anytime you’re modifying the core network there’s a risk that something big can go wrong.
Obviously using a wildcard instead of the specific IP address is really bad, and I’ve also seen several issues where an IP address conflict led to a big outage.
It’s not a big deal to add a new device to the network. But if you accidentally reuse the IP of a critical piece of equipment…
Action: Always write a MOP for any core network maintenance, and always have someone review it.
Lesson 2: You always need a backup route
If you had SIP trunks to Level 3 (now CenturyLink) this weekend, I’m guessing they didn’t work too well. If you had SIP trunks to anyone else and your only connection to the internet backbone was via CenturyLink, then I’m guessing they also didn’t work too well.
For critical traffic, you can’t rely on a single provider – even if they’re providing a supposedly redundant network.
This applies both to your IP-based carrier internet connections AND to your voice trunks (whether SIP or TDM-based). You can’t assume that outages won’t happen. You need to plan for them.
If you have no plan, and would like us to perform an audit of your voice network to identify common issues, drop us a line and we can talk.