The telecoms network has high standards in terms of availability. For years we’ve talked about “five nines” reliability, meaning that service is available 99.999% of the time – equivalent to about 5 minutes of downtime per year. I wish my internet service was that good.
The only way to attain such a high standard of availability is to create a resilient network – a network which is designed so that if any one component fails, the network as a whole will continue to provide service. The network is resilient to the failure of individual components – in other-words, it has in-built redundancy. Typically each function is provided by a redundant pair of components – so that if one fails, the other can continue to provide service.
Redundancy in a network is a wonderful thing – and we can all agree that there should be no single point-of-failure that could significantly impact service.
But if you continue down this path, things become less clear. If redundancy is good, more redundancy must surely be better, right?
- Instead of having two of each component, why not have three?
- Or maybe we should have a redundant pair of the redundant pairs? I mean, if I have a Session Border Controller (SBC) that itself is internally-redundant, wouldn’t it be better to have two SBCs for greater redundancy?
- And what about geo-redundancy? Should we have two separate locations each with a full set of network infrastructure, in case of a natural disaster? And wouldn’t it be better if these two sites were on opposite sides of the country… for better geo-redundancy?
As you can see, it’s easy to go down a rabbit hole here, and all this additional redundancy adds costs – in terms of equipment but also in terms of operations. You end up with a more complex network, which can become harder to maintain.
So how do you decide “the right” amount of redundancy?
To answer this, you have to focus on the potential failures. What failure scenario are you planning for?
For example, in the case of geo-redundancy, does it matter if your geo-redundant location is 10 miles away compared to 500 miles away? That depends on what potential disaster you are protecting against.
- If you live on a flood plain, and both sites could equally be impacted by the same flood, then 10 miles away is not enough.
- If you live in an earthquake zone, then you probably want enough distance so that a single earthquake couldn’t impact both your locations.
- But if you live in Michigan, the state with the lowest risk of natural disasters, then maybe there’s no real benefit to the extra distance.
What about a redundant pair of redundant pairs? For example, we might have an SBC made up a redundant pair of blades, and you are considering adding a second identical SBC for redundancy. Is that ever warranted?
- We start by looking at possible failure scenarios. Is there anything that’s NOT redundant in a single SBC?
- Does it have redundant IP uplinks going to a redundant pair of routers, redundant power supplies going to redundant power sources?
- What about the physical chassis itself? Do you have a pair of blades inside a single chassis? Is there any possibility that the chassis somehow fails – and if so, what are the odds of that?
- Maybe you’re concerned about a disgruntled employee with a large axe? That’s a perfectly valid failure scenario – which might call for geo-redundant SBCs. (You can think of it more like a natural disaster in terms of the impact.)
Maybe everything about the SBC is fully redundant, but the SBC is running on cheap commodity servers and once they get old there’s a 10% chance of failure each month. At some point one of those servers is going to fail. If it takes a week to replace it then suddenly you have a 2.5% chance of a full outage during that week. For me, that’s an unacceptably high risk.
You could mitigate this risk by adding more redundancy, or by buying higher quality servers, or by reducing the time it takes to swap out a server, or by having a policy to always replace servers every two years. Or all of the above! You can calculate the impact each of these options would have on your risk – and that math should inform your decisions.
Stop with all the math! What’s your point?
I actually have two points… sorry.
- Don’t assume that more redundancy is better. Extra redundancy costs money and it costs resources (to implement and maintain), so you need to evaluate whether additional redundancy is justified.
- Your evaluation should consider the following:
- What failure scenarios are possible that you should be planning for?
- How likely are these scenarios? You won’t be able to calculate this perfectly, but even some back-of-the-napkin math will help you see where the biggest risks lie – and also the benefit you’ll receive from adding the additional redundancy.
Your evaluation may show that the additional redundancy is not required, or it may provide compelling data that will help make a case to the executive team that this is an important investment. So I’m sorry, but doing the math is kind of important.
I should perhaps admit to you all that I have a math degree which provided me with skills that I never use (and, to be honest, which I have mostly forgotten) – and so I’m always a sucker for any situation that requires more math!
Our goal is to help all our clients offer great services on a resilient network – so if you’re considering some improvements to your network architecture, feel free to reach out and we can help you figure out the appropriate level of resilience and redundancy.