Today we’ll look at how our best intentions can often lead us down the wrong path, and a surprising approach to tackling the issues that exist in our networks.
Imagine there’s a lake, and the lake is full of water. The sun is out, and the lake is calm, and so you decide to sail your boat over the lake. However, half-way across the lake your boat hits some large rocks, gets a hole, starts taking on water and you drown.
The second lesson from this story is that it would have been much safer if there had been less water in the lake.
It’s possible the metaphor is starting to break down at this point, so let me tie this back into real-life.
- The rocks are problems in your network.
- The water level is anything that hides or minimizes those problems, such that they don’t get resolved – particularly in those cases where the problem is hidden by an abundance of something else.
I’m guessing this is still a bit fuzzy, so let me give you some examples that relate more to telecoms.
Problem
|
Hidden by…
|
i.e. an abundance of…
|
There’s no documentation for any of our processes
|
Expert employees who’ve been doing this for 10 years
|
experience
|
It’s really hard to configure Hosted Business services
|
A large customer support team who spend a lot of time with each customer to help them set everything up
|
support resources
|
A particular line card occasionally reboots
|
Automatic protection switching to a backup
|
backup hardware
|
A trunk group keeps going out of service
|
An automatic overflow route
|
backup routes
|
The IP network does nothing to guarantee quality of service
|
Plenty of bandwidth to go around
|
bandwidth
|
The above examples work well with the idea of abundance as described by the metaphor of the lake and the rocks, but you can also apply the idea more generally to any scenario where a problem may be hidden. Perhaps excessive minor alarms are hiding one important alarm, or perhaps a customer stops reporting problems because nothing ever seems to get fixed, or perhaps backups have failed and no-one notices until they’re urgently needed.
Why should we care?
This is where things get interesting. We obviously want to provide our customers with a high quality product, and certainly we’d rather protect our customers from a problem rather than expose them to an outage or a frustrating experience. However, there’s a real danger that we don’t take these problems seriously, or even notice them, because they’re hidden – and that’s where we run into trouble.
Hide a problem from the customer: Good.
Hide a problem from ourselves: Bad!
-
If you have an experienced employee who does her job well, and the instructions for all her tasks are in her head, then that works – kinda – for now. But…
- What happens if she leaves or gets promoted?
- What happens if there are three people doing the same role, and one of them fixes a flaw in the process – how will you make sure that everyone does it the new, better, way in the future?
-
If your provisioning process is really confusing, and customers need to spend a lot of time working with your support staff to configure things the way they want, then some customers may go away excited about the awesomeness of your support team. But…
- Your pricing may be uncompetitive because you have to spend so much on support personnel.
- Other customers may give up in frustration because it doesn’t work “out of the box”.
- Customers may experience a high volume of future trouble tickets because they each have a custom-solution, configured according to each CSR’s tastes.
-
If all hardware problems are masked through redundancy, then that’s great. But…
- You need to be analyzing and fixing the cause of the hardware problem or else it will keep happening – and your redundancy model only works if each individual failure is rare.
-
If you have overflow routes for every trunk group then that’s great. But…
- Is this hiding a quality or capacity problem on the primary trunk group?
- Are you paying higher fees for calls on the backup trunk group?
- Does your test plan validate that all call types work equally well on both routes? What if international calls only work on the primary route?
-
If you resolve any complaints about the quality of the IP network by purchasing more bandwidth, then…
- Your costs will continually rise.
- You’ll never be able to give any cast-iron guarantees about service quality, even to your most valuable or high-profile customer.
- Your customers will continually find new ways to use the extra bandwidth after each upgrade, only to become frustrated when they hit the next capacity constraint.
- [Hat tip to Martin Geddes at this point, who’s done some incredible work (with his colleagues) on the theory and practice of network quality, and why the internet is fundamentally broken.]
The fundamental lesson is this:
If you care about quality, you need to find ways to highlight problems – to bring them to the surface so they can’t be ignored.
So how do we get started?
In part this is because most people have a natural inclination towards an easy life – and while fixing obvious problems fits coherently within that mindset (problems make life bad, fixing them makes life easier again) – seeking out problems that are below the surface just seems like a lot of work: “stirring up trouble where there isn’t any”.
The other consequence of this new approach to exposing and analyzing problems is that people’s mistakes will be more visible. It takes a great deal of self-confidence and security to feel comfortable talking about mistakes that you made, and the negative consequences of those mistakes. People would much rather avoid the level of deep scrutiny found in the 5 Whys approach I discussed previously, but in order to fully resolve issues we need to be able to have those conversations.
Therefore, it’s important that any attempt to apply Lean concepts to an organization must also be accompanied by a clear commitment from management towards developing people (the third pillar of Lean after quality and eliminating waste).
If people think they’re going to get disciplined for any mistake, then they’re going to hide their mistakes. However, if you create a culture that values people, sees any errors as a failure of the process (“how can we improve the process/tools so that it’s harder to make this mistake next time?”), and continually looks for opportunities to develop employees, then you can truly have open, inquisitive conversations about problems in your network – which will allow you to create a level of quality previously unseen in your organization.
#1. Take trouble tickets seriously
I’ve already explored this in depth in my article on the 5 Whys approach so suffice to say that we need to understand that each trouble ticket is a gift highlighting a flaw in our process. Therefore we need to dig deep into our trouble tickets – it’s not enough to simply make the problem go away, we need to understand how it came to occur in the first place and fix it so it never happens again.
#2. Run an alarm-free system
The alarms on your switch and other network elements are intended to provoke action. Yet it’s far too easy to end up with a system that has a twenty, fifty or one hundred active alarms at any given time – which means you have a ton of known quality issues to address AND it makes it hard to spot a super-important alarm among all the fluff.
Therefore, I’d strongly encourage you to evaluate every alarm that occurs and decide how to act. I generally recommend one of the following choices:
- Investigate & Fix: the alarm represents a real problem. You need to understand the problem then fix it. The alarm should remain active until the problem is fixed.
- Hide: some alarms might be expected in the correct operation of your network. For example, a SIP subscriber might be alarmed when it’s unregistered – and if this is a portable soft-client that might be totally normal. For categories of alarms that are “expected” then I recommend modifying the alarm filters so they don’t appear on your dashboard. I have some additional thoughts about subscriber-level alarms in #7 below.
- Modify thresholds: you may have statistical alarms that trigger when (e.g.) the number of circuits in use on a trunk group hits a certain threshold. When one of these alarms trigger you need either to decide that the threshold is too conservative and increase it, or if the alarm represents a genuine problem then you should “investigate and fix” (by increasing capacity or modifying traffic flows) as mentioned above.
Personally I like the idea of having the alarm dashboard displayed on a large screen in the NOC for everyone to see, and then each morning you can review active alarms and the ongoing actions to resolve them as part of your stand-up meeting.
This idea assumes that you’ve already documented your processes, in which case it’s great for a team manager to try to follow his/her team’s documentation – this removes the abundance of experienced team members performing the work.
There are a couple of benefits of this approach.
(a) it tests the documentation
(b) it helps the manager to better understand some of the detailed tasks performed by the team
(c) even if the documentation is perfect, the manager may have a broader understanding of how the process fits into the organization as a whole – and so may have ideas for how to improve things (which should be discussed with the team members – the experts in the process).
This can be done as a short-term exercise initially, and the general idea is this.
- Each team member performs their regular tasks, but anytime something doesn’t go as planned they ask the extra person – the problem-solver – to look into the problem.
- By reducing the resources available to the team, that makes it necessary for the team to operate more efficiently (less water in the lake). This shines a light on any problems that disrupt the smooth-flowing of the team’s work, and then we can use the problem-solver to dig deeper into the issues that are exposed (the rocks).
- The problem-solver thereby ends to getting exposure to a variety of flaws in the standard processes in pretty short order – and is tasked both with resolving the problems and also with investigating why the problems occurred. These investigations can then form the basis for a team discussion which ultimately leads to some process improvements to prevent these problems in future.
- A side benefit is that you get a feeling for what proportion of a team’s time is spent simply resolving problems with their regular tasks. For example, if you have a team of 5 and one person solves all the problems and the other 4 do all the work – then you can quickly see that 20% of the team’s time is wasted due to these issues. Of course it’s rarely that perfect – but you get the idea.
Most of the time, we have an abundance of capacity in the IP network, which masks any errors in the quality of service rules. So in order to seek out these errors, we should stress test the network.
Now obviously, I’m not saying you should actively trying to break your core network – at least, not during peak hours. So there are a couple of ways you could approach this.
(a) If you have a lab environment that mirrors a customer’s LAN, you should configure the IP network according to the best practices you recommend to your customers and then apply a heavy load of data traffic to the network (there are various stress testing tools that can do this). With the heavy data load applied you can then either manually make a variety of phone calls to subjectively observe how the IP phones function in this situation, or else you can use a VoIP call generator to simulate a heavy VoIP call load and measure the performance of the network in these conditions. Ideally there should be no noticeable degradation of VoIP performance as long as the number of simultaneous calls stays within your designed limits.
(b) Of course, there could be hidden IP issues in your core network just waiting to appear at an inopportune moment – and we shouldn’t ignore these. While theoretically you could duplicate your entire core network in a lab, in practice it’s incredibly difficult and expensive to replicate the network with any degree of accuracy. This leaves the unappealing option of load testing your real core network – and while this sounds risky it’s better to plan a load test in a maintenance window in a controlled fashion than discover problems during your peak hour on a Monday morning.
Are you actively monitoring the IP network – so you can see congestion and/or voice quality issues as they occur? Or do you rely on customer complaints to spot the problems?
There are a variety of tools you can use to monitor your IP network, but a good place to start (if you’re running on Metaswitch) is the Voice Quality Monitoring functionality available in MetaView Explorer. This makes use of the VQM statistics reported over RTCP by various third-party endpoints in your network, along with the Metaswitch network elements, and uses these to present a graph showing the key network nodes and the network quality between each pair of endpoints. This can be a great early warning system so you can be notified of quality issues when they first occur and hopefully resolve issues before they become critical.
#7. Subscriber-level alarms
One of the big challenges with keeping an entirely clean alarm panel (#2 above) is the proliferation of subscriber alarms, which could be a genuine issue, or could reflect a conscious decision on the part of the subscriber (e.g. to shut down a soft-client or unplug a phone, or perform some PBX maintenance).
I haven’t seen this done, but a great way to handle this might be to hook these alarms into a BSS system that sent a notification (e.g. a text message or email) to the subscriber whenever their system went into alarm. That way the subscriber is notified about the problem, and then has the choice either to ignore the message (if it’s expected) or to investigate the problem if not.
You could then filter out all such alarms and you’d both provide a better experience to the customer AND eliminate some waste in your operations team. Anything that improves quality and efficiency simultaneously should make everyone happy!
Sometimes all it takes to discover the big rocks is to ask some probing questions at a team meeting. Try these out and see what you learn.
- What’s the fastest anyone has ever done X? Why can’t we hit that goal every time?
- How could we provision twice as many lines each day with half the number of people? [We can’t!] Why not? What’s the biggest problem?
In conclusion…
If you can find and solve the problems now, before they impact customers, then you can increase both your quality and your efficiency – helping you form a more perfect network.
If you need some help to get started on this path then contact me and let me know a little about your situation and we can figure out some options. You might also like to check out my System Audit service where we can get a baseline on your current Metaswitch products and check for the most obvious risks and unnecessary costs.
If you’d like to be notified when I publish new articles about implementing the Lean Network in your business, please enter your email address in the box below. I’ll also send you a copy of my recent white paper that tells the story of how I was introduced to this way of thinking and why service providers need to take this seriously in today’s competitive market place.
During 14 years at Metaswitch Andrew Ward worked as a Customer Support Engineer (CSE) and CSE Manager with hundreds of service providers across North America before becoming obsessed with processes and quality as VP Operations. He’s now the founder of Andrew Ward Consulting LLC where he works with select service providers to help make The Lean Network a reality. If you’re interested to work with Andrew please click here.
|