• Skip to main content
  • Skip to primary sidebar

Award Consulting

Metaswitch consultants

  • Home
  • About
  • Services
  • Questions
  • Training
  • Articles
  • Podcast
  • Contact

Smooth sailing or a rocky future? 

April 12, 2017 By Andrew

Service providers who want to excel should beware of calm waters, and seek out the rocks.
Picture

Photo by Trevor Bexon via Flickr
This is one of a series of articles exploring how the principles of Lean Manufacturing can be applied to a telecommunications service provider’s network operations – with the aim of creating a perfect smooth-running network with an efficient, quality-focused network operations team – a concept I call The Lean Network.
As you review the telecoms landscape today, there are more service providers than ever, and very little preventing a customer with a broadband connection from switching to a competing provider. Given the challenges of providing a high quality VoIP product, many service providers are fighting a continual battle with churn – customers canceling their service in the hope that the shiny new offering from provider XYZ will solve all their problems (although it rarely does). As I’ve written about previously today’s market threatens the existence of many telcos, but also presents a great opportunity for service providers who are willing and able to set themselves apart – by focusing relentlessly on the quality of their service offering.

Today we’ll look at how our best intentions can often lead us down the wrong path, and a surprising approach to tackling the issues that exist in our networks.

But first – I’d like to talk about boats and lakes. And rocks.

Imagine there’s a lake, and the lake is full of water. The sun is out, and the lake is calm, and so you decide to sail your boat over the lake. However, half-way across the lake your boat hits some large rocks, gets a hole, starts taking on water and you drown. 

Picture

Now the first thing we learn from this story is that if you want to sail a boat, then it might be a good idea if you learned how to swim, and carried some life jackets. And that’s all true, but that’s not where we’re headed today.

The second lesson from this story is that it would have been much safer if there had been less water in the lake. 

Picture

You see, all that lovely water hid the rocks. Had there been less water, you would have been able to see the rocks. The high water level made you think that you were in for a day of smooth sailing, but the rocks were present whether you could see them or not. So if only you could have drained the lake in advance, you would have had a much more pleasant sailing experience.

It’s possible the metaphor is starting to break down at this point, so let me tie this back into real-life.

  • The rocks are problems in your network.
  • The water level is anything that hides or minimizes those problems, such that they don’t get resolved – particularly in those cases where the problem is hidden by an abundance of something else.
This metaphor is usually applied to the manufacturing world, where the water represents inventory. If you have a lot of items in stock then it can hide inefficiencies in the manufacturing process, as inventory flows unevenly through the system. As a result, Lean Manufacturing organizations spend a lot of time trying to minimize their inventory levels – and while reducing inventory is worthwhile in itself (less risk of obsolescence, less capital tied up, lower warehousing costs) the lower inventory also exposes inefficiencies and faults in the manufacturing line that must be resolved if you want to run a low-inventory operation (i.e. the lower water exposes the rocks).

I’m guessing this is still a bit fuzzy, so let me give you some examples that relate more to telecoms.

Problem
Hidden by…
i.e. an abundance of…
There’s no documentation for any of our processes
Expert employees who’ve been doing this for 10 years
experience
It’s really hard to configure Hosted Business services
A large customer support team who spend a lot of time with each customer to help them set everything up
support resources
A particular line card occasionally reboots
Automatic protection switching to a backup
backup hardware
A trunk group keeps going out of service
An automatic overflow route
backup routes
The IP network does nothing to guarantee quality of service
Plenty of bandwidth to go around
bandwidth

​The above examples work well with the idea of abundance as described by the metaphor of the lake and the rocks, but you can also apply the idea more generally to any scenario where a problem may be hidden. Perhaps excessive minor alarms are hiding one important alarm, or perhaps a customer stops reporting problems because nothing ever seems to get fixed, or perhaps backups have failed and no-one notices until they’re urgently needed.

Why should we care?

So this is all very well, but is it actually that important? Some of these examples actually seem like good things. Don’t we want to have great customer support and expert employees?

This is where things get interesting. We obviously want to provide our customers with a high quality product, and certainly we’d rather protect our customers from a problem rather than expose them to an outage or a frustrating experience. However, there’s a real danger that we don’t take these problems seriously, or even notice them, because they’re hidden – and that’s where we run into trouble.

Hide a problem from the customer: Good.
Hide a problem from ourselves: Bad!

So let’s look at some of the above examples in more detail – and what could happen if the issues are left unresolved.

  • If you have an experienced employee who does her job well, and the instructions for all her tasks are in her head, then that works – kinda – for now. But…

    • What happens if she leaves or gets promoted?
    • What happens if there are three people doing the same role, and one of them fixes a flaw in the process – how will you make sure that everyone does it the new, better, way in the future?
  • If your provisioning process is really confusing, and customers need to spend a lot of time working with your support staff to configure things the way they want, then some customers may go away excited about the awesomeness of your support team. But…

    • ​Your pricing may be uncompetitive because you have to spend so much on support personnel.
    • Other customers may give up in frustration because it doesn’t work “out of the box”.
    • Customers may experience a high volume of future trouble tickets because they each have a custom-solution, configured according to each CSR’s tastes.
  • If all hardware problems are masked through redundancy, then that’s great. But…

    • ​You need to be analyzing and fixing the cause of the hardware problem or else it will keep happening – and your redundancy model only works if each individual failure is rare.
  • If you have overflow routes for every trunk group then that’s great. But…

    • ​Is this hiding a quality or capacity problem on the primary trunk group?
    • Are you paying higher fees for calls on the backup trunk group?
    • Does your test plan validate that all call types work equally well on both routes? What if international calls only work on the primary route?
  • If you resolve any complaints about the quality of the IP network by purchasing more bandwidth, then…

    • ​Your costs will continually rise.
    • You’ll never be able to give any cast-iron guarantees about service quality, even to your most valuable or high-profile customer.
    • Your customers will continually find new ways to use the extra bandwidth after each upgrade, only to become frustrated when they hit the next capacity constraint.​
    • [Hat tip to Martin Geddes at this point, who’s done some incredible work (with his colleagues) on the theory and practice of network quality, and why the internet is fundamentally broken.]
It’s easy to overlook problems that are hidden by strengths in the organization, or excess capacity, or sometimes just by a lack of monitoring for certain types of issues. But if we truly care about quality we can’t afford to ignore these issues – we need to seek them out. Because one day something is going to change – perhaps an employee leaves, or a competitor forces you to lower prices, or two unexpected things happen at the same time – and when that day comes these problems won’t be hidden any more, and you’ll be in trouble.
​

The fundamental lesson is this:

If you care about quality, you need to find ways to highlight problems – to bring them to the surface so they can’t be ignored. 

Picture

Image uploaded to Wikipedia by user Zorankovacevic.

So how do we get started?

I have a few specific suggestions below, but before we dive into them I’d like to warn you that a relentless focus on quality often requires a cultural change. If we start seeking out problems, making them more visible and digging deep into the root causes then that tends to make people uncomfortable.

In part this is because most people have a natural inclination towards an easy life – and while fixing obvious problems fits coherently within that mindset (problems make life bad, fixing them makes life easier again) – seeking out problems that are below the surface just seems like a lot of work: “stirring up trouble where there isn’t any”.

The other consequence of this new approach to exposing and analyzing problems is that people’s mistakes will be more visible. It takes a great deal of self-confidence and security to feel comfortable talking about mistakes that you made, and the negative consequences of those mistakes. People would much rather avoid the level of deep scrutiny found in the 5 Whys approach I discussed previously, but in order to fully resolve issues we need to be able to have those conversations.

Therefore, it’s important that any attempt to apply Lean concepts to an organization must also be accompanied by a clear commitment from management towards developing people (the third pillar of Lean after quality and eliminating waste).

If people think they’re going to get disciplined for any mistake, then they’re going to hide their mistakes. However, if you create a culture that values people, sees any errors as a failure of the process (“how can we improve the process/tools so that it’s harder to make this mistake next time?”), and continually looks for opportunities to develop employees, then you can truly have open, inquisitive conversations about problems in your network – which will allow you to create a level of quality previously unseen in your organization.

But enough preaching – let’s look at some specific ways you can make it easier to spot problems that might otherwise be overlooked.

#1. Take trouble tickets seriously
I’ve already explored this in depth in my article on the 5 Whys approach so suffice to say that we need to understand that each trouble ticket is a gift highlighting a flaw in our process. Therefore we need to dig deep into our trouble tickets – it’s not enough to simply make the problem go away, we need to understand how it came to occur in the first place and fix it so it never happens again.

#2. Run an alarm-free system
The alarms on your switch and other network elements are intended to provoke action. Yet it’s far too easy to end up with a system that has a twenty, fifty or one hundred active alarms at any given time – which means you have a ton of known quality issues to address AND it makes it hard to spot a super-important alarm among all the fluff.

Therefore, I’d strongly encourage you to evaluate every alarm that occurs and decide how to act. I generally recommend one of the following choices:

  • Investigate & Fix: the alarm represents a real problem. You need to understand the problem then fix it. The alarm should remain active until the problem is fixed.
  • Hide: some alarms might be expected in the correct operation of your network. For example, a SIP subscriber might be alarmed when it’s unregistered – and if this is a portable soft-client that might be totally normal. For categories of alarms that are “expected” then I recommend modifying the alarm filters so they don’t appear on your dashboard. I have some additional thoughts about subscriber-level alarms in #7 below.
  • Modify thresholds: you may have statistical alarms that trigger when (e.g.) the number of circuits in use on a trunk group hits a certain threshold. When one of these alarms trigger you need either to decide that the threshold is too conservative and increase it, or if the alarm represents a genuine problem then you should “investigate and fix” (by increasing capacity or modifying traffic flows) as mentioned above.

Personally I like the idea of having the alarm dashboard displayed on a large screen in the NOC for everyone to see, and then each morning you can review active alarms and the ongoing actions to resolve them as part of your stand-up meeting. 

Picture

#3. Ask a manager to perform each common task following the written processes
This idea assumes that you’ve already documented your processes, in which case it’s great for a team manager to try to follow his/her team’s documentation – this removes the abundance of experienced team members performing the work.

There are a couple of benefits of this approach.
(a) it tests the documentation
(b) it helps the manager to better understand some of the detailed tasks performed by the team
(c) even if the documentation is perfect, the manager may have a broader understanding of how the process fits into the organization as a whole – and so may have ideas for how to improve things (which should be discussed with the team members – the experts in the process).

#4. Remove a person from a team – to focus on improvements
This can be done as a short-term exercise initially, and the general idea is this.

  • Each team member performs their regular tasks, but anytime something doesn’t go as planned they ask the extra person – the problem-solver – to look into the problem. 
  • By reducing the resources available to the team, that makes it necessary for the team to operate more efficiently (less water in the lake). This shines a light on any problems that disrupt the smooth-flowing of the team’s work, and then we can use the problem-solver to dig deeper into the issues that are exposed (the rocks).
  • The problem-solver thereby ends to getting exposure to a variety of flaws in the standard processes in pretty short order – and is tasked both with resolving the problems and also with investigating why the problems occurred. These investigations can then form the basis for a team discussion which ultimately leads to some process improvements to prevent these problems in future. 
  • A side benefit is that you get a feeling for what proportion of a team’s time is spent simply resolving problems with their regular tasks. For example, if you have a team of 5 and one person solves all the problems and the other 4 do all the work – then you can quickly see that 20% of the team’s time is wasted due to these issues. Of course it’s rarely that perfect – but you get the idea.
#5. Stress test the IP network
Most of the time, we have an abundance of capacity in the IP network, which masks any errors in the quality of service rules. So in order to seek out these errors, we should stress test the network. 

Now obviously, I’m not saying you should actively trying to break your core network – at least, not during peak hours. So there are a couple of ways you could approach this.

(a) If you have a lab environment that mirrors a customer’s LAN, you should configure the IP network according to the best practices you recommend to your customers and then apply a heavy load of data traffic to the network (there are various stress testing tools that can do this). With the heavy data load applied you can then either manually make a variety of phone calls to subjectively observe how the IP phones function in this situation, or else you can use a VoIP call generator to simulate a heavy VoIP call load and measure the performance of the network in these conditions. Ideally there should be no noticeable degradation of VoIP performance as long as the number of simultaneous calls stays within your designed limits.

​(b) Of course, there could be hidden IP issues in your core network just waiting to appear at an inopportune moment – and we shouldn’t ignore these. While theoretically you could duplicate your entire core network in a lab, in practice it’s incredibly difficult and expensive to replicate the network with any degree of accuracy. This leaves the unappealing option of load testing your real core network – and while this sounds risky it’s better to plan a load test in a maintenance window in a controlled fashion than discover problems during your peak hour on a Monday morning. 

#6. Proactively monitor the IP network
Are you actively monitoring the IP network – so you can see congestion and/or voice quality issues as they occur? Or do you rely on customer complaints to spot the problems?

There are a variety of tools you can use to monitor your IP network, but a good place to start (if you’re running on Metaswitch) is the Voice Quality Monitoring functionality available in MetaView Explorer. This makes use of the VQM statistics reported over RTCP by various third-party endpoints in your network, along with the Metaswitch network elements, and uses these to present a graph showing the key network nodes and the network quality between each pair of endpoints. This can be a great early warning system so you can be notified of quality issues when they first occur and hopefully resolve issues before they become critical.

#7. Subscriber-level alarms
One of the big challenges with keeping an entirely clean alarm panel (#2 above) is the proliferation of subscriber alarms, which could be a genuine issue, or could reflect a conscious decision on the part of the subscriber (e.g. to shut down a soft-client or unplug a phone, or perform some PBX maintenance). 

I haven’t seen this done, but a great way to handle this might be to hook these alarms into a BSS system that sent a notification (e.g. a text message or email) to the subscriber whenever their system went into alarm. That way the subscriber is notified about the problem, and then has the choice either to ignore the message (if it’s expected) or to investigate the problem if not.

You could then filter out all such alarms and you’d both provide a better experience to the customer AND eliminate some waste in your operations team. Anything that improves quality and efficiency simultaneously should make everyone happy!

#8. Team thought experiments
Sometimes all it takes to discover the big rocks is to ask some probing questions at a team meeting. Try these out and see what you learn.

  • What’s the fastest anyone has ever done X? Why can’t we hit that goal every time?
  • How could we provision twice as many lines each day with half the number of people? [We can’t!] Why not? What’s the biggest problem?​

In conclusion…

There are a lot of ideas here, and you may have more of your own, but the core question is simple: what problems are hidden because we have too much water in our lake?

If you can find and solve the problems now, before they impact customers, then you can increase both your quality and your efficiency – helping you form a more perfect network.

If you need some help to get started on this path then contact me and let me know a little about your situation and we can figure out some options. You might also like to check out my System Audit service where we can get a baseline on your current Metaswitch products and check for the most obvious risks and unnecessary costs.

If you’d like to be notified when I publish new articles about implementing the Lean Network in your business, please enter your email address in the box below. I’ll also send you a copy of my recent white paper that tells the story of how I was introduced to this way of thinking and why service providers need to take this seriously in today’s competitive market place.




Picture

During 14 years at Metaswitch Andrew Ward worked as a Customer Support Engineer (CSE) and CSE Manager with hundreds of service providers across North America before becoming obsessed with processes and quality as VP Operations. He’s now the founder of Andrew Ward Consulting LLC where he works with select service providers to help make The Lean Network a reality. If you’re interested to work with Andrew please click here.

About Andrew

Award Consulting is focused on helping ILECs and CLECs who use Metaswitch products to thrive as they improve their networks through migrations, strategic projects and improved service offerings.

Our goal is to create highly specific, highly valuable content targeted specifically at US regional service providers, and especially those who are running Metaswitch equipment. Join our email list to be notified of new content.

Primary Sidebar

Our goal is to create highly specific, highly valuable content targeted specifically at US regional service providers, and especially those who are running Metaswitch equipment. Join our email list to be notified of new content.



Articles by Theme

  • Hosted PBX (17)
  • Interviews (1)
  • IP Networks (7)
  • Network Evolution (23)
  • Network Ops (53)
  • Product (17)
  • STIR-SHAKEN (23)
  • Strategy (17)
  • Technical (28)

Copyright © Award Consulting Services 2023