I imagine most of you have already heard that SpaceX attempted to launch a huge rocket on Thursday last week (April 20, 2023). According to Elon Musk, this was a great success – he tweeted, “Congrats @SpaceX team on an exciting test launch of Starship!”.
Check out the video below and see for yourself – I agree with “exciting”. The launch involved a lot of flames and smoke, then after a few minutes the rocket started doing a slow-motion loop-the-loop, before experiencing a “rapid unscheduled disassembly” around the 4 minute mark.
It would be easy to poke fun at this event, but actually, there are a few important lessons we can learn from SpaceX’s $3B experiment.
A good mantra for anyone working with clients or customers is “under promise, over deliver” – which means that you always want to exceed expectations. In SpaceX’s case, they actually tried to do this. Ahead of the launch, Musk had said that he would consider the test a success if the Starship rocket didn’t explode on the launchpad. During the webcast, SpaceX commentator Kate Tice said “Everything after clearing the tower was icing on the cake”.
So by these standards, the Starship rocket lasted 3 minutes longer than expected. Sadly though, there are limitations to what you can achieve through expectation management. However low you set people’s expectations, they’re generally going to view the public explosion of a $3 billion rocket as a set back.
Nevertheless, you’ve got to give them points for effort – and for any of us working in customer service, we should always aim to set clear expectations and to do everything we can to exceed those expectations.
Value of Data
The reason people at SpaceX are happy is that they have gathered a ton of data from the test launch, which will allow them to build a better rocket next time. One of the things I’ve always appreciated about the Metaswitch platform is that it’s constantly gathering data about what’s happening. You have Service Assurance Server providing detailed information on all calls, but if you ever experience a software protection switch (SPS) there is a large diagnostic file produced that allows Metaswitch technical support to understand exactly what was happening at the time of the incident – so they can learn from it.
As an engineer, all failures hurt, but by far the worst situation is where something goes wrong and you don’t know why. If you can be gathering useful diagnostics from your applications (or packet captures) at the time of a problem, then that gives you a fighting chance of understanding what happened and fixing the problem in the future. It costs SpaceX $3B for each test, so I would imagine their engineers will be trying to fix all the problems they can find before they try again.
When I watched the launch video for the first time, I got nervous. There’s a moment where the huge rocket starts to spiral out of control – and my first thought was, “What if this falls out of the sky and hits someone’s house?”
Luckily, SpaceX had thought of this risk, and so the rocket contained an automated “flight termination system” which detected that things were not going well, and promptly blew up the rocket.
The idea of a backup plan is critical for so many aspects of telecoms operations – from overflow trunks, to data backups, to roll-back procedures for any maintenance activity. Whatever you’re doing, you need to consider the risk that something will go wrong, and figure out (in advance) what you’ll do if that happens.
Importance of Learning and Persistence
This isn’t the first time SpaceX has experienced launch failures. Back in 2006-2008, SpaceX was working on its first rocket – the Falcon 1, and the first three launch attempts were all failures. Following these failures SpaceX, Tesla and Elon Musk were all very nearly bankrupt at the same time, and Musk was so stressed that he would wake from nightmares, “screaming and in physical pain“.
But SpaceX finally managed to launch their rocket on 28 September 2008 (two weeks after Lehman Brothers collapsed), and this success saved the company.
Sometimes we are hired by telcos that are experiencing a lot of technical problems – multiple overlapping issues that make it really hard to understand what’s wrong or even if we’re making progress. It can be really hard to persist and trust the process in these situations. It’s rarely possible to wave a magic wand and fix everything – but if you stick with it, take a methodical approach and keep fixing the issues, eventually you’ll see progress.