Captains log, April 20th 2018. It’s day ??? of winter. I’ve packed my entire desk up and decided to move to Sunnyvale. There’s no way there’s winter in a place that has “Sunny” in the name of the city right?
Well, obligatory crappy weather complaint aside…let’s get on to today’s topic, everyone’s favourite word – Outages.
This post is going to be a bit more philosophical than practical, but I’m hoping to give people some insight into the culture we’re trying to cultivate within the CloudLink team and are trying to instill back into the wider organization.
I’ll start with a little back story:
On April 4th 2018 the team had what, as my boss in a slight feverish panic standing next to desk put it, an “operational event”. Also known as an outage. This was for all intents and purposes the very first outage for us in our public cloud infrastructure. Sure we’d had a few things here and there that may have caused a slight service degradation, or a planned maintenance window that we did some work within, but this was the first bona fide time that something went wrong and we didn’t know about it. I’ll spare you all the nitty gritty details here. Long story short, we had a maintenance script that ran at 9PM, it deleted a bunch of users, and at 930AM the next morning we realized what had happened. By 1030AM everything was back up and running. We actually spent a good portion of time discussing the impact and how widespread the issue was than it actually took to fix the problem.
That we had an outage isn’t really the point of this blog post, it’s more about how we handled it and what we did about it.
We’re trying to build a culture of transparency, and to avoid falling into the trap of finding who to blame for a problem. This is a fairly popular model in some of the newer, hipper, technology companies like Netflix and we’re finding this a very positive and constructive approach to take. I’ll focus on each of those things independently.
Right off the bat, all of our laundry is available automatically on our public status page https://status.mitel.io. This page is dynamically driven so if we’re having a problem it doesn’t require an engineer, or support/operations person to actually flip the switch from green to red. That’s a fairly scary step, but we stand behind the product we’ve developed and our culture around quick resolutions that this isn’t too big a deal. Nothing ever really stays red without some sort of explanation as to why. Which is where our incident reporting comes in. We’ve adopted the idea that if there is a problem that is affecting more than one customer, that it’s an incident and we post about it. Much better a customer or partner see a red flag, and then immediately see’s that we know about it and we’re working towards a resolution as opposed to everything looking peachy in our status page but their application is not working. There’s a theory out there called the Service Recovery Paradox that says a customer will think more highly of a company after they experience an outage with their service. The reasoning being that a successful recovery from a fault leads to an increased feeling of confidence in the company. I believe this is true, but only up until the point that a technology company is transparent about the issue. If there was a problem, a customer experienced it, and it got fixed – without ever a word from the company, that customer is probably going to assume the company never knew about it and it magically fixed itself. Even if the problem did magically fix itself – which in some cases it does – it’s still beneficial to explain what happened.
This is where the three pillars of transparency are going to come in, and if you read the postmortem linked above (with the exception of the apology pillar) I wrote following this model.
- Show your understanding of what happened
- Explain your remediation plan
I wont go into detail about these, but you should go watch the first 10 minutes of this video for a really good explanation on them, which is exactly where we got this philosophy from.
In this specific case we haven’t decided if we’re doing public postmortems yet, but for the purposes of ensuring the company still has faith in our product it’s important to still cover these off at least internally for now. The powers that be need to see that we understand exactly what happened, and how we’re going to get better to help keep it from happening in the future. We really do hope to extend this transparency to our customers, and will continue to bang the drum of culture change to allow us to do that.
Playing the No-Blame-Game
This is a tough one. As part of any investigation into an outage, or root cause analysis, or fix development – it’s easy to slip into the mode of trying to find out who caused the problem and blame them, as opposed to fixing the DevOps process that could have prevented bad code from getting out into the wild in the first place, or the platform for not being robust enough to handle whatever happened. This shifts the focus from a developer or team from learning via a negative experience (which is never a good way to learn) into learning through a positive one. It becomes a technical challenge to overcome (which engineers love…don’t you) as opposed to being that one thing in your career you never forget and hang your head in shame (I have a few of those myself ). By taking this approach, not only does everyone feel more comfortable knowing they can make mistakes without getting fired, but it actually improves your overall product. You focus on making the platform more robust, and making the automated processes of DevOps more intelligent to the type of work you’re doing. Everyone wins.
Out of this entire experience, we ended up 9 actionable back log items to address the outage. As such, our processes, and DevOps automation are better than ever, and it highlighted exactly why we utilize the release cadence that we do.
So that’s a little bit of what we’re trying to do here from a DevOps culture perspective to try and improve how we do things at Mitel. As always, we’re always learning, and always growing – so things will change and improve over time.
Until next time,