Word of the Day: Outage

Captains log, April 20th 2018. It’s day ??? of winter. I’ve packed my entire desk up and decided to move to Sunnyvale. There’s no way there’s winter in a place that has “Sunny” in the name of the city right?

Well, obligatory crappy weather complaint aside…let’s get on to today’s topic, everyone’s favourite word – Outages.

This post is going to be a bit more philosophical than practical, but I’m hoping to give people some insight into the culture we’re trying to cultivate within the CloudLink team and are trying to instill back into the wider organization.

I’ll start with a little back story:

On April 4th 2018 the team had what, as my boss in a slight feverish panic standing next to desk put it, an “operational event”. Also known as an outage. This was for all intents and purposes the very first outage for us in our public cloud infrastructure. Sure we’d had a few things here and there that may have caused a slight service degradation, or a planned maintenance window that we did some work within, but this was the first bona fide time that something went wrong and we didn’t know about it. I’ll spare you all the nitty gritty details here. Long story short, we had a maintenance script that ran at 9PM, it deleted a bunch of users, and at 930AM the next morning we realized what had happened. By 1030AM everything was back up and running. We actually spent a good portion of time discussing the impact and how widespread the issue was than it actually took to fix the problem.

 

That we had an outage isn’t really the point of this blog post, it’s more about how we handled it and what we did about it.

We’re trying to build a culture of transparency, and to avoid falling into the trap of finding who to blame for a problem. This is a fairly popular model in some of the newer, hipper, technology companies like Netflix and we’re finding this a very positive and constructive approach to take. I’ll focus on each of those things independently.

Transparency

Right off the bat, all of our laundry is available automatically on our public status page https://status.mitel.io. This page is dynamically driven so if we’re having a problem it doesn’t require an engineer, or support/operations person to actually flip the switch from green to red. That’s a fairly scary step, but we stand behind the product we’ve developed and our culture around quick resolutions that this isn’t too big a deal. Nothing ever really stays red without some sort of explanation as to why. Which is where our incident reporting comes in. We’ve adopted the idea that if there is a problem that is affecting more than one customer, that it’s an incident and we post about it. Much better a customer or partner see a red flag, and then immediately see’s that we know about it and we’re working towards a resolution as opposed to everything looking peachy in our status page but their application is not working. There’s a theory out there called the Service Recovery Paradox that says a customer will think more highly of a company after they experience an outage with their service. The reasoning being that a successful recovery from a fault leads to an increased feeling of confidence in the company. I believe this is true, but only up until the point that a technology company is transparent about the issue. If there was a problem, a customer experienced it, and it got fixed – without ever a word from the company, that customer is probably going to assume the company never knew about it and it magically fixed itself. Even if the problem did magically fix itself – which in some cases it does – it’s still beneficial to explain what happened.

This is where the three pillars of transparency are going to come in, and if you read the postmortem linked above (with the exception of the apology pillar) I wrote following this model.

  1. Apologize
  2. Show your understanding of what happened
  3. Explain your remediation plan

I wont go into detail about these, but you should go watch the first 10 minutes of this video for a really good explanation on them, which is exactly where we got this philosophy from.

In this specific case we haven’t decided if we’re doing public postmortems yet, but for the purposes of ensuring the company still has faith in our product it’s important to still cover these off at least internally for now. The powers that be need to see that we understand exactly what happened, and how we’re going to get better to help keep it from happening in the future. We really do hope to extend this transparency to our customers, and will continue to bang the drum of culture change to allow us to do that.

Playing the No-Blame-Game

This is a tough one. As part of any investigation into an outage, or root cause analysis, or fix development – it’s easy to slip into the mode of trying to find out who caused the problem and blame them, as opposed to fixing the DevOps process that could have prevented bad code from getting out into the wild in the first place, or the platform for not being robust enough to handle whatever happened. This shifts the focus from a developer or team from learning via a negative experience (which is never a good way to learn) into learning through a positive one. It becomes a technical challenge to overcome (which engineers love…don’t you) as opposed to being that one thing in your career you never forget and hang your head in shame (I have a few of those myself Big grin :D). By taking this approach, not only does everyone feel more comfortable knowing they can make mistakes without getting fired, but it actually improves your overall product. You focus on making the platform more robust, and making the automated processes of DevOps more intelligent to the type of work you’re doing. Everyone wins.

Out of this entire experience, we ended up 9 actionable back log items to address the outage. As such, our processes, and DevOps automation are better than ever, and it highlighted exactly why we utilize the release cadence that we do.

So that’s a little bit of what we’re trying to do here from a DevOps culture perspective to try and improve how we do things at Mitel. As always, we’re always learning, and always growing – so things will change and improve over time.

Until next time,

James.

Automagically Secure PowerShell Remote Sessions

In my previous post, PowerShell for Good and Sometimes for Evil, I detailed a few steps that you can take to help secure your systems from malicious use of remote PowerShell sessions. I very quickly realized, as I’m sure many will, that the manual steps are great to understand the concept of what’s being done…but doesn’t really help me in a real life administration setting. I now need to manually configure a bunch of different items on potentially dozens of servers.

That’s no good. If only there was some sort of technology out there to help me script these kinds of tedious administrative tasks to make life easier. Oh wait, isn’t that what this blog is all about?

Right! So to make this configuration a little more real work effective lets put together a script to do this work for me within a deployment. Just a couple of points before I really dive into my first PowerShell script in this blog about how I view PowerShell and writing scripts. Of course it’s an art form and everyone does it differently, so hopefully this helps give you an understanding of my scripting style

  • When I initially write a script I try to make it as simple as possible to make sure it does what I want to do, then layer on usability and extensibility later. It’s a lot easier to start with a simple script and get it working before adding a bunch of error handling, and script parameters. A good chunk of my scripts starting out here will be fairly basic in this sense, I’d prefer to give you something I know will work and give you an understanding of what the script does, and you can add whatever logging, or environment logic to fit your needs.
  • I try to avoid a lot of the short hand and aliasing that exists in PowerShell. Those two things are extremely helpful when you’re a PowerShell guru and you want to fire out scripts that are massive and would otherwise take a lot of time to write. I find a lot of my scripts end up being consumed by people who have a very basic PowerShell knowledge, and using short hand and aliases just makes a script very difficult to follow.
  • I try to always work off the latest and greatest version of PowerShell. This is really a no-brainer, every single release it gets better and easier to use. So if you have bits of script that don’t apply because you’re using PSv2, please upgrade – you won’t regret it, I promise.

Alright, with all that out of the way, lets dive into the script.

With this script I’m assuming that all your servers are part of your domain, and each has a Computer certificate available. In a later iteration of this script we could integrate in a certificate request from the Enterprise CA, but for simplicity let’s assume you’ve already got one. I’m also assuming you have the Remote Server Administration Tools (RSAT) installed, as well as the Group Policy Management feature enabled if you wish to do the bits with the GPO.

First thing we’ll want to do is get some administrator credentials to do all the operations we want. It’s bad practice to hard code usernames and passwords in your insecure scripts, so let’s use Get-Credential to securely collect that

Now we’ll want to retrieve a list of machines we want to apply this to. We can do this using the Get-ADComputer cmdlet (don’t forget to have the RSAT feature enabled on Windows or you won’t have the cmdlets).

I keep my servers in a root OU named Servers, so my search base comes out looking like this (of course, replace local613 and com with your domain and domain post-fix).

Great, now I’ve got a list of machines in my $serverList variable, lets get to work by setting up WinRM to accept HTTPS connections. I’ll do this remotely by using the Invoke-Command cmdlet and using the administrator credentials I collected earlier

I’m using the -force switch on the WinRM command, since there should already be an HTTP listener running on the server, and WinRM will complain there’s an existing configuration and the WinRM service is already running.

PowerShell for Good and Sometimes for Evil

I read an interesting article on a flight back home yesterday which detailed a brazen bank heist in Russia in which the bank robbers used remote PowerShell commands to instruct an ATM to spit out hundreds of thousands of dollars in cash. Interestingly the ATM itself wasn’t hacked in a way that I expected where someone gained access to an input panel and loaded up some malware, instead the hackers had managed to get into the broader bank network, create a series of tunnels to eventually get to the ATM network and issue the commands. This specific attack used malware that self deleted, but either by mistake or code error left some logs behind which allowed security researchers to back track and figure out what went wrong. You can read the full article here.

This got me wondering, in a world of secure network deployments – where there may be some tunneling from other networks, how can I protect systems from executing malicious code from a remote source. The most obvious answer is to block the PowerShell Remoting and WinRM ports on the firewall from the broader network (Ports 5985 and 5986 for HTTP and HTTP/s respectively). That should generally protect the systems unless the Firewall is compromised, or physical access to the servers is obtained. This is a nice solution since it doesn’t restrict me from being able to use Remote PowerShell sessions when an authorized party has access to the system.

I can also change these from the default to obfuscate them somewhat from the outside. This is a good security practice since using default ports just makes things easy for an attacker – and the more road blocks we put up the less likely they are to be successful. We can change the port for WinRM using the following command

You’ll want to replace * with the auto-completed Listener name (if you have an HTTP and an HTTP/s listener), and you’ll have to run it once for the HTTP listener, then run it again with a different port for the HTTP/s listener if one exists. In my case I don’t yet have an HTTP/s WinRM listener configured so I can get away with using the wildcard.

Restarting the WinRM service is required, and can be done with the following command

Now when I’m initiating a Remote PowerShell session, I need to ensure I’m specifying the now non-standard port by using the -Port switch in Enter-PSSession

This puts me in a fairly comfortable position now in the event that someone does gain access to my closed network and attempts to do some PowerShell remoting. My final step to really help me sleep at night is to make sure that when I am using PowerShell remote sessions I’m doing so securely, lest a pesky network sniffer gain valuable intel on the system based on my administrative work. To do this, I’m going to use a couple Group Policy objects to ensure the WinRM Service isn’t accepting insecure connections, in case I forget to connect securely during a 3AM emergency service call.

Within my default domain policy, I’ll configure a couple settings:

Computer Configuration\Policies\Administrative Templates\Windows Components\Windows Remote Management (WinRM)\WinRM Service\Allow Basic Authentication -> Disabled

Computer Configuration\Policies\Administrative Templates\Windows Components\Windows Remote Management (WinRM)\WinRM Service\Allow CredSSP Authentication -> Disabled

Computer Configuration\Policies\Administrative Templates\Windows Components\Windows Remote Management (WinRM)\WinRM Service\Allow Unencrypted Traffic -> Disabled

Now that we’ve made these changes in the GPO, I’ll have to go configure WinRM for HTTP/s on my original server. There’s a great Microsoft Support article on the subject here but for brevity, here’s the steps

  • Launch the Certificates Snapin for Local Computer
  • Ensure within Personal\Certificates a Server Authenticating certificate exists, if not request one through your domain CA
  • Use this command to quick configure WinRM
    winrm quickconfig -transport:https

Now when I connect I’ll need to ensure that I use the following command

It’s not perfect, and it wont stop everyone, but as my father used to say it’s just enough to keep the honest people honest. For those going the less than honest route, I also found this white paper from the Black Hat USA 2014 conference on Investigating PowerShell Attacks to be extremely interesting.

There’s a few good resources out there on the security considerations of remote PowerShell sessions, specifically this one from JuanPablo Jofre on MSDN, and this one from Ed Wilson, of Scripting Guys.