Playing in the SANdbox: Subject Alternate Names on ACM Certificates in CloudFormation

Depending on your view, the speed at which AWS updates and changes can either be a complete nightmare or something that keeps you coming into work every day. Luckily for me, I see it as the latter.

Constant AWS updates means every time I come back to revisit a problem, or I’m surfing the documentation around CloudFormation there’s always a new way to solve something or a different and better way to do things.

When I first started tackling ACM certificates in CloudFormation you couldn’t specify Subject Alternate Names as part of the request – which left you creating a separate certificate for every subdomain you need a certificate to cover. Luckily back in the 70’s Larry Tesler invented Copy & Paste so it wasn’t too big a deal…at least until you run into the maximum number of certificates you can request in a year (which thankfully was increased from when I first started doing ACM requests in CloudFormation so you shouldn’t realistically hit that limit).

Now that Subject Alternate Names (SANs) are supported this simplifies my CloudFormation quite a bit, but it gets a bit tricky for doing the certificate approvals since I’m going to be adding subdomains and I need the certificate approval to come to an email address domain that actually exists.

Continue reading “Playing in the SANdbox: Subject Alternate Names on ACM Certificates in CloudFormation”

Carefully Poking Holes: Using Cross Account Custom Authorizers in API Gateway

First off, apologies for the brief hiatus. I hit a bit of a busy period with work and fell off the posting wagon.

AWS recently introduced support for API Gateway to use a Lambda custom authorizer in API Gateway. Previously the Lambda custom authroizer had to exist in the same AWS account as the API Gateway, which causes problems in our architecture since we want to use a singular token service for REST APIs across all accounts.

We originally solved this problem with what we dubbed the Auth Proxy. The Auth Proxy Lambda lives in an S3 bucket in a shared account that can be deployed with CloudFormation by referencing it’s location. The bucket policy for the bucket in S3 is configured to allow CloudFormation to get the zip package during a deployment. Finally, when we run the CloudFormation stack we do a lookup for the ARN of the Token Lambda, store it as an environment variable for the Auth Proxy, and then add permissions for the Auth Proxy in that account to be able to do a Lambda invocation on the token lambda.

Phew, that’s a lot of steps for something that would be so much easier if we could just point API Gateway at the Lambda in the other account. Now that API Gateway supports exactly that, handling the delicate process of opening up permissions cross account needed to be tackled.

Continue reading “Carefully Poking Holes: Using Cross Account Custom Authorizers in API Gateway”

Proactively Plugging Leaks: Securing CloudFormation Stacks in AWS CodePipeline

In a recent chat with our AWS Solutions Architect, he pointed me in the direction of some really cool open source DevOps tools from the guys over at Stelligent. They have a bunch of neat utilities and frameworks on their public GitHub, and they do a bunch of very interesting DevOps podcasts that are worth taking some time to listen to.

Anyways, one of the utilities I really like is cfn-nag which is designed to scan CloudFormation templates for known vulnerabilities and bad practices. This is something that fills a gap I have in my DevOps portfolio since I know most best practices and know what I should or shouldn’t do from an infrastructure stand point (most of the time), but my developers shouldn’t have to be AWS security experts. Since stacks are maintained by developers after I hand them off, I need a way to ensure they don’t accidentally introduce security leaks into the platform. As is one of my mantras I have to automate everything, I can’t expect developers to remember to scan their stacks before they get deployed.

Continue reading “Proactively Plugging Leaks: Securing CloudFormation Stacks in AWS CodePipeline”

Making Snowflakes: Avoiding Collisions in CloudFormation Stack Outputs

Generally speaking I’m pretty lucky because all our micro services operate in their own accounts (check out AWS Super Glue: The Lambda Function for why we do this) and I don’t have to worry too much about uniqueness of AWS resources. There are a few exceptions of course for resources that must be unique across the entire AWS platform, like S3 bucket names, and CloudFront CNAME’s. Since I’ve been so lucky I haven’t had to solve this problem of stack outputs and the requirement for them to be unique within an AWS account. Our micro services are fairly lightweight so I can get away with all my infrastructure existing within a single CloudFormation script. In my recent Web App Blue/Green deployment however I started running into stack outputs colliding.

Before we get too far into this, let’s take a step back and bring the CloudFormation rookies up to speed. If you just want to check out my solution, go ahead and skip down.

Continue reading “Making Snowflakes: Avoiding Collisions in CloudFormation Stack Outputs”

Crash Test Dummy: Building, and Testing Angular Apps in AWS CodeBuild

As part of an upcoming post around how we achieved Blue/Green functionality within AWS for I wanted to cover off a bit of a technical hurdle we overcame this week around how to build and test a web app in AWS CodeBuild.

So what’s the big deal? AWS CodeBuild lets you use a whole bunch of curated containers that have all kinds of frameworks and tools built in. If that’s not enough for you out of the box the buildspec gives you excellent control over running scripts including installing packages in Ubuntu. Even then if none of that works for you, you can simply curate your own docker image, publish it to the Elastic Container Registry (ECR) (though more on limitations of ECR in CI/CD in another post), or the Docker Hub. With all these tools and approaches at my disposal, there’s no way I can’t have an AWS CodeBuild environment that meets my needs. But, spoiler alert, the environment isn’t my problem – it’s the build and test framework my developers have implemented.

Continue reading “Crash Test Dummy: Building, and Testing Angular Apps in AWS CodeBuild”

AWS Super Glue: The Lambda Function

About halfway through our development cycle, in a meeting with our AWS Solutions Architect we received what we affectionately refer to as the “AWS Bomb”.

Up until that point we had been developing our platform with the idea that all the micro services and resources required to run them all should exist within a single AWS account. Just about all of the AWS documentation up until this point, and guidance from AWS, revolved around the idea of a single account…however AWS had just come up with this new architecture to help keep cloud platforms secure, and minimize the impact (or blast radius) of a security breach. If you’re interested in reading more about what this architecture looks like, or why you should use it – check out the Multiple Account Security Strategy Solution Brief from AWS.

We immediately started working towards adopting this new architecture which came with a whole other host of headaches that we slowly worked through and solved on a micro service by micro service basis. This was fairly easy for the micro services as they generally only communicate within their own account, and when they have to go to another micro service they can just do so with the external API routes.

Where we ran into some interesting challenges was in our CICD process. Luckily we were just starting to design and develop the CICD process, so we didn’t have to start over with this new architecture in mind, but it did mean we had to pioneer some new things that AWS hadn’t covered in their build in CICD tooling.

Continue reading “AWS Super Glue: The Lambda Function”

This is How We DevOps

I realized that in writing this blog, I assumed that everyone knew why the DevOps model is so important, and what CI/CD means, and why we do it.

That’s probably a bad assumption. As a new(ish) development ethos, DevOps isn’t widely adopted and it can be difficult to sell DevOps as a revolution in development to an organization. I’m going to take a bit of time to explain why we do DevOps here, and what our guiding principals are whenever we develop our CI/CD processes.

Why we DevOps

Okay, I know some people are going to say “hey, James…DevOps isn’t a verb!” but it’s a made up word and a contraction of Development and Operations anyways so I’m gonna use it in weird ways simply because I can.

The core tenant of DevOps is exactly that, amalgamating Development and Operations teams. For us, we do this because it allows our development teams to develop, and push to AWS from anywhere anytime. This is achievable because we automate just about everything, and the automation allows us to put in all kinds of security blankets for builds, testing, and security to help developers not make mistakes and push bad code directly to the public. This is a huge improvement over waterfall and really empowers our developers to drive change in the product at a rapid pace. If someone is excited about a new feature and works until 4AM getting it working, why wait another year before actually releasing it? Here’s the best part, when developers know that when they commit that code, it’s going up to the cloud platform right away, and they’re more likely to check and recheck for good measure before it goes (even with the automated safety nets).

My team is all about removing barriers for Developers, while maintaining safety of the platform for our customers. My job is done right if a Developer doesn’t even notice all the checks and balances that go on…that is unless they make a mistake and we catch it.

Finally we cover off the Operations side of things. With all this automation done, the demand for monitoring and notification is high.

If a build fails in the cloud, and nobody is around to see it, did it really fail? -Me, just now

We utilize CloudWatch metrics extensively through DataDog to handle a vast majority of our day to day cloud operations. Everything that goes on shows up on a fancy dashboard I’ve built and I can see the health of the platform at a glance. Beyond that, we have a number of monitors and alarms that will send emails, and messages to our operations slack channel whenever something has gone wrong. We make extensive use of outlier detection to determine what kind of behaviors are out of the norm.

In addition, we have a number of services and checks that we have designed and implemented to watch the health of our actual service. It’s hard to use third party tools out of the box when you’re developing a first party product, so this one went hand in hand between the DevOps team and the engineers working on the microservices. They put in API routes to query health, and we put in automated checks and alarms that leverage those routes. There’s a handful of other lambda’s that check status of things like our CodeBuild projects to notify teams when the builds and deployments fail, or notify the DevOps and media teams when the STUN services are experiencing issues.

This proactive monitoring and alarming lets us address and fix issues at the development level, before a customer ever even contacts our support team. We typically identify, and at a minimum backlog a fix, before our customers ever notice there was a problem. This makes for a really great experience for everyone, developers, operations, support, managers, and most importantly our customers.

The Tenants of our DevOps Practice

Now with all of that introduction out of the way of what DevOps is, and why it’s so important and awesome…let’s get into the meat and potatoes of my DevOps practice. These are sort of my commandments when it comes to developing, designing, and implementing DevOps and CI/CD processes.

  1. Automate everything. Everything that can be automated, should be. Even if it starts as a manual process to figure out how to do it, we always immediately turn that into code and make it run automatically. If we didn’t, DevOps just wouldn’t scale with our platform.
  2. Remove barriers. The last thing I want to be is the guy holding the master key for everything, because I don’t want my phone ringing in the middle of the night because an emergency code fix has to be pushed into the cloud. If I followed commandment #1, then there’s no reason to call me. Just go ahead and push the code.
  3. Implement safety nets. Nobody is perfect, and we all make mistakes. As soon as you admit that, and allow developers to make mistakes, they’re going to be much faster at iterating on code and producing some neat innovations. Let them make mistakes, just catch them before they become a problem for the public.
  4. Never stop improving. The speed of cloud development demands that we are never fully done with our work. AWS releases new services, improves security, and gives us new features and functionality. We are always dedicated to going back and re-working code and processes to use the latest and greatest in services and features. The more I can push off to AWS to handle instead of my custom code, the better.
  5. Always be teaching. Especially in a company where DevOps isn’t a thing yet, it’s important to always be teaching others how to do their own DevOps work. Not only does this make the concept of DevOps proliferate your organization, but it also shares the burden of work across the entire development team. It takes time, but I promise you will reap the rewards when developers maintain their own CI/CD pipelines, leaving you to keep focused on commandment #4.
  6. Secure everything. It’s worth the effort to start with the least possible permissions across the board up front, than to have to go back and dial in back in later. Dialing in later causes outages, downtime, and unintended behavior. Spent the extra bit of time in the cycle up front and make sure you’re not leaving things wide open (especially S3 buckets).

Generally these are sort of my top level drivers for everything. There will be “sub” commandments within each of those to drive me down the right path, but so long as I keep these in mind with every top level decision I make – I know I’m moving in the right direction.

Got a good one I missed? Send me an email I’d love to hear what some of your guiding practices are.

Until next time.


Word of the Day: Outage

Captains log, April 20th 2018. It’s day ??? of winter. I’ve packed my entire desk up and decided to move to Sunnyvale. There’s no way there’s winter in a place that has “Sunny” in the name of the city right?

Well, obligatory crappy weather complaint aside…let’s get on to today’s topic, everyone’s favourite word – Outages.

This post is going to be a bit more philosophical than practical, but I’m hoping to give people some insight into the culture we’re trying to cultivate within the CloudLink team and are trying to instill back into the wider organization.

I’ll start with a little back story:

On April 4th 2018 the team had what, as my boss in a slight feverish panic standing next to desk put it, an “operational event”. Also known as an outage. This was for all intents and purposes the very first outage for us in our public cloud infrastructure. Sure we’d had a few things here and there that may have caused a slight service degradation, or a planned maintenance window that we did some work within, but this was the first bona fide time that something went wrong and we didn’t know about it. I’ll spare you all the nitty gritty details here. Long story short, we had a maintenance script that ran at 9PM, it deleted a bunch of users, and at 930AM the next morning we realized what had happened. By 1030AM everything was back up and running. We actually spent a good portion of time discussing the impact and how widespread the issue was than it actually took to fix the problem.


That we had an outage isn’t really the point of this blog post, it’s more about how we handled it and what we did about it.

We’re trying to build a culture of transparency, and to avoid falling into the trap of finding who to blame for a problem. This is a fairly popular model in some of the newer, hipper, technology companies like Netflix and we’re finding this a very positive and constructive approach to take. I’ll focus on each of those things independently.


Right off the bat, all of our laundry is available automatically on our public status page This page is dynamically driven so if we’re having a problem it doesn’t require an engineer, or support/operations person to actually flip the switch from green to red. That’s a fairly scary step, but we stand behind the product we’ve developed and our culture around quick resolutions that this isn’t too big a deal. Nothing ever really stays red without some sort of explanation as to why. Which is where our incident reporting comes in. We’ve adopted the idea that if there is a problem that is affecting more than one customer, that it’s an incident and we post about it. Much better a customer or partner see a red flag, and then immediately see’s that we know about it and we’re working towards a resolution as opposed to everything looking peachy in our status page but their application is not working. There’s a theory out there called the Service Recovery Paradox that says a customer will think more highly of a company after they experience an outage with their service. The reasoning being that a successful recovery from a fault leads to an increased feeling of confidence in the company. I believe this is true, but only up until the point that a technology company is transparent about the issue. If there was a problem, a customer experienced it, and it got fixed – without ever a word from the company, that customer is probably going to assume the company never knew about it and it magically fixed itself. Even if the problem did magically fix itself – which in some cases it does – it’s still beneficial to explain what happened.

This is where the three pillars of transparency are going to come in, and if you read the postmortem linked above (with the exception of the apology pillar) I wrote following this model.

  1. Apologize
  2. Show your understanding of what happened
  3. Explain your remediation plan

I wont go into detail about these, but you should go watch the first 10 minutes of this video for a really good explanation on them, which is exactly where we got this philosophy from.

In this specific case we haven’t decided if we’re doing public postmortems yet, but for the purposes of ensuring the company still has faith in our product it’s important to still cover these off at least internally for now. The powers that be need to see that we understand exactly what happened, and how we’re going to get better to help keep it from happening in the future. We really do hope to extend this transparency to our customers, and will continue to bang the drum of culture change to allow us to do that.

Playing the No-Blame-Game

This is a tough one. As part of any investigation into an outage, or root cause analysis, or fix development – it’s easy to slip into the mode of trying to find out who caused the problem and blame them, as opposed to fixing the DevOps process that could have prevented bad code from getting out into the wild in the first place, or the platform for not being robust enough to handle whatever happened. This shifts the focus from a developer or team from learning via a negative experience (which is never a good way to learn) into learning through a positive one. It becomes a technical challenge to overcome (which engineers love…don’t you) as opposed to being that one thing in your career you never forget and hang your head in shame (I have a few of those myself Big grin :D). By taking this approach, not only does everyone feel more comfortable knowing they can make mistakes without getting fired, but it actually improves your overall product. You focus on making the platform more robust, and making the automated processes of DevOps more intelligent to the type of work you’re doing. Everyone wins.

Out of this entire experience, we ended up 9 actionable back log items to address the outage. As such, our processes, and DevOps automation are better than ever, and it highlighted exactly why we utilize the release cadence that we do.

So that’s a little bit of what we’re trying to do here from a DevOps culture perspective to try and improve how we do things at Mitel. As always, we’re always learning, and always growing – so things will change and improve over time.

Until next time,


Automagically Secure PowerShell Remote Sessions

In my previous post, PowerShell for Good and Sometimes for Evil, I detailed a few steps that you can take to help secure your systems from malicious use of remote PowerShell sessions. I very quickly realized, as I’m sure many will, that the manual steps are great to understand the concept of what’s being done…but doesn’t really help me in a real life administration setting. I now need to manually configure a bunch of different items on potentially dozens of servers.

That’s no good. If only there was some sort of technology out there to help me script these kinds of tedious administrative tasks to make life easier. Oh wait, isn’t that what this blog is all about?

Right! So to make this configuration a little more real work effective lets put together a script to do this work for me within a deployment. Just a couple of points before I really dive into my first PowerShell script in this blog about how I view PowerShell and writing scripts. Of course it’s an art form and everyone does it differently, so hopefully this helps give you an understanding of my scripting style

  • When I initially write a script I try to make it as simple as possible to make sure it does what I want to do, then layer on usability and extensibility later. It’s a lot easier to start with a simple script and get it working before adding a bunch of error handling, and script parameters. A good chunk of my scripts starting out here will be fairly basic in this sense, I’d prefer to give you something I know will work and give you an understanding of what the script does, and you can add whatever logging, or environment logic to fit your needs.
  • I try to avoid a lot of the short hand and aliasing that exists in PowerShell. Those two things are extremely helpful when you’re a PowerShell guru and you want to fire out scripts that are massive and would otherwise take a lot of time to write. I find a lot of my scripts end up being consumed by people who have a very basic PowerShell knowledge, and using short hand and aliases just makes a script very difficult to follow.
  • I try to always work off the latest and greatest version of PowerShell. This is really a no-brainer, every single release it gets better and easier to use. So if you have bits of script that don’t apply because you’re using PSv2, please upgrade – you won’t regret it, I promise.

Alright, with all that out of the way, lets dive into the script.

With this script I’m assuming that all your servers are part of your domain, and each has a Computer certificate available. In a later iteration of this script we could integrate in a certificate request from the Enterprise CA, but for simplicity let’s assume you’ve already got one. I’m also assuming you have the Remote Server Administration Tools (RSAT) installed, as well as the Group Policy Management feature enabled if you wish to do the bits with the GPO.

First thing we’ll want to do is get some administrator credentials to do all the operations we want. It’s bad practice to hard code usernames and passwords in your insecure scripts, so let’s use Get-Credential to securely collect that

Now we’ll want to retrieve a list of machines we want to apply this to. We can do this using the Get-ADComputer cmdlet (don’t forget to have the RSAT feature enabled on Windows or you won’t have the cmdlets).

I keep my servers in a root OU named Servers, so my search base comes out looking like this (of course, replace local613 and com with your domain and domain post-fix).

Great, now I’ve got a list of machines in my $serverList variable, lets get to work by setting up WinRM to accept HTTPS connections. I’ll do this remotely by using the Invoke-Command cmdlet and using the administrator credentials I collected earlier

I’m using the -force switch on the WinRM command, since there should already be an HTTP listener running on the server, and WinRM will complain there’s an existing configuration and the WinRM service is already running.

PowerShell for Good and Sometimes for Evil

I read an interesting article on a flight back home yesterday which detailed a brazen bank heist in Russia in which the bank robbers used remote PowerShell commands to instruct an ATM to spit out hundreds of thousands of dollars in cash. Interestingly the ATM itself wasn’t hacked in a way that I expected where someone gained access to an input panel and loaded up some malware, instead the hackers had managed to get into the broader bank network, create a series of tunnels to eventually get to the ATM network and issue the commands. This specific attack used malware that self deleted, but either by mistake or code error left some logs behind which allowed security researchers to back track and figure out what went wrong. You can read the full article here.

This got me wondering, in a world of secure network deployments – where there may be some tunneling from other networks, how can I protect systems from executing malicious code from a remote source. The most obvious answer is to block the PowerShell Remoting and WinRM ports on the firewall from the broader network (Ports 5985 and 5986 for HTTP and HTTP/s respectively). That should generally protect the systems unless the Firewall is compromised, or physical access to the servers is obtained. This is a nice solution since it doesn’t restrict me from being able to use Remote PowerShell sessions when an authorized party has access to the system.

I can also change these from the default to obfuscate them somewhat from the outside. This is a good security practice since using default ports just makes things easy for an attacker – and the more road blocks we put up the less likely they are to be successful. We can change the port for WinRM using the following command

You’ll want to replace * with the auto-completed Listener name (if you have an HTTP and an HTTP/s listener), and you’ll have to run it once for the HTTP listener, then run it again with a different port for the HTTP/s listener if one exists. In my case I don’t yet have an HTTP/s WinRM listener configured so I can get away with using the wildcard.

Restarting the WinRM service is required, and can be done with the following command

Now when I’m initiating a Remote PowerShell session, I need to ensure I’m specifying the now non-standard port by using the -Port switch in Enter-PSSession

This puts me in a fairly comfortable position now in the event that someone does gain access to my closed network and attempts to do some PowerShell remoting. My final step to really help me sleep at night is to make sure that when I am using PowerShell remote sessions I’m doing so securely, lest a pesky network sniffer gain valuable intel on the system based on my administrative work. To do this, I’m going to use a couple Group Policy objects to ensure the WinRM Service isn’t accepting insecure connections, in case I forget to connect securely during a 3AM emergency service call.

Within my default domain policy, I’ll configure a couple settings:

Computer Configuration\Policies\Administrative Templates\Windows Components\Windows Remote Management (WinRM)\WinRM Service\Allow Basic Authentication -> Disabled

Computer Configuration\Policies\Administrative Templates\Windows Components\Windows Remote Management (WinRM)\WinRM Service\Allow CredSSP Authentication -> Disabled

Computer Configuration\Policies\Administrative Templates\Windows Components\Windows Remote Management (WinRM)\WinRM Service\Allow Unencrypted Traffic -> Disabled

Now that we’ve made these changes in the GPO, I’ll have to go configure WinRM for HTTP/s on my original server. There’s a great Microsoft Support article on the subject here but for brevity, here’s the steps

  • Launch the Certificates Snapin for Local Computer
  • Ensure within Personal\Certificates a Server Authenticating certificate exists, if not request one through your domain CA
  • Use this command to quick configure WinRM
    winrm quickconfig -transport:https

Now when I connect I’ll need to ensure that I use the following command

It’s not perfect, and it wont stop everyone, but as my father used to say it’s just enough to keep the honest people honest. For those going the less than honest route, I also found this white paper from the Black Hat USA 2014 conference on Investigating PowerShell Attacks to be extremely interesting.

There’s a few good resources out there on the security considerations of remote PowerShell sessions, specifically this one from JuanPablo Jofre on MSDN, and this one from Ed Wilson, of Scripting Guys.