The Wrath of Kahn: Introducing Known Issues into the Public Cloud

I’ve spent the last few months going into some technical aspects of DevOps here, and I’m going to pivot a little bit this week and go back to a more philosophical DevOps topic.

A colleague today swung by my desk to get an opinion on a situation, and it basically shakes down to this:

If I need to release a fix to the public that resolves a critical issue and has gone through it’s appropriate soak cycle, but introduces a new minor issue, do I release the initial fix or wait for both issues to be resolved and soaked?

At first glance it seems like a bit of a no-brainer, but as you peel back the layers and apply different software methodologies things can get a bit muddled.

Assuming we have a soak cycle in a development cloud before going public, let’s analyze two different approaches:

  1. Delay your fix to public, fix both bugs, and allow for appropriate test and soak time in your development environment before going to the world
  2. Push your fix to public, resolve the minor bug, allow it to soak in your development environment, and then release the fix for the minor bug

So now that we’ve got those out, let’s see what they look like.

In approach 1, we went to test, found a bug, sent the whole release back to design, then came back to test, and then went to release. This is normal development practice regardless of where your software runs, and regardless of if your testing is manual or automated. It does however mean we’re leaving a critical problem running in the public cloud for longer than we’d like.

In approach 2, we’re feeling a bit more agile and taking advantage of CICD a bit more, knowing that we can release on a more frequent schedule. We can move forward with deploying the critical fix, send the minor issue back to design for resolution, and then it can get rolled up into the next push to public. This means that we’re knowingly introducing a bug that may affect a small number of users to resolve a problem that affects a large number of users.

Now that we have a good idea of the problem, and the potential ways forward, we have to decide which way to go. We’re oversimplifying this example, so your decision will have numerous other factors to consider from a business standpoint

  1. The financial impact
  2. The reputation impact
  3. Your release cycle
  4. Exactly how impactful the original critical issue is
  5. Your ability to identify and notify those users who will be affected by the new problem, or your ability to identify even IF anyone will be affected by the new issue

The list could probably go on and on and on depending on your line of business and how you approach development.

In our specific case for this issue, we’re able to identify that the minor issue affects a feature that isn’t in use yet or is being used by a small enough subset of special trial customers that we are okay introducing the minor bug to fix the big one.

…the needs of the many outweigh the needs of the few
– Spock, The Wrath of Kahn

Our release cycle means we’re really only going to have the bug public facing for a week (or less), poses no financial or reputation impact, and we can clearly identify who and how customers will be affected.

The interesting point about this issue is that it can be approached and solved differently on each instance since the variables around the problem are fluid. The next time, the critical issue may be less critical, and the minor issue a higher priority making the gap between severity much smaller. This would likely drive us to hold off and fix both issues simultaneously.

If you haven’t run into this problem set yet, it’s one to think through organizationally with your development teams and leadership group to pre-define some logic around it to help make the decision easier. For example, leadership may decide that any financial impact is unbearable and must be avoided.

That puts a huge weight on which of the issues causes or solves a financial impact and guides the process in that direction. Conversely development may indicate that the release cycle doesn’t allow for keeping a critical issue in the public for an extended period. At the end of the day it’s going to be a balancing act, where DevOps is in the middle.