So far, we’ve noticed several reasons why a bundle causes waste. First was the change itself, as each change has the potential to cause an instability. We can’t stop changing because we’d be out of a job (in a future series we’d see that maybe we can). The second was the interval between deployment. The longer it is, the more change accumulates, the bigger the Bundle is and the longer the hunt would be and more time would be wasted. The bigger the Bundle is, the more risky the revert is. If the interval is the source of it, we’d try to do some unbundling through shortening it.
Not Luck, Statistics
it’s not a matter of luck that deployment fails and reverts are needed. The longer the interval between deployment, the higher the odds of it happening. Let’s have a look at the probabilities of deployment. A series of changes can be treated as a binomial distribution. We can use a formula to calculate what are the odds of a successful deployment of (n) independent changes. Each one can be either a success or a failure.
For starters, let’s put the odds of an independent change to introduce instability (once deployed) at P(s) = 95%. We’ve committed 10 changes to the main branch. We’re about to deploy them together. What are the odds of a revert? It is equal to having at least one faulty change out of a group of 10.
Pf(10) = 1 – (0.95)^10 = ~0.4.
The odds of a revert is 40%. And that is just for 10 changes.
If the team were to commit 20 changes the odds of a revert would be 64%.
If the team were to commit 40 changes the odds of a revert would be 87%.
Honestly, I have no idea what my odds are. Let’s push the odds P(s) = 99%. One mistake once in a 100 changes. The odds of a revert of 40 changes would then drop from 87% to 33%. But still 150 changes bundled together would be doomed to fail. We may not do 150 daily changes today but maybe tomorrow. We’re about to see that happening sooner than we thought.
The Shortest Interval
The shorter the interval the less changes accumulated. Meaning the shortest interval possible is of a single independent change. That would put the odds of a failed deployment all the way down to:
Pf(1) = 1 – P(s)^1 = 1 – P(s)
As the deployment is no longer of a Bundle, the odds of the next deployment succeeding or failing are as good as our coding skills and our investment into tests, Continuous Integration (CI) and other preventable measures. Welcome to the world of Continuous Deployment (CD). [Please notice that Continuous Integration and Continuous Deployment are not the same, I’d one day write a series about this].
If we were to go back to our application’s timeline, we can see how a CD changes it.
In Continuous Deployment, each deployment is a reaction to a change event.
We have gained two rewards of this:
- If after a deployment something went sideways, we’d quickly know which change has caused it. The one change that was deployed. The smaller the change, the easier to trace and fix it. Our hunts, the cause of many inefficiencies, would be tremendously shorter.
- The reverts would be safer as reverting a single change is far less riskier. Less chances of causing another unintended instability.
A CD platform requires much effort, but it is not enough on its own. There are two prerequisites for it to be beneficial.
Practice and Monitoring
The year was 2021, I was a Group Leader at RapidAPI. The company had Jenkins with CI and CD pipelines. Tests, although with insufficient coverage, were running in the background. Every merge to main automatically deployed to Staging. The deployment to production was manual.
RapidAPI had a server application full of code, a GraphQL Gateway. It was not owned by any group. It was both a Bundle and not a Bundle, depending on which group changed it. If the group had the practice of deploying immediately to Production after Staging, it was not a bundle as it was continuously deployed. There were groups who did not practice as such. They accumulated changes. And reverted. Sometimes including other group’s changes, causing havoc. Practice matters.
Even with a suitable practice, faulty changes do happen. We are not perfect. But when it does, we need to know as fast as possible something went wrong and what went wrong. At RapidAPI we had a Zero Error Logs policy during a stable state. We’ve put a lot of effort into cleaning up all of our logs and making them more informational. It allowed our logs analytics platform to fire pinpointed alarms when something went wrong. We had proper monitoring. So when we mistakenly took down the entire Monetization operation, we knew of it within 3 minutes. We easily inspected the last change with git comparison, and deployed a fix.
Investing in Continuous Deployment pipelines, good practices and proper monitoring is a way to remove inefficiencies. To prevent instabilities is to avoid Velocity Drops. Alas, the probabilities remain the same.
Against no Odds
The group and I started practicing small commits and small changes for good reasons. Many of our owned services had already evolved into Monoliths, but we managed because there weren’t that many changes to them. Until out of nowhere, a huge technological shift and business pivot in RapidAPI happened. That shift required a lot of big changes to these Monoliths, which were interconnected between them. The changes were always thought to be done, but QA always failed in Staging. By practicing better we managed to overcome this. We slowly rolled small changes all the way to production, one service after the other. We made it. After the technological shift ended, we went back to our day-to-day coding of features and bugs.
But the change in practice had some interesting consequences. The practice resulted in reduced time waste which was invested back into beneficial work. Our Velocity was at an all time high. We’ve done about 50 small changes a week (compared to 3-4 big ones) to the Monetization service, one after the other. That resulted in more reverts than ever.
Let’s revisit the binomial distribution formula and see why that occured.
With a careful look we can see it is agnostic to time intervals or to Bundling. The odds of (n) changes failing is the same whether you deploy them all at once or one after the other. Yes, the odds of the next single deployment of a single change is P(s). The odds of the next 50 deployments of single changes are still extremely low. A failed deployment is still inevitable. Continuous Deployment does not prevent this. It’s not supposed to. CD does not unbundle a Bundle. It just mitigates the risks and consequences of having Bundles. Such would be the case of faulty changes discovered in retrospective.
At RapidAPI we had a faulty change that was discovered only after 36 hours due to a time zone difference. The change itself was a small bug fix. It turned out to have a side effect on another flow we were not familiar with. A flow that can only be manually triggered specifically by one of RapidAPI’s customers. 36 hours later, with our Velocity, it was 30 small changes later. We had to go on a hunt.
It was just one instead of oh so many saved!
Containment
A faulty change deployed is only a matter of time. When it does, it is better to be prepared for it as it may take too long to recover from it. Another strategy would be to make sure the effects of a failure would be limited, contained. With a minimal blast radius. Especially in sensitive operations.
There are many strategies to do so such as rollouts; green-blue deployment; feature flags; application split; etc. We won’t be covering them but have a look at the consequences of a proper containment v.s. zero containment.
There was another perspective to the faulty change of RapidAPI’s Monetization service. You’d expect that only the Monetization related features would stop working when the service is down. Alas, when it went down it took down the entire customer facing frontend application. It’s been like that for years. As we knew we’re about to do some risky changes, we’ve untangled it. It took us a few days to do something that was thought to be impossible. A few weeks afterwards we weren’t careful enough and brought the service down. You know what happened then? Nothing. Our pingdom did not fire. Our customers didn’t notice anything and most importantly, neither did the CEO. Crisis averted!
Or at least that’s what we thought. Because we forgot about the second tightly coupled part of the system, a less visible one. The one that serves all incoming API requests. It was only half a crisis averted. It would have been much harder to fix a full blown crisis than a half one. Especially under the pressure of an entire company with its customers, all breathing down our necks. Having the peace and quiet of putting down one fire instead of two is extremely beneficial.
At the retrospective we’ve proposed a containment for this as well. Although it was agreed to be a good plan, it was not executed because now is not the time. It never is, right?
Next, let’s talk about another evolution process. The one that evolves our applications into Legacy.