03 The Change Factor, 04 Breaking Change

Rolling out Changes: Crossing Time Boundaries

Reading Time: 7 minutes

In the previous chapter, we’ve encountered a technical must split. We splitted our ancient single server web application into two modern applications. A backend application that is deployed to our servers and a frontend application deployed to our customers. Both now would be sharing a client-server relationship.

The newly added physical boundaries between them forced our applications to be deployed independently one after the other. We’ve seen the consequences of this on our engineers, on their deployments and our development workflow.

We’ve started talking on how to ease those consequences, by reducing the frequency of the need for our applications to Change together. One was rearranging their Modules differently between them, and we ended up seeing there are a lot of considerations to it. In this chapter, we’re going to focus on one of these. By inspecting not physical boundaries, but time boundaries. Because rolling out Changes takes time.

Transformers Rollout!

Every Change needs to be rolled out, but not all Rollouts are the same. Every Change needs to eventually reach all of its destinations, which are sometimes remote. For some, there would be physical constraints. For others there would be business or technical dependency constraints. All causing a Delay. The further away from us our application is, the more complicated the Rollout would be and the longer the Delays we would experience. The more customers/destinations we have, the longer the Rollout would be and the more Delays we would experience.

For a while, RapidAPI (2020-2021) had provided an enterprise version for a few tens of enterprises. The SaaS version was continuously deployed a few times a day, but the enterprise version was not. It was built and packed only once every three months. Only later it was installed to one enterprise after another, according to their schedule. It was a gradual Rollout which resulted in a potential 6 to 9 months Delay for a Change. It is a very common business constraint that companies face when they work with large enterprises. It should make us wonder whether it was an agile project or a waterfall project that has left no room for Feedback.

Silo’s physical devices (2019), a B2C product, had a physical distance to cross. An initial version was burned at the factory prior to shipping. Shipping would have taken about a month and a half, and another month or so to be distributed to resellers and maybe a week or two until it was purchased. Only then, it would have made a first connection to the internet and fetched the latest version. It adds up to about a 3 months Delay. But actually it could have been a lot more, because Americans buy Christmas gifts months ahead. Same is true to all connected devices such as Amazon’s Echo Dot, Amazon’s Kindle and any Smart TV by Samsung or Sony.

At DealPly (2014), our Rollout was much simpler, but still gradual. We had a browser addon that was running in about 20 million browsers worldwide, and had continuously self updated itself. We were practicing extreme programming, resulting in a new version being released a few times a day. On the contrary, our addon would try to fetch a new version once in 24 hours. As each application instance timer started with a different seed time, it created a gradual Rollout. 10% of our customers would get our latest Change with a 1 hour Delay, and after a 24 hours Delay 95% of them would.

Rollouts of many modern frontend applications kind of act the same. Browsers do have a local cache and by default they adhere to the Last Modified HTTP Header. It is a relationship between the browser and the server that can be adjusted. There is also another cache for our content and resources, the Content Distribution Networks (CDN). They have an API which can be invoked to invalidate the cache upon Change, or to set a Time to Live (TTL) in advance. All of these caches create a tradeoff between Rollouts, Delays and our servers processing unnecessarily load and wasting network resources.

Apple and Google each have hundreds of millions of customers. It took 6 months for Apple’s iOS 15 to reach 63% of their customers and another 4.5 months to reach almost 90%. Google’s Android 12 reached about 30% of US customers within 5 months and another 3 months to reach about 42%. A Rollout of an operating system could easily take years. It can be a problem for us as they are far beyond our control, and if we depend on it we would experience their Delay.

Ownership

For a more concrete and frequent use case, let’s recall that agents cross physical boundaries. Specifically, we can think of New Relic’s APM or Elastic’s Filebeat, both sending telemetry from one application to another and constituting a client-server relationship. Let’s play with the virtual boundaries between them.

Let’s assume we are the agent provider, owned by our team in our company. We also own and develop the remote collector. When we wish to make a Change to the collector, we can just go ahead and deploy it because our team also owns the server it’s running on.

But when we wish to make a Change to the agent, we run into a problem. The agents are installed on servers owned by our company, but not by our team. It would be updated according to the other team’s schedule , maybe in the next Sprint which is a Delay of a week or two. That is because physical boundaries can lead to organizational boundaries and vice versa. Physical ones can determine the Rollout. Organizational ones cause frictions who can determine the Delay

When our agent is installed on our customer’s servers, the Delay can potentially be months. Our customers may not be in a hurry to update, or have their own very slow workflows to approve each and every update. A self-updating agent sounds like a good solution, but maybe not a 100% viable one. Large enterprises may not allow for software components to be updated freely, as they fear the consequences of an uncontrolled and unchecked update. Let’s not forget that making a robust self-updater can be a challenge of its own.

As a consequence, when our customers do decide to update our agent, they might be updating a bundle of Changes. Same goes for every application where the decision to update resides on the customer. We may not even know who customers really are. It would be true for our B2C mobile application, where the updater is owned by the OS and an end user may choose not to update it for a while. It would be true for our open-source packages, where its distribution is controlled by package managers such as npm for NodeJS or pip for python.

Rolling Instabilities

Because every Change is a potential Instability, its impact also depends on Rollout. It would be more accurate to say differences of factors in Rollouts between our applications. Recovering our collector is a matter of hours and days. Recovery for our agent would be a matter of days and weeks. Recovery for our frontend application would be a matter of hours, for our backend it would be minutes.

A web application’s Rollout is a matter of minutes, for a native mobile application it might be a matter of weeks. React Native, an alternative for a native mobile application, has a CodePush feature that significantly simplifies and shortens Rollouts. For the very same reason, desktop applications became obsolete. Their Rollout was a matter of weeks and for a web application it is a matter of minutes.

We should also factor in how much we have to lose because of a Delay. Let’s presume we’ve decided to put some of our eCommerce Pricing logic into our frontend web application. That one hour Delay now has the potential to create a higher financial loss, than a few minutes Delay our backend applications have. A new feature that requires a Change only to our collector, would mean more customers would be happier sooner, than if it requires a Change to our agent.

Never at Once

I can think of one scenario where there are no Delays. When not only there is a single server application, but only a single instance of it running. But that is also a single point of failure and an unreliable system. In a highly available system, where at least two instances are running concurrently, this scenario does not exist.

If we have multiple instances of an application running on our owned multiple servers, Changing all of them together takes time, so they never Change at once. It is actually a gradual Rollout with a very small Delay, with all of its consequences.

It is possible to create a facade around it, making it seem like they were Changed at once. One technique is a blue-green deployment, where new instances are deployed adjacent to the old ones, and traffic is re-routed and switched between the old and new back and forth. The switch itself would take maybe a few milliseconds. But let’s not forget the dangers of time travels, of reverting.

It is still a facade, meaning it has prerequisites such as being stateless and having no sessions or drain requests. It requires a load balancer, which has its own considerations, not always in existence, and needs to be highly available on its own. And the greatest facade exists in our workstation, where everything actually does Change at once. But “it’s working on my computer” is meaningless to Rollouts

If nothing can be Changed at once, there will always be Delays. It is only a question of how long of a Delay. And Delays are the reasons we need to maintain backward compatibility, practice versioning, test and maintain contracts/APIs between applications.

In this chapter and the previous, we’ve seen some considerations and tradeoffs that we need to be aware of prior to splitting an application, distributing and splitting Modules. With both the physical and time boundaries to cross, we’re going to explore another kind of relationship to split, the one with our data and our database. On this, starting in the next chapter.

Leave a Reply