In the last two chapters, we’ve discussed when an optional split is beneficial or otherwise. In one chapter we saw it depends on the frequency of Change, and in another chapter we saw it depends on Cohesion of Change. They both can be determined in advance and can be used to anticipate the results of an action of a split.
We’ve seen that a beneficial split is one that increases our Throughout and maybe even our Reliability. But we’ve only reviewed use cases that ended up with beneficial splits.
In this chapter, we’re going to review a use case with a non-beneficial outcome. By it we’re going to finally see the entire picture of what splitting application has to do with Throughputs, Bottlenecks and the Change Stream. And it would end with a twist I made sure you won’t see coming!
Following Microservices Architecture, RapidAPI had an application between the frontend and the backend applications. It was a server-side application called API Gateway, a shared layer between all of the backend applications. It took care of some shared logic, such as authentication and authorization. But, I suspect that due to the company’s structure it also ended up doing more than that, with more responsibilities. It ended up also being the GQL Endpoint and Backend for Frontend.
As it had multiple responsibilities, it had many Modules. It is not an easy task to plan how to intersect multiple Modules in a single application, who were also gradually added. I guess this is why the application ended up with a non-design or an incorrect design. There were also multiple engineers contributing code to this application, it withstanded a 10-15 engineer’s Throughput of Change. As Change fuels the evolutionary processes along with Throughput hitting a design that can not withstand it, it eventually evolved to a Monolith.
There were the same conditions that evolved RapidAPI’s frontend to a Monolith as well, one that we had split into micro-frontends. Micro-frontends came to fruition only recently, and on the contrary splitting backend applications was a practice in existence decades before. Because there was a technical solution to it that we’ve chosen not to apply, it was an Inefficiency. The Monolith itself was a Bottleneck to be removed from our development workflow.
But that wasn’t the only Bottleneck the company experienced. When a Product Manager had a new feature request, a Cause to Change that turns into a Change, an engineer would need to make at least 4 Changes to 4 different applications.
Problem was, it required the involvement of at least three engineers from three different teams, each owning only a part of the Change. As ownership entails gradual Rollouts and Delays, there were a lot of those. Combine them all and we got a Bottleneck in the development workflow, resulting in a Back Pressure for delivering product Changes.
Applications were owned by engineers, but no one owned a Change. When there was a bug in one of the Monetization’s features, who would be the one looking into it? No matter which engineer in which team we’d choose, the odds are the ball would be kicked between them as they each own another part of it. Another Bottleneck. It would take longer to fix bugs thus keeping our customers unsatisfied for longer.
This scenario is not limited to multiple applications, it happens with a single one as well. A Change will be fragmented into multiple Changes because of a mismatch between the Change Stream and the application’s design and/or company structure. Splitting applications does not guarantee we’d overcome and avoid fragmentation, and in the next chapter we’ll see it can also be the cause of it.
As the company’s structure had something to do with it, a defragmentation process had started. A reorganization.
The company had restructured into a team per business domain. Each team consists mainly of full stack engineers, who take end-to-end ownership of entire Products and Features. As expected, it reduced the frequency and overhead of orchestrating between multiple product managers and teams. As expected it indeed somewhat reduced the product delivery Delays and somewhat solved the ownership issues. As expected when we better match the Change Stream.
On the contrary and not surprisingly, a reorganization has no effect on the application called API Gateway. Because awareness is a human trait, they can not be aware of an entity called organization or to its structure. It still had to withstand the very same factor of Throughput, because all engineers still need to make a Change to API Gateway for almost any Cause, bug or feature. The fact that engineers switched teams changed nothing. It is still the same Monolith that goes through the same evolutionary processes at the same rate. And it was still a Bottleneck.
The reorganization did uncover something very interesting, a potential Inefficiency that may have existed forever. Although Changes to 4 applications had always been needed, only when a single person did all of them together we noticed it was frequently required. No matter what the Task was, no matter which team it was.
Frequency of Penalty
Separating the frontend from the backend application, is a practical and technical must. So does persisting the data in a database. The decision to add another application of API Gateway / Backend for Front was an optional one. If it’s optional, we already know that determining whether it would be a beneficial or non-beneficial split, it has a lot to do with frequency of Change and Cohesion of Change. For this, let’s first examine API Gateway as a single application. Of its grouping of Modules, intersecting Modules and the Intention to Change them.
In the context of a single application, we wondered which model is better (1) on the left or (2) on the right. For a single Change coming through tunnel III, where we’d need to Change A & D together, there would be a penalty to pay in model (2) that does not exist in model (1). The penalty of making two Changes to two independent groups of Modules, instead of just one Change to the intersection between them.
Is it a penalty worth paying? To answer it, we’ve asked ourselves how frequently Changes to both together would be coming through tunnel III. It was followed by a realization that it’s more about the distribution of Changes between multiple tunnels, between Direction of Changes. If most Changes are coming through tunnel III, then model (1) would be more eventually beneficial than model (2).
These are all properties of the Change Stream, an abstract force beyond our control. As such, they also apply to multiple applications. The only difference is that unlike grouping Modules in a single application, a split adds a physical boundary for our applications to cross. So we have two states to compare between, before and after a split. And we also have two comparable penalties.
We’ve reviewed the outcomes of putting a physical boundary between two applications.
They can no longer Change together, which forces us to deploy one application after the other, and this creates an Inefficiency for our engineer. And they can not Change at once , entaling a careful design for our applications and their Rollouts. And there now are client-server relationships to maintain between the two. This is our penalty to pay when a single Change comes through tunnel III.
If most Changes are coming through tunnel III, we’d be paying this penalty on a regular basis. It would be a constant Inefficiency and a Bottleneck in the development workflow. This would be a non-beneficial split.
Eventually Beneficial Design
For everyone in the Monetization team, that was a daily penalty to pay. Because the Monetization’s business logic was split between it and its backend service, every new feature required a redundant Change to API Gateway. On the contrary to its purpose, the Backend for Frontend’s affinity ended up being with the backend service, and not with the frontend which RapidAPI had only one. If two entities constantly Change together, they are cohesive. And if they are cohesive, it would make sense for them to be together as one.
Let’s review a non-final design we came up with to solve the above. We came up with it using all the insights we’ve seen through in this series and the Change Driven Design series. At this time of writing, it was not yet completed at RapidAPI and probably ended up different. Not because it was a non-beneficial design, but because there were far more urgent business issues to attend to.
All the business logic from the API Gateway would be moved to the backend services, where it belongs. The role of Authentication & Authorization would be its own layer. Whether it would be a package or a microservice was to be decided (more on that later in this series). Each backend service will expose a GQL Endpoint thus maintaining its backward compatibility during the Change, by keeping the current GQL contracts.
The GQL Federation will do the cross queries between services, just like the previous Backend for Frontend Modules did. Only this time, it would have an affinity with the frontend.
The database would remain one for the time being, as we were continuously querying the entire data when we needed to investigate charge and customer issues. To overcome noisy neighbors [ref to Noise] and scale, read replicas can be used when eventually needed. We’ve also started to consolidate, move and mirror Monetization related data from other domains and services.
Following up on the principle of what shouldn’t Change together shouldn’t Change together, we’ve grouped together multiple Flows, into business subdomains. One of those was the charge subdomain. It was a business critical process/flow that was not expected to Change any time soon thus would remain stable. And it had its own clear and independent Causes to Change. Charge would be isolated from the rest of the applications who were expected to Change. Very similar to how we split Quota and Analytics to gain a higher reliability.
[Note: while researching for this chapter, I’ve also ran into Event Storming, a collaborative thought process ending with the same outcomes as the above principles and do not contradict them]
Charge itself would have been split to further smaller applications, due to its asynchronous nature (beyond the scope of this series). But no matter how further split it would end up being, it would still be isolated from all of the rest of the subdomains and their applications. This would also allow us to do a more beneficial and continuous refactor, one endpoint or subdomain at a time.
We should notice that the design actually has more layers of applications than before. The difference is when Ellah from the Monetization team wishes to make a Change, she would not need to make it neither to the GQL Federation nor the Authentication and Authorization. There is no business logic there, she will have one less Change to make. They have their own Causes to Change, and their own independent Rollouts. Which is an interesting Technical v.s. Non-Technical mutual exclusivity.
As there would be one less application to Change when needed, there would be one less deployment to make. It would be a Bottleneck removed for the Monetization team, thus would increase their Throughput. That is guaranteed. As Change is what fuels the evolutionary processes, we predicted what would happen to each one. As each would have its own independent Direction of Change, each would evolve independently and slower than any other combination.
Specifically, each would evolve slower to a Monolith because the Throughput would be splitted. But let’s not forget the design also determines how soon it will happen. As our applications would be correctly cohesive, there would not be unnecessary bundling. Legacy is beyond our control but mitigated to allow gradual technological shifts.
Our applications would end up being at the right size. Because size doesn’t matter. Or maybe it does. On this, in the next chapter.