In the previous chapter, we went through some pull and poll based designs, in order to untangle the bundling made to our Actors through the backend. We’ve talked about how frequently our customers are Changing their data, and how quickly they expect any Change to propagate and be in effect. We’ve seen how to match our designs with these expectations.
We also learned of a method to validate our designs, by checking our costs and according to what parameters they scale. Through it, it seems like we have exhausted our efforts of pull/poll based design. In this chapter, we’re going towards push based designs. Hopefully a design based on those would be more of a worthy candidate for a system architecture.
Event Driven Change
In our poll based design for the previous chapter, we had presumed for our story’s sake that S3 would be able to support such a design. It doesn’t. It has an entirely different mechanism, based on signals, on notifications.
S3 sends out many notifications. One of which tells whoever wishes to know that the file has {CHANGED}. Someone, like our very own microservice. On AWS, bridging between it and S3 can be done quite quickly. We’d need to attach a Serverless Function to this notification, and AWS Lambda would be consuming it one notification at a time.
During consumption, our Lambda would be invoking our microservice’s HTTP endpoint, commanding it to refetch the data from S3. The effort to do the above is measured in hours. Done correctly, and our Serverless Function would be a frozen application as well.
As a result, there will no longer be any sort of polling whatsoever. Only a notification pushed and applications pushing one another. A dramatic decrease in costs.
Although simple, a custom coded application on AWS Lambda may not always be best fitting, and also restricts and couples our design to one kind of compute. Fortunately, another method exists for our microservice to indirectly receive these notifications. It would require more effort into setting up a small messaging platform, based on SNS/SQS/EventBridge. Once set up, our microservice can register to it and would start receiving those notifications. In a later chapter we’ll be deep diving into those messaging components, setting up an entire messaging platform.
When multiple instances of our microservices are involved, each instance on its own can register to our messaging platform. Doing so, they would remain almost consistent with one another, as all of our instances would receive the same notifications almost at once. A hard-consistency would not be feasible.
Both of our designs above are an implementation of the well known Publish/Subscribe pattern, a messaging pattern.
Assuming Sizing
Even with these push based designs, the number of times we need to re-fetch the data in its entirety remains the same. As our customers’ frequency of Changing the data is far faster than our required propagation time of 5 minutes, we are performing more fetches than required. To save some costs, we can code our application to aggregate our {CHANGED} notifications, and perform a fetch only once an interval. Sounds like a suitable design meeting both costs and expectations.
Let’s complicate the expectations a little bit with a different kind of applicative configuration data, of security and permissions. Once granted, it’s okay for it to take a few minutes to propagate. Unfortunately, a few minutes is just what it takes for a very mad hacker to create some serious damage. So revoking is an operation which must be propagated as soon as possible.
The quickest solution would be to remove the aggregative code, and go back to refetching the file on every notification. Although quick, it will also reincur the very costs we have already saved. In the previous chapter we’ve seen that for RapidAPI with its 72 microservice instances, the costs could easily have been about 36$ a year, per 1MB. And it scales linearly with it. A warning sign.
Let’s do a quick leetcode exercise to see how unrealistic 1MB can be. In AWS, the smallest permission policy I could think of was an empty one. I generated one and it weighs about 82 bytes. I’ve added two permissions, one to fetch our data.json file and another to get its metadata. It now weighs 418 bytes. On a very lousy average, each permission weighs about 168 bytes. So for an AWS customer that has 10 applications, each one granted with 10 permissions, it would sum up to 17.62kb. You’d have to trust me it is one very light customer, and a 1MB file would hold only about 60 of those. We can expect our file size to easily grow larger and larger and quite fast, because AWS already has a million customers.
We can optimize our data.json file ourselves, but it might be non-beneficial as we would be putting an effort into something a database already takes care of for us. Data persistence is optimized for size, exactly because of this tradeoff. Any under-optimization would incur cost in networking. And when it comes to S3, the fact that costs scale linearly with the file size would remain the same.
Assuming Dead Ends
Here’s a crazy idea: what if instead of S3 reporting that our file has {CHANGED}, it will report what has Changed in the file. If possible, it would mean the {CHANGED} message emitted, would also contain the entirety of the new permission granted. When a permission is revoked, the message body would contain nothing but the permission identifier. Both messages will be consumed by our microservice and it will update its own in-memory. There will no longer be a need to re-fetch the file in its entirety, besides on boot. Problem solved!
Problem is, S3 can not do that. It can not report what has {CHANGED} within the file. It is also a blackbox, so we can not Change its code. In our fourth chapter, we’ve difference between a design and a system architecture, by reviewing dead-ends. As we may have reached another, it is a good chance to elaborate on this matter.
In this chapter and the previous one, we are trying to resolve the Food Inventory problem at Silo. A data shared that bundles our Actors together through the backend. In a way, we had found a design that resolves it. Unfortunately there is a hidden assumption underneath it preventing it from being a system architecture. The assumption where every storage we’ll forever and ever choose, will be able to report its data has {CHANGED}. Not all storages implement the concept of Change Data Capture (CDC).
If we force it on every future design of our colleagues, we’d be Restricting them from using ElasticSearch, ClickHouse and many other databases. Our can assumption today, turns into a must future requirement. It is not about feasibility. We could replace S3 in our design with one that supports CDC. It would make no difference as all we did was to replace one assumption with its exact opposite. Becoming a system architecture, it would be one Restricting others from using S3 entirely.
A single Restricted design which solves one problem may be less of a concern for us. But a system architecture that Restricts all current and future design is an eventually non-beneficial architecture. One day it will also Restrict us from fulfilling business requirements with our designs. Exactly why architecture is hard to do. Assumptions need to be made, but each one Restricts and degrades it. But is assuming less any better? A conundrum.
[Note: for a different perspective on architecting, I recommend reading the follow up book Projection about project and product management for engineers. Specifically, the series The Winding Road about Roadmaps, architectural ones included.]
Assuming Less
A middle way between assuming one thing or otherwise, is to assume them both or assume none of them. To bypass S3, let’s search for another component to emit the {CHANGED} event, one guaranteed to be able to do so. None better than the one who knows what and how it is going to Change our file. Our web application. Not coincidentally it is also an Actor and where our behavioral/business logic is.
Our web application would continue to load the entire JSON file, still weighing several MBs. It would add/remove/change a specific node and would put the file back into S3. Into our custom coded application, we’ll add the {CHANGED} emission with the newly set permission. As our microservice already knows how to consume messages, it’s only left for it to parse the message differently.
Besides during boot up, there will no longer be re-fetching done by our microservice. File size is no longer a concern for it, but it may be one day for our web application. As messages are consumed once produced, the propagation time is no longer a concern for us as well. Consequently, we’ve found a design that not only satisfies a freshness of a few minutes, but satisfies near real time freshness.
As emitting a {CHANGED} message is no longer done by our storage, we require no additional capability of it. We had assumed less for it. However, we do now assume our web application can communicate with our messaging platform. Same goes between it and our microservice. Done correctly, our assumption would be as little as assuming all three can perform and accept HTTP requests.
To assume that any two applications can form a client-server relationship is such a light assumption and such a basic thing to implement, that it is hard to imagine what it can Restrict. It’s starting to look like a candidate for system architecture. Unless the Actor’s WiFi or internet is down, which for Silo and some other companies if quite possible, but a temporary state.
We can say there is also an assumption made about our messaging platform, of it supporting the Publish/Subscribe pattern. It would need to fan out and spread messages to multiple instances and between multiple applications. Those would need to be able to register to it. And there is another assumption here, for it to be fast enough to provide a consistently looking experience to our customers. More on this, later in this series.
Assuming Scaling
Our very last play with our frequencies is where one customer frequently changes his data. Or where we have millions of customers less frequently do so. Or both. No matter which technical parameter we play with this time, it would mean our design would handle a lot more traffic. Does it scale and what should scale to meet this?
Continuing with our leetcode exercise, let’s say AWS has about 1 million customers. Each has one policy and each performs one action on it a day during working hours, either granting or revoking. If these actions are evenly distributed throughout the day, we’re talking about 2k actions a minute. To paint the picture with some real numbers, 2k per minute is not a lot. RapidAPI’s data platform consumed 80k messages per minute and that’s not a lot as well. And it was ready for x10 more than that.
Scaling for message consumption is no different than scaling to meet incoming HTTP requests directly from clients/Actors. For our microservice instance to consume messages fast enough to maintain our consistent experience, more CPU for each and every instance might do the trick. For it to continue storing the entirety of the data in-memory, we can add more RAM. It is called vertical scaling or scaling up. It is feasible as long as all the data fits in the memory.
Of course more resources cost more money, but it is done within minutes and maybe cheaper than a redesign. It’s a good option to keep in our pocket, to buy us the time needed, to postpone the redesign until when it is eventually beneficial to execute.
Our messaging platform would need to scale as well. AWS’s managed services such as SNS, SQS, AWS IoT and EventBridge have near infinite capacity and throughput. If we were to manage it on our own with RabbitMQ or HiveMQ, we could first vertically scale those with more CPU and RAM. We could also scale them horizontally/out, add more instances, shards and replicas. Hopefully these can be automated and taken care of by our underlying Compute infrastructure.
Although both our microservice and our messaging platform scale up and down according to the same incoming traffic, each does so independently on its own. As expected from mutually exclusive applications.
Applicatively speaking, our design breaks. Starting with the data.json file on S3, as it is would be concurrently read and written to. The second thing to break is everything all at once. If someone were to Change the data structure of our file or our {CHANGED} messages, and inadvertently introduce an Instability, all of our microservices will no longer be able to process updates rendering all of their consumers/clients helpless.
There are still a few loose ends to take care of. But it starts to look like a good candidate for a system architecture. But before tying those, we should validate our design by inspecting an alternative. In the next chapter, we’re going to implement our Publish/Subscribe pattern with a database.