Through the years I was asked many times, by colleagues and during job interviews, if I would have done anything different. It was a really hard question for me to answer. And not only because Silo’s system architecture never saw the light of day. It was because the company had to shut down not due to any Technical Cause. Maybe for other kinds of Causes, or for Causes that were not in my power to prevent from rising.
I really wish I could tell you hindsights such as “although we did our best, we ended up paying too much for AWS, which caused our pricing to be too high in our market segment of smart kitchen appliances”. Instead, in the years that followed our adventure at Silo, I mostly dwelled on everything else that led the company to shut down. Incidents and matters I’ve kept out of this book. Meaning there’s enough material for another one to write in my retirement, temporarily titled The Book of Failures.
Since Silo, I’ve worked for another company called RapidAPI. Many of the insights gained and the work done at Silo, had an effect on RapidAPI. It will also have one on my next job, which I got accepted to as I was writing these very lines. Or as I call it, I got the opportunity to gather material for another book.
Since Silo, I’ve also had a lot of job interviews, where I shared the work we’ve done. I’ve met with many fellow architects. I’ve worked with people who were far more experienced than me. So in a way, I did gain a lot of insights. I’ve shared some of those already, spread along the chapters of this book. In this one, I’m going to share the very few last ones. And one last final thought.
The Real Cost Effectiveness
Throughout this series, we’ve talked a lot about the importance of costs. In October 2022, just as I was writing about costs, I had a job interview with a CTO of a medium sized startup in Israel. He asked me what parameters I consider in my designs. He wasn’t surprised by the first, but very much surprised by the second. “I always check costs and alternative costs. Not only will it give me a cost baseline, but also will find cost leakages which may translate to faults in my designs.”
I’ll be honest, I’ve done only part of the math required to know how much it will cost to run Silo’s system architecture. However, in this series we have pointed out exactly the two most expensive parts of it, the message fan out and the Event Analytics. We guaranteed both of those costs can be potentially reduced, but we might also be missing the bigger picture of these costs.
Let’s imagine a future day where we’ll get the bill for Event Analytics. It will stand for 200 thousand US dollars. Not coincidentally, I’ve picked a number equivalent to a one year salary of a 10 years experienced software engineer in Israel. If Event Analytics saves the company from hiring him, it would be a cost-effective expense – and it is. Because it avoids Inefficiencies in advance, and because we’ve proved its value in Silo’s PoC. More so, the entire system architecture was created to slow down wasteful evolutionary processes as much as it can. Both within a single application (The Change Driven Design series), and between multiple applications (The Breaking Change series).
But that’s on paper. Can it actually be measured? No. Because we can not measure what we had prevented from occurring, you can not measure “not a loss”. Or at the very least, that’s what I thought.
I had an interview with another CTO of an Israeli startup. He wished his company to perform like in his favorite book Accelerate. I found it most amusing that my potential new manager, the next interviewer, had not read it. I had, within a single weekend because I prefer to come best prepared for interviews.
Accelerate is the most extensive research ever done on the relationship between engineering-product efficiency and business results. They are correlative. The more inefficient the first is, the worse the long term business results will be. If Accelerate is about correlation, The Change Factor is about causation. In it we’ve pointed out exactly where Inefficiencies are, and there’s plenty more in the follow up book Projection. It’s good to know that someone else has proven that knowing how to avoid Inefficiencies is an eventually beneficial effort. It is indeed cost-effective.
Okay, but one day eventually I’ll be gone. There would be another CTO or architect after me. If he would not know these matters, he’ll go ahead and kill the Event Analytics because it’s the most expensive one. Same was with job interviews. I needed to convince someone who is unaware of these topics, that what looks like the most expensive and the most complicated system architecture is not over-engineered, but one that actually saves money and effort.
Non-Human Actors
We should notice what Silo’s three main Actors had in common. They were all operated by humans. More than that, Food Management is a User Journey and an experience for humans. But not all Actors are and not all experiences are.
We had planned for such a non-human Actor to join us soon. We expected to have a future integration with a company called Yummly, owned by Whirlpool. But as the User Journeys we had were already eventually consistent and asynchronous, adding Yummly as an Actor would have no effect. They were to receive relevant Events from us, just like each and every one of our own Actors do. If they’d need to trigger our API to participate in our Food Management experience, they’d do so in a mutually exclusive way just like any other Actor and application. But that is because Yummly’s effect on Food Management, would be eventually presented to our human customers.
Some Actors are non-human, they are machines. Some of which require hard consistency and extremely low freshness, and in some cases are the main Actors. High-frequency trading, for example, is done by machines that are extremely sensitive to even a sub-millisecond latency. But when human Actors participate in a non-human experience, it may still be an eventually consistent experience for humans. HFT must be sub-ms, but a UI that shows all trades made, may show them with a slight delay, as a human can not grasp a sub-ms experience. Humans can not grasp anything below 250ms-400ms.
So both humans and non-humans can live within the same system, still maintaining Cohesion and Reliability and mutual exclusivity. All the non-human Actor would need to do is to emit Events, and for the platform to withstand the throughput.
An actual scenario I took a part of, was over at RapidAPI (2021). We had a proxy allowing consumption of APIs, a middleware between an HTTP request and reply. For it,
every millisecond delay count. It asynchronously emitted messages to our data platform. A non-human Actor consuming the API would know no difference. A human Actor would be able to view the API’s analytics about a minute afterward.
A human Actor who wished to update his API’s definition, something infrequently changing, got an immediate feedback the change was good to go, but the actual update propagating to the proxy was a matter of seconds and an eventually consistent experience.
Thinking of non-human Actors might shed a light on a possible limitation of this system architecture. A scenario of two non-humans Actors sharing a low-latency and hard-consistency experience. Once there’s a messaging platform between them, a contradiction will occur with the product requirements that gave birth to this architecture. The two non-human can be joined to a single Actor, such as a Platform API. The two together would be mutually exclusive from all others, but not from each other.
Fitting Topologies
After Silo, at RapidAPI (2021), I was also managing the data team where our mission was to replace the entirety of the company’s data platform. One that was also destined to be a messaging platform.
We had a Gateway planned, and we were wondering if we should place a message broker that will fan out and route incoming messages to either the data platform (Kafka-Spark) or to a future unknown messaging platform (probably RabbitMQ). Something similar to Silo’s.
Roman, who had recently joined us and was far more experienced than all of us combined in data-driven infrastructures, said it was unnecessary. Whoever wishes to publish a message, knows what kind of message it is and to which platform it should go. He said it’s better to simply expose an endpoint per platform in our Gateway, and let the publisher decide. Because the publisher knows best. If there would be a need for it, such as long term retention or auditing, the messaging platform can forward messages to the data platform.
We made no such distinction at Silo, so it was a bit strange for me to hear his design. I really wanted to get down to the bottom of how he sees it so differently than me. We sat for about an hour and a half, and together we went through Silo’s architecture. He couldn’t find anything wrong with it, and was sad to hear it didn’t go live. He too wondered what it would look like live. However, we managed to point the finger on the source of the difference between the two architectures.
All of Silo’s Actors were emitting a canonical message format, Events. Some Events were reporting an action, such as BUTTON_CLICKED. Some Events were reporting progress in a state/experience, such as LABEL_DONE. All were reported to one client-facing Gateway, AWS IoT.
But whether an Event was just an {analytical} one for tracking usage or if it’s an {operational} for another application’s consumption, it wasn’t solely the publisher’s decision because it was a reactive system. Meaning that before delivery, the Event can be both {operational} and {analytical} at the same time.
If the Event Analytics wishes to consume LABEL_DONE, it will be an {analytical} message. If the Food Management Service wishes to consume the very same LABEL_DONE Event, it will be an {operational} one. It is up to who consumes the Events to determine what kind of Event iut is. And I’m not sure how unique it is to Silo’s architecture.
Throughput
I had another job interview, this time with a software architecture team leader of a large Asian-Israeli enterprise. We’ve come to the matter of {operational} v.s. {analytical} messages. I’ve told him about the non-distinction we’ve made at Silo. He said it to be an extremely dangerous non-distinction to make.
When I asked him how come, he said it was about throughput and gave an example. Unlike operational messages, analytical messages are something sampled at a very high speed, like a consistent sampling rate of 50ms of CPU usages of an entire data center. In these cases, he said, you’d need different kinds of infrastructure components. It’s the kind of throughput messaging infrastructures are supposed to fail to handle, and that’s what streaming infrastructures are for.
That wasn’t news to me, and I knew Silo’s architecture was built to easily withstand tens-of-millions of messages per minute. I really had no idea how come I was thinking both a word-class architect and I, both of us were right.
A few days afterwards it hit me. AWS IoT was a message broker designed intentionally to withstand extremely high sampling rates of millions of sensors. When a platform manages to withstand a high throughput of {analytical} messages, surely it can handle {operational} ones. It was technically enabled to be agnostic to what kind of message it is.
From Silo’s messaging platform’s point of view, there was no distinction between the two. Only in routing. The {analytical} ones to streaming components, such as Silo’s Event Analytics and Event Store who are based on Kinesis Streams. The {operational} ones to a messaging component, Silo’s service-to-service message broker, based on SNS.
But there might be a theoretical throughput AWS IoT could not handle, and it would be required to publish these messages directly to a streaming platform. And if per message acknowledgment is required, there is Puslar for that.
If he’d read this one day, maybe then he’ll hire me. Maybe one day. If I’m actually right about what I wrote above.
Data Driven Company
While working for RapidAPI, I’ve gained more experience in the discipline of Big Data and data infrastructures. Out of it, I can theorize a kind of company / use case where this system architecture might not be a beneficial fit, or it might require some tweaking to it.
Let’s recall that in our system architecture, each of our Services has a dedicated view stored in a dedicated database for persistence. A view which is derived from the actual data, the Events. A view that is also a fracture of all of our data. It’s important to ask ourselves whether it’s also a fracture in size, as GBs are a resource we pay for.
For example, if I remember correctly, at RapidAPI we stored about 0.5TB of data a day in the company’s source of truth. And that’s not a lot by the way. On the contrary, the derived view of one of the company’s data-driven products, grew only by a few tens of MBs a day. It really was a small fraction, nothing really to worry about.
I imagine a company where it has a source of truth growing in TBs a day. The same imaginary one also has two Services, each of their views grows in TBs a day as well. This may very well be data duplication and/or data redundancy we’d be paying costly for. It is also sometimes a necessary evil, because dedicated views are required for Command Query Responsibility Segregation (CQRS), and/or to withstand both extremely high and concurrent write and read throughputs.
In a scenario where all of our multiple Services and all of our multiple databases will continue both to grow in numbers through the years, and in TBs a day, it will cost dearly. A cost that I can not say for sure would be cost-effective with Cohesion and Reliability.
The Baton
I think one of Silo’s system architecture weak spots is not in its technicalities per se. It is complex, but it is also just a drawing of some infrastructure components connected to one another, and a canonical message format. But that would also belittle it.
For someone else, to reverse engineer it to the thoughts, ideas and processes that led to it, all the very fine and entangled details, would not be an easy task at all. As such, I see two last couplings in this system architecture.
First, it is tightly coupled to itself. It is extendable, but removing any single component of it will break its cost-effectiveness, and might break it all. If someone were to remove the Event Store, it would mean no more dedicated views can be produced. It would not be possible to launch new Services. It might be a very delicate architecture. I’m not sure it can go live in any other way than “together all at once when the company is founded”.
Lastly, it might be tightly coupled to a single employee in the company. It just happened to be me. I had the opportunity to pass the baton to Kiril and Guy, although probably with a lot of information lost along the way. They did not have the opportunity to really do anything with it, as the company shut down just a little afterwards.
An architecture that was not proven inheritable by someone else, is it a beneficial one? Is it now after the much detailed description in this book? Luckily, I no longer care. I did not write this book so that you, the one who reads these lines, would prove it for me or continue my work. I wrote it so that you can make your own architecture for your own needs. For me, it was both a burden lifted and a dream coming true. For you, it might be required to read it all over again to gain a few more underlying insights.
For you, you now know better than before. Now you can do better. I believe in you. And I am thankful to you for reading this. By doing so, you’ve brought value to my work. Because you and I, we all just play our little parts in The Force of Change. We are The Change Factor.