After a long journey we managed to define what Eventualism is in the previous chapter. I find that thinking in terms of “it will eventually happen” is also beneficial while designing. Saying it pushes us away from the non-beneficial action of doing nothing, causes us to stop disregarding the situation at hand and give it a serious thought. The outcome would be a process of discovery which can lead to better and more robust designs. It would help us better grasp what we should actually do now, what should be done later and what should not be done because a good enough reason does not yet exist.
Everything Fails
Back in university (2008), I attended a class in distributed systems. The grade was composed of 10% theoretical test and 90% hands-on design & coding assignment. Because it was also a course about robust systems, the expectation was that no matter what the application submitted would eventually recover and continue.
We’ve created from scratch a peer-to-peer real time auction platform. Part of the assignment’s final submission was to get it up and running on 5 computers and the tutor would randomly go to and start unplugging network cables. A minute after he would plug them back in. The only way to succeed was to design and code under the assumption that everything will fail, properly anticipate it and test what happens when it does. The assignment took the 4 of us about 3 months to do and we got the perfect score of a 100.
That’s how I do designs till this very day. Going one component after the other saying “this one will eventually fail. What will happen then?”. That is true both to a reaction to Change and during runtime. That is the basics of Chaos Engineering which is an excellent practice to ensure and validate robust systems.
In Gartner (2016) I’ve done a design review to a colleague. It was a flow that involved about 7 compo§nents spread world wide. “Let me tell you what will eventually happen. A fellow engineer would accidentally delete the database. Because he has permission to do so. It would happen in Israel on Sunday morning. That would bring down our website entirely. The data is in another company located in the US west coast. Meaning it would be 36 hours until we’ll get a response from them because they don’t work on Sundays plus the time zone difference. When they do respond they will find that the SSH key is obsolete and a new one needs to be generated. That would involve security compliance which would take a day of two. Once they get the new key, they will try to run the copy process that no one has run in a year and it wouldn’t work. It will take 3-5 days until it will be back and running again. That’s 7-8 days of downtime”.
I must say that Eventualism has another drawback, that sometimes its results are so obscure that no wonder my colleague said “well, that’s what you think”. I can’t blame him for deciding to do absolutely nothing with the new information, as back then I couldn’t even explain what I was doing. Three months afterward it happened. Word by word. Luckily it was before launch and in the staging environment. The entire development team was on hold for two weeks because of it. Uncoincidentally, it also meant two weeks of postponed delivery. “How did you know?!” he asked me. “The colleague who deleted the database. It was with the highest likelihood because every human eventually makes an error. The rest just happened because enough time has passed for it to happen”.
Open Heart Surgery
Having a message broker was a central component of Silo’s system (2019). I won’t go into the details right now, I’ll just summarize it was a must to have asynchronous communication between the physical devices and other applications. It was only a question of which message broker to choose. There were basically two paying models for fully managed brokers:
- Pay per message, less than a cent per message (SNS with or without SQS)
- Pay for running time, and for a 24/7 server it adds up to hundred of dollars per month per environment (ActiveMQ or RabbitMQ)
At the time, I had no idea how many messages a single physical device would emit. I had no idea how many devices the company is going to have. The only sure thing was everyone wishing for a million units as soon as possible. Nobody could even guess how long it’s going to take to sell and ship that amount. It was also foreseeable that it’s going to be at least a year until product launch. I was working with barely any viable information, only with wishes, to make an economical based decision. With which model are we going to pay less? Supposedly it was a binary decision, which our binary thinking minds love so much.
I did some math and saw it would be about 2-4 years until a point of equilibrium is met between the two models, an equal monthly bill. The obvious choice for today was to pay per message. In a very early stage startup you are not supposed to worry what is going to happen in 4 years: “don’t worry about it”, “don’t even think about”, “we’ll see when we get there”. That is just not me to think like that.
Out of experience I knew this day would eventually come. I’ve seen it before. I’ve worked for companies bleeding money and investing months of effort into cost optimizations because they have either grown old, are currently in financial trouble or experienced hyper-growth out of success. The worst part is, I’ve also experienced doing so where there was no architecture, so the system ended up being closed off to changes. And I also might have simply done the math wrong!
I knew I needed to pick the cheaper one today. And then told myself “the message broker would eventually be replaced. It is only a matter of time”. The thought alone has opened my mind to a new design goal – to make sure that it is not impossible and not hard to do so in the future. Weeks later when coding started, that goal translated into entirely obfuscating messaging (not only the message broker) from the applications. Technically speaking it was coding an interface called IPubPub with two virtual functions, publish and subscribe. That was only about 2 hours of work. It’s potential was saving hundreds of thousands of dollars in effort to do a risky open heart surgery of replacing a message broker, exactly at the wrong time.
When to Scale
At RapidAPI (2021) we had a technical roadmap to replace AWS Kinesis with Kafka. There was also the business plan that included more enterprise customers joining in, larger and larger. We knew of a customer at the end of the sales pipeline that had x3 throughput than everything we ever dealt with. It was only a matter of months till he went live. We also knew that there are customers out there who would have x10 throughput and it’s only a matter of time until they enter the sales pipeline.
As we’re binary thinkers, we thought only of two options:
- Plan and build for x3 throughput, and when the time comes “don’t worry we’ll see what happens”. The upside is a minimal effort required. The downside is a future risk of a complete redesign and failing meeting business demands.
- Plan and build for x10 throughput. The upside is meeting future demands up ahead, the downside is pre-investing effort before the need actually surfaced. A time waste.
With Eventualism we said “x3 is required today, x10 throughput would be eventually required”. That led the team to raise the question of how long we will have to go from x3 to x10. The sales team answered that there will be at least a 3 month notice. We’ve come up with a design and technical roadmap that its first milestone was x3 and was open to scale up to x10 within three months only. That way we only invested the effort required immediately. We avoided and pushed to the future tasks which at that point in time weren’t beneficial, while making sure it can be done when they do become beneficial one day.
Vendor Locking
“You are working at a very early stage start-up. You should not even think about these things so early”. That’s what several architects have told me, both from AWS and independent ones. Which is exactly what I told Silo’s investors (2018). And then I added “I somewhat disagree with them” followed by a dramatic pause.
I took a deep breath in. “Eventually we’d need to leave AWS. There are some small tweaks to the design we need to do today to make sure that when time comes it would not be impossible to do so”. I was saying so with confidence.
Just the mere thought of “eventually it would happen” got me investigating if vendor locking is still a thing or is it just a buzzword. Discovering what is a vendor unlocking process. I came prepared with scenarios that could have dramatic effects on the company, followed by small actions to be taken today to prevent some of them and be prepared for others.
One of these scenarios could rise on the day that we’d want to sell the company to someone who is not Amazon, like Google or Walmart. During the technical due diligence we would find out that it is impossible to leave AWS, or would require years of effort to do so. That will kill the deal. This must not happen. “Killing the deal” is an outcome of binary thinking. There would be a deal or there won’t be a deal. But in between, this very same scenario could be a leverage of Amazon over Silo and may lead to a lower evaluation of the deal. And investors want to invest in unicorns without their horns cut off.
A more likely scenario could happen a few years from now. We’ve grown big enough to get a discount from AWS. The negotiation starting point would be different. Instead of AWS knowing that we can never leave them, they would know it would take us a year to switch to Google. This could open the possibility for a larger discount, which is a lower cost to the company.
In the next chapter, we’ll see how Eventualism affects our designs and systems through analysis on the most important component of all – ourselves.