For the last few years I’ve taken part in a philosophical debate on how to better handle concurrency and throughput. I would say three are three main approaches to this:
- Programmatically in the application layer:
- Multi threaded programming
- Low-level languages over high level ones (C++ over Java for example)
- Code/OS optimizations
- Garbage collection tuning
- (many other kinds)
- Scale your infrastructure layer:
- Launch larger servers
- Scale with more instances (servers, containers or functions)
- Combinations of the above two
I see two main reasons that tilts towards infrastructure scaling:
- The costs of infrastructure and infrastructure maintenance are constantly reduced while personnel costs are constantly on the rise
- Doing it once on the infrastructure layter, instead of customizing each application individually. It may not cover all of your applications, but think Pareto 80/20.
To achieve higher throughput in the application layer requires certain expertise, and probably even per application. Multi threaded development, as “basic” as it’s considered to be, is still cognitively hard and time consuming; To code with low-level C++ over a high-level Java or Node.JS both hurts your velocity as it takes more time to code and requires personnel adjustments (yeah, you may have to let people go when you switch languages); Code and OS optimizations are a very rare and costly expertise.
You could however choose to jump from 2 cores to 96(!) cores. It would cost an annual payment of $34,500 but it is equivalent to only a 2 months salary of a single experienced engineer (200K a year). Put his efforts towards the money makers it would be a way higher return on investment (ROI) than having him sit down and optimize his code [see We Were Born to Run: [In]stability and Velocity].
Scaling your infrastructure is a know-how as well. It requires expertise and skilled personnel so there are costs and learning curves to consider. But that would not be done by your development team, thus maintaining their focus on what they do best. If done correctly, scaling your infrastructure can be application agnostic, thus done once but holds for many applications. This is what Containers, Functions and Orchestrators are for [see Cloud Computing Evolution series]. Do not make the mistake and think it would apply to all of your applications, but a Pareto of 80/20 is with a high value, maybe ever higher [see The Road to Know-where: Velocity and Customer Experience].
The answer between one or the other, Programmatic or Infrastructure, is as always – somewhere in between. The Middle Way. It depends.
From failure to success
I’m aware of a large successful Israeli startup with a B2B IoT product that required near real time analysis and response time, around 400ms end-to-end. One of the founders, the CTO, was a C++ developer so natively when he started coding that was his weapon of choice. Going with what you know is a good and valid consideration, but should not be the only one. Two years later, they failed to scale. They had a single monolith and a highly skilled team of C++ developers. They spent months of blood, sweat and tears on low-level code optimizations, until they eventually gave up.
They brought in an architect that specializes in DevOps and AWS architecture. Within 3 months the new architecture team put the good old application, with minimal adjustments, into a container and launched more instances – and it scaled. Later on they switched from C++ to Node.JS for a higher Velocity (faster coding) and switched strategy to focus on features. They ended up being sold for a few good hundreds of millions of dollars as their tech and product were unique in the market. I know both of these guys, the C++ and the AWS architect. They are both the best of people and the best at what they do. Only one prevailed.
No way around it
A different use case I stumbled upon was of a backend engineer of a FinTech company that came by for an interview. You can learn A LOT about software from interviewees. After he showed some of the design of the trading platform he was working on, I wondered and asked him why they haven’t launched more than one server to scale throughput. According to him, as trading platforms require very sub-millisecond performance, adding any other entity in the way (a load balancer or a DNS e.g.), and adding server hops, is considered a bad solution. They preferred to pay a huge rent for offices adjacent to stock exchange buildings in order to gain a few more nanoseconds. I never worked for a FinTech and it sounds absurd, but also very reasonable. I’d take his word for it.
There are use cases where scaling infrastructure is not feasible, not a good practice or maybe even an overkill. It could be that a “simple” primary/backup solution could be a sufficient one.
Infrastructure first, application second
At Dealply (2014), I was working on a catalog parsing and indexing platform, which needed to concurrently parse about 40 different sources. It took a few good months of work to reach a steady state of a single source pipeline working from end to end. The company’s CTO, Daniel Barkan, stopped me a second before I started to refactor it towards a platform.
Naturally, I wanted to code more threads. He suggested otherwise. “If it already is working end to end, run multiple instances of it on different servers. Just launch a few more and distribute the pipelines between them. Get these 40 sources up and running. Get the job done sooner, then start optimizing costs.”. I tend to listen to other people so I did as suggested.
A month afterwards we had 40 servers up and running, which cost quite a lot but income was pouring in thus the ROI was already positive. The project has proved itself worthy of further investment. Then instead of coding concurrent processing, I went with scheduled executions. I manually “bin backed” sources that were running at different times and reduced the number of servers to just 10. Infrastructure first, application second.
At Silo (2019), where the application layer itself was a complete unknown [see I don’t know], it was decided to go infrastructure scaling first. It would simply coding almost all future applications, as scaling has been already taken care of. Or for the very least, it would postpone optimizing the. far into the future as possible.
The hellish core
Having the ability to scale infrastructure at ease, literally with a press of a button, is no excuse for being a bad developer. At Wiser (2016), I inherited a Java based web scraping platform with a completely rotten core and a backlog full of features on top of it. I won’t go deep into the technical details but suffice to say that the more web scrapers it ran, the slower it got. That was bad for business. Really bad. While the platform was running there were 30,000(!) threads constantly context switching. My team and I were continuously working on new features as the sales team needed to to close down deals (the money makers!) while continuously patching the core more and more.
To buy us more time, we also enlarged our servers to max cores possible back then (32 cores). A server cost was 3,000$ a month, and we had two of those. It was quite a lot back then as we ate up half of the company’s AWS budget. After a year of work, we managed to pull through, at least business wise. Within a 24 hour cycle all the data was scraped and the customers were satisfied. But alas, the core still does not scale. We were at a turning point what to do about it.
We had to choose between two paths. The one would be a three month’s work to refactor the core itself, which will make it ready for infrastructure scaling once needed. The other option would be a complete rewrite from scratch, microservices and scale to begin with.
Both options made technical sense. I argued that business wise, it would be wrong to rewrite as it would take more effort that would be better spent on the business needs, such as smart monitoring and machine learning related capabilities.
Unfortunately, due to financial reasons 75% of the personnel were let go, me included. The company did get back on its feet, and a rewrite was decided. As far as I know it took four engineers for about a year to reach a customer satisfying system, to a steady 24 hours cycle. As I wasn’t a part of it, I will always wonder if they made the right choice.
Be wary of the distributed
Somewhere around 2016, I attended a lecture by Gett’s chief architect, Lior Bar On, about microservices. It was back when the hype about microservers was just getting started [see future article about microservices]. “The first rule of distributed systems”, he yelled at the audience, “is not! to! do! distributed!! Systems!!!”.
He is 100% correct. If you haven’t done it before and your team has no experience with it – think ten times before walking blindly into the world of distributed systems and infrastructure scaling [see the Serverless Development series and future sections about distributed systems]. Distributed systems are a can of worms. That’s why I love it. Know your use case and select the best scaling strategy for you. Sometimes the simplest solutions are better for you.
Although I’m an advocate for infrastructure scaling first and I chose to do so at Silo, I decided it partially based on me having vast experience with distributed systems, being able to encapsulate and ease up a developer’s effort and me being able to pass this knowledge on. It was under the assumption that I’d have at least a 4 years run at the company (which I nearly did).
In hindsight and from a managerial position, I find this to be a very fragile assumption. No architecture should depend on a single person. Architecture has a lot to do with HR. Although I did pass it on and taught well, it might have still been too relying on me being a part of the company. I wouldn’t say I would have made a different call, only that I didn’t give it enough thought back then.