Meeting throughput can be achieved either through scaling infrastructure or within the application itself [see Scaling Strategies: Infrastructure or Applicative Scaling]. Both Containers and Functions are passing throughput and scaling control towards infrastructure, but they do so differently and they also scale differently.
A single Container instance, just like any familiar application, can handle multiple requests. If you’d wish to scale throughput, you launch more instances in advance and balance/distribute traffic/requests/tasks between them. Functions are the other way around. When a request or a task is present, a function instance is launched on demand. Each function instance can handle exactly one request/task at a time. Function concurrency and balancing is entirely handled by the infrastructure layer.
This distinction can have some major consequences on the application itself.
Less coding, no task distribution
Let’s presume you’re coding a new application from scratch and you’ve finished coding the business flow that processes a single task. If you were to follow a regular application development and put it into a Container, you’d now be working on concurrent processing. It would most likely be with multi threading or something similar. Notice that neither Containers nor Functions are recycled/restarted after every execution.
Afterwards, you’d need to make sure that a specific task/request reaches your application, to somehow “get the task” to it. That would be either by a push model (putting it into a web application, e.g.) or a pull/poll model (from a queue or a file system, e.g.). That would be more coding. If you’d require multiple instances of your application running, either for resilience or more processing power, you would then have to distribute these tasks/requests between the instances. That would be even more coding, or launching and maintaining more entities such as a load balancer.
Functions provide you with another option as concurrency handling and task distribution is encapsulated from your application by the infrastructure. It can be done as a push model, where a function can be directly invoked by an HTTP request. It can also be done as a poll/pull model by integrating with existing entities [will be covered later at this series].
Reusing and utilizing a fully managed, battle tested infrastructure is development time saved and invested into the “money maker” [see We Were Born to Run: [In]stability and Velocity]. “Outsourcing” it to someone else who is better at this then you’d ever be, translates into a more highly available and resilient system.
For another use case, think about a machine learning clustering process over a few hundreds of vectors. A single clustering task can easily reach tens of seconds up to minutes. Some algorithms can’t be parallelized and even if they do, each thread utilizes a CPU core in its entirety (100% CPU usage). On top of that, there is the business constraint of customers expecting results as fast as possible. Oh yeah, and it is a B2C product, so expect withstanding a throughput of 100k requests per hour that varies.
The above is an example for an application that can barely process more than one task at a time due to the nature of the task itself.
Containers scale slower than Functions, thus makes it harder to withstand throughput:
- Container instance launch time on Fargate is measured in tens of seconds and minutes. A Lambda instance launch time is in the hundreds milliseconds to a few seconds.
- Fargate has a soft limit of 100 tasks/containers concurrently running. Lambda has a soft limit of 1,000. Soft limits in AWS are not max limits but if you’d need 10k concurrency that would be most likely denied on Fargate.
- You’d still need to manage/tune scaling up and down the number of Container instances as they are not launched on demand per task.
Note that Container instance launch time can be somewhat lowered on ECS and concurrency soft limit of ECS is 1,000 [see Managing Your Own Cluster later this series]
It is possible to meet such throughput with Containers, but it requires a lot of coding and a lot of know-how that not every developer has. Although it may cost you more, a lot more even and don’t forget to check that, it may be worth the development time saved and resilience gained. At the very least, your solution would be up and running and meeting throughput much sooner.
Let’s presume that your application has an external dependency with limited throttling, either a database or a 3rd party service that can’t scale indefinitely and has a maximum number of concurrent requests.
Ordinarily, if you have a single instance application (either in a Container or not), you’d be coding a throttler, maybe within a client factory, that handles these limits. When you scale up your number of instances, e.g. for high availability, how would multiple instances share the same throttler?
The simplest solution would be to have each instance limited only to 1/(instance count) of the limit, but it is far from away from being optimal (but there is nothing wrong with keeping it simple!):
- One instance can be hit more than another
- The dependency’s response time has a high variance which will cause an unbalanced limit between the instances (one instance would use 100% of its own throttle and deny requests, while the other instances only used 75% of it)
- This solution does not have any regard to Auto Scaling / dynamic number of instances
What about an independent throttling application? That may be a step forward, but that would also be just propagating the above problems to another application, but at least if you’d come up with a solution it is a reusable one. A real viable solution would be one with a true distributed solution, such as a distributed lock or a shared fast scalable resource with sharding (Redis!). This is some heavy engineering and coding to do.
Instead, you can leverage Lambda’s own limits together: one request/task at a time + concurrent execution limit ⇒ max number of concurrent incoming requests ⇒ throttling requests to dependent resources.
That would work when invoking the function synchronously (returns 200 after invocation completed successfully, 429 and TooManyRequestsException when limit was reached). It would work even better when invoking it asynchronously which would return 202 and Lambda’s internal retry mechanism will handle the failed 429s.