03 The Change Factor, 05 Future Changes

Delivery: Reliable Messaging

Reading Time: 10 minutes

Through this entire series, we’ve talked about a messaging platform as an abstract. We’ve used it to implement a Publish/Subscribe pattern. Later on we used it as a technical enabler to our system architecture. One which leads to multiple Actors and their backend tuples, to Reliably living together.

Through it, we’ve defined and assumed its minimal requirements – to form a relationship with our Event Handler. It should be able to register, consume and publish messages in the form of Events to the platform. Routing and fan out to single and multiple consumers would also be required of it.

In this chapter, we’re going to dive into a few more requirements for a messaging platform. As this topic is very broad, we’d be exploring only some of the concepts we should to be familiar with in order to correctly review the selection done in Silo, for the components of its messaging platform. We’d be mostly focused on messaging Reliability and meeting Silo’s product requirements, a unique case of a B2C IoT physical device that was expected to be manufactured in the millions. 

Asynchronicity

An Actor in an idle state, can out of nowhere suddenly receive an Event and react to it. For him, it would be an eventually consistent experience, one shared between multiple Actors. For this to happen, the Event needs to make its way to the Actor somehow.

During our design and thought process, we always had in mind millions of devices in mind. Working concurrently although maybe infrequently. For us, methods like polling and long polling were out of the question. Both would entail either hitting our servers by millions of clients, keeping millions of open connections, or both. It had to be some kind of push mechanism, an asynchronous one. But what is asynchronous?

We should differentiate between asynchronous communication and asynchronous process/Flow. In this implementation and many others, our messaging platform communicates directly and synchronously with the rest of our applications. Invoking HTTP endpoints is only one method out of many.

They do not do so asynchronously, they do not continuously process a bit stream. It is not the one of the network/transport layer of the OSI model, but of the application layer. They first consume the entirety of it and only then reply with HTTP 200 OK (ACK) noting the message has been processed in its entirety. That is done one message after the other.

There is another kind of ACK, HTTP 201 Accepted, noting the message has been received in its entirety and will be processed some unknown time later. I would assume that some messaging platforms would reply with either 200 or 201, and other non-HTTP protocols would have some other form of ACKs.

Either way, once a message has been published to our messaging platform, it will be later delivered to and be processed by multiple consumers, indirectly. Although each one would be receiving the message via a synchronous communication line, the entire process/Flow/experience would be asynchronous. This latter is the one we refer to by asynchronous in this chapter.

It is the responsibility of each of our Event Handlers applications to give an ACK only after a message was processed successfully. An ACK after the business logic went through as expected. Nobody ever said it always does.

Reliably Publishing

For web developers, the common practice is for the browser to retry again when the server denies a request. For example, when a browser wishes to fetch some data and the server replies with HTTP 500, the browser would try again and again until it gives up. Only then a visual indicator to our customer would appear, saying “The service is currently unavailable, please try again later.”. 

The same goes if our Actor would try to publish an Event to the messaging platform, while it is down. After a few retries, it would give up. Same with data fetching done by a tuple of XXX_REQUESTED and XXX_REPLIED Events. It would require the exact error handling strategy done by a browser, because the tuple implements a synchronous direct invocation in an asynchronous manner. It is an at-most-once publishing.

Let’s say that exactly when a message has been persisted on the platform for a later delivery, the platform goes down without sending an ACK. It would lead the client to retry and publish again later. Once the platform goes back up, it will receive the very same message again, ending with two messages persisted and waiting to be delivered. Duplicates. An at-least-once publishing.

Some platforms do overcome duplicates. AWS SQS uses message IDs to identify and dump duplicate messages, and only within a 5 minutes time window. On the contrary, AWS SNS does nothing to handle duplication, and leaves it to the consumer. But it is not a given.

But as our system architecture is also based on Events that are later reacted upon, an Actor can not simply give up after multiple retries and drop Events. It is an eventually consistent experience, all Events must be eventually published. When our messaging infrastructure is down, or when our Actor is unable to communicate with it, it would have to collect the Events internally and publish them once possible. An exactly-once publishing, but it’s not only up to the publisher on its own.

The platform itself may not at all support an exactly-once publishing quality of service. Not all do. Some would only be able to only guarantee at least once publishing or at most once publishing. On others we will deliberately set it to a lower QoS because exactly-once publishing is slower, affecting the throughput and scaling of the platform.

As not all messaging platforms support it anyhow, it might be better to assume less so the platform can be eventually replaced when needed. Silo’s system architecture required all Events to be eventually published, so at-most-once publishing was not optional. On the other end, the effort needed to guarantee an exact-once publishing might be eventually non-beneficial.

Reliably Delivering

On the other end of the platform, messages are delivered to consumers. For them, it’s a whole other mess. It may not be guaranteed it would be exactly-once delivered by the messaging platform, and to all registered consumers. When an Actor fails to publish a message, it retries later. But what does a messaging platform do when it fails to deliver it?

We can say that the platform delivers the message on behalf of the Actor. The platform would be doing exactly what the Actor would have done in case of a failure, to retry. And it’s exactly why exactly-once delivery is extremely hard to achieve.

Furthermore, even if achieved, it would not be 100% bulletproof. Consuming applications will continue to go down, and human customers will continue to repeatedly hit the submit or refresh buttons, when something does not work as expected. Duplication would always be an issue.

Maybe that’s why in some platforms and implementations, handling duplicates and retries is left to the consuming applications. It’s a burden added to our engineers when we add a messaging platform. Not to mention the burden of distributed transactions between multiple consumers, or of a messaging splitter and aggregator.

On a side note, there are platforms that enable end-to-end exactly-once semantics, together in publishing and delivering. Two notable ones are Kafka and Flink, but we should also notice that they are both streaming platforms. It is a very hot topic whether streaming platforms should be used as messaging ones. They differ by not providing two essential messaging requirements, an individual message acknowledgement and single message-processing. It can be overcome with additional coding to the consumers and producers, with much effort. One streaming platform does natively provide so and that is Pulsar.

Reliably Retrying

As a messaging platform decouples between applications, it also functions as a stockpile between them. This stockpile is expected to remain around zero, for as long as the consumers keep matching the delivery throughput. When they can’t keep up, some messages will be delivered and consumed after a delay. At an edge case, when the consumers are entirely down, the messages will continue to be stockpiled, persisted for a much delayed consumption.

On the contrary to a direct client-server request, the throughput is much more within our control. When we have millions of clients, it would be entirely impossible to orchestrate between them all. But if all requests are funneled through our messaging platform, it can better control the combined throughput. Instead of a million requests processed by a Service all at once, it would be a million messages first published and then slowly and safely delivered. It would create a smoothing effect.

In turn, the smoothing effect would prevent excessive uncontrolled pressure on our backend Services. It would also give our Services and our databases some time to scale up and match the required throughput, time which they do not have when being directly bombarded with client requests. A more Reliable system.

To understand how a messaging platform does so, let’s have a look at AWS Lambda’s internal queue which can be utilized when asynchronously invoked. It will retain each message for up to 6 hours. It is time given to scale up more Serverless Functions instances, scale up more consumers to match required delivery throughput. It could also be given in order to scale up a database with more shards and replicas. The faster and more automated the scaling is, the shorter this time window can be. For AWS Lambda, it’s fully automated and resources spin up fast.

Unfortunately, lack of resources is not the only cause for failure. If consumption fails due to a bug in the code or a database failure, a retry mechanism will be in effect. AWS Lambda will retry up to only 2 times, with a 1 minute delay between each retry. Meaning a consumer has only 2 minutes to get a hold of himself. Which is far less than the time it takes an engineer to fix an issue.

For an offline job batch processing, a 1 minute delay between retries might be acceptable. But for a customer facing product, it might be entirely unacceptable as it would lead to a customer being annoyed by extreme delays.

Murderous Letters

After 6 hours or after retries had been exhausted, these messages would be moved to a Dead Letter Queue/Channel for further inspection or a much delayed consumption. Purging away a message that can not be processed is a must. Unless done, it would make the platform susceptible to a most dangerous poison pill.

Imagine a million messages that for some reason can not be processed at all, maybe due to a broken contract or a bug in the consumption code. Unless purged away, they will prevent the consumption of the trailing ones after the one-million-th message. For Silo, if this would occur, it would entail all of its smart kitchen appliances to become useless. Not just of one customer, of them all. Unacceptable.

Messages do not have to be purged away instantly, they can also be skipped for a while. Different kinds of platforms have different mechanisms to ensure that. In most queues, a fixed message visibility can be set, leading to a retry every constant number of milliseconds. Set to a low enough number, a retry would have no effect on the customer experience. Unfortunately it might also be not enough time for a consumer to recover or handle pressure during high load.

A push based message broker, such as SNS, has multiple retry strategies. Most notable is the exponential backoff, which gives more than enough time to scale up to overcome high pressure, and to handle jitters and latencies. So under normal load there would be no latency, and under heavy load there would be some latency experienced. For under extreme loads, Silo’s experience between Actors is eventually consistent anyhow.

Eventually Expected

Silo was all about an eventually consistent experience, leading to an eventually consistent persistence. For this, a message must be eventually consumed. It could be done immediately when the consumer is up, or a longer time later because it might go down for a while. Either way, the consumption itself must eventually succeed. As such, the delivery must be continuously retried until consumed.

In our system architecture, by publishing Events that applications are independently reacting to, we’ve avoided most of the issues above. As almost any two applications are not even familiar with one another, they have no expectancy of delivery at all. The publisher is agnostic to whether it’s an exactly-once delivery or an at-least-once delivery.

We had another option planned. We already had a synchronous bridge, a distributed lock to bridge between our asynchronous Flows and synchronous requests. It could have been reused to avoid duplicate messages published, as an “in-processing” state existed as the lock itself. 

We mostly ensured that all of our Services would be idempotent, making sure the same operation done twice would have no effect on the end result. It was done either by dropping Events received at incorrect states, or by making sure to do only idempotent actions on our data. It was mostly implemented with the database, checking whether it had an effect on the data or not. For example, with an SQL based one it was ensuring 0 rows would be affected when the same Event was delivered and consumed twice.

In this chapter, we’ve reviewed some technical considerations of what makes a messaging platform into a Reliable one. In the next chapter, we’re going to use these very considerations to overview Silo’s unique messaging platform.

Leave a Reply