03 The Change Factor, 05 Future Changes

Post Office: A Messaging Platform

Reading Time: 19 minutes

In the previous chapter, we’ve reviewed some technical concerns and considerations for choosing a fitting messaging platform. We’ve mostly looked at it through the perspective of Reliability. The one of publishing and the one of delivery/consumption. Because unlike synchronous direct client-server invocation, with asynchronous Flows a message may not be published exactly-once, and may not be delivered exactly-once. And as we design for everything to eventually fail, we’ve inspected how platforms act upon failures and high load scenarios, and the different traps they hold. 

We’ve also explored why in Silo we’ve decided our platform to guarantee only an at-least-once on both ends, as Silo’s experience itself is eventually consistent between Actors. One of those is a highly Reliable B2C IoT smart kitchen appliance, destined to be manufactured in the millions.

In this chapter and based on the above, we’ll be reviewing the decisions made for composing Silo’s messaging platform. An implementation for its system architecture, one that fuses together Event Driven Architecture, Enterprise Messaging Architecture and The Reactive Manifesto. One to also minimize future Inefficiencies.

We’ll start with the PoC stage, and move to its full implementation with its data pipelines for the Event Source and Event Analytics. Although it did not get to see any traffic before the company had to shutdown operations around Covid-19, there is much to learn from the process itself and the end result. 

Proof of Reliability

It was arguable whether a smart kitchen appliance is an IoT device or not. Back then in 2017, most IoT devices were just small sensors with WiFi or Bluetooth capabilities. They were constantly sending high throughput telemetry to backend services for further inspection, for 24 hours a day. There were also some B2C products such as Smart Light Bulbs that were connected for 24/7 and barely sent any data. The bulbs were just waiting for signals to turn themselves on and off. Such was the expected case of Silo, millions of concurrently connected devices, mostly in idle state. Similar to Amazon Echo Dot with Alexa.

Out of Silo’s unique product requirements, of an eventually consistent experience between multiple Actors, came the thought of asynchronicity. Our past experience with distributed systems and ensuring Reliability, led us to an initial thought of a mediator between the clients and the backends services. A message broker that would act as a circuit breaker in case of failure.

IoT was a big buzzword in 2017, so our mind was set to look for dedicated solutions for IoT products. As AWS was our chosen cloud provider, we considered AWS IoT. Just by reading its documentation, we were blown away by how many issues we need to take care of. Issues like remote updates, version management and fleet tracking. And security. Lots of security.

These issues are not taken for granted in web and mobile development, but are already taken care of by the browser or OS. For physical devices, it’s something we have to do on our own. As AWS IoT would bring us two third of the way to resolving those, we’ve decided to PoC it.

Our PoC interacted a lot with AWS IoT. To ensure it can both publish and consume Events, we used the MQTT over WebSocket protocol, keeping an open connection open for 24/7. Our PoC also included an implementation of a simple on-boarding Flow and a complete remote update Flow. Both utilizing multiple features on AWS IoT, capabilities way beyond of a plain message broker. 

Besides our PoC physical device, we also implemented PoCs for the company’s two major data pipelines on the backend. AWS IoT seamlessly connects and directly publishes incoming Events to many other AWS services. Two of those were AWS Kinesis Stream which we based our Event Analysis on, and AWS Kinesis Firehose which we based our future Event Store / Source of Truth on. 

After about a year of hard work, we delivered 30 devices to users in the US to get the feedback we needed. Once it was live, something curious had happened. The Events Indexer crashed almost every other day. But even while it was down, the device kept on seamlessly emitting Events to AWS IoT, and it seamlessly kept pushing those to Kinesis Stream. When we got to work in the morning and restarted it, within a minute our ElasticSearch had all caught up with all the Events emitted.

We have proven that with a message broker, even when a future backend Service will go down, our devices would not be affected. And with some tuning to it, we can turn it into a system architecture that would make sure the device would simply work uninterrupted. Now, there is no need to prove that a message broker does what it does, although it’s nice to see it in action. These incidents mostly proved to ourselves that we can turn our requirements out of Reliability and Efficiency into a reality. 

Stream Broker

In between we had a thought of using Kinesis itself as a message broker. In the previous chapter we had talked about the difference between messaging platforms and streaming platforms, that a per message processing, per message acknowledgments and a Dead Letter Queue are all required. We could have coded all of those into all of our Kinesis Consumers, such implementations do exist.

Besides those, another of Kinesis’ limitations was our gravest concern. Each partitioned shard can support only up to five read transactions per second. This would introduce a potential constant delay to each delivery, 200ms on average. It can add between 400ms to 2 seconds delay when fulfilling a synchronous request with a request/reply tuple of Events [ref to Canonical Events]. Not to mention chain of Events with Saga patterns and distributed transactions. It would have led to a horrible annoying experience for our customers, as jitter starts above a 250ms-400ms delay. It was a no go.

The thought itself has taught us a lesson, we know our requirements better. We’d need a message broker with lower latencies to provide a smooth experience. We’ve also learned Kinesis pulls from the stream in an interval. For us, alternatives based on a similar mechanism would carry the same potential to introduce delays in experience, such as pulling from queues. Probably one of the reasons we later insisted on a message broker with a push mechanism.

Coupling

After the PoC, we went ahead to design our final physical product and the entire supporting architecture. One that would include not only a physical device, but a mobile application and a complicated Alexa Skill. Multiple Actors. It would also include something we’ve made sure to avoid as much as we can during the PoC – multiple backend Services and multiple databases.

We started to inspect how to deliver messages from AWS IoT to backend applications, to where some of our User Journeys are. Back then (2019), the primary way to directly attach a custom code to process messages was to attach a Serverless Function to a topic in the broker. As AWS IoT invokes AWS Lambda asynchronously, which we reviewed its retry mechanism in the last chapter [ref to Reliability Retrying]. It drops messages only after 3 tries with a minute delay between each one. Not Reliable enough for our needs, to guarantee eventual consistency.

Costs were also a concern for us. With millions of devices, there would be a high enough throughput we’d be paying dearly for, for solely using Functions and not Containers. A greater one was that Serverless Functions have their technical limitations, some of which would cause great latencies to our customer experience. Being tightly coupled to only one single kind of underlying compute, an expensive and limiting one, is no go. 

Today (2022), after reviewing some of AWS IoT again, two more ways were added to it. The first is a Step Function, a dedicated framework of AWS for state machines. It might have done the trick, but once again we would be closed off and coupled to a specific framework.

The second one added was a direct HTTP request, where a Container can be at its end. Unfortunately, AWS IoT isn’t sufficiently and reliably retrying when doing so. It would try up to only 3 times, a try timeout after 3 seconds and discard the message. No Dead Letter Queue, although with some effort one can be made as the errors are published to CloudWatch. Once again, not Reliable enough to guarantee our eventual consistency.

Putting it all together, not only was it problematic for AWS IoT to reliably deliver messages from our Actors to our custom code, it would be impractical for it to serve as a message broker between our backend applications.

Queueing

AWS IoT also can deliver Events to SQS, a fully managed and extremely highly Reliable queue platform. Any kind of compute can continuously pull messages from it. Although we preferred push ones, we considered a queue based design as well.

Our main concern was that messages from millions of devices would be funneled into a joint queue. Naturally, the more devices being used concurrently, the more concurrent messages would need to be processed. A theoretical race condition might happen between multiple devices, as one device might have to wait for another device’s consumption to be done first. Or worse, it creates an opening for a potential poison pill when enough devices would emit broken messages due to an Instability slowly rolling out.

We thought about multiple queues. One option was a queue per Event name, one queue for XXX_REQUESTED, another for XXX_REPLIED and so on. It would ease our concerns and reduce the odds of the issues described above, but would not make them go away.

We thought about a queue per customer and/or device. But that would be millions of queues, most of them would be completely idle. There is no limit to the maximum number of queues we can create on SQS, or a cost for creating idle ones. But creating one during a customer B2C on-boarding Flow is complicated and slow going. And managing millions of those resources would be a huge burden. Such problems arise less when working on a B2B or an SaaS/enterprise product, as the number of customers are only by the thousands and there’s less sensitivity to on-boarding time.

Lastly, a queue per customer wasn’t actually feasible for our messaging platform and our Publish/Subscribe pattern. Only one consuming application can be attached to a queue. Even if we attach a single consumer to all the queues together, there would still not be a message fan out. We would be unable to have multiple consumers. 

Identity Crisis

We had only two options remaining. The first would be to replace AWS IoT with another. It wasn’t really an option because there were no fully managed alternatives back then. The second was connecting AWS IoT to SNS, a fully managed service-to-service message broker. Connecting two message brokers sounds very odd, but the differences between the two are very interesting ones.

SNS satisfies all Reliability requirements, as good as any other message broker (kind of, more on this later). It could easily fan out messages to multiple consumers according to Event name, the routing we require. It also scales indefinitely and can deliver with a push mechanism. There was also nothing technically preventing our Actors, our customer facing applications, to publish messages to it just like our backend ones. But security issues did.

By attaching IAM Policies to them, backend applications are granted with the permission to publish to a set of specific SNS topics. Same can be done with applications running outside our cloud, including client web/mobile applications. But those first must identify themselves, and only then would they be granted with permission to only affect their own resources. In Silo’s case, it was to publish only their own Events, meaning to their own topics. Exactly the same problem we had with queues. 

Alternatively, a topic per Event name would result in the needed fan out. It would also create a big security risk, as a publisher would be permitted to publish to a joint resource. A hacker who buys a device, can simply and easily publish tons of messages, on other customers behalf. Not only would we create an opening for him to put poison pills, but an opening to impersonate all of our customers.

This fine grade security, to restrict impersonations and restrict each Actor to its own Events and to its own Event names, was unfeasible on SNS. Neither in many other message brokers. It would require some sort of Gateway that would handle authentication and authorization, one between our physical devices and SNS. Not coincidentally, AWS IoT does exactly that. It was natural for us to continue using it, as we already implemented this level of security during the PoC.

Each device was granted permissions to publish to only a very specific set of topics in AWS IoT. There was a topic per Event name, per household id, per device id. Doing so, we prevented an Actor from publishing Events in the name of other applications and other households. And it would not allow it to publish unauthorized Events, only the ones expected of him. It would indeed be millions of topics, but those were dynamically created and automatically managed by AWS IoT. No burden to it, on our engineering team and our User Journeys.

To forward Events to SNS, publish actions on AWS IoT were attached to topic prefixes. The prefix was the Event name itself, so it was easily routed to a matching SNS topic, one per Event name. From there, it would be fanned out to our backend applications. To easily create this routing between two message brokers, we used a Terraform module. It required nothing more than a quick copy-paste and a change of the Event name string, followed by a simple terraform apply command. It was fully automated and fault-tolerant for our engineers to create new Events.

In its role as a Gateway for our Actors, it’s easier to make the distinction between AWS IoT and SNS. One is a client-oriented message broker, while the other is a backend-oriented one. Due to Silo’s use case, it required having them both. 

Alternative to SNS

Today (2022), another message broker exists in AWS, called EventBridge. I’ve taken the time to review it while writing this chapter, and on the surface it seems like a most fitting alternative to SNS. Both provide the highest Reliability and native connectivity to all other AWS Services, but unfortunately AWS IoT currently does not forward messages to EventBridge.

Back then (2019), another message broker was available on AWS, AmazonMQ. It was a fully managed RabbitMQ, a queueing/messaging platform used by many as a message broker due to its fan out capabilities to multiple queues. There were three reasons we removed it off the table.

The first, AWS IoT does not natively connect, publish and forward messages to AmazonMQ. It was possible to code a bridge between the two, but if I remember correctly we had no idea how to throttle AWS IoT. There also was no metric to trigger scaling up and down of that bridge when needed.

The second reason was AWS’s intentions with it. It was a service dedicated to enterprises, to ease up moving their infrastructure to the cloud. We deduced its long term roadmap would be more focused on enterprise clients, and less focused on our needs as a young startup. Practically speaking, it would mean less future relevant features compared to other alternatives. Although they might eventually pop up, as long as AWS would internally catch up with RabbitMQ’s open source version.

The third reason was probably an extension of the previous one. AWS themselves said to not use AmazonMQ. They had clearly noted that if you start a new company use a combination of SNS and SQS. But there seems to be a contradiction here. If we stated before that SNS is a perfect message broker, why is there a need for it to forward messages to queues and not deliver directly to consumers? And that is because it was not perfect. Not yet anyhow, and let’s see why.

Pushing SNS

The following is a most interesting story, of a solution we came up with against AWS Best Practices. One that might have something to do with me not getting a job as an AWS Solution Architect.

As with any other component, we needed to make sure SNS is not coupling us to a specific compute/technology. It was possible to connect it to an AWS Lambda, unfortunately SNS invokes it asynchronously and its Reliability issue for us surfaces once again.

I have no idea why it does so. I can not see the reason why it isn’t being invoked synchronously and upon failure reuses SNS’s amazing retry strategy. For us, to circumvent it would merely be coding a 20 line bridge, a pure/stateless Event Handler. It would consume an Event from SNS and would invoke Lambda synchronously, propagating failures back to SNS for it to retry. 

To run a consumer in a Container on an orchestrator, AWS suggests forwarding a message from SNS to SQS and have them pull from it. Don’t get me wrong, it is a correct and efficient method. But till this very day I’m surprised why no one suggests using another one of SNS capabilities, to fan out with HTTP pushes. It would require us to put a load balancer in front of an auto scaling Container, but that’s exactly how even the simplest of Services operate. HTTP pushes are done by SNS with exponential backoffs, which are far more Reliable and with lower latency than SQS’s fixed message visibility. Sounds like a superior solution in many ways. So we did just that. Kind of.

I was really curious why AWS insists on connecting the two. At first, I thought it had something to do with the message retention period, it’s Time to Live. If I remember correctly, it was around 6 hours, which sounded like more than enough time to handle scaling and jitters during normal operations. But right next to it, is what I theorized was the technical requirement to connect the two. SNS did not have a Dead Letter Queue. After 6 hours, the messages were discarded. Unreliable and does not guarantee our eventual consistency.

We had a plan to overcome this. We already had the thought of using sidecars/microservices in front of our Services, to make them agnostic to the eventual switch to a cheaper message broker. And as it functions as an extension of our messaging platform, they could also move messages to a Dead Letter Queue in SQS, just before the retention period is over. For us, it would be coding something additional and not an entirely new component in the system.

[In hindsight the effort was premature. It shouldn’t have been a sidecar, but a middleware in our Services. And only later to be extracted into a sidecar, only once the actual message broker switch would be needed.]

When we showed this to two AWS Solution Architects, they basically said we were somewhat right, but the coding is not worth the effort. They said to connect it to SQS because that’s the best practice. In a very funny way, we were both right.

A few months afterwards, we started to design the technical implementation of re-routing to a DLQ. Just before rushing into coding, we read the entirety of SNS documentation again. I remember Kiril, who had recently joined us and took charge of our backend, telling me something weird was going on.

“Amir, you said the retention period is only 6 hours. But the documentation says it’s 24 hours”. I was turning 35 around that time, so maybe my memory was betraying me. It didn’t. The documentation was updated 48 hours before. The latest update also included AWS adding a Dead Letter Queue capability to SNS, by redirecting discarded messages to an SQS queue. Exactly what we were about to code ourselves that very same day. Call it luck, call it a coincidence. But it was effort saved. And we could safely and calmly move forward with SNS. And so we did.

The last piece of the puzzle was to forward Events to AWS IoT, and from it to our Actors. We coded some Event Handlers bridging between the two, and routing between topics. Our Event Handler framework made sure that each Event would include the Event name, the household id and the device id. A combination of the three was also a topic on AWS IoT, one a specific device had registered to and received Events from. It ensured our Actors can fully participate in our asynchronous User Journeys.

We had ourselves a messaging platform, where any kind of application can publish and receive Events from it, in a highly Reliable manner. 

Mobile Gateway

When it came to Silo’s physical device and the final implementation, we kind of redid and extended the work done during the PoC. But unlike in the PoC, we also had a mobile application to take care of, and a future yet to be born web application.

Let’s start at the very end. After inspecting the alternatives, we’d decided that if we already have a Gateway able to handle millions of clients, it would be most effective to reuse it for all other Actors. Although AWS IoT was tailored for physical devices, there was nothing preventing it from being used with other forms of customer facing clients.

The only difference was that a physical device identified itself with a pre-burned certificate in the factory, but a mobile application identified itself with AWS Cognito. Specifically, it was done with a social login with an Amazon account, which a customer needed to go through anyhow in order to login to the Alexa ecosystem.

Looking for an alternative client-facing Gateway was an interesting one on its own. The only fully managed alternative available for us was Amazon API Gateway. Without any code and only with just some set up, it could have been used as a proxy to SNS. It integrates with AWS Cognito and AWS IAM, both answer our requirements of authentication and authorization for our clients.

Same as with the topic, it was possible to create resources behind a URL in the form of /{EVENT_TYPE}/{HOUSEHOLD_ID}/{DEVICE_ID}. With a dynamic and variable based IAM Policy, we could have granted it access only to a specific set of URLs. With REST APIs, that could have been easily done.

But let’s recall we have a need for two-way communication. Events published by our backend applications need to be delivered to our mobile application. It has to be done during an interaction with it, so mobile Push Notifications wouldn’t do. The only alternative was using API Gateway with WebSockets, just like our physical devices did through AWS IoT.

Unfortunately, once WebSocket is used by API Gateway, there are no URL paths to grant access according to. The only way to get it done was with a custom authorizer running as an AWS Lambda. I remember it being the reason we dropped this alternative. I do not remember whether because it required more coding, or because it was a problematic authorization to code. 

Future Costs

Throughout this series we’ve talked about costs, so we need to stop and ask ourselves if AWS IoT is an expensive solution. For AWS IoT we pay for connection time, a WebSocket being open for 24/7, and pay per message published. For SNS, we pay per message published and per message delivered multiplied by the fan out.

We could only somewhat anticipate how many Events each Actor would publish and receive every day. We took the usage data from our PoC and multiplied it several folds. We did some math, and it would have cost about 5$ for 3 years per customer, all clients combined.

Was that expensive? Compared to other company expenses, it was only about 4% of the expected Customer Acquisition Costs and roughly the same out of the device manufacturing costs. Compared to expected income/profit, of future stream revenues of customers buying food and accessories, it was negligible.

But there’s a few catches here. It’s 1.33$ per customer per year. For a million customers, that’s an annual bill of 1.33m$. That would be a future COO bringing the hammer down on us. But to do cost optimization today, before launch and before we have a million customers, would be non-beneficial. But guaranteeing costs could eventually be lowered, would be eventually beneficial. And we can and should design for eventually.

As a client-facing gateway is a must for us, it would set the baseline cost to be non-zero. Nothing ever comes without a cost. The alternative of using API Gateway would have cost roughly the same. If that is so, we could no longer say AWS IoT was expensive. The design might be expensive, or maybe there just is a toll to pay for Reliability. 

We could however reduce our application connection time. They do not have to be connected 24/7. They can disconnect after being in idle state for a little while, and reconnect once out of it. That would reduce the connection cost by about 95% to 99%. We could also reduce the number of messages published, group most of them into an Event Collection published as one message. Both were 100% guaranteed to be both technically possible, and without effect on the product. Both could be done a year after the launch, when actually needed

For SNS, it was a bit different. We couldn’t publish and deliver less messages, and the only pricing model SNS has is per message. A different cost model entails it would have to be entirely replaced eventually one day. It would not be an easy task at all. The only thing we could do today is to make it not impossible in future. To guarantee so was another responsibility of our Event Handler framework. It made sure all of our Services/business logic/consumers, would be entirely agnostic to what are the components of our messaging platform. One day, when a component would be replaced, it would be a purely Technical Cause, completely mutually exclusive from any Behavioral Causes for Change

That was one rough implementation, mostly because we had rough product requirements. In this chapter, we talked just a little about security, but it’s a whole other story to focus on. A story of our threat model and GDPR. On this, in the next chapter.

Leave a Reply