03 The Change Factor, 05 Future Changes

Truthful Insights: Event-Driven Analytics

Reading Time: 9 minutes

In our system architecture, all of our applications are communicating indirectly through a messaging platform. In order to maintain a higher level of decoupling, each application is emitting Events so others can consume and react to. It is a fusion between Event-Driven Development and The Reactive Manifesto, having Events as the canonical message of our Enterprise Messaging Architecture. In its turn, it transformed all of our applications into Event Handlers.

In the previous chapter, we’ve seen that these very same Events hold within them a certain truth, an immutable past. On the contrary, each database only holds a dedicated view derived from those Events, each view answers a specific Actor needs, its user interface and its experiences. It resulted in no database being able to be the source of truth. So we built a data pipeline, flushing all Events from our messaging platform to an Event Store.

In this chapter, we’re going to see another pipeline. One who answers different needs. We’re going to build Event Analytics.

In-Product Sights

When we design a new kind of microwave, we may have some experience with microwaves. We can not do the same with a product that nothing like it is in existence. If nothing remotely similar to it exists, how do you even start to design anything?

It was one of the challenges we had at Silo. Our smart kitchen appliance was designed entirely from scratch, based on nothing but our amazing team’s experience. We’ve iterated on the physical design with 3d printers, and coded the PoC with Amazon Alexa for almost a year and a half. But how can we know if we did good, if a customer would actually want the product and enjoy using it?

The appliance was first to be sold in the US, to American customers. But we were Israelis, not Americans. We have different cultural food habits, and our kitchens look and are used entirely differently. Giving a device to my 70 years old mom who barely cooks anymore, would indeed get us some initial feedback but also an insufficient one. Sorry mom, I know you do your best.

Surely, we could have flown with the device to the US and watched people’s direct experience with it, but that would be insufficient as well. To get the real feedback we needed, it had to be in their kitchens, for a long while. But once there, it is a few thousand miles away from our engineers and product managers in Israel. From that far away, we still can not actually know how our customers use the product and if they enjoy it.

As we previously discussed the matter, Silo’s physical product was also a state machine. Some Events were emitted from hardware components, such as buttons emitting BUTTON_PRESSSED. Some Events were non-hardware and behavioral ones, such as LABELING_DONE. The very same ones that later in our final system architecture, our backend applications reacted to.

If we look carefully, we’ll notice these Events all have something in common. They hold a very fine grained detail on how the customers are using our product. Some describe the customer interaction and the user interface. Some describe the experience coded into the device. It’s just right there already in the code! If only we could have reached them from far away, they’ll just tell us the story of what actually happened in our customers’ kitchens. But wait, yes we can!

Proving Concepts

As we were working on a prototype and we knew nothing but JavaScript, our PoC was coded in NodeJS running on Raspberry Pi. Back then, Node.jS was rather new and the practice was to have independent Modules extend a class called EventEmitter. Each Module would register to an Event type, consume it during runtime and produce another Event. If it sounds vaguely familiar, it’s because it was a simple Publish/Subscribe pattern within a single application.

In hindsight, we realized something was missing. Each Module was indeed independent, but there was nothing to correctly orchestrate between them. Each Module managed its own state, but there was not a general state machine present. The lack of it led to many bugs to fix, and had made our development harder than necessary.

It was less of a concern for us, as we knew the final product would be coded entirely from scratch in C++ on a dedicated board. And our final EventHandler framework for our backend applications, was based on the PoC’s EventEmitter.

As all of these Modules were already communicating between them via Events. We just needed to add another one and have it register to all Events. A Module whose sole responsibility is to publish those to AWS IoT. It was a message broker designed and dedicated to work with millions of IoT devices in high throughput. Later in this series, we’ll see how it fitted into the final messaging platform.

[The final product’s Event Analytics pipeline differed a little from the above]

From AWS IoT, it was directly connected to AWS Kinesis Stream which flushed all the Events to ElasticSearch hosted on ElasticCloud. With Kibana, we could have easily queried and answered questions like “how many times button 1 was pressed?”, “how many times labeling has started” and “how many times labeling has ended”.

Event Structure

Due to another unique property of Events, we could have also answered even more complicated queries, such as “which labels were given?”.

Our Events were semi-structured, in a JSON format for easy consumption by our NodeJS applications. All Events had shared fields, such as EVENT_NAME and TIMESTAMP. Making it into a well-structured and consistent format. One of those fields was named PAYLOAD but its value was a nested unstructured JSON. 

For example:

{
   EVENT_NAME: “BUTTON_PRESSED”,
   TIMESTAMP: 1664962242,
   PAYLOAD: {
      duration: 456,
      durationUnit: “ms” 
}
{
   EVENT_NAME: “LABELING_DONE”,
   TIMESTAMP: 1665962242,
   PAYLOAD: {
      labelName: “Strawberries”
   }
}

Although unstructured, PAYLOAD was consistent with the value of EVENT_NAME. All the PAYLOADs of BUTTON_PRESSED Events were structured the same. Same goes for all the PAYLOADs of LABELING_DONE. [As it is available today, I may choose to structure Events based on CloudEvents and maybe extend it if needed.]

ElasticSearch also enables querying on unstructured and textual data, without knowing its structure in advance and with only a little Change to the index. With it, it was easy for us to create a word cloud out of the labelName field. It led us to better know the NLP based challenges we would be facing in our future Alexa Skill. Turns out our customers are really creative in what and how they eat and store. Out of it, we knew what tweaks would better our product’s design and experience.

On the contrary, the Event Store described in the previous chapter, did not enable querying the payloads because it is not required in order to perform an Event Replay. That level of query flexibility, the lesser one in our Event Store and the higher one in our Event Analytics, is what allowed our engineers to continuously add new Events without any Change to Event Store, and only with a little tweak to ElasticSearch’s index. 

But that wasn’t the greatest impact made on our engineers.

Efficiently Tracing

The most annoying thing about bugs is hunting and recreating them, exactly why it is one of the biggest sources of Inefficiency. Just imagine how hard it is to fix one, on a physical device beyond your reach, used for the first time in ways we could not imagine. 

When a customer of ours got into trouble, he’d email us with some details of what he has tried to do. But mostly what we needed is when he tried to do that. With the timestamp, we could simply query ElasticSearch for Events within a timeframe for a specific device, and see what happened one millisecond after the other. By reading Events perfectly describing what had happened, plainly so with our very own eyes, we simply knew what a customer had done and what had unexpectedly happened.

It was mostly just small bugs that got the experiences stuck. Not coincidentally, these Events collected were also the exact input to each and every one of our Modules and their functions. So to recreate the bug, all we had to do is copy an Event from ElasticSearch and paste it into a unit test. Or a few Events if needed. That created a localized small Event Replay, and had recreated the bug within minutes.

As we all know how time consuming it can be, recreating and fixing bugs within minutes is a lot of future Inefficiencies avoided. It is an extremely beneficial outcome, which aligns with our company’s goal to be efficient. Our engineers’ time is money, money better spent ensuring and paying for the Reliability required.

Merely 48 hours after the first device was first used in the US, it just worked. We fixed tens of bugs, deployed tens of versions. Continually stabilized it until it just kept on working, with no customer complaining. Exactly what we needed to prove to ourselves that we can do, that we can provide such a high level of Reliability in a very short time. 

Efficiently Monitoring

One day we got a call from a salesman of an Israeli company. He was trying to sell us their log analysis platform, one mainly used for real time monitoring of applicative logs and set alerts accordingly. I told him “thank you, but we don’t have logs”.

He was a bit surprised by it and couldn’t really believe it. “You know what? We’re up for the challenge. Let’s set up a call for tomorrow. We want to convince you that we don’t need your platform”. We showed him everything we’ve done in the PoC. He was convinced, and he let us go. It was another sign for us that we were doing something right.

The very same Event Analytics was used to remotely monitor our devices. We knew in real time how many devices were live at any given time, a way to derive how many of our customers had kept them constantly on their kitchen counter. We could track how many containers were vacuumed each day. We could also track how many and which applications crashed unexpectedly, and whether the count increases or decreases after each deployment. Out of which we could easily create Metrics, KPIs and set alarms to. A capability needed to ensure Reliability of a live production environment running millions of devices.

We took these capabilities with us to our final product, and into our system architecture. Same as with our Event Source, all Events from our messaging platform were flushed to it. But unlike in our PoC, we needed to consider data costs.

We had no need to trace issues that occurred more than 15-30 days before. There definitely isn’t, because they are already stored forever and replayed from our Event Store. Blocking us from cleaning out older Events, to prevent the endless growth, were the KPIs and Metrics.

In order to query them on time frames as “day 0 till now”, It is supposedly required to retain them indefinitely. To see historical, year-to-year or seasonal changes. But we also expected hundreds millions of Events eventually produced every day, meaning our ElasticSearch index will grow endlessly and in a costly manner.

To circumvent that we had a plan for a poller application, a plan that wasn’t executed. The poller will query ElasticSearch once a minute, and emit Events such as METRIC_COLLECTED and KPI_CREATED to our messaging platform. Those will be retained forever in our Event Store. Those could also be retained indefinitely on another index in our ElasticSearch, one will still grow endlessly but at a much slower rate and would not incur dire costs. 

This very same concept was done at RapidAPI a few years later (2021), for their new product of API Tracing and Analytics. A customer would get short term retention for tracing, next to an indefinite retention for metrics. It made it possible to offer the product to existing customers, without Changing the company’s financial model.
There was another component in the system architecture we needed to make sure is Reliable. The messaging platform itself, and it was entirely different from anything else we’ve talked about so far. On this, in the next chapter.

Leave a Reply