McDonald’s Journey to Event-Driven Architecture

McDonald’s uses Event-Driven Architecture! Luckily for us, they’ve written a couple of blog posts providing some details of their journey into event-driven architecture. I’m going to go a bit deeper by providing my thoughts on how their system works and why they are doing it so that it can give you some ideas about your systems.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

McDonald’s Event-Driven Architecture

It’s always interesting to see companies post details of the architecture of various systems they have. It can be insightful to see what they are doing, why, and their challenges. McDonald’s posted behind-the-scenes and how-it-works blog posts detailing their journey to event-driven architecture. More specifically, it’s not that they are new to event-driven architecture but rather have a standardized way to implement it with distributed teams of developers with different skill levels.

McDonald's Event-Driven Architecture

There are many different components to their platform. Their infrastructure is within AWS, and they use MSK (Managed Streaming for Kafka) along with ECS, DynamocDB, and API Gateway.

Here’s how everything works together.

Schema Registry

One of their challenges was related to data quality. Likely because there was no set definition (schema) for data within events. If multiple producers produce the same event type, they might not be composing them exactly the same. I believe an event should have a single publisher, the owner of that schema, to avoid this issue. However, this could be applicable in a message-driven architecture that’s also using queues and commands.

Producers at startup use a custom SDK that retrieves all the event schemas from the registry. This allows the producer to validate the event being produced against the schema.

Schema Registry and Producer SDK

If validation passes, the producer can publish this event to the appropriate Kafka topic using the SDK at this point.

Producer SDK to Kafka

As you can expect, on the consumer side, the same thing occurs. Consumers at startup use a custom SDK that retrieves all the schemas from the registry, just like the producers do.

Consumer SDK to Registry

Then the consumers can process messages from the Kafka topics and understand how to deserialize them from the schema and version of the schema.

Consumer SDK to Kafka

Everything within any Kafka topic should be valid based on all the schemas (versioned) within the registry. Data quality issues are solved!

Validation

Of course, not everything goes through the happy path. What if a producer tries to publish an event, but it fails to validate against the schema? The producer then publishes the message to a Dead Letter Queue. Kafka isn’t a queue, so this is a Dead Letter Topic.

Producer to DLQ

Once a message is in the “DLQ” there needs to be a way to view, modify and fix the event so it can be re-published to the correct topic.

For this, an Admin/Utility UI provides this functionality for them.

Admin Utility

Reliable Publishing

The second failure that can occur is failing to publish to Kafka (MSK). Anyone getting involved in Event-Driven Architecture is bound to run into this. It would be best if you had consistency between making state changes to your business data and publishing your event. When events become critical to your system and possibly workflows, you need guarantees that you publish the relevant events when you make some state change to business data.

Mcdonald’s chose to use DynamoDB to persist any events that cannot be published to Kafka. This means their Publisher SDK will fallback to storing the event data within DynamoDB if it cannot publish to Kafka.

Fallback to DynamoDB

Using a fallback to some durable storage is a common approach. However, the Outbox Pattern is another common solution. I discussed this and other common issues in a post about the 5 Pitfalls of EDA.

Once the event data is in DynamoDB, they use Lambda to pull it from DynamoDB and then retry and publish it to Kafka. I’d assume they have different retry intervals/backoffs.

Lambda Retry

Gateway

Lastly, if you’re integrating with 3rd parties or even within a large organization, you’ll need to have them publish events. However, they won’t have direct access to your SDK and Kafka. For this, they use API Gateway as an HTTP interface to convert HTTP requests that will communicate with the Producer that has the SDK and can publish to Kafka.

Event Gateway

That way, we go through the same validation against the schema in the registry just as if any of our client code is using the producer SDK. This allows external 3rd parties to publish events without using our SDK directly. We can instead have them use our Event Gateway (HTTP API).

Technical Blog Posts

I love when companies have technical blog posts that give insights into their architecture and design. It’s hard to know the full context, but seeing how they solve these issues they run into is interesting. Companies face many common issues when using Event-Driven Architecture, but all have unique constraints.

If you have any recommendations for other technical blog post analyses, please let a comment!

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

How your “Sr.” Devs incurred Technical Debt

Are you overwhelmed by technical debt? Taking the path of least resistance when implementing new features in a large existing codebase will ultimately turn it into a difficult-to-change turd pile. It’s a vicious circle. Making the “quick change” constantly makes it harder to make future changes. So what’s the solution? Being aware of technical debt, stop solely thinking about data, and give yourself options in your architecture.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Path of Least Resistance

One common reason for a system growing over time and becoming unmaintainable is developers choosing to take the path of least resistance when implementing a change.

This happens for various reasons, such as time constraints, unfamiliarity with the system, lack of domain knowledge, poor overall architecture & design, etc.

For example, let’s say we have a typical web application that is using some underlying web framework that invokes some code into our application logic, through to our domain, and then some interaction with a database.

Application Request

When a new feature is implemented, it’s common to look at other features as templates for developing a new feature. Or, worse, it can be using an existing feature and adding the relevant code needed for the new feature throughout the stack. I say worse because this can often confuse two concepts that seem similar but are very distinct. Merging the two concepts within the same code path can add complexity.

Layers

This means we may change existing code through the entire stack, from the client, web API, application code, domain, and our database.

You may decide to piggyback off another feature because of time constraints. It’s not because the feature is difficult to implement. It’s time-consuming or will take more time than you have to implement. Or if you’re new to the codebase or it’s brittle, you might be afraid to make changes because you know it it can cause you to break other parts of the system and don’t want to cause any regressions.

The path of least resistance is making a change that you know isn’t going to break anything that isn’t overly time-consuming, but it’s not necessarily the ideal. It’s likely good for the right now but not good for the long run.

Technical Debt

Technical debt isn’t inherently bad. For me, technical debt comes in two forms. The first is when you’re aware and choosing to take on technical debt at a very moment, knowing it adds value now but will cause issues in the future. This awareness of choosing to make this explicit decision isn’t bad.

However, when you’re unaware that you’re making these types of decisions is when you’re headed in the wrong direction.

If you’re making explicit decisions about the tradeoffs of technical debt, you’re aware of the debt being incurred. You can then explicitly choose when to pay off (refactor) that debt. For example, with a startup, you might incur debt right now so that you have a future.

On the other side, if you’re unaware that you’re incurring technical debt, then when would you realize all the debt that’s been incurred and needs to be addressed? Taking the path of least resistance, without realizing it, is one form of this happening. While it seems like it’s helping you now, it could be hindering you now and even more so in the future.

Coupling & Cohesion

Software Architecture is about making key decisions at a low cost that give you options in the future. Having a good architecture allows you to evolve your system over time. As a codebase and system grow, it should not hinder future development. I’ve talked about this more in my post What is Software Architecture?

Why is a system brittle and hard to change? Generally, it has a high degree of coupling from higher and lower levels within a system. I find this is often because of the focus on data and informational cohesion rather than functional cohesion.

For example, let’s say we are in an e-commerce and warehouse system. There is the concept of a product. When we primarily think about data first, we think of a singular product. It holds all information for everything related to an individual product. The name, price, location in the warehouse, the quantity on hand, it is available for sale, etc.

In reality, a system for e-commerce and a warehouse would be huge. A large codebase that multiple departments would use in an organization. Sales, Purchasing, Warehouse (shipping & receiving), Accounting, and more.

In other words, I’m simplifying this example only to show a few different pieces of data related to a product, but in reality, there would be a lot.

When focusing on data primarily, we lose sight of the behaviors that relate to this data. What does the QuantityOnHand have to do with the Price? What does the Location have to do with the Description?

Nothing.

We’ve lumped all aspects into one concept of a product. However, in a large system like this, the concept of a product would exist in many different forms depending on the behaviors provided.

Product Entities

Sales have the concept of the product that cares about the Selling Price and if we’re selling. It’s customer focused.

Purchasing cares about the price from the vendor or manufacturer, which is our cost. It’s vendor-centric.

The warehouse cares about the location of the product in the warehouse and the assumed quantity on hand.

Each logical boundary has a concept of a product but has different concerns in each of its own contexts.

This means instead of mixing all these different concerns up together, instead be driven by the capabilities of each boundary and then the data ownership for those capabilities.

Low functional cohesion will lead to a high degree of coupling.

Defining logical boundaries by grouping related behaviors will lead to higher cohesion, which can then lead to loose coupling.

Awareness

Some of the trade-offs of taking the path of least resistance is being aware of the trade-offs you are making between coupling and cohesion. Earlier I mentioned piggybacking off an existing feature to implement a new feature. You’re coupling. Again, not a bad thing if that decision is explicit.

Over time, left unchecked, if you’re unaware of the technical debt you’re creating, you’ll end up with a large turd pile that’s brittle and hard to change.

If you are aware you can choose when to pay down debt (refactor) and keep making those decisions over time, you can manage the amount of debt incurred, never letting it get out of reach.

I often say a system is a turd pile because nothing is perfect. It’s a constant battle to pay down debt, whether you choose it explicitly or not.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Data Partitioning! Don’t let growth SLOW you down!

Why would you want data partitioning? Data keeps growing. Hopefully, you’re working on a system with active customers with a good lifespan! If so, the amount of stored data will keep growing over time. The challenge can be managing data growth, which can start to impact performance. I will discuss ways to use this data and strategies to avoid it affecting performance.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Multi-Tenant

In a multi-tenant application, there are multiple ways to silo data. First, you can use the same database instance, but a partition key indicates which data belongs to which tenant.

Multi-Tenant

For example, tables in a relational database would have a TenantId, allowing you to query and retrieve data for a specific tenant.

Partition Key

For various reasons, data growth is one of them; you may choose to instead silo data in its database instance per tenant.

Silo Data

There are many reasons to silo data; check out my post on Multi-tenant Architecture for SaaS for more on various aspects of sharing compute, identity, and more within a Multi-Tenant SaaS architecture.

Regardless of if you use a partition key or silo data, hopefully, you’re working in a system where data is growing! Ultimately this would mean you’re building a system with a lot of activity and users! Congrats!

The issue is over time; more users means more data.

Data Growth

How you initially wrote your system to query specific data may not be ideal now that you have a vastly different volume of data. Queries may have performed fast a couple of years ago, but the volume of data now those queries could be much slower, impacting overall system performance.

Lifecycle

In many line-of-business and enterprise systems, not all data is actively relevant. Many business processes and workflows related to data have a finite lifecycle.

There is generally some initial beginning creation, some work in process, and finally, some completion.

Lifecycle

As an example, let’s use a support ticketing system. When a user opens a support ticket, it goes through a lifecycle till the support ticket is finally closed.

Support Ticket

At this point, this support ticket is not likely very active in terms of interactions. A support ticket that was closed 2 years ago is not likely active at all, and users aren’t viewing them.

A small portion of the total data we store in our database is likely active/hot. Older data that its lifecycle has completed is still available to be queried or changed however; it’s more considered warm data that isn’t active.

Warm & Hot Data

Without any type of partitioning, we have a mix of hot and warm data in our database.

Mixed Hot and Warm Data

Instead, we can organize our data by whether it is warm or hot and increase system performance because we deal with a much smaller dataset when performing actions against hot data.

Partitioned Hot & Warm Data

As for implementation, this could mean that when a ticket is closed, we move it to a separate table/collection or even a separate database. The strategy of when is up to you, but the gist is to segregate between hot and warm data.

Time

Another strategy which is often also hot/warm, is to partition by time. It is very typical in many different domains to have a time-bound partition. An excellent example is fiscal or calendar years for finance or employee pay periods, often weekly or bi-weekly.

Data Partitioned by Year

Summary

Often transactional data is only needed to summarize it, at which point the summary becomes the truth. I worked in a system with a high volume of transactional data partitioned by the hour of the day it was received. After each hour, the data were summarized and removed from the hourly table.

Data Partitioned by Time

Often these are actual business concepts that you can capture. A warehouse with physical goods performs stock counts to confirm the actual quantity on hand that is recorded within the system. The database/system is not the point of truth; the warehouse is. Just because the database shows a quantity of 10 for a product does not mean there are 10 in the warehouse. A product could be damaged, stolen, or can’t be located. The transactional data of products received & shipped “should” be the quantity on hand. But it’s not. That’s why stock counts exist to reconcile what is in the warehouse.

Summary Business Concepts

We can use these business events as a summary. The stock count is the summary event we can use. The other transactional events could be archived.

Cold Data

Speaking of archived, you do not need to delete data if you’ve summarized it. You can archive it as cold data. Data that is archived can be restored for compliance or regulatory reasons.

Archived Cold Data

This data isn’t immediately available as it’s archived, and it is removed from warm and hot data storage.

Data Partitioning

Hopefully, this gave you some ideas about how can partition data over time as it grows. Time is a key aspect, as most business data is time-bound or goes through a finite lifecycle or lifespan. You can also leverage business concepts that inherently are summaries of a point in time where previous transactional data can be archived.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out the YouTube Membership or Patreon for more info.

You also might like

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design