Sponsor: Do you build complex software systems? See how NServiceBus makes it easier to design, build, and manage software systems that use message queues to achieve loose coupling. Get started for free.
Event-Driven architecture has a lot of benefits but comes at the cost of another set of problems. Some of the Event-Driven Architecture Issues are visibility in workflows and business processes, consistency, idempotency, and consumer lag.
YouTube
Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.
Visibility
While there are undeniable benefits, the trade-offs often lead to increased complexity. One of the primary Event-Driven Architecture issues faced is visibility, especially when using messaging for workflows.
Let’s consider a scenario where a client places an order. This action sends a command to our broker, which the sales service picks up. The sales service processes the order and publishes an event to a topic within the broker. This triggers a workflow involving a billing boundary service that consumes the event and responds with a confirmation that the order has been billed. Sales then updates the order status and notifies the shipping service to create a shipping label. But how do we gain visibility into this entire workflow?
In a real-world scenario, you’re likely dealing with multiple consumers kicking off various workflows from a single published event. The challenge lies in tracking the entire process: when it started, what the finish line looks like, and how to gain insights into the services passing messages around. Ideally, we want a timeline showing each step of the process, from when client A publishes the message to service A consuming it, and so on.
OpenTelemetry is a fantastic tool for achieving this visibility. While many are familiar with it for capturing HTTP requests and database calls, it excels in tracking asynchronous workflows too. In my example in the video, I’m using NServiceBus configured with OpenTelemetry and Zipkin to illustrate how it works. By setting breakpoints and running through the process, we can visualize everything happening in real-time.
However, there are costs associated with this visibility. In high-volume systems, capturing every workflow can be expensive and impractical. You might find yourself sampling data to manage costs, which could lead to missing crucial traces during debugging. If you’re using OpenTelemetry, I’d love to hear your experiences in the comments—are you self-hosting or using cloud services, and how are the costs affecting you?
Check out my post, Distributed Tracing to discover a Distributed BIG BALL of MUD
Consistency
One of the major Event-Driven Architecture issues faced is consistency. For example, when a client makes a request to our app service, we change the state in our database and publish an event to alert other consumers. If there’s a bug in the code that leads to incorrect data in the database, we often find ourselves manually fixing it. Meaning we’re in the database manually changing the state.
This can create a disconnect between the database state and the published events, leading to potential inconsistencies.
One solution could be to use a change data capture (CDC) tool to handle these discrepancies. However, this often leads to losing the intent behind the original events, making it challenging to derive the correct state for other consumers. In many cases, using event sourcing negates this issue, as you would look at the event stream to derive the correct state without needing to directly manipulate the database.
At Least Once
Idempotency is yet another hurdle. Processing the same message multiple times can lead to chaos within the system. For instance, if a consumer processes a message that builds an order more than once, it likely won’t end well.
Brokers typically expect an acknowledgment from the consumer that a message was received and/or processed. Typically, you would receive the message, process it, and then send the acknowledgment back to the broker that you’ve processed it.
One way you can receive the same message again is if you’ve exceeded the timeout to send back the acknowledgment. This is typically called the invisibility timeout.
To address this, we need to ensure our system is idempotent, meaning that processing the same message multiple times doesn’t lead to unintended side effects.
One method is to keep track of processed message IDs in a storage solution, allowing the consumer to check if a message has already been processed. Alternatively, you could limit the number of instances of a consumer to ensure that messages are processed one at a time, although this can impact throughput.
Consumer Lag
Then there’s the issue of consumer lag. If your system produces messages faster than they can be consumed, you’ll end up with a backlog that can overwhelm your system.
The solution here is straightforward: scale out by adding more instances to handle the increased load. This is called the Competing Consumers pattern.
This allows you to process more messages concurrently, thus increasing throughput. However, this can introduce other complexities, especially if you have to maintain order in message processing. Check my post: Message Ordering in Pub/Sub or Queues
Event-Driven Architecture Issues
Ultimately, event-driven architecture is not without its challenges. I’m a very vocal and strong advocate for messaging and event-driven architecture. However, it’s important to understand some of the issues you’ll face. Visibility, consistency, idempotency, and consumer lag are just a few hurdles you’ll encounter. It’s important to carefully consider these aspects when designing your system to ensure it remains resilient and efficient.
Join CodeOpinon!
Developer-level members of my Patreon or YouTube channel get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.