The world is full of Asynchronous Workflow

The world is asynchronous. Many workflows and business processes you encounter out in the world are long-running and driven by asynchronous systems. Yet as developers, we’re still often writing procedural and synchronous code to model these business processes. I will give one of my favorite examples of going out to eat at a restaurant to illustrate this and how this can be applied to software.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Asynchronous Chicken Wings?

One of my favorite ways to explain asynchronous workflows is to use a restaurant as an example. There are asynchronous workflows everywhere when you eat at a restaurant.

However, many developers can get caught in the trap of only working with synchronous blocking RPC calls between services or boundaries when building systems.

To illustrate, let’s use the restaurant example. We can think of two different service boundaries. The waiter/waitress is one; the kitchen is another. If you were making blocking RPC calls between them, that would mean that when the customer places their food order with the waiter/waitress, they would immediately go to the kitchen and tell the cooking staff. The waiter/waitress would stand there waiting for the food to be finished so they could bring it to the customers’ table. The customer is doing nothing else this entire time; they are simply waiting.

Blocking

That’s not how a restaurant works. Interactions between the customer, waiter/waitress, and the kitchen aren’t blocking.

If you’re developing a microservices or service-oriented system and making service-to-service calls with HTTP or gRPC then you’re making blocking calls.

We know those interactions at a restaurant aren’t blocking. They are asynchronous. When a waiter places an order in the kitchen, they don’t stand there waiting for the food. So why would we write systems that communicate using blocking calls between them?

Asynchronous Workflow

To illustrate how the asynchronous workflow really works, first a waiter/waitress goes to a customers table and gets their order.

Asynchronous Workflow

They may right afterward go to another customer’s table and collect their order as well.

Asynchronous Workflow

Now that they have two orders from different customers, they proceed to place the order with the kitchen.

Asynchronous Workflow

At this time, our customers are free to do whatever they want. The waiter/waitress might be doing some other work unrelated to attending tables. The kitchen, at this point, is working on the two orders that were placed. Once they are finished, the waiter/waitress then gets the food from the kitchen.

Asynchronous Workflow

They would then immediately go and bring the food to the customers’ table.

Asynchronous Workflow

The same process would occur for the other customer order. Once it’s ready, the waiter/waitress would get the food from the kitchen and bring it to the customers’ table.

Nothing is blocking. The waiter/waitress is free to do other tasks while food is being cooked.

Messaging

You can build asynchronous workflows using messaging. Sending commands and publishing events to a message broker. This allows us to have our interactions asynchronous and in isolation. We aren’t temporally coupled between services. Each service can work independently without requiring the other.

When the client tells the wait staff (waiter/waitress) their order, they persist that into their database.

Client calls Wait Staff

They then could create a command for the kitchen boundary to prepare the food.

Wait Staff creates Message/Command

Once the message is sent to the broker, the client and wait staff interaction is complete. At this point, the wait staff isn’t concerned about the kitchen; it knows it will consume the message which tells it to prepare the food. If the kitchen is flooded with messages, preparing the food might take longer than expected. But because this is asynchronous, it doesn’t mean it will fail, just that it will take longer.

Asynchronous Workflow

The kitchen will then consume the message and prepare the food.

Kitchen Consumes Message/Command

Once the food is ready, the kitchen can publish an event back to the broker or possibly provide a reply to the wait staff to notify them that the food is ready.

All these interactions are asynchronous and non-blocking.

Asynchronous Workflow

Long-running business processes and workflows are everywhere. Just look to the real world sometimes for examples of how they work. You can often use them to illustrate how a message or event-driven system would work. Being loosely coupled between services can make your system much more resilient to failures and influx of load.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out the YouTube Membership or Patreon for more info.

You also might like

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Event Carried State Transfer: Keep a local cache!

What’s Event Carried State Transfer, and what problem does it solve? Do you have a service that requires data from another service? You’re trying to avoid making a blocking synchronous call between services because this introduces temporal coupling and availability concerns? One solution is Event Carried State Transfer.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Temporal Coupling

If you have a service that needs to get data from another service, you might just think to make an RPC call. There can be many reasons for needing data from another service. Most often, it’s for query purposes to generate a ViewModel/UI/Reporting. If you need data to perform a command because you need data for business logic, then check out my post on Data Consistency Between Services.

The issue with making an RPC call and the temporal coupling that comes with it is availability. If we need to make an RPC call from ServiceA to ServiceB, and ServiceB is unavailable, how do we handle that failure, and what do we return to the client?

Service to Service

We want ServiceA to be available even when ServiceB is unavailable. To do this, we need to remove the temporal coupling so we don’t need to make this RPC call.

This means that ServiceA needs all the data to fulfill the request from the client.

Service has all the data within it's own boundary

Services should be independent. If a client makes a request to any service, that service should not need to make a call to any other service. It has to have all the data required.

Local Cache

One way to accomplish this is to be notified via an event asynchronously when data changes within a service boundary. This allows you to call back the service to get the latest data/state from the service and then update your database, which is acting as a local cache.

For example, if a Customer were changed in ServiceB, it would publish a CustomerChanged event containing the Customer ID that was changed.

Publish Event

When ServiceA consumes that event, it would then do a callback to ServiceB to get the latest state of the Customer.

Consume and Callback Publisher

This allows us to keep a local cache of data from other services. We’re leveraging events to notify other service boundaries that the state has changed within a service boundary. Other services can then call the service to update their local cache.

The downside to this approach is that you could be receiving/accepting a lot of requests for data from other services if you’re publishing many events. From the example, ServiceB could have an increased load handling the requests for data.

You’re still dealing with availability, however. If you consume an event and then make an RPC call to get the latest data, the service isn’t available or responding. As with any cache, it’s going to be stale.

Callback Failure/Availability

Event Carried State Transfer

Instead of having these callbacks to the event’s producer, the event contains the state. This is called Event Carried State Transfer.

If all the relevant data related to the state change is in the event, then ServiceA can simply use the data in the event to update its local cache.

Event Carried State Transfer

There are three key aspects to Event Carried State Transfer: Immutable, Stale, and Versioned.

Events are immutable. When they were published, they represented the state at that moment in time. You can think of them as integration events. They are immutable because you don’t own any of the data. Data ownership belongs to the service that’s publishing the event. You just have a local cache. And as mentioned earlier, you need to expect it to be stale because it’s a cache.

Versioning

There must be a version that increments within the event that represents the point in time when the state was changed. For example, if a CustomerChanged event was published for CustomerID 123 multiple times, even if you’re using FIFO (first-in-first-out) queues, that does not mean you’ll process them in order if you’re using the Competing Consumers Pattern.

Competing Consumers

When you consume an event, you need to know that you haven’t processed a more recent version already. You don’t want to overwrite with older data.

Check out my post Message Ordering in Pub/Sub or Queues and Competing Consumers Pattern for Scalability.

Data Ownership

So what type of data would you want to keep as a local cache updated via Event Carried State Transfer? Generally, reference data from supporting boundaries. Not transactional data.

Because reference data is non-volatile, it fits well for a local cache. This type of data isn’t changing often, so you’re not as likely to be concerned with staleness.

Transactional data, however, I do not see as a good fit. Generally, transactional data should be owned and contained within a service boundary.

An example with an online checkout process. When the client starts the checkout process, it makes requests to the Ordering service.

Start Checkout Process

The client then needs to enter their billing and credit card information. This information isn’t sent directly to the Ordering service but to the Payment service. The payment service would store the billing and credit card information to the ShoppingCartID.

Payment Information

Finally, the order is reviewed, and to complete the order, the client then requests the Ordering service to place the order. At this point, the order service would publish an OrderPlaced event only containing the OrderID and ShoppingCartID.

Place Order

The Payment service would consume the OrderPlaced event and use the ShoppingCartID within the event to look up the credit card information within its database so it can then call the credit card gateway to make the credit card transaction.

Consume OrderPlaced and Process Payment

Event Carried State Transfer

Event Carried State Transfer is a way to keep a local cache of data from other service boundaries. This works well for reference data from supporting boundaries that aren’t changing that often. However be careful about using it with transactional data and don’t force the use of event carried state transfer where you should be directing data to the appropriate boundary.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out the YouTube Membership or Patreon for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Avoiding a QUEUE Backlog Disaster with Backpressure & Flow Control

I advocate a lot for asynchronous messaging. It can add reliability, temporal decoupling, and much more. But what are some of the challenges? One of them is backpressure and flow control. This occurs when you’re producing more messages can you can consume. Meaning you’re piling up messages in your queue and you can never catch up. The queue just keeps growing.

YouTube

Check out my YouTube channel where I post all kinds of content that accompanies my posts including this video showing everything in this post.

Producers and Consumers

Producers send messages to a broker/queue and a Consumer processes those messages. For a simplistic view, we have a single producer and a single consumer.

The producer creates a message and sends it to the broker/queue.

The message can sit on the broker until the consumer is ready to process it. This enables the producer and the consumer to be temporally decoupled.

Temporal Decoupling provided as the queue holds the message

The consumer then processes the message and it is removed from the broker/queue.

Consumer process the message from the queue

As long as you can consume messages on average faster than messages are produced, you won’t get into having a queue backlog.

But since there can be many producers, or because of load, you may start producing more messages at a faster rate than can be consumed.

For example, if you’re producing a single message every second, yet it takes you 1.5 seconds to process the message, you’re going to start filling up the queue. You’ll never be able to catch up and have an empty queue.

Queue backlog

Most systems will have peaks and valleys in terms of how many messages are produced. During the valleys is where the consumer can catch up. But if again, on average, you’re producing more messages than can be processed, you’re going to build a queue backlog.

Competing Consumers

One solution is to add more consumers so that you can process more messages concurrently. Basically, you’re increasing your throughput. You need to match or exceed the rate of production with consumption.

The competing consumer pattern is having multiple instances of the consumer that are competing for messages on the queue.

Competing Consumers of multiple consumer instances

Since we have multiple consumers, we can now process 2 messages concurrently. Since one consumer is busy processing a message, if another message is sent to the queue, we have another consumer that is available.

Each consumer available competes for the next message

The consumer that is available will compete for the next message and process it.

Competing consumers adds more processing which increases throughput

There are a couple of issues with the Competing consumers’ pattern.

The first is if you’re expecting to be processing messages in order. Just because you’re using a FIFO (first-in, first-out) queue, that does not mean you’ll process messages in order as they were produced. Because you’re processing messages concurrently, you could finish processing messages out of order.

The second issue is you’ve moved the bottleneck. Any downstream services that are used when consuming a message will now experience additional load. For example, if you’re interacting with a database, you’re now going to add additional calls to that database because you’re now processing more messages at a given time.

Competing consumers adding additional load on downstream services

Incoming

A queue is like a buffer. My analogy is to think of a queue as a pond of water. There is a stream of water as an inflow on one end, and a stream of water as an outflow on the other.

If the stream of water coming in is larger than the stream of water going out, the water level on the pond will increase. In order to lower the water level, you need to widen the outgoing stream to allow more water to escape. This will lower the water level.

But another way to maintain the water level is to limit the amount of water entering the pond.

In other words, limit the producer to only be able to add so many messages to the queue.

Setting a limit on the broker/queue itself means when the producer tries to send a message to the queue, if it’s reached its limit, it won’t accept the message.

Queue Backlog handled by limiting producer

Because the producer might be a client/UI, you might want to have built-in retry and other ways of handling this failure if you cannot enqueue a message. Generally, I think this way of handling this backpressure is used as a safeguard to not overwhelm the entire system.

Queue Backlog

Ultimately when dealing with queues (and topics!) you need to understand some metrics. The rate at which you’re producing messages. The rate at which you can consume messages. How long are messages sitting in the queue? What’s the lead time, from when it was produced to when it was consumed to be processed? What’s the processing time, and how long does it take to consume a specific message?

Knowing these metrics will allow you to understand how to handle backpressure and flow control. Look at both sides, producing and consuming.

Look at competing consumers’ pattern to increase throughput. Also, look at if there are optimizations to be made for how a consumer is processing a message. Be aware of downstream services that also will be affected by increasing throughput.

Add a safeguard on the producing side to not get into a queue backlog situation you can’t recover from.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design as well as access to source code for any working demo application that I post on my blog or YouTube. Check out the YouTube Membership or Patreon for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design