Distributed Tracing to discover a Distributed BIG BALL of MUD

Distributed tracing is great for observing how a request flows through a set of distributed services. However, it can also be used as a band-aid to mask a problem that you shouldn’t have.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Distributed Tracing

So why is distributed tracing helpful? When you’re working within a Monolith, you generally have an entire request processed within the same process. This means if there is a failure of any sort, you can capture the entire stack trace.

Monolith

When you’re in a large system that is decomposed with a set of distributed services that interact, it can be very difficult to track where the exception is occurring. Also, it can be difficult to know the latency or processing time for the entire request and where the bottleneck might be from service to service calls.

As an example, if there is a request from a client to Service A, and it needs to make a call to other services, they might make calls to other services.

Distributed Monolith

With distributed tracing, you could see the flow of a request that passes through multiple distributed services. To illustrate, here’s a timeline of the diagram above.

Distributed Monolith Timeline

So it’s great that distributed tracing can give us away from observing a request’s flow. The problem is having service-to-service communication can lead to another set of challenges beyond tracing.

Distributed tracing in this service-to-service system style is a band-aid to a problem you shouldn’t have. Blocking synchronous calls, such as HTTP, from service to service can provide issues with latency, fault tolerance, and availability, all because of temporal coupling. Check out my post on REST APIs for Microservices? Beware! that dives deeper into this topic.

Blocking Synchronous Calls

However, not all blocking synchronous calls can be avoided. Specifically, any type of query, such as a request from a client, will naturally be a blocking synchronous call. If you’re doing any type of UI composition, you may choose to use a BFF (Backend for frontend) or API gateway to do this composition. The BFF makes synchronous calls to all services to get data from each to compose a result for the client.

UI Composition

Distributed tracing in this situation is great! We’ll be able to see which services have the longest response time because, ultimately, if we are making all calls from the BFF to backing services concurrently, the slowest response will determine the length of the total execution time from the client.

UI Composition Distributed Tracing Timeline

Workflow

Another great place for distributed tracing is with asynchronous workflows. It has always been very challenging to see the flow of a request executed asynchronously by passing messages via a message broker. Distributed tracing solves that and allows us to visualize that flow.

As an example, the client requests the initial service to perform a command/action.

Start Workflow

The service will then create a message and send it to the message broker for the next service to continue the workflow.

Send Message to Broker

Another service will pick up this message and perform whatever action it needs to take to complete its part of the entire workflow.

Consume Message from another service

Once the second service is completed processing the message, it may send another message to the broker.

Continue workflow by sending another message to the broker

A third service (ServiceC) might pick up that message from the broker and perform some action that is a part of this long-running workflow. And just like the others, it may send a message to the broker once it’s complete.

Consume message from a final service

At this point, ServiceA, which started the entire workflow, may consume the last message sent by ServiceA to do some finalization of the entire workflow.

Initial service completes workflow

Because this entire workflow was executed asynchronously and has removed the temporal coupling, each service doesn’t have to be online and available. Each service will consume and produce messages at its rate and availability without causing the entire workflow to fail.

Distributed Tracing timeline

OpenTelemetry & Zipkin

I’ve created a sample app that uses OpenTelemtry with NServiceBus for an asynchronous workflow that can then be visualized with Zipkin. If you want access to the full source code example, check out the YouTube Membership or Patreon for more info.

As an example with ASP.NET Core, I’ve added OpenTelemery packages and added the registration for them in the ConfigureServices of the Startup. This will add tracing for NServiceBus, any calls using the HTTPClient, and ASP.NET Core itself.

With NServiceBus I have a saga that is orchestrating sending commands to various logical boundaries to complete the workflow.

After running the sample app, I can open up Zipkin and see the entire trace that spans my ASP.NET Core app that is going through the various logical boundaries, including the database calls SQL Express, and the HTTP call to Fedex.com

Distributed Tracing

Distributed tracing is great for collecting data and observing the flow of a long-running business process or if you’re doing UI Composition using a synchronous request/response involving many different services. However, don’t use it as a crutch because there is a pile of service-to-service synchronous requests/responses proving difficult to manage. If anything, use distributed tracing to realize you have a high coupled distributed monolith so you can remove some of the temporal coupling making your system more loosely coupled and resilient.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out the YouTube Membership or Patreon for more info.

You also might like

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Avoiding Batch Jobs by a message in the FUTURE

Some people will call it cron jobs, scheduled tasks, or batch jobs. Whatever you call it, it’s a process that runs periodically looking at the state of a database to determine some specific action to occur for the various records it finds. If you’ve dealt with this, you probably know it can be a nightmare, especially with failures. And of course, these usually run in the middle of the night, so get ready for a page!

YouTube

Check out my YouTube channel where I post all kinds of content that accompanies my posts including this video showing everything in this post.

Reservation

A while back I placed an online order at a big box store that was set for pick-up at the store, rather than delivery. Here’s the confirmation order:

Order Confirmation Email

Note that the email says, once my order is ready for pick-up I’ll get an email. This is because an employee at the store has to physically go to get the item off the shelf somewhere in the store, and then bring it to a holding area for pick-up orders.

Once the item is taken from the shelf by the employee at the store, they set the order as being available for pick-up, which triggers this email I received:

Order Ready Email

The interesting part of this email is that it says I have 7 days to pick up my order, otherwise I’ll be refunded and the item will be put back onto the shelf.

Basically, this is a reservation. For more on the Reservation Pattern, check out my post on Avoiding Distributed Transactions with the Reservation Pattern.

Timeline

There are a couple of different scenarios. The first is that I actually drive to the store and pick up the item that I ordered.

The timeline would be that I placed my order, sometime later the order was reserved (taken from the shelf by an employee), and then later I arrived at the store to complete my reservation/order.

Reservation Completed Timeline

The second scenario is that I place my online order, it’s reserved for me, but I just never show up to pick the item up at the store.

Reservation Expired Timeline

After 7 days from the order being reserved, the reservation will expire, which means an employee will take the item and put it back on the shelf so someone else can purchase it.

Batch Jobs

If you’re using batch jobs, how would that work? Typically you’ll have a process that runs every day that will look at the state of the database and then determine what to do. In this example, this might be selecting all the orders that are reserved but have not yet been completed.

Batch jobs looking at the state of the database

At this point, the batch job has to do all of the refunds to credit cards and update the database to set the orders as canceled.

The issue with batch jobs is that they are done in a batch. If there are 100 orders that have expired, the batch job iterates one by one performing these actions. What happens if at order 42 of 100 the process fails, because of a bug or failing to connect to the payment gateway, etc.

Batch jobs failing

Batch jobs, purely based on the name or not done in isolation. This means the entire batch fails. Typically batch jobs are run during off-peak hours, which is usually in the middle of the night. If it fails in the middle of the night, you’re likely to get a page/alert of the failure.

Isolation

Rather what we would want is to handle each individual expired order/reservation in isolation. What we want to do is tell the system in the future, as a reminder, to cancel the reservation/order in 7 days when it’s reserved.

Expire Reservation Future Message

If the customer doesn’t come and pick up the item at the store, then the “expire reservation” reminder will kick in and execute at the appropriate time, which is 7 days from when the order was reserved.

However, if the customer does come and pick up the item at the store which completes the order, the “expire reservation” will still be triggered, however, it won’t need to do anything since the order is complete. It can exit early and not perform any action.

Reservation Completed Timeline

Each Order will have its own “expire reservation”. Each is executed exactly 7 days from when the order was reserved and each will be executed in isolation. If one fails, it doesn’t affect the others.

Delayed Delivery

So how can we technically implement this type of future reminder? You can use a queue that supports delayed delivery.

When sending a message to a queue, you’re also going to provide the queue with a period of time to delay the delivery to a consumer.

Send message to a Queue with Delayed Delivery

This means that the queue will not make the message visible to consumers. Consumers will not be able to pull this message from the queue until the delay period of time has elapsed. The consumers won’t even see the message in the queue.

Message not visible during period of time

Once the delayed delivery period has elapsed, the message will be visible again.

Message visible again

The consumer will then pull the message and process it.

Consumer processes message

What this allows us to do is send a message to be processed in the future.

In the example of our online order, this means that once an Order is reserved and the item is taken from the shelf, we enqueue a “expire reservation” message with delayed delivery of 7 days. After the 7 days, we will process that “expire reservation” message. If the order is already completed, the process just exits early. If it hasn’t been completed, it then does the credit card refund and updates the database to set the Order as canceled.

Avoiding Batch Jobs

You can avoid batch jobs and have smaller units of work in isolation by telling your system to do something in the future. Leveraging a queue that supports delayed delivery is one example of how you can accomplish this.

The workload will be smoothed out across time and provide isolation so you can handle failure at an individual level of work.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design as well as access to source code for any working demo application that I post on my blog or YouTube. Check out the YouTube Membership or Patreon for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Sidecar Pattern for Abstracting Common Concerns

What is the sidecar pattern? Applications and services often have generic concerns such as health checks, configuration, metrics, as well as how they communicate with each other either directly or through messaging. Services usually implement these using libraries or SDKs to handle these concerns. How can you share these concerns across all these services so you’re not implementing them in every service? The sidecar pattern and ambassador pattern might be a good fit to solve this problem.

YouTube

Check out my YouTube channel where I post all kinds of content that accompanies my posts including this video showing everything in this post.

Shared Concerns

Regardless of the platform, you’re going to leverage libraries/packages/SDKs to handle common concerns like health checks, configuration, metrics, and more. Each service will have these common concerns and need to use the libraries for their respective platform.

For example, if you had two services, one written in .NET and the other written in Go. Each service would leverage libraries in its ecosystem to provide this common functionality.

Shared concerns for different services

Even with libraries, you still need to define how you’re handling and dealing with these common concerns. You may be using the same underlying infrastructure for both of them. As an example, each service might be publishing metrics to Cloudwatch.

Wouldn’t be nice if there was a standardized way that each service would handle these shared concerns?

Sidecar pattern

The sidecar pattern allows you to extract the common concerns from your service and host them separately in a separate process, known as a sidecar.

sidecar process runs locally to service

In the containerized world, this is often thought of as a separate container from your service, however, it really is just a separate process that runs locally alongside your service.

With a sidecar, your service can now interact with a separate process that will handle the common concerns. The sidecar can perform health checks on your service, your service can send metrics to the sidecar, etc.

Ambassador Pattern

Back to the example of sending metrics to Cloudwatch. With the sidecar, we can also apply the Ambassador Pattern. This all makes the sidecar a proxy to send data to Cloudwatch. This means our service is interacting with the sidecar, using a common API, and it’s in charge of sending the data to Cloudwatch. It’s not just metrics and Cloudwatch, this applies to any external service.

sidecar as a proxy to external services

The benefit is that the sidecar is handling any failures, retries, backoffs, etc. We don’t have to implement all kinds of retry logic in our service, rather that’s a common concern handled by the sidecar when communicating with external services.

If we want to make an HTTP request to an external service, we proxy it through the sidecar.

sidecar handling retry logic

If there is a transient failure, the sidecar is responsible for doing the retry while still maintaining the connection to our service while it does so.

Handling transient failures and retries

Abstraction

While each service has its own instance of a sidecar, you can be using the same sidecar for many different services. This allows each service to have the exact same interface to all of the shared concerns the sidecar provides.

This means that you can also be using it for things like a message broker.

sidecar as abstraction to message broker

If you read any of my other blog posts or watch my videos on YouTube, you know I’m an advocate for loosely coupling between services using messages, and not using blocking synchronous request-response.

A sidecar can provide a common abstraction over a message broker. This means that each service doesn’t have to interact with the broker directly, nor does it need to use the specific libraries or SDKs for that broker. The sidecar is providing a common API for sending and consuming messages and is abstracting the underlying message broker.

different services to AMQP supported message broker

This means you could have one service using .NET and another using Python, both exchanging messages to a broker supporting AMQP. Each service would be completely unaware of what that underlying transport and message broker is.

Trade-offs

So why would you want to use a sidecar and ambassador patterns? If you have services that are using different languages/platforms, and you want to standardize common concerns. Instead of each service implementing using their native packages/libraries/SDKs to their respective platform, the sidecar pattern allows you to define a common API that each service uses. This allows you to focus more on what your service actually provides rather than common infrastructure or concerns.

One trade-off to mention is latency. If you’re using a sidecar with the ambassador pattern to proxy requests to external services, a message broker, etc, you’re going to be adding latency. While this latency might not that much, it’s something to note.

If you don’t have a system that’s comprised of many services, or they are all using the same language/platform, then a sidecar could just be unneeded complexity. You could leverage an internal shared package/library that each service would use for defining a common API for shared concerns, it doesn’t need to be a sidecar. Again, your context matters. Use the patterns when you have the problems these patterns solve.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design as well as access to source code for any working demo application that I post on my blog or YouTube. Check out the YouTube Membership or Patreon for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design