Sponsor: Do you build complex software systems? See how NServiceBus makes it easier to design, build, and manage software systems that use message queues to achieve loose coupling. Get started for free.

Distributed Tracing to discover a Distributed BIG BALL of MUD

Distributed tracing is great for observing how a request flows through a set of distributed services. However, it can also be used as a band-aid to mask a problem that you shouldn’t have.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Distributed Tracing

So why is distributed tracing helpful? When you’re working within a Monolith, you generally have an entire request processed within the same process. This means if there is a failure of any sort, you can capture the entire stack trace.

Monolith

When you’re in a large system that is decomposed with a set of distributed services that interact, it can be very difficult to track where the exception is occurring. Also, it can be difficult to know the latency or processing time for the entire request and where the bottleneck might be from service to service calls.

As an example, if there is a request from a client to Service A, and it needs to make a call to other services, they might make calls to other services.

Distributed Monolith

With distributed tracing, you could see the flow of a request that passes through multiple distributed services. To illustrate, here’s a timeline of the diagram above.

Distributed Monolith Timeline

So it’s great that distributed tracing can give us away from observing a request’s flow. The problem is having service-to-service communication can lead to another set of challenges beyond tracing.

Distributed tracing in this service-to-service system style is a band-aid to a problem you shouldn’t have. Blocking synchronous calls, such as HTTP, from service to service can provide issues with latency, fault tolerance, and availability, all because of temporal coupling. Check out my post on REST APIs for Microservices? Beware! that dives deeper into this topic.

Blocking Synchronous Calls

However, not all blocking synchronous calls can be avoided. Specifically, any type of query, such as a request from a client, will naturally be a blocking synchronous call. If you’re doing any type of UI composition, you may choose to use a BFF (Backend for frontend) or API gateway to do this composition. The BFF makes synchronous calls to all services to get data from each to compose a result for the client.

UI Composition

Distributed tracing in this situation is great! We’ll be able to see which services have the longest response time because, ultimately, if we are making all calls from the BFF to backing services concurrently, the slowest response will determine the length of the total execution time from the client.

UI Composition Distributed Tracing Timeline

Workflow

Another great place for distributed tracing is with asynchronous workflows. It has always been very challenging to see the flow of a request executed asynchronously by passing messages via a message broker. Distributed tracing solves that and allows us to visualize that flow.

As an example, the client requests the initial service to perform a command/action.

Start Workflow

The service will then create a message and send it to the message broker for the next service to continue the workflow.

Send Message to Broker

Another service will pick up this message and perform whatever action it needs to take to complete its part of the entire workflow.

Consume Message from another service

Once the second service is completed processing the message, it may send another message to the broker.

Continue workflow by sending another message to the broker

A third service (ServiceC) might pick up that message from the broker and perform some action that is a part of this long-running workflow. And just like the others, it may send a message to the broker once it’s complete.

Consume message from a final service

At this point, ServiceA, which started the entire workflow, may consume the last message sent by ServiceA to do some finalization of the entire workflow.

Initial service completes workflow

Because this entire workflow was executed asynchronously and has removed the temporal coupling, each service doesn’t have to be online and available. Each service will consume and produce messages at its rate and availability without causing the entire workflow to fail.

Distributed Tracing timeline

OpenTelemetry & Zipkin

I’ve created a sample app that uses OpenTelemtry with NServiceBus for an asynchronous workflow that can then be visualized with Zipkin. If you want access to the full source code example, check out the YouTube Membership or Patreon for more info.

As an example with ASP.NET Core, I’ve added OpenTelemery packages and added the registration for them in the ConfigureServices of the Startup. This will add tracing for NServiceBus, any calls using the HTTPClient, and ASP.NET Core itself.

With NServiceBus I have a saga that is orchestrating sending commands to various logical boundaries to complete the workflow.

After running the sample app, I can open up Zipkin and see the entire trace that spans my ASP.NET Core app that is going through the various logical boundaries, including the database calls SQL Express, and the HTTP call to Fedex.com

Distributed Tracing

Distributed tracing is great for collecting data and observing the flow of a long-running business process or if you’re doing UI Composition using a synchronous request/response involving many different services. However, don’t use it as a crutch because there is a pile of service-to-service synchronous requests/responses proving difficult to manage. If anything, use distributed tracing to realize you have a high coupled distributed monolith so you can remove some of the temporal coupling making your system more loosely coupled and resilient.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out the YouTube Membership or Patreon for more info.

You also might like

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Fintech Mindset to Software Design

If you’re creating a line of business or enterprise-type software, I think one of the most valuable skills you can have isn’t technical. Rather it’s understanding how the business domain you are in works. One way is following how money flows through a system by having a fintech mindset.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Revenue & Cost

I was on the Azure Devops Podcast, where I mentioned that a big influence on my career was working with an Accountant. No surprise, this has affected how I look at a line of business and enterprise-type systems, how they can be decomposed, and how money flows through the system.

At a high level, I’m thinking about Revenue and Cost.

For context, I’m going to talk about project-based work to illustrate.

How does the company generate revenue? How does the software system we are building fit into how the company generates revenue?

How does the company incur costs to generate that revenue? How do we keep track of that in the system, similar to how we generate revenue.

We generally have customers, sales, and invoicing on the revenue side.

Revenue Side

There is some type of sales process involved with a CRM that ultimately leads us to a sale and invoice a customer to get paid.

On the cost side, we may have employees with a salary or work per hour. We pay them at some interval (weekly, bi-weekly, monthly) based on their salary or hours worked (timesheet).

Employee Cost Side

Other costs can occur, such as materials or services from outside vendors. We will create purchase orders and get receipts.

Vendor Cost Side

Follow the Money

When thinking about decomposing a system and understanding a domain, I help to follow the money. This can lead to understanding the workflows and processes that generate revenue and cost.

In the case of project base work, the heart is in the execution. The actual project. There are many ways to “sell” a project, such as Time & Materials, Fixed Price, etc. The same goes for costs, some employees might be Salary (fixed) some may be hourly.

Regardless, the point is the execution of the project is where the complexity lies. This is likely to be the heart of the system you’re building.

Execution

For example, if you’ve worked on a software project before or done any project management, you can relate that this is the heart of complexity.

So if you were writing a system to manage the full lifecycle of project-based work, the execution of the project would be the core. Many boundaries would act purely in a supporting role, such as CRM, Employee, Payroll, Vendors, and Invoicing. You don’t necessarily need to write your own; this is usually pretty generic, and you can buy it off the shelf or software as a service.

And this makes sense because you’re not about to write a CRM or Accounting system. You can integrate with those to push/pull the relevant data required.

On the other hand, Timesheets, Sales, and Purchasing may be something you choose to write because there isn’t anything specific enough for the context/niche of what you’re building.

Core and Supporting Boundaries

Fintech Mindset

I’ve written software for Distribution, Accounting, Manufacturing, and various forms of Transportation, and they have all of this in common. Have a fintech mindset by understanding revenue and costs and how each is made.

Here’s a screenshot of QuickBooks, an accounting system typically used for small businesses. It gives a good example of how everything is related and how money flows.

Quickbooks

In writing line of business and enterprise type systems, I’ve always found the best developers are the ones that have equally great technical development ability as well as domain knowledge. Having a fintech mindset and understanding how money flows helps.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out the YouTube Membership or Patreon for more info.

You also might like

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design

Event Carried State Transfer: Keep a local cache!

What’s Event Carried State Transfer, and what problem does it solve? Do you have a service that requires data from another service? You’re trying to avoid making a blocking synchronous call between services because this introduces temporal coupling and availability concerns? One solution is Event Carried State Transfer.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

Temporal Coupling

If you have a service that needs to get data from another service, you might just think to make an RPC call. There can be many reasons for needing data from another service. Most often, it’s for query purposes to generate a ViewModel/UI/Reporting. If you need data to perform a command because you need data for business logic, then check out my post on Data Consistency Between Services.

The issue with making an RPC call and the temporal coupling that comes with it is availability. If we need to make an RPC call from ServiceA to ServiceB, and ServiceB is unavailable, how do we handle that failure, and what do we return to the client?

Service to Service

We want ServiceA to be available even when ServiceB is unavailable. To do this, we need to remove the temporal coupling so we don’t need to make this RPC call.

This means that ServiceA needs all the data to fulfill the request from the client.

Service has all the data within it's own boundary

Services should be independent. If a client makes a request to any service, that service should not need to make a call to any other service. It has to have all the data required.

Local Cache

One way to accomplish this is to be notified via an event asynchronously when data changes within a service boundary. This allows you to call back the service to get the latest data/state from the service and then update your database, which is acting as a local cache.

For example, if a Customer were changed in ServiceB, it would publish a CustomerChanged event containing the Customer ID that was changed.

Publish Event

When ServiceA consumes that event, it would then do a callback to ServiceB to get the latest state of the Customer.

Consume and Callback Publisher

This allows us to keep a local cache of data from other services. We’re leveraging events to notify other service boundaries that the state has changed within a service boundary. Other services can then call the service to update their local cache.

The downside to this approach is that you could be receiving/accepting a lot of requests for data from other services if you’re publishing many events. From the example, ServiceB could have an increased load handling the requests for data.

You’re still dealing with availability, however. If you consume an event and then make an RPC call to get the latest data, the service isn’t available or responding. As with any cache, it’s going to be stale.

Callback Failure/Availability

Event Carried State Transfer

Instead of having these callbacks to the event’s producer, the event contains the state. This is called Event Carried State Transfer.

If all the relevant data related to the state change is in the event, then ServiceA can simply use the data in the event to update its local cache.

Event Carried State Transfer

There are three key aspects to Event Carried State Transfer: Immutable, Stale, and Versioned.

Events are immutable. When they were published, they represented the state at that moment in time. You can think of them as integration events. They are immutable because you don’t own any of the data. Data ownership belongs to the service that’s publishing the event. You just have a local cache. And as mentioned earlier, you need to expect it to be stale because it’s a cache.

Versioning

There must be a version that increments within the event that represents the point in time when the state was changed. For example, if a CustomerChanged event was published for CustomerID 123 multiple times, even if you’re using FIFO (first-in-first-out) queues, that does not mean you’ll process them in order if you’re using the Competing Consumers Pattern.

Competing Consumers

When you consume an event, you need to know that you haven’t processed a more recent version already. You don’t want to overwrite with older data.

Check out my post Message Ordering in Pub/Sub or Queues and Competing Consumers Pattern for Scalability.

Data Ownership

So what type of data would you want to keep as a local cache updated via Event Carried State Transfer? Generally, reference data from supporting boundaries. Not transactional data.

Because reference data is non-volatile, it fits well for a local cache. This type of data isn’t changing often, so you’re not as likely to be concerned with staleness.

Transactional data, however, I do not see as a good fit. Generally, transactional data should be owned and contained within a service boundary.

An example with an online checkout process. When the client starts the checkout process, it makes requests to the Ordering service.

Start Checkout Process

The client then needs to enter their billing and credit card information. This information isn’t sent directly to the Ordering service but to the Payment service. The payment service would store the billing and credit card information to the ShoppingCartID.

Payment Information

Finally, the order is reviewed, and to complete the order, the client then requests the Ordering service to place the order. At this point, the order service would publish an OrderPlaced event only containing the OrderID and ShoppingCartID.

Place Order

The Payment service would consume the OrderPlaced event and use the ShoppingCartID within the event to look up the credit card information within its database so it can then call the credit card gateway to make the credit card transaction.

Consume OrderPlaced and Process Payment

Event Carried State Transfer

Event Carried State Transfer is a way to keep a local cache of data from other service boundaries. This works well for reference data from supporting boundaries that aren’t changing that often. However be careful about using it with transactional data and don’t force the use of event carried state transfer where you should be directing data to the appropriate boundary.

Join!

Developer-level members of my YouTube channel or Patreon get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out the YouTube Membership or Patreon for more info.

Follow @CodeOpinion on Twitter

Software Architecture & Design

Get all my latest YouTube Vidoes and Blog Posts on Software Architecture & Design