Distributed Systems Consistency: Mistake Nobody Warns You About!

Sponsor: Using RabbitMQ or Azure Service Bus in your .NET systems? Well, you could just use their SDKs and roll your own serialization, routing, outbox, retries, and telemetry. I mean, seriously, how hard could it be?

One of the most common and very overlooked issues when writing a distributed system is consistency. You have one thing happening in one part of your system that triggers something else to happen in another part of the system, except it doesn’t happen, and that can be a nightmare to deal with. This is incredibly overlooked but incredibly common, so let’s jump into some code as an example.

YouTube

Check out my YouTube channel, where I post all kinds of content accompanying my posts, including this video showing everything in this post.

https://www.youtube.com/watch?v=fiqHpA20AHE

The Distributed Systems Mistake Nobody Warns You About! (https://www.youtube.com/watch?v=fiqHpA20AHE)

Example

Here’s a really simple sample example that I’m using with NServiceBus. You can run it yourself by grabbing the source code.

This is an ASP.NET Core controller that I’m using here; it’s straightforward. I’m generating an ID from a GUID and then using EF Core, where I have this entity. It’s really simple—it’s just going to have that ID and a property called processd, and we’re just going to set it to false. Then, we’re going to save that record to our database.

After that, I will send MyMessage, which we will process asynchronously. So this isn’t going to wait here; we’re returning immediately to the client asynchronously.

NServiceBus will invoke this handler separately, asynchronously, from the code above.

We’re going to use the DbContext to pull out the same entity and set hte Processed property to true.

Now, in a real application, this is often very applicable where you want to process stuff asynchronously in the background. The example I use with e-commerce is if you want to process the payment asynchronously separately when an order is placed. There are very different concerns; they don’t need to be running at the same time together.

The same thing applies to a user signup; you don’t need to send their confirmation email to verify them when they do the signup. That’s often done asynchronously. You can think about anything in your system where you need to do something triggered by something else occurring, like a third-party integration when you have to interact with them—you’re often doing this stuff asynchronously.

Now, while that example was very simple, it had what I said at the start—a glaring flaw to it.

We’re saving our changes to our database and then sending our message. The assumption here is that these will always happen one after the other, and it will always work. There’s almost this assumption that these are like one atomic operation, but they’re not, and that oversight can significantly impact the consistency of your system.

My example was straightforward; it’s just one after the other. You can imagine that in a real system, there’s a lot more complexity about the logic you’re doing, the state changes you’re making, and the messages you’re sending. When those messages aren’t sent, the other parts of your system have no idea that you made state changes.

As in my e-commerce example, when an order is placed, and you’ve got to process the payment, well, if that message is never sent, you’re never going to process the payment. So, to illustrate that in the sample, all I’m going to do is throw an exception right after we save our state changes because, in reality, maybe you have a lot more code going on here or something else. Let’s say you get a null reference exception—who hasn’t experienced that?

In the example above, we’ve persisted data to our database, but we never sent the MyMessage. This could have an incredibly negative effect on your system when you’re expecting your message to be sent. Our system could be in an inconsistent state where we’ve saved data but not sent the relevant messages. Distributed systems and consistency is likely one of the biggest issues that are overlooked by developers new to messaging.

Solution

So, let’s get to a solution for handling distributed systems consistency, and while doing so, I will talk about maybe the elephant in the room.

You might think, well, you shouldn’t even have any of that logic in your controller; that’s bad practice.

I’ll save best practices for this post: Biggest scam in software development? Best practices.

However, you may have a system where you do have logic and persistence in your controllers. So, I’m going to use the exact example here to illustrate the solution because the reality of what you should be doing and what people are actually doing in the systems that they’re building are often very different.

You might be thinking, well, maybe one solution is we can take sending this message and change the order of this.

Now, sending our message first, and then we’ll save our database changes. But this doesn’t solve the problem; it just creates a different situation because sending our message and saving our state are not one atomic operation.

Instead, they’re two distinct things. So, by just changing the order, we have a different problem. I’m pushing this message out, saying something’s happened; the reality is it hasn’t happened yet. So you’ve possibly introduced a race condition where now you have a handler that’s going to be invoked possibly before you’ve actually even saved the data if the data ever gets saved at all.

So what’s the solution? Well, it’s to have one atomic operation where we save our state and our messages together. We can use our database with transactions; data has to have that as a single atomic operation. Then, as a separate step, we can take those messages that persisted in our database and then send them.

To visualize that, when we save our state to our database, within the same transaction, we can serialize our message that we’re trying to send and persist it with our state. It’s one atomic operation; they’re both there.

As I mentioned, the second step is to deserialize the messages we’re trying to send, and then we can push them out to our queues, our topics, etc.

The key part is that if this step fails, we haven’t lost our message; it’s still in our database, and we can retry, pull it back out, and again try to send it back to our queues, topics, etc.

This is the outbox pattern. NServiceBus provides us with a TransactionalSession. I’m going to use it rather than the IMessageSession from the examples above.

The real magic is happening in our ASP.NET Core Filter.

If you’re using minimal APIs or ASP.NET Core middleware, you can be doing the same thing. What we have here is if our parameter, which I changed it to, is an ITransactionalSession, that’s actually what we’re getting from DI, and I’m opening a session then calling the next filter/middleware. Assuming we didn’t get any exceptions, we can commit.

You’ll also notice in the controller that I took out the call to SaveChangesAsync because we don’t need it anymore. The transactional session we called commit on in our middleware or filter is actually doing it all for us.

Now with these small changes, we’re not saving state from our DbContext and our messages as one atomic operation using the outbox pattern.

If there is an exception in the controller action at any point, we will not save the state or ultimately send a message. It doesn’t matter where that exception occurs.

Distributed Systems Consistency

I suspect one of the issues developers face when writing distributed systems is realizing that when multiple operations occur, they aren’t atomic. Looking at the code, thinking everything will run without errors, and not realizing the impact if there is a failure.

The implications can be very problematic depending on the type of system you’re building. What’s even worse is when they happen infrequently, and you don’t understand what happened, why data is a specific way, or why a message handler didn’t run or process a message and left the system in an inconsistent state.

Follow @CodeOpinion

Join CodeOpinon!
Developer-level members of my Patreon or YouTube channel get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

Distributed Systems Consistency: Mistake Nobody Warns You About!

YouTube

Example

Solution

Distributed Systems Consistency

Related Links