Skip to content

Why Sagas Feel Broken

Sponsor: Do you build complex software systems? See how NServiceBus makes it easier to design, build, and manage software systems that use message queues to achieve loose coupling. Get started for free.

Learn more about Software Architecture & Design.
Join thousands of developers getting weekly updates to increase your understanding of software architecture and design concepts.


So, you built an elaborate system with commands, queues, an event-driven architecture, retries, timeouts, and, most importantly, compensating actions all handled within sagas.

But do you really?

Because then you get a call from support. There’s an order where the payment is pending, and it’s been pending for 48 hours. You look into it and see that the payment provider did charge the customer, but your system shows that the payment didn’t go through.

So which is true?

Clearly, the payment provider.

YouTube

Check out my YouTube channel, where I post all kinds of content on Software Architecture & Design, including this video showing everything in this post.

https://youtu.be/MB9KBLVL7Uk

The Mess of Payment Logic

The first and obvious answer is, “Let’s just make it smarter.” We already have different code paths handling different use cases. Maybe we need to add another retry. Maybe, in this example, there was a timeout. The payment provider timed out, we didn’t get a result, but the payment actually did work.

So we think, “We just need to make it smarter and add more logic.”

But that’s exactly how this turns into a mess.

You start with something simple. Capture the payment.

But if there’s a timeout, maybe you need to check again to see if the payment actually went through. Or maybe there’s a payment provider exception. What was the reason for it? Was it a duplicate? Then you need to check the provider because you know you already sent this payment request, or at least you think you did for that exact order.

If it’s an unknown error, maybe you need to flag it for review.

Now you’re handling all these different cases, edge cases, and flows directly in your code. The logic starts spreading everywhere because you’re trying to understand every possible thing that could have happened across a network boundary.

And you might think a saga is the answer to this problem.

It’s not exactly.

What a Saga Is Actually Good At

A saga is great for coordinating workflow, but it is not the source of truth.

It can tell you:

  • What step am I on?
  • What message should I receive?
  • What command should I send next?

Backing up to a checkout flow, a saga is a great fit. It demonstrates the workflow. That’s where you understand the process:

  1. An order is placed.
  2. Inventory is reserved.
  3. Payment is captured.
  4. The order is confirmed.

That’s coordination. That’s where sagas fit really well.

But there’s a giant gap here.

The saga is not the point of truth for an external system like your payment provider.

The Missing Piece Is Reconciliation

What you really want alongside a saga is reconciliation.

You want to know:

  • What should be true?
  • What is actually true?
  • What corrective actions or compensating actions can you safely apply?

Anytime you’re dealing with a network boundary to a third party you do not control, you’re going to have uncertainty.

Did the call work?
Did it time out but still work behind the scenes?
Did the provider process the payment but your system never receive the response?

This is where reconciliation comes in.

A saga and reconciliation go hand in hand.

A Checkout Flow With a Saga

Here’s an example of a checkout flow in a saga, and it’s incredibly simple, as it should be.

First, we handle the OrderPlaced event. From there, we reserve inventory.

Once inventory is reserved, we handle that event and try to capture the payment. At the same time, we set a timeout for the payment capture.

If the PaymentCaptured event occurs, then we can mark the saga as complete and confirm the order.

But if the timeout happens because we never got the PaymentCaptured event back, then after 15 minutes we send a request for payment reconciliation.

A couple things to note.

The timeout does not mean anything failed.

It just means we stopped waiting.

We had an expectation that the payment captured event would occur within a certain amount of time, say 15 minutes, and it didn’t. So now we need to handle that situation.

Also notice that there are no compensating actions here about reversing inventory or cancelling the order because of a failure.

Because we don’t know that we have a failure.

A Timeout Is Not a Failure

To visualize the flow, the saga sends a capture payment command. That command is received, but now we’re waiting on the third party payment provider.

The payment provider never responds to us.

Eventually, we hit the timeout.

But because we have that timeout, our saga fires after 15 minutes and requests payment reconciliation.

That reconciliation process can then ask the payment provider, “Did you actually process this payment?”

And if the payment provider says yes, then we can update our order status because we now know the payment isn’t pending anymore. From the payment provider’s perspective, everything worked.

This isn’t a saga problem.

This is a data drift problem.

It’s about the point of truth.

And that’s where reconciliation comes in.

What Reconciliation Looks Like

Reconciliation should be simple.

We get the order and check its current status. If the order is already confirmed or cancelled, we can exit.

But if it’s still pending, we go to the point of truth. In this case, that’s the payment provider.

We ask the payment provider for the actual payment status.

If the payment was processed successfully by the payment processor or gateway, then we mark the payment as captured and confirm the order.

If the payment actually failed on the provider side, then we mark the payment as failed and cancel the order.

That’s it.

The pattern is:

  1. Know there is drift.
  2. Go to the source of truth.
  3. Compare it against your data.
  4. Apply a safe action.

In this example, the way we knew there was drift was with a timeout. But there are many different ways to trigger reconciliation.

Different Ways to Trigger Reconciliation

A saga timeout can work great, but it doesn’t have to be the only trigger.

It might be something as simple as adding a button that says “Check payment status” so an end user or support person can invoke it themselves.

Polling might not seem like the greatest option, but it might be a good option for your system.

For example, you could have a background service that runs every five minutes, looks in your database for pending payments older than 15 minutes, then iterates through them and requests payment reconciliation.

That could be the trigger.

It really can just be polling.

Reconciliation Is Not a Cleanup Job

Reconciliation is not a cleanup job.

It’s about consistency.

I get the sense that people sometimes feel like, “Well, I have to run this job because of a timeout and reconcile with some third party system, but everything should magically always be consistent.”

It won’t be.

There’s nothing wrong with doing reconciliation.

It’s not some dirty cleanup job that means your system is broken.

Sagas are great at coordinating workflows. Reconciliation is about verifying that what you think the state of the system is actually matches the real state of the system.

Because there are so many things that can fail.

Trying to shove all these use cases, logic, and edge cases into sagas, or other parts of your system, to handle every possible failure just isn’t going to work.

You don’t need to add all this branching logic every time something gets invoked, like capturing a payment.

You can reconcile.

You can do it safely.

Is something wrong? Yes? Then perform some type of safe action.

Sagas Coordinate. Reconciliation Verifies.

A saga helps coordinate the workflow. It knows what step you’re on, what message you’re waiting for, and what command should happen next.

But it does not know the truth of an external system.

The payment provider does.

So when your system and the provider disagree, don’t keep adding more and more logic into the saga to guess what happened.

Go ask the source of truth.

That’s reconciliation.

And if you’ve felt this pain before, where something in your system is trying to do too much and has all this logic for edge cases, you probably didn’t need more branching.

You probably needed reconciliation.

Join CodeOpinon!
Developer-level members of my Patreon or YouTube channel get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.