Eventual Consistency is a UX Nightmare

Sponsor: Do you build complex software systems? See how NServiceBus makes it easier to design, build, and manage software systems that use message queues to achieve loose coupling. Get started for free.

Eventual Consistency is the term most people refer to when they are reading from a different data source from where they perform a write. Usually leads to a bad user experience where a user performs some action but then doesn’t see their change reflected in the UI immediately. There can be many reasons for this. It could be from using a read replica that is eventually consistent. If you’re using Event Sourcing with Projections as your Read Model, which is built asynchronously. Or if you’re processing commands asynchronously via a message queue.

YouTube

Check out my YouTube channel where I post all kinds of content that accompanies my posts including this video showing everything that is in this post.

https://www.youtube.com/watch?v=wEUTMuRSZT0&list=PLThyvG1mlMznuNW2tITIGmgQqJikLBqab&index=14

Eventual Consistency is a UX Nightmare (https://www.youtube.com/watch?v=wEUTMuRSZT0&list=PLThyvG1mlMznuNW2tITIGmgQqJikLBqab&index=14)

Replication Lag

A very common situation where eventual consistency occurs is with replication lag. If you’re using a database with read replicas that are eventually consistent and you perform queries against the read replica rather than the primary database.

You have a client that makes a request to your API to perform some type of command that is going to make a state change.

After the command request has been completed, the client then immediately makes a query request to get data that directly relates to the state change from the command.

The issue is because the read replica is eventually consistent, it hasn’t yet been replicated from the primary. This means that the query performed by the client is getting stale data.

After the query was made and returned stale data does replication occur and the read replica is now up-to-date. If the client makes another request it will get consistent data.

What’s more troublesome is this is a race condition as sometimes replication will happen quicker than the client query, this is entirely dependent on how quickly the client makes a request and how quickly replication is occurring.

Event Sourcing & Projections

One of the most common questions I get related to CQRS & Event Sourcing is eventual consistency. This occurs because you will often create projections (read models) that are used for queries. Since projections are built asynchronously, there is a lag between writing to your event stream in your event store and the projection being updated.

In a similar way as above, the client sends a command that makes a state change which results in a new event being appended to your event stream.

The client then immediately calls a query to retrieve data that is related to that state change.

But because our projection (event handlers) haven’t yet processed the event to update our read model (query database), we return the client stale data.

And just like with replication lag, we have a race condition where the projections may run after that query was made.

Async Processing

While not eventual consistency, processing requests/commands asynchronously off a queue can have a similar type of effect in a bad user experience.

The client sends a command but instead of it being handled in-process (blocking), rather the command is put on a queue.

The client isn’t possibly fully aware that the command wasn’t immediately processed. The client then makes a query which is returning what they believe to be stale data.

However this isn’t stale data, it’s just that the command has yet to be processed because of a race condition.

Finally, the message is pulled off the queue and processed and the database is updated.

While not eventual consistency, a similar UX issue as replication lag and event sourcing projections.

Solutions

In the majority of these cases, this is only a problem for a single user. This is because a user is trying to perform a write (command) and then subsequently perform a read (query) when either the data that they specifically changed hasn’t yet been either processed or made its way to the read database.

Basically, you’re trying to read your write but not using a fully consistent way.

Server Wait

One solution is to have the server wait (block) the client until the read database has been updated. This means when the client sends a command, the server won’t return back to the client until the data has been replicated (or projection updated, etc). This means that once the command completes and the client makes a query, they will not get stale data.

Client Polling

Instead of having the server wait, you instead can have the client wait. Once the client performs its write (command), it then performs the query but understands what the last read (version) was. If the version has not been updated, that means that the replication (or projection) has not been updated and it’s still stale data. The client will wait and then query again until it gets the updated data.

Push to Client

Another option is to push to the client once the replication (or projection) has occurred. This could be accomplished with something like WebSockets where you establish a connection from the client to server and then have the server push to the client to notify them the replication (or projection) has occurred. At this point, the client can then perform a query to get the latest data.

Read from Primary after Write

Finally, the last option is to read from the primary after you perform a write. Again this is only a situation for an individual user who performed the write (command).

What this means is once the client sends a command, the following query will be directed towards the primary, not the read replica (or query database).

There are different methods to do this, one of which is simply having a time window to direct all queries to the primary for the client that made a state change. For example, after a command, for a time window of 5 seconds, all queries are direct to the primary for the client that sent the initial command.