Skip to content

Conversation

@MauriceVanVeen
Copy link
Member

This PR describes the current read consistencies of JetStream, particularly when using AllowDirect and MirrorDirect which give low consistency guarantees. For 2.14 we should should support more configurability or in general higher levels of consistency as opt-in.

This is intended as a replacement for #358, which proposed very manual client-side tracking of received sequences. The desired outcome would be a less manual approach.

This PR is meant to document and start discussion around these topics, specifically asking three high-level questions:

  • What would be expected given different topologies?
  • Should all read requests get higher consistency, or only a few?
  • What should the performance considerations be?

Based on the shared feedback we can more confidently decide what direction to go in.

Signed-off-by: Maurice van Veen <[email protected]>

### Discussion

#### What would be expected given different topologies?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting a thread for thoughts around this topic:

Location transparency is a big topic for NATS. With that in mind "it should not matter" where you are connected in the topology. If you need a high consistency read, you shouldn't need to care where you are connected and if the stream happens to be in the same place.

I'd say that a client connected to a leaf node, connected to a cluster, gateway-ed to another cluster that hosts the stream.. should also be able to not only write to that stream, but also get a consistent read when it requires it. We shouldn't require the client to be connected directly to the cluster hosting the stream.

connected to the leaf node. Is that okay given the topology, or would the expectation be that the client can always get
the most consistent view of a stream without being "location-specific"?

#### Should all read requests get higher consistency, or only a few?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting a thread for thoughts around this topic:

On the one hand having all reads have a certain consistency is nice and clear, but on the other hand that means it becomes an all-or-nothing reads value availability OR consistency, with no in-between.

I'd say there should be an option for a hybrid approach. If a leaf node has a mirror of a given stream and N clients all perform reads, it's fine for these to value availability. But if a single process happens to write to that mirrored stream, with the write going to the origin, then it should also have the option to have such a consistent read. Which would mean having a hybrid approach, where client-side you can decide between availability or consistency on a case-by-case basis.

developer will need to decide per process or app which consistency level to use? Is this additional complexity worth the
flexibility?

#### What should the performance considerations be?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting a thread for thoughts around this topic:

"Leader leases" or "distributed locks" don't have strict guarantees, especially when considering clock skew/drift/etc. Under this mode leader elections would need to trigger way slower, which would prevent quick leader elections during faults where the previous leader can't signal it's stepping down.

I'd say we take the path taken by others, like etcd, and have reads "go through Raft" as this gives a strict guarantee of linearizable reads when opted-in. I realize that reads should be faster than writes, and reads should not generally slow down the throughput of writes. This means the read should not result in an append entry to be written into the Raft log, as the throughput would then be shared with writes. Instead, we can optimize this by NOT writing additional data in the log, instead "piggybacking" on writes that are happening in parallel. Meaning, if a replicated write happens and the leader sends this out to its followers but didn't receive quorum yet, then N reads can simply wait for this write to have quorum with no need to write additional data into the log. Reads are then simply "queued" after the latest writes, ensuring least impact while unlocking highest guarantees.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the leader leases:

"Leader leases" or "distributed locks" don't have strict guarantees, especially when considering clock skew/drift/etc.

I agree on this one. Leases may violate the guarantees, although it would be unlikely if lease duration is chosen in a sensible way.

Under this mode leader elections would need to trigger way slower, which would prevent quick leader elections during faults where the previous leader can't signal it's stepping down.

I'm not sure I follow. A faulty leader can signal it is stepping down? If a leader wants to step down, it can as well stop serving local reads.

In practice, a lease would be held for a time that is less than the election timeout (and some more to account for max clock drift and latency). We currently use 4 seconds election timeout, which I think is plenty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. A faulty leader can signal it is stepping down? If a leader wants to step down, it can as well stop serving local reads.

Indeed a leader that's stepped down, i.e. not a leader anymore, need not serve local reads under the "consistent mode".
Where a faulty leader can't signal it is stepping down is where the fault prevents it. One example could be a hard kill, or a network partition where the current leader doesn't know another leader has been elected.

If we were to use leader leases that would mean we need to ensure a new leader can't be elected before the lease expires. With the minimum election timeout of 4 seconds, and heartbeats every 1 second (or earlier if a normal append entry gets sent). Would a "lease timeout" of say 2 seconds suffice? Meaning, as long as we've got a response from a quorum of servers and the monotonic timestamps we've seen are within this "lease timeout", then we can serve a local read?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A partition or a hard kill would prevent sending step-down request regardless of using leases or not. The system has to rely on the election timeouts anyways.

If one can implement leases, while keeping the same election timeout we have now, it won't make a difference. Keep in mind that even if you renew a lease every second or so, you can serve lots and lots of local reads in that time.
I'm not necessarily pushing for leases, just want to put it in the right perspective.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About reads through raft:

Instead, we can optimize this by NOT writing additional data in the log

It is a good idea to optimize reads by not logging any additional data. However, piggybacking reads onto writes, would not perform well in read heavy workloads (I suppose that KV workloads could be read heavy, for example).
I think it is possible to avoid logging read requests. I would do it like this: when a read request is received, the leader marks the request with the current value of the index, that is, the index at which we want to read from. Then the leader sends heartbeats to the followers, and waits responses from a majority. The response will confirm that the leader is in fact still leader, and we can serve a read as soon as the read request index has applied in the upper layer. The raft paper also hints at such a solution. Towards the end of section 8 in https://raft.github.io/raft.pdf:

Read-only operations can be handled without writing
anything into the log. However, with no additional mea-
sures, this would run the risk of returning stale data, since
the leader responding to the request might have been su-
perseded by a newer leader of which it is unaware. Lin-
earizable reads must not return stale data, and Raft needs
two extra precautions to guarantee this without using the
log. First, a leader must have the latest information on
which entries are committed. The Leader Completeness
Property guarantees that a leader has all committed en-
tries, but at the start of its term, it may not know which
those are. To find out, it needs to commit an entry from
its term. Raft handles this by having each leader com-
mit a blank no-op entry into the log at the start of its
term. Second, a leader must check whether it has been de-
posed before processing a read-only request (its informa-
tion may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heart-
beat messages with a majority of the cluster before re-
sponding to read-only requests.

I think we already have the first requirement in place. We would need to implement the second part.

Copy link
Member Author

@MauriceVanVeen MauriceVanVeen Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A partition or a hard kill would prevent sending step-down request regardless of using leases or not. The system has to rely on the election timeouts anyways.

Indeed, just mentioning it because the current leader will not step down based on election timeouts, but does its own lostQuorumCheck and has its own lostQuorumInterval. No big deal for leases, but that means the current server will think it's leader for longer than the 4 seconds min election timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when a read request is received, the leader marks the request with the current value of the index, that is, the index at which we want to read from. Then the leader sends heartbeats to the followers, and waits responses from a majority.
...
I think we already have the first requirement in place. We would need to implement the second part.

Indeed, we'd only need the second part.

When I mentioned "piggybacking on writes" that assumes there are sufficient writes to piggyback on. If writes are actively happening, there's no need for an additional heartbeat message to be sent. But if the workload is sufficiently read heavy such that there are no active writes, a heartbeat would need to be used instead exactly like how you're describing.

I believe we're fully aligned here then 😄

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh OK, I initially misunderstood your proposal. All clear now.

| `AllowDirect` enabled. | `$JS.API.DIRECT.GET.<stream>` | If the stream is replicated, the followers will also answer read requests. | Higher availability read responses but with lower consistency. A read request will be randomly served by a server hosting the stream. Recently written data is not guaranteed to be returned on a subsequent read request. |
| `MirrorDirect` enabled on mirror stream. | `$JS.API.DIRECT.GET.<stream>` | If the stream is mirrored, the mirror can also answer read requests. For example a mirror stream in a different cluster or on a leaf node. | Higher availability with potential of fast local read responses but with lowest consistency. Mirrors can be in any relative state to the source. |

Additionally, if a stream is replicated and a consumer is created, there is no guarantee that the consumer can

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example feels a little ambiguous, and it is not clear if you are describing an actual problem or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an actual problem, imagine a replicated object store where you write a new file. A watcher (using a consumer) will be notified of this file, and a new consumer could be created to read in the chunks of that just-written file. However, since there are no guarantees what the consumer might see, it could be the consumer only gets halfway the file and the server then says "no more pending" because it's a stream follower that doesn't have all the writes yet.

above)

Linearizable reads would be desirable, but a minimum would be to opt in to at least session-level guarantees such as
reading your own writes and monotonic reads.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have an additional section for session guarantees? If I remember correctly, you implemented session guarantees, but it introduced some complexity. So I wonder if it would make sense to document your attempt and your conclusions. And maybe we could think of alternative ways to achieve session guarantees. Or maybe you did that somewhere else already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That prior attempt and design is included here: #358

I'm tempted to solely focus on fully linearizable reads though, and not "intermediate" session guarantees. Mostly because there are use cases involving multiple processes which require linearizable reads, that can't be solved by (single process/client) session guarantees.

If high availability is valued, then the current Direct Get API could still be used while high consistency read requests
could be served by the stream leader only. Would such a hybrid approach even be desirable, given that now the app
developer will need to decide per process or app which consistency level to use? Is this additional complexity worth the
flexibility?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed during the previous attempts it's critical that one should be able to know that all access to a certain stream is done safely. That is, if the stream is a financial wallet the stream should be able to enforce that all access to it must be done in a consistent manner.

Does not mean it always have to be that way, we can have a mode where a highbrid approach is used where consistent and eventually consistent reads are allowed.

But it must be able to configure a stream to only accept consistent access and reject other attempts against it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... one should be able to know that all access to a certain stream is done safely

Agree, and that could perhaps be achieved by a stream setting like ReadConsistency: high, where all the reads are ensured to be linearizable.

The hybrid could be on an opt-in per request basis to enable this high consistency for that particular request, even if the stream is not enforcing the ReadConsistency: high because it's set to weak for example.

... the stream should be able to enforce that all access to it must be done in a consistent manner.

The only doubts I have is around, "what if on a separate cluster/leaf somebody makes a mirror of that stream?".
The act of adding a mirror will violate this guarantee, regardless of the source having ReadConsistency: high, since the mirror can't know the source was configured to that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we discussed we will reasonably be able to have in-cluster consistency only, if possible we might say a high consistency stream cannot have a mirror direct enabled mirror?

The hybrid could be on an opt-in per request basis

This is exactly what we want to avoid when supporting a mode where the stream is in a strict high consistency mode only

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what we want to avoid when supporting a mode where the stream is in a strict high consistency mode only

Indeed, if the stream is in strict high consistency mode only, there's no way to disable that.
I meant if the stream is in weak consistency, you can on a per-request basis still opt-in to the high consistency when required.

if possible we might say a high consistency stream cannot have a mirror direct enabled mirror?

Not sure if this is possible. Mirrors can be made on a leaf while not even having a connection to what it mirrors from. Especially if we allow updating the ReadConsistency value, then there's no way to enforce a mirror isn't created against it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, if the stream is in strict high consistency mode only, there's no way to disable that.

good stuff

Re mirrors, that's what I expected so might be a combo of documentation and for example if you enable high consistency on the mirror you can't also set mirror direct or something similar (just an example, probably a bad one) idea is some way to at least communicate that stuffs weird and shouldn't be allowed

@jpsugar-flow
Copy link
Contributor

Is something like logical timestamps[1][2] in scope? They provide the ability for replicas to respond to queries causally with only a short delay.

[1] https://cse.msu.edu/~sandeep/publications/wss01/main.pdf
[2] https://cse.buffalo.edu/~demirbas/publications/augmentedTime.pdf

@sciascid
Copy link

Is something like logical timestamps[1][2] in scope? They provide the ability for replicas to respond to queries causally with only a short delay.

[1] https://cse.msu.edu/~sandeep/publications/wss01/main.pdf [2] https://cse.buffalo.edu/~demirbas/publications/augmentedTime.pdf

@jpsugar-flow We have not considered logical timestamps at this point, nor causal consistency. So thanks for the feedback and suggestions!
The problem I see with [1] is that the system model makes strong assumptions on both clock drift and message delays (see section 2, guarantees G1 and G2). Requiring careful tuning (section 6) for the system to work well. I think this could potentially be a problem, given the diverse environments in which NATS can be deployed. For these reasons, I'm not sure [1] would be a good fit for NATS. And [2], seems to be based on the hybrid clock design of [1], but notice that it aims to solve a more general problem: the short paper proposes an alternative to TrueTime which is used for snapshot reads on multi-partition distributed databases (multi value / multi partition read-only transactions). Not sure this applies here directly.

@jpsugar-flow
Copy link
Contributor

Thanks for the reply! The clock guarantees are a problem for some users, but I think they can (with some caveats) be satisfied by users in major cloud databases.

If there's general interest I could start a discussion thread somewhere to not spam up this one?

@sciascid
Copy link

@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review December 5, 2025 10:20
@MauriceVanVeen
Copy link
Member Author

This ADR PR is updated with a concrete design, marking as ready for review.
TL;DR

  • A new AllowDirectLeader stream setting, allowing the leader to respond to DirectGet requests, even if AllowDirect is disabled. (Currently MsgGet requests are used, but these have higher overheads, so this can then use the lighter-weight DirectGet request to the leader instead)
  • A new linearizable read API: $JS.API.DIRECT_LEADER.GET.<stream>. Reads go only/directly to the stream leader, no matter where you're connected, and the stream leader (if replicated) will guarantee linearizability since it "goes through Raft".
  • Clients get per-request and per-object (stream/KV/ObjectStore) opt-ins to linearizable reads. The per-object opt-in allows the user to specify "all requests require linearizability" in one place instead of spread over every request for a given stream 'object'.

@MauriceVanVeen
Copy link
Member Author

(I imagine the 'Discussion' section will need to be removed prior to merge to not clutter the final doc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants