Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 159 additions & 0 deletions adr/ADR-56.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
| Revision | Date | Author | Info |
|----------|------------|-----------------------------|---------------------------------------------------|
| 1 | 2025-09-12 | @ripienaar, @MauriceVanVeen | Initial document for R1 `async` persistence model |
| 2 | 2025-10-28 | @MauriceVanVeen | Add read consistencies |
| 3 | 2025-12-05 | @MauriceVanVeen | Add design for linearizable reads |

## Context and Problem Statement

Expand Down Expand Up @@ -50,3 +52,160 @@ The interactions between `PersistMode:async` and `sync:always` are as follows:
* When the user provides no value for `PersistMode` the implied default is `default` but the server will not set this in the configuration, result of INFO requests will also have it unset
* Setting `PersistMode` to anything other than empty/absent will require API Level 2

## Read Consistencies

The table below describes the current read consistencies supported by the JetStream API, from the highest consistency
level to lowest.

| Stream configuration | JetStream API | Description | Level of consistency |
|:-----------------------------------------|:----------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `AllowDirect` disabled. | `$JS.API.STREAM.MSG.GET.<stream>` | An API meant only for management outside of the hot path. The request goes to every server, but is normally only answered by the stream leader. | Current highest level of read consistency. Only the stream leader answers, but stale reads are technically possible after leader changes or during network partitions since an old leader could still answer before the current leader does. |
| `AllowDirect` enabled. | `$JS.API.DIRECT.GET.<stream>` | If the stream is replicated, the followers will also answer read requests. | Higher availability read responses but with lower consistency. A read request will be randomly served by a server hosting the stream. Recently written data is not guaranteed to be returned on a subsequent read request. |
| `MirrorDirect` enabled on mirror stream. | `$JS.API.DIRECT.GET.<stream>` | If the stream is mirrored, the mirror can also answer read requests. For example a mirror stream in a different cluster or on a leaf node. | Higher availability with potential of fast local read responses but with lowest consistency. Mirrors can be in any relative state to the source. |

Additionally, if a stream is replicated and a consumer is created, there is no guarantee that the consumer can

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example feels a little ambiguous, and it is not clear if you are describing an actual problem or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an actual problem, imagine a replicated object store where you write a new file. A watcher (using a consumer) will be notified of this file, and a new consumer could be created to read in the chunks of that just-written file. However, since there are no guarantees what the consumer might see, it could be the consumer only gets halfway the file and the server then says "no more pending" because it's a stream follower that doesn't have all the writes yet.

immediately observe all the written messages at that time. For example, if a R1 consumer is created on a follower not up
to date on all writes yet. The consumer will eventually observe all the writes as it keeps on fetching new messages as
they come in.

## Proposal to add linearizability

Newer server versions, like for 2.14+, should support more configurability or in general higher levels of consistency as
opt-in. For example, higher read consistency for consumers can be achieved by having consumer CRUD operations go through
the stream's Raft log instead of the Meta Raft log, which ensures that a consumer created at time X in the stream log
can observe all the stream writes up to time X.

Specifically, higher level consistency for message read requests would roughly require:

- `AllowDirect` should not need to be disabled. The `$JS.API.STREAM.MSG.GET.<stream>` API, when `AllowDirect` is
disabled, has significant overhead since these requests go to ALL servers not just the servers hosting the stream.
- Direct Get allows using batch requests, this should also be supported. (Which is not the case with the Msg Get API
above)

Linearizable reads would be desirable, but a minimum would be to opt in to at least session-level guarantees such as
reading your own writes and monotonic reads.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have an additional section for session guarantees? If I remember correctly, you implemented session guarantees, but it introduced some complexity. So I wonder if it would make sense to document your attempt and your conclusions. And maybe we could think of alternative ways to achieve session guarantees. Or maybe you did that somewhere else already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That prior attempt and design is included here: #358

I'm tempted to solely focus on fully linearizable reads though, and not "intermediate" session guarantees. Mostly because there are use cases involving multiple processes which require linearizable reads, that can't be solved by (single process/client) session guarantees.


### Discussion

#### What would be expected given different topologies?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting a thread for thoughts around this topic:

Location transparency is a big topic for NATS. With that in mind "it should not matter" where you are connected in the topology. If you need a high consistency read, you shouldn't need to care where you are connected and if the stream happens to be in the same place.

I'd say that a client connected to a leaf node, connected to a cluster, gateway-ed to another cluster that hosts the stream.. should also be able to not only write to that stream, but also get a consistent read when it requires it. We shouldn't require the client to be connected directly to the cluster hosting the stream.


For a replicated stream, the minimum to expect would be that within the cluster the read response will always be aware
of the writes performed up to that point. Either only writes performed by solely this process doing the read request, or
all writes performed by all processes on the same stream. (To be discussed later)

But what about when the stream is mirrored on a leaf node and a client is connected to the leaf node? Or similarly if
the client is connected to another cluster in a super cluster, and the stream is mirrored there?

Writes still only go to the source, which will be aware of all writes to the stream up to that point. So a new write may
be immediately reflected in a read request when the client is connected directly to the cluster, but perhaps not when
connected to the leaf node. Is that okay given the topology, or would the expectation be that the client can always get
the most consistent view of a stream without being "location-specific"?

#### Should all read requests get higher consistency, or only a few?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting a thread for thoughts around this topic:

On the one hand having all reads have a certain consistency is nice and clear, but on the other hand that means it becomes an all-or-nothing reads value availability OR consistency, with no in-between.

I'd say there should be an option for a hybrid approach. If a leaf node has a mirror of a given stream and N clients all perform reads, it's fine for these to value availability. But if a single process happens to write to that mirrored stream, with the write going to the origin, then it should also have the option to have such a consistent read. Which would mean having a hybrid approach, where client-side you can decide between availability or consistency on a case-by-case basis.


For example, if this would be a setting on the stream like `ReadConsistency: weak/high` that would mean that ALL read
requests would be served by the stream leader only if set to `high`. This has the side effect of lower availability when
there's no leader available at a given time.

But does this actually need to be for ALL read requests, or only a select few?

If high availability is valued, then the current Direct Get API could still be used while high consistency read requests
could be served by the stream leader only. Would such a hybrid approach even be desirable, given that now the app
developer will need to decide per process or app which consistency level to use? Is this additional complexity worth the
flexibility?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed during the previous attempts it's critical that one should be able to know that all access to a certain stream is done safely. That is, if the stream is a financial wallet the stream should be able to enforce that all access to it must be done in a consistent manner.

Does not mean it always have to be that way, we can have a mode where a highbrid approach is used where consistent and eventually consistent reads are allowed.

But it must be able to configure a stream to only accept consistent access and reject other attempts against it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... one should be able to know that all access to a certain stream is done safely

Agree, and that could perhaps be achieved by a stream setting like ReadConsistency: high, where all the reads are ensured to be linearizable.

The hybrid could be on an opt-in per request basis to enable this high consistency for that particular request, even if the stream is not enforcing the ReadConsistency: high because it's set to weak for example.

... the stream should be able to enforce that all access to it must be done in a consistent manner.

The only doubts I have is around, "what if on a separate cluster/leaf somebody makes a mirror of that stream?".
The act of adding a mirror will violate this guarantee, regardless of the source having ReadConsistency: high, since the mirror can't know the source was configured to that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we discussed we will reasonably be able to have in-cluster consistency only, if possible we might say a high consistency stream cannot have a mirror direct enabled mirror?

The hybrid could be on an opt-in per request basis

This is exactly what we want to avoid when supporting a mode where the stream is in a strict high consistency mode only

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what we want to avoid when supporting a mode where the stream is in a strict high consistency mode only

Indeed, if the stream is in strict high consistency mode only, there's no way to disable that.
I meant if the stream is in weak consistency, you can on a per-request basis still opt-in to the high consistency when required.

if possible we might say a high consistency stream cannot have a mirror direct enabled mirror?

Not sure if this is possible. Mirrors can be made on a leaf while not even having a connection to what it mirrors from. Especially if we allow updating the ReadConsistency value, then there's no way to enforce a mirror isn't created against it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, if the stream is in strict high consistency mode only, there's no way to disable that.

good stuff

Re mirrors, that's what I expected so might be a combo of documentation and for example if you enable high consistency on the mirror you can't also set mirror direct or something similar (just an example, probably a bad one) idea is some way to at least communicate that stuffs weird and shouldn't be allowed


#### What should the performance considerations be?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting a thread for thoughts around this topic:

"Leader leases" or "distributed locks" don't have strict guarantees, especially when considering clock skew/drift/etc. Under this mode leader elections would need to trigger way slower, which would prevent quick leader elections during faults where the previous leader can't signal it's stepping down.

I'd say we take the path taken by others, like etcd, and have reads "go through Raft" as this gives a strict guarantee of linearizable reads when opted-in. I realize that reads should be faster than writes, and reads should not generally slow down the throughput of writes. This means the read should not result in an append entry to be written into the Raft log, as the throughput would then be shared with writes. Instead, we can optimize this by NOT writing additional data in the log, instead "piggybacking" on writes that are happening in parallel. Meaning, if a replicated write happens and the leader sends this out to its followers but didn't receive quorum yet, then N reads can simply wait for this write to have quorum with no need to write additional data into the log. Reads are then simply "queued" after the latest writes, ensuring least impact while unlocking highest guarantees.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the leader leases:

"Leader leases" or "distributed locks" don't have strict guarantees, especially when considering clock skew/drift/etc.

I agree on this one. Leases may violate the guarantees, although it would be unlikely if lease duration is chosen in a sensible way.

Under this mode leader elections would need to trigger way slower, which would prevent quick leader elections during faults where the previous leader can't signal it's stepping down.

I'm not sure I follow. A faulty leader can signal it is stepping down? If a leader wants to step down, it can as well stop serving local reads.

In practice, a lease would be held for a time that is less than the election timeout (and some more to account for max clock drift and latency). We currently use 4 seconds election timeout, which I think is plenty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. A faulty leader can signal it is stepping down? If a leader wants to step down, it can as well stop serving local reads.

Indeed a leader that's stepped down, i.e. not a leader anymore, need not serve local reads under the "consistent mode".
Where a faulty leader can't signal it is stepping down is where the fault prevents it. One example could be a hard kill, or a network partition where the current leader doesn't know another leader has been elected.

If we were to use leader leases that would mean we need to ensure a new leader can't be elected before the lease expires. With the minimum election timeout of 4 seconds, and heartbeats every 1 second (or earlier if a normal append entry gets sent). Would a "lease timeout" of say 2 seconds suffice? Meaning, as long as we've got a response from a quorum of servers and the monotonic timestamps we've seen are within this "lease timeout", then we can serve a local read?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A partition or a hard kill would prevent sending step-down request regardless of using leases or not. The system has to rely on the election timeouts anyways.

If one can implement leases, while keeping the same election timeout we have now, it won't make a difference. Keep in mind that even if you renew a lease every second or so, you can serve lots and lots of local reads in that time.
I'm not necessarily pushing for leases, just want to put it in the right perspective.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About reads through raft:

Instead, we can optimize this by NOT writing additional data in the log

It is a good idea to optimize reads by not logging any additional data. However, piggybacking reads onto writes, would not perform well in read heavy workloads (I suppose that KV workloads could be read heavy, for example).
I think it is possible to avoid logging read requests. I would do it like this: when a read request is received, the leader marks the request with the current value of the index, that is, the index at which we want to read from. Then the leader sends heartbeats to the followers, and waits responses from a majority. The response will confirm that the leader is in fact still leader, and we can serve a read as soon as the read request index has applied in the upper layer. The raft paper also hints at such a solution. Towards the end of section 8 in https://raft.github.io/raft.pdf:

Read-only operations can be handled without writing
anything into the log. However, with no additional mea-
sures, this would run the risk of returning stale data, since
the leader responding to the request might have been su-
perseded by a newer leader of which it is unaware. Lin-
earizable reads must not return stale data, and Raft needs
two extra precautions to guarantee this without using the
log. First, a leader must have the latest information on
which entries are committed. The Leader Completeness
Property guarantees that a leader has all committed en-
tries, but at the start of its term, it may not know which
those are. To find out, it needs to commit an entry from
its term. Raft handles this by having each leader com-
mit a blank no-op entry into the log at the start of its
term. Second, a leader must check whether it has been de-
posed before processing a read-only request (its informa-
tion may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heart-
beat messages with a majority of the cluster before re-
sponding to read-only requests.

I think we already have the first requirement in place. We would need to implement the second part.

Copy link
Member Author

@MauriceVanVeen MauriceVanVeen Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A partition or a hard kill would prevent sending step-down request regardless of using leases or not. The system has to rely on the election timeouts anyways.

Indeed, just mentioning it because the current leader will not step down based on election timeouts, but does its own lostQuorumCheck and has its own lostQuorumInterval. No big deal for leases, but that means the current server will think it's leader for longer than the 4 seconds min election timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when a read request is received, the leader marks the request with the current value of the index, that is, the index at which we want to read from. Then the leader sends heartbeats to the followers, and waits responses from a majority.
...
I think we already have the first requirement in place. We would need to implement the second part.

Indeed, we'd only need the second part.

When I mentioned "piggybacking on writes" that assumes there are sufficient writes to piggyback on. If writes are actively happening, there's no need for an additional heartbeat message to be sent. But if the workload is sufficiently read heavy such that there are no active writes, a heartbeat would need to be used instead exactly like how you're describing.

I believe we're fully aligned here then 😄

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh OK, I initially misunderstood your proposal. All clear now.


Tradeoffs can be made regarding performance versus consistency.

Over-simplifying, there are two options:

- Reads go through Raft. Simplest way to implement and ensure no stale reads happen, but requires an additional network
hop for consensus.
- Reads do not go through Raft. Requires a mechanism like a "leader lease", can immediately answer read requests like
before, but requires timeout tuning and a new leader election to take way longer to happen.

Having reads go through Raft is essentially what etcd also did:
> When we evaluated etcd 0.4.1 in 2014, we found that it exhibited stale reads by default due to an optimization. While
> the Raft paper discusses the need to thread reads through the consensus system to ensure liveness, etcd performed reads
> on any leader, locally, without checking to see whether a newer leader could have more recent state. The etcd team
> implemented an optional quorum flag, and in version 3.0 of the etcd API, made linearizability the default for all
> operations except for watches.
> - https://jepsen.io/analyses/etcd-3.4.3 (2020-01-30)

Having leader leases is essentially what YugabyteDB did:
> Within a shard, Raft ensures linearizability for all operations which go through Raft’s log. However, for performance
> reasons, YugaByte DB does not use Raft’s consensus for reads. Instead, it cheats: reads return the local state from any
> Raft leader immediately, using leader leases to ensure safety. Using `CLOCK_MONOTONIC` for leases (instead of
`CLOCK_REALTIME`) insulates YugaByte DB from some classes of clock error, such as leap seconds.
> - https://jepsen.io/analyses/yugabyte-db-1.1.9 (2019-03-26)

Generally we hear users are willing to pay a performance "penalty" for higher consistency. But there are two things to
consider:

- The previous point of "Should all read requests get higher consistency, or only a few?", are all reads considered
equal or should there be a hybrid approach? If hybrid, then going through Raft for some reads probably makes most sense?
- Leader leases are tricky to implement and can (under niche conditions) still result in stale reads. Do we prefer being
able to strictly guarantee no stale reads?
- In some ways NATS' KV can be considered similar to etcd's KV, should we make similar choices?

> etcd ensures linearizability for all other operations by default. Linearizability comes with a cost, however, because linearized requests must go through the Raft consensus process. To obtain lower latencies and higher throughput for read requests, clients can configure a request’s consistency mode to serializable, which may access stale data with respect to quorum, but removes the performance penalty of linearized accesses’ reliance on live consensus.
> - https://etcd.io/docs/v3.5/learning/api_guarantees/

### Design

The design introduces 'linearizable reads' to JetStream by adding a new API that's specifically used for this purpose.
Allowing location transparent access; it doesn't matter if a client is connected via a leaf node and several hops to the
stream leader, if it requires linearizable reads, it can use this new API to get this guarantee. Additionally, this API
is enabled through a new stream setting: `AllowDirectLeader`. If enabled, the leader will also respond to
`$JS.API.DIRECT.GET.<stream>`, not requiring `AllowDirect`. This allows clients to migrate away from using the
`$JS.API.STREAM.MSG.GET.<stream>` API, since it's primarily meant for management purposes only.

- Introduce a new API for linearizable reads: `$JS.API.DIRECT_LEADER.GET.<stream>`.
- The new API will be similar to `$JS.API.DIRECT.GET.<stream>` but will go to the stream leader only. If it's a
replicated stream, the read will need to "go through Raft" to ensure linearizability.
- The new API will be enabled by an `AllowDirectLeader` setting on the stream. Once enabled, the new API will be active,
and the leader will also respond to `$JS.API.DIRECT.GET.<stream>`, not requiring `AllowDirect`. This allows clients to
use the DirectGet API instead of the MsgGet API:
- If `AllowDirect` is set, the client should use `$JS.API.DIRECT.GET.<stream>` by default.
- If the user specifies requiring linearizable reads:
- If `AllowDirectLeader` is NOT set, then the client should return an error that 'linearizable reads are not
enabled for this stream'.
- If `AllowDirectLeader` is set, then the client should use the new `$JS.API.DIRECT_LEADER.GET.<stream>` API.
- If the client does not know the current value of `AllowDirectLeader` (since it might not have access to the
stream info), then the client should use the new `$JS.API.DIRECT_LEADER.GET.<stream>` API anyway. A '503 No
Responders' error will be returned to the user, which will either mean there's temporarily no leader
available, or the stream is not configured to allow linearizable reads, but the client can't differentiate
between these two cases.
- If `AllowDirect` is NOT set but `AllowDirectLeader` is set, then the client should use
`$JS.API.DIRECT.GET.<stream>`. The consistency guarantees would be the same as having disabled `AllowDirect` and
the `$JS.API.STREAM.MSG.GET.<stream>` API being used, but without the additional overhead of that API.
- If none are specified, the client falls back to the `$JS.API.STREAM.MSG.GET.<stream>` API.
- The `AllowDirectLeader` setting will require API Level 3.

The addition of `AllowDirectLeader` allows for various levels of read consistencies (from weakest to strongest):

- Leader/Follow/Mirror reads, high availability with potential of fast local read responses by a mirror with no
cross-request/session consistency guarantees, weakest consistency: `AllowDirect` and `AllowMirror` enabled, and a
request to `$JS.API.DIRECT.GET.<stream>`.
- Leader/Follower reads, high availability with no cross-request/session consistency guarantees: `AllowDirect` enabled,
any (up-to-date enough) peer answers requests to `$JS.API.DIRECT.GET.<stream>`.
- Leader reads, high consistency fast responses, but potential of inconsistency during leader changes or network
partitions: `AllowDirect` disabled, and a request to `$JS.API.DIRECT.GET.<stream>` (`AllowDirectLeader` enabled) or
`$JS.API.STREAM.MSG.GET.<stream>`.
- Linearizable reads, highest level of consistency: `AllowDirectLeader` enabled with a request to
`$JS.API.DIRECT_LEADER.GET.<stream>` which is only answered by the stream leader if it can guarantee linearizability.

This design makes linearizability an opt-in and a conscious choice by a user. The server provides all the tools required
for various consistency levels. Clients can ease the user experience by offering:

- Per-request linearizable read opt-in. For example: `js.GetMsg("my-stream", 1, nats.Linearizable()` and
`js.GetLastMsg("stream", "subject", nats.Linearizable())`. This allows the user to value availability by default, but
opt in to linearizable reads for the requests that need it.
- Per-object linearizable read opt-in. Opt-in to linearizable reads for a specific stream, KV or Object Store. All reads
to that 'object' will use linearizable read requests by default, without needing the user to specify this on a
per-request basis. For example:
`js.CreateKeyValue(ctx, jetstream.KeyValueConfig{Bucket: "TEST", Replicas: 3, LinearizableReads: true})`.

The clients are free to implement this in a way that's best for the given language, but should generally provide both
the per-request and per-object options.