-
Notifications
You must be signed in to change notification settings - Fork 48
ADR-56: Read consistencies #390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Maurice van Veen <[email protected]>
|
|
||
| ### Discussion | ||
|
|
||
| #### What would be expected given different topologies? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting a thread for thoughts around this topic:
Location transparency is a big topic for NATS. With that in mind "it should not matter" where you are connected in the topology. If you need a high consistency read, you shouldn't need to care where you are connected and if the stream happens to be in the same place.
I'd say that a client connected to a leaf node, connected to a cluster, gateway-ed to another cluster that hosts the stream.. should also be able to not only write to that stream, but also get a consistent read when it requires it. We shouldn't require the client to be connected directly to the cluster hosting the stream.
| connected to the leaf node. Is that okay given the topology, or would the expectation be that the client can always get | ||
| the most consistent view of a stream without being "location-specific"? | ||
|
|
||
| #### Should all read requests get higher consistency, or only a few? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting a thread for thoughts around this topic:
On the one hand having all reads have a certain consistency is nice and clear, but on the other hand that means it becomes an all-or-nothing reads value availability OR consistency, with no in-between.
I'd say there should be an option for a hybrid approach. If a leaf node has a mirror of a given stream and N clients all perform reads, it's fine for these to value availability. But if a single process happens to write to that mirrored stream, with the write going to the origin, then it should also have the option to have such a consistent read. Which would mean having a hybrid approach, where client-side you can decide between availability or consistency on a case-by-case basis.
| developer will need to decide per process or app which consistency level to use? Is this additional complexity worth the | ||
| flexibility? | ||
|
|
||
| #### What should the performance considerations be? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting a thread for thoughts around this topic:
"Leader leases" or "distributed locks" don't have strict guarantees, especially when considering clock skew/drift/etc. Under this mode leader elections would need to trigger way slower, which would prevent quick leader elections during faults where the previous leader can't signal it's stepping down.
I'd say we take the path taken by others, like etcd, and have reads "go through Raft" as this gives a strict guarantee of linearizable reads when opted-in. I realize that reads should be faster than writes, and reads should not generally slow down the throughput of writes. This means the read should not result in an append entry to be written into the Raft log, as the throughput would then be shared with writes. Instead, we can optimize this by NOT writing additional data in the log, instead "piggybacking" on writes that are happening in parallel. Meaning, if a replicated write happens and the leader sends this out to its followers but didn't receive quorum yet, then N reads can simply wait for this write to have quorum with no need to write additional data into the log. Reads are then simply "queued" after the latest writes, ensuring least impact while unlocking highest guarantees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the leader leases:
"Leader leases" or "distributed locks" don't have strict guarantees, especially when considering clock skew/drift/etc.
I agree on this one. Leases may violate the guarantees, although it would be unlikely if lease duration is chosen in a sensible way.
Under this mode leader elections would need to trigger way slower, which would prevent quick leader elections during faults where the previous leader can't signal it's stepping down.
I'm not sure I follow. A faulty leader can signal it is stepping down? If a leader wants to step down, it can as well stop serving local reads.
In practice, a lease would be held for a time that is less than the election timeout (and some more to account for max clock drift and latency). We currently use 4 seconds election timeout, which I think is plenty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. A faulty leader can signal it is stepping down? If a leader wants to step down, it can as well stop serving local reads.
Indeed a leader that's stepped down, i.e. not a leader anymore, need not serve local reads under the "consistent mode".
Where a faulty leader can't signal it is stepping down is where the fault prevents it. One example could be a hard kill, or a network partition where the current leader doesn't know another leader has been elected.
If we were to use leader leases that would mean we need to ensure a new leader can't be elected before the lease expires. With the minimum election timeout of 4 seconds, and heartbeats every 1 second (or earlier if a normal append entry gets sent). Would a "lease timeout" of say 2 seconds suffice? Meaning, as long as we've got a response from a quorum of servers and the monotonic timestamps we've seen are within this "lease timeout", then we can serve a local read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A partition or a hard kill would prevent sending step-down request regardless of using leases or not. The system has to rely on the election timeouts anyways.
If one can implement leases, while keeping the same election timeout we have now, it won't make a difference. Keep in mind that even if you renew a lease every second or so, you can serve lots and lots of local reads in that time.
I'm not necessarily pushing for leases, just want to put it in the right perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About reads through raft:
Instead, we can optimize this by NOT writing additional data in the log
It is a good idea to optimize reads by not logging any additional data. However, piggybacking reads onto writes, would not perform well in read heavy workloads (I suppose that KV workloads could be read heavy, for example).
I think it is possible to avoid logging read requests. I would do it like this: when a read request is received, the leader marks the request with the current value of the index, that is, the index at which we want to read from. Then the leader sends heartbeats to the followers, and waits responses from a majority. The response will confirm that the leader is in fact still leader, and we can serve a read as soon as the read request index has applied in the upper layer. The raft paper also hints at such a solution. Towards the end of section 8 in https://raft.github.io/raft.pdf:
Read-only operations can be handled without writing
anything into the log. However, with no additional mea-
sures, this would run the risk of returning stale data, since
the leader responding to the request might have been su-
perseded by a newer leader of which it is unaware. Lin-
earizable reads must not return stale data, and Raft needs
two extra precautions to guarantee this without using the
log. First, a leader must have the latest information on
which entries are committed. The Leader Completeness
Property guarantees that a leader has all committed en-
tries, but at the start of its term, it may not know which
those are. To find out, it needs to commit an entry from
its term. Raft handles this by having each leader com-
mit a blank no-op entry into the log at the start of its
term. Second, a leader must check whether it has been de-
posed before processing a read-only request (its informa-
tion may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heart-
beat messages with a majority of the cluster before re-
sponding to read-only requests.
I think we already have the first requirement in place. We would need to implement the second part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A partition or a hard kill would prevent sending step-down request regardless of using leases or not. The system has to rely on the election timeouts anyways.
Indeed, just mentioning it because the current leader will not step down based on election timeouts, but does its own lostQuorumCheck and has its own lostQuorumInterval. No big deal for leases, but that means the current server will think it's leader for longer than the 4 seconds min election timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when a read request is received, the leader marks the request with the current value of the index, that is, the index at which we want to read from. Then the leader sends heartbeats to the followers, and waits responses from a majority.
...
I think we already have the first requirement in place. We would need to implement the second part.
Indeed, we'd only need the second part.
When I mentioned "piggybacking on writes" that assumes there are sufficient writes to piggyback on. If writes are actively happening, there's no need for an additional heartbeat message to be sent. But if the workload is sufficiently read heavy such that there are no active writes, a heartbeat would need to be used instead exactly like how you're describing.
I believe we're fully aligned here then 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh OK, I initially misunderstood your proposal. All clear now.
| | `AllowDirect` enabled. | `$JS.API.DIRECT.GET.<stream>` | If the stream is replicated, the followers will also answer read requests. | Higher availability read responses but with lower consistency. A read request will be randomly served by a server hosting the stream. Recently written data is not guaranteed to be returned on a subsequent read request. | | ||
| | `MirrorDirect` enabled on mirror stream. | `$JS.API.DIRECT.GET.<stream>` | If the stream is mirrored, the mirror can also answer read requests. For example a mirror stream in a different cluster or on a leaf node. | Higher availability with potential of fast local read responses but with lowest consistency. Mirrors can be in any relative state to the source. | | ||
|
|
||
| Additionally, if a stream is replicated and a consumer is created, there is no guarantee that the consumer can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example feels a little ambiguous, and it is not clear if you are describing an actual problem or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an actual problem, imagine a replicated object store where you write a new file. A watcher (using a consumer) will be notified of this file, and a new consumer could be created to read in the chunks of that just-written file. However, since there are no guarantees what the consumer might see, it could be the consumer only gets halfway the file and the server then says "no more pending" because it's a stream follower that doesn't have all the writes yet.
Signed-off-by: Maurice van Veen <[email protected]>
| above) | ||
|
|
||
| Linearizable reads would be desirable, but a minimum would be to opt in to at least session-level guarantees such as | ||
| reading your own writes and monotonic reads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have an additional section for session guarantees? If I remember correctly, you implemented session guarantees, but it introduced some complexity. So I wonder if it would make sense to document your attempt and your conclusions. And maybe we could think of alternative ways to achieve session guarantees. Or maybe you did that somewhere else already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That prior attempt and design is included here: #358
I'm tempted to solely focus on fully linearizable reads though, and not "intermediate" session guarantees. Mostly because there are use cases involving multiple processes which require linearizable reads, that can't be solved by (single process/client) session guarantees.
| If high availability is valued, then the current Direct Get API could still be used while high consistency read requests | ||
| could be served by the stream leader only. Would such a hybrid approach even be desirable, given that now the app | ||
| developer will need to decide per process or app which consistency level to use? Is this additional complexity worth the | ||
| flexibility? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed during the previous attempts it's critical that one should be able to know that all access to a certain stream is done safely. That is, if the stream is a financial wallet the stream should be able to enforce that all access to it must be done in a consistent manner.
Does not mean it always have to be that way, we can have a mode where a highbrid approach is used where consistent and eventually consistent reads are allowed.
But it must be able to configure a stream to only accept consistent access and reject other attempts against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... one should be able to know that all access to a certain stream is done safely
Agree, and that could perhaps be achieved by a stream setting like ReadConsistency: high, where all the reads are ensured to be linearizable.
The hybrid could be on an opt-in per request basis to enable this high consistency for that particular request, even if the stream is not enforcing the ReadConsistency: high because it's set to weak for example.
... the stream should be able to enforce that all access to it must be done in a consistent manner.
The only doubts I have is around, "what if on a separate cluster/leaf somebody makes a mirror of that stream?".
The act of adding a mirror will violate this guarantee, regardless of the source having ReadConsistency: high, since the mirror can't know the source was configured to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we discussed we will reasonably be able to have in-cluster consistency only, if possible we might say a high consistency stream cannot have a mirror direct enabled mirror?
The hybrid could be on an opt-in per request basis
This is exactly what we want to avoid when supporting a mode where the stream is in a strict high consistency mode only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly what we want to avoid when supporting a mode where the stream is in a strict high consistency mode only
Indeed, if the stream is in strict high consistency mode only, there's no way to disable that.
I meant if the stream is in weak consistency, you can on a per-request basis still opt-in to the high consistency when required.
if possible we might say a high consistency stream cannot have a mirror direct enabled mirror?
Not sure if this is possible. Mirrors can be made on a leaf while not even having a connection to what it mirrors from. Especially if we allow updating the ReadConsistency value, then there's no way to enforce a mirror isn't created against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, if the stream is in strict high consistency mode only, there's no way to disable that.
good stuff
Re mirrors, that's what I expected so might be a combo of documentation and for example if you enable high consistency on the mirror you can't also set mirror direct or something similar (just an example, probably a bad one) idea is some way to at least communicate that stuffs weird and shouldn't be allowed
|
Is something like logical timestamps[1][2] in scope? They provide the ability for replicas to respond to queries causally with only a short delay. [1] https://cse.msu.edu/~sandeep/publications/wss01/main.pdf |
@jpsugar-flow We have not considered logical timestamps at this point, nor causal consistency. So thanks for the feedback and suggestions! |
|
Thanks for the reply! The clock guarantees are a problem for some users, but I think they can (with some caveats) be satisfied by users in major cloud databases. If there's general interest I could start a discussion thread somewhere to not spam up this one? |
Signed-off-by: Maurice van Veen <[email protected]>
|
This ADR PR is updated with a concrete design, marking as ready for review.
|
|
(I imagine the 'Discussion' section will need to be removed prior to merge to not clutter the final doc) |
This PR describes the current read consistencies of JetStream, particularly when using
AllowDirectandMirrorDirectwhich give low consistency guarantees. For 2.14 we should should support more configurability or in general higher levels of consistency as opt-in.This is intended as a replacement for #358, which proposed very manual client-side tracking of received sequences. The desired outcome would be a less manual approach.
This PR is meant to document and start discussion around these topics, specifically asking three high-level questions:
Based on the shared feedback we can more confidently decide what direction to go in.