fix race condition between forget and failover#105
Conversation
Signed-off-by: yang.qiu <[email protected]>
fe6c82e to
6097785
Compare
Signed-off-by: Björn Svensson <[email protected]>
bjosv
left a comment
There was a problem hiding this comment.
When running e2e tests in CI I hit the error in ~1 of 4 runs,
but with this fix I'm able to run 20 runs without any errors.
Seems to fix our problem!
internal/valkey/clusterstate.go
Outdated
| // MasterIdFromSelf returns the master node ID that this node reports as its | ||
| // own master in CLUSTER NODES (fields[3] of the "myself" line). Returns "-" | ||
| // for masters and the master's node ID for replicas. | ||
| func (n *NodeState) MasterIdFromSelf() string { |
There was a problem hiding this comment.
| func (n *NodeState) MasterIdFromSelf() string { | |
| func (n *NodeState) GetPrimaryId() string { |
| continue | ||
| } | ||
| // A live replica still considers this failing node its | ||
| // master. Forgetting it from the other masters now would |
There was a problem hiding this comment.
Replace master with primary
internal/valkey/clusterstate.go
Outdated
| // HasReplicaOf returns true if any live node in the cluster state reports | ||
| // itself as a replica of the given node ID. This is used to prevent | ||
| // CLUSTER FORGET from racing with auto-failover: forgetting a failed | ||
| // primary from other masters removes it from their node tables, which |
There was a problem hiding this comment.
Replace masters with e.g. replica nodes
|
@bjosv Silly question, how did you run the CI e2e tests many times over? Create your own branch that pulled these changes and repeated them over and over? |
I just ran an own branch with a change in the makefile to run go test 20 times (incl. the fix) Runtime: 1h ! 😩 https://github.com/Nordix/valkey-operator/actions/runs/22959266171 |
…over-bug-fix Signed-off-by: Joseph Heyburn <[email protected]>
Signed-off-by: Joseph Heyburn <[email protected]>
Summary
Fix a race condition between
forgetStaleNodesand Valkey's auto-failover that can permanently prevent a replica from being promoted after its primary dies. See #103 for more context.The bug
When a primary's deployment is deleted, the controller's
forgetStaleNodesissuesCLUSTER FORGETfor the dead node from every surviving node. If this runs before Valkey's auto-failover election completes, it removes the dead primary from the other masters' node tables. Those masters can then no longer validate the replica'sFAILOVER_AUTH_REQUEST(they don't recognize the dead node), so they never vote. The replica is permanently stuck as a slave,findShardPrimarynever finds a primary for the shard, and the cluster enters an infinite loop of:This is a timing-dependent race. The window is roughly 0.5–1 second between the
failflag being set and the failover election completing. It was reported by a user who hit it when deleting a primary deployment.The fix
Before issuing
CLUSTER FORGET, check whether any live node in the cluster still considers the failing node as its master (HasReplicaOf). If so, skip the FORGET — the replica needs the dead node in the other masters' node tables to complete the failover election. Once the failover completes and the replica is promoted, it no longer reports itself as a slave of the dead node, so the next reconcile will proceed with FORGET normally.Changes
internal/valkey/clusterstate.go: AddHasReplicaOf(nodeId)method onClusterStatethat checks if any node'sCLUSTER NODESself-report shows it as a replica of the given node ID. AddMasterIdFromSelf()helper onNodeStatethat extractsfields[3](master ID) from themyselfline.internal/controller/valkeycluster_controller.go: GuardforgetStaleNodeswith theHasReplicaOfcheck. When skipped, log"skipping forget; failover pending for node"at V(1).Why this is safe
HasReplicaOfreturns false → FORGET proceeds immediately. No behavior change.state.Shards(connection failed) →HasReplicaOfreturns false → FORGET proceeds. Correct — no failover is possible anyway.HasReplicaOfreturns false → FORGET proceeds. No behavior change.HasReplicaOfreturns true, FORGET is deferred. This is no worse than today where FORGET runs but the failover is also permanently blocked. With this fix, at least the failover has a chance if the blocking condition resolves.