Skip to content

fix race condition between forget and failover#105

Merged
jdheyburn merged 3 commits intovalkey-io:mainfrom
ysqyang:failover-bug-fix
Apr 10, 2026
Merged

fix race condition between forget and failover#105
jdheyburn merged 3 commits intovalkey-io:mainfrom
ysqyang:failover-bug-fix

Conversation

@ysqyang
Copy link
Copy Markdown
Contributor

@ysqyang ysqyang commented Mar 4, 2026

Summary

Fix a race condition between forgetStaleNodes and Valkey's auto-failover that can permanently prevent a replica from being promoted after its primary dies. See #103 for more context.

The bug

When a primary's deployment is deleted, the controller's forgetStaleNodes issues CLUSTER FORGET for the dead node from every surviving node. If this runs before Valkey's auto-failover election completes, it removes the dead primary from the other masters' node tables. Those masters can then no longer validate the replica's FAILOVER_AUTH_REQUEST (they don't recognize the dead node), so they never vote. The replica is permanently stuck as a slave, findShardPrimary never finds a primary for the shard, and the cluster enters an infinite loop of:

ERROR  command failed: CLUSTER FORGET  {"error": "Can't forget my master!"}
DEBUG  skipping replica; primary not ready yet
DEBUG  missing replicas, requeue..

This is a timing-dependent race. The window is roughly 0.5–1 second between the fail flag being set and the failover election completing. It was reported by a user who hit it when deleting a primary deployment.

The fix

Before issuing CLUSTER FORGET, check whether any live node in the cluster still considers the failing node as its master (HasReplicaOf). If so, skip the FORGET — the replica needs the dead node in the other masters' node tables to complete the failover election. Once the failover completes and the replica is promoted, it no longer reports itself as a slave of the dead node, so the next reconcile will proceed with FORGET normally.

Changes

  • internal/valkey/clusterstate.go: Add HasReplicaOf(nodeId) method on ClusterState that checks if any node's CLUSTER NODES self-report shows it as a replica of the given node ID. Add MasterIdFromSelf() helper on NodeState that extracts fields[3] (master ID) from the myself line.
  • internal/controller/valkeycluster_controller.go: Guard forgetStaleNodes with the HasReplicaOf check. When skipped, log "skipping forget; failover pending for node" at V(1).

Why this is safe

  • Dead replica (not a master): No node claims a replica as its master → HasReplicaOf returns false → FORGET proceeds immediately. No behavior change.
  • Both master and replica are dead: The dead replica isn't in state.Shards (connection failed) → HasReplicaOf returns false → FORGET proceeds. Correct — no failover is possible anyway.
  • Scale-down stale nodes: Drained masters have no replicas left → HasReplicaOf returns false → FORGET proceeds. No behavior change.
  • Failover permanently blocked for other reasons (e.g., replica too far behind): HasReplicaOf returns true, FORGET is deferred. This is no worse than today where FORGET runs but the failover is also permanently blocked. With this fix, at least the failover has a chance if the blocking condition resolves.

@ysqyang ysqyang force-pushed the failover-bug-fix branch from fe6c82e to 6097785 Compare March 4, 2026 19:05
bjosv added a commit to Nordix/valkey-operator that referenced this pull request Mar 11, 2026
Copy link
Copy Markdown
Collaborator

@bjosv bjosv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running e2e tests in CI I hit the error in ~1 of 4 runs,
but with this fix I'm able to run 20 runs without any errors.
Seems to fix our problem!

// MasterIdFromSelf returns the master node ID that this node reports as its
// own master in CLUSTER NODES (fields[3] of the "myself" line). Returns "-"
// for masters and the master's node ID for replicas.
func (n *NodeState) MasterIdFromSelf() string {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (n *NodeState) MasterIdFromSelf() string {
func (n *NodeState) GetPrimaryId() string {

continue
}
// A live replica still considers this failing node its
// master. Forgetting it from the other masters now would
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace master with primary

// HasReplicaOf returns true if any live node in the cluster state reports
// itself as a replica of the given node ID. This is used to prevent
// CLUSTER FORGET from racing with auto-failover: forgetting a failed
// primary from other masters removes it from their node tables, which
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace masters with e.g. replica nodes

@jdheyburn
Copy link
Copy Markdown
Collaborator

jdheyburn commented Mar 12, 2026

@bjosv Silly question, how did you run the CI e2e tests many times over? Create your own branch that pulled these changes and repeated them over and over?

@bjosv
Copy link
Copy Markdown
Collaborator

bjosv commented Mar 13, 2026

@bjosv Silly question, how did you run the CI e2e tests many times over? Create your own branch that pulled these changes and repeated them over and over?

I just ran an own branch with a change in the makefile to run go test 20 times (incl. the fix)
main...Nordix:valkey-operator:refs/heads/collect-valkey-logs

Runtime: 1h ! 😩 https://github.com/Nordix/valkey-operator/actions/runs/22959266171

@jdheyburn jdheyburn marked this pull request as ready for review March 13, 2026 15:13
Copy link
Copy Markdown
Collaborator

@jdheyburn jdheyburn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased from main to pull in the ValkeyNode changes. We've discussed on a previous tech call that we can get this merged in once it had been rebased. Thank you @ysqyang!

@jdheyburn jdheyburn merged commit 2042a47 into valkey-io:main Apr 10, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants