fix race condition between forget and failover by ysqyang · Pull Request #105 · valkey-io/valkey-operator

ysqyang · 2026-03-04T19:04:37Z

Summary

Fix a race condition between forgetStaleNodes and Valkey's auto-failover that can permanently prevent a replica from being promoted after its primary dies. See #103 for more context.

The bug

When a primary's deployment is deleted, the controller's forgetStaleNodes issues CLUSTER FORGET for the dead node from every surviving node. If this runs before Valkey's auto-failover election completes, it removes the dead primary from the other masters' node tables. Those masters can then no longer validate the replica's FAILOVER_AUTH_REQUEST (they don't recognize the dead node), so they never vote. The replica is permanently stuck as a slave, findShardPrimary never finds a primary for the shard, and the cluster enters an infinite loop of:

ERROR  command failed: CLUSTER FORGET  {"error": "Can't forget my master!"}
DEBUG  skipping replica; primary not ready yet
DEBUG  missing replicas, requeue..

This is a timing-dependent race. The window is roughly 0.5–1 second between the fail flag being set and the failover election completing. It was reported by a user who hit it when deleting a primary deployment.

The fix

Before issuing CLUSTER FORGET, check whether any live node in the cluster still considers the failing node as its master (HasReplicaOf). If so, skip the FORGET — the replica needs the dead node in the other masters' node tables to complete the failover election. Once the failover completes and the replica is promoted, it no longer reports itself as a slave of the dead node, so the next reconcile will proceed with FORGET normally.

Changes

internal/valkey/clusterstate.go: Add HasReplicaOf(nodeId) method on ClusterState that checks if any node's CLUSTER NODES self-report shows it as a replica of the given node ID. Add MasterIdFromSelf() helper on NodeState that extracts fields[3] (master ID) from the myself line.
internal/controller/valkeycluster_controller.go: Guard forgetStaleNodes with the HasReplicaOf check. When skipped, log "skipping forget; failover pending for node" at V(1).

Why this is safe

Dead replica (not a master): No node claims a replica as its master → HasReplicaOf returns false → FORGET proceeds immediately. No behavior change.
Both master and replica are dead: The dead replica isn't in state.Shards (connection failed) → HasReplicaOf returns false → FORGET proceeds. Correct — no failover is possible anyway.
Scale-down stale nodes: Drained masters have no replicas left → HasReplicaOf returns false → FORGET proceeds. No behavior change.
Failover permanently blocked for other reasons (e.g., replica too far behind): HasReplicaOf returns true, FORGET is deferred. This is no worse than today where FORGET runs but the failover is also permanently blocked. With this fix, at least the failover has a chance if the blocking condition resolves.

Signed-off-by: yang.qiu <[email protected]>

Signed-off-by: Björn Svensson <[email protected]>

bjosv

When running e2e tests in CI I hit the error in ~1 of 4 runs,
but with this fix I'm able to run 20 runs without any errors.
Seems to fix our problem!

bjosv · 2026-03-12T08:03:24Z

internal/valkey/clusterstate.go

+// MasterIdFromSelf returns the master node ID that this node reports as its
+// own master in CLUSTER NODES (fields[3] of the "myself" line). Returns "-"
+// for masters and the master's node ID for replicas.
+func (n *NodeState) MasterIdFromSelf() string {


Suggested change

func (n *NodeState) MasterIdFromSelf() string {

func (n *NodeState) GetPrimaryId() string {

bjosv · 2026-03-12T08:03:52Z

internal/controller/valkeycluster_controller.go

+					continue
+				}
+				// A live replica still considers this failing node its
+				// master. Forgetting it from the other masters now would


Replace master with primary

bjosv · 2026-03-12T08:10:34Z

internal/valkey/clusterstate.go

+// HasReplicaOf returns true if any live node in the cluster state reports
+// itself as a replica of the given node ID. This is used to prevent
+// CLUSTER FORGET from racing with auto-failover: forgetting a failed
+// primary from other masters removes it from their node tables, which


Replace masters with e.g. replica nodes

jdheyburn · 2026-03-12T18:29:05Z

@bjosv Silly question, how did you run the CI e2e tests many times over? Create your own branch that pulled these changes and repeated them over and over?

bjosv · 2026-03-13T13:24:17Z

@bjosv Silly question, how did you run the CI e2e tests many times over? Create your own branch that pulled these changes and repeated them over and over?

I just ran an own branch with a change in the makefile to run go test 20 times (incl. the fix)
main...Nordix:valkey-operator:refs/heads/collect-valkey-logs

Runtime: 1h ! 😩 https://github.com/Nordix/valkey-operator/actions/runs/22959266171

…over-bug-fix Signed-off-by: Joseph Heyburn <[email protected]>

Signed-off-by: Joseph Heyburn <[email protected]>

jdheyburn

I rebased from main to pull in the ValkeyNode changes. We've discussed on a previous tech call that we can get this merged in once it had been rebased. Thank you @ysqyang!

fix race condition between forget and failover

6097785

Signed-off-by: yang.qiu <[email protected]>

ysqyang force-pushed the failover-bug-fix branch from fe6c82e to 6097785 Compare March 4, 2026 19:05

ysqyang mentioned this pull request Mar 4, 2026

Cluster cannot recover from deleted deployment #103

Open

bjosv added a commit to Nordix/valkey-operator that referenced this pull request Mar 11, 2026

Use "fix race condition between forget and failover valkey-io#105"

e3ef4c2

Signed-off-by: Björn Svensson <[email protected]>

bjosv reviewed Mar 12, 2026

View reviewed changes

jdheyburn marked this pull request as ready for review March 13, 2026 15:13

jdheyburn added 2 commits April 10, 2026 13:46

Merge branch 'main' of github.com:valkey-io/valkey-operator into fail…

84fc0ff

…over-bug-fix Signed-off-by: Joseph Heyburn <[email protected]>

chore: use primary language

225df39

Signed-off-by: Joseph Heyburn <[email protected]>

jdheyburn approved these changes Apr 10, 2026

View reviewed changes

jdheyburn merged commit 2042a47 into valkey-io:main Apr 10, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix race condition between forget and failover#105

fix race condition between forget and failover#105
jdheyburn merged 3 commits intovalkey-io:mainfrom
ysqyang:failover-bug-fix

ysqyang commented Mar 4, 2026 •

edited

Loading

Uh oh!

bjosv left a comment

Uh oh!

bjosv Mar 12, 2026

Uh oh!

bjosv Mar 12, 2026

Uh oh!

bjosv Mar 12, 2026

Uh oh!

jdheyburn commented Mar 12, 2026 •

edited by bjosv

Loading

Uh oh!

bjosv commented Mar 13, 2026 •

edited

Loading

Uh oh!

jdheyburn left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	func (n *NodeState) MasterIdFromSelf() string {
	func (n *NodeState) GetPrimaryId() string {

Conversation

ysqyang commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The bug

The fix

Changes

Why this is safe

Uh oh!

bjosv left a comment

Choose a reason for hiding this comment

Uh oh!

bjosv Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

bjosv Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

bjosv Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

jdheyburn commented Mar 12, 2026 • edited by bjosv Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjosv commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdheyburn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ysqyang commented Mar 4, 2026 •

edited

Loading

jdheyburn commented Mar 12, 2026 •

edited by bjosv

Loading

bjosv commented Mar 13, 2026 •

edited

Loading