Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PD may not be able to sense the change of internal raft leader in etcd in time #7780

Open
JmPotato opened this issue Jan 31, 2024 · 0 comments · May be fixed by #7846
Open

PD may not be able to sense the change of internal raft leader in etcd in time #7780

JmPotato opened this issue Jan 31, 2024 · 0 comments · May be fixed by #7846
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@JmPotato
Copy link
Member

Enhancement Task

pd-2.log:[2024/01/28 11:13:33.947 +08:00] [INFO] [raft.go:706] ["fef5444e2c4d3d9c became follower at term 4"]
pd-1.log:[2024/01/28 11:13:33.953 +08:00] [INFO] [raft.go:771] ["a22604cca51ee334 became leader at term 4"]

The etcd leader of pd-2 dropped at 11:13:33.947, and then the etcd leader was elected by pd-1 as the new etcd leader. However, it was not until 11:18:30.646 that the PD leader of pd-2 stepped down, and pd-1 was elected as the new PD leader.

pd-2.log:[2024/01/28 11:18:30.646 +08:00] [INFO] [server.go:1687] ["etcd leader changed, resigns pd leadership"] [old-pd-leader-name=tc-pd-2]
pd-1.log:[2024/01/28 11:18:31.627 +08:00] [INFO] [server.go:1529] ["pd leader has changed, try to re-campaign a pd leader"]

The only check for the PD leader to find out the etcd leader is changed:

pd/server/server.go

Lines 1805 to 1809 in 1c54865

etcdLeader := s.member.GetEtcdLeader()
if etcdLeader != s.member.ID() {
log.Info("etcd leader changed, resigns pd leadership", zap.String("old-pd-leader-name", s.Name()))
return
}

So it's reasonable to conclude that m.etcd.Server.Lead() may not return the latest etcd leader as soon as possible.

In the etcd code, there is a small detail inside the raft Ready preparation:

https://github.com/etcd-io/etcd/blob/85b640cee793e25f3837c47200089d14a8392dc7/raft/node.go#L311-L322

So it is possible that the raw raft node has already finished the election but the upper etcd server can not get the latest soft state to apply, which causes the new etcd leader to be set in a long time later. This could happen more easily, especially when we have some chaos like IO hang injected into the etcd.

@JmPotato JmPotato added the type/enhancement The issue or PR belongs to an enhancement. label Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant