Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator can't stop in time, because of background jobs are still running #5274

Open
han-ian opened this issue Jul 6, 2022 · 6 comments · May be fixed by #5341
Open

Coordinator can't stop in time, because of background jobs are still running #5274

han-ian opened this issue Jul 6, 2022 · 6 comments · May be fixed by #5341
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@han-ian
Copy link

han-ian commented Jul 6, 2022

Bug Report

I have posted a post on the forum, see https://asktug.com/t/topic/694191/3 .
In short, the problem is that when coordinator is stopping the scheduler will still keep running until all jobs are finished.

The following figure shows that it takes a long time for the coordinator to close from the beginning to the end, in this case more than 10 minutes.
image

I think the root cause is that schedulers and other background jobs have no way to receive the signal to exit when coordinator is stopping.

What did you do?

pd leader lease timeout.
image

What did you expect to see?

Follower campaign for leader in time.

What did you see instead?

Follower can't campaign for leader in 10 minutes, and keep print the following log.
image

What version of PD are you using (pd-server -V)?

5.3.0

@han-ian han-ian added the type/bug The issue is confirmed as a bug. label Jul 6, 2022
@han-ian han-ian changed the title Coordinator can't stop in time, because of background jobs is still running Coordinator can't stop in time, because of background jobs are still running Jul 6, 2022
@nolouch
Copy link
Contributor

nolouch commented Jul 6, 2022

Thanks, @in-han , I think the coordinator running too long does not affect the follower who became a leader, the follower should watch the leader key is expired. could you show more details logs?

@han-ian
Copy link
Author

han-ian commented Jul 6, 2022

Thanks, @in-han , I think the coordinator running too long does not affect the follower who became a leader, the follower should watch the leader key is expired. could you show more details logs?

Yes, follower can watch key expiration. But follower can campaign leader with tow conditions: a) pd leader key is expired; b) follower is etcd leader, in this case, the old pd leader was still the etcd leader.

This is the logs:
image

@nolouch
Copy link
Contributor

nolouch commented Jul 6, 2022

Why lease expired but the old PD still is the leader? does it re-election and the old pd became leader again?

BTW, 2000 TiKVs is the largest cluster size I have seen. amazing! here is an issue trace to improve the performance of the balance region #3744. If you have an interest, may you can help us.

@han-ian
Copy link
Author

han-ian commented Jul 7, 2022

Pd leader lease timeout because it encountered timeout when write key to embed ETCD. You are right, in this case the old pd became leader again.

Ah, after deployment, we also feel that this cluster is too large!

I think there maybe two ways to resolve this problem :
a) speed up coordinator stop. There is already a PR had optimized this problem ,but it wasn't enough. I have some ideas and will try to make a PR.
b) as you mentioned, optimize scheduler performance can also resolve this problem, I will have a look at the PR.

@han-ian
Copy link
Author

han-ian commented Jul 7, 2022

@nolouch Thanks for your reply

@mayjiang0203
Copy link

/remove-type bug
/type enhancement

@ti-chi-bot ti-chi-bot added type/enhancement The issue or PR belongs to an enhancement. and removed type/bug The issue is confirmed as a bug. labels Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
4 participants