Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding Executor Bottleneck #1631

Open
qz-fordham opened this issue Jul 27, 2021 · 0 comments
Open

Understanding Executor Bottleneck #1631

qz-fordham opened this issue Jul 27, 2021 · 0 comments

Comments

@qz-fordham
Copy link
Contributor

Hi @efeg ,

I am working on rebalancing an 8-node Kafka cluster with 4,000+ topics and 120,000K+ replicas. After I triggered rebalance I noticed that there are about 5,000+ replica move tasks, 2,000+ lead replica moves and about 50GB data move.

The data moving is fast but leader replica moves are quite slow (by eyeballing them, about 3 - 10 seconds per task). With 7,000+ tasks, the whole rebalance will take about 5-8 hours. All the machines have decent hardware (CPU, Disk, Network, Geo-location) and the max number of concurrent tasks won't be exceeding 40. Each batch of tasks (inter-broker partition movements) is generally finished in between 30 seconds to 120 seconds.

My question is that is this time expected or there is something else I can do to speed them up?

I tried to follow your suggestion on How to speed up rebalance executor, but from my observation, it will lead to admin client timeout when too many concurrent tasks are running.

Previously, I thought the executor speed was capped due to data movement, but I was wrong. It's actually capped by partition/replica movements (registration changes) I think. Please correct me if I am wrong.

Please share some of your thought and I appreciate any suggestions.

Thank you in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant