Understanding Executor Bottleneck #1631

qz-fordham · 2021-07-27T20:27:46Z

I am working on rebalancing an 8-node Kafka cluster with 4,000+ topics and 120,000K+ replicas. After I triggered rebalance I noticed that there are about 5,000+ replica move tasks, 2,000+ lead replica moves and about 50GB data move.

The data moving is fast but leader replica moves are quite slow (by eyeballing them, about 3 - 10 seconds per task). With 7,000+ tasks, the whole rebalance will take about 5-8 hours. All the machines have decent hardware (CPU, Disk, Network, Geo-location) and the max number of concurrent tasks won't be exceeding 40. Each batch of tasks (inter-broker partition movements) is generally finished in between 30 seconds to 120 seconds.

My question is that is this time expected or there is something else I can do to speed them up?

I tried to follow your suggestion on How to speed up rebalance executor, but from my observation, it will lead to admin client timeout when too many concurrent tasks are running.

Previously, I thought the executor speed was capped due to data movement, but I was wrong. It's actually capped by partition/replica movements (registration changes) I think. Please correct me if I am wrong.

Please share some of your thought and I appreciate any suggestions.

Thank you in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding Executor Bottleneck #1631

Understanding Executor Bottleneck #1631

qz-fordham commented Jul 27, 2021

Understanding Executor Bottleneck #1631

Understanding Executor Bottleneck #1631

Comments

qz-fordham commented Jul 27, 2021