You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on rebalancing an 8-node Kafka cluster with 4,000+ topics and 120,000K+ replicas. After I triggered rebalance I noticed that there are about 5,000+ replica move tasks, 2,000+ lead replica moves and about 50GB data move.
The data moving is fast but leader replica moves are quite slow (by eyeballing them, about 3 - 10 seconds per task). With 7,000+ tasks, the whole rebalance will take about 5-8 hours. All the machines have decent hardware (CPU, Disk, Network, Geo-location) and the max number of concurrent tasks won't be exceeding 40. Each batch of tasks (inter-broker partition movements) is generally finished in between 30 seconds to 120 seconds.
My question is that is this time expected or there is something else I can do to speed them up?
I tried to follow your suggestion on How to speed up rebalance executor, but from my observation, it will lead to admin client timeout when too many concurrent tasks are running.
Previously, I thought the executor speed was capped due to data movement, but I was wrong. It's actually capped by partition/replica movements (registration changes) I think. Please correct me if I am wrong.
Please share some of your thought and I appreciate any suggestions.
Thank you in advance.
The text was updated successfully, but these errors were encountered:
Hi @efeg ,
I am working on rebalancing an 8-node Kafka cluster with 4,000+ topics and 120,000K+ replicas. After I triggered rebalance I noticed that there are about 5,000+ replica move tasks, 2,000+ lead replica moves and about 50GB data move.
The data moving is fast but leader replica moves are quite slow (by eyeballing them, about 3 - 10 seconds per task). With 7,000+ tasks, the whole rebalance will take about 5-8 hours. All the machines have decent hardware (CPU, Disk, Network, Geo-location) and the max number of concurrent tasks won't be exceeding 40. Each batch of tasks (inter-broker partition movements) is generally finished in between 30 seconds to 120 seconds.
My question is that is this time expected or there is something else I can do to speed them up?
I tried to follow your suggestion on How to speed up rebalance executor, but from my observation, it will lead to admin client timeout when too many concurrent tasks are running.
Previously, I thought the executor speed was capped due to data movement, but I was wrong. It's actually capped by partition/replica movements (registration changes) I think. Please correct me if I am wrong.
Please share some of your thought and I appreciate any suggestions.
Thank you in advance.
The text was updated successfully, but these errors were encountered: