Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cruise control can't recognize topics which are newly created during the execution of remove_broker? #1708

Open
jiao-zhangS opened this issue Oct 4, 2021 · 4 comments
Labels
correctness A condition affecting the proper functionality.

Comments

@jiao-zhangS
Copy link
Contributor

jiao-zhangS commented Oct 4, 2021

Hi, when executing remove_broker for an online broker(say it broker A), some new topics were created. So when the execution for remove_broker is done, broker A still hold some replicas for new topics. I planned to kick remove_broker again to clear these remained replicas, but reassignment plan does not include these topics. Also kafka_cluster_state shows broker A holds 0 replicas(which isn't reflecting the reality).
With cruise control's restart, kafka_cluster_state shows the correct number of remained replicas.
Is this expected or is it a bug?
Anyone hit the similar thing? Btw, the version in use is 2.4.36.

@jiao-zhangS jiao-zhangS changed the title CC can't recognize newly created topic during the execution of remove_broker? cruise control can't recognize newly created topic during the execution of remove_broker? Oct 4, 2021
@jiao-zhangS jiao-zhangS changed the title cruise control can't recognize newly created topic during the execution of remove_broker? cruise control can't recognize newly created topics during the execution of remove_broker? Oct 4, 2021
@jiao-zhangS jiao-zhangS changed the title cruise control can't recognize newly created topics during the execution of remove_broker? cruise control can't recognize topics which are newly created during the execution of remove_broker? Oct 4, 2021
@efeg
Copy link
Collaborator

efeg commented Oct 5, 2021

@jiao-zhangS Thanks for creating the issue.
This is not an expected behavior -- i.e. kafka_cluster_state should always show the latest metadata and any followup removal should remove replicas (if any) from the to-be-removed brokers.
One potential explanation for this behavior is that you might have a broker in the cluster that is stuck with stale metadata for some reason, and CC happens to receive the metadata from this broker.

Were there any exceptions / errors in (1) CC logs showing that it was unable to communicate with the cluster or (2) Kafka logs showing potential network partitioning?

@jiao-zhangS
Copy link
Contributor Author

@efeg Thanks for the confirm and advice.

Were there any exceptions / errors in (1) CC logs showing that it was unable to communicate with the cluster

Confirmed CC logs and there was no 'Failed to update metadata in **' log found. If my understanding is correct, metadata's update in CC is via doRefreshMetadata in MetadataClient. Sadly we didn't enable debug log here.

(2) Kafka logs showing potential network partitioning?

On broker side, we can see metadata requests coming from CC against multiple brokers continuously even after new topics were created(correctly speaking some topics were re-created). But I am not very sure if the whole CC(including metrics consumer, producer, load monitor, detector and etc.) use the same piece of metadata?

Btw, this happens on a big cluster with 100+ brokers and the remove_broker itself spent 12+ hours. After first time's remove_broker is finished, about a couple of hours' later, I checked kafka_cluster_state and found replica number is not correct. It seems Metadata was stale for long time. Is it possible that CC somehow lost the trigger for metadata update?

kafka_cluster_state should always show the latest metadata

I checked the code and kafka_cluster_state only fetch cached metadata and don't trigger metadata update. Is my understanding correct?

@efeg
Copy link
Collaborator

efeg commented Nov 2, 2021

@jiao-zhangS Created #1726 and linked this issue to it.
Recreating a topic with the same name may indeed cause this issue, and we will address that on CC. Once again, thanks for reporting this issue.

@efeg efeg added the correctness A condition affecting the proper functionality. label Nov 6, 2021
@lenin-joseph
Copy link

Hi @efeg, Is there any fix for this or any ETA? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
correctness A condition affecting the proper functionality.
Projects
None yet
Development

No branches or pull requests

3 participants