Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Broker-Ejection Interface #1499

Open
mgrubent opened this issue Mar 25, 2021 · 2 comments
Open

Feature Request: Broker-Ejection Interface #1499

mgrubent opened this issue Mar 25, 2021 · 2 comments
Labels
functionality A feature request.

Comments

@mgrubent
Copy link
Contributor

Background

cruise-control manages the distribution of replicas across brokers in a kafka cluster, according to proposals that it calculates for optimal resource utilization.

cruise-control also has the concept of "anomalies", which it seeks to remediate by specific manipulations of the proposals that it generates.
Illustrative anomalies are disk failure, broker failure, and goal violation.
Further, within goal violation, cruise-control seeks to detect brokers that are "slow", which is an additional acknowledgement of hardware-reality impinging on the abstraction of kafka brokers with perfectly idealized resources.

Finally, cruise-control is taking steps toward adding and removing kafka brokers from the cluster, as-needed to correctly provision the cluster (#1494)

Problem

Currently, while cruise-control may put brokers in "timeout" for a variety of performance reasons, it has no power to eject them from the kafka cluster.

Proposed Solution

cruise-control should consider a "broker-ejection interface", similar to the provisioner interface under current development.

Specifically, cruise-control should allow users to supply custom classes that define how cruise-control can effectuate the removal of a host from the cluster.
Since it is likely that these exact steps will differ wildly across kafka deployments, this is probably as specific as the interface can get without being inapplicable for some deployments.
Of course, for deployments in standardized platforms like Azure, these custom classes may only need to be implemented once, and can be widely shared among similar users.

For each code path that cruise-control traverses that would result in the automated removal of all replicas from a broker (for whatever reason), users should be able to toggle whether or not that path would call the broker-ejection interface.
This allows for maximum flexibility, since some deployments may want to keep brokers in the cluster under all circumstances, others may want always to eject problem brokers, and still others will be somewhere in between.

Probably, for maximum safety, cruise-control should call the ejection hook only after all replicas have been removed from the targeted brokers. Conceivably, users may want to eject prior to evacuation, but this seems unsafe to me, and therefore unlikely.

Finally, it's an open question to me whether cruise-control should require the interface to return an affirmative ejection-succeeded or ejection-failed signal. This would be nice in theory, but may be complicated in practice to require many disparate systems to now additionally check back in with cruise-control.

@efeg efeg added the functionality A feature request. label Mar 27, 2021
@efeg
Copy link
Collaborator

efeg commented Apr 28, 2021

@mgrubent Thanks for creating this issue and providing a detailed explanation!

I feel that we can make such an interface even more generic to support (1) removal, (2) addition, and (3) swap (when applicable) of resources from/to/in the cluster. Resources may include, but are not limited to

  • brokers,
  • disks (for JBOD brokers),
  • racks,
  • partitions

@bachmanity1
Copy link
Contributor

@efeg @mgrubent

Can I ask the status of this issue? Is this issue handled by #1710?

Probably, for maximum safety, cruise-control should call the ejection hook only after all replicas have been removed from the targeted brokers. Conceivably, users may want to eject prior to evacuation, but this seems unsafe to me, and therefore unlikely.

How was this issue handled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functionality A feature request.
Projects
None yet
Development

No branches or pull requests

3 participants