-
Notifications
You must be signed in to change notification settings - Fork 75
Supporting stretch Kafka cluster with Strimzi #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
80f778e to
8396b40
Compare
fvaleri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the proposal. Left some initial comments.
Can you please put one sentence per line to make the review easier? You can look at one of the other proposals for an example.
The word "cluster" is overloaded in this context, so we should always pay attention and clarify if we are talking about Kubernetes or Kafka.
scholzj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposal. I left some comments.
But TBH, I do not think the level of depth it has is nowhere near to where it would need to be to approve or not approve anything. It is just a super high-level idea that without the implementation details cannot be correct or wrong. We cannot approve some API changes and then try to figure out how to implement the code around it. It needs to go hand in hand.
It also almost completely ignores the networking part which is the most complicated part. It needs to cover how the different mechanisms will be supported and handled as we should be able to integrate into the cloud native landscape and fit in with the tools already being used in this area. Relying purely on something like Ingress is not enough. So the proposal needs to cover how this will be handled and how do we ensure the extensibility of this.
It would be also nice to cover topics such as:
- How will the installation be handled both on the side clusters as well as on the main Kubernetes cluster
- Testing strategy (how and where will we test this given our resources)
6239b26 to
f8d3497
Compare
72b3605 to
19fac97
Compare
…tion Added details about how to use Submariner for cross cluster communication Contributes to: strimzi#129 Signed-off-by: Aswin A <[email protected]>
…tion Added details about how to use Submariner for cross cluster communication Contributes to: strimzi#129 Signed-off-by: Aswin A <[email protected]>
c541a8e to
124b0b1
Compare
c9455a3 to
cf8672c
Compare
…tion Added details about how to use Submariner for cross cluster communication Contributes to: strimzi#129 Signed-off-by: Aswin A <[email protected]> Signed-off-by: Mark S Taylor <[email protected]>
cf8672c to
c9455a3
Compare
Signed-off-by: Aswin A <[email protected]>
Signed-off-by: Aswin A <[email protected]>
…tion Added details about how to use Submariner for cross cluster communication Contributes to: strimzi#129 Signed-off-by: Aswin A <[email protected]>
c9455a3 to
64f06ad
Compare
Moved sentences to separate lines to help with reviews Signed-off-by: Aswin A <[email protected]>
…tion Signed-off-by: Aswin A <[email protected]>
Signed-off-by: Mark S Taylor <[email protected]>
Signed-off-by: Mark S Taylor <[email protected]>
5a71224 to
b47b711
Compare
Signed-off-by: Mark S Taylor <[email protected]>
|
This stretch cluster proposal has been updated significantly to include details of a prototype. We'd like to request a re-review of the proposal, please. Many thanks! |
Signed-off-by: Aswin A <[email protected]>
c1789da to
002de75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another pass and left comments (I have to go through the discussion on the main page which caused more changes on the proposal I guess)
095-stretch-cluster.md
Outdated
|
|
||
| ##### Remote cluster operator configuration | ||
|
|
||
| When deploying the operator to remote clusters, the operator must be configured to reconcile only StrimziPodSet resources by setting the existing environment variable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the STRIMZI_POD_SET_RECONCILIATION_ONLY could be not used.
AFAICS the combination of the strimzi.io/enable-stretch-cluster annotation on the Kafka custom resource and the strimzi.io/remote-podset annotation on the StrimziPodSet should be enough to avoid collisions.
For example ...
- If the user creates a
KafkaCR namedfooin the central cluster (withstrimzi.io/enable-stretch-clusterannotation), we'll haveStrimziPodSet(withremote-podsetannotation) landing on the remote cluster. This would help the remote cluster operator (strimzipodset controller) to reconcile it (together with all the "local"StrimziPodSet). - At the same time, the user can create a
KafkaCR namedfoo(again!) in the local cluster as well: it doesn't havestrimzi.io/enable-stretch-clusterannotation, so it's local, the cluster operator creates theStrimziPodSet(withoutstrimzi.io/remote-podsetannotation) and it's able to reconcile it.
I guess that having two Kafka CR with same name (as stretched cluster and local cluster) won't be a problem in terms of advertising addresses and quorum voters because the stretched ones will take clusterId into account at DNS names level.
This way the cluster operator on the remote cluster can still operate the other operands (bridge, connect, ...).
But at this point shouldn't we have a similar annotation to remote-podset for all the other resources (listed later in the proposal) like ConfigMap, Secret and so on to avoid clashing with the same but corresponding to the local Kafka cluster having the same name as the stretched one?
| - The feature gate will be disabled by default, allowing early adopters and community members to safely test the functionality without affecting production environments. | ||
| - After at least two Strimzi releases, and based on user feedback and observed stability, enabling the feature gate by default may be considered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on having this behind a feature gate despite to enable a stretch cluster needs some steps and configuration. A FG makes clear to the users that's a beta feature to be tested before maturing. Of course, we would need a timeline, I agree.
| ### Kafka Connect, Kafka Bridge and MirrorMaker2 | ||
| This proposal does not cover stretching Kafka Connect, Kafka MirrorMaker 2 or the Kafka Bridge. | ||
| These components will be deployed to the central cluster and will function as they do today. | ||
| Operators running in remote clusters will not manage KafkaBridge, KafkaConnect, KafkaConnector, or KafkaMirrorMaker2 resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in the same thread I left a comment about the possibility to avoid the STRIMZI_POD_SET_RECONCILIATION_ONLY and having remote cluster operator operating all other components deployed locally. For the stretch cluster it would make sense to continue to deploy in the central one imho.
Signed-off-by: Aswin A <[email protected]>
Co-authored-by: Paolo Patierno <[email protected]> Signed-off-by: Aswin A <[email protected]>
Co-authored-by: Paolo Patierno <[email protected]> Signed-off-by: Aswin A <[email protected]>
Co-authored-by: Paolo Patierno <[email protected]> Signed-off-by: Aswin A <[email protected]>
Co-authored-by: Paolo Patierno <[email protected]> Signed-off-by: Aswin A <[email protected]>
Co-authored-by: Paolo Patierno <[email protected]> Signed-off-by: Aswin A <[email protected]>
|
Thanks for raising this, @ppatierno — you're right, the external bootstrap section definitely deserve a more detailed explanation. I'm currently working on refining that part of the proposal, and Jakub has also shared some valuable questions in this area. I do have a working solution in mind, and I've tested how it behaves in practice, but I want to take a bit more time to properly evaluate the trade-offs and ensure we're proposing the most robust and user-friendly approach. I'll update the proposal soon with a more complete explanation. |
|
As a more general comment on the direction of the proposal ... During the review, in the past weeks, I raised the plugability issue with @aswinayyolath. It was mostly related to the fact that the proposal had a distinction between using Cilium (which didn't need any operator code change) vs Submariner (which needed the creation of a I think that leveraging a Kubernetes API to abstract the underneath networking would be the best choice, while on the other side I could also see that the documentation is poor (some sections are empty or TBD) which, to be honest, set doubts on how much the community is still working on it. There are also a few projects implementing it. On the other side I am not sure I get what @scholzj is mentioning as plugability in order to use projects like "Linkerd, Calico, Ingress, Gateway API, and so on" (as mentioned in one of the comments). Of course, they don't implement the MCS API (or it's just my ignorance here) so I would assume what Jakub is referring to is a kind of "Strimzi API for stretched cluster" that someone has to implement via a plugin in order to use their preferred technology. Is my understanding right? But also, if we think this feature has not big demand aren't we sure we can't be more opinionated and supporting MCS API with the few implementations available? A user who wants a stretch cluster should use one of them. I know it could be not possible for various business reasons, but at the same time do we want to really think about a generic new API in order to provide several users to implement their own and then discovering after months no one is going to do so? Also I am not sure from where the comparison with Confluent is coming but I agree with Jakub that the scope and goal of Strimzi are different. I can understand that discussing a proposal for so long time could be frustrating as I can feel from you when reading about "urgency / engagement ..." for customers that IBM has on its side but, as an open source maintainer, I take care of the project on the long term so thinking through helps (I hope that IBM folks can confirm that the proposal has been improved a lot with all the feedback from the community). Even if it means delaying stuff at the beginning, it pays on the long run. |
|
Thanks Jakub and Paulo. I do appreciate all the technical collaboration between the IBM and RedHat folks on this topic, and that it's worth taking time to start down the right path. At the same time, we see users of Kafka in Kubernetes with requirements for stretch clusters - it comes up a lot - and they are adopting/already using proprietary solutions (whether or not they really provide a similar capability, perception is all that matters) or even rolling their own in some cases. Many of these users would adopt Strimzi and that's what I want to see. |
Co-authored-by: Paolo Patierno <[email protected]> Signed-off-by: Aswin A <[email protected]>
@ppatierno The Strimzi API and plugins I'm talking about are set of Java interfaces as the API and a JAR with the implementation of the interfaces as the plugin. I.e. as our PodSecurityProviders or Kafka connectors for example. |
Well this looks more the technical explanation about what to do which was pretty clear to me. My question was more about ... are you envisage a custom Strimzi API for plugins (so something we should have in the proposal which doesn't exist at all) and not using a Kubernetes API like the MCS one? And my next question was, why you don't see enough having the MCS API (without the pluggability you are requesting)? Because of only a few projects implementing MCS? |
TBH, I think the advantages of having a pluggable interface are pretty obvious. And I think I covered many of them in my comments already anyway. For example:
The MCS API ... I would expect it to be one of the possible implementations. But TBH, for me it is not a Kubernetes API - at least not yet. It is a project worked on by one of the Kubernetes SIGs. I would be happy if it one day helps to standardize things. But while I do not claim to be an expert on it, it does not seem to be there. The obvious question marks are the number of implementations and the maturity of the API (4-year-old alpha version?). Why do you think it is the only thing we need to support? So no, for me it is not the obvious choice to hardcode it and dump it on the Strimzi community. And if you wanna build this using Kubernetes APIs, the obvious choice would be load balancers, node ports, etc. - but even there I would vote for the pluggability over having it hardcoded. Designing the pluggable interface might be initially more complicated. But if you are in it for the long term, I'm 100% sure it is worth it. As a core community, we also reduce some effort on developing and maintaining the various implementations and on testing them. It will also lead to cleaner design, as you cannot just hardcode all the stuff into the codebase but have to think about it a bit more. If you are against the plugability, I would also be curious what you would do if someone comes next month with the proposal to hardcode something else next to it? I do not think you would have other choice than to accept it. The plugability I'm proposing gives you a clear path for everyone without dumping the burden on the core community. |
d3db920 to
00055c6
Compare
…cluster and validate KafkaNodePool deployment targets in stretch cluster setups Signed-off-by: Aswin A <[email protected]>
00055c6 to
4d74ba1
Compare
Signed-off-by: Mark S Taylor <[email protected]>
Signed-off-by: Mark S Taylor <[email protected]>
|
Should we close this given there is no update or progress since May? |
|
Hi @scholzj , thanks for checking in. 🙏 Please don’t close this just yet, we’ve been actively working on performance and latency testing of Strimzi stretch clusters using Submariner and Cilium to understand the cross-cluster behavior. Those tests are now complete, and we finally have some time to focus on the pluggability side of the proposal. We’ll share an update on this shortly. |
Signed-off-by: ROHAN ANIL KUMAR <[email protected]>
|
We acknowledge that it has been some time since the last update to the proposal. Over the past few months, our efforts have been concentrated on developing the pluggable network interface for Stretch Clusters. The current implementation can be found here: A refreshed proposal with updated design and complete details will be published shortly. |
|
@aswinayyolath Honestly, I think the best thing is to close the old proposal and open a new one when you are ready as I suggested a long time ago. At least you won't need to promise that you will update it |
Thanks for the feedback... your suggestion makes sense. A fresh proposal would definitely help streamline things and avoid carrying forward outdated discussions. Let me sync internally to make sure everything is aligned on our side, and I’ll come back with the next steps. In the meantime, I’ve taken note of your points and fully understand the reasoning. |
|
+1 from me for having a new shiny proposal. |
|
As per the above comments from Strimzi maintainers. I am closing this Proposal and new Proposal will be available here : #187 |
This proposal describes design details of stretch cluster
Prototype
A working prototype can be deployed using the steps outlined in a draft README that is being iteratively revised.
Note: The prototype might not always exactly align with this proposal so please refer to the README documentation when working with the prototype.
POC implementation