-
Notifications
You must be signed in to change notification settings - Fork 149
Proposal for Multi-Cluster Inference Gateways #1374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
7fdb6ee
fd0e5ee
aa1eff2
1b442c9
2fe03ac
d2e274f
095ff1a
fe2ba09
58f3d6a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,166 @@ | ||
# Multi-Cluster Inference Gateways | ||
|
||
Author(s): @robscott, @bexxmodd | ||
|
||
## Proposal Status | ||
|
||
***Draft*** | ||
|
||
## Summary | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this proposal would benefit from something that we've done a decent bit in Gateway API: an initial scoping doc that contains definitions, responsibilities, and use-cases that are explicitly out of scope. I want to understand why the inference extension should solve this vs. the gateway. What alternatives exist? Can we afford to assume pod to pod connectivity across all clusters? What are the exact boundaries. I think the what and why needs to be decided before we go too deep on the how |
||
|
||
Inference Gateways aim to provide efficient routing to LLM workloads running in Kubernetes. In practice, an Inference Gateway is a Gateway that conforms to the [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/concepts/conformance/). This Gateway supports a new type of backend - InferencePool. When routing to an [InferencePool](https://gateway-api-inference-extension.sigs.k8s.io/api-types/inferencepool/), the Gateway calls out to an “Endpoint Picker” referenced by the InferencePool to get instructions on which specific endpoint within the pool it should route the request to. | ||
|
||
 | ||
|
||
### Why Multi-Cluster? | ||
|
||
Until now, Inference Gateways have been focused exclusively on routing to a single cluster. Unfortunately, the resources needed to run LLM workloads continue to be scarce, and the desired capacity is rarely available within a single cluster. To address this, we propose expanding InferencePool to support multi-cluster routing. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is multi-cluster inference pools a gateway "thing" or gateway inference extension "thing" (e.g., EPP is responsible for delagating to a remote)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the benefits for using multicluster inference pool vs a front-end routing layer that directs to the relevant gateways? Is it expected to provide better inference experience, a better UX, ...? The introduction of multicluster (IP) routing is non-trivial. Adding Submariner (or some other overlay) could make for a complex system with non-trivial cost and failure modes. If we're leaning towards NOT routing directly to remote endpoints, the use of a gateway proxies creates a more robust and scalable system. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
MC Inference is part of the inference extension, rather than gateway just as InferencePool is. To address your second questions in short, the idea of extending Inference Extension to support MC is motivated with the desire to make it as simple as possible for users. We don't want users manually configuring local Gateways. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bexxmodd Sorry for not being clear in phrasing the questions. Hope the below makes it clearer.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@bexxmodd I think I have the same confusion as @elevran. I understand that this proposal is about the high level spec on how we add resources similar to inferencePool (or extend it) to support MC inference. But the implementation will still be in the gateway controllers right? Or Is there a plan to allow someone to provide a standalone plug-in that can magically enable a non-mc inference aware GW to be able to route to multi-cluster inference endpoints? |
||
|
||
### Goals | ||
|
||
* Enable Inference Gateways to route to backends in multiple clusters. | ||
* Follow a pattern that is familiar to users of [Multi-Cluster Services (MCS)](https://multicluster.sigs.k8s.io/concepts/multicluster-services-api/) and/or Gateways. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder why this is a goal? What's the benefit of following that pattern? IIRC, we are not going to directly use the serviceExport/Import API and one of the non-goals is that we don't want to "be overly prescriptive about implementation details". So that leaves me scratching my head on which part of MCS we are actually following as we neither reuse the MCS API nor dictate implementation. The UX described below seems to be pretty generic that if someone would design it without ever knowing MCS probably would do something similar on a high level. Just to be clear, I am not against it, but I am not sure if it has to be a goal. Also, I would agree more if the goal is that we have to use MCS since it's an some what established API multi-cluster networking protocol with implementations. |
||
|
||
### Non-Goals | ||
|
||
* Be overly prescriptive about implementation details - this should focus on the resulting UX and leave significant flexibility in how it is achieved. | ||
* L4 ClusterIP routing and/or automatic DNS naming - all traffic needs to flow through the Inference Gateway for this pattern to be useful (otherwise the Endpoint Picker itself would be bypassed). | ||
|
||
## Proposal | ||
|
||
The multi-cluster Inference Gateway model will largely follow the multi-cluster services model, with a few key differences. We will omit DNS and ClusterIP resolution, and avoid a separate resource, e.g. ServiceExport, by inlining the concept within InferencePool. Additionally, we will add support for having separate Endpoint Pickers in each cluster. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can you provide additional context here. Separate as in separate from the EPP ref'd by an InferencePool with an export annotation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are remote clusters expected to learn of Exported/Imported InferencePool objects? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We'll be replicating how Multi-Cluster Service does that. Essentially extending Kubernetes' native service discovery across multiple clusters, creating a logical ClusterSet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, the how part seems to be more to the implementation detail? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The
This is a big "details left to implementation" question regarding credentials, read/write access to clusters in a peer-to-peer or hub model, and push vs pull semantics. |
||
|
||
 | ||
|
||
### API Changes | ||
|
||
#### InferencePool | ||
|
||
A new `inference.networking.k8s.io/export` annotation is added to InferencePool (replacement for ServiceExport resource in MCS). In the future this may become a field, but we’ll start with an annotation to allow for faster iteration. [We’ll avoid using a bool here to align with k8s API conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#primitive-types). The supported values to start will be `ClusterSet` until any other use-case is accepted. In the future, we may allow for some intermediate values such as Regional or domain-prefixed values. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should only start with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For cluster-set, the assumption that inferenepool is replicated to all (or none) of the member clusters. There is no selective import/export (e.g., if resources in a cluster are maxed, what should it do with an imported pool?). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Removed
I'm not sure I understand the question. Are you referring to how controller coordinates export/import? Also, we want to keep API changes as minimal as possible so we should leave some things open to the implementation details. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to clarify, clusterSet does not have the assumption that a service exists in all its members. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I would discourage "selective sharing" as it can be a leaky abstraction to "non-sameness" in a clusterset (i.e. if
This is the primary utility of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a proposed API spec for InferencePoolImport might help clarify some of these questions, like if/how it conveys addresses to reach remote EPPs or InferencePools or if it has any relationship/dependency on a local EPP? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, I was thinking a potential use case for |
||
#### InferencePoolImport | ||
|
||
A new API that mirrors ServiceImport from the MCS API. This allows anyone in a connected cluster to reference a Multi-Cluster InferencePool, even if the local cluster does not have a local InferencePool. In the context of Gateway API, that means that a Gateway could be configured to reference an InferencePoolImport, even if that cluster did not contain an InferencePool. | ||
This API will be used almost exclusively for tracking endpoints, but unlike MCS, we actually have two distinct sets of endpoints that we could track: | ||
|
||
1. Endpoint Pickers | ||
1. Model Server Endpoints | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would routing directly to model server endpoints bypass the remote EPP? Does that imply that that EPP is operating with a partial view of resources in its cluster and their load? Is that reasonable/desirable? |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Note: The EPP protocol spec should be updated when this design is finalized (please create a tracker issue). |
||
## Implementation Details | ||
|
||
In the happy path, the only type of endpoint that a Gateway would need to know about is Endpoint Pickers. Ultimately, each Gateway will be sending requests to Endpoint Pickers, and then following the directions of that Endpoint Picker. As long as an Endpoint Picker is available, there’s no need to actually propagate the model server endpoints. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to suggest the following
This requires 2x RTT / BW and cross cluster coordination, routing... Couldn't GW(A) delegate to GW(B) and then leave the rest local to B? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Gateway in the cluster doesn't do any actual routing in single or multicluster. When the gateway resource in the cluster is created, the gateway controller creates Load Balancer resources, so when the request is sent by the client, it's received by the L7LB and routed to the EPP in the appropriate cluster. It's never actually routed to the cluster with the gateway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was referring to the L7LB in each cluster as "the Gateway", not the the Gateway resources in the k8s API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't really follow this. So the gateway sends to many EPPs and somehow it follows a single EPP to the model server endpoints. How does the gateway pick which EPP to follow? |
||
|
||
### Failure Mode | ||
|
||
If the Endpoint Picker is unavailable and the failure mode is configured as “FailOpen”, we could take one of several approaches: | ||
|
||
#### Honor FailOpen configuration | ||
|
||
This seems to require the Gateway to be aware of at least some model server endpoints, which requires more endpoint propagation. | ||
|
||
#### Fail over to other cluster/Endpoint Picker | ||
|
||
In a world where there are multiple clusters/Endpoint Pickers to choose from, it may be desirable to fail over to another cluster. Ultimately if all Endpoint Pickers are unavailable, we may end up back at the same problem though of needing to be aware of model server endpoints. | ||
|
||
#### Consider FailOpen “Extended” support for multi-cluster | ||
|
||
Given the potential complexity of supporting a FailOpen mode for multi-cluster, we could consider this “Extended” or optional support. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 on MC failover being "Extended" support. |
||
|
||
### Cluster/Endpoint Picker Selection | ||
|
||
It’s likely that each Gateway implementation will have some different logic here, but there will likely be at least two common paths here: | ||
|
||
#### Metrics from model server endpoints | ||
|
||
In the case where a Gateway is aware of all model server endpoints, it could theoretically also track metrics for each of these endpoints. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This duplicates the work of the EPP. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Isn't EPP tracking metrics for the individual pod? Maybe have Gateway collect aggregated metrics for InfPool? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed that this risks an unclear separation of concerns (or duplication of functionality) between a Gateway and EPP. |
||
|
||
#### Metrics from Endpoint Picker | ||
|
||
Since Gateways are ultimately deciding which Endpoint Picker to send traffic to, it could make sense for Endpoint Pickers to report back load/utilization data to the Gateway to help inform that decision. (This would reflect the utilization of model server Pods within the local InferencePool managed by each EPP). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. EPP already exposes InferencePool metrics. It will be up to the implementation on how to use these metrics to make a routing decision. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a bit confused here - how or in what circumstance would a Gateway potentially be evaluating InferencePool metrics to make routing decisions? If multiple InferencePoolImport or InferencePool backendRefs are available under a single HTTPRouteRule, does this imply a need for some sort of dynamic Would an HTTPRouteRule extensionRef filter potentially be useful for this selection somehow? Is this the theoretical future enhancement referenced below and currently out of scope? |
||
|
||
#### PreferClose/PreferLocal | ||
|
||
Local cluster by default, failover if out of capacity. | ||
|
||
### Theoretical Future Enhancement: Multi-Cluster Endpoint Pickers | ||
|
||
In the future, a more advanced implementation could allow Endpoint Pickers to pick from endpoints in other clusters (relying on the same underlying infrastructure that propagates endpoints for this multi-cluster model). We're intentionally avoiding that from the initial scope as it's both more complicated to implement, and unlikely to be scalable given the need for Endpoint Pickers to have a very tight feedback loop (usually via frequent scraping of metrics) with each model server Pod in the InferencePool. Extending that model across clusters could become quite costly. | ||
|
||
**Pros**: | ||
|
||
* Reuses existing MCS model | ||
* Simplest possible API model | ||
* “Export” configuration lives on InferencePool and clearly applies to the entire pool, not just EPP | ||
* Can clearly reference an InferencePool in other clusters without having one locally | ||
|
||
**Cons**: | ||
|
||
* Does not reuse MCS API (unclear if this is a con) | ||
|
||
## Alternative 1: MCS API for EPP | ||
|
||
If we lean into the idea that the only thing a Gateway needs to know is the Endpoint Picker endpoints and what cluster(s) they're associated with, we could build this on top of the MCS API. With this approach, the Endpoint Picker is exposed with a Multi-Cluster Service: | ||
bexxmodd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
 | ||
|
||
**Pros**: | ||
|
||
* Reuses existing MCS infrastructure. | ||
* Likely relatively simple to implement. | ||
|
||
**Cons**: | ||
|
||
* Referencing InferencePools in other clusters requires you to create an InferencePool locally. | ||
* Significantly more complex configuration (more YAML at least). | ||
* "FailOpen" mode becomes ~impossible if implementations don't actually have some model server endpoints to fall back to. | ||
* In this model, you don’t actually choose to export an InferencePool, you export the Endpoint Picker, that could lead to significant confusion. | ||
* InferencePool is meant to be a replacement for a Service so it may seem counterintuitive for a user to create a Service to achieve multi-cluster inference. | ||
|
||
## Alternative 2: New MCS API | ||
|
||
One of the key pain points we’re seeing here is that the current iteration of the MCS API requires a tight coupling between name/namespace and kind, with Service being the only kind of backend supported right now. This goes against the broader SIG-Network direction of introducing more focused kinds of backends (like InferencePool). To address this, we could create a resource that has an `exportRef` that allows for exporting different types of resources. | ||
|
||
Well we were at it, we could combine the separate `export` and `import` resources that exist today, with `export` acting as the (optional) spec of this new resource, and `import` acting as `status` of the resource. Instead of `import` resources being automatically created, users would create them wherever they wanted to reference or export something to a MultiClusterService. | ||
|
||
Here’s a very rough example: | ||
|
||
```yaml | ||
apiVersion: networking.k8s.io/v1 | ||
kind: MultiClusterService | ||
metadata: | ||
name: bookinfo | ||
namespace: bookinfo | ||
spec: | ||
exportRef: | ||
group: v1 | ||
kind: Service | ||
name: bookinfo | ||
scope: ClusterSet | ||
status: | ||
conditions: | ||
- type: Accepted | ||
status: "True" | ||
message: "MultiClusterService has been accepted" | ||
lastTransitionTime: "2025-03-30T01:33:51Z" | ||
targetCount: 1 | ||
ports: | ||
- protocol: TCP | ||
appProtocol: HTTP | ||
port: 8080 | ||
``` | ||
|
||
### Open Questions | ||
|
||
How can we ensure that cross-cluster connections to EPP are secure? (Requires resolution of https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/735#issuecomment-3133302612) | ||
Can we find a way to configure preferences for where a request should be routed? | ||
|
||
### Prior Art | ||
|
||
* [GEP-1748: Gateway API Interaction with Multi-Cluster Services](https://gateway-api.sigs.k8s.io/geps/gep-1748/) | ||
* [Envoy Gateway with Multi-Cluster Services](https://gateway.envoyproxy.io/latest/tasks/traffic/multicluster-service/) | ||
* [Multicluster Service API](https://multicluster.sigs.k8s.io/concepts/multicluster-services-api/) | ||
* [Submariner](https://submariner.io/) | ||
|
||
### References | ||
|
||
* [Original Doc for MultiCluster Inference Gateway](https://docs.google.com/document/d/1QGvG9ToaJ72vlCBdJe--hmrmLtgOV_ptJi9D58QMD2w/edit?tab=t.0#heading=h.q6xiq2fzcaia) |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the proposal. overall it looks very nice and at the very high level it could work (not getting into the details).
I think the main part that is still missing here is the motivation.
the only motivation that was mentioned in this doc is that the cluster may get out of resources.
I can share with you that this idea was proposed multiple times internally in IBM (much before GIE) but the answer was always the same - the cluster can be scaled with more resources and the complexity of spreading across multiple clusters doesn't worth it when looking at the tradeoff.
I would try to focus on this point - I think you need to find at least one use case or problem that cannot be solved by scaling the single cluster with more resources.
I think there is no doubt that this proposal adds complexity to GIE and there should be a real requirement or real use case for us to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your perspective of adding any complexity should be justified and usually I'm the one who argues against it. However, in this case, there's a hard limit to how much resources can be scaled up vertically, so in the cloud environment that becomes only possible by adding new regions and scale horizontally. The expectation for the GWs is that this means adding new clusters. That's why there's a strong support for MC inference GW from other vendors like Microsoft, Red Hat, Solo, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you hit that limit in GCP?
it would be great to add that information as background to the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Customers exhausting allocated GPU availability within a given region is a very real challenge across multiple cloud vendors currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't arguing otherwise :).
I was trying to stress a point - when GPU availability within a given region is exhausted, can a cloud vendor add more resources to that region? theoretically that's possible and solves the problem without the multi cluster complexity.
but scaling also has its limits. we can scale up to a certain limit. the question was if we know what that limit is, and if we can document it in order to understand what are the conditions where multi cluster becomes a better solution that scaling the single cluster.
adding this kind of information can strengthen the motivation section and help in understanding of whether we should invest in this use case or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree to add information about scalability for strengthening motivation. I'll do it in a general sense, as the specific numbers will very from provider to provider.
Currently, GKE Gateway limits to 1500 pods per regional clusters, and 500 pods per zonal clusters.