Proposal for Multi-Cluster Inference Gateways #1374

bexxmodd · 2025-08-14T00:48:59Z

Initial design doc: https://docs.google.com/document/d/1QGvG9ToaJ72vlCBdJe--hmrmLtgOV_ptJi9D58QMD2w/edit?usp=sharing

netlify · 2025-08-14T00:49:05Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`58f3d6a`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68a516fe5fcd6400084feff2
😎 Deploy Preview	https://deploy-preview-1374--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-08-14T00:49:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bexxmodd
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-08-14T00:49:08Z

Welcome @bexxmodd!

It looks like this is your first PR to kubernetes-sigs/gateway-api-inference-extension 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gateway-api-inference-extension has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-08-14T00:49:09Z

Hi @bexxmodd. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

bexxmodd · 2025-08-14T00:50:21Z

/cc @robscott

robscott · 2025-08-14T00:52:24Z

/ok-to-test

nirrozenbaum · 2025-08-14T05:54:15Z

@bexxmodd can you please remove the .DS_Store files?

bexxmodd · 2025-08-14T16:08:47Z

@bexxmodd can you please remove the .DS_Store files?

Removed.

Also, created PR to gitignore macOS generated files #1378

docs/proposals/1374-mc-inference-gateways/README.md

danehans · 2025-08-14T21:19:39Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+### Non-Goals
+
+Be overly prescriptive about implementation details - this should focus on the resulting UX and leave significant flexibility in how it is achieved


Each should be a bullet and end with a period.

Having worked with MCS, my impression was that it was too prescriptive on some aspects (e.g., name sameness, IP addresses being routable across) and too open-ended in others (e.g., how to coordinate Exports/Imports).
Providing (at least) guidance on major interactions and design options would be helpful.

too prescriptive on some aspects (e.g., name sameness, IP addresses being routable across)

My impression is that namespace sameness has largely been proven to be a desirable constraint to facilitate sensible and safe routing between clusters in a trusted group (ClusterSet in MCS nomenclature) to avoid issues like default/api in one cluster being a completely different workload than in another cluster in the group.

The challenge is that depending on the cluster ownership model (such as proliferation facilitated by managed Kubernetes offerings, where each team owns several of their own clusters rather than sharing a few centrally-managed clusters) or the namespace tenancy pattern used by an organization, avoiding naming conflicts can be difficult in larger organizations.

The question around sameness is more about whether there is a need to model and address services outside the trusted group into a cohesive Kubernetes networking model (versus external routing through ingress gateways). An example of this would be a large organization where an "accounts" and "billing" team might each own their own group of clusters for multi-region availability of the services they own, but need to route some requests outside of their domain to services owned by the other team in a different clusterset. In this organizational topology, it would be likely that some service names would conflict, being completely different logical workloads under a clashing common name (like api, db, or web). Whether attempting to model these relationships in some sort of unified way through "mesh federation" is worthwhile may be an open question, but has been out of scope for MCS.

IP addresses are not required to be directly routable - single-network (direct routing) or multi-network (indirection through E/W gateways) are both acceptable implementations for the MCS API.

too open-ended in others (e.g., how to coordinate Exports/Imports)

MCS API was an early attempt to pivot from the issues with KubeFed trying to do too much as a full implementation, and instead just define the spec as an out-of-tree set of CRDs - a pattern which has seen success with Gateway API. There's some lack of precision here due to this being an early attempt, but the pattern has largely proven to be successful at facilitating an engaged community of implementations.

Providing (at least) guidance on major interactions and design options would be helpful.

If at some later point we would seriously consider a proposal for alternative 2️⃣ , starting from a provisional doc covering these interactions, constraints and user stories (including things missing from MCS like topology-aware routing or endpoint weighting inclusive of capacity/load reporting) must be a first step before we try jumping directly to a proposed API design.

danehans · 2025-08-14T21:21:39Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+## Proposal
+
+The multi-cluster Inference Gateway model will largely follow the multi-cluster services model, with a few key differences. We will omit DNS and ClusterIP resolution, and avoid a separate resource, e.g. ServiceExport, by inlining the concept within InferencePool. Additionally, we will add support for having separate Endpoint Pickers in each cluster.


we will add support for having separate Endpoint Pickers in each cluster.

Can you provide additional context here. Separate as in separate from the EPP ref'd by an InferencePool with an export annotation?

How are remote clusters expected to learn of Exported/Imported InferencePool objects?
How will remote access to API masters be secured and coordinated over time?

How are remote clusters expected to learn of Exported/Imported InferencePool objects? How will remote access to API masters be secured and coordinated over time?

We'll be replicating how Multi-Cluster Service does that. Essentially extending Kubernetes' native service discovery across multiple clusters, creating a logical ClusterSet.

FWIW, the how part seems to be more to the implementation detail?

How are remote clusters expected to learn of Exported/Imported InferencePool objects?

The InferencePoolImport resource should be created in each cluster in the ClusterSet in the same namespace and with the same name as any exported InferencePool (likely automatically by some global controller, but details can be impl-specific), so that each cluster has a local reference resource to any exported remote InferencePool.

How will remote access to API masters be secured and coordinated over time?

This is a big "details left to implementation" question regarding credentials, read/write access to clusters in a peer-to-peer or hub model, and push vs pull semantics.

danehans · 2025-08-14T21:23:21Z

docs/proposals/1374-mc-inference-gateways/README.md

+#### InferencePool
+
+A new `inference.networking.k8s.io/export` annotation is added to InferencePool (replacement for ServiceExport resource in MCS). In the future this may become a field, but we’ll start with an annotation to allow for faster iteration. [We’ll avoid using a bool here to align with k8s API conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#primitive-types). The supported values to start will be `Local` and `ClusterSet`. In the future, we may allow for some intermediate values such as Regional or domain-prefixed values.
+


We should only start with ClusterSet until a use case for Local or any other supported value is accepted.

For cluster-set, the assumption that inferenepool is replicated to all (or none) of the member clusters. There is no selective import/export (e.g., if resources in a cluster are maxed, what should it do with an imported pool?).
I think coordination of the export/import process (including, for example, status reporting) should be specified.

We should only start with ClusterSet until a use case for Local or any other supported value is accepted.

Removed Local.

For cluster-set, the assumption that inferenepool is replicated to all (or none) of the member clusters. There is no selective import/export (e.g., if resources in a cluster are maxed, what should it do with an imported pool?). I think coordination of the export/import process (including, for example, status reporting) should be specified.

I'm not sure I understand the question. Are you referring to how controller coordinates export/import? Also, we want to keep API changes as minimal as possible so we should leave some things open to the implementation details.

Correct - I'm wondering if you want to recommend/enforce that import / export is done to all clusters in the ClusterSet or a controller is free to support selective sharing (e.g., whose policy specification is out of scope).

How, if at all, are Import status reported on an Export? Is there an indication in the exported InferencePool that it was successfully imported and where? I don't consider those implementation details but part of the API. The controller should be able to coordinate across clusters so the user is aware of status (e.g., where is the import coming from, was the export successful, etc.)

Just to clarify, clusterSet does not have the assumption that a service exists in all its members.

I'm wondering if you want to recommend/enforce that import / export is done to all clusters in the ClusterSet or a controller is free to support selective sharing (e.g., whose policy specification is out of scope).

I would discourage "selective sharing" as it can be a leaky abstraction to "non-sameness" in a clusterset (i.e. if default/api is a different workload on clusters A and B, and selective sharing is used as a workaround to export the workload from cluster B to A and C, and separately from A to only B, it would be easy to quickly lose track of service identity.

How, if at all, are Import status reported on an Export? Is there an indication in the exported InferencePool that it was successfully imported and where? I don't consider those implementation details but part of the API. The controller should be able to coordinate across clusters so the user is aware of status (e.g., where is the import coming from, was the export successful, etc.)

This is the primary utility of status.conditions on the ServiceExport resource (and status.clusters on ServiceImport for "where is the import coming from") in MCS, although messaging conflicts or failed exports can be difficult for some implementations not using a centralized hub model. I think it's a valid question whether this is a required part of the API for an initial implementation, or if something like logs/metrics from a multi-cluster Gateway API Inference Extension controller is sufficient initially.

I think a proposed API spec for InferencePoolImport might help clarify some of these questions, like if/how it conveys addresses to reach remote EPPs or InferencePools or if it has any relationship/dependency on a local EPP?

FWIW, I was thinking a potential use case for Local might be to create a local InferencePoolImport resource to be able to incrementally migrate Gateways in the local cluster from a local InferencePool over to InferencePoolImport without exposing their local InferencePool to remote traffic yet (but now that I'm thinking about this more, the UX is actually a bit wonky because if the same name/namespace InferencePool on another cluster was already exported to the ClusterSet, then traffic could still immediately start routing off-cluster, and would just additionally have local endpoints available, so maybe that's actually more confusing than helpful...)

danehans · 2025-08-14T21:33:23Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+1. Endpoint Pickers
+1. Model Server Endpoints
+


We should also consider routing to a remote EPP through a remote cluster Gateway (see the original design doc appendix).

Why does InferencePoolImport need to know about the model server endpoints in a remote cluster? I would expect the local Gateway to route to remote EPPs or through a remote Gateway based on one of the following conditions:

A local InferencePool exists and no local GPUs are available, e.g. EPP returns a 503/429.

A local InferencePool exists and the local Gateway decides the request is better served by an InferencePoolImport.

No local InferencePool exists but an InferencePoolImport exists with a status that indicates available GPU resources.

Note: The EPP protocol spec should be updated when this design is finalized (please create a tracker issue).

danehans · 2025-08-14T22:08:18Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+#### Consider FailOpen “Extended” support for multi-cluster
+
+Given the potential complexity of supporting a FailOpen mode for multi-cluster, we could consider this “Extended” or optional support.


+1 on MC failover being "Extended" support.

danehans · 2025-08-14T22:09:26Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+#### Metrics from model server endpoints
+
+In the case where a Gateway is aware of all model server endpoints, it could theoretically also track metrics for each of these endpoints.


This duplicates the work of the EPP.

This duplicates the work of the EPP.

Isn't EPP tracking metrics for the individual pod? Maybe have Gateway collect aggregated metrics for InfPool?

Agreed that this risks an unclear separation of concerns (or duplication of functionality) between a Gateway and EPP.

danehans · 2025-08-14T22:10:25Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+#### Metrics from Endpoint Picker
+
+Since Gateways are ultimately deciding which Endpoint Picker to send traffic to, it could make sense for Endpoint Pickers to report back load/utilization data to the Gateway to help inform that decision. (This would reflect the utilization of model server Pods within the local InferencePool managed by each EPP).


EPP already exposes InferencePool metrics. It will be up to the implementation on how to use these metrics to make a routing decision.

I'm a bit confused here - how or in what circumstance would a Gateway potentially be evaluating InferencePool metrics to make routing decisions?

If multiple InferencePoolImport or InferencePool backendRefs are available under a single HTTPRouteRule, does this imply a need for some sort of dynamic weight on backendRef instead of static configuration? If an EPP is only called to select an endpoint after the target InferencePool has been selected, how would metrics be considered?

Would an HTTPRouteRule extensionRef filter potentially be useful for this selection somehow?

Is this the theoretical future enhancement referenced below and currently out of scope?

docs/proposals/1374-mc-inference-gateways/README.md

nirrozenbaum · 2025-08-17T09:24:57Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+ ***Draft***
+
+## Summary


I've read the proposal. overall it looks very nice and at the very high level it could work (not getting into the details).
I think the main part that is still missing here is the motivation.
the only motivation that was mentioned in this doc is that the cluster may get out of resources.

I can share with you that this idea was proposed multiple times internally in IBM (much before GIE) but the answer was always the same - the cluster can be scaled with more resources and the complexity of spreading across multiple clusters doesn't worth it when looking at the tradeoff.

I would try to focus on this point - I think you need to find at least one use case or problem that cannot be solved by scaling the single cluster with more resources.

I think there is no doubt that this proposal adds complexity to GIE and there should be a real requirement or real use case for us to do that.

I agree with your perspective of adding any complexity should be justified and usually I'm the one who argues against it. However, in this case, there's a hard limit to how much resources can be scaled up vertically, so in the cloud environment that becomes only possible by adding new regions and scale horizontally. The expectation for the GWs is that this means adding new clusters. That's why there's a strong support for MC inference GW from other vendors like Microsoft, Red Hat, Solo, etc.

there's a hard limit to how much resources can be scaled up vertically

have you hit that limit in GCP?
it would be great to add that information as background to the proposal.

Customers exhausting allocated GPU availability within a given region is a very real challenge across multiple cloud vendors currently.

Customers exhausting allocated GPU availability within a given region is a very real challenge across multiple cloud vendors currently.

I wasn't arguing otherwise :).

I was trying to stress a point - when GPU availability within a given region is exhausted, can a cloud vendor add more resources to that region? theoretically that's possible and solves the problem without the multi cluster complexity.
but scaling also has its limits. we can scale up to a certain limit. the question was if we know what that limit is, and if we can document it in order to understand what are the conditions where multi cluster becomes a better solution that scaling the single cluster.

adding this kind of information can strengthen the motivation section and help in understanding of whether we should invest in this use case or not.

Yeah, I agree to add information about scalability for strengthening motivation. I'll do it in a general sense, as the specific numbers will very from provider to provider.

Currently, GKE Gateway limits to 1500 pods per regional clusters, and 500 pods per zonal clusters.

elevran · 2025-08-18T10:04:27Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+### Non-Goals
+
+Be overly prescriptive about implementation details - this should focus on the resulting UX and leave significant flexibility in how it is achieved


Having worked with MCS, my impression was that it was too prescriptive on some aspects (e.g., name sameness, IP addresses being routable across) and too open-ended in others (e.g., how to coordinate Exports/Imports).
Providing (at least) guidance on major interactions and design options would be helpful.

elevran · 2025-08-18T10:06:22Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+### Why Multi-Cluster?
+
+Until now, Inference Gateways have been focused exclusively on routing to a single cluster. Unfortunately, the resources needed to run LLM workloads continue to be scarce, and the desired capacity is rarely available within a single cluster. To address this, we propose expanding InferencePool to support multi-cluster routing.


Is multi-cluster inference pools a gateway "thing" or gateway inference extension "thing" (e.g., EPP is responsible for delagating to a remote)?
I think the expectation is that gateways would support this directly?

What are the benefits for using multicluster inference pool vs a front-end routing layer that directs to the relevant gateways? Is it expected to provide better inference experience, a better UX, ...?

The introduction of multicluster (IP) routing is non-trivial. Adding Submariner (or some other overlay) could make for a complex system with non-trivial cost and failure modes. If we're leaning towards NOT routing directly to remote endpoints, the use of a gateway proxies creates a more robust and scalable system.

Is multi-cluster inference pools a gateway "thing" or gateway inference extension "thing" (e.g., EPP is responsible for delagating to a remote)? I think the expectation is that gateways would support this directly?

MC Inference is part of the inference extension, rather than gateway just as InferencePool is.
Where's the expectation for it coming from? Is there a public thread or discussion?

To address your second questions in short, the idea of extending Inference Extension to support MC is motivated with the desire to make it as simple as possible for users. We don't want users manually configuring local Gateways.

@bexxmodd Sorry for not being clear in phrasing the questions. Hope the below makes it clearer.

I understand that the multicluster inferencing is part of IGW APIs and not the GW API. What I meant to ask was: in your view, which component should be implementing the handling of imported remote inference pools - is it the role of (1) an EPP, (2) the gateway implementation (e.g., by configuring Envoy) or (3) leaving it implementation dependent? Regarding the use of "expectation" - it was meant to confirm my reading of the design (i.e., that the gateway implementations would take on the role of handling the actual remote routing / traffic paths and it is not a feature of the EPP). There is no public thread or discussion - apologize for the confusion.

there are multiple ways to solve this without having users manually configure gateways. For example, using GitOps to replicate the InferencePool along with a L7LB to select between clusters; a controller that configures Istio egress and ingress proxies, etc. The main point I'm trying to convey is that multicluster flat IP networks (ie. all pods across all clusters are directly routable from all clusters) can be a complex and fragile solution in many cloud and on premise deployments. Relying on the installation of a multicluster overlay (such as Submariner) makes the proposed multicluster inferencing solution depend on a third party tool to be viable and I don't think it would make users life easier.

I think the expectation is that gateways would support this directly?

@bexxmodd I think I have the same confusion as @elevran. I understand that this proposal is about the high level spec on how we add resources similar to inferencePool (or extend it) to support MC inference. But the implementation will still be in the gateway controllers right? Or Is there a plan to allow someone to provide a standalone plug-in that can magically enable a non-mc inference aware GW to be able to route to multi-cluster inference endpoints?

elevran · 2025-08-18T10:12:53Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+## Proposal
+
+The multi-cluster Inference Gateway model will largely follow the multi-cluster services model, with a few key differences. We will omit DNS and ClusterIP resolution, and avoid a separate resource, e.g. ServiceExport, by inlining the concept within InferencePool. Additionally, we will add support for having separate Endpoint Pickers in each cluster.


How are remote clusters expected to learn of Exported/Imported InferencePool objects?
How will remote access to API masters be secured and coordinated over time?

elevran · 2025-08-18T10:15:44Z

docs/proposals/1374-mc-inference-gateways/README.md

+#### InferencePool
+
+A new `inference.networking.k8s.io/export` annotation is added to InferencePool (replacement for ServiceExport resource in MCS). In the future this may become a field, but we’ll start with an annotation to allow for faster iteration. [We’ll avoid using a bool here to align with k8s API conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#primitive-types). The supported values to start will be `Local` and `ClusterSet`. In the future, we may allow for some intermediate values such as Regional or domain-prefixed values.
+


For cluster-set, the assumption that inferenepool is replicated to all (or none) of the member clusters. There is no selective import/export (e.g., if resources in a cluster are maxed, what should it do with an imported pool?).
I think coordination of the export/import process (including, for example, status reporting) should be specified.

elevran · 2025-08-18T10:17:43Z

docs/proposals/1374-mc-inference-gateways/README.md

+This API will be used almost exclusively for tracking endpoints, but unlike MCS, we actually have two distinct sets of endpoints that we could track:
+
+1. Endpoint Pickers
+1. Model Server Endpoints


Would routing directly to model server endpoints bypass the remote EPP? Does that imply that that EPP is operating with a partial view of resources in its cluster and their load? Is that reasonable/desirable?

elevran · 2025-08-18T10:21:52Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+## Implementation Details
+
+In the happy path, the only type of endpoint that a Gateway would need to know about is Endpoint Pickers. Ultimately, each Gateway will be sending requests to Endpoint Pickers, and then following the directions of that Endpoint Picker. As long as an Endpoint Picker is available, there’s no need to actually propagate the model server endpoints.


This seems to suggest the following

gateway in cluster A routes to EPP in cluster B

EPP returns (e.g.,) a local endpoint in cluster B

gateway in cluster A sends the request directly to endpoint in cluster B

This requires 2x RTT / BW and cross cluster coordination, routing... Couldn't GW(A) delegate to GW(B) and then leave the rest local to B?

The Gateway in the cluster doesn't do any actual routing in single or multicluster. When the gateway resource in the cluster is created, the gateway controller creates Load Balancer resources, so when the request is sent by the client, it's received by the L7LB and routed to the EPP in the appropriate cluster. It's never actually routed to the cluster with the gateway.

I was referring to the L7LB in each cluster as "the Gateway", not the the Gateway resources in the k8s API.
Do you mean to say that there's an additional LB, besides Envoy/nginx/etc proxy, that is routing directly to the EPP service. or is it done via the proxy which is programmed via the Gateway API?
Also, LoadBalancer resources are often true of cloud deployment and not necessarily on-premise clusters,

ryanzhang-oss · 2025-08-20T20:24:12Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+### Why Multi-Cluster?
+
+Until now, Inference Gateways have been focused exclusively on routing to a single cluster. Unfortunately, the resources needed to run LLM workloads continue to be scarce, and the desired capacity is rarely available within a single cluster. To address this, we propose expanding InferencePool to support multi-cluster routing.


I think the expectation is that gateways would support this directly?

@bexxmodd I think I have the same confusion as @elevran. I understand that this proposal is about the high level spec on how we add resources similar to inferencePool (or extend it) to support MC inference. But the implementation will still be in the gateway controllers right? Or Is there a plan to allow someone to provide a standalone plug-in that can magically enable a non-mc inference aware GW to be able to route to multi-cluster inference endpoints?

ryanzhang-oss · 2025-08-20T20:26:22Z

docs/proposals/1374-mc-inference-gateways/README.md

+### Goals
+
+* Enable Inference Gateways to route to backends in multiple clusters.
+* Follow a pattern that is familiar to users of [Multi-Cluster Services (MCS)](https://multicluster.sigs.k8s.io/concepts/multicluster-services-api/) and/or Gateways.


I wonder why this is a goal? What's the benefit of following that pattern? IIRC, we are not going to directly use the serviceExport/Import API and one of the non-goals is that we don't want to "be overly prescriptive about implementation details". So that leaves me scratching my head on which part of MCS we are actually following as we neither reuse the MCS API nor dictate implementation. The UX described below seems to be pretty generic that if someone would design it without ever knowing MCS probably would do something similar on a high level.

Just to be clear, I am not against it, but I am not sure if it has to be a goal. Also, I would agree more if the goal is that we have to use MCS since it's an some what established API multi-cluster networking protocol with implementations.

ryanzhang-oss · 2025-08-20T20:34:39Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+## Proposal
+
+The multi-cluster Inference Gateway model will largely follow the multi-cluster services model, with a few key differences. We will omit DNS and ClusterIP resolution, and avoid a separate resource, e.g. ServiceExport, by inlining the concept within InferencePool. Additionally, we will add support for having separate Endpoint Pickers in each cluster.


FWIW, the how part seems to be more to the implementation detail?

ryanzhang-oss · 2025-08-20T20:37:08Z

docs/proposals/1374-mc-inference-gateways/README.md

+#### InferencePool
+
+A new `inference.networking.k8s.io/export` annotation is added to InferencePool (replacement for ServiceExport resource in MCS). In the future this may become a field, but we’ll start with an annotation to allow for faster iteration. [We’ll avoid using a bool here to align with k8s API conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#primitive-types). The supported values to start will be `Local` and `ClusterSet`. In the future, we may allow for some intermediate values such as Regional or domain-prefixed values.
+


Just to clarify, clusterSet does not have the assumption that a service exists in all its members.

ryanzhang-oss · 2025-08-20T21:04:35Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+## Implementation Details
+
+In the happy path, the only type of endpoint that a Gateway would need to know about is Endpoint Pickers. Ultimately, each Gateway will be sending requests to Endpoint Pickers, and then following the directions of that Endpoint Picker. As long as an Endpoint Picker is available, there’s no need to actually propagate the model server endpoints.


Ultimately, each Gateway will be sending requests to Endpoint Pickers, and then following the directions of that Endpoint Picker.

I don't really follow this. So the gateway sends to many EPPs and somehow it follows a single EPP to the model server endpoints. How does the gateway pick which EPP to follow?

keithmattix · 2025-08-21T16:41:24Z

docs/proposals/1374-mc-inference-gateways/README.md

+
+ ***Draft***
+
+## Summary


I think this proposal would benefit from something that we've done a decent bit in Gateway API: an initial scoping doc that contains definitions, responsibilities, and use-cases that are explicitly out of scope. I want to understand why the inference extension should solve this vs. the gateway. What alternatives exist? Can we afford to assume pod to pod connectivity across all clusters? What are the exact boundaries. I think the what and why needs to be decided before we go too deep on the how

bexxmodd added 2 commits August 13, 2025 17:42

Proposal for mc inference gateway.

7fdb6ee

Formatting updates.

fd0e5ee

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 14, 2025

k8s-ci-robot requested review from kfswain and nirrozenbaum August 14, 2025 00:49

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 14, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 14, 2025

Give PR number to the proposal.

aa1eff2

k8s-ci-robot requested a review from robscott August 14, 2025 00:50

Adding author(s)

1b442c9

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 14, 2025

Removed ds_store files.

2fe03ac

Removed ds_store file

d2e274f

danehans reviewed Aug 14, 2025

View reviewed changes

docs/proposals/1374-mc-inference-gateways/README.md Outdated Show resolved Hide resolved

danehans reviewed Aug 14, 2025

View reviewed changes

docs/proposals/1374-mc-inference-gateways/README.md Outdated Show resolved Hide resolved

danehans reviewed Aug 14, 2025

View reviewed changes

docs/proposals/1374-mc-inference-gateways/README.md Show resolved Hide resolved

nirrozenbaum reviewed Aug 17, 2025

View reviewed changes

elevran reviewed Aug 18, 2025

View reviewed changes

bexxmodd added 3 commits August 18, 2025 13:40

Formatting updates.

095ff1a

Replaced lists to use asterisks.

fe2ba09

Removed local from supported values.

58f3d6a

ryanzhang-oss reviewed Aug 20, 2025

View reviewed changes

keithmattix reviewed Aug 21, 2025

View reviewed changes


		### Non-Goals

		Be overly prescriptive about implementation details - this should focus on the resulting UX and leave significant flexibility in how it is achieved


		## Proposal

		The multi-cluster Inference Gateway model will largely follow the multi-cluster services model, with a few key differences. We will omit DNS and ClusterIP resolution, and avoid a separate resource, e.g. ServiceExport, by inlining the concept within InferencePool. Additionally, we will add support for having separate Endpoint Pickers in each cluster.

		#### InferencePool

		A new `inference.networking.k8s.io/export` annotation is added to InferencePool (replacement for ServiceExport resource in MCS). In the future this may become a field, but we’ll start with an annotation to allow for faster iteration. [We’ll avoid using a bool here to align with k8s API conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#primitive-types). The supported values to start will be `Local` and `ClusterSet`. In the future, we may allow for some intermediate values such as Regional or domain-prefixed values.


		#### Consider FailOpen “Extended” support for multi-cluster

		Given the potential complexity of supporting a FailOpen mode for multi-cluster, we could consider this “Extended” or optional support.


		#### Metrics from model server endpoints

		In the case where a Gateway is aware of all model server endpoints, it could theoretically also track metrics for each of these endpoints.


		#### Metrics from Endpoint Picker

		Since Gateways are ultimately deciding which Endpoint Picker to send traffic to, it could make sense for Endpoint Pickers to report back load/utilization data to the Gateway to help inform that decision. (This would reflect the utilization of model server Pods within the local InferencePool managed by each EPP).


		### Why Multi-Cluster?

		Until now, Inference Gateways have been focused exclusively on routing to a single cluster. Unfortunately, the resources needed to run LLM workloads continue to be scarce, and the desired capacity is rarely available within a single cluster. To address this, we propose expanding InferencePool to support multi-cluster routing.


		## Implementation Details

		In the happy path, the only type of endpoint that a Gateway would need to know about is Endpoint Pickers. Ultimately, each Gateway will be sending requests to Endpoint Pickers, and then following the directions of that Endpoint Picker. As long as an Endpoint Picker is available, there’s no need to actually propagate the model server endpoints.

Proposal for Multi-Cluster Inference Gateways #1374

Are you sure you want to change the base?

Proposal for Multi-Cluster Inference Gateways #1374

Conversation

bexxmodd commented Aug 14, 2025 • edited by danehans Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Aug 14, 2025

Uh oh!

k8s-ci-robot commented Aug 14, 2025

Uh oh!

k8s-ci-robot commented Aug 14, 2025

Uh oh!

bexxmodd commented Aug 14, 2025

Uh oh!

robscott commented Aug 14, 2025

Uh oh!

nirrozenbaum commented Aug 14, 2025

Uh oh!

bexxmodd commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikemorris Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bexxmodd Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikemorris Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikemorris Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikemorris Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikemorris Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danehans Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bexxmodd commented Aug 14, 2025 •

edited by danehans

Loading

netlify bot commented Aug 14, 2025 •

edited

Loading

bexxmodd commented Aug 14, 2025 •

edited

Loading

mikemorris Aug 22, 2025 •

edited

Loading

bexxmodd Aug 20, 2025 •

edited

Loading

mikemorris Aug 22, 2025 •

edited

Loading

mikemorris Aug 22, 2025 •

edited

Loading

mikemorris Aug 22, 2025 •

edited

Loading

mikemorris Aug 22, 2025 •

edited

Loading

danehans Aug 14, 2025 •

edited

Loading

bexxmodd Aug 18, 2025 •

edited

Loading

mikemorris Aug 22, 2025 •

edited

Loading

mikemorris Aug 22, 2025 •

edited

Loading

nirrozenbaum Aug 17, 2025 •

edited

Loading

elevran Aug 18, 2025 •

edited

Loading