Skip to content

Conversation

@orfeas-k
Copy link

Add proposal for supporting Istio ambient in Kubeflow.

Context

This has been discussed for more than a year in the Kubeflow community (see kubeflow/manifests#2676). Currently, there is an ongoing GSOC project working on implementing this (see
#809 (comment)), which will be greatly facilitated by agreeing on expected changes.

What is this proposal about?

This proposal:

  • defines the changes needed to make Kubeflow components ambient-compatible
  • outlines a setup that uses Istio in ambient and enables all the existing Authorization and Authentication use cases.

Ref kubeflow/manifests#2676

cc @kimwnasptd @juliusvonkohout @thesuperzapper

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add proposal for supporting Istio ambient in Kubeflow. This has been
discussed for more than a year in the Kubeflow community (see
kubeflow/manifests#2676). Given that right now, there is an ongoing GSOC
project working on this (see
kubeflow#809 (comment)),
this proposal aims to facilitate the project and:
* defines the changes needed to make Kubeflow component ambient-compatible
* outlines a setup that uses Istio in ambient and enables all the existing
  Authorization and Authentication use cases.

Ref kubeflow/manifests#2676

Signed-off-by: Orfeas Kourkakis <[email protected]>
@orfeas-k orfeas-k force-pushed the proposal-support-istio-ambient branch from 2797df2 to 55c124c Compare July 31, 2025 19:39
* A high-level architecture of Kubeflow that uses Istio ambient and showcases how to:
* Use HTTPRoutes instead of VirtualServices
* Enable the user login flow
* Use waypoints to enforce namespace isolation (L7 and L4 traffic)
Copy link
Member

@juliusvonkohout juliusvonkohout Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Use waypoints to enforce namespace isolation (L7 and L4 traffic)
* Use waypoints to enforce namespace isolation (L7 and L4 traffic)
* Provide Zero overhead namespaces, so without actual user workloads the Pods, CPU and memory used should be 0 (after the seaweedfs PR is merged). So no waypoint proxies are allowed in kubeflow profile namespaces.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also related to #891 (comment).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not needed since it's not an actual goal of this proposal.

To enable authorization for Kubeflow components, the whole `kubeflow` namespace is included in the mesh.

##### Waypoints
To get the best performance out of ambient mode[^6][^7], services are enrolled to a waypoint only when the L7 features are required. For the Kubeflow control plane, only authorizationPolicies targeting `ml-pipeline` and `ml-pipeline-ui` make use of L7 attributes. Thus, a waypoint is deployed in the `kubeflow` namespace while only the services of those two components are enrolled to the waypoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that notebooks do not use L7 ? I mean everyone with the kubeflow-userid header uses L7 so i guess almost all components in Kubeflow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. All those components trust the ingress gateway and the headers it provides. Thus, they accept all requests from the ingress gateway without any filtering done at the AuthorizationPolicy, like kfp components do (api server)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking about the jupyterlab itself in the customer namespace. They should be created on demand based on the notebook custom resource.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you refer to the AuthorizationPolicy created by the profile-controller? This is created dynamically when a new user namespace is created and it also allows/denies traffic to jupyterlab server pods. This looks like this, which indeed has L7 rules.

Comment on lines 87 to 88
##### One waypoint per namespace
User namespaces are secured via an AuthorizationPolicy that is created dynamically by kubeflow's `profile-controller`. Most of the rules in this AuthorizationPolicy have L7 features[^8]. In sidecar mode, this AuthorizationPolicy is enforced to all traffic in the namespace by having no `selector`. This is due to the nature of those namespaces, where services are created dynamically. In Istio ambient mode, the only way to enforce a namespace-wide L7 AuthorizationPolicy is by proxying traffic to the whole namespace through the waypoint and then attaching the AuthorizationPolicy to the waypoint. This results in the authorization rules being enforced for all traffic proxied by the waypoint. This means that **a separate waypoint needs to be deployed in every user namespace**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we want to have exactly the opposite. Never ever waypoint proxies in user namespaces. Also the istio documentation and my talk with @kimwnasptd and the istio/solo.io founder at Kubecon supports that you do not need them in user namespaces. See https://istio.io/latest/docs/ambient/usage/waypoint/#configure-resources-to-use-a-cross-namespace-waypoint-proxy. That is the most essential part that you do not have waypoint proxies there and do not destroy zero overhead user namespaces.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(replied in #891 (comment))

Comment on lines 100 to 108
#### One waypoint per namespace
Apparently, deploying one waypoint per namespace adds a resource overhead for every user namespace, while it also reduces network efficiency for requests to that namespace. As mentioned above, this architecture is required by Istio's ambient model to enforce namespace-wide L7 AuthorizationPolicies. Since the `profile-controller` creates a default policy with L7 rules (e.g. path or headers based), one waypoint must correspond only to one namespace. The same Authorization model is not possible using a single cross-namespace waypoint. [Asking in Istio slack](https://istio.slack.com/archives/C041EQL1XMY/p1751905575407449?thread_ts=1751636618.350839&cid=C041EQL1XMY), they confirmed that:
>Typically, the authorization policy would be applied to a service so the destination would be "the thing my policy is attached to".
>...
>I don't think there is a shorthand expression for "all services in X namespace".

This leaves no other way to enforce an L7 AuthorizationPolicy to all services in a namespace, where services are being created dynamically.

Note that the introduced resource overhead affects cases with a large number of **idle** namespaces. For namespaces that are being actively used, the resource footprint is still greatly reduced, given that there are no sidecar containers injected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds really problematic. So i would rather stay with Istio-cni until this is possible. Follow up of https://github.com/kubeflow/community/pull/891/files#r2248473959

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should block on this. Of course, providing zero overhead namespaces without any namespaced resources should be the north star we 're after. The fact though that Istio ambient does not enable us to do so should not be a reason for Kubeflow to not be ambient compatible. Both istio-cni and ambient mode have their pros and cons and Kubeflow should provide administrators the option to decide on their own which mode suits them. For example, clusters with few namespaces but hundreds of workloads per namespace would benefit greatly from using ambient, contrary to clusters with hundreds of idle namespaces and a few workloads per namespace.

As mentioned in the proposal, instead of blocking, I suggest:

  1. Implement code that does not provide enforce any high-level architecture. That means that it will be configurable enough to support both waypoint per namespace and cross-namespace waypoint architectures. If Istio ambient starts supporting L7 namespace isolation using a single cross-namespace waypoint, then the code is already compatible or at minimum needs a few tweaks rather than the whole "sidecar -> istio ambient & gateway API" migration effort.
  2. Provide an ambient overlay in manifests. The Istio-cni will still be the default one but providing an overlay enables users and distributions to try ambient and experience benefits and downsides. This can greatly help the next point.
  3. Ask feature from Istio to support L7 namespace isolation using a single cross-namespace waypoint. I 've already asked them in Slack and plan to bring this up in their meetings too. However, I suspect that they have a good reason this is not supported right now. If there is adoption of Istio ambient by kubeflow users, then this puts us in a better position by providing valid use cases for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with all of @orfeas-k's points.

While having waypoints per namespace will have a different resource footprint from CNI, still this is up to the vendors to decide if they are OK with or not rather than not supporting it in Kubeflow.

The example can keep the CNI, but the project should still give the option to end users to have Kubeflow with K8s Gateway API and ambient.

Copy link
Member

@juliusvonkohout juliusvonkohout Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i agree, lets not make istio ambient the default until this is possible. So for the time being istio-cni is the default and istio-ambient opt-in overlay of istio-cni.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think technically they 'll need to be two separate overlays but let's keep this implementation detail for implementation time. We all agree that Istio-cni will be the default one, provided in the example.

Copy link
Member

@juliusvonkohout juliusvonkohout Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to keep a single istio folder. In there we also have the legacy insecure overlay to disable istio-cni. https://github.com/kubeflow/manifests/tree/master/common/istio/istio-install/overlays/insecure. please lets not create multiple istio versions to maintain, but one base version with overlays.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was brought up in the Istio community meeting by @ca-scribner and it looks like they don't see any blockers to this. It was not an intentional omission, just something that hasn't been implemented or decided on. There was nothing brought up that was a hard blocker, more just challenges to overcome. It would though be a big effort, so before they actually action it they'd want to see more than a single person requesting the feature.

I filed a feature request here https://github.com/istio/istio.io/issues/16781. What we need to do now is to:

  1. get more consensus/demand, which can only be done by getting more people to use the ambient overlay (at least from Kubeflow's side)
  2. bring it up again in a future community meeting once the demand is obvious

I 'm updating the proposal with this information to make sure it is documented.

Include information about the feature request in Istio.

Signed-off-by: Orfeas Kourkakis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants