Skip to content

Conversation

@RyanRosario
Copy link

@RyanRosario RyanRosario commented Nov 20, 2025

What type of PR is this?

kind/cleanup

What this PR does / why we need it:

Adds an E2E test for multi-port enhancement. Currently verifyTrafficRouting is implemented, verifyMetrics to follow.

Which issue(s) this PR fixes:

Fixes #1768

Does this PR introduce a user-facing change?:

NONE


@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Nov 20, 2025
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 20, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: RyanRosario / name: Ryan R. Rosario (bc9f24d)

@netlify
Copy link

netlify bot commented Nov 20, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit bc9f24d
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69379106ea78a70008e85877
😎 Deploy Preview https://deploy-preview-1885--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: RyanRosario
Once this PR has been reviewed and has the lgtm label, please assign danehans for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 20, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @RyanRosario. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 20, 2025
@RyanRosario RyanRosario changed the title [WIP] Add e2e test for multiport InferencePool enhancement Add e2e test for multiport InferencePool enhancement Nov 25, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 25, 2025
@RyanRosario
Copy link
Author

Hey @danehans and @nirrozenbaum , my first PR is ready for review.

@nirrozenbaum
Copy link
Contributor

nirrozenbaum commented Nov 25, 2025

/ok-to-test

Thanks @RyanRosario. seems like your PR needs a rebase.
it would be good to solve conflicts in order to see if the tests are passing.

additionally - please pay attention that your commits are not verified and if the PR is ready for review it would be good to remove the /hold to let others know this is ready.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 25, 2025
@RyanRosario
Copy link
Author

/retest

@RyanRosario
Copy link
Author

Thank you for your patience!

The failing test seems to be related to issue 1872. Can we continue with review or should 1872 be resolved first?

@nirrozenbaum
Copy link
Contributor

Thank you for your patience!

The failing test seems to be related to issue 1872. Can we continue with review or should 1872 be resolved first?

failing test isn't blocking the review but it is blocking the merge.
if this is failing due to a flake, triggering a /retest should solve it (eventually).
if it's failing consistently, we might have a hidden issue here.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 2, 2025
@RyanRosario
Copy link
Author

/hold cancel

All initial feedback regarding rebase, tests, and global configuration changes have been compleed.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2025
go.sum Outdated
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.31.2 h1:jpcvIRr3GLoUoEKRkHKSmGjxb6lWwrBlJsXc+eUYQHM=
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.31.2/go.mod h1:Ve9uj1L+deCXFrPOk1LpFXqTg7LCFzFso6PA48q/XZw=
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.34.0 h1:hSfpvjjTQXQY2Fol2CS0QHMNs/WI1MOSGzCm1KhM5ec=
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.34.0/go.mod h1:Ve9uj1L+deCXFrPOk1LpFXqTg7LCFzFso6PA48q/XZw=
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file will be removed before merge. It seemed to help me pass the CI test (which was passing locally).

sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.34.0 // indirect
sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 // indirect
sigs.k8s.io/randfill v1.0.0 // indirect
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file will be removed before merge.

@RyanRosario
Copy link
Author

Adding @LukeAVanDrie to help review to reduce some load.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2025
@LukeAVanDrie
Copy link
Contributor

Great work on the verification logic! This test looks really good from two standpoints:

  1. Traffic Routing: Proving traffic actually hits different ports (via the x-inference-port header).
  2. Virtual Pod Abstraction: Proving the EPP sees "virtual" pods (via the ...-rank-N metric label).

I have a few suggestions to make the test suite more robust and easier to debug. We want to avoid flakiness where possibly and improve maintainability.

Copy link
Contributor

@LukeAVanDrie LukeAVanDrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job on this test! The only major change I'm asking for is to simplify the test setup a bit where possible.


var _ = ginkgo.Describe("InferencePool", func() {
var infObjective *v1alpha2.InferenceObjective
ginkgo.BeforeEach(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are dynamically modifying the existing vllm-llama3-8b-instruct Deployment in BeforeEach and trying to revert it in AfterEach. If the test crashes or the runner is killed halfway through, AfterEach might not fully restore the state. This leaves the cluster "dirty" (configured for multi-port) which will cause subsequent single-port tests to fail.

I would encourage creating separate test resources that already have the ports and args configured correctly (e.g., testdata/inferencepool-multiport.yaml) with a corresponding Deployment manifest. This way if the test fails, we just delete the new resources, and the original single-port Deployment remains untouched. It also makes the code a bit easier to understand and maintain.

  • In the test, apply this specific manifest.
  • In AfterEach, just delete these resources.

This ensures that even if the test fails cataclysmically, the original environment is untouched. It also removes the need for the complex argument-parsing code in BeforeEach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to Ryan: Create YAML, read it, create a new InferencePool from it. This leaves the existing InferencePool alone (though may bring back the InferenceObject status issue)


curlCmd := getCurlCommand(envoyName, testConfig.NsName, envoyPort, modelName, curlTimeout, t.api, currentPromptOrMessages, false)

resp, err := testutils.ExecCommandInPod(testConfig, "curl", "curl", curlCmd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping kubectl exec (which is what ExecCommandInPod does) in Go routines adds a lot of complexity (WaitGroups, channels) for a small gain. Since we are only targeting 2 ports, a simple sequential loop is likely enough and much easier to debug.

// Instead of hardcoding arguments, we can instead replace the arguments that need
// to be changed, preserving any others that may exist.
var newArgs []string
skipNext := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move to a dedicated manifest file, this entire block of code disappears, making the test much cleaner and easier to maintain.


for _, modelServerPod := range modelServerPods {
for rank := range numPorts {
metricQueueSize := fmt.Sprintf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good verification here!

// This gives us an expected number of trials to collect all coupons (ports).
batches := int(math.Ceil(numPorts * harmonicNumber(numPorts)))
// Send curl requests to verify routing to all target ports in the InferencePool.
gomega.Eventually(func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping the entire batch of request generation inside Eventually can be risky. If one request fails, we retry the whole batch, which is slow and heavy. Since we already wait for the deployment to be ready in BeforeEach, we can probably remove the Eventually wrapper around the traffic generation loop. Instead, just loop 20 times.

If a curl fails, you can use a small retry loop just for that specific command (like you did in generateTraffic), but let's avoid retrying the entire batch verification unless absolutely necessary.

)

const (
firstPort = 8000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you switch to using a static testdata/inferencepool-multiport.yaml, please make sure to add a comment here saying something like:

// Must match ports defined in testdata/inferencepool-multiport.yaml.

This helps future contributors who might edit the YAML but forget to update the Go test.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 9, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Dec 10, 2025

Is this PR ready? anything left to be addressed?

@RyanRosario
Copy link
Author

Is this PR ready? anything left to be addressed?

Hi Abdullah, I am working on addressing Luke's feedback but after that I would still need an LGTM in order to get the final stamp, if I understand correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update E2E tests to include multiport case

6 participants