Skip to content

Conversation

@RyanRosario
Copy link

@RyanRosario RyanRosario commented Nov 20, 2025

What type of PR is this?

kind/cleanup

What this PR does / why we need it:

Adds an E2E test for multi-port enhancement. Currently verifyTrafficRouting is implemented, verifyMetrics to follow.

Which issue(s) this PR fixes:

Fixes #1768

Does this PR introduce a user-facing change?:

NONE


@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Nov 20, 2025
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 20, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: RyanRosario / name: Ryan R. Rosario (bc9f24d)

@netlify
Copy link

netlify bot commented Nov 20, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit bc9f24d
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69379106ea78a70008e85877
😎 Deploy Preview https://deploy-preview-1885--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: RyanRosario
Once this PR has been reviewed and has the lgtm label, please assign danehans for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 20, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @RyanRosario. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 20, 2025
@RyanRosario RyanRosario changed the title [WIP] Add e2e test for multiport InferencePool enhancement Add e2e test for multiport InferencePool enhancement Nov 25, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 25, 2025
@RyanRosario
Copy link
Author

Hey @danehans and @nirrozenbaum , my first PR is ready for review.

@nirrozenbaum
Copy link
Contributor

nirrozenbaum commented Nov 25, 2025

/ok-to-test

Thanks @RyanRosario. seems like your PR needs a rebase.
it would be good to solve conflicts in order to see if the tests are passing.

additionally - please pay attention that your commits are not verified and if the PR is ready for review it would be good to remove the /hold to let others know this is ready.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 25, 2025
@RyanRosario
Copy link
Author

/retest

@RyanRosario
Copy link
Author

Thank you for your patience!

The failing test seems to be related to issue 1872. Can we continue with review or should 1872 be resolved first?

@nirrozenbaum
Copy link
Contributor

Thank you for your patience!

The failing test seems to be related to issue 1872. Can we continue with review or should 1872 be resolved first?

failing test isn't blocking the review but it is blocking the merge.
if this is failing due to a flake, triggering a /retest should solve it (eventually).
if it's failing consistently, we might have a hidden issue here.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 2, 2025
@RyanRosario
Copy link
Author

/hold cancel

All initial feedback regarding rebase, tests, and global configuration changes have been compleed.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2025
func createInferExt(testConfig *testutils.TestConfig, filePath string) {
inManifests := testutils.ReadYaml(filePath)

// This image needs to be updated to open multiple ports and respond.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still valid?

Copy link
Author

@RyanRosario RyanRosario Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The code comment is stale. I have the fix locally and will push it in the next batched commit to avoid triggering a full CI run just for this doc update.

go.sum Outdated
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.31.2 h1:jpcvIRr3GLoUoEKRkHKSmGjxb6lWwrBlJsXc+eUYQHM=
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.31.2/go.mod h1:Ve9uj1L+deCXFrPOk1LpFXqTg7LCFzFso6PA48q/XZw=
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.34.0 h1:hSfpvjjTQXQY2Fol2CS0QHMNs/WI1MOSGzCm1KhM5ec=
sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.34.0/go.mod h1:Ve9uj1L+deCXFrPOk1LpFXqTg7LCFzFso6PA48q/XZw=
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file will be removed before merge. It seemed to help me pass the CI test (which was passing locally).

sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.34.0 // indirect
sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 // indirect
sigs.k8s.io/randfill v1.0.0 // indirect
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file will be removed before merge.

@RyanRosario
Copy link
Author

Adding @LukeAVanDrie to help review to reduce some load.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2025
@LukeAVanDrie
Copy link
Contributor

Great work on the verification logic! This test looks really good from two standpoints:

  1. Traffic Routing: Proving traffic actually hits different ports (via the x-inference-port header).
  2. Virtual Pod Abstraction: Proving the EPP sees "virtual" pods (via the ...-rank-N metric label).

I have a few suggestions to make the test suite more robust and easier to debug. We want to avoid flakiness where possibly and improve maintainability.

Copy link
Contributor

@LukeAVanDrie LukeAVanDrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job on this test! The only major change I'm asking for is to simplify the test setup a bit where possible.


var _ = ginkgo.Describe("InferencePool", func() {
var infObjective *v1alpha2.InferenceObjective
ginkgo.BeforeEach(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are dynamically modifying the existing vllm-llama3-8b-instruct Deployment in BeforeEach and trying to revert it in AfterEach. If the test crashes or the runner is killed halfway through, AfterEach might not fully restore the state. This leaves the cluster "dirty" (configured for multi-port) which will cause subsequent single-port tests to fail.

I would encourage creating separate test resources that already have the ports and args configured correctly (e.g., testdata/inferencepool-multiport.yaml) with a corresponding Deployment manifest. This way if the test fails, we just delete the new resources, and the original single-port Deployment remains untouched. It also makes the code a bit easier to understand and maintain.

  • In the test, apply this specific manifest.
  • In AfterEach, just delete these resources.

This ensures that even if the test fails cataclysmically, the original environment is untouched. It also removes the need for the complex argument-parsing code in BeforeEach.

for idx, msg := range originalMessages {
msgCopy := make(map[string]any, len(msg))
maps.Copy(msgCopy, msg)
// Inject a unique nonce into the content of *EACH* message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on adding the 'Nonce' to the prompt. Since our scheduling layer prioritizes prefix caching (affinity), sending identical requests would likely result in them all going to the same pod, which defeats the purpose of this test. Varying the prompt body seems like the best approach here.

I think we can simplify the implementation. Instead of the complex struct reflection logic, consider just prepending a simple string prefix.

// Probability: need to compute estimate of number of batches to send to have high confidence of hitting all ports.
// Using the Coupon Collector's Problem formula: n * H_n, where H_n is the nth harmonic number.
// This gives us an expected number of trials to collect all coupons (ports).
batches := int(math.Ceil(numPorts * harmonicNumber(numPorts)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you used the "Coupon Collector's Problem" to calculate the necessary requests. This is very cool; however, for E2E tests, we should prioritize determinism and simplicity over efficiency.

For numPorts = 2 this is probably overkill. Instead of calculating the perfect number of requests, let's just brute-force it. Sending 20 requests sequentially is statistically guaranteed to hit both ports if the system is working.


curlCmd := getCurlCommand(envoyName, testConfig.NsName, envoyPort, modelName, curlTimeout, t.api, currentPromptOrMessages, false)

resp, err := testutils.ExecCommandInPod(testConfig, "curl", "curl", curlCmd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping kubectl exec (which is what ExecCommandInPod does) in Go routines adds a lot of complexity (WaitGroups, channels) for a small gain. Since we are only targeting 2 ports, a simple sequential loop is likely enough and much easier to debug.

// Instead of hardcoding arguments, we can instead replace the arguments that need
// to be changed, preserving any others that may exist.
var newArgs []string
skipNext := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move to a dedicated manifest file, this entire block of code disappears, making the test much cleaner and easier to maintain.

}, testConfig.ExistsTimeout, testConfig.Interval).Should(gomega.Succeed())

ginkgo.By("Restarting EPP to force configuration reload")
// We delete the EPP *POD*, not the deployment. The Deployment will recreate it immediately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is a good call.


for _, modelServerPod := range modelServerPods {
for rank := range numPorts {
metricQueueSize := fmt.Sprintf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good verification here!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this test flakes in CI, it helps to know why. Inside your verification loop, can you add a GinkgoWriter log to print the final map of actualPort and actualModel before the assertions.

Example: ginkgo.GinkgoWriter.Printf("Port distribution: %v\n", actualPort)

This way, if it fails, we can see if it was a total connectivity failure (empty map) or a distribution failure (stuck on one port).

// This gives us an expected number of trials to collect all coupons (ports).
batches := int(math.Ceil(numPorts * harmonicNumber(numPorts)))
// Send curl requests to verify routing to all target ports in the InferencePool.
gomega.Eventually(func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping the entire batch of request generation inside Eventually can be risky. If one request fails, we retry the whole batch, which is slow and heavy. Since we already wait for the deployment to be ready in BeforeEach, we can probably remove the Eventually wrapper around the traffic generation loop. Instead, just loop 20 times.

If a curl fails, you can use a small retry loop just for that specific command (like you did in generateTraffic), but let's avoid retrying the entire batch verification unless absolutely necessary.

)

const (
firstPort = 8000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you switch to using a static testdata/inferencepool-multiport.yaml, please make sure to add a comment here saying something like:

// Must match ports defined in testdata/inferencepool-multiport.yaml.

This helps future contributors who might edit the YAML but forget to update the Go test.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update E2E tests to include multiport case

5 participants