gatewayclass: Enable Horizontal Pod Autoscaling #1326

Miciah · 2025-12-16T21:51:26Z

Enable Horizontal Pod Autoscaling (HPA) on Istio. Hard-code the autoscaling parameters based on the cluster infrastructure config:

If the infrastructure topology is "SingleReplica", set minimum to 1.
Otherwise, set minimum to 2.
In any case, set maximum to 10.

openshift-ci · 2025-12-16T21:51:30Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Miciah · 2025-12-16T21:51:33Z

/test e2e-aws-operator

openshift-ci · 2025-12-16T21:51:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gcs278 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Miciah · 2025-12-17T00:47:35Z

e2e-aws-operator passed. deployments.json shows that the istiod pod has 2 replicas:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1326/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/2001047586798047232/artifacts/e2e-aws-operator/gather-extra/artifacts/deployments.json' | jq '.items.[]|select(.metadata.name=="istiod-openshift-gateway")|.spec.replicas'
2
%

However, there are some errors due to missing metrics:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1326/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/2001047586798047232/artifacts/e2e-aws-operator/gather-extra/artifacts/events.json' | jq -r '[.items.[]|select(.metadata.namespace=="openshift-ingress" and .source.component=="horizontal-pod-autoscaler")]|sort_by(.firstTimestamp//.metadata.creationTimestamp)|.[]|.involvedObject.name+": "+.message'
istiod-openshift-gateway: New size: 2; reason: Current number of replicas below Spec.MinReplicas
istiod-openshift-gateway: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
istiod-openshift-gateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
istiod-openshift-gateway: New size: 2; reason: Current number of replicas below Spec.MinReplicas
istiod-openshift-gateway: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
istiod-openshift-gateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
%

Moreover, the proxy pods do not get scaled out more than 1 replica:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1326/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/2001047586798047232/artifacts/e2e-aws-operator/gather-extra/artifacts/events.json' | jq -r '[.items.[]|select(.metadata.namespace=="openshift-ingress" and (.metadata.name|startswith("test-gateway")) and .source.component=="deployment-controller")]|sort_by(.firstTimestamp//.metadata.creationTimestamp)|.[]|.involvedObject.name+": "+.message'
test-gateway-openshift-default: Scaled up replica set test-gateway-openshift-default-74dff54c89 from 0 to 1
test-gateway-openshift-default: Scaled up replica set test-gateway-openshift-default-577b5f5db7 from 0 to 1
test-gateway-openshift-default: Scaled down replica set test-gateway-openshift-default-74dff54c89 from 1 to 0
test-gateway-update-openshift-default: Scaled up replica set test-gateway-update-openshift-default-75c4dbd84f from 0 to 1
%

rikatz · 2025-12-17T14:00:00Z

pkg/operator/controller/gatewayclass/controller.go


+	// Watch the cluster infrastructure config in case the infrastructure
+	// topology changes.
+	if err := c.Watch(source.Kind[client.Object](operatorCache, &configv1.Infrastructure{}, reconciler.enqueueRequestForSomeGatewayClass())); err != nil {


question here is, if you change Sail Operator parameter will it reflect on Pilot config?
Also, in case this parameter is also used for Gateway replica deployment, will Sail reconcile all the gateways and HPA?

(Giving some thoughts for test)

answering myself:
the change of HPA parameter from pilot does not impact on the GatewayClass, it needs its own definition from spec.infrastructure.paramRefs

rikatz · 2025-12-17T14:11:20Z

testing on my own cluster (AWS, HA):

kubectl get hpa -A
NAMESPACE           NAME                       REFERENCE                             TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
openshift-ingress   istiod-openshift-gateway   Deployment/istiod-openshift-gateway   cpu: <unknown>/80%   2         10        2          36s

It is scaling pilot, but not sure about the gateways, let me test

rikatz · 2025-12-17T17:00:47Z

IIUC we can pass to sail the configs of the Gateway class on the Istio resource, and it will create the configmap for us (even the gateway class being created by the user): https://github.com/openshift-service-mesh/sail-operator/blob/f3a3c6d7e6f2c2ae412b09ce1a78bc93258b4db4/resources/v1.28.0/charts/istiod/templates/gateway-class-configmap.yaml

We can make CIO set this property and the configmap will be created for us (this includes even setting the annotations for internal only clusters!), the problem is that today the json.RawMessage used by Sail API is not accepting the content correctly. I have asked the sail operator team about it and if they have been using it, otherwise we can improve this workflow on Sail Operator and consume on CIO

Miciah · 2025-12-22T03:21:32Z

/test e2e-aws-operator

Enable Horizontal Pod Autoscaling (HPA) on Istio. Hard-code the autoscaling parameters based on the cluster infrastructure config: - If the infrastructure topology is "SingleReplica", set minimum to 1. - Otherwise, set minimum to 2. - In any case, set maximum to 10. * pkg/operator/controller/gatewayclass/controller.go (gatewayclassControllerIndexFieldName): New const. (NewUnmanaged): Watch infrastructures. Initialize fieldIndexer in the reconciler so that it can be used to create an index over gatewayclasses later. The index cannot be created directly in NewUnmanaged as the gatewayclasses resource might not yet exist when NewUnmanaged is called. (reconciler): Add fieldIndexer and startGatewayclassControllerIndex. (Reconcile): Get the cluster infrastructure object and pass it to ensureIstio. Create an index over gatewayclasses by spec.controllerName, using fieldIndexer from the reconciler, gatewayclassControllerIndexFieldName for the index field name, and startGatewayclassControllerIndex to ensure the index is only created once, on first reconciliation. * pkg/operator/controller/gatewayclass/controller_test.go (Test_Reconcile): Add test cases for missing cluster infrastructure config. Add a test case for SingleReplica topology mode. Add a test case with multiple gatewayclasses. Add the cluster infrastructure config object to existingObjects for the existing test cases. Add the expected HPA configuration to the expected Istio CRs in test cases. Initialize fieldIndexer with a fake indexer in the reconciler. (FakeIndexer, (FakeIndexer).IndexField): New type and method, used in Test_Reconcile. * pkg/operator/controller/gatewayclass/istio.go (ensureIstio): Add an infraConfig parameter, and pass the argument to desiredIstio. Use the new index to list gatewayclasses with the OpenShift gateway controller name, and pass the list of gatewayclasses to desiredIstio as well. * pkg/operator/controller/gatewayclass/istio.go (desiredIstio): Add an infraConfig parameter and a gatewayclasses parameter, and use the arguments to look up the infrastructure topology mode and and configure HPA accordingly.

Miciah · 2025-12-22T14:28:17Z

/test e2e-aws-operator

Miciah · 2025-12-22T21:19:59Z

/test e2e-aws-operator

Don't overwrite the gateway variable's value from createGatewayWithListeners with a possibly nil value from assertGatewaySuccessful. The gateway variable is used in a cleanup handler, which would panic if the variable had a nil value. The gateway value from assertGatewaySuccessful isn't really needed, so it can be safely ignored. * test/e2e/gateway_api_test.go (testGatewayAPIDNSListenerWithNoHostname) (testGatewayAPIDNSListenerUpdate): Ignore the return value from assertGatewaySuccessful.

Miciah · 2025-12-23T02:46:11Z

/test e2e-aws-operator

openshift-ci · 2025-12-23T05:26:34Z

@Miciah: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-operator	`c566d6a`	link	true	`/test e2e-aws-operator`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Set default values, such as the logger, in the controller options for unmanaged controllers, namely the gateway-labeler, gateway-service-dns, and gatewayclass controllers. Before controller-runtime v0.21.0, controller-runtime implicitly set default values for controller options for managed and unmanaged controllers alike. Controller-runtime internally used the DefaultFromConfig method to do so. Since controller-runtime v0.21.0, these default values are not implicitly set for unmanaged controllers[1]. In particular, this means that the logger is not initialized, and so the controller initialization and any reconciliation errors are not logged. This commit explicitly sets default values by calling DefaultFromConfig for unmanaged controllers. Follow-up to commit 66485b8. 1. kubernetes-sigs/controller-runtime@d9ff283 * pkg/operator/controller/gateway-labeler/controller.go (NewUnmanaged): * pkg/operator/controller/gateway-service-dns/controller.go (NewUnmanaged): * pkg/operator/controller/gatewayclass/controller.go (NewUnmanaged): Use DefaultFromConfig to set default values for controller options.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 16, 2025

rikatz reviewed Dec 17, 2025

View reviewed changes

Miciah force-pushed the gatewayclass-enable-Horizontal-Pod-Autoscaling branch from 1ea51fc to f407b29 Compare December 22, 2025 03:20

Miciah force-pushed the gatewayclass-enable-Horizontal-Pod-Autoscaling branch from f407b29 to 3e6e8c5 Compare December 22, 2025 14:28

Miciah force-pushed the gatewayclass-enable-Horizontal-Pod-Autoscaling branch from fad49ff to c566d6a Compare December 23, 2025 02:45

gatewayclass: Enable Horizontal Pod Autoscaling #1326

Are you sure you want to change the base?

gatewayclass: Enable Horizontal Pod Autoscaling #1326

Uh oh!

Conversation

Miciah commented Dec 16, 2025

Uh oh!

openshift-ci bot commented Dec 16, 2025

Uh oh!

Miciah commented Dec 16, 2025

Uh oh!

openshift-ci bot commented Dec 16, 2025

Uh oh!

Miciah commented Dec 17, 2025

Uh oh!

rikatz Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

rikatz Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

rikatz commented Dec 17, 2025

Uh oh!

rikatz commented Dec 17, 2025

Uh oh!

Miciah commented Dec 22, 2025

Uh oh!

Miciah commented Dec 22, 2025

Uh oh!

Miciah commented Dec 22, 2025

Uh oh!

Miciah commented Dec 23, 2025

Uh oh!

openshift-ci bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants