Skip to content

Conversation

@Miciah
Copy link
Contributor

@Miciah Miciah commented Dec 16, 2025

Enable Horizontal Pod Autoscaling (HPA) on Istio. Hard-code the autoscaling parameters based on the cluster infrastructure config:

  • If the infrastructure topology is "SingleReplica", set minimum to 1.
  • Otherwise, set minimum to 2.
  • In any case, set maximum to 10.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 16, 2025
@Miciah
Copy link
Contributor Author

Miciah commented Dec 16, 2025

/test e2e-aws-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gcs278 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Miciah
Copy link
Contributor Author

Miciah commented Dec 17, 2025

e2e-aws-operator passed. deployments.json shows that the istiod pod has 2 replicas:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1326/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/2001047586798047232/artifacts/e2e-aws-operator/gather-extra/artifacts/deployments.json' | jq '.items.[]|select(.metadata.name=="istiod-openshift-gateway")|.spec.replicas'
2
%

However, there are some errors due to missing metrics:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1326/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/2001047586798047232/artifacts/e2e-aws-operator/gather-extra/artifacts/events.json' | jq -r '[.items.[]|select(.metadata.namespace=="openshift-ingress" and .source.component=="horizontal-pod-autoscaler")]|sort_by(.firstTimestamp//.metadata.creationTimestamp)|.[]|.involvedObject.name+": "+.message'
istiod-openshift-gateway: New size: 2; reason: Current number of replicas below Spec.MinReplicas
istiod-openshift-gateway: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
istiod-openshift-gateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready)
istiod-openshift-gateway: New size: 2; reason: Current number of replicas below Spec.MinReplicas
istiod-openshift-gateway: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
istiod-openshift-gateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
%

Moreover, the proxy pods do not get scaled out more than 1 replica:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1326/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/2001047586798047232/artifacts/e2e-aws-operator/gather-extra/artifacts/events.json' | jq -r '[.items.[]|select(.metadata.namespace=="openshift-ingress" and (.metadata.name|startswith("test-gateway")) and .source.component=="deployment-controller")]|sort_by(.firstTimestamp//.metadata.creationTimestamp)|.[]|.involvedObject.name+": "+.message'
test-gateway-openshift-default: Scaled up replica set test-gateway-openshift-default-74dff54c89 from 0 to 1
test-gateway-openshift-default: Scaled up replica set test-gateway-openshift-default-577b5f5db7 from 0 to 1
test-gateway-openshift-default: Scaled down replica set test-gateway-openshift-default-74dff54c89 from 1 to 0
test-gateway-update-openshift-default: Scaled up replica set test-gateway-update-openshift-default-75c4dbd84f from 0 to 1
%


// Watch the cluster infrastructure config in case the infrastructure
// topology changes.
if err := c.Watch(source.Kind[client.Object](operatorCache, &configv1.Infrastructure{}, reconciler.enqueueRequestForSomeGatewayClass())); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question here is, if you change Sail Operator parameter will it reflect on Pilot config?
Also, in case this parameter is also used for Gateway replica deployment, will Sail reconcile all the gateways and HPA?

(Giving some thoughts for test)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answering myself:
the change of HPA parameter from pilot does not impact on the GatewayClass, it needs its own definition from spec.infrastructure.paramRefs

@rikatz
Copy link
Member

rikatz commented Dec 17, 2025

testing on my own cluster (AWS, HA):

kubectl get hpa -A
NAMESPACE           NAME                       REFERENCE                             TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
openshift-ingress   istiod-openshift-gateway   Deployment/istiod-openshift-gateway   cpu: <unknown>/80%   2         10        2          36s

It is scaling pilot, but not sure about the gateways, let me test

@rikatz
Copy link
Member

rikatz commented Dec 17, 2025

IIUC we can pass to sail the configs of the Gateway class on the Istio resource, and it will create the configmap for us (even the gateway class being created by the user): https://github.com/openshift-service-mesh/sail-operator/blob/f3a3c6d7e6f2c2ae412b09ce1a78bc93258b4db4/resources/v1.28.0/charts/istiod/templates/gateway-class-configmap.yaml

We can make CIO set this property and the configmap will be created for us (this includes even setting the annotations for internal only clusters!), the problem is that today the json.RawMessage used by Sail API is not accepting the content correctly. I have asked the sail operator team about it and if they have been using it, otherwise we can improve this workflow on Sail Operator and consume on CIO

@Miciah Miciah force-pushed the gatewayclass-enable-Horizontal-Pod-Autoscaling branch from 1ea51fc to f407b29 Compare December 22, 2025 03:20
@Miciah
Copy link
Contributor Author

Miciah commented Dec 22, 2025

/test e2e-aws-operator

Enable Horizontal Pod Autoscaling (HPA) on Istio.  Hard-code the
autoscaling parameters based on the cluster infrastructure config:

- If the infrastructure topology is "SingleReplica", set minimum to 1.
- Otherwise, set minimum to 2.
- In any case, set maximum to 10.

* pkg/operator/controller/gatewayclass/controller.go
(gatewayclassControllerIndexFieldName): New const.
(NewUnmanaged): Watch infrastructures.  Initialize fieldIndexer in the
reconciler so that it can be used to create an index over gatewayclasses
later.  The index cannot be created directly in NewUnmanaged as the
gatewayclasses resource might not yet exist when NewUnmanaged is called.
(reconciler): Add fieldIndexer and startGatewayclassControllerIndex.
(Reconcile): Get the cluster infrastructure object and pass it to
ensureIstio.  Create an index over gatewayclasses by
spec.controllerName, using fieldIndexer from the reconciler,
gatewayclassControllerIndexFieldName for the index field name, and
startGatewayclassControllerIndex to ensure the index is only created
once, on first reconciliation.
* pkg/operator/controller/gatewayclass/controller_test.go
(Test_Reconcile): Add test cases for missing cluster infrastructure
config.  Add a test case for SingleReplica topology mode.  Add a test
case with multiple gatewayclasses. Add the cluster infrastructure config
object to existingObjects for the existing test cases. Add the expected
HPA configuration to the expected Istio CRs in test cases.  Initialize
fieldIndexer with a fake indexer in the reconciler.
(FakeIndexer, (FakeIndexer).IndexField): New type and method,
used in Test_Reconcile.
* pkg/operator/controller/gatewayclass/istio.go (ensureIstio): Add an
infraConfig parameter, and pass the argument to desiredIstio.  Use the
new index to list gatewayclasses with the OpenShift gateway controller
name, and pass the list of gatewayclasses to desiredIstio as well.
* pkg/operator/controller/gatewayclass/istio.go (desiredIstio): Add an
infraConfig parameter and a gatewayclasses parameter, and use the
arguments to look up the infrastructure topology mode and and configure
HPA accordingly.
@Miciah Miciah force-pushed the gatewayclass-enable-Horizontal-Pod-Autoscaling branch from f407b29 to 3e6e8c5 Compare December 22, 2025 14:28
@Miciah
Copy link
Contributor Author

Miciah commented Dec 22, 2025

/test e2e-aws-operator

1 similar comment
@Miciah
Copy link
Contributor Author

Miciah commented Dec 22, 2025

/test e2e-aws-operator

Don't overwrite the gateway variable's value from
createGatewayWithListeners with a possibly nil value from
assertGatewaySuccessful.

The gateway variable is used in a cleanup handler, which would panic if
the variable had a nil value.  The gateway value from
assertGatewaySuccessful isn't really needed, so it can be safely
ignored.

* test/e2e/gateway_api_test.go (testGatewayAPIDNSListenerWithNoHostname)
(testGatewayAPIDNSListenerUpdate): Ignore the return value from
assertGatewaySuccessful.
@Miciah Miciah force-pushed the gatewayclass-enable-Horizontal-Pod-Autoscaling branch from fad49ff to c566d6a Compare December 23, 2025 02:45
@Miciah
Copy link
Contributor Author

Miciah commented Dec 23, 2025

/test e2e-aws-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 23, 2025

@Miciah: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-operator c566d6a link true /test e2e-aws-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Set default values, such as the logger, in the controller options for
unmanaged controllers, namely the gateway-labeler, gateway-service-dns,
and gatewayclass controllers.

Before controller-runtime v0.21.0, controller-runtime implicitly set
default values for controller options for managed and unmanaged
controllers alike.  Controller-runtime internally used the
DefaultFromConfig method to do so.  Since controller-runtime v0.21.0,
these default values are not implicitly set for unmanaged
controllers[1].  In particular, this means that the logger is not
initialized, and so the controller initialization and any reconciliation
errors are not logged.  This commit explicitly sets default values by
calling DefaultFromConfig for unmanaged controllers.

Follow-up to commit 66485b8.

1. kubernetes-sigs/controller-runtime@d9ff283

* pkg/operator/controller/gateway-labeler/controller.go (NewUnmanaged):
* pkg/operator/controller/gateway-service-dns/controller.go
(NewUnmanaged):
* pkg/operator/controller/gatewayclass/controller.go (NewUnmanaged):
Use DefaultFromConfig to set default values for controller options.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants