OCPBUGS-62627: cluster operator ingress reported Progressing=True wit… #1299

davidesalerno · 2025-10-31T14:15:46Z

…h reason=Reconciling for a node reboot

This PR will avoid Cluster Ingress Operator reporting Progressing=True condition if there is a node reboot or scale up as per Cluster Operator new requirement (link).

The Cluster Ingress Operator will continue to report DeploymentRollingOut=true condition with a different Reason if there is a node reboot or scale up and we will exclude these reasons (ReplicasStabilizing and PodsStarting) from the ones causing the operator Progressing=True condition.

In order to detect the a Deployment has been fully rolled out in the past we will use Deployment condition Progressing with reason NewReplicaSetAvailable as similarly done in library-go (see PR)

openshift-ci-robot · 2025-10-31T14:15:52Z

@davidesalerno: This pull request references Jira Issue OCPBUGS-62627, which is invalid:

expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

…h reason=Reconciling for a node reboot

This PR will avoid Cluster Ingress Operator reporting Progressing=True condition if there is a node reboot or scale up as per Cluster Operator new requirement (link).

The Cluster Ingress Operator will continue to report DeploymentRollingOut=true condition with a different Reason if there is a node reboot or scale up and we will exclude these reasons (ReplicasStabilizing and PodsStarting ) from the ones causing the operator Progressing=True condition.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

davidesalerno · 2025-10-31T14:17:06Z

/jira refresh

openshift-ci-robot · 2025-10-31T14:17:14Z

@davidesalerno: This pull request references Jira Issue OCPBUGS-62627, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-10-31T15:23:12Z

@davidesalerno: This pull request references Jira Issue OCPBUGS-62627, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

…h reason=Reconciling for a node reboot

This PR will avoid Cluster Ingress Operator reporting Progressing=True condition if there is a node reboot or scale up as per Cluster Operator new requirement (link).

The Cluster Ingress Operator will continue to report DeploymentRollingOut=true condition with a different Reason if there is a node reboot or scale up and we will exclude these reasons (ReplicasStabilizing and PodsStarting) from the ones causing the operator Progressing=True condition.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-11-13T17:05:01Z

@davidesalerno: This pull request references Jira Issue OCPBUGS-62627, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

…h reason=Reconciling for a node reboot

This PR will avoid Cluster Ingress Operator reporting Progressing=True condition if there is a node reboot or scale up as per Cluster Operator new requirement (link).

The Cluster Ingress Operator will continue to report DeploymentRollingOut=true condition with a different Reason if there is a node reboot or scale up and we will exclude these reasons (ReplicasStabilizing and PodsStarting) from the ones causing the operator Progressing=True condition.

In order to detect the a Deployment has been fully rolled out in the past we will use Deployment condition Progressing with reason NewReplicaSetAvailable as similarly done in library-go (see PR)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

davidesalerno · 2025-11-13T17:05:27Z

/retest

lihongan · 2025-11-17T09:36:24Z

per-merge tested and passed
/label qe-approved
/verified by @lihongan

## watching co/ingress status while rebooting the node

## with the fix:
$ oc get co/ingress -w
NAME      VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.21.0-0-2025-11-17-075629-test-ci-ln-scc84qk-latest   True        False         False      23m     


## without the fix
$ oc get co/ingress -w
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.21.0-0.nightly-2025-11-13-042845   True        False         False      34m     
ingress   4.21.0-0.nightly-2025-11-13-042845   True        True          False      35m     ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 of 2 updated replica(s) are available......
ingress   4.21.0-0.nightly-2025-11-13-042845   True        False         False      36m

openshift-ci-robot · 2025-11-17T09:36:36Z

@lihongan: This PR has been marked as verified by @lihongan.

Details

In response to this:

per-merge tested and passed
/label qe-approved
/verified by @lihongan

## watching co/ingress status while rebooting the node

## with the fix:
$ oc get co/ingress -w
NAME      VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.21.0-0-2025-11-17-075629-test-ci-ln-scc84qk-latest   True        False         False      23m     


## without the fix
$ oc get co/ingress -w
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.21.0-0.nightly-2025-11-13-042845   True        False         False      34m     
ingress   4.21.0-0.nightly-2025-11-13-042845   True        True          False      35m     ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 of 2 updated replica(s) are available......
ingress   4.21.0-0.nightly-2025-11-13-042845   True        False         False      36m

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

candita · 2025-11-25T18:57:47Z

Sorry, I forgot to assign this a few weeks back.
/assign @bentito

bentito · 2025-12-02T16:44:00Z

pkg/operator/controller/ingress/status.go

+			// Ignore infrastructure-driven deployment rollouts when computing
+			// the IngressController's Progressing condition.
+			ignoreReasons: []string{
+				ReasonReplicasStabilizing.String(), // Node reboots, pod evictions


Does this mean the operator will no longer report Progressing=True during explicit scaling operations? If so, is this intended? Typically, users expect to see a Progressing state when they change the configuration to scale the controller.

In this case the user should change the Ingress CR to set the nuber of replicas would like to have and so the check deployment.Status.ObservedGeneration == deployment.Generation we are doing into computeDeploymentRollingOutStatusAndReason will fail and the operator status Progressing=True will be maintained

I see what you mean about the generation check failing initially, but I think that window is super brief (just until the deployment controller observes the spec update).

The bulk of the time is usually spent waiting for the new pods to actually become ready. In that state (ObservedGeneration == Generation but AvailableReplicas < Spec.Replicas), the logic returns ReasonPodsStarting. Since ReasonPodsStarting is in the ignoreReasons list, the operator will flip back to Progressing=False almost immediately, even while the new pods are still spinning up.

If we want users to see Progressing=True for the duration of the scale-up (which I think they'd expect), we might need to reconsider ignoring ReasonPodsStarting or find a way to distinguish 'pods starting due to explicit scale' vs 'pods starting due to restart'

This should be the requested behavior by the ClusterOperator team, if pods won't start in the grace period then the status that it is expected is Degraded=true

/cc @hongkailiu

Let me understand the situation:

One of these happens

a. User changes the number of replicas
b. A node where the controlled pod is running is rebooted

Deployment acts on 1.a or 1.b above (Generation checking)

Pod acts on the 2

Before this pull the behaviour is that Progressing=True until 3 is done and now only 2 is done without waiting for 3.
The result is Progressing=True could be very short.

We definitely want this if the story starts from 1.b.

"Operators should not report Progressing when they are reconciling (without action) a previously known state."

Could it be an overkill for 1.a? (Ideally we distinguish the 2 cases and only change the behaviour for 1.b)

I read the API doc again about Progressing and it does not have a clear definition about to what time Progressing should be stopped.

I feel it is OK that even for 1.a we do not wait for Pod's reconciling on the number of replicas: Progressing=False when Deployment has started rolling out the change but not yet finished.
Arguably, we could say Deployment has noticed the user action and cluster operator's work (rolling out the change) is done. Whatever Deployment wants to do about it, it is its decision to make.

pkg/operator/controller/ingress/status_test.go

openshift-ci · 2025-12-15T11:18:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from bentito. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…h reason=Reconciling for a node reboot Signed-off-by: Davide Salerno <dsalerno@redhat.com>

openshift-ci · 2025-12-15T15:12:57Z

@davidesalerno: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-pre-release-ossm	`421ee05`	link	false	`/test e2e-aws-pre-release-ossm`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 31, 2025

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 31, 2025

openshift-ci bot requested review from frobware and knobunc October 31, 2025 14:16

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 31, 2025

openshift-ci bot requested a review from lihongan October 31, 2025 14:17

davidesalerno force-pushed the ocpbugs62627master branch 2 times, most recently from 701a14e to 68520e0 Compare November 5, 2025 13:57

davidesalerno changed the title ~~[WIP] OCPBUGS-62627: cluster operator ingress reported Progressing=True wit…~~ OCPBUGS-62627: cluster operator ingress reported Progressing=True wit… Nov 6, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2025

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Nov 17, 2025

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 17, 2025

openshift-ci bot assigned bentito Nov 25, 2025

bentito reviewed Dec 2, 2025

View reviewed changes

pkg/operator/controller/ingress/status_test.go Outdated Show resolved Hide resolved

davidesalerno force-pushed the ocpbugs62627master branch from 68520e0 to 7e5e577 Compare December 15, 2025 11:18

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Dec 15, 2025

OCPBUGS-62627: cluster operator ingress reported Progressing=True wit…

421ee05

…h reason=Reconciling for a node reboot Signed-off-by: Davide Salerno <dsalerno@redhat.com>

davidesalerno force-pushed the ocpbugs62627master branch from 7e5e577 to 421ee05 Compare December 15, 2025 11:40

openshift-ci bot requested a review from hongkailiu December 17, 2025 15:55

OCPBUGS-62627: cluster operator ingress reported Progressing=True wit… #1299

Are you sure you want to change the base?

OCPBUGS-62627: cluster operator ingress reported Progressing=True wit… #1299

Conversation

davidesalerno commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 31, 2025

Uh oh!

davidesalerno commented Oct 31, 2025

Uh oh!

openshift-ci-robot commented Oct 31, 2025

Uh oh!

openshift-ci-robot commented Oct 31, 2025

Uh oh!

openshift-ci-robot commented Nov 13, 2025

Uh oh!

davidesalerno commented Nov 13, 2025

Uh oh!

lihongan commented Nov 17, 2025

Uh oh!

openshift-ci-robot commented Nov 17, 2025

Uh oh!

candita commented Nov 25, 2025

Uh oh!

bentito Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

davidesalerno Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

bentito Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

davidesalerno Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

hongkailiu Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci bot commented Dec 15, 2025

Uh oh!

openshift-ci bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

davidesalerno commented Oct 31, 2025 •

edited

Loading