Skip to content

Conversation

@davidesalerno
Copy link
Contributor

@davidesalerno davidesalerno commented Oct 31, 2025

…h reason=Reconciling for a node reboot

This PR will avoid Cluster Ingress Operator reporting Progressing=True condition if there is a node reboot or scale up as per Cluster Operator new requirement (link).

The Cluster Ingress Operator will continue to report DeploymentRollingOut=true condition with a different Reason if there is a node reboot or scale up and we will exclude these reasons (ReplicasStabilizing and PodsStarting) from the ones causing the operator Progressing=True condition.

In order to detect the a Deployment has been fully rolled out in the past we will use Deployment condition Progressing with reason NewReplicaSetAvailable as similarly done in library-go (see PR)

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 31, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 31, 2025
@openshift-ci-robot
Copy link
Contributor

@davidesalerno: This pull request references Jira Issue OCPBUGS-62627, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

…h reason=Reconciling for a node reboot

This PR will avoid Cluster Ingress Operator reporting Progressing=True condition if there is a node reboot or scale up as per Cluster Operator new requirement (link).

The Cluster Ingress Operator will continue to report DeploymentRollingOut=true condition with a different Reason if there is a node reboot or scale up and we will exclude these reasons (ReplicasStabilizing and PodsStarting ) from the ones causing the operator Progressing=True condition.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from frobware and knobunc October 31, 2025 14:16
@davidesalerno
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 31, 2025
@openshift-ci-robot
Copy link
Contributor

@davidesalerno: This pull request references Jira Issue OCPBUGS-62627, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from lihongan October 31, 2025 14:17
@openshift-ci-robot
Copy link
Contributor

@davidesalerno: This pull request references Jira Issue OCPBUGS-62627, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

…h reason=Reconciling for a node reboot

This PR will avoid Cluster Ingress Operator reporting Progressing=True condition if there is a node reboot or scale up as per Cluster Operator new requirement (link).

The Cluster Ingress Operator will continue to report DeploymentRollingOut=true condition with a different Reason if there is a node reboot or scale up and we will exclude these reasons (ReplicasStabilizing and PodsStarting) from the ones causing the operator Progressing=True condition.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@davidesalerno davidesalerno force-pushed the ocpbugs62627master branch 2 times, most recently from 701a14e to 68520e0 Compare November 5, 2025 13:57
@davidesalerno davidesalerno changed the title [WIP] OCPBUGS-62627: cluster operator ingress reported Progressing=True wit… OCPBUGS-62627: cluster operator ingress reported Progressing=True wit… Nov 6, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2025
@openshift-ci-robot
Copy link
Contributor

@davidesalerno: This pull request references Jira Issue OCPBUGS-62627, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

…h reason=Reconciling for a node reboot

This PR will avoid Cluster Ingress Operator reporting Progressing=True condition if there is a node reboot or scale up as per Cluster Operator new requirement (link).

The Cluster Ingress Operator will continue to report DeploymentRollingOut=true condition with a different Reason if there is a node reboot or scale up and we will exclude these reasons (ReplicasStabilizing and PodsStarting) from the ones causing the operator Progressing=True condition.

In order to detect the a Deployment has been fully rolled out in the past we will use Deployment condition Progressing with reason NewReplicaSetAvailable as similarly done in library-go (see PR)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@davidesalerno
Copy link
Contributor Author

/retest

@lihongan
Copy link
Contributor

per-merge tested and passed
/label qe-approved
/verified by @lihongan

## watching co/ingress status while rebooting the node

## with the fix:
$ oc get co/ingress -w
NAME      VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.21.0-0-2025-11-17-075629-test-ci-ln-scc84qk-latest   True        False         False      23m     


## without the fix
$ oc get co/ingress -w
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.21.0-0.nightly-2025-11-13-042845   True        False         False      34m     
ingress   4.21.0-0.nightly-2025-11-13-042845   True        True          False      35m     ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 of 2 updated replica(s) are available......
ingress   4.21.0-0.nightly-2025-11-13-042845   True        False         False      36m     

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Nov 17, 2025
@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 17, 2025
@openshift-ci-robot
Copy link
Contributor

@lihongan: This PR has been marked as verified by @lihongan.

Details

In response to this:

per-merge tested and passed
/label qe-approved
/verified by @lihongan

## watching co/ingress status while rebooting the node

## with the fix:
$ oc get co/ingress -w
NAME      VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.21.0-0-2025-11-17-075629-test-ci-ln-scc84qk-latest   True        False         False      23m     


## without the fix
$ oc get co/ingress -w
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.21.0-0.nightly-2025-11-13-042845   True        False         False      34m     
ingress   4.21.0-0.nightly-2025-11-13-042845   True        True          False      35m     ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 of 2 updated replica(s) are available......
ingress   4.21.0-0.nightly-2025-11-13-042845   True        False         False      36m     

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@candita
Copy link
Contributor

candita commented Nov 25, 2025

Sorry, I forgot to assign this a few weeks back.
/assign @bentito

// Ignore infrastructure-driven deployment rollouts when computing
// the IngressController's Progressing condition.
ignoreReasons: []string{
ReasonReplicasStabilizing.String(), // Node reboots, pod evictions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the operator will no longer report Progressing=True during explicit scaling operations? If so, is this intended? Typically, users expect to see a Progressing state when they change the configuration to scale the controller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the user should change the Ingress CR to set the nuber of replicas would like to have and so the check deployment.Status.ObservedGeneration == deployment.Generation we are doing into computeDeploymentRollingOutStatusAndReason will fail and the operator status Progressing=True will be maintained

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean about the generation check failing initially, but I think that window is super brief (just until the deployment controller observes the spec update).

The bulk of the time is usually spent waiting for the new pods to actually become ready. In that state (ObservedGeneration == Generation but AvailableReplicas < Spec.Replicas), the logic returns ReasonPodsStarting. Since ReasonPodsStarting is in the ignoreReasons list, the operator will flip back to Progressing=False almost immediately, even while the new pods are still spinning up.

If we want users to see Progressing=True for the duration of the scale-up (which I think they'd expect), we might need to reconsider ignoring ReasonPodsStarting or find a way to distinguish 'pods starting due to explicit scale' vs 'pods starting due to restart'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the requested behavior by the ClusterOperator team, if pods won't start in the grace period then the status that it is expected is Degraded=true

/cc @hongkailiu

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me understand the situation:

  1. One of these happens

    a. User changes the number of replicas
    b. A node where the controlled pod is running is rebooted

  2. Deployment acts on 1.a or 1.b above (Generation checking)

  3. Pod acts on the 2

Before this pull the behaviour is that Progressing=True until 3 is done and now only 2 is done without waiting for 3.
The result is Progressing=True could be very short.

We definitely want this if the story starts from 1.b.

"Operators should not report Progressing when they are reconciling (without action) a previously known state."

Could it be an overkill for 1.a? (Ideally we distinguish the 2 cases and only change the behaviour for 1.b)

I read the API doc again about Progressing and it does not have a clear definition about to what time Progressing should be stopped.

I feel it is OK that even for 1.a we do not wait for Pod's reconciling on the number of replicas: Progressing=False when Deployment has started rolling out the change but not yet finished.
Arguably, we could say Deployment has noticed the user action and cluster operator's work (rolling out the change) is done. Whatever Deployment wants to do about it, it is its decision to make.

@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Dec 15, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 15, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from bentito. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…h reason=Reconciling for a node reboot

Signed-off-by: Davide Salerno <dsalerno@redhat.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 15, 2025

@davidesalerno: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-pre-release-ossm 421ee05 link false /test e2e-aws-pre-release-ossm

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot requested a review from hongkailiu December 17, 2025 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants