-
Notifications
You must be signed in to change notification settings - Fork 219
OCPBUGS-62627: cluster operator ingress reported Progressing=True wit… #1299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@davidesalerno: This pull request references Jira Issue OCPBUGS-62627, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@davidesalerno: This pull request references Jira Issue OCPBUGS-62627, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@davidesalerno: This pull request references Jira Issue OCPBUGS-62627, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
701a14e to
68520e0
Compare
|
@davidesalerno: This pull request references Jira Issue OCPBUGS-62627, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
per-merge tested and passed |
|
@lihongan: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Sorry, I forgot to assign this a few weeks back. |
| // Ignore infrastructure-driven deployment rollouts when computing | ||
| // the IngressController's Progressing condition. | ||
| ignoreReasons: []string{ | ||
| ReasonReplicasStabilizing.String(), // Node reboots, pod evictions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the operator will no longer report Progressing=True during explicit scaling operations? If so, is this intended? Typically, users expect to see a Progressing state when they change the configuration to scale the controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the user should change the Ingress CR to set the nuber of replicas would like to have and so the check deployment.Status.ObservedGeneration == deployment.Generation we are doing into computeDeploymentRollingOutStatusAndReason will fail and the operator status Progressing=True will be maintained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean about the generation check failing initially, but I think that window is super brief (just until the deployment controller observes the spec update).
The bulk of the time is usually spent waiting for the new pods to actually become ready. In that state (ObservedGeneration == Generation but AvailableReplicas < Spec.Replicas), the logic returns ReasonPodsStarting. Since ReasonPodsStarting is in the ignoreReasons list, the operator will flip back to Progressing=False almost immediately, even while the new pods are still spinning up.
If we want users to see Progressing=True for the duration of the scale-up (which I think they'd expect), we might need to reconsider ignoring ReasonPodsStarting or find a way to distinguish 'pods starting due to explicit scale' vs 'pods starting due to restart'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the requested behavior by the ClusterOperator team, if pods won't start in the grace period then the status that it is expected is Degraded=true
/cc @hongkailiu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me understand the situation:
-
One of these happens
a. User changes the number of replicas
b. A node where the controlled pod is running is rebooted -
Deployment acts on 1.a or 1.b above (Generation checking)
-
Pod acts on the 2
Before this pull the behaviour is that Progressing=True until 3 is done and now only 2 is done without waiting for 3.
The result is Progressing=True could be very short.
We definitely want this if the story starts from 1.b.
"Operators should not report Progressing when they are reconciling (without action) a previously known state."
Could it be an overkill for 1.a? (Ideally we distinguish the 2 cases and only change the behaviour for 1.b)
I read the API doc again about Progressing and it does not have a clear definition about to what time Progressing should be stopped.
I feel it is OK that even for 1.a we do not wait for Pod's reconciling on the number of replicas: Progressing=False when Deployment has started rolling out the change but not yet finished.
Arguably, we could say Deployment has noticed the user action and cluster operator's work (rolling out the change) is done. Whatever Deployment wants to do about it, it is its decision to make.
68520e0 to
7e5e577
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…h reason=Reconciling for a node reboot Signed-off-by: Davide Salerno <dsalerno@redhat.com>
7e5e577 to
421ee05
Compare
|
@davidesalerno: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
…h reason=Reconciling for a node reboot
This PR will avoid Cluster Ingress Operator reporting
Progressing=Truecondition if there is a node reboot or scale up as per Cluster Operator new requirement (link).The Cluster Ingress Operator will continue to report
DeploymentRollingOut=truecondition with a differentReasonif there is a node reboot or scale up and we will exclude these reasons (ReplicasStabilizingandPodsStarting) from the ones causing the operatorProgressing=Truecondition.In order to detect the a Deployment has been fully rolled out in the past we will use
DeploymentconditionProgressingwith reasonNewReplicaSetAvailableas similarly done in library-go (see PR)