Skip to content

Conversation

@cstockton
Copy link
Contributor

The systemd default is 10s / 5 for these values with a DefaultRestartUSec of 100ms. Most services set a RestartSec limit of 3, under most circumstances it takes 15s to restart 5 times so the limit of 10s is not exceeded. However if other system processes (salt, cloud init) restart it explicitly, or recovering system services within the --before chain trigger a restart the limit can be exceeded causing it to be marked as failed. Since no services mark gotrue.service as required it will remain offline until the next explicit restart is issued.

Setting these values to 0 with Restart=always and RestartSec=3 will prevent gotrue from being marked as failed.

@cstockton cstockton requested review from a team as code owners November 28, 2025 18:14
@cstockton cstockton enabled auto-merge December 2, 2025 13:18
@samrose samrose requested review from darora and pcnc December 2, 2025 13:20
Copy link
Collaborator

@samrose samrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to create a testing AMI to thoroughly test these changes out. Will request @LGUG2Z to perform these tests as he's also going to be helping us find ways to automate these testing approaches.

@samrose samrose requested a review from LGUG2Z December 2, 2025 13:52
Copy link
Collaborator

@samrose samrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we ultimately merge this, we should bump the versions in ansible/vars.yml to create a release for these changes. This way, it will be a distinct change instead of bundled with other changes.

@cstockton
Copy link
Contributor Author

Hi @samrose - I've just updated the branch. Any updates on this?

@cstockton cstockton force-pushed the cs/gotrue-start-limit-fix branch 2 times, most recently from e097bf1 to 3ef31ba Compare December 8, 2025 17:25
Chris Stockton added 2 commits December 8, 2025 10:28
The systemd default is 10s / 5 for these values with a DefaultRestartUSec of
100ms. Most services set a RestartSec limit of 3, under most circumstances it
takes 15s to restart 5 times so the limit of 10s is not exceeded. However if
other system processes (salt, cloud init) restart it explicitly, or recovering
system services within the --before chain trigger a restart the limit can be
exceeded causing it to be marked as failed. Since no services mark
gotrue.service as required it will remain offline until the next explicit
restart is issued.

Setting these values to 0 with Restart=always and RestartSec=3 will prevent
gotrue from being marked as failed.
I've noticed all !oneshot services set a `RestartSec` of `3s` and we use the
systemd defaults of `StartLimitBurst=5` and `StartLimitInterval=10s`. Together
this forms a property that under typical conditions a service will be restarted
indefinitely until it comes back up due to `(3s * 5) > 10s`, but it is still
possible for a service to enter a failed state under some scenarios. This change
defensively sets them to 0/0 to keep them in restart loops.
@cstockton cstockton force-pushed the cs/gotrue-start-limit-fix branch from 3ef31ba to c89c805 Compare December 8, 2025 17:28
@samrose samrose self-requested a review December 11, 2025 05:15
Copy link
Collaborator

@samrose samrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just needs a rebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants