-
Notifications
You must be signed in to change notification settings - Fork 233
Add some safety when closing orchestrator and template manager #1609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
mostly dealt with paying attention to context cancellation
mostly dealt with paying attention to context cancellation" This reverts commit 7351c9b.
| case p.newSlots <- slot: | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Resource leak when context cancelled during slot creation
When the context is cancelled in the inner select while trying to send the slot to p.newSlots, the newly created slot from createNetworkSlot is leaked. The slot's network resources are allocated and the newSlotsAvailableCounter is incremented, but the slot is never added to the channel or cleaned up. This leaks network resources (slot storage, network configuration) since Close() only cleans up slots that were successfully sent to the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a force-quit situation, don't think we're worried about resources left around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we clean that one slot regardless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the context is done, the slot is still nto cleared, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Ignored error return from network pool Populate
The networkPool.Populate(ctx) function now returns an error, but the return value is discarded and the service wrapper always returns nil. This means any errors from populating the network pool (like ErrClosed or context errors) won't be propagated to the service error handling, preventing proper shutdown signaling when the network pool fails.
packages/orchestrator/main.go#L387-L392
infra/packages/orchestrator/main.go
Lines 387 to 392 in 830dcaf
| networkPool := network.NewPool(network.NewSlotsPoolSize, network.ReusedSlotsPoolSize, slotStorage, config.NetworkConfig) | |
| startService("network pool", func(ctx context.Context) error { | |
| networkPool.Populate(ctx) | |
| return nil | |
| }) |
| case p.newSlots <- slot: | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we clean that one slot regardless?
| done := make(chan struct{}, 1) | ||
|
|
||
| go func() { | ||
| f.wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should block allowing new sandboxes to be started at this point, otherwise the wait group could fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this is implemented, the caller sets the status to Draining before calling Wait, which should prevent new sandboxes from being scheduled. I don't think we want to block sandboxes from being scheduled in multiple places, for the same reason we don't want to set a max time limit in multiple places - if implementation changes, we don't want to hunt down all the places we blocked, closed, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be worth it here though as it is a limitation of using the WaitGroup - you can't add new tasks to the WaitGroup after you do Wait, otherwise it may panic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it panics in that scenario. I set this playground up, let me know if it looks like what you expected: https://go.dev/play/p/bDPHf1ejwGs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that easy to simulate, but we've hit this already few times. Here are the links to the source code:
- https://github.com/golang/go/blob/ad91f5d241f3b8e85dc866d0087c3f13af96ef33/src/sync/waitgroup.go#L121C3-L121C69
- https://github.com/golang/go/blob/ad91f5d241f3b8e85dc866d0087c3f13af96ef33/src/sync/waitgroup.go#L213
Here is the PR:
packages/orchestrator/main.go
Outdated
| if serviceInfo.GetStatus() == orchestratorinfo.ServiceInfoStatus_Healthy { | ||
| serviceInfo.SetStatus(ctx, orchestratorinfo.ServiceInfoStatus_Draining) | ||
|
|
||
| logger.L().Info(ctx, "Waiting for api to read orchestrator status") | ||
| sleepCtx(closeCtx, 15*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to wait before you await in the sandbox factory, as new sandboxes can be still scheduled in that time (same for templates)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that a) the window is quite small, and b) the sync.WaitGroup doesn't panic in that scenario (see reply to other comment), is this still worth complicating the solution to prevent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the b) is true, then you need to be sure at least that the 15 seconds pass before the await for both template and sbxs exists
packages/orchestrator/main.go
Outdated
| logger.L().Info(ctx, "Force shutdown signal received") | ||
| cancel() | ||
| cancelCloseCtx() | ||
| config.ForceStop = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Data race on config.ForceStop during shutdown
The goroutine spawned at line 570 writes to config.ForceStop at line 575 when a second signal is received, while the main goroutine reads config.ForceStop at line 619 during the closers loop. These operations can occur concurrently without any synchronization, creating a data race. The goroutine runs independently of closeWg.Wait(), so if a second signal arrives during the shutdown process, the write and read can race.
Additional Locations (1)
|
|
||
| logger.L().Info(ctx, "Waiting for sandbox clients to close connections") | ||
| sleepCtx(closeCtx, 15*time.Second) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Race condition allows new sandboxes during WaitGroup.Wait
The three closeWg.Go goroutines run in parallel, which means sandboxFactory.Wait() may start before the draining status is set. Since the sandbox creation service doesn't check draining status (unlike the template service), new sandbox requests can still be accepted. Each new sandbox calls wg.Add(1) on the factory's WaitGroup while Wait() is running. This can cause shutdown to hang indefinitely as long as new sandboxes keep being created, since the counter never reaches zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly what we want. Since we're likely to be waiting for quite a while for the orchestrator to shut down, one more sandbox isn't a big deal. It won't deadlock, since the new sandbox will add/subtract to the counter exactly the same way that the rest of them will.
| %{ if !update_stanza } | ||
| FORCE_STOP = "true" | ||
| %{ endif } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also remove this if the kill_timeout is now 24h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's there to prevent single-node clusters from having to wait too long to restart their allocation. I'm not totally sure. They don't seem to server an important purpose anymore, but I'd like input from @jakubno and @ValentaTomas before we remove them.
| case p.newSlots <- slot: | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the context is done, the slot is still nto cleared, correct?
packages/orchestrator/main.go
Outdated
| go func() { | ||
| <-sigs | ||
| logger.L().Info(ctx, "Force shutdown signal received") | ||
| cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to cancel this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for both this and the previous comment) My theory on "force shutdown" was that, if we were in that scenario, we want everything to bail immediately - A dirty shutdown that leaves files and processes is exactly what we want, and cancelling the context gets us what we want. It effectively simulates setting FORCE_QUIT=true at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial idea behind adding FORCE_QUIT was to not wait for the sbxs/template builds when developing locally, but still clean everything cleanly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooh. definitely not my understanding, I'll clean that up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Send on closed channel can panic during shutdown
Removing the <-p.done check from the Return method creates a race condition that can cause a panic. During shutdown, Pool.Close() calls close(p.reusedSlots) after close(p.done). If a concurrent Return call (such as from sandbox cleanup running with context.WithoutCancel) passes the first select and is executing ResetInternet when p.reusedSlots gets closed, the subsequent p.reusedSlots <- slot send operation will panic. The old code with case <-p.done: return ErrClosed provided a safe exit when the pool was closing before the channel was closed.
packages/orchestrator/internal/sandbox/network/pool.go#L209-L221
infra/packages/orchestrator/internal/sandbox/network/pool.go
Lines 209 to 221 in 61d21d2
| select { | |
| case <-ctx.Done(): | |
| return ctx.Err() | |
| case p.reusedSlots <- slot: | |
| returnedSlotCounter.Add(ctx, 1) | |
| reusableSlotsAvailableCounter.Add(ctx, 1) | |
| default: | |
| err := p.cleanup(ctx, slot) | |
| if err != nil { | |
| return fmt.Errorf("failed to return slot '%d': %w", slot.Idx, err) | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Sending to closed channel can panic during shutdown
The Return method previously checked p.done before sending to p.reusedSlots, but this check was removed. Since Return is called asynchronously from sandbox cleanup (in a goroutine at sandbox.go lines 999-1005), and Close() calls close(p.reusedSlots) at line 260, there's a race condition: an async Return goroutine may try to send to p.reusedSlots after it's been closed, causing a panic. The select with default case doesn't protect against this because sending to a closed channel panics immediately rather than blocking or falling through to default.
packages/orchestrator/internal/sandbox/network/pool.go#L190-L223
infra/packages/orchestrator/internal/sandbox/network/pool.go
Lines 190 to 223 in 58a0ed3
| func (p *Pool) Return(ctx context.Context, slot *Slot) error { | |
| // avoid checking p.done, as we want to return the slot even if the pool is closed. | |
| select { | |
| case <-ctx.Done(): | |
| return ctx.Err() | |
| default: | |
| } | |
| err := slot.ResetInternet(ctx) | |
| if err != nil { | |
| // Cleanup the slot if resetting internet fails | |
| if cerr := p.cleanup(ctx, slot); cerr != nil { | |
| return fmt.Errorf("reset internet: %w; cleanup: %w", err, cerr) | |
| } | |
| return fmt.Errorf("error resetting slot internet access: %w", err) | |
| } | |
| select { | |
| case <-ctx.Done(): | |
| return ctx.Err() | |
| case p.reusedSlots <- slot: | |
| returnedSlotCounter.Add(ctx, 1) | |
| reusableSlotsAvailableCounter.Add(ctx, 1) | |
| default: | |
| err := p.cleanup(ctx, slot) | |
| if err != nil { | |
| return fmt.Errorf("failed to return slot '%d': %w", slot.Idx, err) | |
| } | |
| } | |
| return nil |
packages/orchestrator/internal/sandbox/network/pool.go#L259-L260
infra/packages/orchestrator/internal/sandbox/network/pool.go
Lines 259 to 260 in 58a0ed3
| close(p.reusedSlots) |
|
|
||
| select { | ||
| case <-p.done: | ||
| return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Network slot leaked when pool is closed during Return
When p.done is closed in the Return function, the slot is not cleaned up and is simply dropped, causing a resource leak. The Populate function specifically tries to clean up by calling p.Return(ctx, slot) when the pool closes (lines 132-140), but Return doesn't actually perform cleanup in that case - it just returns nil. The ctx.Done() case properly calls p.cleanup(ctx, slot), but the p.done case should do the same. This was noted in the PR review comment "can we clean that one slot regardless?"
Additional Locations (1)
|
|
||
| select { | ||
| case <-p.done: | ||
| return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Network slot resource leak when pool is closing
In the Return method, when p.done is received (pool is shutting down), the function returns nil without calling p.cleanup(ctx, slot). At this point, slot.ResetInternet has already succeeded, but the slot's network namespace and storage allocation are never released. This leaks network resources. The ctx.Done() case correctly calls p.cleanup, and the default case also correctly calls p.cleanup, but the p.done case does not. This was noted in PR discussion by @dobrac asking "can we clean that one slot regardless?".
|
Can we fix the lint error + resolve the bot convos, so it is clear that this can be reviewed again? |
Also hide some "errors" that aren't really errors.
Note
Adds context-driven shutdown across orchestrator/template-manager, stop-all for sandboxes/builds, 24h kill timeouts, context-aware pools/storage, and a new utils.Wait with tests.
FORCE_STOPusage withConfig.StopSandboxesOnExit(env:"FORCE_STOP") to optionally kill all sandboxes on exit.sleepCtx, helpers to ignore canceled/invalid-arg/service-done errors.sync.WaitGroup; newWait(ctx)to block until all exit; ensure decrement via cleanup.StopAll(ctx)to gracefully stop all sandboxes.Populate(ctx)return errors and respect context/done channels; fix slot cleanup paths.Storage.ReleasetoRelease(ctx, *Slot); implement in KV/Local/Memory with context support.utils.Wait(ctx, &wg)and logs clean shutdown; addStopAllBuilds().StopAll()to mark running builds as cancelled.kill_timeout = "24h"fororchestratorandtemplate-managertasks..editorconfig: add*.hclindent rules.utils.Wait(ctx, *sync.WaitGroup)with comprehensive tests; sandbox factory wait test added.packages/shared/scripts: runnpm install, adddotenv, bumpe2band related deps.go.mod: addgo.uber.org/atomic.Written by Cursor Bugbot for commit 8d10e4a. This will update automatically on new commits. Configure here.