Add some safety when closing orchestrator and template manager #1609

djeebus · 2025-12-12T03:05:46Z

Also hide some "errors" that aren't really errors.

Note

Adds context-driven shutdown across orchestrator/template-manager, stop-all for sandboxes/builds, 24h kill timeouts, context-aware pools/storage, and a new utils.Wait with tests.

Orchestrator / Shutdown & Signals:
- Refactor main shutdown flow: context-based waiting, ignore benign errors, structured service start/stop, and parallel close steps.
- Replace FORCE_STOP usage with Config.StopSandboxesOnExit (env:"FORCE_STOP") to optionally kill all sandboxes on exit.
- Add sleepCtx, helpers to ignore canceled/invalid-arg/service-done errors.
Sandbox:
- Factory: track active sandboxes via sync.WaitGroup; new Wait(ctx) to block until all exit; ensure decrement via cleanup.
- Map: add StopAll(ctx) to gracefully stop all sandboxes.
- Cleanup: introduce named cleaners with logging; prioritize and report failures.
- Reduce noisy logs and improve envd init retry behavior.
Pools & Network:
- Make NBD and network pool Populate(ctx) return errors and respect context/done channels; fix slot cleanup paths.
- Change network Storage.Release to Release(ctx, *Slot); implement in KV/Local/Memory with context support.
Template Manager:
- Server now waits via utils.Wait(ctx, &wg) and logs clean shutdown; add StopAllBuilds().
- Build Cache: add StopAll() to mark running builds as cancelled.
gRPC/HTTP:
- Simplify gRPC server interceptors (comment out logging), keep recovery/OTEL handlers.
- cmux/http/grpc servers: context-aware start/stop, consistent error handling.
Infra / Config:
- Nomad jobs: set kill_timeout = "24h" for orchestrator and template-manager tasks.
- .editorconfig: add *.hcl indent rules.
Utilities & Tests:
- New utils.Wait(ctx, *sync.WaitGroup) with comprehensive tests; sandbox factory wait test added.
Scripts/Deps:
- packages/shared/scripts: run npm install, add dotenv, bump e2b and related deps.
- go.mod: add go.uber.org/atomic.

^{Written by Cursor Bugbot for commit 8d10e4a. This will update automatically on new commits. Configure here.}

packages/orchestrator/internal/sandbox/sandbox.go

packages/orchestrator/main.go

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

packages/orchestrator/internal/template/server/main.go

packages/orchestrator/main.go

packages/orchestrator/internal/sandbox/sandbox.go

packages/orchestrator/internal/template/server/main.go

packages/orchestrator/main.go

packages/orchestrator/internal/sandbox/sandbox.go

packages/orchestrator/main.go

packages/orchestrator/internal/template/server/main.go

mostly dealt with paying attention to context cancellation

mostly dealt with paying attention to context cancellation" This reverts commit 7351c9b.

cursor · 2025-12-12T19:54:35Z

packages/orchestrator/internal/sandbox/network/pool.go

+			case p.newSlots <- slot:
+			case <-ctx.Done():
+				return ctx.Err()
+			}


Bug: Resource leak when context cancelled during slot creation

When the context is cancelled in the inner select while trying to send the slot to p.newSlots, the newly created slot from createNetworkSlot is leaked. The slot's network resources are allocated and the newSlotsAvailableCounter is incremented, but the slot is never added to the channel or cleaned up. This leaks network resources (slot storage, network configuration) since Close() only cleans up slots that were successfully sent to the channel.

This would be a force-quit situation, don't think we're worried about resources left around.

can we clean that one slot regardless?

if the context is done, the slot is still nto cleared, correct?

packages/orchestrator/main.go

cursor

Bug: Ignored error return from network pool Populate

The networkPool.Populate(ctx) function now returns an error, but the return value is discarded and the service wrapper always returns nil. This means any errors from populating the network pool (like ErrClosed or context errors) won't be propagated to the service error handling, preventing proper shutdown signaling when the network pool fails.

packages/orchestrator/main.go#L387-L392

infra/packages/orchestrator/main.go

Lines 387 to 392 in 830dcaf

    
           networkPool := network.NewPool(network.NewSlotsPoolSize, network.ReusedSlotsPoolSize, slotStorage, config.NetworkConfig) 
        
           startService("network pool", func(ctx context.Context) error { 
        
           	networkPool.Populate(ctx) 
        
           	return nil 
        
           })

packages/orchestrator/internal/sandbox/network/pool.go

dobrac · 2025-12-17T17:05:25Z

packages/orchestrator/internal/sandbox/network/pool.go

+			case p.newSlots <- slot:
+			case <-ctx.Done():
+				return ctx.Err()
+			}


can we clean that one slot regardless?

dobrac · 2025-12-17T17:22:49Z

packages/orchestrator/internal/sandbox/sandbox.go

+	done := make(chan struct{}, 1)
+
+	go func() {
+		f.wg.Wait()


we should block allowing new sandboxes to be started at this point, otherwise the wait group could fail

The way this is implemented, the caller sets the status to Draining before calling Wait, which should prevent new sandboxes from being scheduled. I don't think we want to block sandboxes from being scheduled in multiple places, for the same reason we don't want to set a max time limit in multiple places - if implementation changes, we don't want to hunt down all the places we blocked, closed, etc.

I think it might be worth it here though as it is a limitation of using the WaitGroup - you can't add new tasks to the WaitGroup after you do Wait, otherwise it may panic

I don't think it panics in that scenario. I set this playground up, let me know if it looks like what you expected: https://go.dev/play/p/bDPHf1ejwGs

It's not that easy to simulate, but we've hit this already few times. Here are the links to the source code:

https://github.com/golang/go/blob/ad91f5d241f3b8e85dc866d0087c3f13af96ef33/src/sync/waitgroup.go#L121C3-L121C69

https://github.com/golang/go/blob/ad91f5d241f3b8e85dc866d0087c3f13af96ef33/src/sync/waitgroup.go#L213

Here is the PR:

fix: HTTPExporter can panic on sync #1341

packages/orchestrator/main.go

packages/orchestrator/internal/template/server/main.go

dobrac · 2025-12-17T18:20:34Z

packages/orchestrator/main.go

+		if serviceInfo.GetStatus() == orchestratorinfo.ServiceInfoStatus_Healthy {
+			serviceInfo.SetStatus(ctx, orchestratorinfo.ServiceInfoStatus_Draining)
+
+			logger.L().Info(ctx, "Waiting for api to read orchestrator status")
+			sleepCtx(closeCtx, 15*time.Second)


you need to wait before you await in the sandbox factory, as new sandboxes can be still scheduled in that time (same for templates)

Assuming that a) the window is quite small, and b) the sync.WaitGroup doesn't panic in that scenario (see reply to other comment), is this still worth complicating the solution to prevent?

if the b) is true, then you need to be sure at least that the 15 seconds pass before the await for both template and sbxs exists

cursor · 2025-12-17T19:26:33Z

packages/orchestrator/main.go

+		logger.L().Info(ctx, "Force shutdown signal received")
+		cancel()
+		cancelCloseCtx()
+		config.ForceStop = true


Bug: Data race on config.ForceStop during shutdown

The goroutine spawned at line 570 writes to config.ForceStop at line 575 when a second signal is received, while the main goroutine reads config.ForceStop at line 619 during the closers loop. These operations can occur concurrently without any synchronization, creating a data race. The goroutine runs independently of closeWg.Wait(), so if a second signal arrives during the shutdown process, the write and read can race.

Additional Locations (1)

packages/orchestrator/main.go#L618-L619

cursor · 2025-12-17T20:33:27Z

packages/orchestrator/main.go

+
+		logger.L().Info(ctx, "Waiting for sandbox clients to close connections")
+		sleepCtx(closeCtx, 15*time.Second)
+	})


Bug: Race condition allows new sandboxes during WaitGroup.Wait

The three closeWg.Go goroutines run in parallel, which means sandboxFactory.Wait() may start before the draining status is set. Since the sandbox creation service doesn't check draining status (unlike the template service), new sandbox requests can still be accepted. Each new sandbox calls wg.Add(1) on the factory's WaitGroup while Wait() is running. This can cause shutdown to hang indefinitely as long as new sandboxes keep being created, since the counter never reaches zero.

This is exactly what we want. Since we're likely to be waiting for quite a while for the orchestrator to shut down, one more sandbox isn't a big deal. It won't deadlock, since the new sandbox will add/subtract to the counter exactly the same way that the rest of them will.

dobrac · 2025-12-17T22:10:44Z

iac/provider-gcp/nomad/jobs/template-manager.hcl

 %{ if !update_stanza }
        FORCE_STOP                    = "true"
 %{ endif }


should we also remove this if the kill_timeout is now 24h?

I think that's there to prevent single-node clusters from having to wait too long to restart their allocation. I'm not totally sure. They don't seem to server an important purpose anymore, but I'd like input from @jakubno and @ValentaTomas before we remove them.

dobrac · 2025-12-17T22:11:29Z

packages/orchestrator/internal/sandbox/network/pool.go

+			case p.newSlots <- slot:
+			case <-ctx.Done():
+				return ctx.Err()
+			}


if the context is done, the slot is still nto cleared, correct?

dobrac · 2025-12-17T22:23:12Z

packages/orchestrator/main.go

+	go func() {
+		<-sigs
+		logger.L().Info(ctx, "Force shutdown signal received")
+		cancel()


do we need to cancel this context?

(for both this and the previous comment) My theory on "force shutdown" was that, if we were in that scenario, we want everything to bail immediately - A dirty shutdown that leaves files and processes is exactly what we want, and cancelling the context gets us what we want. It effectively simulates setting FORCE_QUIT=true at runtime.

The initial idea behind adding FORCE_QUIT was to not wait for the sbxs/template builds when developing locally, but still clean everything cleanly

oooh. definitely not my understanding, I'll clean that up

cursor

Bug: Send on closed channel can panic during shutdown

Removing the <-p.done check from the Return method creates a race condition that can cause a panic. During shutdown, Pool.Close() calls close(p.reusedSlots) after close(p.done). If a concurrent Return call (such as from sandbox cleanup running with context.WithoutCancel) passes the first select and is executing ResetInternet when p.reusedSlots gets closed, the subsequent p.reusedSlots <- slot send operation will panic. The old code with case <-p.done: return ErrClosed provided a safe exit when the pool was closing before the channel was closed.

packages/orchestrator/internal/sandbox/network/pool.go#L209-L221

infra/packages/orchestrator/internal/sandbox/network/pool.go

Lines 209 to 221 in 61d21d2

    
           select { 
        
           case <-ctx.Done(): 
        
           	return ctx.Err() 
        
           case p.reusedSlots <- slot: 
        
           	returnedSlotCounter.Add(ctx, 1) 
        
           	reusableSlotsAvailableCounter.Add(ctx, 1) 
        
           default: 
        
           	err := p.cleanup(ctx, slot) 
        
           	if err != nil { 
        
           		return fmt.Errorf("failed to return slot '%d': %w", slot.Idx, err) 
        
           	} 
        
           }

cursor

Bug: Sending to closed channel can panic during shutdown

The Return method previously checked p.done before sending to p.reusedSlots, but this check was removed. Since Return is called asynchronously from sandbox cleanup (in a goroutine at sandbox.go lines 999-1005), and Close() calls close(p.reusedSlots) at line 260, there's a race condition: an async Return goroutine may try to send to p.reusedSlots after it's been closed, causing a panic. The select with default case doesn't protect against this because sending to a closed channel panics immediately rather than blocking or falling through to default.

packages/orchestrator/internal/sandbox/network/pool.go#L190-L223

infra/packages/orchestrator/internal/sandbox/network/pool.go

Lines 190 to 223 in 58a0ed3

    
           func (p *Pool) Return(ctx context.Context, slot *Slot) error { 
        
           	// avoid checking p.done, as we want to return the slot even if the pool is closed. 
        
           	select { 
        
           	case <-ctx.Done(): 
        
           		return ctx.Err() 
        
           	default: 
        
           	} 
        
           	err := slot.ResetInternet(ctx) 
        
           	if err != nil { 
        
           		// Cleanup the slot if resetting internet fails 
        
           		if cerr := p.cleanup(ctx, slot); cerr != nil { 
        
           			return fmt.Errorf("reset internet: %w; cleanup: %w", err, cerr) 
        
           		} 
        
           		return fmt.Errorf("error resetting slot internet access: %w", err) 
        
           	} 
        
           	select { 
        
           	case <-ctx.Done(): 
        
           		return ctx.Err() 
        
           	case p.reusedSlots <- slot: 
        
           		returnedSlotCounter.Add(ctx, 1) 
        
           		reusableSlotsAvailableCounter.Add(ctx, 1) 
        
           	default: 
        
           		err := p.cleanup(ctx, slot) 
        
           		if err != nil { 
        
           			return fmt.Errorf("failed to return slot '%d': %w", slot.Idx, err) 
        
           		} 
        
           	} 
        
           	return nil

packages/orchestrator/internal/sandbox/network/pool.go#L259-L260

infra/packages/orchestrator/internal/sandbox/network/pool.go

Lines 259 to 260 in 58a0ed3


	close(p.reusedSlots)

cursor · 2025-12-18T01:08:07Z

packages/orchestrator/internal/sandbox/network/pool.go


 	select {
+	case <-p.done:
+		return nil


Bug: Network slot leaked when pool is closed during Return

When p.done is closed in the Return function, the slot is not cleaned up and is simply dropped, causing a resource leak. The Populate function specifically tries to clean up by calling p.Return(ctx, slot) when the pool closes (lines 132-140), but Return doesn't actually perform cleanup in that case - it just returns nil. The ctx.Done() case properly calls p.cleanup(ctx, slot), but the p.done case should do the same. This was noted in the PR review comment "can we clean that one slot regardless?"

Additional Locations (1)

packages/orchestrator/internal/sandbox/network/pool.go#L131-L140

cursor · 2025-12-18T22:36:10Z

packages/orchestrator/internal/sandbox/network/pool.go


 	select {
+	case <-p.done:
+		return nil


Bug: Network slot resource leak when pool is closing

In the Return method, when p.done is received (pool is shutting down), the function returns nil without calling p.cleanup(ctx, slot). At this point, slot.ResetInternet has already succeeded, but the slot's network namespace and storage allocation are never released. This leaks network resources. The ctx.Done() case correctly calls p.cleanup, and the default case also correctly calls p.cleanup, but the p.done case does not. This was noted in PR discussion by @dobrac asking "can we clean that one slot regardless?".

ValentaTomas · 2026-01-02T13:54:37Z

Can we fix the lint error + resolve the bot convos, so it is clear that this can be reviewed again?

djeebus added 2 commits December 11, 2025 19:02

hack in some safety when quitting orchastrator and template manager

380459b

linting

8fc0f1a

djeebus requested review from ValentaTomas, dobrac and jakubno as code owners December 12, 2025 03:05

e2b-request-same-site-reviewers bot assigned ValentaTomas Dec 12, 2025

claude bot reviewed Dec 12, 2025

View reviewed changes

packages/orchestrator/internal/sandbox/sandbox.go Outdated Show resolved Hide resolved

claude bot reviewed Dec 12, 2025

View reviewed changes

packages/orchestrator/main.go Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 12, 2025

View reviewed changes

packages/orchestrator/internal/template/server/main.go Outdated Show resolved Hide resolved

cursor bot reviewed Dec 12, 2025

View reviewed changes

packages/orchestrator/internal/template/server/main.go Outdated Show resolved Hide resolved

packages/orchestrator/main.go Show resolved Hide resolved

dobrac reviewed Dec 12, 2025

View reviewed changes

use the right ctx, respect force quit

aaebf25

cursor bot reviewed Dec 12, 2025

View reviewed changes

packages/orchestrator/internal/sandbox/sandbox.go Show resolved Hide resolved

packages/orchestrator/main.go Outdated Show resolved Hide resolved

packages/orchestrator/internal/template/server/main.go Outdated Show resolved Hide resolved

djeebus added 3 commits December 12, 2025 11:44

fix orchestrator force quitting

7351c9b

mostly dealt with paying attention to context cancellation

Revert "fix orchestrator force quitting

6fdea50

mostly dealt with paying attention to context cancellation" This reverts commit 7351c9b.

linting

830dcaf

cursor bot reviewed Dec 12, 2025

View reviewed changes

packages/orchestrator/main.go Show resolved Hide resolved

cursor bot reviewed Dec 12, 2025

View reviewed changes

djeebus added 2 commits December 12, 2025 11:59

return some errors, and return them again

b6b3816

cleanup the network slot if it can't be published

297633a

cursor bot reviewed Dec 12, 2025

View reviewed changes

packages/orchestrator/internal/sandbox/network/pool.go Outdated Show resolved Hide resolved

djeebus added 3 commits December 12, 2025 12:42

linting, fix issues, remove 24 hour timeout

80aa42d

Merge branch 'main' into hack-it-in-quick

9aa2e03

Merge branch 'main' into hack-it-in-quick

7f9119e

cursor bot reviewed Dec 16, 2025

View reviewed changes

packages/orchestrator/internal/sandbox/network/pool.go Show resolved Hide resolved

djeebus added 2 commits December 16, 2025 10:00

use clean up to mark sandbox as "done"

e51dfbb

fix a codereview bug

78d524c

dobrac assigned dobrac and unassigned ValentaTomas Dec 17, 2025

Merge branch 'main' into hack-it-in-quick

7484379

dobrac reviewed Dec 17, 2025

View reviewed changes

djeebus added 2 commits December 17, 2025 10:36

nomad should wait a day before killing an orchestrator job

18bc3e3

don't need this anymore

3fec3bd

cursor bot reviewed Dec 17, 2025

View reviewed changes

djeebus added 4 commits December 17, 2025 11:58

try to return the slot we just collected

011fa75

extract "context-aware wait" into shared packages

190f116

linting

d8ea53b

linting

e2f9938

cursor bot reviewed Dec 17, 2025

View reviewed changes

dobrac requested changes Dec 17, 2025

View reviewed changes

avoid returning errors when population stops due to closing

61d21d2

cursor bot reviewed Dec 17, 2025

View reviewed changes

implement forceStop as an atomic bool to avoid race conditions

58a0ed3

cursor bot reviewed Dec 18, 2025

View reviewed changes

djeebus added 3 commits December 17, 2025 16:27

make tidy

faab4ec

plug a few release holes

1112440

linting

b5ca128

cursor bot reviewed Dec 18, 2025

View reviewed changes

djeebus added 2 commits December 18, 2025 11:24

fix some small issues

19292d7

even more progress, but it's not working yet =/

8d10e4a

cursor bot reviewed Dec 18, 2025

View reviewed changes

	networkPool := network.NewPool(network.NewSlotsPoolSize, network.ReusedSlotsPoolSize, slotStorage, config.NetworkConfig)
	startService("network pool", func(ctx context.Context) error {
	networkPool.Populate(ctx)

	return nil
	})


	select {
	case <-ctx.Done():
	return ctx.Err()
	case p.reusedSlots <- slot:
	returnedSlotCounter.Add(ctx, 1)
	reusableSlotsAvailableCounter.Add(ctx, 1)
	default:
	err := p.cleanup(ctx, slot)
	if err != nil {
	return fmt.Errorf("failed to return slot '%d': %w", slot.Idx, err)
	}
	}


	func (p Pool) Return(ctx context.Context, slot Slot) error {
	// avoid checking p.done, as we want to return the slot even if the pool is closed.

	select {
	case <-ctx.Done():
	return ctx.Err()
	default:
	}

	err := slot.ResetInternet(ctx)
	if err != nil {
	// Cleanup the slot if resetting internet fails
	if cerr := p.cleanup(ctx, slot); cerr != nil {
	return fmt.Errorf("reset internet: %w; cleanup: %w", err, cerr)
	}

	return fmt.Errorf("error resetting slot internet access: %w", err)
	}

	select {
	case <-ctx.Done():
	return ctx.Err()
	case p.reusedSlots <- slot:
	returnedSlotCounter.Add(ctx, 1)
	reusableSlotsAvailableCounter.Add(ctx, 1)
	default:
	err := p.cleanup(ctx, slot)
	if err != nil {
	return fmt.Errorf("failed to return slot '%d': %w", slot.Idx, err)
	}
	}

	return nil

Add some safety when closing orchestrator and template manager #1609

Are you sure you want to change the base?

Add some safety when closing orchestrator and template manager #1609

Uh oh!

Conversation

djeebus commented Dec 12, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Dec 12, 2025

Choose a reason for hiding this comment

Bug: Resource leak when context cancelled during slot creation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Ignored error return from network pool Populate

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dobrac Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dobrac Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Dec 17, 2025

Choose a reason for hiding this comment

Bug: Data race on config.ForceStop during shutdown

Uh oh!

cursor bot Dec 17, 2025

Choose a reason for hiding this comment

Bug: Race condition allows new sandboxes during WaitGroup.Wait

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djeebus commented Dec 12, 2025 •

edited by cursor bot

Loading

dobrac Dec 17, 2025 •

edited

Loading

dobrac Dec 17, 2025 •

edited

Loading