Skip to content

Conversation

@djeebus
Copy link
Contributor

@djeebus djeebus commented Dec 19, 2025

This mostly involved adding a bunch of tests, which uncovered edge cases involving canceled contexts, missed cleanups, etc.

Also moved network operations off of the slot and into a struct that can be swapped out for mocks.


Note

Extracts network operations into a pluggable interface and refactors the network pool for context-aware lifecycle, async cleanup, and configurable sizes; adds comprehensive tests and wires new env/config across services.

  • Network module:
    • Operations abstraction: Introduces Operations interface with NetNSOperations implementation; moves networking methods off Slot into network.go (CreateNetwork, RemoveNetwork, ConfigureInternet, ResetInternet).
    • Pool refactor: NewPool(operations, storage, config) replaces size args; Populate returns error, respects context/closed state, and cleans up asynchronously; Return drops when reuse pool is full; adds ErrClosed, isClosed, ignoreClosed; cleanup now logs instead of bubbling errors.
    • Config: New internal/sandbox/network/config.go adds NetworkSlotsReusePoolSize and NetworkSlotsFreshPoolSize (env-driven); storage Release now Release(ctx, s *Slot).
    • Tests & mocks: Adds pool_test.go with extensive coverage for close/cancel/cleanup paths; generates mocks_test.go for Storage and Operations via .mockery.yaml additions.
  • Wiring:
    • Update call sites (benchmark_test.go, cmd/build-template/main.go, main.go) to construct NetNSOperations and use new NewPool signature; propagate context to Populate and Close.
    • IaC: Adds NETWORK_SLOTS_REUSE_POOL_SIZE and NETWORK_SLOTS_FRESH_POOL_SIZE to orchestrator.hcl and template-manager.hcl env.
  • Dependencies: Adds go.uber.org/atomic (and related sum entries).

Written by Cursor Bugbot for commit bb1c165. This will update automatically on new commits. Configure here.

telemetry.ReportEvent(ctx, "reused network slot")

slot = s
default:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Pool.Get() can return nil slot when reused slot is acquired

In the Get method, when a reused slot is acquired from p.reusedSlots channel (line 157-162), the method sets slot = s and then falls through to the second select statement (lines 166-177). This second select can potentially return nil from p.newSlots, overwriting the valid reused slot that was just acquired.

The logic should return immediately after successfully getting a reused slot. The second select block should only execute when no reused slot was available (when slot is still nil).

Expected fix: Add if slot != nil { goto configure } after line 164, or restructure the logic to avoid the fallthrough.

// Return the slot to the pool if configuring internet fails
go func() {
if returnErr := p.Return(context.WithoutCancel(ctx), slot); returnErr != nil {
if returnErr := p.Return(ctx, slot); returnErr != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Context used in error recovery goroutine may be canceled

When ConfigureInternet fails, a goroutine is spawned to return the slot to the pool. However, this goroutine uses the original ctx which may already be canceled (line 183). If ctx is canceled, the Return() call will fail immediately at line 195-200, and the slot will never be properly cleaned up or returned to the pool.

Should use context.WithoutCancel(ctx) or a background context for the cleanup goroutine to ensure it completes regardless of the original context state.

close(p.newSlots)

for slot := range p.newSlots {
p.cleanup(ctx, slot)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Populate cleanup uses potentially canceled context

In the deferred cleanup function (lines 114-120), slots are drained from p.newSlots and cleaned up using ctx. However, if Populate exits due to context cancellation (line 138), this ctx is already canceled. The cleanup operations (RemoveNetwork, Release) will immediately fail the context checks, leaving network namespaces and slots in an unclean state.

Should use context.Background() or context.WithoutCancel(ctx) for the cleanup operations in the defer block to ensure proper cleanup regardless of why Populate is exiting.

return fmt.Errorf("failed to return slot '%d': %w", slot.Idx, err)
}
// reused slots was full, drop the slot
go p.cleanup(ctx, slot)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: Return() cleanup goroutines spawned after pool closed

In Return(), when operations fail or context is canceled (lines 206, 213, 217, 225), cleanup goroutines are spawned that use the potentially canceled context. More critically, at line 225, when the reusedSlots channel is full, a cleanup goroutine is spawned even though the pool might be in the process of closing or already closed.

If the pool closes between the initial check at line 195-201 and the cleanup goroutine execution, the cleanup call to p.operations.RemoveNetwork and p.slotStorage.Release might be operating on a closed pool state. Consider checking pool closed state within cleanup() or using a background context.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +179 to 184
err := p.operations.ConfigureInternet(ctx, slot, network)
if err != nil {
// Return the slot to the pool if configuring internet fails
go func() {
if returnErr := p.Return(context.WithoutCancel(ctx), slot); returnErr != nil {
if returnErr := p.Return(ctx, slot); returnErr != nil {
logger.L().Error(ctx, "failed to return slot to the pool", zap.Error(returnErr), zap.Int("slot_index", slot.Idx))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Slot cleanup skipped when ctx canceled after Get failure

When ConfigureInternet fails, the pool now tries to return the slot with the same request context (p.Return(ctx, slot)). Return immediately bails out if that context is canceled, so callers that cancel their context after Get fails will skip ResetInternet/cleanup and leak the slot back out of the pool. The prior code used context.WithoutCancel here to guarantee cleanup regardless of cancellation, so this regresses under canceled contexts and can exhaust the pool over time.

Useful? React with 👍 / 👎.

Comment on lines 91 to 92
NETWORK_SLOTS_POOL_SIZE = "100"
NETWORK_SLOTS_TO_PREPOPULATE = "32"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Nomad env vars don’t match network pool config keys

The Nomad job sets NETWORK_SLOTS_POOL_SIZE and NETWORK_SLOTS_TO_PREPOPULATE to 100/32, but network.Config still reads NETWORK_SLOTS_REUSE_POOL_SIZE and NETWORK_SLOTS_FRESH_POOL_SIZE (pool.go lines 54-55). Because the configured keys don’t exist, the service falls back to the new 8/8 defaults, shrinking the pool dramatically in production instead of using the intended 100/32 sizing.

Useful? React with 👍 / 👎.

acquiredSlots.Add(ctx, 1, metric.WithAttributes(attribute.String("pool", "new")))
telemetry.ReportEvent(ctx, "new network slot")

slot = s
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Reused slot leaked when second select always overwrites

The Get function has a critical logic error where the second select statement (lines 146-157) always runs unconditionally, even if a reused slot was already obtained in the first select. When a reused slot is available, it's assigned to slot (line 142), but then the code proceeds to the second select which blocks until a new slot is available and overwrites slot (line 156). This causes the reused slot to be leaked (never returned to the pool), incorrect metrics (both counters decremented), and unnecessary blocking. The second select should only execute when the first select's default case is hit (no reused slot available).

Fix in Cursor Fix in Web

return nil
case <-ctx.Done():
return
default:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Created slot leaks when context cancels before send

In Populate, when createNetworkSlot succeeds (line 107) but the subsequent select exits via <-p.done or <-ctx.Done() (lines 115-118), the just-created slot is not cleaned up. The defer only drains slots already in p.newSlots, not the local slot variable that was created but never sent. This leaks network resources during shutdown or context cancellation.

Fix in Cursor Fix in Web

if err != nil {
// Return the slot to the pool if configuring internet fails
go func() {
if returnErr := p.Return(context.WithoutCancel(ctx), slot); returnErr != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Removing context.WithoutCancel causes slot leak on cancellation

The goroutine that returns a slot after ConfigureInternet fails changed from using context.WithoutCancel(ctx) to plain ctx. The Return function's first select (lines 175-181) checks ctx.Done() and returns immediately without cleanup if canceled. With the old code, the detached context ensured Return would proceed past this check. Now, if the context is canceled before or during the goroutine execution, Return exits early without cleaning up the slot, causing a resource leak.

Additional Locations (1)

Fix in Cursor Fix in Web

@djeebus djeebus marked this pull request as draft December 19, 2025 20:04
@djeebus
Copy link
Contributor Author

djeebus commented Dec 19, 2025

These are pretty reasonable, going to work through them.

@djeebus
Copy link
Contributor Author

djeebus commented Dec 20, 2025

Going to extract a warm pool in #1636 first. These changes will likely head over there, or into a follow up PR

@djeebus djeebus closed this Dec 20, 2025
@ValentaTomas ValentaTomas deleted the clean-up-pool branch January 2, 2026 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants