-
Notifications
You must be signed in to change notification settings - Fork 233
Clean up current implementation of network pool #1634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| telemetry.ReportEvent(ctx, "reused network slot") | ||
|
|
||
| slot = s | ||
| default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Pool.Get() can return nil slot when reused slot is acquired
In the Get method, when a reused slot is acquired from p.reusedSlots channel (line 157-162), the method sets slot = s and then falls through to the second select statement (lines 166-177). This second select can potentially return nil from p.newSlots, overwriting the valid reused slot that was just acquired.
The logic should return immediately after successfully getting a reused slot. The second select block should only execute when no reused slot was available (when slot is still nil).
Expected fix: Add if slot != nil { goto configure } after line 164, or restructure the logic to avoid the fallthrough.
| // Return the slot to the pool if configuring internet fails | ||
| go func() { | ||
| if returnErr := p.Return(context.WithoutCancel(ctx), slot); returnErr != nil { | ||
| if returnErr := p.Return(ctx, slot); returnErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Context used in error recovery goroutine may be canceled
When ConfigureInternet fails, a goroutine is spawned to return the slot to the pool. However, this goroutine uses the original ctx which may already be canceled (line 183). If ctx is canceled, the Return() call will fail immediately at line 195-200, and the slot will never be properly cleaned up or returned to the pool.
Should use context.WithoutCancel(ctx) or a background context for the cleanup goroutine to ensure it completes regardless of the original context state.
| close(p.newSlots) | ||
|
|
||
| for slot := range p.newSlots { | ||
| p.cleanup(ctx, slot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Populate cleanup uses potentially canceled context
In the deferred cleanup function (lines 114-120), slots are drained from p.newSlots and cleaned up using ctx. However, if Populate exits due to context cancellation (line 138), this ctx is already canceled. The cleanup operations (RemoveNetwork, Release) will immediately fail the context checks, leaving network namespaces and slots in an unclean state.
Should use context.Background() or context.WithoutCancel(ctx) for the cleanup operations in the defer block to ensure proper cleanup regardless of why Populate is exiting.
| return fmt.Errorf("failed to return slot '%d': %w", slot.Idx, err) | ||
| } | ||
| // reused slots was full, drop the slot | ||
| go p.cleanup(ctx, slot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition: Return() cleanup goroutines spawned after pool closed
In Return(), when operations fail or context is canceled (lines 206, 213, 217, 225), cleanup goroutines are spawned that use the potentially canceled context. More critically, at line 225, when the reusedSlots channel is full, a cleanup goroutine is spawned even though the pool might be in the process of closing or already closed.
If the pool closes between the initial check at line 195-201 and the cleanup goroutine execution, the cleanup call to p.operations.RemoveNetwork and p.slotStorage.Release might be operating on a closed pool state. Consider checking pool closed state within cleanup() or using a background context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| err := p.operations.ConfigureInternet(ctx, slot, network) | ||
| if err != nil { | ||
| // Return the slot to the pool if configuring internet fails | ||
| go func() { | ||
| if returnErr := p.Return(context.WithoutCancel(ctx), slot); returnErr != nil { | ||
| if returnErr := p.Return(ctx, slot); returnErr != nil { | ||
| logger.L().Error(ctx, "failed to return slot to the pool", zap.Error(returnErr), zap.Int("slot_index", slot.Idx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slot cleanup skipped when ctx canceled after Get failure
When ConfigureInternet fails, the pool now tries to return the slot with the same request context (p.Return(ctx, slot)). Return immediately bails out if that context is canceled, so callers that cancel their context after Get fails will skip ResetInternet/cleanup and leak the slot back out of the pool. The prior code used context.WithoutCancel here to guarantee cleanup regardless of cancellation, so this regresses under canceled contexts and can exhaust the pool over time.
Useful? React with 👍 / 👎.
| NETWORK_SLOTS_POOL_SIZE = "100" | ||
| NETWORK_SLOTS_TO_PREPOPULATE = "32" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nomad env vars don’t match network pool config keys
The Nomad job sets NETWORK_SLOTS_POOL_SIZE and NETWORK_SLOTS_TO_PREPOPULATE to 100/32, but network.Config still reads NETWORK_SLOTS_REUSE_POOL_SIZE and NETWORK_SLOTS_FRESH_POOL_SIZE (pool.go lines 54-55). Because the configured keys don’t exist, the service falls back to the new 8/8 defaults, shrinking the pool dramatically in production instead of using the intended 100/32 sizing.
Useful? React with 👍 / 👎.
| acquiredSlots.Add(ctx, 1, metric.WithAttributes(attribute.String("pool", "new"))) | ||
| telemetry.ReportEvent(ctx, "new network slot") | ||
|
|
||
| slot = s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Reused slot leaked when second select always overwrites
The Get function has a critical logic error where the second select statement (lines 146-157) always runs unconditionally, even if a reused slot was already obtained in the first select. When a reused slot is available, it's assigned to slot (line 142), but then the code proceeds to the second select which blocks until a new slot is available and overwrites slot (line 156). This causes the reused slot to be leaked (never returned to the pool), incorrect metrics (both counters decremented), and unnecessary blocking. The second select should only execute when the first select's default case is hit (no reused slot available).
| return nil | ||
| case <-ctx.Done(): | ||
| return | ||
| default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Created slot leaks when context cancels before send
In Populate, when createNetworkSlot succeeds (line 107) but the subsequent select exits via <-p.done or <-ctx.Done() (lines 115-118), the just-created slot is not cleaned up. The defer only drains slots already in p.newSlots, not the local slot variable that was created but never sent. This leaks network resources during shutdown or context cancellation.
| if err != nil { | ||
| // Return the slot to the pool if configuring internet fails | ||
| go func() { | ||
| if returnErr := p.Return(context.WithoutCancel(ctx), slot); returnErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Removing context.WithoutCancel causes slot leak on cancellation
The goroutine that returns a slot after ConfigureInternet fails changed from using context.WithoutCancel(ctx) to plain ctx. The Return function's first select (lines 175-181) checks ctx.Done() and returns immediately without cleanup if canceled. With the old code, the detached context ensured Return would proceed past this check. Now, if the context is canceled before or during the goroutine execution, Return exits early without cleaning up the slot, causing a resource leak.
Additional Locations (1)
|
These are pretty reasonable, going to work through them. |
|
Going to extract a warm pool in #1636 first. These changes will likely head over there, or into a follow up PR |
This mostly involved adding a bunch of tests, which uncovered edge cases involving canceled contexts, missed cleanups, etc.
Also moved network operations off of the slot and into a struct that can be swapped out for mocks.
Note
Extracts network operations into a pluggable interface and refactors the network pool for context-aware lifecycle, async cleanup, and configurable sizes; adds comprehensive tests and wires new env/config across services.
Operationsinterface withNetNSOperationsimplementation; moves networking methods offSlotintonetwork.go(CreateNetwork,RemoveNetwork,ConfigureInternet,ResetInternet).NewPool(operations, storage, config)replaces size args;Populatereturnserror, respects context/closed state, and cleans up asynchronously;Returndrops when reuse pool is full; addsErrClosed,isClosed,ignoreClosed; cleanup now logs instead of bubbling errors.internal/sandbox/network/config.goaddsNetworkSlotsReusePoolSizeandNetworkSlotsFreshPoolSize(env-driven); storageReleasenowRelease(ctx, s *Slot).pool_test.gowith extensive coverage for close/cancel/cleanup paths; generatesmocks_test.goforStorageandOperationsvia.mockery.yamladditions.benchmark_test.go,cmd/build-template/main.go,main.go) to constructNetNSOperationsand use newNewPoolsignature; propagate context toPopulateandClose.NETWORK_SLOTS_REUSE_POOL_SIZEandNETWORK_SLOTS_FRESH_POOL_SIZEtoorchestrator.hclandtemplate-manager.hclenv.go.uber.org/atomic(and related sum entries).Written by Cursor Bugbot for commit bb1c165. This will update automatically on new commits. Configure here.