Skip to content

Conversation

@djeebus
Copy link
Contributor

@djeebus djeebus commented Dec 20, 2025

Once this looks good, we should use it for the nbd device pool as well.


Note

Introduces a reusable, telemetry‑instrumented pooling primitive and adopts it for network slots in the orchestrator.

  • Add utils.WarmPool with OTEL metrics, lifecycle handling, and tests; provide Item/ItemFactory interfaces and generated mocks
  • Refactor orchestrator network pool to utils.WarmPool (network.Pool now wraps a warm pool); add network.Config and Slot.String()
  • Config/infra: expose pool sizing via NETWORK_SLOTS_REUSE_POOL_SIZE and NETWORK_SLOTS_FRESH_POOL_SIZE; set these in Nomad jobs (orchestrator.hcl, template-manager.hcl)
  • Call site updates: switch constructors in benchmark and build-template; main orchestrator uses new pool
  • Tooling: add scripts/fix-meters.sh and improve fix-tracers.sh; run both in make fmt; CI workflow now relies on make fmt (drops explicit tracer fix step)
  • Minor go.mod/sum updates across packages to support new telemetry/logging deps

Written by Cursor Bugbot for commit 9d960af. This will update automatically on new commits. Configure here.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@djeebus djeebus marked this pull request as ready for review December 20, 2025 03:22
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@jakubno jakubno self-assigned this Dec 29, 2025
return err
}

wp.doneOnce.Do(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why do you close freshItems in different function? Why not to keep it consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safe to read from a closed channel, but you'll panic if you write to a closed channel or close a channel twice. As such, golang encourages you to close channels near the writer, not the reader, and only when you're done writing to it. The freshItems channel gives us an easy place to do that - when the loop has exited, nothing will ever write to the channel again. reusableItems is odd, as its writer doesn't write in a loop, so it's awkward to close it within that channel; we have to resort to closing it in another function, then protecting it by selecting from both the write and a read to done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check the wg.done before running wg.Go; otherwise, you have no guarantee there won't be a new Return called, please add test for that (long running Return) + Close and these Returns concurrently

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, the mechanism in Populate is much cleaner and easier to get right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. Okay, nope, this wasn't an option here. I moved it into the Close.

It works perfectly fine if you call Populate before calling Close. If you don't, you end up with a deadlock while trying to empty the channel, as the channel never got closed. Safer to close everything in Close.

func (wp *WarmPool[T]) Get(ctx context.Context) (T, error) {
var item T

recordSuccess := func(source string, attrs ...attribute.KeyValue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 3 different failures for closed

return err
}

wp.doneOnce.Do(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check the wg.done before running wg.Go; otherwise, you have no guarantee there won't be a new Return called, please add test for that (long running Return) + Close and these Returns concurrently

return err
}

wp.doneOnce.Do(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, the mechanism in Populate is much cleaner and easier to get right

})

return err
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition between Close and Populate can cause panic

Close doesn't wait for Populate to exit before closing freshItems. The Populate goroutine runs independently and is never added to wp.wg. When Close is called, it closes wp.done, then immediately calls wp.wg.Wait() (which returns since Populate isn't tracked), then closes wp.freshItems. If Populate is in create() during this sequence, it may try to send to the closed freshItems channel in its select statement. Since both <-wp.done and the send case would be "ready", Go may randomly pick the send, causing a panic.

Additional Locations (1)

Fix in Cursor Fix in Web

timer.success()

return item, err
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timer records both success and failure on create error

In the create function, timer.success() is called unconditionally after timer.failure() when there's an error. When acquisition.Create fails, both timer.failure() and timer.success() are executed, adding both result=failure and result=success attributes to the telemetry metric. The timer.success() call needs to be inside an else branch or the function needs to return early after timer.failure().

Fix in Cursor Fix in Web


return ctx.Err()
case wp.freshItems <- item:
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition can panic when sending to closed channel

There's a race condition between Populate() and Close() that can cause a panic. The cleanup goroutine at line 86 closes freshItems after done is closed. However, Populate's main loop at line 117 may try to send to freshItems in the select statement. If Populate passes the isClosed check at line 96, then Close() closes done, and the cleanup goroutine closes freshItems before Populate enters the select, Go may randomly select the send case. Sending to a closed channel always panics in Go, and the select doesn't prevent this. The race window exists between done being closed and freshItems being closed by the cleanup goroutine.

Additional Locations (1)

Fix in Cursor Fix in Web

for item := range wp.freshItems {
wp.destroy(ctx, item, operationClose, reasonCleanupFresh)
}
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition may panic when closing freshItems channel

When Close() is called, it closes wp.done which unblocks both the cleanup goroutine (line 83) and makes the main loop's select <-wp.done case ready. The cleanup goroutine then closes wp.freshItems at line 86. However, the main loop's select at lines 108-118 has both <-wp.done and wp.freshItems <- item as potentially ready cases, and Go picks randomly among ready cases. If the cleanup goroutine closes wp.freshItems after the select determines the send case is ready but before it executes, sending to the closed channel will cause a panic. This race window exists because both goroutines are triggered by the same close(wp.done) event.

Additional Locations (1)

Fix in Cursor Fix in Web

@djeebus
Copy link
Contributor Author

djeebus commented Dec 31, 2025

Somehow, avoiding race conditions got overly complicated. Going to put this into draft until I can figure out why.

@djeebus djeebus closed this Dec 31, 2025
@djeebus djeebus reopened this Dec 31, 2025
@djeebus djeebus marked this pull request as draft December 31, 2025 01:02
<-cleanup

// close freshItems
println("closing freshItems")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug println statements left in production code

Several println() debug statements were left in the production code. These appear at lines 81, 84, 90, 94, 267, and 271 with messages like "creating clean up", "closing clean up", "waiting for clean up", "closing freshItems", "closing done", and "closing reusableItems". These should be removed before merging.

Additional Locations (1)

Fix in Cursor Fix in Web

for item := range wp.freshItems {
wp.destroy(ctx, item, operationClose, reasonCleanupFresh)
}
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling Populate twice causes panic on closed channel

When Populate() returns (due to context cancellation), the cleanup goroutine at line 88-102 closes freshItems. If Populate() is called again with a new context, the isClosed() check passes (since only wp.done and context are checked), and the code proceeds to write to freshItems at line 126, causing a panic on write to closed channel. The pool has no protection (like sync.Once) to prevent calling Populate() multiple times. While not typical usage, this could cause crashes if someone attempts to restart population after a transient failure.

Additional Locations (1)

Fix in Cursor Fix in Web

// - When done, destroys all items in the `freshItems` channel before returning
// - When an error is encountered, continue trying to create more entries
func (wp *WarmPool[T]) Populate(ctx context.Context) error {
println("creating clean up")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug println statements left in production code

Multiple println debug statements were left in the production code. These appear at lines 81, 84, 90, 94, 267, and 271 with messages like "creating clean up", "closing clean up", "waiting for clean up", "closing freshItems", "closing done", and "closing reusableItems". These will print to stdout and clutter logs in production.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants