-
Notifications
You must be signed in to change notification settings - Fork 233
Implement warm pool base struct #1636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return err | ||
| } | ||
|
|
||
| wp.doneOnce.Do(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why do you close freshItems in different function? Why not to keep it consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safe to read from a closed channel, but you'll panic if you write to a closed channel or close a channel twice. As such, golang encourages you to close channels near the writer, not the reader, and only when you're done writing to it. The freshItems channel gives us an easy place to do that - when the loop has exited, nothing will ever write to the channel again. reusableItems is odd, as its writer doesn't write in a loop, so it's awkward to close it within that channel; we have to resort to closing it in another function, then protecting it by selecting from both the write and a read to done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should check the wg.done before running wg.Go; otherwise, you have no guarantee there won't be a new Return called, please add test for that (long running Return) + Close and these Returns concurrently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, the mechanism in Populate is much cleaner and easier to get right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. Okay, nope, this wasn't an option here. I moved it into the Close.
It works perfectly fine if you call Populate before calling Close. If you don't, you end up with a deadlock while trying to empty the channel, as the channel never got closed. Safer to close everything in Close.
| func (wp *WarmPool[T]) Get(ctx context.Context) (T, error) { | ||
| var item T | ||
|
|
||
| recordSuccess := func(source string, attrs ...attribute.KeyValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have 3 different failures for closed
| return err | ||
| } | ||
|
|
||
| wp.doneOnce.Do(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should check the wg.done before running wg.Go; otherwise, you have no guarantee there won't be a new Return called, please add test for that (long running Return) + Close and these Returns concurrently
| return err | ||
| } | ||
|
|
||
| wp.doneOnce.Do(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, the mechanism in Populate is much cleaner and easier to get right
| }) | ||
|
|
||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition between Close and Populate can cause panic
Close doesn't wait for Populate to exit before closing freshItems. The Populate goroutine runs independently and is never added to wp.wg. When Close is called, it closes wp.done, then immediately calls wp.wg.Wait() (which returns since Populate isn't tracked), then closes wp.freshItems. If Populate is in create() during this sequence, it may try to send to the closed freshItems channel in its select statement. Since both <-wp.done and the send case would be "ready", Go may randomly pick the send, causing a panic.
Additional Locations (1)
| timer.success() | ||
|
|
||
| return item, err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timer records both success and failure on create error
In the create function, timer.success() is called unconditionally after timer.failure() when there's an error. When acquisition.Create fails, both timer.failure() and timer.success() are executed, adding both result=failure and result=success attributes to the telemetry metric. The timer.success() call needs to be inside an else branch or the function needs to return early after timer.failure().
|
|
||
| return ctx.Err() | ||
| case wp.freshItems <- item: | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition can panic when sending to closed channel
There's a race condition between Populate() and Close() that can cause a panic. The cleanup goroutine at line 86 closes freshItems after done is closed. However, Populate's main loop at line 117 may try to send to freshItems in the select statement. If Populate passes the isClosed check at line 96, then Close() closes done, and the cleanup goroutine closes freshItems before Populate enters the select, Go may randomly select the send case. Sending to a closed channel always panics in Go, and the select doesn't prevent this. The race window exists between done being closed and freshItems being closed by the cleanup goroutine.
Additional Locations (1)
| for item := range wp.freshItems { | ||
| wp.destroy(ctx, item, operationClose, reasonCleanupFresh) | ||
| } | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition may panic when closing freshItems channel
When Close() is called, it closes wp.done which unblocks both the cleanup goroutine (line 83) and makes the main loop's select <-wp.done case ready. The cleanup goroutine then closes wp.freshItems at line 86. However, the main loop's select at lines 108-118 has both <-wp.done and wp.freshItems <- item as potentially ready cases, and Go picks randomly among ready cases. If the cleanup goroutine closes wp.freshItems after the select determines the send case is ready but before it executes, sending to the closed channel will cause a panic. This race window exists because both goroutines are triggered by the same close(wp.done) event.
Additional Locations (1)
|
Somehow, avoiding race conditions got overly complicated. Going to put this into draft until I can figure out why. |
| <-cleanup | ||
|
|
||
| // close freshItems | ||
| println("closing freshItems") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug println statements left in production code
Several println() debug statements were left in the production code. These appear at lines 81, 84, 90, 94, 267, and 271 with messages like "creating clean up", "closing clean up", "waiting for clean up", "closing freshItems", "closing done", and "closing reusableItems". These should be removed before merging.
Additional Locations (1)
| for item := range wp.freshItems { | ||
| wp.destroy(ctx, item, operationClose, reasonCleanupFresh) | ||
| } | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling Populate twice causes panic on closed channel
When Populate() returns (due to context cancellation), the cleanup goroutine at line 88-102 closes freshItems. If Populate() is called again with a new context, the isClosed() check passes (since only wp.done and context are checked), and the code proceeds to write to freshItems at line 126, causing a panic on write to closed channel. The pool has no protection (like sync.Once) to prevent calling Populate() multiple times. While not typical usage, this could cause crashes if someone attempts to restart population after a transient failure.
Additional Locations (1)
| // - When done, destroys all items in the `freshItems` channel before returning | ||
| // - When an error is encountered, continue trying to create more entries | ||
| func (wp *WarmPool[T]) Populate(ctx context.Context) error { | ||
| println("creating clean up") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug println statements left in production code
Multiple println debug statements were left in the production code. These appear at lines 81, 84, 90, 94, 267, and 271 with messages like "creating clean up", "closing clean up", "waiting for clean up", "closing freshItems", "closing done", and "closing reusableItems". These will print to stdout and clutter logs in production.
Once this looks good, we should use it for the nbd device pool as well.
Note
Introduces a reusable, telemetry‑instrumented pooling primitive and adopts it for network slots in the orchestrator.
utils.WarmPoolwith OTEL metrics, lifecycle handling, and tests; provideItem/ItemFactoryinterfaces and generated mocksutils.WarmPool(network.Poolnow wraps a warm pool); addnetwork.ConfigandSlot.String()NETWORK_SLOTS_REUSE_POOL_SIZEandNETWORK_SLOTS_FRESH_POOL_SIZE; set these in Nomad jobs (orchestrator.hcl,template-manager.hcl)scripts/fix-meters.shand improvefix-tracers.sh; run both inmake fmt; CI workflow now relies onmake fmt(drops explicit tracer fix step)Written by Cursor Bugbot for commit 9d960af. This will update automatically on new commits. Configure here.