-
Notifications
You must be signed in to change notification settings - Fork 233
Add peer-to-peer metrics tracker #1333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I'm going to pull a piece out to its own PR, and reopen when that's merged. |
this allows us to put sandbox-specific code here, to execute on insert, on delete, etc
2e39e0f to
5bababb
Compare
# Conflicts: # packages/orchestrator/internal/server/main.go # packages/orchestrator/internal/server/sandboxes_test.go # packages/orchestrator/internal/service/service_info.go # packages/orchestrator/main.go
# Conflicts: # packages/orchestrator/internal/cfg/model.go
# Conflicts: # packages/orchestrator/internal/cfg/model.go # packages/orchestrator/internal/server/main.go # packages/orchestrator/internal/server/sandboxes.go # packages/orchestrator/internal/service/service_info.go # packages/orchestrator/main.go
| } | ||
|
|
||
| // Check if we've reached the max number of starting instances on this node | ||
| acquired := t.startingSandboxes.TryAcquire(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Snapshot resume no longer waits for starting slot
The original code used blocking Acquire(ctx, 1) for snapshot resumes which would wait up to 15 seconds (acquireTimeout) for a starting slot to become available. The refactored AcquireStarting method only uses non-blocking TryAcquire(1), which immediately fails if no slot is available. In sandboxes.go, acquireCtx is still set up with a timeout for snapshots, but this context is never used for blocking acquisition - it's only passed to AcquireStarting which ignores it for semaphore acquisition. This behavioral regression will cause snapshot resumes to fail immediately instead of waiting when slots are temporarily unavailable.
Additional Locations (1)
| zap.Error(err), | ||
| zap.String("path", fullPath)) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Startup double-counts stale self file on PID reuse
When Run starts, it reads all existing .json files into otherMetrics but doesn't skip selfPath. If a previous process with the same PID crashed (leaving a stale file), and the PID is reused, the stale file gets added to otherMetrics. Since fsnotify events for selfPath are later ignored in the main loop, this entry is never cleared. The result is double-counting: TotalRunningCount() includes both the current process's sandboxes from selfSandboxResources AND stale sandbox counts from otherMetrics[currentPID]. This could cause unnecessary ResourceExhausted errors after crash recovery with PID reuse.
|
From the bot comments, #1333 (comment), not sure what is the state now? |
This consolidates metrics into a single struct that does a few things:
{pid}.jsonIt also creates
server.Limiterfor checking starting and running limits.Note
Adds a shared-state tracker aggregating sandbox allocations across processes and a limiter enforcing running/starting caps, integrated into server, metrics, and config.
internal/sharedstatemanager that writes{pid}.json, watches directory viafsnotify, aggregatesAllocationsand running counts across processes; includes tests.main.go: subscribe tosandbox.Map, run manager (SharedStateDirectory,SharedStateWriteInterval).service.Infoto source allocated CPU/memory/disk/sandbox metrics from shared state.internal/server/Limiterusing feature flagMaxSandboxesPerNodeand a semaphore for per-node starting slots; add tests.Server.Createto gate sandbox starts and return appropriate GRPC errors; replace in-flight semaphore logic.SharedStateDirectory,SharedStateWriteInterval,MaxStartingInstances.sandbox.Mapsubscriber interface to acceptcontext; update allInsert/Removecall sites.github.com/fsnotify/fsnotify.main.go.Written by Cursor Bugbot for commit a640183. This will update automatically on new commits. Configure here.