Add peer-to-peer metrics tracker #1333

djeebus · 2025-10-10T00:28:51Z

This consolidates metrics into a single struct that does a few things:

exports metrics to a file called {pid}.json
watches for other files and reads their metrics.
when handling incoming requests, check full host metrics

It also creates server.Limiter for checking starting and running limits.

Note

Adds a shared-state tracker aggregating sandbox allocations across processes and a limiter enforcing running/starting caps, integrated into server, metrics, and config.

Shared State & Metrics:
- Add internal/sharedstate manager that writes {pid}.json, watches directory via fsnotify, aggregates Allocations and running counts across processes; includes tests.
- Wire into main.go: subscribe to sandbox.Map, run manager (SharedStateDirectory, SharedStateWriteInterval).
- Update service.Info to source allocated CPU/memory/disk/sandbox metrics from shared state.
Sandbox Start Limiting:
- Introduce internal/server/Limiter using feature flag MaxSandboxesPerNode and a semaphore for per-node starting slots; add tests.
- Use limiter in Server.Create to gate sandbox starts and return appropriate GRPC errors; replace in-flight semaphore logic.
API/Plumbing:
- Extend config with SharedStateDirectory, SharedStateWriteInterval, MaxStartingInstances.
- Change sandbox.Map subscriber interface to accept context; update all Insert/Remove call sites.
Deps:
- Add github.com/fsnotify/fsnotify.
Misc:
- Minor variable rename for Google storage limiter in main.go.

^{Written by Cursor Bugbot for commit a640183. This will update automatically on new commits. Configure here.}

linear · 2025-10-10T00:28:54Z

ENG-3153 Export sandbox metrics to a file readable by other orchestrators

djeebus · 2025-10-10T15:48:06Z

I'm going to pull a piece out to its own PR, and reopen when that's merged.

this allows us to put sandbox-specific code here, to execute on insert, on delete, etc

# Conflicts: # packages/orchestrator/internal/server/main.go # packages/orchestrator/internal/server/sandboxes_test.go # packages/orchestrator/internal/service/service_info.go # packages/orchestrator/main.go

# Conflicts: # packages/orchestrator/internal/cfg/model.go

packages/orchestrator/internal/cfg/model.go

packages/orchestrator/internal/service/service_info.go

# Conflicts: # packages/orchestrator/internal/cfg/model.go # packages/orchestrator/internal/server/main.go # packages/orchestrator/internal/server/sandboxes.go # packages/orchestrator/internal/service/service_info.go # packages/orchestrator/main.go

cursor · 2025-12-16T22:13:46Z

packages/orchestrator/internal/server/limiter.go

+	}
+
+	// Check if we've reached the max number of starting instances on this node
+	acquired := t.startingSandboxes.TryAcquire(1)


Bug: Snapshot resume no longer waits for starting slot

The original code used blocking Acquire(ctx, 1) for snapshot resumes which would wait up to 15 seconds (acquireTimeout) for a starting slot to become available. The refactored AcquireStarting method only uses non-blocking TryAcquire(1), which immediately fails if no slot is available. In sandboxes.go, acquireCtx is still set up with a timeout for snapshots, but this context is never used for blocking acquisition - it's only passed to AcquireStarting which ignores it for semaphore acquisition. This behavioral regression will cause snapshot resumes to fail immediately instead of waiting when slots are temporarily unavailable.

Additional Locations (1)

packages/orchestrator/internal/server/sandboxes.go#L85-L92

cursor · 2025-12-16T22:13:46Z

packages/orchestrator/internal/sharedstate/tracker.go

+				zap.Error(err),
+				zap.String("path", fullPath))
+		}
+	}


Bug: Startup double-counts stale self file on PID reuse

When Run starts, it reads all existing .json files into otherMetrics but doesn't skip selfPath. If a previous process with the same PID crashed (leaving a stale file), and the PID is reused, the stale file gets added to otherMetrics. Since fsnotify events for selfPath are later ignored in the main loop, this entry is never cleared. The result is double-counting: TotalRunningCount() includes both the current process's sandboxes from selfSandboxResources AND stale sandbox counts from otherMetrics[currentPID]. This could cause unnecessary ResourceExhausted errors after crash recovery with PID reuse.

ValentaTomas · 2026-01-02T13:53:19Z

From the bot comments, #1333 (comment), not sure what is the state now?

djeebus requested review from ValentaTomas, dobrac and jakubno as code owners October 10, 2025 00:28

e2b-request-same-site-reviewers bot requested review from ValentaTomas and removed request for ValentaTomas, dobrac and jakubno October 10, 2025 00:29

This comment was marked as outdated.

Sign in to view

djeebus marked this pull request as draft October 10, 2025 15:47

djeebus added 6 commits October 10, 2025 09:01

create a strongly typed sandboxes map

f598bf8

this allows us to put sandbox-specific code here, to execute on insert, on delete, etc

rename struct, add subscribers

4402908

add locking to the subs list

fa22f6a

collect/publish metrics across peers

80ed359

use the errgroup.Group to run the metrics tracker

2fba68d

split up the limiter from the tracker

5bababb

djeebus force-pushed the joint-metrics branch from 2e39e0f to 5bababb Compare October 10, 2025 17:44

djeebus mentioned this pull request Oct 10, 2025

Create a strongly typed sandboxes map #1336

Merged

djeebus added 2 commits October 10, 2025 10:49

make tidy

753a5d5

shrink the tracker a bit

8d31aa5

ValentaTomas changed the title ~~ENG-3153 peer-to-peer metrics tracker~~ Add peer-to-peer metrics tracker Oct 10, 2025

djeebus added 2 commits October 15, 2025 16:52

Merge remote-tracking branch 'origin/main' into joint-metrics

51fd9e5

# Conflicts: # packages/orchestrator/internal/server/main.go # packages/orchestrator/internal/server/sandboxes_test.go # packages/orchestrator/internal/service/service_info.go # packages/orchestrator/main.go

rename some variables and fields

e8db442

djeebus marked this pull request as ready for review October 15, 2025 23:56

e2b-request-same-site-reviewers bot assigned ValentaTomas Oct 15, 2025

This comment was marked as outdated.

Sign in to view

djeebus added 2 commits October 16, 2025 17:52

remove obsolete error

ec9c0d2

Merge branch 'main' into joint-metrics

c37153c

# Conflicts: # packages/orchestrator/internal/cfg/model.go

This comment was marked as outdated.

Sign in to view

djeebus and others added 5 commits October 16, 2025 18:02

clean up error

94882c5

fix error message

fb54ef7

Merge remote-tracking branch 'origin/main' into joint-metrics

21cc7a8

bring back new error message

d5ab234

linting

f7d02fc

sitole self-requested a review October 21, 2025 09:32

sitole reviewed Oct 21, 2025

View reviewed changes

packages/orchestrator/internal/cfg/model.go Outdated Show resolved Hide resolved

sitole reviewed Oct 21, 2025

View reviewed changes

packages/orchestrator/internal/service/service_info.go Show resolved Hide resolved

djeebus and others added 2 commits October 21, 2025 09:33

Merge branch 'main' into joint-metrics

196a06c

rename metrics tracker to shared state manager

39a2b94

This comment was marked as outdated.

Sign in to view

djeebus and others added 2 commits October 21, 2025 15:02

protect against nils

835ee09

cursor bot reviewed Dec 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add peer-to-peer metrics tracker #1333

Add peer-to-peer metrics tracker #1333

Uh oh!

djeebus commented Oct 10, 2025 •

edited by cursor bot

Loading

Uh oh!

linear bot commented Oct 10, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

djeebus commented Oct 10, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot Dec 16, 2025

Uh oh!

cursor bot Dec 16, 2025

Uh oh!

ValentaTomas commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add peer-to-peer metrics tracker #1333

Are you sure you want to change the base?

Add peer-to-peer metrics tracker #1333

Uh oh!

Conversation

djeebus commented Oct 10, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linear bot commented Oct 10, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

djeebus commented Oct 10, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot Dec 16, 2025

Choose a reason for hiding this comment

Bug: Snapshot resume no longer waits for starting slot

Uh oh!

cursor bot Dec 16, 2025

Choose a reason for hiding this comment

Bug: Startup double-counts stale self file on PID reuse

Uh oh!

ValentaTomas commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

djeebus commented Oct 10, 2025 •

edited by cursor bot

Loading