Skip to content

Conversation

@djeebus
Copy link
Contributor

@djeebus djeebus commented Oct 10, 2025

This consolidates metrics into a single struct that does a few things:

  • exports metrics to a file called {pid}.json
  • watches for other files and reads their metrics.
  • when handling incoming requests, check full host metrics

It also creates server.Limiter for checking starting and running limits.


Note

Adds a shared-state tracker aggregating sandbox allocations across processes and a limiter enforcing running/starting caps, integrated into server, metrics, and config.

  • Shared State & Metrics:
    • Add internal/sharedstate manager that writes {pid}.json, watches directory via fsnotify, aggregates Allocations and running counts across processes; includes tests.
    • Wire into main.go: subscribe to sandbox.Map, run manager (SharedStateDirectory, SharedStateWriteInterval).
    • Update service.Info to source allocated CPU/memory/disk/sandbox metrics from shared state.
  • Sandbox Start Limiting:
    • Introduce internal/server/Limiter using feature flag MaxSandboxesPerNode and a semaphore for per-node starting slots; add tests.
    • Use limiter in Server.Create to gate sandbox starts and return appropriate GRPC errors; replace in-flight semaphore logic.
  • API/Plumbing:
    • Extend config with SharedStateDirectory, SharedStateWriteInterval, MaxStartingInstances.
    • Change sandbox.Map subscriber interface to accept context; update all Insert/Remove call sites.
  • Deps:
    • Add github.com/fsnotify/fsnotify.
  • Misc:
    • Minor variable rename for Google storage limiter in main.go.

Written by Cursor Bugbot for commit a640183. This will update automatically on new commits. Configure here.

@linear
Copy link

linear bot commented Oct 10, 2025

@e2b-request-same-site-reviewers e2b-request-same-site-reviewers bot requested review from ValentaTomas and removed request for ValentaTomas, dobrac and jakubno October 10, 2025 00:29
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@djeebus djeebus marked this pull request as draft October 10, 2025 15:47
@djeebus
Copy link
Contributor Author

djeebus commented Oct 10, 2025

I'm going to pull a piece out to its own PR, and reopen when that's merged.

@ValentaTomas ValentaTomas changed the title ENG-3153 peer-to-peer metrics tracker Add peer-to-peer metrics tracker Oct 10, 2025
# Conflicts:
#	packages/orchestrator/internal/server/main.go
#	packages/orchestrator/internal/server/sandboxes_test.go
#	packages/orchestrator/internal/service/service_info.go
#	packages/orchestrator/main.go
@djeebus djeebus marked this pull request as ready for review October 15, 2025 23:56
cursor[bot]

This comment was marked as outdated.

# Conflicts:
#	packages/orchestrator/internal/cfg/model.go
cursor[bot]

This comment was marked as outdated.

@sitole sitole self-requested a review October 21, 2025 09:32
cursor[bot]

This comment was marked as outdated.

djeebus and others added 2 commits October 21, 2025 15:02
# Conflicts:
#	packages/orchestrator/internal/cfg/model.go
#	packages/orchestrator/internal/server/main.go
#	packages/orchestrator/internal/server/sandboxes.go
#	packages/orchestrator/internal/service/service_info.go
#	packages/orchestrator/main.go
}

// Check if we've reached the max number of starting instances on this node
acquired := t.startingSandboxes.TryAcquire(1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Snapshot resume no longer waits for starting slot

The original code used blocking Acquire(ctx, 1) for snapshot resumes which would wait up to 15 seconds (acquireTimeout) for a starting slot to become available. The refactored AcquireStarting method only uses non-blocking TryAcquire(1), which immediately fails if no slot is available. In sandboxes.go, acquireCtx is still set up with a timeout for snapshots, but this context is never used for blocking acquisition - it's only passed to AcquireStarting which ignores it for semaphore acquisition. This behavioral regression will cause snapshot resumes to fail immediately instead of waiting when slots are temporarily unavailable.

Additional Locations (1)

Fix in Cursor Fix in Web

zap.Error(err),
zap.String("path", fullPath))
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Startup double-counts stale self file on PID reuse

When Run starts, it reads all existing .json files into otherMetrics but doesn't skip selfPath. If a previous process with the same PID crashed (leaving a stale file), and the PID is reused, the stale file gets added to otherMetrics. Since fsnotify events for selfPath are later ignored in the main loop, this entry is never cleared. The result is double-counting: TotalRunningCount() includes both the current process's sandboxes from selfSandboxResources AND stale sandbox counts from otherMetrics[currentPID]. This could cause unnecessary ResourceExhausted errors after crash recovery with PID reuse.

Fix in Cursor Fix in Web

@ValentaTomas
Copy link
Member

From the bot comments, #1333 (comment), not sure what is the state now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants