-
Notifications
You must be signed in to change notification settings - Fork 233
Description
Problem Description
When creating sandboxes in self-hosting E2B environment, the first HTTP request to envd consistently fails with connection reset by peer, while the second request always succeeds. This 100% reproducible pattern is observed in Grafana (Service Name: template-manager_orchestrator, Span Name: sandbox-create) as bellow image.
Error Details
First HTTP POST /init request:
- Error:
read tcp 10.12.0.4:48426->10.11.0.2:49983: read: connection reset by peer - Error type:
*net.OpError - Duration: ~30ms
- Result: ❌ FAIL
Second HTTP POST /init request:
- Status: HTTP 204 No Content
- Duration: ~36ms
- Result: ✅ SUCCESS
Timeline from OpenTelemetry Traces
Time (ns) Event Notes
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1762571273908497000 resume-vm completed VM is now running
1762571273909003300 set-mmds completed +0.5ms after resume
1762571273909097700 HTTP request starts +0.6ms after resume
1762571273939074000 HTTP request ends ❌ connection reset
1762571273944640300 HTTP retry starts +5.5ms after failure
1762571273980743700 HTTP retry ends ✅ success
Environment
- VM Technology: Firecracker microVM with snapshot/restore
- envd HTTP Server: Listening on port 49983 inside the VM
- Orchestrator: Go HTTP client making requests to envd
- Network: Direct connection between orchestrator and VM (no proxy/load balancer)
Reproduction
This issue occurs on every single sandbox creation:
from e2b_code_interpreter import Sandbox
start_time = time.time()
with Sandbox.create('base') as sandbox:
creation_time = time.time() - start_time
print(creation_time)
- Template is created with envd running as HTTP server
- VM is snapshotted via Firecracker
- VM is resumed from snapshot via
ResumeVMAPI - Orchestrator sends HTTP POST to
http://<vm-ip>:49983/init - First request fails with "connection reset by peer"
- Second request succeeds immediately
Reproduction rate: 100%
Troubleshooting Attempts
Attempt 1: Increase Request Timeout ❌
Hypothesis: The 50ms timeout is too short for the VM to be ready.
Test: Increased ENVD_INIT_TIMEOUT_MILLISECONDS from 50ms to 200ms.
Result: No change. The error is still connection reset by peer, not a timeout error.
Attempt 2: Add Delay After VM Resume ❌
Hypothesis: The network stack or envd process needs initialization time after snapshot resume.
Test: Added 100 ms delay after resumeVM() and before HTTP request:
err = p.client.resumeVM(ctx)
if err != nil {
return errors.Join(fmt.Errorf("error resuming vm: %w", err), fcStopErr)
}
time.Sleep(100 * time.Millisecond) // Give VM time to initialize
err = p.client.setMmds(ctx, meta)Questions
-
Is this expected behavior? Should we always expect the first request after VM resume to fail?
-
Is there a proper way to detect when envd is ready to accept connections after VM resume? Some kind of health check or readiness probe?
-
Why does the TCP connection reset? The VM is resumed and the envd process should be running. What causes the server side to send RST? Is this an issue for Http Keep Alive setting in Micro-VM?
Any insights on why this happens and how to properly handle it would be greatly appreciated!