Create Sandbox - First HTTP Request to envd Fails After VM Resume from Snapshot

## Problem Description

When creating sandboxes in self-hosting E2B environment, the **first HTTP request to envd consistently fails** with `connection reset by peer`, while the **second request always succeeds**. This 100% reproducible pattern is observed in **Grafana** (Service Name: template-manager_orchestrator, Span Name: sandbox-create) as bellow image.

<img width="1522" height="940" alt="Image" src="https://github.com/user-attachments/assets/1efe1a26-2733-4552-a6a0-8456041c54a1" />

### Error Details

**First HTTP POST /init request:**
- Error: `read tcp 10.12.0.4:48426->10.11.0.2:49983: read: connection reset by peer`
- Error type: `*net.OpError`
- Duration: ~30ms
- Result: ❌ FAIL

**Second HTTP POST /init request:**
- Status: HTTP 204 No Content
- Duration: ~36ms
- Result: ✅ SUCCESS

### Timeline from OpenTelemetry Traces

```
Time (ns)            Event                    Notes
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1762571273908497000  resume-vm completed      VM is now running
1762571273909003300  set-mmds completed       +0.5ms after resume
1762571273909097700  HTTP request starts      +0.6ms after resume
1762571273939074000  HTTP request ends        ❌ connection reset
1762571273944640300  HTTP retry starts        +5.5ms after failure
1762571273980743700  HTTP retry ends          ✅ success
```

## Environment

- **VM Technology**: Firecracker microVM with snapshot/restore
- **envd HTTP Server**: Listening on port 49983 inside the VM
- **Orchestrator**: Go HTTP client making requests to envd
- **Network**: Direct connection between orchestrator and VM (no proxy/load balancer)

## Reproduction

This issue occurs on **every single sandbox creation**:

```
from e2b_code_interpreter import Sandbox

start_time = time.time()
with Sandbox.create('base') as sandbox:
    creation_time = time.time() - start_time
    print(creation_time)
```

1. Template is created with envd running as HTTP server
2. VM is snapshotted via Firecracker
3. VM is resumed from snapshot via `ResumeVM` API
4. Orchestrator sends HTTP POST to `http://<vm-ip>:49983/init`
5. **First request fails** with "connection reset by peer"
6. **Second request succeeds** immediately

**Reproduction rate**: 100%

## Troubleshooting Attempts

### Attempt 1: Increase Request Timeout ❌

**Hypothesis**: The 50ms timeout is too short for the VM to be ready.

**Test**: Increased `ENVD_INIT_TIMEOUT_MILLISECONDS` from 50ms to 200ms.

**Result**: No change. The error is still `connection reset by peer`, not a timeout error.

### Attempt 2: Add Delay After VM Resume ❌

**Hypothesis**: The network stack or envd process needs initialization time after snapshot resume.

**Test**: Added 100 ms delay after `resumeVM()` and before HTTP request:

```go
err = p.client.resumeVM(ctx)
if err != nil {
    return errors.Join(fmt.Errorf("error resuming vm: %w", err), fcStopErr)
}

time.Sleep(100 * time.Millisecond)  // Give VM time to initialize

err = p.client.setMmds(ctx, meta)
```

## Questions

1. **Is this expected behavior?** Should we always expect the first request after VM resume to fail?

2. **Is there a proper way to detect when envd is ready** to accept connections after VM resume? Some kind of health check or readiness probe?

3. **Why does the TCP connection reset?** The VM is resumed and the envd process should be running. What causes the server side to send RST? Is this an issue for Http Keep Alive setting in Micro-VM? 


---

Any insights on why this happens and how to properly handle it would be greatly appreciated!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Sandbox - First HTTP Request to envd Fails After VM Resume from Snapshot #1478

Problem Description

Error Details

Timeline from OpenTelemetry Traces

Environment

Reproduction

Troubleshooting Attempts

Attempt 1: Increase Request Timeout ❌

Attempt 2: Add Delay After VM Resume ❌

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Create Sandbox - First HTTP Request to envd Fails After VM Resume from Snapshot #1478

Description

Problem Description

Error Details

Timeline from OpenTelemetry Traces

Environment

Reproduction

Troubleshooting Attempts

Attempt 1: Increase Request Timeout ❌

Attempt 2: Add Delay After VM Resume ❌

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions