SWE-Bench Evaluation: Baseline vs MCP-Enhanced

Comparative evaluation of Claude Sonnet 4.5 on SWE-Bench-Lite with and without Model Context Protocol (MCP) tools.

Overview

This repository runs two parallel evaluations:

Baseline: Claude Sonnet 4.5 with standard OpenHands tools
Baseline-with-MCP: Claude Sonnet 4.5 with Supermodel MCP server providing code graph analysis tools

The goal is to measure the impact of MCP-provided code intelligence (dependency graphs, call graphs, code structure analysis) on software engineering task performance.

Setup

Clone this repository with submodules:

git clone --recurse-submodules https://github.com/supermodeltools/swe-bench.git
cd swe-bench

If you already cloned without submodules, initialize them:

git submodule update --init --recursive

Prerequisites

Platform Requirements

⚠️ Important: x86_64 architecture required

SWE-Bench evaluations require x86_64 (Intel/AMD) architecture. The Docker images used for evaluations are built for linux/amd64 and will not run on ARM-based systems (Apple Silicon M1/M2/M3, ARM servers).

Options:

Run on an x86_64 Linux server or VM
Use cloud VMs (AWS EC2, Azure, GCP) with x86_64 instances
❌ Will not work on Apple Silicon Macs (M1/M2/M3) without emulation (extremely slow)

Software Requirements

Docker - Required for isolated evaluation environments
- Install from docker.com
- Ensure Docker daemon is running: docker ps

uv - Python package manager

curl -LsSf https://astral.sh/uv/install.sh | sh

API Keys

export ANTHROPIC_API_KEY="your-anthropic-key"
export SUPERMODELTOOLS_API_KEY="your-supermodel-key"

Quick Start

Test with a single instance (builds images and runs both evaluations):

./run.sh --build-images --n-limit 1

This will:

Build Docker images for both baseline and MCP environments
Run both evaluations in parallel on 1 test instance
Save results to ./results/baseline/ and ./results/baseline-with-mcp/

Running Evaluations

Test Run (Recommended First)

# Test with 5 instances (~20-30 minutes)
./run.sh --build-images --n-limit 5

Full Evaluation

# Build all Docker images (one-time, ~30-60 minutes)
./run.sh --build-images

# Run full SWE-Bench-Lite evaluation (300 instances, several hours)
# Images are cached, so this is much faster
./run.sh

Custom Instance Selection

# Create a file with specific instance IDs
echo "scikit-learn__scikit-learn-25500" > instances.txt
echo "django__django-11333" >> instances.txt

# Run on those instances only
./run.sh --build-images --select instances.txt

Configuration

Evaluation Parameters

Max iterations: 150 (configurable in run.sh)
Model: claude-sonnet-4-5-20250929
Dataset: princeton-nlp/SWE-bench_Lite (300 instances)
Workspace: Docker (isolated containers per instance)

MCP Integration

The baseline-with-mcp configuration adds the Supermodel MCP server:

mcp_config = {
    "mcpServers": {
        "supermodel": {
            "command": "npx",
            "args": ["-y", "@supermodeltools/mcp-server@0.4.1"],
            "env": {
                "SUPERMODELTOOLS_API_KEY": os.getenv("SUPERMODELTOOLS_API_KEY", "")
            }
        }
    }
}

This provides Claude with tools to:

Analyze code dependencies and imports
Generate call graphs and execution flows
Understand code structure and relationships
Navigate large codebases efficiently

Docker Images

Two separate Docker image sets are built:

Baseline: Standard OpenHands eval-agent-server image
MCP: Extended image with Node.js 20 for MCP server support

Results

Results are saved to separate directories for comparison:

results/
├── baseline/                          # Standard evaluation
│   └── princeton-nlp__SWE-bench_Lite-test/
│       └── anthropic/
│           └── output.jsonl
└── baseline-with-mcp/                 # MCP-enhanced evaluation
    └── princeton-nlp__SWE-bench_Lite-test/
        └── anthropic/
            └── output.jsonl

Each output.jsonl contains:

Generated code patches
Agent conversation history
Token usage and cost metrics
Success/failure status

Evaluating Results

Use SWE-Bench evaluation tools to score the patches:

# Baseline results
cd baseline/openhands-benchmarks
uv run swebench-eval ../../results/baseline/output.jsonl

# MCP-enhanced results
cd baseline-with-mcp/openhands-benchmarks
uv run swebench-eval ../../results/baseline-with-mcp/output.jsonl

Command Line Options

./run.sh [OPTIONS]

Options:
  --build-images     Build Docker images before running (required on first run)
  --n-limit N        Limit evaluation to first N instances
  --select FILE      Evaluate only instances listed in FILE (one per line)

Any additional arguments are passed to swebench-infer

Architecture

Repository Structure

.
├── run.sh                          # Main runner script
├── baseline/                       # Standard evaluation setup
│   ├── llm_config.json            # Claude configuration
│   └── openhands-benchmarks/      # Submodule: OpenHands/benchmarks
└── baseline-with-mcp/             # MCP-enhanced setup
    ├── llm_config.json            # Claude configuration
    └── openhands-benchmarks/      # Submodule: supermodeltools/open-hands-benchmarks-with-mcp

Submodules

baseline/openhands-benchmarks: OpenHands/benchmarks
- Standard SWE-Bench evaluation harness
baseline-with-mcp/openhands-benchmarks: supermodeltools/open-hands-benchmarks-with-mcp
- Fork with MCP integration
- Includes Node.js 20 in Docker image
- Adds Supermodel MCP configuration

Troubleshooting

Architecture / Platform Issues

Error: Docker containers crash or fail to start

Failed to load runtime instance
Container exited unexpectedly

Cause: Running on ARM-based system (Apple Silicon, ARM servers) with x86_64 Docker images.

Solution: Use an x86_64 machine or VM. Example Azure VM setup:

# Create x86_64 VM on Azure
az vm create \
  --resource-group swe-bench-rg \
  --name swe-bench-vm \
  --image Ubuntu2204 \
  --size Standard_D8s_v3 \
  --admin-username azureuser \
  --generate-ssh-keys

# SSH and clone repository
ssh azureuser@<vm-ip>
git clone --recurse-submodules https://github.com/supermodeltools/swe-bench.git

Docker Not Running

# Check Docker status
docker ps

# Start Docker Desktop or daemon
open /Applications/Docker.app  # macOS
# or systemctl start docker      # Linux

Submodule Errors

Error: references a workspace in `tool.uv.sources`, but is not a workspace member

Solution: Initialize all nested submodules:

git submodule update --init --recursive

API Rate Limits

Evaluations may take several hours due to API rate limits
Results are saved incrementally - you can stop and resume
Use --n-limit to test with fewer instances first

Docker Disk Space

# Check Docker disk usage
docker system df

# Clean up old images and containers
docker system prune -a

# Remove specific eval-agent-server images
docker images | grep eval-agent-server
docker rmi <image-id>

MCP Server Connection Issues

If MCP evaluation fails with connection errors:

Verify SUPERMODELTOOLS_API_KEY is set
Check that Node.js 20 is installed in the Docker image
Test MCP server manually: npx -y @supermodeltools/mcp-server@0.4.1

References

Contributing

To add your own MCP servers or modify the evaluation:

Fork the baseline-with-mcp/openhands-benchmarks submodule
Modify benchmarks/swebench/run_infer.py to add your MCP configuration
Update the Dockerfile if additional dependencies are needed
Submit a pull request

License

This repository follows the licenses of its submodules:

OpenHands Benchmarks: MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
baseline-with-mcp		baseline-with-mcp
baseline		baseline
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
build_images.sh		build_images.sh
run.sh		run.sh

supermodeltools/swe-bench

Folders and files

Latest commit

History

Repository files navigation