Comparative evaluation of Claude Sonnet 4.5 on SWE-Bench-Lite with and without Model Context Protocol (MCP) tools.
This repository runs two parallel evaluations:
- Baseline: Claude Sonnet 4.5 with standard OpenHands tools
- Baseline-with-MCP: Claude Sonnet 4.5 with Supermodel MCP server providing code graph analysis tools
The goal is to measure the impact of MCP-provided code intelligence (dependency graphs, call graphs, code structure analysis) on software engineering task performance.
Clone this repository with submodules:
git clone --recurse-submodules https://github.com/supermodeltools/swe-bench.git
cd swe-benchIf you already cloned without submodules, initialize them:
git submodule update --init --recursiveSWE-Bench evaluations require x86_64 (Intel/AMD) architecture. The Docker images used for evaluations are built for linux/amd64 and will not run on ARM-based systems (Apple Silicon M1/M2/M3, ARM servers).
Options:
- Run on an x86_64 Linux server or VM
- Use cloud VMs (AWS EC2, Azure, GCP) with x86_64 instances
- ❌ Will not work on Apple Silicon Macs (M1/M2/M3) without emulation (extremely slow)
-
Docker - Required for isolated evaluation environments
- Install from docker.com
- Ensure Docker daemon is running:
docker ps
-
uv - Python package manager
curl -LsSf https://astral.sh/uv/install.sh | sh -
API Keys
export ANTHROPIC_API_KEY="your-anthropic-key" export SUPERMODELTOOLS_API_KEY="your-supermodel-key"
Test with a single instance (builds images and runs both evaluations):
./run.sh --build-images --n-limit 1This will:
- Build Docker images for both baseline and MCP environments
- Run both evaluations in parallel on 1 test instance
- Save results to
./results/baseline/and./results/baseline-with-mcp/
# Test with 5 instances (~20-30 minutes)
./run.sh --build-images --n-limit 5# Build all Docker images (one-time, ~30-60 minutes)
./run.sh --build-images
# Run full SWE-Bench-Lite evaluation (300 instances, several hours)
# Images are cached, so this is much faster
./run.sh# Create a file with specific instance IDs
echo "scikit-learn__scikit-learn-25500" > instances.txt
echo "django__django-11333" >> instances.txt
# Run on those instances only
./run.sh --build-images --select instances.txt- Max iterations: 150 (configurable in
run.sh) - Model:
claude-sonnet-4-5-20250929 - Dataset:
princeton-nlp/SWE-bench_Lite(300 instances) - Workspace: Docker (isolated containers per instance)
The baseline-with-mcp configuration adds the Supermodel MCP server:
mcp_config = {
"mcpServers": {
"supermodel": {
"command": "npx",
"args": ["-y", "@supermodeltools/mcp-server@0.4.1"],
"env": {
"SUPERMODELTOOLS_API_KEY": os.getenv("SUPERMODELTOOLS_API_KEY", "")
}
}
}
}This provides Claude with tools to:
- Analyze code dependencies and imports
- Generate call graphs and execution flows
- Understand code structure and relationships
- Navigate large codebases efficiently
Two separate Docker image sets are built:
- Baseline: Standard OpenHands eval-agent-server image
- MCP: Extended image with Node.js 20 for MCP server support
Results are saved to separate directories for comparison:
results/
├── baseline/ # Standard evaluation
│ └── princeton-nlp__SWE-bench_Lite-test/
│ └── anthropic/
│ └── output.jsonl
└── baseline-with-mcp/ # MCP-enhanced evaluation
└── princeton-nlp__SWE-bench_Lite-test/
└── anthropic/
└── output.jsonl
Each output.jsonl contains:
- Generated code patches
- Agent conversation history
- Token usage and cost metrics
- Success/failure status
Use SWE-Bench evaluation tools to score the patches:
# Baseline results
cd baseline/openhands-benchmarks
uv run swebench-eval ../../results/baseline/output.jsonl
# MCP-enhanced results
cd baseline-with-mcp/openhands-benchmarks
uv run swebench-eval ../../results/baseline-with-mcp/output.jsonl./run.sh [OPTIONS]
Options:
--build-images Build Docker images before running (required on first run)
--n-limit N Limit evaluation to first N instances
--select FILE Evaluate only instances listed in FILE (one per line)
Any additional arguments are passed to swebench-infer.
├── run.sh # Main runner script
├── baseline/ # Standard evaluation setup
│ ├── llm_config.json # Claude configuration
│ └── openhands-benchmarks/ # Submodule: OpenHands/benchmarks
└── baseline-with-mcp/ # MCP-enhanced setup
├── llm_config.json # Claude configuration
└── openhands-benchmarks/ # Submodule: supermodeltools/open-hands-benchmarks-with-mcp
-
baseline/openhands-benchmarks: OpenHands/benchmarks
- Standard SWE-Bench evaluation harness
-
baseline-with-mcp/openhands-benchmarks: supermodeltools/open-hands-benchmarks-with-mcp
- Fork with MCP integration
- Includes Node.js 20 in Docker image
- Adds Supermodel MCP configuration
Error: Docker containers crash or fail to start
Failed to load runtime instance
Container exited unexpectedly
Cause: Running on ARM-based system (Apple Silicon, ARM servers) with x86_64 Docker images.
Solution: Use an x86_64 machine or VM. Example Azure VM setup:
# Create x86_64 VM on Azure
az vm create \
--resource-group swe-bench-rg \
--name swe-bench-vm \
--image Ubuntu2204 \
--size Standard_D8s_v3 \
--admin-username azureuser \
--generate-ssh-keys
# SSH and clone repository
ssh azureuser@<vm-ip>
git clone --recurse-submodules https://github.com/supermodeltools/swe-bench.git# Check Docker status
docker ps
# Start Docker Desktop or daemon
open /Applications/Docker.app # macOS
# or systemctl start docker # LinuxError: references a workspace in `tool.uv.sources`, but is not a workspace member
Solution: Initialize all nested submodules:
git submodule update --init --recursive- Evaluations may take several hours due to API rate limits
- Results are saved incrementally - you can stop and resume
- Use
--n-limitto test with fewer instances first
# Check Docker disk usage
docker system df
# Clean up old images and containers
docker system prune -a
# Remove specific eval-agent-server images
docker images | grep eval-agent-server
docker rmi <image-id>If MCP evaluation fails with connection errors:
- Verify
SUPERMODELTOOLS_API_KEYis set - Check that Node.js 20 is installed in the Docker image
- Test MCP server manually:
npx -y @supermodeltools/mcp-server@0.4.1
- OpenHands Benchmarks
- SWE-Bench Paper
- SWE-Bench Leaderboard
- Model Context Protocol
- Supermodel MCP Server
To add your own MCP servers or modify the evaluation:
- Fork the
baseline-with-mcp/openhands-benchmarkssubmodule - Modify
benchmarks/swebench/run_infer.pyto add your MCP configuration - Update the Dockerfile if additional dependencies are needed
- Submit a pull request
This repository follows the licenses of its submodules:
- OpenHands Benchmarks: MIT License