Skip to content

supermodeltools/swe-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SWE-Bench Evaluation: Baseline vs MCP-Enhanced

Comparative evaluation of Claude Sonnet 4.5 on SWE-Bench-Lite with and without Model Context Protocol (MCP) tools.

Overview

This repository runs two parallel evaluations:

  1. Baseline: Claude Sonnet 4.5 with standard OpenHands tools
  2. Baseline-with-MCP: Claude Sonnet 4.5 with Supermodel MCP server providing code graph analysis tools

The goal is to measure the impact of MCP-provided code intelligence (dependency graphs, call graphs, code structure analysis) on software engineering task performance.

Setup

Clone this repository with submodules:

git clone --recurse-submodules https://github.com/supermodeltools/swe-bench.git
cd swe-bench

If you already cloned without submodules, initialize them:

git submodule update --init --recursive

Prerequisites

Platform Requirements

⚠️ Important: x86_64 architecture required

SWE-Bench evaluations require x86_64 (Intel/AMD) architecture. The Docker images used for evaluations are built for linux/amd64 and will not run on ARM-based systems (Apple Silicon M1/M2/M3, ARM servers).

Options:

  • Run on an x86_64 Linux server or VM
  • Use cloud VMs (AWS EC2, Azure, GCP) with x86_64 instances
  • ❌ Will not work on Apple Silicon Macs (M1/M2/M3) without emulation (extremely slow)

Software Requirements

  1. Docker - Required for isolated evaluation environments

    • Install from docker.com
    • Ensure Docker daemon is running: docker ps
  2. uv - Python package manager

    curl -LsSf https://astral.sh/uv/install.sh | sh
  3. API Keys

    export ANTHROPIC_API_KEY="your-anthropic-key"
    export SUPERMODELTOOLS_API_KEY="your-supermodel-key"

Quick Start

Test with a single instance (builds images and runs both evaluations):

./run.sh --build-images --n-limit 1

This will:

  • Build Docker images for both baseline and MCP environments
  • Run both evaluations in parallel on 1 test instance
  • Save results to ./results/baseline/ and ./results/baseline-with-mcp/

Running Evaluations

Test Run (Recommended First)

# Test with 5 instances (~20-30 minutes)
./run.sh --build-images --n-limit 5

Full Evaluation

# Build all Docker images (one-time, ~30-60 minutes)
./run.sh --build-images

# Run full SWE-Bench-Lite evaluation (300 instances, several hours)
# Images are cached, so this is much faster
./run.sh

Custom Instance Selection

# Create a file with specific instance IDs
echo "scikit-learn__scikit-learn-25500" > instances.txt
echo "django__django-11333" >> instances.txt

# Run on those instances only
./run.sh --build-images --select instances.txt

Configuration

Evaluation Parameters

  • Max iterations: 150 (configurable in run.sh)
  • Model: claude-sonnet-4-5-20250929
  • Dataset: princeton-nlp/SWE-bench_Lite (300 instances)
  • Workspace: Docker (isolated containers per instance)

MCP Integration

The baseline-with-mcp configuration adds the Supermodel MCP server:

mcp_config = {
    "mcpServers": {
        "supermodel": {
            "command": "npx",
            "args": ["-y", "@supermodeltools/mcp-server@0.4.1"],
            "env": {
                "SUPERMODELTOOLS_API_KEY": os.getenv("SUPERMODELTOOLS_API_KEY", "")
            }
        }
    }
}

This provides Claude with tools to:

  • Analyze code dependencies and imports
  • Generate call graphs and execution flows
  • Understand code structure and relationships
  • Navigate large codebases efficiently

Docker Images

Two separate Docker image sets are built:

  1. Baseline: Standard OpenHands eval-agent-server image
  2. MCP: Extended image with Node.js 20 for MCP server support

Results

Results are saved to separate directories for comparison:

results/
├── baseline/                          # Standard evaluation
│   └── princeton-nlp__SWE-bench_Lite-test/
│       └── anthropic/
│           └── output.jsonl
└── baseline-with-mcp/                 # MCP-enhanced evaluation
    └── princeton-nlp__SWE-bench_Lite-test/
        └── anthropic/
            └── output.jsonl

Each output.jsonl contains:

  • Generated code patches
  • Agent conversation history
  • Token usage and cost metrics
  • Success/failure status

Evaluating Results

Use SWE-Bench evaluation tools to score the patches:

# Baseline results
cd baseline/openhands-benchmarks
uv run swebench-eval ../../results/baseline/output.jsonl

# MCP-enhanced results
cd baseline-with-mcp/openhands-benchmarks
uv run swebench-eval ../../results/baseline-with-mcp/output.jsonl

Command Line Options

./run.sh [OPTIONS]

Options:
  --build-images     Build Docker images before running (required on first run)
  --n-limit N        Limit evaluation to first N instances
  --select FILE      Evaluate only instances listed in FILE (one per line)

Any additional arguments are passed to swebench-infer

Architecture

Repository Structure

.
├── run.sh                          # Main runner script
├── baseline/                       # Standard evaluation setup
│   ├── llm_config.json            # Claude configuration
│   └── openhands-benchmarks/      # Submodule: OpenHands/benchmarks
└── baseline-with-mcp/             # MCP-enhanced setup
    ├── llm_config.json            # Claude configuration
    └── openhands-benchmarks/      # Submodule: supermodeltools/open-hands-benchmarks-with-mcp

Submodules

Troubleshooting

Architecture / Platform Issues

Error: Docker containers crash or fail to start

Failed to load runtime instance
Container exited unexpectedly

Cause: Running on ARM-based system (Apple Silicon, ARM servers) with x86_64 Docker images.

Solution: Use an x86_64 machine or VM. Example Azure VM setup:

# Create x86_64 VM on Azure
az vm create \
  --resource-group swe-bench-rg \
  --name swe-bench-vm \
  --image Ubuntu2204 \
  --size Standard_D8s_v3 \
  --admin-username azureuser \
  --generate-ssh-keys

# SSH and clone repository
ssh azureuser@<vm-ip>
git clone --recurse-submodules https://github.com/supermodeltools/swe-bench.git

Docker Not Running

# Check Docker status
docker ps

# Start Docker Desktop or daemon
open /Applications/Docker.app  # macOS
# or systemctl start docker      # Linux

Submodule Errors

Error: references a workspace in `tool.uv.sources`, but is not a workspace member

Solution: Initialize all nested submodules:

git submodule update --init --recursive

API Rate Limits

  • Evaluations may take several hours due to API rate limits
  • Results are saved incrementally - you can stop and resume
  • Use --n-limit to test with fewer instances first

Docker Disk Space

# Check Docker disk usage
docker system df

# Clean up old images and containers
docker system prune -a

# Remove specific eval-agent-server images
docker images | grep eval-agent-server
docker rmi <image-id>

MCP Server Connection Issues

If MCP evaluation fails with connection errors:

  1. Verify SUPERMODELTOOLS_API_KEY is set
  2. Check that Node.js 20 is installed in the Docker image
  3. Test MCP server manually: npx -y @supermodeltools/mcp-server@0.4.1

References

Contributing

To add your own MCP servers or modify the evaluation:

  1. Fork the baseline-with-mcp/openhands-benchmarks submodule
  2. Modify benchmarks/swebench/run_infer.py to add your MCP configuration
  3. Update the Dockerfile if additional dependencies are needed
  4. Submit a pull request

License

This repository follows the licenses of its submodules:

About

Open Hands implementation of SWE-Bench-Lite for Supermodel MCP

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages