Has Anyone Successfully Run the SWE-Lancer Benchmark? #49

ShaneTian · 2025-02-28T06:53:55Z

Description:
I am trying to run the SWE-Lancer benchmark on my system, but I would like to confirm if others have successfully completed the process.

System Details:

OS: CentOS Linux 7 64bit / Linux 4.14.0_1-0-0-51
CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz 3.10/0.00GHz, 96 core
MEM: 377GB
Disk: 8 * 7.3TB
Docker: 27.0.0-rc.2

Steps Taken

✅ Step 1: Environment Setup (Success)

uv sync
source .venv/bin/activate
for proj in nanoeval alcatraz nanoeval_alcatraz; do
  uv pip install -e project/"$proj"
done

✅ Step 2: Docker Build (Success)

docker buildx build \
  -f Dockerfile_x86 \
  --platform linux/amd64 \
  --ssh default=$SSH_AUTH_SOCK \
  --network host \
  -t swelancer \
  .

✅ Step 3: Running Container (Success)

Using ISSUE_ID=1 environment variable #44 (comment)

docker run -itd --name swelancer-runtime -p 5900:5900 -p 5901:5901 -e ISSUE_ID=1 swelancer

❓ Step 4: Running SWE-Lancer (Partial Failure)

uv run python run_swelancer.py --issue_ids 1 2 3 4 5 6 7 8 9 10 11

I used the gpt-4o-2024-11-20 model and ran the first 11 issues. Each question takes an average of 20 minutes to an hour.
There are 7 issues that can work properly (1, 2, 4, 5, 8, 10, 11), but all the results are failures. Other 4 issues do not work properly and will raise errors (3 6 7 9).
I am not sure if there is something wrong with my way.

Thanks in advance! 🚀

The text was updated successfully, but these errors were encountered:

Lucky-w0y · 2025-03-03T07:15:43Z

Hello, I am also running the SWE-Lancer. However my user-tool can't not run successfully. Did you meet the error: bash: cannot set terminal process group (17647): Inappropriate ioctl for device
bash: no job control in this shell.when the agent calls the user-tool?

BoxiYu · 2025-03-03T12:54:16Z

Hi guys, I have successfully run the Swelancer examples. If you are also using x86 architecture, you can download my pre-built Docker image here: https://hub.docker.com/repository/docker/cccav/swelancer_x86/general.

Wish you good luck!

moresearch · 2025-03-04T14:13:52Z

@BoxiYu thats great, could you please give a hint about the cost of running the examples? why x86?

BoxiYu · 2025-03-04T14:21:18Z

@moresearch hi, I only run it with the two examples, about an average of 2 dollars maybe, I did not record it precisely. The image I provided at the link is built on an x86 cloud server, and I have smoothly used it on my x86 laptop. I did not build it successfully on my arm device (it might be due to the network error).

moresearch · 2025-03-04T18:56:33Z

@BoxiYu you think we could collectively as a community gather cost data per different model/agent-implementation? Would you be interested in participating in such endeavour?

BoxiYu · 2025-03-05T07:11:33Z

I think it may cost a lot to evaluate on lots of model. @moresearch

Xc1ord · 2025-03-07T03:20:46Z

I'm experiencing the same problems while running the SWELancer-Benchmark:

Using the gpt-4o-2024-11-20 model, I've attempted to run the first 5 issues.
uv run python run_swelancer.py --issue_ids 1 2 3 4 5
The 3rd issue fails with an error, and all other results are failures.
The setup process is extremely slow - each container setup runs /SWELancer-Benchmark/runtime_scripts/run.sh with ansible-playbook -i "localhost," --connection=local /app/tests/setup_expensify.yml, where the "Install node modules via npm" step takes at least 20 minutes per issue.

I tried set npm registry to use the local npm proxy to improve performance, but this had minimal impact.

    - name: Set npm registry to use the local npm proxy
      shell: |
        source /root/.nvm/nvm.sh
        npm set registry https://registry.npmmirror.com
      args:
        chdir: /app/expensify
        executable: /bin/bash

Questions

Is there a known fix for the 3rd issue error?
How can I optimize the npm installation process to reduce setup time?
Are there configuration changes that could improve overall reliability?

Any guidance would be greatly appreciated.

BoxiYu · 2025-03-07T06:35:38Z

Hi, @Xc1ord I can run issue 3 successfully with my docker env, maybe you can try my image listed above. If your error is due to some internet error, I suggest you to rent some online cloud server in digital ocean, Github codespace or something else.

2025-03-07T06:29:22.072914Z [info     ] Summary:
{'accuracy': np.float64(1.0),
 'aggregations': {'error_breakdown': {},
                  'num_attempts': 1,
                  'num_correct': 1,
                  'num_incorrect': 0,
                  'num_system_error': 0,
                  'num_tasks': 1},
 'is_valid': True,
 'length_stats': {'frac_correct': 1.0,
                  'frac_max_steps': 0.0,
                  'frac_max_time': 0.0,
                  'frac_max_tokens': 0.0,
                  'frac_model_ended': 0.0},
 'metrics_including_errors': {'accuracy': np.float64(1.0),
                              'aggregations': {'error_breakdown': {},
                                               'num_attempts': 1, 
                                               'num_correct': 1,
                                               'num_incorrect': 0,
                                               'num_system_error': 0,
                                               'num_tasks': 1}},
 'missing_tasks': []}
Evaluated 1 tasks with SWELancer component=nanoeval.evaluation
2025-03-07T06:29:22.073040Z [info     ] Evalboard Summary: unknown, add in pr? component=nanoeval.evaluation
SWELancer (r_id=250307062159RHI2DDBJ, m=nanoeval): 100%|████████████████████| 1/1 [07:23<00:00, 443.02s/it, corr=1, errs=0, fail=0]
{'accuracy': np.float64(1.0), 'aggregations': {'num_tasks': 1, 'num_attempts': 1, 'num_correct': 1, 'num_incorrect': 0, 'num_system_error': 0, 'error_breakdown': {}}, 'metrics_including_errors': {'accuracy': np.float64(1.0), 'aggregations': {'num_tasks': 1, 'num_attempts': 1, 'num_correct': 1, 'num_incorrect': 0, 'num_system_error': 0, 'error_breakdown': {}}}, 'missing_tasks': [], 'is_valid': True, 'length_stats': {'frac_correct': 1.0, 'frac_max_time': 0.0, 'frac_max_steps': 0.0, 'frac_max_tokens': 0.0, 'frac_model_ended': 0.0}}
(base) ddq@ddq-ROG-Strix-G18-G814JVR-G814JVR:~/SWELancer-Benchmark$ uv run python run_swelancer.py --issue_ids 3

Fangzhou66 · 2025-03-13T11:08:42Z

Hi @BoxiYu , did you successfully run manager task? I have tried a lot of times still crash...

petebachant mentioned this issue Mar 4, 2025

Improve reproducibility and reduce manual setup steps with Calkit #54

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Has Anyone Successfully Run the SWE-Lancer Benchmark? #49

Has Anyone Successfully Run the SWE-Lancer Benchmark? #49

ShaneTian commented Feb 28, 2025 •

edited

Loading

Lucky-w0y commented Mar 3, 2025

BoxiYu commented Mar 3, 2025

moresearch commented Mar 4, 2025

BoxiYu commented Mar 4, 2025

moresearch commented Mar 4, 2025

BoxiYu commented Mar 5, 2025

Xc1ord commented Mar 7, 2025 •

edited

Loading

BoxiYu commented Mar 7, 2025

Fangzhou66 commented Mar 13, 2025

Has Anyone Successfully Run the SWE-Lancer Benchmark? #49

Has Anyone Successfully Run the SWE-Lancer Benchmark? #49

Comments

ShaneTian commented Feb 28, 2025 • edited Loading

Steps Taken

✅ Step 1: Environment Setup (Success)

✅ Step 2: Docker Build (Success)

✅ Step 3: Running Container (Success)

❓ Step 4: Running SWE-Lancer (Partial Failure)

Lucky-w0y commented Mar 3, 2025

BoxiYu commented Mar 3, 2025

moresearch commented Mar 4, 2025

BoxiYu commented Mar 4, 2025

moresearch commented Mar 4, 2025

BoxiYu commented Mar 5, 2025

Xc1ord commented Mar 7, 2025 • edited Loading

Questions

BoxiYu commented Mar 7, 2025

Fangzhou66 commented Mar 13, 2025

ShaneTian commented Feb 28, 2025 •

edited

Loading

Xc1ord commented Mar 7, 2025 •

edited

Loading