Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Has Anyone Successfully Run the SWE-Lancer Benchmark? #49

Open
ShaneTian opened this issue Feb 28, 2025 · 9 comments
Open

Has Anyone Successfully Run the SWE-Lancer Benchmark? #49

ShaneTian opened this issue Feb 28, 2025 · 9 comments

Comments

@ShaneTian
Copy link

ShaneTian commented Feb 28, 2025

Description:
I am trying to run the SWE-Lancer benchmark on my system, but I would like to confirm if others have successfully completed the process.

System Details:

  • OS: CentOS Linux 7 64bit / Linux 4.14.0_1-0-0-51
  • CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz 3.10/0.00GHz, 96 core
  • MEM: 377GB
  • Disk: 8 * 7.3TB
  • Docker: 27.0.0-rc.2

Steps Taken

✅ Step 1: Environment Setup (Success)

uv sync
source .venv/bin/activate
for proj in nanoeval alcatraz nanoeval_alcatraz; do
  uv pip install -e project/"$proj"
done

✅ Step 2: Docker Build (Success)

docker buildx build \
  -f Dockerfile_x86 \
  --platform linux/amd64 \
  --ssh default=$SSH_AUTH_SOCK \
  --network host \
  -t swelancer \
  .

✅ Step 3: Running Container (Success)

Using ISSUE_ID=1 environment variable #44 (comment)

docker run -itd --name swelancer-runtime -p 5900:5900 -p 5901:5901 -e ISSUE_ID=1 swelancer

❓ Step 4: Running SWE-Lancer (Partial Failure)

uv run python run_swelancer.py --issue_ids 1 2 3 4 5 6 7 8 9 10 11

I used the gpt-4o-2024-11-20 model and ran the first 11 issues. Each question takes an average of 20 minutes to an hour.
There are 7 issues that can work properly (1, 2, 4, 5, 8, 10, 11), but all the results are failures. Other 4 issues do not work properly and will raise errors (3 6 7 9).
I am not sure if there is something wrong with my way.

Thanks in advance! 🚀

@Lucky-w0y
Copy link

Hello, I am also running the SWE-Lancer. However my user-tool can't not run successfully. Did you meet the error: bash: cannot set terminal process group (17647): Inappropriate ioctl for device
bash: no job control in this shell.when the agent calls the user-tool?

@BoxiYu
Copy link
Contributor

BoxiYu commented Mar 3, 2025

Hi guys, I have successfully run the Swelancer examples. If you are also using x86 architecture, you can download my pre-built Docker image here: https://hub.docker.com/repository/docker/cccav/swelancer_x86/general.

Wish you good luck!

@moresearch
Copy link

@BoxiYu thats great, could you please give a hint about the cost of running the examples? why x86?

@BoxiYu
Copy link
Contributor

BoxiYu commented Mar 4, 2025

@moresearch hi, I only run it with the two examples, about an average of 2 dollars maybe, I did not record it precisely. The image I provided at the link is built on an x86 cloud server, and I have smoothly used it on my x86 laptop. I did not build it successfully on my arm device (it might be due to the network error).

@moresearch
Copy link

@BoxiYu you think we could collectively as a community gather cost data per different model/agent-implementation? Would you be interested in participating in such endeavour?

@BoxiYu
Copy link
Contributor

BoxiYu commented Mar 5, 2025

I think it may cost a lot to evaluate on lots of model. @moresearch

@Xc1ord
Copy link

Xc1ord commented Mar 7, 2025

I'm experiencing the same problems while running the SWELancer-Benchmark:

  1. Using the gpt-4o-2024-11-20 model, I've attempted to run the first 5 issues.
    uv run python run_swelancer.py --issue_ids 1 2 3 4 5
  2. The 3rd issue fails with an error, and all other results are failures.
  3. The setup process is extremely slow - each container setup runs /SWELancer-Benchmark/runtime_scripts/run.sh with ansible-playbook -i "localhost," --connection=local /app/tests/setup_expensify.yml, where the "Install node modules via npm" step takes at least 20 minutes per issue.

I tried set npm registry to use the local npm proxy to improve performance, but this had minimal impact.

    - name: Set npm registry to use the local npm proxy
      shell: |
        source /root/.nvm/nvm.sh
        npm set registry https://registry.npmmirror.com
      args:
        chdir: /app/expensify
        executable: /bin/bash

Questions

  1. Is there a known fix for the 3rd issue error?
  2. How can I optimize the npm installation process to reduce setup time?
  3. Are there configuration changes that could improve overall reliability?

Any guidance would be greatly appreciated.

@BoxiYu
Copy link
Contributor

BoxiYu commented Mar 7, 2025

Hi, @Xc1ord I can run issue 3 successfully with my docker env, maybe you can try my image listed above. If your error is due to some internet error, I suggest you to rent some online cloud server in digital ocean, Github codespace or something else.

2025-03-07T06:29:22.072914Z [info     ] Summary:
{'accuracy': np.float64(1.0),
 'aggregations': {'error_breakdown': {},
                  'num_attempts': 1,
                  'num_correct': 1,
                  'num_incorrect': 0,
                  'num_system_error': 0,
                  'num_tasks': 1},
 'is_valid': True,
 'length_stats': {'frac_correct': 1.0,
                  'frac_max_steps': 0.0,
                  'frac_max_time': 0.0,
                  'frac_max_tokens': 0.0,
                  'frac_model_ended': 0.0},
 'metrics_including_errors': {'accuracy': np.float64(1.0),
                              'aggregations': {'error_breakdown': {},
                                               'num_attempts': 1, 
                                               'num_correct': 1,
                                               'num_incorrect': 0,
                                               'num_system_error': 0,
                                               'num_tasks': 1}},
 'missing_tasks': []}
Evaluated 1 tasks with SWELancer component=nanoeval.evaluation
2025-03-07T06:29:22.073040Z [info     ] Evalboard Summary: unknown, add in pr? component=nanoeval.evaluation
SWELancer (r_id=250307062159RHI2DDBJ, m=nanoeval): 100%|████████████████████| 1/1 [07:23<00:00, 443.02s/it, corr=1, errs=0, fail=0]
{'accuracy': np.float64(1.0), 'aggregations': {'num_tasks': 1, 'num_attempts': 1, 'num_correct': 1, 'num_incorrect': 0, 'num_system_error': 0, 'error_breakdown': {}}, 'metrics_including_errors': {'accuracy': np.float64(1.0), 'aggregations': {'num_tasks': 1, 'num_attempts': 1, 'num_correct': 1, 'num_incorrect': 0, 'num_system_error': 0, 'error_breakdown': {}}}, 'missing_tasks': [], 'is_valid': True, 'length_stats': {'frac_correct': 1.0, 'frac_max_time': 0.0, 'frac_max_steps': 0.0, 'frac_max_tokens': 0.0, 'frac_model_ended': 0.0}}
(base) ddq@ddq-ROG-Strix-G18-G814JVR-G814JVR:~/SWELancer-Benchmark$ uv run python run_swelancer.py --issue_ids 3 

@Fangzhou66
Copy link

Hi @BoxiYu , did you successfully run manager task? I have tried a lot of times still crash...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants