Skip to content

uiuc-kang-lab/UTBoost

Repository files navigation

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

License arXiv 🤗 HF Collection 🤗 HF Collection

This repo is a duplicate of https://github.com/CUHK-Shenzhen-SE/UTBoost with small modifications for usability.

👋 Overview

UTBoost is a toolkit designed to enhance the test suites in SWE-Bench, which may lack sufficient coverage, resulting in less rigorous evaluation of coding agents. By augmenting the original test cases with additional ones, UTBoost ensures that coding agents thoroughly resolve issues, beyond merely passing human-written tests.

🦜 Environment Setup

  1. clone the repo

    git clone --recurse-submodules https://github.com/uiuc-kang-lab/UTBoost.git
  2. cp .env.example .env and put your OpenAI API key in .env.

  3. install uv if you did not install it before.

    curl -LsSf https://astral.sh/uv/install.sh | sh
  4. install python and packages via uv sync

💫 Generating test cases

We have provided our generated test cases here assets/useful_scripts/dir_generated_test_cases.zip. assets/useful_scripts/augTest.json is our confirmed augmented test cases.

For genrating your own augmented test cases with UTGenerator, here is the instruction: we should first locate the places for adding test cases by:

uv run python -m UTGenerator.run_localization \
    --dataset_split verified \ # for SWE-bench Verified
    --dataset_slice :2 \ # first two samples
    --output_folder results/test_localization \ 
    --file_level \
    --related_level \
    --fine_grain_line_level \
    --top_n 3 \
    --compress \
    --context_window 10 \
    --temperature 1 \ # for reasoning models
    --num_sample 4 \
    --model o3
uv run python -m UTGenerator.run_localization \
    --merge \
    --output_folder results/test_merge \
    --start_file results/test_localization/loc_outputs.jsonl \
    --num_samples 4

Then we can run the test case generation script:

uv run python -m UTGenerator.run_testgen \
    --loc_file results/test_merge/loc_merged_0-1_outputs.jsonl \
    --output_folder results/test_gen/new_gen_testCase_t099_lm01 \
    --loc_interval --top_n=3 --context_window=10 \
    --max_samples 2  --cot --diff_format \
    --gen_and_process 

🤗 SWE-Bench re-evaluation with UTBoost data on Hugging Face

You can use your favorite way to use UTBoost. We have uploaded UTBoost to HuggingFace; you can access the data:

from datasets import load_dataset
swebench = load_dataset('uiuc-kang-lab/SWE-bench-Lite-UTBoost', split='test')
# for verified split, use uiuc-kang-lab/SWE-bench-Verified-UTBoost

To evaluate your coding agent on UTBoost, you can replace the SWE-Bench dataset with the UTBoost one. Here is an example for using SWE-bench_Lite_UTBoost in SWE-Bench evaluation pipeline https://github.com/SWE-bench:

python -m swebench.harness.run_evaluation \
    --dataset_name uiuc-kang-lab/SWE-bench-Lite-UTBoost \
    --predictions_path <path_to_predictions> \
    --max_workers <num_workers> \
    --run_id <run_id>
    # use --predictions_path 'gold' to verify the gold patches
    # use --run_id to name the evaluation run

If you want to check with more details, we refer you to create the docker container with the corresponding instance_id. We extract the setup scripts, you can find here: assets/useful_scripts/my_class_list_lite.pkl and assets/useful_scripts/my_class_list_verified.pkl

🖊️ Getting the annotation with the refined parser

The SWE-Bench annotation data has some labeling errors due to the defects of the original parser. For example: SWE-bench/SWE-bench#314. We think the annotations should be updated to ensure rigorous evaluation. You can find our refined parser here: update_SWE_Bench/log_parsers.py.

We re-run the SWE-Bench data collection to gather the annotations, please check with the update_SWE_Bench/updated_parser_test_instance_dict_verified.json and update_SWE_Bench/updated_parser_test_instance_dict.json.

✍️ Report suspicious issues with intramorphic testing

  1. We need to run intramorphicTesting.py to get the report of suspicious issues (we provide example data in log4test).
  2. Then we can check if UTGenerator generate effective and harmless code, and then add them to SWE-Bench after confirmation.
  3. For example, we can examine the instance and the test case when gold passed and geenrated patch failed in given the follwoing report:
Report for pydata__xarray-7393 is different between gold: log4test/gold-366 and model: log4test/20231010_rag_swellama7b
here is the differences:
There is a difference in test cases: xarray/tests/test_indexes.py::test_restore_dtype_on_multiindexes[int32], gold: PASSED, gen: FAILED
There is a difference in test cases: xarray/tests/test_indexes.py::test_restore_dtype_on_multiindexes[float32], gold: PASSED, gen: FAILED
There is a difference in test cases: xarray/tests/test_indexes.py::test_multiindex_with_various_dtypes, gold: PASSED, gen: FAILED
There is a difference in test cases: xarray/tests/test_indexes.py::test_empty_multiindex, gold: PASSED, gen: FAILED

📝 Citation

MIT license. Check LICENSE.md.

If you find our work helpful, please use the following citation.

@article{yu2025utboost,
  title={UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench},
  author={Yu, Boxi and Zhu, Yuxuan and He, Pinjia and Kang, Daniel},
  journal={arXiv preprint arXiv:2506.09289},
  year={2025}
}

📰 QA

For any question, feel free to pull an Github Issue or email me via [email protected].

😻 Acknowledgement

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages