UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

This repo is a duplicate of https://github.com/CUHK-Shenzhen-SE/UTBoost with small modifications for usability.

👋 Overview

UTBoost is a toolkit designed to enhance the test suites in SWE-Bench, which may lack sufficient coverage, resulting in less rigorous evaluation of coding agents. By augmenting the original test cases with additional ones, UTBoost ensures that coding agents thoroughly resolve issues, beyond merely passing human-written tests.

🦜 Environment Setup

clone the repo

git clone --recurse-submodules https://github.com/uiuc-kang-lab/UTBoost.git

cp .env.example .env and put your OpenAI API key in .env.

install uv if you did not install it before.

curl -LsSf https://astral.sh/uv/install.sh | sh

install python and packages via uv sync

💫 Generating test cases

We have provided our generated test cases here assets/useful_scripts/dir_generated_test_cases.zip. assets/useful_scripts/augTest.json is our confirmed augmented test cases.

For genrating your own augmented test cases with UTGenerator, here is the instruction: we should first locate the places for adding test cases by:

uv run python -m UTGenerator.run_localization \
    --dataset_split verified \ # for SWE-bench Verified
    --dataset_slice :2 \ # first two samples
    --output_folder results/test_localization \ 
    --file_level \
    --related_level \
    --fine_grain_line_level \
    --top_n 3 \
    --compress \
    --context_window 10 \
    --temperature 1 \ # for reasoning models
    --num_sample 4 \
    --model o3
uv run python -m UTGenerator.run_localization \
    --merge \
    --output_folder results/test_merge \
    --start_file results/test_localization/loc_outputs.jsonl \
    --num_samples 4

Then we can run the test case generation script:

uv run python -m UTGenerator.run_testgen \
    --loc_file results/test_merge/loc_merged_0-1_outputs.jsonl \
    --output_folder results/test_gen/new_gen_testCase_t099_lm01 \
    --loc_interval --top_n=3 --context_window=10 \
    --max_samples 2  --cot --diff_format \
    --gen_and_process

🤗 SWE-Bench re-evaluation with UTBoost data on Hugging Face

You can use your favorite way to use UTBoost. We have uploaded UTBoost to HuggingFace; you can access the data:

from datasets import load_dataset
swebench = load_dataset('uiuc-kang-lab/SWE-bench-Lite-UTBoost', split='test')
# for verified split, use uiuc-kang-lab/SWE-bench-Verified-UTBoost

To evaluate your coding agent on UTBoost, you can replace the SWE-Bench dataset with the UTBoost one. Here is an example for using SWE-bench_Lite_UTBoost in SWE-Bench evaluation pipeline https://github.com/SWE-bench:

python -m swebench.harness.run_evaluation \
    --dataset_name uiuc-kang-lab/SWE-bench-Lite-UTBoost \
    --predictions_path <path_to_predictions> \
    --max_workers <num_workers> \
    --run_id <run_id>
    # use --predictions_path 'gold' to verify the gold patches
    # use --run_id to name the evaluation run

If you want to check with more details, we refer you to create the docker container with the corresponding instance_id. We extract the setup scripts, you can find here: assets/useful_scripts/my_class_list_lite.pkl and assets/useful_scripts/my_class_list_verified.pkl

🖊️ Getting the annotation with the refined parser

The SWE-Bench annotation data has some labeling errors due to the defects of the original parser. For example: SWE-bench/SWE-bench#314. We think the annotations should be updated to ensure rigorous evaluation. You can find our refined parser here: update_SWE_Bench/log_parsers.py.

We re-run the SWE-Bench data collection to gather the annotations, please check with the update_SWE_Bench/updated_parser_test_instance_dict_verified.json and update_SWE_Bench/updated_parser_test_instance_dict.json.

✍️ Report suspicious issues with intramorphic testing

We need to run intramorphicTesting.py to get the report of suspicious issues (we provide example data in log4test).
Then we can check if UTGenerator generate effective and harmless code, and then add them to SWE-Bench after confirmation.
For example, we can examine the instance and the test case when gold passed and geenrated patch failed in given the follwoing report:

Report for pydata__xarray-7393 is different between gold: log4test/gold-366 and model: log4test/20231010_rag_swellama7b
here is the differences:
There is a difference in test cases: xarray/tests/test_indexes.py::test_restore_dtype_on_multiindexes[int32], gold: PASSED, gen: FAILED
There is a difference in test cases: xarray/tests/test_indexes.py::test_restore_dtype_on_multiindexes[float32], gold: PASSED, gen: FAILED
There is a difference in test cases: xarray/tests/test_indexes.py::test_multiindex_with_various_dtypes, gold: PASSED, gen: FAILED
There is a difference in test cases: xarray/tests/test_indexes.py::test_empty_multiindex, gold: PASSED, gen: FAILED

📝 Citation

MIT license. Check LICENSE.md.

If you find our work helpful, please use the following citation.

@article{yu2025utboost,
  title={UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench},
  author={Yu, Boxi and Zhu, Yuxuan and He, Pinjia and Kang, Daniel},
  journal={arXiv preprint arXiv:2506.09289},
  year={2025}
}

📰 QA

For any question, feel free to pull an Github Issue or email me via [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
SWE-bench @ 4c21a48		SWE-bench @ 4c21a48
UTGenerator		UTGenerator
assets		assets
get_repo_structure		get_repo_structure
log4test		log4test
results_1		results_1
update_SWE_Bench		update_SWE_Bench
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE		LICENSE
intramorphicTesting.py		intramorphicTesting.py
log_intra.py		log_intra.py
pyproject.toml		pyproject.toml
readme.md		readme.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

👋 Overview

🦜 Environment Setup

💫 Generating test cases

🤗 SWE-Bench re-evaluation with UTBoost data on Hugging Face

🖊️ Getting the annotation with the refined parser

✍️ Report suspicious issues with intramorphic testing

📝 Citation

📰 QA

😻 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

uiuc-kang-lab/UTBoost

Folders and files

Latest commit

History

Repository files navigation

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

👋 Overview

🦜 Environment Setup

💫 Generating test cases

🤗 SWE-Bench re-evaluation with UTBoost data on Hugging Face

🖊️ Getting the annotation with the refined parser

✍️ Report suspicious issues with intramorphic testing

📝 Citation

📰 QA

😻 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages