This repo is a duplicate of https://github.com/CUHK-Shenzhen-SE/UTBoost with small modifications for usability.
UTBoost is a toolkit designed to enhance the test suites in SWE-Bench, which may lack sufficient coverage, resulting in less rigorous evaluation of coding agents. By augmenting the original test cases with additional ones, UTBoost ensures that coding agents thoroughly resolve issues, beyond merely passing human-written tests.
-
clone the repo
git clone --recurse-submodules https://github.com/uiuc-kang-lab/UTBoost.git
-
cp .env.example .env
and put your OpenAI API key in.env
. -
install
uv
if you did not install it before.curl -LsSf https://astral.sh/uv/install.sh | sh
-
install python and packages via
uv sync
We have provided our generated test cases here assets/useful_scripts/dir_generated_test_cases.zip
. assets/useful_scripts/augTest.json
is our confirmed augmented test cases.
For genrating your own augmented test cases with UTGenerator, here is the instruction: we should first locate the places for adding test cases by:
uv run python -m UTGenerator.run_localization \
--dataset_split verified \ # for SWE-bench Verified
--dataset_slice :2 \ # first two samples
--output_folder results/test_localization \
--file_level \
--related_level \
--fine_grain_line_level \
--top_n 3 \
--compress \
--context_window 10 \
--temperature 1 \ # for reasoning models
--num_sample 4 \
--model o3
uv run python -m UTGenerator.run_localization \
--merge \
--output_folder results/test_merge \
--start_file results/test_localization/loc_outputs.jsonl \
--num_samples 4
Then we can run the test case generation script:
uv run python -m UTGenerator.run_testgen \
--loc_file results/test_merge/loc_merged_0-1_outputs.jsonl \
--output_folder results/test_gen/new_gen_testCase_t099_lm01 \
--loc_interval --top_n=3 --context_window=10 \
--max_samples 2 --cot --diff_format \
--gen_and_process
You can use your favorite way to use UTBoost. We have uploaded UTBoost to HuggingFace; you can access the data:
from datasets import load_dataset
swebench = load_dataset('uiuc-kang-lab/SWE-bench-Lite-UTBoost', split='test')
# for verified split, use uiuc-kang-lab/SWE-bench-Verified-UTBoost
To evaluate your coding agent on UTBoost, you can replace the SWE-Bench dataset with the UTBoost one. Here is an example for using SWE-bench_Lite_UTBoost in SWE-Bench evaluation pipeline https://github.com/SWE-bench:
python -m swebench.harness.run_evaluation \
--dataset_name uiuc-kang-lab/SWE-bench-Lite-UTBoost \
--predictions_path <path_to_predictions> \
--max_workers <num_workers> \
--run_id <run_id>
# use --predictions_path 'gold' to verify the gold patches
# use --run_id to name the evaluation run
If you want to check with more details, we refer you to create the docker container with the corresponding instance_id. We extract the setup scripts, you can find here: assets/useful_scripts/my_class_list_lite.pkl
and assets/useful_scripts/my_class_list_verified.pkl
The SWE-Bench annotation data has some labeling errors due to the defects of the original parser. For example: SWE-bench/SWE-bench#314.
We think the annotations should be updated to ensure rigorous evaluation. You can find our refined parser here: update_SWE_Bench/log_parsers.py
.
We re-run the SWE-Bench data collection to gather the annotations, please check with the update_SWE_Bench/updated_parser_test_instance_dict_verified.json
and update_SWE_Bench/updated_parser_test_instance_dict.json
.
- We need to run
intramorphicTesting.py
to get the report of suspicious issues (we provide example data in log4test). - Then we can check if UTGenerator generate effective and harmless code, and then add them to SWE-Bench after confirmation.
- For example, we can examine the instance and the test case when gold passed and geenrated patch failed in given the follwoing report:
Report for pydata__xarray-7393 is different between gold: log4test/gold-366 and model: log4test/20231010_rag_swellama7b
here is the differences:
There is a difference in test cases: xarray/tests/test_indexes.py::test_restore_dtype_on_multiindexes[int32], gold: PASSED, gen: FAILED
There is a difference in test cases: xarray/tests/test_indexes.py::test_restore_dtype_on_multiindexes[float32], gold: PASSED, gen: FAILED
There is a difference in test cases: xarray/tests/test_indexes.py::test_multiindex_with_various_dtypes, gold: PASSED, gen: FAILED
There is a difference in test cases: xarray/tests/test_indexes.py::test_empty_multiindex, gold: PASSED, gen: FAILED
MIT license. Check LICENSE.md
.
If you find our work helpful, please use the following citation.
@article{yu2025utboost,
title={UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench},
author={Yu, Boxi and Zhu, Yuxuan and He, Pinjia and Kang, Daniel},
journal={arXiv preprint arXiv:2506.09289},
year={2025}
}
For any question, feel free to pull an Github Issue or email me via [email protected].