This repository contains the full dataset and evaluation benchmark introduced in our OSDI'25 paper:
"Principles and Methodologies for Serial Performance Optimization (OSDI' 25)"
Large language models (LLMs) hold promise as assistants for system performance optimization, yet their evaluation in this domain remains underexplored. This repository provides:
- A curated dataset of performance optimization problems and observations, derived from 10 years of SOSP/OSDI papers
- A taxonomy-grounded benchmark to assess LLMs' ability to suggest concrete, actionable system optimizations
- Scripts to evaluate models on their ability to recover real-world optimization strategies
.
βββ dataset/
β βββ dataset.xlsx # Full training + test data (see below)
β βββ example_3 # Few-shot prompt examples (N = 3)
β βββ example_5 # Few-shot prompt examples (N = 5)
β βββ example_10 # Few-shot prompt examples (N = 10)
β
βββ eval.py # Evaluation script (e.g., precision/recall)
βββ run_test.sh # Script to reproduce Figure 7
βββ README.md
- Sheet 1: Training dataset distilled from 10 years of OSDI/SOSP papers (2013β2022).
- Sheet 2: Test dataset of 96 papers published in 2024 (OSDI/SOSP).
- Each entry includes a problem statement, system observations, and labeled methodologies.
If you use this dataset or benchmark, please cite:
```
@inproceedings{park:sysgpt,
title = {{Principles and Methodologies for Serial Performance Optimization}},
author = {Sujin Park and Mingyu Guan and Xiang Cheng and Taesoo Kim},
booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI)},
month = jul,
year = 2025,
}
```