Model-driven autotuning of MPI-IO hints and Lustre striping parameters for HPC applications. The project combines active learning with gradient-boosted and extra-trees regressors to discover high-throughput configurations for two widely used proxy applications: S3D-IO and BT-IO.
- Manual exploration of the parallel I/O stack is slow and error-prone. We automate it by steering benchmarks with Bayesian optimisation.
- Three complementary models shorten the tuning cycle: an active-learning loop (model 1), a fast extra-trees regressor for runtime prediction (model 2), and an XGBoost model for bandwidth prediction (model 3).
- Results, plots, and the full final report are available in
final/(seefinal/code/final.png,final/report.pdf).
final/code/– polished autotuning pipeline, notebooks, trained models, benchmark harness, and figures used in the final report.final/readme– legacy notes kept for provenance.progress/– earlier experiments, notebooks, and scripts preserved for reference.scripts/– helper shell scripts that were used to sync data during development.
Key files inside final/code/:
active/– Jupyter notebooks implementing the three models plus saved scalers and trained regressors.S3D-IO/&btio-pnetcdf-1.1.1/– benchmark sources and PBS wrappers used to collect measurements.read_config_general.py,bt_read_config_general.py– non-notebook runners that execute a benchmark using the parameters stored inconfex.json.stats.txt,BTIOstats.txt,S3DIOstats.txt– consolidated results captured during active-learning runs.
- Access to a cluster with a PBS-compatible scheduler (
qsub) and Lustre (lfs) commands. - PnetCDF installed and visible via
PNETCDF_DIR. - Python 3.6.7 (tested version),
virtualenv, Jupyter, and compiler toolchain needed by the benchmarks.
Set up the Python environment:
python3 -m venv env
source env/bin/activate
pip install -r final/code/requirements.txtBuild the benchmarks (adjust PNETCDF_DIR if required):
export PNETCDF_DIR=/path/to/pnetcdf
cd final/code/S3D-IO
mkdir -p output
make
cd ../btio-pnetcdf-1.1.1
make
mkdir -p outputAll notebooks expect the repository path to be assigned to project_dir (include the trailing /) and assume the benchmarks were built as above.
cd final/code/activejupyter notebook- Open either
S3D-IO active learning.ipynborBTIO active learning.ipynb. - Update
project_dirin the second cell, adjust command-line arguments for your node/PPN/grid configuration, and execute the notebook (restart-and-run-all). - The notebook iteratively updates
confex.json, launches benchmark runs viaread_config_general.py/bt_read_config_general.py, and appends measurements tostats.txtorBTIOstats.txt.
- From the same notebook server, open
predicting_time.ipynband ensureproject_diris correct. - Run the notebook to regenerate the Extra Trees model artefacts.
- Open
predicted_model.ipynb, comment out cells 3–4 and enable cells 5–6 as instructed inside the notebook to load the time model. - Update the
os.chdirpaths andconfex.jsonlocations before executing the notebook to obtain predicted optimal parameters.
- S3D-IO: run
predicting Write Bandwidth-XGB-BOOST.ipynb, then executepredicted_model.ipynbwith cells 3–4 active and paths updated to your environment. - BT-IO: run
predicting Write Bandwidth-XGB-BOOST-BTIO.ipynb, then executepredicted_model-BTIO.ipynbwith the correct paths.
Both notebooks emit the best-performing configuration into confex.json, which you can immediately evaluate with the CLI runners:
cd final/code
python3 read_config_general.py -c "<nx ny nz npx npy npz restart>" -n <nodes> -p <ppn>
python3 bt_read_config_general.py -c "<grid points>" -n <nodes> -p <ppn>The scripts submit a PBS job, wait for completion, parse the generated output, and append the measurements to the corresponding *stats.txt file.
default_S3D.py,bt_default.py, anddefault_run.shreproduce baseline runs with stock MPI/Lustre settings.- Plotting helpers (
default-best-plotscript.py,btio-default-best-plotscript.py,plotcombine.py) compare default throughput to tuned results. Generated figures are stored inplots/,bt_plots/,somemoreplots/, and summarised infinal/code/final.png.
app.logcaptures all benchmark submissions made through the Python wrappers.active/result/gbm_trials-*.csvrecords every configuration explored by the Bayesian optimiser.- Intermediate CSVs, trained model pickles (
*.sav), and scaler dumps (*.save) are kept inactive/for reproducibility.
- If the benchmarks fail to build, hardcode
PNETCDF_DIRinside the respective Makefiles. - Ensure Lustre striping commands (
lfs setstripe) succeed; otherwise adjust permissions or run against a Lustre-backed directory. - When adding new parameters to the search space, update both the notebooks and the
confex.jsonschema accordingly.
The final write-up (final/report.pdf) details the methodology, design choices, and performance gains achieved with this framework.