Skip to content

divyanshsinghvi/parallel-IO-autotuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel I/O Autotuning

Model-driven autotuning of MPI-IO hints and Lustre striping parameters for HPC applications. The project combines active learning with gradient-boosted and extra-trees regressors to discover high-throughput configurations for two widely used proxy applications: S3D-IO and BT-IO.

Why this project?

  • Manual exploration of the parallel I/O stack is slow and error-prone. We automate it by steering benchmarks with Bayesian optimisation.
  • Three complementary models shorten the tuning cycle: an active-learning loop (model 1), a fast extra-trees regressor for runtime prediction (model 2), and an XGBoost model for bandwidth prediction (model 3).
  • Results, plots, and the full final report are available in final/ (see final/code/final.png, final/report.pdf).

Repository layout

  • final/code/ – polished autotuning pipeline, notebooks, trained models, benchmark harness, and figures used in the final report.
  • final/readme – legacy notes kept for provenance.
  • progress/ – earlier experiments, notebooks, and scripts preserved for reference.
  • scripts/ – helper shell scripts that were used to sync data during development.

Key files inside final/code/:

  • active/ – Jupyter notebooks implementing the three models plus saved scalers and trained regressors.
  • S3D-IO/ & btio-pnetcdf-1.1.1/ – benchmark sources and PBS wrappers used to collect measurements.
  • read_config_general.py, bt_read_config_general.py – non-notebook runners that execute a benchmark using the parameters stored in confex.json.
  • stats.txt, BTIOstats.txt, S3DIOstats.txt – consolidated results captured during active-learning runs.

Prerequisites

  • Access to a cluster with a PBS-compatible scheduler (qsub) and Lustre (lfs) commands.
  • PnetCDF installed and visible via PNETCDF_DIR.
  • Python 3.6.7 (tested version), virtualenv, Jupyter, and compiler toolchain needed by the benchmarks.

Set up the Python environment:

python3 -m venv env
source env/bin/activate
pip install -r final/code/requirements.txt

Build the benchmarks (adjust PNETCDF_DIR if required):

export PNETCDF_DIR=/path/to/pnetcdf
cd final/code/S3D-IO
mkdir -p output
make
cd ../btio-pnetcdf-1.1.1
make
mkdir -p output

Running the autotuners

All notebooks expect the repository path to be assigned to project_dir (include the trailing /) and assume the benchmarks were built as above.

Model 1 – Active learning loop

  1. cd final/code/active
  2. jupyter notebook
  3. Open either S3D-IO active learning.ipynb or BTIO active learning.ipynb.
  4. Update project_dir in the second cell, adjust command-line arguments for your node/PPN/grid configuration, and execute the notebook (restart-and-run-all).
  5. The notebook iteratively updates confex.json, launches benchmark runs via read_config_general.py / bt_read_config_general.py, and appends measurements to stats.txt or BTIOstats.txt.

Model 2 – Extra Trees time predictor (S3D-IO)

  1. From the same notebook server, open predicting_time.ipynb and ensure project_dir is correct.
  2. Run the notebook to regenerate the Extra Trees model artefacts.
  3. Open predicted_model.ipynb, comment out cells 3–4 and enable cells 5–6 as instructed inside the notebook to load the time model.
  4. Update the os.chdir paths and confex.json locations before executing the notebook to obtain predicted optimal parameters.

Model 3 – XGBoost bandwidth predictor

  • S3D-IO: run predicting Write Bandwidth-XGB-BOOST.ipynb, then execute predicted_model.ipynb with cells 3–4 active and paths updated to your environment.
  • BT-IO: run predicting Write Bandwidth-XGB-BOOST-BTIO.ipynb, then execute predicted_model-BTIO.ipynb with the correct paths.

Both notebooks emit the best-performing configuration into confex.json, which you can immediately evaluate with the CLI runners:

cd final/code
python3 read_config_general.py -c "<nx ny nz npx npy npz restart>" -n <nodes> -p <ppn>
python3 bt_read_config_general.py -c "<grid points>" -n <nodes> -p <ppn>

The scripts submit a PBS job, wait for completion, parse the generated output, and append the measurements to the corresponding *stats.txt file.

Baselines and plotting utilities

  • default_S3D.py, bt_default.py, and default_run.sh reproduce baseline runs with stock MPI/Lustre settings.
  • Plotting helpers (default-best-plotscript.py, btio-default-best-plotscript.py, plotcombine.py) compare default throughput to tuned results. Generated figures are stored in plots/, bt_plots/, somemoreplots/, and summarised in final/code/final.png.

Logs and artefacts

  • app.log captures all benchmark submissions made through the Python wrappers.
  • active/result/gbm_trials-*.csv records every configuration explored by the Bayesian optimiser.
  • Intermediate CSVs, trained model pickles (*.sav), and scaler dumps (*.save) are kept in active/ for reproducibility.

Tips & troubleshooting

  • If the benchmarks fail to build, hardcode PNETCDF_DIR inside the respective Makefiles.
  • Ensure Lustre striping commands (lfs setstripe) succeed; otherwise adjust permissions or run against a Lustre-backed directory.
  • When adding new parameters to the search space, update both the notebooks and the confex.json schema accordingly.

Further reading

The final write-up (final/report.pdf) details the methodology, design choices, and performance gains achieved with this framework.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •