Move ESM2 scripts to sub-packages (#406)

Move and refactor ESM2 scripts into a package
NVIDIA · Nov 13, 2024 · 2c21102 · 2c21102
1 parent cd4f48a
commit 2c21102
Show file tree

Hide file tree

Showing 14 changed files with 711 additions and 641 deletions.
diff --git a/README.md b/README.md
@@ -188,8 +188,8 @@ export MY_DATA_SOURCE="pbss"
 # The fastest transformer engine environment variables in testing were the following two
 TEST_DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source $MY_DATA_SOURCE); \
 ESM2_650M_CKPT=$(download_bionemo_data esm2/650m:2.0 --source $MY_DATA_SOURCE); \
-python  \
-    scripts/protein/esm2/esm2_pretrain.py     \
+
+train_esm2     \
     --train-cluster-path ${TEST_DATA_DIR}/2024_03_sanity/train_clusters_sanity.parquet     \
     --train-database-path ${TEST_DATA_DIR}/2024_03_sanity/train_sanity.db     \
     --valid-cluster-path ${TEST_DATA_DIR}/2024_03_sanity/valid_clusters.parquet     \

diff --git a/docs/docs/user-guide/examples/bionemo-esm2/pretrain.md b/docs/docs/user-guide/examples/bionemo-esm2/pretrain.md
@@ -277,12 +277,13 @@ llm.train(
 )
 ```
 
-Or simply call `esm2_pretrain.py` directly.
+Or simply use the ESM2 pretrain located in `$WORKDIR/sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py`. This script can be called either by directly using python or the installed executable `train_esm2`:
+
 ```bash
 # Enable fused attention in transformer engine for speed-up
 DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source ngc)
 
-python scripts/protein/esm2/esm2_pretrain.py \
+train_esm2 \
     --train-cluster-path ${DATA_DIR}/2024_03_sanity/train_clusters_sanity.parquet \
     --train-database-path ${DATA_DIR}/2024_03_sanity/train_sanity.db \
     --valid-cluster-path ${DATA_DIR}/2024_03_sanity/valid_clusters.parquet \

diff --git a/docs/docs/user-guide/getting-started/development.md b/docs/docs/user-guide/getting-started/development.md
@@ -60,8 +60,7 @@ The process for pretraining models from BioNeMo involves running scripts located
 exposes a Command-Line Interface (CLI) that contains and documents the options available for that model.
 
 To pretrain a model, you need to run the corresponding script with the required parameters. For example, to pretrain the
-ESM-2 model, you would run the `esm2_pretrain.py` script located in `scripts/protein/esm2`. Similarly, to pretrain the
-Geneformer model, you would run the `train.py` script located in `scripts/singlecell/geneformer`.
+ESM-2 and Geneformer models, you would call `train_esm2` and `train_geneformer` executables, respectively.
 
 The scripts provide various options that can be customized for pretraining, such as:
 

diff --git a/docs/docs/user-guide/getting-started/index.md b/docs/docs/user-guide/getting-started/index.md
@@ -92,12 +92,7 @@ $ tree -C -I "*.pyc" -I "test_data" -I "test_experiment" -I "test_finettune_expe
 │   ├── gpt-pretrain.py
 │   ├── protein
 │   │   └── esm2
-│   │       ├── esm2_pretrain.py
-│   │       └── test_esm2_pretrain.py
-│   └── singlecell
-│       └── geneformer
-│           ├── test_train.py
-│           └── train.py
+│           └── esm2_dataset_perplexity.py
 # 🟢 All work goes into `sub-packages`
 #  Sub-packages represent individually installable subsets of the bionemo codebase. We recommend that you
 #  create new sub-packages to track your experiments and save any updated models or utilities that you need.