Skip to content

Commit

Permalink
Move ESM2 scripts to sub-packages (#406)
Browse files Browse the repository at this point in the history
Move and refactor ESM2 scripts into a package
  • Loading branch information
farhadrgh authored Nov 13, 2024
1 parent cd4f48a commit 2c21102
Show file tree
Hide file tree
Showing 14 changed files with 711 additions and 641 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,8 +188,8 @@ export MY_DATA_SOURCE="pbss"
# The fastest transformer engine environment variables in testing were the following two
TEST_DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source $MY_DATA_SOURCE); \
ESM2_650M_CKPT=$(download_bionemo_data esm2/650m:2.0 --source $MY_DATA_SOURCE); \
python \
scripts/protein/esm2/esm2_pretrain.py \

train_esm2 \
--train-cluster-path ${TEST_DATA_DIR}/2024_03_sanity/train_clusters_sanity.parquet \
--train-database-path ${TEST_DATA_DIR}/2024_03_sanity/train_sanity.db \
--valid-cluster-path ${TEST_DATA_DIR}/2024_03_sanity/valid_clusters.parquet \
Expand Down
5 changes: 3 additions & 2 deletions docs/docs/user-guide/examples/bionemo-esm2/pretrain.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,12 +277,13 @@ llm.train(
)
```

Or simply call `esm2_pretrain.py` directly.
Or simply use the ESM2 pretrain located in `$WORKDIR/sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py`. This script can be called either by directly using python or the installed executable `train_esm2`:

```bash
# Enable fused attention in transformer engine for speed-up
DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source ngc)

python scripts/protein/esm2/esm2_pretrain.py \
train_esm2 \
--train-cluster-path ${DATA_DIR}/2024_03_sanity/train_clusters_sanity.parquet \
--train-database-path ${DATA_DIR}/2024_03_sanity/train_sanity.db \
--valid-cluster-path ${DATA_DIR}/2024_03_sanity/valid_clusters.parquet \
Expand Down
3 changes: 1 addition & 2 deletions docs/docs/user-guide/getting-started/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,7 @@ The process for pretraining models from BioNeMo involves running scripts located
exposes a Command-Line Interface (CLI) that contains and documents the options available for that model.

To pretrain a model, you need to run the corresponding script with the required parameters. For example, to pretrain the
ESM-2 model, you would run the `esm2_pretrain.py` script located in `scripts/protein/esm2`. Similarly, to pretrain the
Geneformer model, you would run the `train.py` script located in `scripts/singlecell/geneformer`.
ESM-2 and Geneformer models, you would call `train_esm2` and `train_geneformer` executables, respectively.

The scripts provide various options that can be customized for pretraining, such as:

Expand Down
7 changes: 1 addition & 6 deletions docs/docs/user-guide/getting-started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,7 @@ $ tree -C -I "*.pyc" -I "test_data" -I "test_experiment" -I "test_finettune_expe
│ ├── gpt-pretrain.py
│ ├── protein
│ │ └── esm2
│ │ ├── esm2_pretrain.py
│ │ └── test_esm2_pretrain.py
│ └── singlecell
│ └── geneformer
│ ├── test_train.py
│ └── train.py
│ └── esm2_dataset_perplexity.py
# 🟢 All work goes into `sub-packages`
# Sub-packages represent individually installable subsets of the bionemo codebase. We recommend that you
# create new sub-packages to track your experiments and save any updated models or utilities that you need.
Expand Down
Loading

0 comments on commit 2c21102

Please sign in to comment.