NeMo1 -> NeMo2 checkpoint conversion #180

jstjohn · 2024-09-21T00:10:45Z

Summary

Nemo1 to nemo2 checkpoint conversion

Details

Bugfixes to NeMo Checkpoint connector bugfixes NeMo#10647
Test showing that we can convert a model from nemo1 to nemo2 format and that it was properly initialized, on Geneformer.
TODO same test for ESM2 once TE conversion is complete Migrate ESM2 to transformer engine #199

Usage

py.test sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py::test_nemo1_checkpoint_conversion

This could either be used in an interactive session or placed in a script to one off convert some specific checkpoint in nemo1 format to a checkpoint in nemo2 format.

Testing

Test checks that the checkpoint can be converted by this function, and then pointing a model at the new nemo2 checkpoint works as expected when doing fine-tuning resumption.

Tests for these changes can be run via:

py.test sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py::test_nemo1_checkpoint_conversion

Did you review the Before your PR is "Ready for review" section before asking for review?
Did you make sure your changes have tests? Did you test your changes locally?
Can you add the SKIP_CI label to your PR?
Can you add the PYTEST_NOT_REQUIRED label to your PR?
Can you add the JET_NOT_REQUIRED label to your PR?

jstjohn · 2024-09-25T19:24:24Z

/build-ci

jstjohn · 2024-09-26T23:38:07Z

/build-ci

jstjohn · 2024-09-26T23:39:45Z

/build-ci

sub-packages/bionemo-llm/src/bionemo/llm/model/biobert/connector.py

jstjohn · 2024-09-27T02:47:47Z

Does this work for you? I don’t think the state dict loading line actually replaces the meta tensor. I think contents are filled in. import torchimport torch.nn as nn# Define a module with parameters initialized as meta tensorsclass MyModule(nn.Module): def __init__(self): super(MyModule, self).__init__() self.linear = nn.Linear(10, 10, device='meta') # Meta device# Instantiate the modulemodule = MyModule()# Print the current parameter device (meta)print(module.linear.weight.device) # Should output "meta"# Create a state_dict with actual tensorsstate_dict = { 'linear.weight': torch.randn(10, 10), 'linear.bias': torch.randn(10)}# Load the state_dict into the modulemodule.load_state_dict(state_dict)# Check that the parameters have been replaced with actual tensorsprint(module.linear.weight.device) # Should output "cpu" or the device of the tensors in state_dictThis prints meta device. And there’s a warning (yikes!) not an error! Sent from my iPhoneOn Sep 26, 2024, at 7:27 PM, Alexandros Koumparoulis ***@***.***> wrote: @akoumpa commented on this pull request. In sub-packages/bionemo-llm/src/bionemo/llm/model/biobert/connector.py:

+ def is_te_mapping(model: BioBertLightningModule) -> bool:

+ """Check for TE layers, for now infer this from the config.""" + return model.config.biobert_spec_option in { + BiobertSpecOption.bert_layer_with_transformer_engine_spec, + BiobertSpecOption.bert_layer_with_transformer_engine_and_qk_ln_spec, + } + + def convert_state(self, source: Dict[str, torch.Tensor], target: BioBertLightningModule) -> BioBertLightningModule: + """Convert the input state_dict keys from nemo1 biobert to nemo2 biobert.""" + te_mapping = self.is_te_mapping(target) # check for TE layers. + target.module.cpu() + new_state_dict_from_old = {} + for k, v in source.items(): + new_key = nemo1_to_nemo2_biobert_key_mapping(k, new_model_prefix="", te_mapping=te_mapping) + new_state_dict_from_old[new_key] = v + target.module.load_state_dict(new_state_dict_from_old, strict=not te_mapping) @jstjohn I would add here something like the following meta_tensors = list(filter(lambda x: isinstance(x[1], torch.Tensor) and x[1].device.type == 'meta', target.module.state_dict().items())) assert len(meta_tensors) == 0, meta_tensors This should print all the tensors that have meta device. The assumption here was that the input state_dict (in this case new_state_dict_from_old) contains all the parameters needed by target.module. Please let me know if that works. The other thing would be add a parameter to nemo_setup to init using CPU instead of meta, but I want to avoid this, for large models (e.g. 100B parameters) this takes too long and it's not useful since the initialized parameters will be overwritten with the ones from the checkpoint. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

jstjohn · 2024-09-27T02:51:40Z

Haha chat gpt wrote that code and was all like “this will work”. I asked it to run it. It printed meta device. Anyways I manually tried running this in a collab instance since I’m AFK, and I can confirm that it just prints a warning and doesn’t actually fill in the tensors. Sent from my iPhoneOn Sep 26, 2024, at 7:47 PM, John St John ***@***.***> wrote:Does this work for you? I don’t think the state dict loading line actually replaces the meta tensor. I think contents are filled in. import torchimport torch.nn as nn# Define a module with parameters initialized as meta tensorsclass MyModule(nn.Module): def __init__(self): super(MyModule, self).__init__() self.linear = nn.Linear(10, 10, device='meta') # Meta device# Instantiate the modulemodule = MyModule()# Print the current parameter device (meta)print(module.linear.weight.device) # Should output "meta"# Create a state_dict with actual tensorsstate_dict = { 'linear.weight': torch.randn(10, 10), 'linear.bias': torch.randn(10)}# Load the state_dict into the modulemodule.load_state_dict(state_dict)# Check that the parameters have been replaced with actual tensorsprint(module.linear.weight.device) # Should output "cpu" or the device of the tensors in state_dictThis prints meta device. And there’s a warning (yikes!) not an error! Sent from my iPhoneOn Sep 26, 2024, at 7:27 PM, Alexandros Koumparoulis ***@***.***> wrote: @akoumpa commented on this pull request. In sub-packages/bionemo-llm/src/bionemo/llm/model/biobert/connector.py:

+ def is_te_mapping(model: BioBertLightningModule) -> bool:

+ """Check for TE layers, for now infer this from the config.""" + return model.config.biobert_spec_option in { + BiobertSpecOption.bert_layer_with_transformer_engine_spec, + BiobertSpecOption.bert_layer_with_transformer_engine_and_qk_ln_spec, + } + + def convert_state(self, source: Dict[str, torch.Tensor], target: BioBertLightningModule) -> BioBertLightningModule: + """Convert the input state_dict keys from nemo1 biobert to nemo2 biobert.""" + te_mapping = self.is_te_mapping(target) # check for TE layers. + target.module.cpu() + new_state_dict_from_old = {} + for k, v in source.items(): + new_key = nemo1_to_nemo2_biobert_key_mapping(k, new_model_prefix="", te_mapping=te_mapping) + new_state_dict_from_old[new_key] = v + target.module.load_state_dict(new_state_dict_from_old, strict=not te_mapping) @jstjohn I would add here something like the following meta_tensors = list(filter(lambda x: isinstance(x[1], torch.Tensor) and x[1].device.type == 'meta', target.module.state_dict().items())) assert len(meta_tensors) == 0, meta_tensors This should print all the tensors that have meta device. The assumption here was that the input state_dict (in this case new_state_dict_from_old) contains all the parameters needed by target.module. Please let me know if that works. The other thing would be add a parameter to nemo_setup to init using CPU instead of meta, but I want to avoid this, for large models (e.g. 100B parameters) this takes too long and it's not useful since the initialized parameters will be overwritten with the ones from the checkpoint. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

jstjohn · 2024-09-27T22:53:02Z

/build-ci

jstjohn · 2024-09-30T20:41:39Z

/build-ci

…emo1-checkpoint-connector

3rdparty/NeMo

jstjohn · 2024-10-01T02:06:42Z

Yes it’s newer than TOT. It’s on the commit I’m working on on the Nemo side to fix the checkpoint conversion stuff. Sent from my iPhoneOn Sep 30, 2024, at 7:01 PM, Peter St. John ***@***.***> wrote: @pstjohn commented on this pull request. On 3rdparty/NeMo: Is this on TOT? Let's just not downgrade if the dependabot updates bring us to a more recent commit —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

sub-packages/bionemo-esm2/src/bionemo/esm2/model/model.py

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_model.py

sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py

sub-packages/bionemo-llm/src/bionemo/llm/model/biobert/connector.py

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_model.py

…emo1-checkpoint-connector

jstjohn · 2024-10-01T16:00:06Z

/build-ci

…emo1-checkpoint-connector

jstjohn · 2024-10-01T23:15:00Z

/build-ci

…emo1-checkpoint-connector

jstjohn · 2024-10-04T21:43:22Z

scripts/protein/esm2/make_nemo2_checkpoints.py

+    """Usage:
+    # ESM2 3B
+    ## ESM2 3b checkpoint conversion:
+    python scripts/protein/esm2/make_nemo2_checkpoints.py --s3-path s3://bionemo-ci/models/esm2nv_3B_converted.nemo --output-path ~/.cache/bionemo/checkpoints/esm2_3B_nemo2
+    ## ESM2 3b checkpoint upload (recursive since it is a directory)
+    aws s3 cp --recursive ~/.cache/bionemo/checkpoints/esm2_3B_nemo2 s3://bionemo-ci/models/esm2_3B_nemo2
+    # ESM2 650M
+    ## ESM2 650M checkpoint conversion
+    python scripts/protein/esm2/make_nemo2_checkpoints.py --s3-path s3://bionemo-ci/models/esm2nv_650M_converted.nemo --output-path ~/.cache/bionemo/checkpoints/esm2_650M_nemo2
+    ## ESM2 650M checkpoint upload
+    aws s3 cp --recursive ~/.cache/bionemo/checkpoints/esm2_650M_nemo2 s3://bionemo-ci/models/esm2_650M_nemo2
+    """


TODO add the steps to create the .tar.gz files:

cd esm2_650M_nemo2 && tar czvf ../esm2_650M_nemo2.tar.gz *

for example.

…hn/nemo1-checkpoint-connector

jstjohn · 2024-11-12T19:21:22Z

/build-ci

jstjohn requested review from malcolmgreaves, skothenhill-nv, farhadrgh, dorotat-nv, gwarmstrong, jomitchellnv and pstjohn as code owners September 21, 2024 00:10

jstjohn self-assigned this Sep 21, 2024

jstjohn marked this pull request as draft September 21, 2024 00:10

jstjohn force-pushed the jstjohn/nemo1-checkpoint-connector branch 2 times, most recently from 6d17206 to 3b6be46 Compare September 25, 2024 19:24

jstjohn mentioned this pull request Sep 26, 2024

Checkpoint connector bugfixes NVIDIA/NeMo#10647

Merged

jstjohn force-pushed the jstjohn/nemo1-checkpoint-connector branch from 144a0c4 to 11bca4f Compare September 26, 2024 23:34

jstjohn marked this pull request as ready for review September 26, 2024 23:38

jstjohn requested review from ohadmo and trvachov as code owners September 26, 2024 23:38

akoumpa reviewed Sep 27, 2024

View reviewed changes

sub-packages/bionemo-llm/src/bionemo/llm/model/biobert/connector.py Outdated Show resolved Hide resolved

jstjohn added 7 commits September 30, 2024 13:22

Initial commit toward NeMo2 conversion

147d60f

Closer to working version of checkpoint conversion

3f41f57

add docstrings

7af8913

Merge in main

4529ba1

PR feedback on nemo change

cf864d9

Update nemo

257f252

Address NeMo PR requests

b741569

jstjohn added 3 commits September 30, 2024 20:58

Merge branch 'main' of github.com:NVIDIA/bionemo-fw-ea into jstjohn/n…

9f15050

…emo1-checkpoint-connector

Merge branch 'main' of github.com:NVIDIA/bionemo-fw-ea into jstjohn/n…

531fc0d

…emo1-checkpoint-connector

Check in progress toward ESM2 checkpoint conversion test

38d40d1

pstjohn reviewed Oct 1, 2024

View reviewed changes

3rdparty/NeMo Outdated Show resolved Hide resolved

pstjohn reviewed Oct 1, 2024

View reviewed changes

jstjohn added 3 commits October 1, 2024 15:55

Get working esm2 conversion test

e2bc8c0

Merge branch 'main' of github.com:NVIDIA/bionemo-fw-ea into jstjohn/n…

4bfb915

…emo1-checkpoint-connector

Bump nemo to ToT on checkpoint conversion pr NVIDIA/NeMo#10647

0210a53

jstjohn added 6 commits October 1, 2024 22:33

more checks in place to make sure checkpoint loading worked properly

a63c0f1

Adding global variable for hugging face result

e5aebb6

refactor esm2 test for shared variable of target loss

bc61f6a

skip problematic tests

19f9302

add fixmes and issue references

b438ea8

Merge branch 'main' of github.com:NVIDIA/bionemo-fw-ea into jstjohn/n…

f118238

…emo1-checkpoint-connector

jstjohn added 3 commits October 2, 2024 17:11

Merge branch 'main' of github.com:NVIDIA/bionemo-fw-ea into jstjohn/n…

e9af2c9

…emo1-checkpoint-connector

add documented conversion scripts for esm2 and geneformer

d97f888

Merge branch 'main' of github.com:NVIDIA/bionemo-fw-ea into jstjohn/n…

9669175

…emo1-checkpoint-connector

jstjohn commented Oct 4, 2024

View reviewed changes

malcolmgreaves added the NOT_related_to_v24.10 label Oct 9, 2024

malcolmgreaves marked this pull request as draft October 9, 2024 20:23

malcolmgreaves marked this pull request as ready for review October 9, 2024 20:26

jstjohn added 2 commits November 12, 2024 19:17

Merge branch 'main' of github.com:NVIDIA/bionemo-framework into jstjo…

8df90fb

…hn/nemo1-checkpoint-connector

Reformat files

0876960

jstjohn mentioned this pull request Jan 14, 2025

ESM-2 to NeMo checkpoint conversion #537

Merged

yzhang123 removed the NOT_related_to_v24.10 label Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeMo1 -> NeMo2 checkpoint conversion #180

NeMo1 -> NeMo2 checkpoint conversion #180

jstjohn commented Sep 21, 2024 •

edited

Loading

jstjohn commented Sep 25, 2024

jstjohn commented Sep 26, 2024

jstjohn commented Sep 26, 2024

jstjohn commented Sep 27, 2024 via email

jstjohn commented Sep 27, 2024 via email

jstjohn commented Sep 27, 2024

jstjohn commented Sep 30, 2024

jstjohn commented Oct 1, 2024 via email

jstjohn commented Oct 1, 2024

jstjohn commented Oct 1, 2024

jstjohn Oct 4, 2024

jstjohn commented Nov 12, 2024

NeMo1 -> NeMo2 checkpoint conversion #180

Are you sure you want to change the base?

NeMo1 -> NeMo2 checkpoint conversion #180

Conversation

jstjohn commented Sep 21, 2024 • edited Loading

Summary

Details

Usage

Testing

jstjohn commented Sep 25, 2024

jstjohn commented Sep 26, 2024

jstjohn commented Sep 26, 2024

jstjohn commented Sep 27, 2024 via email

jstjohn commented Sep 27, 2024 via email

jstjohn commented Sep 27, 2024

jstjohn commented Sep 30, 2024

jstjohn commented Oct 1, 2024 via email

jstjohn commented Oct 1, 2024

jstjohn commented Oct 1, 2024

jstjohn Oct 4, 2024

Choose a reason for hiding this comment

jstjohn commented Nov 12, 2024

jstjohn commented Sep 21, 2024 •

edited

Loading