[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz) #7130

cyr0930 · 2025-03-11T08:54:47Z

After this commit (#4906), secondary partitioned tensors are updated only after optimizer.step().

When loading state_dict or resizing embedding after init, secondary partitioned tensors should be updated.

e.g., https://github.com/huggingface/transformers/blob/1c4b62b219323a31011bac3bd3cece7675d9e4c3/src/transformers/integrations/deepspeed.py#L344

…secondary partitions (hpz)

tjruwase · 2025-03-12T10:52:04Z

@cyr0930, thanks for this PR. Can you provide more context about the failure this fixes? For example, did you encounter convergence issues after checkpoint loading?

cyr0930 · 2025-03-12T12:25:56Z

Partitioned parameters are updated only when ds_secondary_partition_tensor is None by this commit (#4906).
And ds_secondary_partition_tensors only become None after optimizer.step function is called (that function contains logic that invalidate secondary tensor)
But there's other cases that partitioned parameters should be updated such as loading state_dict or resizing embeddings.
(Above explanation is based on transformers implementation)

For now, after parameter initialization, ds_secondary_partition_tensors are created and existed for each params, so they are not updated when we perform state_dict loading or embedding resizing.
Maybe we can add a function that invalidate ds_secondary_partition and call it after every parameter changes.
But IMO it is quite messy and I decided to revert has_been_updated part of the previous commit.

tjruwase · 2025-04-09T18:06:49Z

For now, after parameter initialization, ds_secondary_partition_tensors are created and existed for each params, so they are not

@cyr0930, apologies for the delay on this. My understanding is the ds_secondary_partition_tensors are created by forward pass, correct? But I expect that state_dict loading or embedding resizing should happen before forward pass. Can you please clarify? Thanks

cyr0930 · 2025-04-10T11:36:26Z

I'm not sure because this logic is a bit complicated, but IMO,

while HfArgumentParser.parse_args_into_dataclasses is executed deepspeed zero3 is enabled by this (https://github.com/huggingface/transformers/blob/v4.51.2/src/transformers/training_args.py#L2046).

And while load model by from_pretrained method, deepspeed.zero.Init context is introduced by this (https://github.com/huggingface/transformers/blob/v4.51.2/src/transformers/modeling_utils.py#L3727).

This context wrapped initialization of modules, so parameters are converted to deepspeed parameters during initialization in here (https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L1107).

That's what I get from debugging for now.
Thanks for the comment though.

tjruwase · 2025-04-10T14:06:42Z

@cyr0930, thanks for sharing this information and for debugging. As this is a critical part of hpz, can you please share your repro steps with me so I can try on my side?

cyr0930 · 2025-04-14T12:35:57Z

This is a minimal reproducing code I can make.
Running this code with the command at the bottom can reproduce the issue I encountered :)

deepspeed_init.py

from transformers import AutoModel
from transformers.hf_argparser import HfArgumentParser
from transformers.training_args import TrainingArguments
from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled

parser = HfArgumentParser((TrainingArguments,))
training_args = parser.parse_args_into_dataclasses()
assert is_deepspeed_zero3_enabled()
model = AutoModel.from_pretrained("Qwen/Qwen2.5-0.5B")
assert hasattr(model.embed_tokens.weight, "ds_secondary_tensor") ```

# accelerate launch --use_deepspeed --zero_stage 3 deepspeed_init.py

[bugfix] update results of state_dict loading, embedding resizing to …

a16b4f5

…secondary partitions (hpz)

cyr0930 requested review from tjruwase and tohtana as code owners March 11, 2025 08:54

Merge branch 'master' into bug/2nd_part

9af5808

loadams and others added 3 commits March 14, 2025 11:56

Merge branch 'master' into bug/2nd_part

fcdfc5e

Merge branch 'master' into bug/2nd_part

1faeace

Merge branch 'master' into bug/2nd_part

e64484f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz) #7130

[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz) #7130

cyr0930 commented Mar 11, 2025

tjruwase commented Mar 12, 2025

cyr0930 commented Mar 12, 2025

tjruwase commented Apr 9, 2025

cyr0930 commented Apr 10, 2025

tjruwase commented Apr 10, 2025

cyr0930 commented Apr 14, 2025 •

edited

Loading

[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz) #7130

Are you sure you want to change the base?

[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz) #7130

Conversation

cyr0930 commented Mar 11, 2025

tjruwase commented Mar 12, 2025

cyr0930 commented Mar 12, 2025

tjruwase commented Apr 9, 2025

cyr0930 commented Apr 10, 2025

tjruwase commented Apr 10, 2025

cyr0930 commented Apr 14, 2025 • edited Loading

deepspeed_init.py

cyr0930 commented Apr 14, 2025 •

edited

Loading