Implement gradient checkpointing in GPTBigCode #41818

githubnemo · 2025-10-23T15:17:23Z

Support for gradient checkpointing was lost in the major refactoring in PR #38635 and this is the attempt to re-add it.

I extended the tests to

test use_reentrant=True and False
make sure model.train is called so that gradient checkpointing works; this is a limiation of the tests currently used by GPTBigCode
make sure that one (the first) gradient checkpointing layer is called
make sure that the same non-zero grads are there for normal and checkpointing runs - this is something we tripped over before in PEFT due to the possibly incompletely stored runtime environment in the checkpointed forward step, see also peft#2826

Note that the invocation of GPTBigCodeBlock.forward has changed:

layer_past is now passed as a keyword argument so that GradientCheckpointingLayer.__call__ can see and filter this parameter (use_reentrant=False fails otherwise)
{encoder_}hidden_states are still passed as positional arguments so that torch.utils.checkpoint.checkpoint receives them as pos. args and computes gradients for these (kwargs would be filtered by GradientCheckpointingLayer).

Support for gradient checkpointing was lost in the major refactoring in PR huggingface#38635 and this is the attempt to re-add it. I extended the tests to - test `use_reentrant=True` and `False` - make sure `model.train` is called so that gradient checkpointing works; this is a limiation of the tests currently used by GPTBigCode - make sure that one (the first) gradient checkpointing layer is called - make sure that the same non-zero grads are there for normal and checkpointing runs - this is something we tripped over before in PEFT due to the possibly incompletely stored runtime environment in the checkpointed forward step, see also peft#2826 Note that the invocation of `GPTBigCodeBlock.forward` has changed: - `layer_past` is now passed as a keyword argument so that `GradientCheckpointingLayer.__call__` can see and filter this parameter (`use_reentrant=False` fails otherwise) - `{encoder_}hidden_states` are still passed as positional arguments so that `torch.utils.checkpoint.checkpoint` receives them as pos. args and computes gradients for these (kwargs would be filtered by `GradientCheckpointingLayer`).

github-actions · 2025-10-23T15:18:27Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: gpt_bigcode

HuggingFaceDocBuilderDev · 2025-10-23T15:26:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu

The tests are neat, I think we should move them to common tests tho. Not exactly sure why it was specially treated here.

And ig there will be a need for another round to check similar models that may have been accidentally overriden with the ckpting layer 😓 not necessarily this PR tho

vasqu · 2025-10-23T17:56:03Z

src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py

+        encoder_hidden_states: Optional[torch.Tensor] = None,
        layer_past: Optional[Cache] = None,
        attention_mask: Optional[torch.Tensor] = None,
-        encoder_hidden_states: Optional[torch.Tensor] = None,


Let's not change the order here, we could break things for users here. Rather change the args, kwargs positions if necessary on the module call

I'm not sure that this is possible. It is mandatory that we pass layer_past as keyword argument, otherwise GradientCheckpointingLayer will not be able to remove it from the kwargs in case of gradient checkpointing. On the other hand every input that may require gradients (hidden_states, encoder_hidden_states) must be passed as positional argument for checkpoint() to work. Maybe I'm missing something but I don't think we can bring those together without moving encoder_hidden_states up in the list.

I mean that the signature should stay the same, e.g. see

transformers/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py

Lines 586 to 596 in 84d19be

def forward(

self,

hidden_states: Optional[tuple[torch.Tensor]],

layer_past: Optional[torch.Tensor] = None,

attention_mask: Optional[torch.Tensor] = None,

head_mask: Optional[torch.Tensor] = None,

encoder_hidden_states: Optional[torch.Tensor] = None,

encoder_attention_mask: Optional[torch.Tensor] = None,

use_cache: Optional[bool] = False,

output_attentions: Optional[bool] = False,

**kwargs,

It will need to adjust the calls from the module above like

transformers/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py

Lines 901 to 910 in 84d19be

outputs = block(

hidden_states,

layer_past,

attention_mask,

head_mask[i],

encoder_hidden_states, # as a positional argument for gradient checkpointing

encoder_attention_mask=encoder_attention_mask,

use_cache=use_cache,

output_attentions=output_attentions,

)

Changing the signature is breaking a bit too much!

vasqu · 2025-10-23T17:58:04Z

tests/models/gpt_bigcode/test_modeling_gpt_bigcode.py

        self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))

    def create_and_check_forward_and_backwards(
-        self, config, input_ids, input_mask, token_type_ids, *args, gradient_checkpointing=False


I'm a bit surprised that it was overriden here. It would be nicer if we could move this into test_modeling_common instead.

Agreed but I'm not sure how to deal with the fact that not all models use a GradientCheckpointingLayer and sometimes call the function by themselves. Do you have a suggestion how to deal with that?

We can check if the layer exists somewhere, no? If we do not detect that it exists, raise an error and check which models fails --> all models should already have this but bigger PRs apparently clashed so it isnt the case anymore

vasqu · 2025-10-23T17:58:58Z

tests/models/gpt_bigcode/test_modeling_gpt_bigcode.py

+            for _, p in trainable_params:
+                p.grad = None
+
+            checkpointing_layer = next(m for m in model.modules() if isinstance(m, GradientCheckpointingLayer))


Ah ok so we are bound to the new gradient ckpting then. Guess there will be a need to check for all models to use this properly.

vasqu · 2025-10-23T18:00:22Z

tests/models/gpt_bigcode/test_modeling_gpt_bigcode.py

        result.loss.backward()

+        non_zero_grads_normal = {n for n, p in trainable_params if p.grad.abs().sum() > 0}
+        assert non_zero_grads_normal


Let's use

transformers/tests/models/bart/test_modeling_bart.py

Line 182 in 81b4f98

self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])

instead of normal asserts. Depends on if we move the test too ig.

vasqu reviewed Oct 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement gradient checkpointing in GPTBigCode #41818

Implement gradient checkpointing in GPTBigCode #41818

githubnemo commented Oct 23, 2025

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 23, 2025

Uh oh!

vasqu left a comment

Uh oh!

vasqu Oct 23, 2025

Uh oh!

githubnemo Oct 24, 2025

Uh oh!

vasqu Oct 24, 2025

Uh oh!

vasqu Oct 23, 2025

Uh oh!

githubnemo Oct 24, 2025

Uh oh!

vasqu Oct 24, 2025

Uh oh!

vasqu Oct 23, 2025

Uh oh!

vasqu Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def forward(
	self,
	hidden_states: Optional[tuple[torch.Tensor]],
	layer_past: Optional[torch.Tensor] = None,
	attention_mask: Optional[torch.Tensor] = None,
	head_mask: Optional[torch.Tensor] = None,
	encoder_hidden_states: Optional[torch.Tensor] = None,
	encoder_attention_mask: Optional[torch.Tensor] = None,
	use_cache: Optional[bool] = False,
	output_attentions: Optional[bool] = False,
	**kwargs,

	outputs = block(
	hidden_states,
	layer_past,
	attention_mask,
	head_mask[i],
	encoder_hidden_states, # as a positional argument for gradient checkpointing
	encoder_attention_mask=encoder_attention_mask,
	use_cache=use_cache,
	output_attentions=output_attentions,
	)

Implement gradient checkpointing in GPTBigCode #41818

Are you sure you want to change the base?

Implement gradient checkpointing in GPTBigCode #41818

Conversation

githubnemo commented Oct 23, 2025

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 23, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants