Skip to content

Conversation

gante
Copy link
Member

@gante gante commented Sep 4, 2025

What does this PR do?

(Carved from #40553, which is becoming messy)

This PR:

  • Updates test_past_key_values_format to support things like GQA or skipped kv cache layers. As a result, we can remove some overwrites/skips 💛
  • In get_text_config, fixes a bug -- in get_text_config + legacy models, we were not respecting the attribute_map (i.e. the remapping of config attributes, which we need to be careful with), with this PR we do. This bugfix allow us to remove some of the test skips.

@@ -728,10 +727,6 @@ def test_training_gradient_checkpointing_use_reentrant(self):
def test_training_gradient_checkpointing_use_reentrant_false(self):
pass

@is_flaky(max_attempts=5, description="Flaky for some input configurations.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(double-checked with flake-finder -- it is no longer flaky)

Copy link
Contributor

github-actions bot commented Sep 4, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: dia, gemma3n, got_ocr2, speecht5, t5gemma

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great clean-up, just had question to clarify about decoder text configs. TBH didn't know we were using it for flat structured configs

Comment on lines +1258 to +1262
# Does the class map the new key into a different attribute name at read time? if so, let's write
# into that attribute instead
if new_key in config_to_return.attribute_map:
new_key = config_to_return.attribute_map[new_key]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I got this. So if we map the new key back to attribute map, in models like BART we will do num_attention_head -> encoder_attention_heads. This doesn't look quite right if we asked for a decoder config

Copy link
Member Author

@gante gante Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not like right indeed 😢 But attribute_map is a class-level attribute, so we can't update it for the new configuration either (i.e. for the config instance returned by get_text_config).

Note that these encoder/decoder attributes in attribute_map are from old models, and that these inconsistencies only show up if they decide to print internal variables 👀

This means we are limited to two options, to maintain BC:

  1. [This PR] We use the same mapping all over the code (e.g.config.get_text_config(decoder=True).num_attention_head to get the number of attention heads in the decoder), but accept that some old configs will have odd representation because of their attribute map;
  2. [main] Have several if/else scattered across our codebase, like
num_decoder_layers = (
    getattr(config, "decoder_layers", None)  # flat configs case 1
    or getattr(config, "num_decoder_layers", None)  # flat configs case 2
    or decoder_config.num_hidden_layers  # modern default for decoders
)

(if we double-down in direction 2, we need to add more if/else cases, our current logic is not robust in all tests).

Option 1. seems much more reliable in the long run, and also nudges everyone into using the same names everywhere (as opposed to relying on attribute maps) 🤗

Copy link
Member Author

@gante gante Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we may be able to update the attribute_map logic to read/write into the target variable, as opposed to mapping the read/writes 👀

Example:
If we have the {"a": "b"} mapping, atm all reads to config.a actually read config.b without checking if a exists in config. Same for writes.

We could instade make config.a reads read config.a first and, if it doesn't exist, try to read config.b. All writes would write into config.a.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could instade make config.a reads read config.a first and, if it doesn't exist, try to read config.b. All writes would write into config.a.

This sounds interesting, and slightly breaking because we will end up with two keys for the same concept. It might raise questions such as which value is correct when inspecting visually or serializing configs. For ex: we might have both: image_token_id/image_token_index in some VLMs

Coming back to "Option 1", I see we always check for attribute mapping now. I was expecting that get_text_config() will return a different config only if config structure is nested tbh. Otherwise the whole config is a text config and has no other modalities

In this case I think the current approach is best we can do, because it helps to reduce LOC and is not much breaking. We can ignore the weird naming as noone would serialize/print the text config, I hope. Let's either keep it as is and I also have another option below. Feel free to ignore if it doesn't work

I looked though attribute maps in repo, and it always maps to encoder if encoder-decoder is used. We could deprecate this pattern gradually from mapping and nudge users to explicitly get with config.encoder_attention_heads. We will need to use consistent naming in encoder-decoder models and promote it for future model. Though this option will take a long time to deprecate, maybe even till v5 🙃

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp

What I'm reading is "let's go with this PR, and try to nudge users away from attribute_map". Is this correct? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, the second one is more longer term to make our lives better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

(Approval please then 💛 )

Comment on lines -1053 to +1054
text_config = config.get_text_config()
num_decoder_layers = (
getattr(text_config, "decoder_layers", None)
or getattr(text_config, "num_decoder_layers", None)
or text_config.num_hidden_layers
)

num_decoder_layers = decoder_config.num_hidden_layers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this pattern

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above 😅

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, yep, sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants