-
Notifications
You must be signed in to change notification settings - Fork 30.3k
[tests] update test_past_key_values_format
and delete overwrites
#40701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -728,10 +727,6 @@ def test_training_gradient_checkpointing_use_reentrant(self): | |||
def test_training_gradient_checkpointing_use_reentrant_false(self): | |||
pass | |||
|
|||
@is_flaky(max_attempts=5, description="Flaky for some input configurations.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(double-checked with flake-finder
-- it is no longer flaky)
[For maintainers] Suggested jobs to run (before merge) run-slow: dia, gemma3n, got_ocr2, speecht5, t5gemma |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great clean-up, just had question to clarify about decoder text configs. TBH didn't know we were using it for flat structured configs
# Does the class map the new key into a different attribute name at read time? if so, let's write | ||
# into that attribute instead | ||
if new_key in config_to_return.attribute_map: | ||
new_key = config_to_return.attribute_map[new_key] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I got this. So if we map the new key back to attribute map, in models like BART we will do num_attention_head
-> encoder_attention_heads
. This doesn't look quite right if we asked for a decoder config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not like right indeed 😢 But attribute_map
is a class-level attribute, so we can't update it for the new configuration either (i.e. for the config instance returned by get_text_config
).
Note that these encoder
/decoder
attributes in attribute_map
are from old models, and that these inconsistencies only show up if they decide to print internal variables 👀
This means we are limited to two options, to maintain BC:
- [This PR] We use the same mapping all over the code (e.g.
config.get_text_config(decoder=True).num_attention_head
to get the number of attention heads in the decoder), but accept that some old configs will have odd representation because of their attribute map; - [
main
] Have several if/else scattered across our codebase, like
num_decoder_layers = (
getattr(config, "decoder_layers", None) # flat configs case 1
or getattr(config, "num_decoder_layers", None) # flat configs case 2
or decoder_config.num_hidden_layers # modern default for decoders
)
(if we double-down in direction 2, we need to add more if/else cases, our current logic is not robust in all tests).
Option 1. seems much more reliable in the long run, and also nudges everyone into using the same names everywhere (as opposed to relying on attribute maps) 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we may be able to update the attribute_map
logic to read/write into the target variable, as opposed to mapping the read/writes 👀
Example:
If we have the {"a": "b"}
mapping, atm all reads to config.a
actually read config.b
without checking if a
exists in config
. Same for writes.
We could instade make config.a
reads read config.a
first and, if it doesn't exist, try to read config.b
. All writes would write into config.a
.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could instade make config.a reads read config.a first and, if it doesn't exist, try to read config.b. All writes would write into config.a.
This sounds interesting, and slightly breaking because we will end up with two keys for the same concept. It might raise questions such as which value is correct when inspecting visually or serializing configs. For ex: we might have both: image_token_id/image_token_index
in some VLMs
Coming back to "Option 1", I see we always check for attribute mapping now. I was expecting that get_text_config()
will return a different config only if config structure is nested tbh. Otherwise the whole config is a text config and has no other modalities
In this case I think the current approach is best we can do, because it helps to reduce LOC and is not much breaking. We can ignore the weird naming as noone would serialize/print the text config, I hope. Let's either keep it as is and I also have another option below. Feel free to ignore if it doesn't work
I looked though attribute maps in repo, and it always maps to
encoder
if encoder-decoder is used. We could deprecate this pattern gradually from mapping and nudge users to explicitly get withconfig.encoder_attention_heads
. We will need to use consistent naming in encoder-decoder models and promote it for future model. Though this option will take a long time to deprecate, maybe even till v5 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm reading is "let's go with this PR, and try to nudge users away from attribute_map
". Is this correct? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeap, the second one is more longer term to make our lives better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
(Approval please then 💛 )
text_config = config.get_text_config() | ||
num_decoder_layers = ( | ||
getattr(text_config, "decoder_layers", None) | ||
or getattr(text_config, "num_decoder_layers", None) | ||
or text_config.num_hidden_layers | ||
) | ||
|
||
num_decoder_layers = decoder_config.num_hidden_layers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love this pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, yep, sorry
What does this PR do?
(Carved from #40553, which is becoming messy)
This PR:
test_past_key_values_format
to support things like GQA or skipped kv cache layers. As a result, we can remove some overwrites/skips 💛get_text_config
, fixes a bug -- inget_text_config
+ legacy models, we were not respecting theattribute_map
(i.e. the remapping of config attributes, which we need to be careful with), with this PR we do. This bugfix allow us to remove some of the test skips.