-
Notifications
You must be signed in to change notification settings - Fork 30.9k
multiple tokenizers with different filenames can save now #41837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
multiple tokenizers with different filenames can save now #41837
Conversation
|
Running the example from the issue, I get: Traceback (most recent call last):
File "/Users/amitmoryossef/dev/sign/visual-text-decoder/example.py", line 32, in <module>
processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
File "/opt/homebrew/Caskroom/miniconda/base/envs/visual_text_decoder/lib/python3.12/site-packages/transformers/processing_utils.py", line 804, in save_pretrained
attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)
^^^^^^^^^^^^^^^^
NameError: name 'save_jinja_files' is not defined |
|
hi, @AmitMY , can you try it once again! |
|
Strangely, now the error is
It would be good if you add a test for your code. Create import tempfile
from transformers.testing_utils import TestCasePlus
from transformers import ProcessorMixin, AutoTokenizer, PreTrainedTokenizer
class ProcessorSavePretrainedMultipleAttributes(TestCasePlus):
def test_processor_loads_separate_attributes(self):
class OtherProcessor(ProcessorMixin):
name = "other-processor"
attributes = [
"tokenizer1",
"tokenizer2",
]
tokenizer1_class = "AutoTokenizer"
tokenizer2_class = "AutoTokenizer"
def __init__(self,
tokenizer1: PreTrainedTokenizer,
tokenizer2: PreTrainedTokenizer
):
super().__init__(tokenizer1=tokenizer1,
tokenizer2=tokenizer2)
tokenizer1 = AutoTokenizer.from_pretrained("google/gemma-3-270m")
tokenizer2 = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B")
processor = OtherProcessor(tokenizer1=tokenizer1,
tokenizer2=tokenizer2)
assert processor.tokenizer1.__class__ != processor.tokenizer2.__class__
with tempfile.TemporaryDirectory() as temp_dir:
# Save processor
processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
# Load processor
new_processor = OtherProcessor.from_pretrained(temp_dir)
assert new_processor.tokenizer1.__class__ != new_processor.tokenizer2.__class__ |
a9b0d95 to
c2ab93f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! now i guess need to make all the other tests pass...
… comments and serialization behavior
What does this PR do?
This PR fixes an issue where saving a custom Processor that includes multiple sub-tokenizers of the same type caused them to overwrite each other during serialization.
The root cause was that all sub-components were being saved using the same default filenames, leading to collisions.
This update introduces unique naming and loading logic in the ProcessorMixin save/load methods, allowing processors with multiple tokenizers to be safely saved and reloaded without data loss.
Fixes #41816
Before submitting
I have read the contributor guideline.
The change was discussed in issue #41816.
I’ve tested the processor save/load logic locally with multiple tokenizers.
No documentation changes were required.
Added/verified tests for multiple sub-tokenizers loading correctly.
Who can review?
Tagging maintainers familiar with processor and tokenizer internals:
@Cyrilvallez
@ArthurZucker