multiple tokenizers with different filenames can save now #41837

aijadugar · 2025-10-24T03:23:11Z

What does this PR do?

This PR fixes an issue where saving a custom Processor that includes multiple sub-tokenizers of the same type caused them to overwrite each other during serialization.

The root cause was that all sub-components were being saved using the same default filenames, leading to collisions.

This update introduces unique naming and loading logic in the ProcessorMixin save/load methods, allowing processors with multiple tokenizers to be safely saved and reloaded without data loss.

Fixes #41816

Before submitting

I have read the contributor guideline.

The change was discussed in issue #41816.

I’ve tested the processor save/load logic locally with multiple tokenizers.

No documentation changes were required.

Added/verified tests for multiple sub-tokenizers loading correctly.

Who can review?

Tagging maintainers familiar with processor and tokenizer internals:

@Cyrilvallez

@ArthurZucker

AmitMY · 2025-10-24T07:49:58Z

Running the example from the issue, I get:

Traceback (most recent call last):
  File "/Users/amitmoryossef/dev/sign/visual-text-decoder/example.py", line 32, in <module>
    processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/visual_text_decoder/lib/python3.12/site-packages/transformers/processing_utils.py", line 804, in save_pretrained
    attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)
                                                                   ^^^^^^^^^^^^^^^^
NameError: name 'save_jinja_files' is not defined

aijadugar · 2025-10-24T14:29:31Z

hi, @AmitMY , can you try it once again!

AmitMY · 2025-10-24T14:38:17Z

Strangely, now the error is

AttributeError: GemmaTokenizerFast has no attribute to_dict

It would be good if you add a test for your code.

Create tests/utils/test_processor_utils.py

import tempfile

from transformers.testing_utils import TestCasePlus

from transformers import ProcessorMixin, AutoTokenizer, PreTrainedTokenizer


class ProcessorSavePretrainedMultipleAttributes(TestCasePlus):
    def test_processor_loads_separate_attributes(self):
        class OtherProcessor(ProcessorMixin):
            name = "other-processor"

            attributes = [
                "tokenizer1",
                "tokenizer2",
            ]
            tokenizer1_class = "AutoTokenizer"
            tokenizer2_class = "AutoTokenizer"

            def __init__(self,
                         tokenizer1: PreTrainedTokenizer,
                         tokenizer2: PreTrainedTokenizer
                         ):
                super().__init__(tokenizer1=tokenizer1,
                                 tokenizer2=tokenizer2)

        tokenizer1 = AutoTokenizer.from_pretrained("google/gemma-3-270m")
        tokenizer2 = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B")

        processor = OtherProcessor(tokenizer1=tokenizer1,
                                   tokenizer2=tokenizer2)
        assert processor.tokenizer1.__class__ != processor.tokenizer2.__class__

        with tempfile.TemporaryDirectory() as temp_dir:
            # Save processor
            processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
            # Load processor
            new_processor = OtherProcessor.from_pretrained(temp_dir)

        assert new_processor.tokenizer1.__class__ != new_processor.tokenizer2.__class__

AmitMY

Cool! now i guess need to make all the other tests pass...

src/transformers/processing_utils.py

… comments and serialization behavior

multiple tokenizers with different filenames can save now

73f91c5

shallow copy to avoid deepcopy errors

c2ab93f

aijadugar force-pushed the fix-multiple-tokenizers-saved branch from a9b0d95 to c2ab93f Compare October 24, 2025 18:24

aijadugar mentioned this pull request Oct 24, 2025

Fix deepcopy in ProcessorMixin.to_dict for GemmaTokenizerFast #41851

Open

AmitMY suggested changes Oct 25, 2025

View reviewed changes

aijadugar added 2 commits October 25, 2025 17:40

Fix processor save logic for multiple tokenizers and restore original…

93731c5

… comments and serialization behavior

Merge branch 'main' into fix-multiple-tokenizers-saved

c94405a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multiple tokenizers with different filenames can save now #41837

multiple tokenizers with different filenames can save now #41837

Uh oh!

aijadugar commented Oct 24, 2025

Uh oh!

AmitMY commented Oct 24, 2025

Uh oh!

aijadugar commented Oct 24, 2025

Uh oh!

AmitMY commented Oct 24, 2025

Uh oh!

AmitMY left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

multiple tokenizers with different filenames can save now #41837

Are you sure you want to change the base?

multiple tokenizers with different filenames can save now #41837

Uh oh!

Conversation

aijadugar commented Oct 24, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

AmitMY commented Oct 24, 2025

Uh oh!

aijadugar commented Oct 24, 2025

Uh oh!

AmitMY commented Oct 24, 2025

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants