Fix encoding and improve tokenizer testing logic #41840

brittytino · 2025-10-24T06:30:23Z

Summary

This pull request resolves multiple issues in the tokenizer parity testing script used to validate consistency between the slow (Python) and fast (Rust) tokenizers in the Hugging Face transformers library.
The changes correct dataset handling, fix incorrect API usage, and improve overall code clarity, performance, and maintainability.

Key Changes

Fixed Incorrect Dataset Field Access

The original implementation attempted to access nested fields in the facebook/xnli dataset using:
for text in dataset[i]["premise"].values():
for text in dataset[i]["hypothesis"]["translation"]:

However, the dataset structure is flat, with top-level premise and hypothesis keys. This caused TypeError exceptions during iteration.

Updated Implementation:

for example in dataset:
    test_string(slow, fast, example["premise"])
    test_string(slow, fast, example["hypothesis"])

2. Corrected Offset Mapping Handling in check_LTR_mark

The previous code incorrectly accessed enc.offsets after calling encode_plus(), which returns a dictionary rather than an object.
This raised AttributeError exceptions.

Fixed Implementation:

enc = fast.encode_plus(line, return_offsets_mapping=True)
offsets = enc["offset_mapping"]

Added proper boundary checks to handle cases where index positions are out of range.

3. Added Missing Global Variable Declarations

The counters (perfect, imperfect, wrong, total) were modified inside several functions without being declared as global.
This could lead to incorrect scope behavior or counter mismatches.

Fix:
Explicitly declared global variables inside the main execution block:

if __name__ == "__main__":
    global imperfect, perfect, wrong, total

4. Refactored Dataset Iteration

Replaced index-based iteration with direct dataset iteration for cleaner and more memory-efficient looping:

for example in dataset:
    ...

This improves readability and eliminates unnecessary indexing operations.

5. General Cleanup and Minor Improvements

Improved logging consistency and readability.
Removed redundant temporary variables.
Ensured all tokenizer checks run safely without unhandled exceptions.
Maintained full backward compatibility with existing functionality.

Results

The script now executes without errors on the facebook/xnli dataset.
Tokenizer output comparisons between slow and fast implementations complete successfully.
Accuracy reporting and detailed mismatch diagnostics function as intended.
The script produces consistent results across multiple tokenizer architectures.

Testing

Validated using BertTokenizer, XLMRobertaTokenizer, and MBartTokenizer.
Confirmed parity check logic and reporting are correct.
No exceptions encountered during execution.
Accuracy metrics print as expected.

Files Modified
tokenizer_equivalence_test.py

brittytino added 2 commits October 24, 2025 11:51

Fix encoding and improve tokenizer testing logic

b5c9e73

Merge branch 'main' into main

efa4418

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix encoding and improve tokenizer testing logic #41840

Fix encoding and improve tokenizer testing logic #41840

brittytino commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix encoding and improve tokenizer testing logic #41840

Are you sure you want to change the base?

Fix encoding and improve tokenizer testing logic #41840

Conversation

brittytino commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant