Fix encoding and improve tokenizer testing logic #41840
Open
+11
−14
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This pull request resolves multiple issues in the tokenizer parity testing script used to validate consistency between the slow (Python) and fast (Rust) tokenizers in the Hugging Face transformers library.
The changes correct dataset handling, fix incorrect API usage, and improve overall code clarity, performance, and maintainability.
Key Changes
However, the dataset structure is flat, with top-level premise and hypothesis keys. This caused TypeError exceptions during iteration.
Updated Implementation:
2. Corrected Offset Mapping Handling in check_LTR_mark
The previous code incorrectly accessed enc.offsets after calling encode_plus(), which returns a dictionary rather than an object.
This raised AttributeError exceptions.
Fixed Implementation:
Added proper boundary checks to handle cases where index positions are out of range.
3. Added Missing Global Variable Declarations
The counters (perfect, imperfect, wrong, total) were modified inside several functions without being declared as global.
This could lead to incorrect scope behavior or counter mismatches.
Fix:
Explicitly declared global variables inside the main execution block:
4. Refactored Dataset Iteration
Replaced index-based iteration with direct dataset iteration for cleaner and more memory-efficient looping:
This improves readability and eliminates unnecessary indexing operations.
5. General Cleanup and Minor Improvements
Results
Testing
Files Modified
tokenizer_equivalence_test.py