Skip to content

Conversation

@brittytino
Copy link

Summary

This pull request resolves multiple issues in the tokenizer parity testing script used to validate consistency between the slow (Python) and fast (Rust) tokenizers in the Hugging Face transformers library.
The changes correct dataset handling, fix incorrect API usage, and improve overall code clarity, performance, and maintainability.

Key Changes

  1. Fixed Incorrect Dataset Field Access

The original implementation attempted to access nested fields in the facebook/xnli dataset using:

for text in dataset[i]["premise"].values():
for text in dataset[i]["hypothesis"]["translation"]:

However, the dataset structure is flat, with top-level premise and hypothesis keys. This caused TypeError exceptions during iteration.

Updated Implementation:

for example in dataset:
    test_string(slow, fast, example["premise"])
    test_string(slow, fast, example["hypothesis"])

2. Corrected Offset Mapping Handling in check_LTR_mark

The previous code incorrectly accessed enc.offsets after calling encode_plus(), which returns a dictionary rather than an object.
This raised AttributeError exceptions.

Fixed Implementation:

enc = fast.encode_plus(line, return_offsets_mapping=True)
offsets = enc["offset_mapping"]

Added proper boundary checks to handle cases where index positions are out of range.

3. Added Missing Global Variable Declarations

The counters (perfect, imperfect, wrong, total) were modified inside several functions without being declared as global.
This could lead to incorrect scope behavior or counter mismatches.

Fix:
Explicitly declared global variables inside the main execution block:

if __name__ == "__main__":
    global imperfect, perfect, wrong, total

4. Refactored Dataset Iteration

Replaced index-based iteration with direct dataset iteration for cleaner and more memory-efficient looping:

for example in dataset:
    ...

This improves readability and eliminates unnecessary indexing operations.

5. General Cleanup and Minor Improvements

  • Improved logging consistency and readability.
  • Removed redundant temporary variables.
  • Ensured all tokenizer checks run safely without unhandled exceptions.
  • Maintained full backward compatibility with existing functionality.

Results

  • The script now executes without errors on the facebook/xnli dataset.
  • Tokenizer output comparisons between slow and fast implementations complete successfully.
  • Accuracy reporting and detailed mismatch diagnostics function as intended.
  • The script produces consistent results across multiple tokenizer architectures.

Testing

  • Validated using BertTokenizer, XLMRobertaTokenizer, and MBartTokenizer.
  • Confirmed parity check logic and reporting are correct.
  • No exceptions encountered during execution.
  • Accuracy metrics print as expected.

Files Modified
tokenizer_equivalence_test.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant