Skip to content

Conversation

@fglock
Copy link
Owner

@fglock fglock commented Oct 29, 2025

Summary

This PR improves the transliteration operator (tr/// and y///) by adding Unicode character name support and fixing critical bugs.

Test Results

  • Before: 256/318 passing (80.5%) - test died at line 1113
  • After: 277/318 passing (87.1%)
  • Improvement: +21 tests (+6.6%)

Changes

1. ✅ Unicode Character Name Support

  • Added support for \N{name} syntax with actual Unicode character names
  • Uses UnicodeResolver integration with ICU4J (~30,000+ names supported)
  • Properly rejects multi-character named sequences

Example:

$s = "é";
$s =~ tr/\N{LATIN SMALL LETTER E WITH ACUTE}/E/;  # Now works!

2. ✅ Empty \N{} Validation

  • Added validation for empty character names
  • Now gives proper error: Unknown charname ''

3. ✅ Surrogate Pair Bug Fix

  • Fixed incorrect handling where \x{ffff} was being removed
  • Changed from checking isHighSurrogate() to isSupplementaryCodePoint()
  • Now correctly preserves standalone surrogates and valid surrogate pairs

Example:

no warnings 'utf8';
$s = "\x{d800}\x{ffff}";
$s =~ tr/\0/A/;
# Before: "\x{d800}" (lost \x{ffff})
# After: "\x{d800}\x{ffff}" (preserved both)

Files Modified

  • src/main/java/org/perlonjava/operators/RuntimeTransliterate.java
  • src/main/java/org/perlonjava/regex/UnicodeResolver.java

Documentation

  • TR_UNICODE_NAME_SUPPORT.md - Technical details
  • TR_IMPROVEMENTS_SUMMARY.md - Complete summary

All changes maintain Perl 5.36 compatibility.

- Add support for \N{name} syntax with actual Unicode character names
  using UnicodeResolver integration with ICU4J (~30,000+ names supported)
- Fix empty \N{} validation to give proper 'Unknown charname' error
- Fix surrogate pair handling bug that was incorrectly removing characters
  by checking isSupplementaryCodePoint() instead of isHighSurrogate()

Test improvements: 256/318 (80.5%) -> 277/318 (87.1%)
- Fixed 21 tests (+6.6%)
- Tests now run to completion (previously died at line 1113)

Examples:
  $s =~ tr/\N{LATIN SMALL LETTER E WITH ACUTE}/E/;  # now works
  $s = "\x{d800}\x{ffff}"; $s =~ tr/\0/A/;  # now preserves both chars
@fglock
Copy link
Owner Author

fglock commented Oct 29, 2025

Closing this PR as requested. The tr/// improvements (+21 tests) are solid, but there are concerns about test regressions. Note: Investigation showed the listed regressions (magic.t, avhv.t, etc.) were introduced by PR #62 (strict-refs fix), not by these tr/// changes. The work is preserved in the feature/tr-unicode-support branch for future reference.

@fglock fglock closed this Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants