Add Unicode character name support and fix bugs in tr/// operator #63

fglock · 2025-10-29T15:00:03Z

Summary

This PR improves the transliteration operator (tr/// and y///) by adding Unicode character name support and fixing critical bugs.

Test Results

Before: 256/318 passing (80.5%) - test died at line 1113
After: 277/318 passing (87.1%)
Improvement: +21 tests (+6.6%)

Changes

1. ✅ Unicode Character Name Support

Added support for \N{name} syntax with actual Unicode character names
Uses UnicodeResolver integration with ICU4J (~30,000+ names supported)
Properly rejects multi-character named sequences

Example:

$s = "é";
$s =~ tr/\N{LATIN SMALL LETTER E WITH ACUTE}/E/;  # Now works!

2. ✅ Empty \N{} Validation

Added validation for empty character names
Now gives proper error: Unknown charname ''

3. ✅ Surrogate Pair Bug Fix

Fixed incorrect handling where \x{ffff} was being removed
Changed from checking isHighSurrogate() to isSupplementaryCodePoint()
Now correctly preserves standalone surrogates and valid surrogate pairs

Example:

no warnings 'utf8';
$s = "\x{d800}\x{ffff}";
$s =~ tr/\0/A/;
# Before: "\x{d800}" (lost \x{ffff})
# After: "\x{d800}\x{ffff}" (preserved both)

Files Modified

src/main/java/org/perlonjava/operators/RuntimeTransliterate.java
src/main/java/org/perlonjava/regex/UnicodeResolver.java

Documentation

TR_UNICODE_NAME_SUPPORT.md - Technical details
TR_IMPROVEMENTS_SUMMARY.md - Complete summary

All changes maintain Perl 5.36 compatibility.

- Add support for \N{name} syntax with actual Unicode character names using UnicodeResolver integration with ICU4J (~30,000+ names supported) - Fix empty \N{} validation to give proper 'Unknown charname' error - Fix surrogate pair handling bug that was incorrectly removing characters by checking isSupplementaryCodePoint() instead of isHighSurrogate() Test improvements: 256/318 (80.5%) -> 277/318 (87.1%) - Fixed 21 tests (+6.6%) - Tests now run to completion (previously died at line 1113) Examples: $s =~ tr/\N{LATIN SMALL LETTER E WITH ACUTE}/E/; # now works $s = "\x{d800}\x{ffff}"; $s =~ tr/\0/A/; # now preserves both chars

fglock · 2025-10-29T15:19:54Z

Closing this PR as requested. The tr/// improvements (+21 tests) are solid, but there are concerns about test regressions. Note: Investigation showed the listed regressions (magic.t, avhv.t, etc.) were introduced by PR #62 (strict-refs fix), not by these tr/// changes. The work is preserved in the feature/tr-unicode-support branch for future reference.

fglock closed this Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Unicode character name support and fix bugs in tr/// operator #63

Add Unicode character name support and fix bugs in tr/// operator #63

Uh oh!

fglock commented Oct 29, 2025

Uh oh!

fglock commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Unicode character name support and fix bugs in tr/// operator #63

Add Unicode character name support and fix bugs in tr/// operator #63

Uh oh!

Conversation

fglock commented Oct 29, 2025

Summary

Test Results

Changes

1. ✅ Unicode Character Name Support

2. ✅ Empty \N{} Validation

3. ✅ Surrogate Pair Bug Fix

Files Modified

Documentation

Uh oh!

fglock commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants