[MNG-8241] Handle non-BMP characters when comparing versions #2071

elharo · 2025-01-30T02:42:56Z

This PR takes the path of treating all non-ASCII digits as non-numeric. Alternately we could keep the existing behavior for all BMP digits and also treat non-BMP digits as numeric. But we probably shouldn't treat BMP digits and non-BMP digits differently.

gnodet · 2025-01-30T09:27:15Z

It would make sense to also fix the behavior of the resolver related class:
https://github.com/apache/maven-resolver/blob/fa639fe1e76abc774d5ffd298c3dfa501cf305fc/maven-resolver-util/src/main/java/org/eclipse/aether/util/version/GenericVersion.java#L230-L258
A resolver release is planned very soon, and the whole Maven 4 API uses this class rather than the one in maven-artifact.

gnodet · 2025-01-30T09:34:59Z

...maven-artifact/src/test/java/org/apache/maven/artifact/versioning/ComparableVersionTest.java

+    void testDigitGreaterThanNonAscii() {
+        ComparableVersion c1 = new ComparableVersion("1");
+        ComparableVersion c2 = new ComparableVersion("é");
+        assertTrue(c1.compareTo(c2) > 0, "expected " + "1" + " > " + "\uD835\uDFE4");


The messages is wrong

gnodet · 2025-01-30T09:35:11Z

...maven-artifact/src/test/java/org/apache/maven/artifact/versioning/ComparableVersionTest.java

+        ComparableVersion c1 = new ComparableVersion("1");
+        ComparableVersion c2 = new ComparableVersion("é");
+        assertTrue(c1.compareTo(c2) > 0, "expected " + "1" + " > " + "\uD835\uDFE4");
+        assertTrue(c2.compareTo(c1) < 0, "expected " + "\uD835\uDFE4" + " < " + "1");


and that one too

gnodet · 2025-01-30T09:47:39Z

compat/maven-artifact/src/main/java/org/apache/maven/artifact/versioning/ComparableVersion.java

@@ -687,7 +700,8 @@ public final void parseVersion(String version) {
                    stack.push(list);
                }
                isCombination = false;
-            } else if (Character.isDigit(c)) {
+                // TODO we might not want to use isDigit here; just check for ASCII digits only
+            } else if (c >= '0' && c <= '9') {


I'm not sure why not. It seems to me that later, the string will be parsed using Integer.parseInt, Long.parseLong, or new BigInteger(), and AFAIK, they all leverage the Character.digit(char, int) which should support everything that is considered a digit by Character.isDigit(char) (but not Character.isDigit(int)).
I think this has the same effect as we're only considering radix-10 when parsing.
An alternative to support supplemental characters would be to normalise those in a buffer while they are read in this loop and pass on the buffer rather than a substring.

Something like the following maybe ?

@SuppressWarnings("checkstyle:innerassignment") public final void parseVersion(String version) { this.value = version; items = new ListItem(); version = version.toLowerCase(Locale.ENGLISH); ListItem list = items; Deque<Item> stack = new ArrayDeque<>(); stack.push(list); boolean isDigit = false; boolean isCombination = false; int startIndex = 0; StringBuilder normalizedBuffer = new StringBuilder(); for (int i = 0; i < version.length(); i++) { char character = version.charAt(i); int c = character; // Handle high surrogate if (Character.isHighSurrogate(character)) { try { char low = version.charAt(i + 1); char[] both = {character, low}; c = Character.codePointAt(both, 0); i++; } catch (IndexOutOfBoundsException ex) { // Treat high surrogate without low surrogate as a regular character } } // Normalize the Unicode digits (including supplemental plane characters) using Character.digit() if (Character.isDigit(c)) { // Normalize Unicode digits to ASCII digits int normalizedDigit = Character.digit(c, 10); if (normalizedDigit != -1) { c = normalizedDigit + '0'; // Convert digit to its ASCII equivalent } } // Build the normalized substring on the fly if (c == '.' || c == '-' || i == version.length() - 1) { // If we've reached a separator or the end of the string, process the current segment if (i == startIndex && c != '-') { list.add(IntItem.ZERO); // Handle empty sections } else { // Add the current segment to the list after normalizing it list.add(parseItem(isCombination, isDigit, normalizedBuffer.toString())); } isCombination = false; normalizedBuffer.setLength(0); // Reset the buffer for the next segment startIndex = i + 1; // Handle separators if (c == '-' && !list.isEmpty()) { list.add(list = new ListItem()); stack.push(list); } } else if (c >= '0' && c <= '9') { if (!isDigit && i > startIndex) { // Handle combination of a letter and number, like "X1" isCombination = true; if (!list.isEmpty()) { list.add(list = new ListItem()); stack.push(list); } } isDigit = true; } else { if (isDigit && i > startIndex) { list.add(parseItem(isCombination, true, normalizedBuffer.toString())); normalizedBuffer.setLength(0); // Reset buffer startIndex = i; list.add(list = new ListItem()); stack.push(list); isCombination = false; } isDigit = false; } // Append the current character to the normalized buffer normalizedBuffer.append((char) c); } if (version.length() > startIndex) { // Final segment (if any) if (!isDigit && !list.isEmpty()) { list.add(list = new ListItem()); stack.push(list); } list.add(parseItem(isCombination, isDigit, normalizedBuffer.toString())); } while (!stack.isEmpty()) { list = (ListItem) stack.pop(); list.normalize(); } }

So, just to be clear you prefer the alternate approach of treating all Unicode digits as version numbers? As long as they're all base 10 this should work. Looking at Unicode details, there are some non-base 10 digits. They have a different Unicode character class, so I need to check what Character.isDigit is doing; i..e whether it's checking for decimal digits or all digits. If it is checking for all digits, I need to check the character class directly. To be specific we want De, possibly Di, but not Nu character classes.

OK, looks like Character.isDigit is checking specifically for decimal digits, not non-decimal digits

Ran into a problem with that approach. It treats different versions of digits as the same. E.g. 7 is the same as Arabic-Indic digit 7 even though they are different characters.

I can see an argument for normalizing all these digits to ASCII digits for Maven purposes, but that would touch a lot of different places in the code and ecosystem including things like file names and artifact URLs in Maven Central.
I'm going to think about this a little more, but for the moment it feels safer to treat all the non-ASCII digits as distinct non-numeric characters.

Only supporting ASCII digits sounds fine to me too.

elharo · 2025-01-30T12:20:09Z

Probably worth filing a related bug for the resolver issues.

Handle non-BMP characters

bd99835

elharo changed the title ~~Handle non-BMP characters~~ [MNG-8241] Handle non-BMP characters Jan 30, 2025

elharo changed the title ~~[MNG-8241] Handle non-BMP characters~~ [MNG-8241] Handle non-BMP characters when comparing versions Jan 30, 2025

elharo added 2 commits January 29, 2025 20:50

Treat non-ASCII digits as strings

d62e852

spotless

d1a091c

gnodet approved these changes Jan 30, 2025

View reviewed changes

gnodet self-requested a review January 30, 2025 09:32

gnodet reviewed Jan 30, 2025

View reviewed changes

elharo marked this pull request as ready for review February 3, 2025 13:43

elharo marked this pull request as draft February 3, 2025 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MNG-8241] Handle non-BMP characters when comparing versions #2071

[MNG-8241] Handle non-BMP characters when comparing versions #2071

elharo commented Jan 30, 2025 •

edited

Loading

gnodet commented Jan 30, 2025

gnodet Jan 30, 2025

gnodet Jan 30, 2025

gnodet Jan 30, 2025 •

edited

Loading

gnodet Jan 30, 2025

elharo Jan 30, 2025

elharo Jan 30, 2025

elharo Jan 30, 2025

gnodet Feb 3, 2025

elharo commented Jan 30, 2025

[MNG-8241] Handle non-BMP characters when comparing versions #2071

Are you sure you want to change the base?

[MNG-8241] Handle non-BMP characters when comparing versions #2071

Conversation

elharo commented Jan 30, 2025 • edited Loading

gnodet commented Jan 30, 2025

gnodet Jan 30, 2025

Choose a reason for hiding this comment

gnodet Jan 30, 2025

Choose a reason for hiding this comment

gnodet Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

gnodet Jan 30, 2025

Choose a reason for hiding this comment

elharo Jan 30, 2025

Choose a reason for hiding this comment

elharo Jan 30, 2025

Choose a reason for hiding this comment

elharo Jan 30, 2025

Choose a reason for hiding this comment

gnodet Feb 3, 2025

Choose a reason for hiding this comment

elharo commented Jan 30, 2025

elharo commented Jan 30, 2025 •

edited

Loading

gnodet Jan 30, 2025 •

edited

Loading