Unicode Support for Case Insensitive Matching in RegExp #14192

john-wagster · 2025-02-03T19:04:57Z

About four years ago ASCII-only case insensitive matching (apache/lucene-solr#1541) was added to Lucene. In the past couple of a years a couple of requests have been made related to case insensitive matching in Elasticsearch across other parts of UTF-8 which uses the RegExp regex Automaton in Lucene. Previous discussions around this for the ASCII-only work suggested that this task may be controversial. So I've spent a bit of time exploring options and have submitted this PR as, I believe, the best direction to take related to that support, but welcome feedback on this approach.

tl;dr the approach I've taken is to mirror java.util.regex.Pattern with the belief that any downstream users of products like Elasticsearch would expect and welcome consistency in Unicode's edge-cases with case insensitivity being handled the same between both Java's Pattern class and Lucene's RegExp class

More specifically @jpountz brought up concerns around the handling of characters such as sigma and it's variants (Σ, σ, ς) (apache/lucene-solr#1541 (comment)). I spent some time investigating all of the characters in Unicode and tried to explain edge cases within the RegExp class by enumerating the classes of characters and their behaviors so we can easily discuss or pivot as desired. I then opted to handle these special classes of characters the same as how java.util.regex.Pattern handles these. Often (though I haven't tested all code points) Perl regex seems to treat case insensitivity the same as the Pattern class. So for instance the Pattern class when using both Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE matching flags will treat the three sigma characters (Σ, σ, ς) as the same for the purposes of matching; so for instance σ and ς are a positive match in a case insensitive regex even though they are not themselves in a case sensitive context (they are both lowercase).

For this PR I've opted to maintain the best performance possible scenario so special cases are handled with a lookup table while compiling the regex to an Automaton. Matching then must only consider the necessary alternative characters to ensure a match without any additional consideration for case sensitivity.

While most characters can be easily handled by uppercasing or lowercasing them appropriately outside the ASCII range, during my review of the Unicode spec I encountered three distinct classes of characters in Unicode that can be problematic.

Class 1 is the set of characters such as sigma that match other characters outside of the immediate toUpperCase and toLowerCase forms. However, in spidering the Unicode table with some utility classes I found that no more than four characters were ever included in a set of matched alternatives.
Class 2 is the set of the characters that have a both a different and distinct toUpperCase and toLowerCase form. An example of this is: ǅ whose upper case form is Ǆ and whose lower case form is ǆ.
Class 3 is the set of characters that have some cased form that transitions the Basic Multilingual Plane (BMP). These sets of characters are typically not matched by java.util.regex.Pattern (for likely performance reasons as they transition from say 2 byte representations to 4 byte representations in UTF-8) and so are explicitly excluded from matching here as well, which as it turns out is relatively easy as the Java's Character class toUpperCase and toLowerCase methods for code points do not case these characters, which we were already using in RegExp. It's worth noting that supporting this is completely possible and Java's String class toUpperCase and toLowerCase methods do correctly transition characters across the BMP. Lastly it's worth noting that these classes can and do overlap and so if a character is in both class 3 and class 2 for instance then again we defer the behavior seen in java.util.regex.Pattern and ignore the casing across the BMP but do case appropriately otherwise such as with ῼ which will match it's lowercase form ῳ in the BMP but not it's uppercase form outside the BMP ΩΙ.

Edit:
For anyone just diving in to this issue. After discussion with @rmuir I realized that Pattern is not comprehensive in how it handles casing. Instead after his recommendation we are going to discuss using Unicode's latest case folding spec: https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt

The intention that I hope for this PR is still to maintain matching capability with Pattern where it makes sense. Such that if Pattern were to match something ideally so would RegExp.

…Unicode characters

john-wagster · 2025-02-03T19:05:37Z

@jpountz, @jimczi, @mayya-sharipova ya'll may be interested in this PR so just tagging you here in case you are interested.

john-wagster · 2025-02-03T19:06:39Z

lucene/core/src/test/org/apache/lucene/util/automaton/TestRegExp.java

+    }
+  }
+
+  public static void generateClass1() {


Not sure where these utility functions should live; they are nice for validating and regenerating relevant code points so putting them here for now but open to suggestions for if / where they belong.

rmuir

I think matching java is ok, but we need to explain what the special cases are doing because the current explanation doesnt add up

rmuir · 2025-02-03T19:54:41Z

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

+   * with 4 bytes) and are therefore ignored: 223, 304, 329, 496, 912, 944, 1415, 7830-7834, 8016,
+   * 8018, 8020, 8022, 8064-8111, 8114-8116, 8118, 8119, 8124, 8130-8132, 8134, 8135, 8140, 8146,
+   * 8147, 8150, 8151, 8162-8164, 8166-8180, 8182, 8183, 8188, 64256-64262, 64275-64279
+   */


I don't understand this comment, where is this "in our out of BMP transitioning":

ﬗ is U+FB17 ARMENIAN SMALL LIGATURE MEN XEH
ՄԽ is U+0544 ARMENIAN CAPITAL LETTER MEN followed by U+053D ARMENIAN CAPITAL LETTER XEH

All of this is in the BMP. So we need a better explanation of what this is all about since it isn't that.

Using hex would help when reading it, but even with the decimal numbers, you can see that nothing in this comment is outside of the BMP, it is all less than 64k.

You are right; poking around a bit more to better understand what's happening here. I was debugging through the Pattern class to see how this was handled there and naively assumed that it was a single character but it's being uppercased into two characters within the BMP. Let me go back and double check these characters, revise the comment at least to be more generically calling out that we respect the same behavior as in Pattern related to not matching when going from a single code point to multiple code points and remove references to the BMP.

rmuir · 2025-02-03T19:55:42Z

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

+  static final Map<Integer, int[]> unstableUnicodeCharacters =
+      Map.ofEntries(
+          // these are the set of characters whose casing matches across multiple characters
+          entry(181, new int[] {924, 956}),


please, use hex for the unicode everywhere

didn't mean to come off short here,

But the hex is just a defacto standard, e.g. you'll see that representation in all the unicode data files. It helps to have an unambiguous way to refer to the same codepoint.

There's quite a few lucene code reviewers that can look at the hex digits, and recognize things, maybe how many bytes it takes up in UTF-8, maybe whether it is compatibility character that might explode (﷽), maybe what block it is in and so on.

Since they are all codepoints (int), it will help to do them all in hex like 0x1F602, as you can represent all codepoints with it. Only bother with the \uXXXX, when dealing with java's UTF-16 char type.

rmuir · 2025-02-03T20:21:03Z

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

+          entry(75, new int[] {8490, 107}),
+          entry(8491, new int[] {229, 197}),
+          entry(229, new int[] {8491, 197}),
+          entry(197, new int[] {8491, 229}),


For these big sets of numbers, can we express in some UnicodeSet(s) or properties from unicode data files (e.g. has full case-folding) to understand where they are coming from?

It helps to verify and keep the list updated.

For example, maybe this is some of them (maybe not, it is just an example, as I can't tell if this code is trying to handle characters that expand due to normalization, casing, or both)

[[:Expands_On_NFKC=Yes:]&[[:Changes_When_Casefolded:][:Changes_When_Uppercased:][:Changes_When_Lowercased:]]]

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B%3AExpands_On_NFKC%3DYes%3A%5D%26%5B%5B%3AChanges_When_Casefolded%3A%5D%5B%3AChanges_When_Uppercased%3A%5D%5B%3AChanges_When_Lowercased%3A%5D%5D%5D&g=&i=

https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt

good suggestion; let me poke around a bit more and see if I can somehow do that. My intention was to follow Pattern not the Unicode spec (and let Pattern follow some version of the spec) but it would be nice if there were obvious sets that all of these fell into rather than this hardcoded mapping.

Yeah I'm less concerned about what Pattern is doing. I am happy you have a solution though. I don't really want to look at Pattern and given that java STILL doesn't expose case-folding, I'm guessing I'd be sad if I looked.

From unicode perspective I think this is where we want to move towards?
https://unicode.org/reports/tr18/#Simple_Loose_Matches

So if the answer to this special set is that, they are the ones in https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt, it will solve my problem. Then we have a way to keep everything up to date and understand what all the exceptions are for.

I'm sold. I was comparing the CaseFolding spec with the Pattern and String classes today and found a (albeit small) number of characters that neither correctly supports from the unicode spec which seemed to me that should pretty clearly be supported such as:

https://util.unicode.org/UnicodeJsps/character.jsp?a=1C8A&B1=Show
https://util.unicode.org/UnicodeJsps/character.jsp?a=0x1C89&B1=Show

I have some changes related to your other comments but I'm going to iterate a bit further and I'll likely need some additional feedback on an approach that incorporates that spec / we may need to talk about how to maintain the case folding as the unicode spec iterates. I was hoping to rely heavily on the Pattern class being updated so we didn't have to do as much maintenance on our end to make sure case insensitivity continued to work appropriately.

Also just to be clear the intention of the PR is just handling character casing not normalization (just forgot to address that yesterday).

Yeah, I think we want the simple "L1" 1-1 case folding, no expansion/contraction/normalization/turkish-azeri alt/etc.

A big part of the problem is that java offers NO method to support this, which is why you are having to create lists of exceptions yourself. java only offers Character.toUpperCase()/toLowerCase()/toTitleCase(), which are for display purposes.

from i18n perspective, if i had one feature request for openjdk from the unicode, it would be to add Character.foldCase() which is for case insensitivity and is different. Imagine if you had this method in Character.java, your PR would be very tiny: https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/lang/UCharacter.html#foldCase-int-boolean-

rmuir · 2025-02-04T01:54:22Z

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

+            altAutomaton.add(Automata.makeChar(altCodepoints[i]));
+          }
+          altAutomaton.add(Automata.makeChar(codepoint));
+          result = Operations.union(altAutomaton);


I opened #14193 to try to improve this case. It is hard to think about the worst-cases and so on today, both the current ascii-only code and this patch could just use an int[] and get back a minimal automaton from what I see.

awesome; I took a brief look and it looked good to me. Part of the reason I was building these up in a List was related to a PR that @mayya-sharipova submitted recently where she found performance issues with unions: #14169

yeah that PR does not fix any hard problem (optimizing union and concatenate to be better in general). But it can fix your problem.

and we could test, but I'd guess that this PR should be very fast and we should consider the option of deprecating the previous ASCII option and just calling this one CASE_INSENSITIVE? because we can make the argument it is just as fast or faster than previous ASCII option and end out with a cleaner API and maintenance.

john-wagster · 2025-02-06T22:02:35Z

Iterated here a bit after the changes in #14193 went in and also pivoted to using https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt. I still need to do some additional validation and work through a couple things here. So not done yet. But feedback of course is always welcome on this direction.

rmuir · 2025-02-07T01:55:52Z

lucene/core/src/java/org/apache/lucene/util/automaton/CaseFoldingUtil.java

+
+/**
+ * Generates the case folding set of alternate character mappings based on this spec:
+ * https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt


I think this one uses a too-huge file? We don't need all the casefolding mappings of unicode, just the exceptions?

Look at https://www.unicode.org/Public/16.0.0/ucd/SpecialCasing.txt which is only exceptional ones and should be much smaller to deal with.

Yep agreed. I want to take a look and compare. In the mean time I've been digging into the Character.toLowerCase and Character.toUpperCase code and it's more complex than I was hoping. I'm curious about the performance between them. I'm kind of leaning toward not using those though and the full CaseFolding.txt file is only around 24kb in memory and could be optimized if we are concerned about that size for any reason. Definitely appreciate your perspective here though. So curious about your thoughts. And in the meantime I'll continue to validate / explore the differences so we have more information.

i would lean on those, trust me. I think adding text files and multiple .java files is overkill to start with on this PR?

Main thing I wanted, was a proper explanation of what the "magic numbers are". The if-then-else or switch statement like you had is fine, it just needs to be explained so we can reason about it and laid out in a way we can maintain it if needed, and ensure correctness.

Here's a REALLY big switch statement as example: https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java#L370-L376

See how the code is easy to review just because of how its formatted, with hex notation, comment with character, and character name? And if you look at top of the file it references where they come from.

So e.g. in your case, you just add comment linking to specialcasing or whatever the source is, format the exceptional cases better, and so on. My gut is, list of exceptions should be small and very stable and then it is easy on us.

So relying on SpecialCasing.txt by itself is not going to work. It lacks several of the more common mappings like:

0x1C84 # CYRILLIC SMALL LETTER TALL TE 0x1C85 # CYRILLIC SMALL LETTER THREE-LEGGED TE 0x0422 # CYRILLIC CAPITAL LETTER TE 0x0442 # CYRILLIC SMALL LETTER TALL TE

I can combine SpecialCasing.txt with UnicodeData.txt or CaseFolding.txt but it really doesn't buy us anything. SpecialCasing.txt is just missing information about all of the "simple" folds. I could take CaseFolding.txt and then filter it by what Character.toUpperCase and Character.toLowerCase can map to but I was thinking about this and it strikes me that I would be doing that simply to reduce the map size from 24kb, which seems pretty small to me to begin with. And then at runtimes we'd be calling Character.toUpperCase and Character.toLowerCase to undo that and get those mappings back. I can see an argument that there's an upside to not maintaining the mappings that "Java already supports" but then it seems painful to know what version of the Unicode spec we actually support. If we build the mapping from CaseFolding.txt only it's simpler logic at regex build time and easier to document and say we support version 16.0 of the Unicode case folding spec. I can and already plan to update the casefolding.txt file with comments about which characters are folding. So I'm not sure it's a huge difference between hard coding it in a Java file vs loading via a text file. If there was a single source like we didn't need to generate the mapping and could read from say CaseFolding.txt or SpecialCasing.txt it would make more sense to hard code it to me, but I'm a little concerned it would be hard to see edits to the spec reflected appropriately without maintaining some of kind utility class like what I currently have as CaseFoldingUtil. But if you want to push back hard on some of this I don't mind building a utility as a one off that outputs a bunch of case statements or Map Entry lines instead of a checked in utility that outputs a bunch of text lines. The two seem pretty equivalent to me. It seems nice to maintain the mapping utility in Lucene itself imo. I'd prefer to get something in for Unicode support rather than nothing though. And particularly for my use case here (I'm trying to clean up and unify regex in Elasticsearch) it would be nice to expose a CaseFolding class that I can reference in ES instead of say using icu4j in ES and Lucene having a potentially slightly different opinion.

Lastly if the map size of 24kb is concerning I can also conditionally load that only when the flag is set. I'm kind of thinking this is a good idea anyway but didn't want to add unnecessary complexity.

To be confident about some of this I went and have code to do much of it locally. I can show you the differences between using SpecialCasing.txt vs all of CaseFolding.txt vs what I did original which is evaluate all of the matches that Pattern pulls back. Let me know if that would be helpful to review one of those and or see some snippets in comments here.

not worried about what Pattern does, or offering general purpose unicode apis to elasticsearch.

But mainly, if we want to bring all unicode casefolding data into lucene-core and offer some general API around it, I want it to work, where it can be performant and usable by analyzers and so on. boxing and unboxing and hashmaps on every character aren't great for that. the map is also not 24KB: this file has 3,000 lines and is using objects (Integer, int[]) per mapping, so it is much more.

for regexp, it is enough to just have it like the first patch, only documented where the magic numbers are coming from. And parsing regexp doesn't need to be fast, so it makes it easier to start. If we want to offer some general public api then it needs to look different.

rmuir · 2025-02-10T20:29:57Z

lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java

+/**
+ * A utility class that allows downstream code to be consistent with how Lucene handles case folding
+ */
+public class CaseFolding {


does this really need to be public? If so, can it offer an actual case-folding function: fold(int) -> int.

it doesn't offer that today, it just allows generating all the alternatives that have the same case folding: that method should have another name.

john-wagster · 2025-02-11T17:09:39Z

@rmuir I made another pass based on your feedback and I'm good with and agree to keep this simple for a first pass. To that end I've done the following:

CaseFolding is no longer externally exposed (can revisit this later)
CaseFolding has one method to get the set of alternates renamed from fold which was inaccurate previously. The idea being to introduce a fold method in the future if it's valuable.
CaseFolding only includes the set of edge cases that are not handled by Character.toLowerCase and Character.toUpperCase and is a switch/case statement now with hex codepoints and comments on ever character
- this is similar to ASCIIFoldingFilter
- it was built using https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt and filtered by the set of things not already covered by Character.toLowerCase and Character.toUpperCase
The utility class CaseFoldingUtil for generating the switch / case statement code in the CaseFolding class has been removed.
ASCII_CASE_INSENSITIVE flag has been deprecated and has the same behavior as the CASE_INSENSITIVE flag
Tried to minimize changes otherwise with less concern for performance during compilation of the regex

Thoughts at this point and whether I've adequately incorporated your feedback would be much appreciated.

john-wagster · 2025-02-18T15:04:27Z

Apologies @rmuir I forgot to request review after my last set of changes; just did so now.

rmuir · 2025-02-18T23:53:59Z

@john-wagster this looks great! Thank you for simplifying this down as first step. I will merge it in after CI checks.

john-wagster · 2025-02-19T00:26:00Z

awesome! thank you sir.

Added support for case insensitive matching across the full range of Unicode characters.

Ensure the new case-insensitive unicode nodes from apache#14192 are minimal DFAs

rmuir · 2025-02-19T01:04:46Z

var re = new RegExp("παραστάσεις", RegExp.NONE, RegExp.CASE_INSENSITIVE);
System.out.println(re.toAutomaton().toDot());

Added support for case insensitive matching across the full range of …

62fdd05

…Unicode characters

github-actions bot added the module:core/search label Feb 3, 2025

john-wagster commented Feb 3, 2025

View reviewed changes

added a gh pr references to the CHANGES.txt

07298cf

rmuir requested changes Feb 3, 2025

View reviewed changes

rmuir reviewed Feb 3, 2025

View reviewed changes

rmuir mentioned this pull request Feb 4, 2025

Add Automata.makeCharSet/makeCharClass to optimize regexp #14193

Merged

rmuir reviewed Feb 4, 2025

View reviewed changes

iter

5bf7bfe

john-wagster force-pushed the case_insensitive_unicode_support branch from 2236f0a to 5bf7bfe Compare February 6, 2025 21:59

github-actions bot added module:core/codecs module:core/hnsw labels Feb 6, 2025

iter

8323b76

github-actions bot removed module:core/codecs module:core/hnsw labels Feb 6, 2025

rmuir reviewed Feb 7, 2025

View reviewed changes

rmuir reviewed Feb 10, 2025

View reviewed changes

john-wagster added 4 commits February 11, 2025 09:32

trying switch / case statement for lookups

e34e565

merge

72bbfaf

merge

faaeb6a

minor javdoc cleanup

c6518b6

john-wagster mentioned this pull request Feb 13, 2025

Unicode Support for Case Insensitive Matching elastic/elasticsearch#122536

Draft

john-wagster requested a review from rmuir February 18, 2025 15:03

Merge branch 'main' into case_insensitive_unicode_support

e2309a7

rmuir added this to the 10.2.0 milestone Feb 18, 2025

rmuir merged commit 7c050f9 into apache:main Feb 19, 2025
6 checks passed

asfgit pushed a commit that referenced this pull request Feb 19, 2025

Unicode Support for Case Insensitive Matching in RegExp (#14192)

a288a6e

Added support for case insensitive matching across the full range of Unicode characters.

rmuir added a commit to rmuir/lucene that referenced this pull request Feb 19, 2025

Merge branch 'main' into boost_test_regexp_parsing

5205f7c

Ensure the new case-insensitive unicode nodes from apache#14192 are minimal DFAs

Unicode Support for Case Insensitive Matching in RegExp #14192

Unicode Support for Case Insensitive Matching in RegExp #14192

Uh oh!

Conversation

john-wagster commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

john-wagster commented Feb 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmuir left a comment

Choose a reason for hiding this comment

Uh oh!

rmuir Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-wagster commented Feb 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-wagster commented Feb 11, 2025

Uh oh!

john-wagster commented Feb 18, 2025

Uh oh!

rmuir commented Feb 18, 2025

Uh oh!

Uh oh!

john-wagster commented Feb 19, 2025

Uh oh!

rmuir commented Feb 19, 2025

Uh oh!

Uh oh!

john-wagster commented Feb 3, 2025 •

edited

Loading

rmuir Feb 3, 2025 •

edited

Loading