Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Support for Case Insensitive Matching in RegExp #14192

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

john-wagster
Copy link

@john-wagster john-wagster commented Feb 3, 2025

About four years ago ASCII-only case insensitive matching (apache/lucene-solr#1541) was added to Lucene. In the past couple of a years a couple of requests have been made related to case insensitive matching in Elasticsearch across other parts of UTF-8 which uses the RegExp regex Automaton in Lucene. Previous discussions around this for the ASCII-only work suggested that this task may be controversial. So I've spent a bit of time exploring options and have submitted this PR as, I believe, the best direction to take related to that support, but welcome feedback on this approach.

tl;dr the approach I've taken is to mirror java.util.regex.Pattern with the belief that any downstream users of products like Elasticsearch would expect and welcome consistency in Unicode's edge-cases with case insensitivity being handled the same between both Java's Pattern class and Lucene's RegExp class

More specifically @jpountz brought up concerns around the handling of characters such as sigma and it's variants (Σ, σ, ς) (apache/lucene-solr#1541 (comment)). I spent some time investigating all of the characters in Unicode and tried to explain edge cases within the RegExp class by enumerating the classes of characters and their behaviors so we can easily discuss or pivot as desired. I then opted to handle these special classes of characters the same as how java.util.regex.Pattern handles these. Often (though I haven't tested all code points) Perl regex seems to treat case insensitivity the same as the Pattern class. So for instance the Pattern class when using both Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE matching flags will treat the three sigma characters (Σ, σ, ς) as the same for the purposes of matching; so for instance σ and ς are a positive match in a case insensitive regex even though they are not themselves in a case sensitive context (they are both lowercase).

For this PR I've opted to maintain the best performance possible scenario so special cases are handled with a lookup table while compiling the regex to an Automaton. Matching then must only consider the necessary alternative characters to ensure a match without any additional consideration for case sensitivity.

While most characters can be easily handled by uppercasing or lowercasing them appropriately outside the ASCII range, during my review of the Unicode spec I encountered three distinct classes of characters in Unicode that can be problematic.

  • Class 1 is the set of characters such as sigma that match other characters outside of the immediate toUpperCase and toLowerCase forms. However, in spidering the Unicode table with some utility classes I found that no more than four characters were ever included in a set of matched alternatives.
  • Class 2 is the set of the characters that have a both a different and distinct toUpperCase and toLowerCase form. An example of this is: Dž whose upper case form is DŽ and whose lower case form is dž.
  • Class 3 is the set of characters that have some cased form that transitions the Basic Multilingual Plane (BMP). These sets of characters are typically not matched by java.util.regex.Pattern (for likely performance reasons as they transition from say 2 byte representations to 4 byte representations in UTF-8) and so are explicitly excluded from matching here as well, which as it turns out is relatively easy as the Java's Character class toUpperCase and toLowerCase methods for code points do not case these characters, which we were already using in RegExp. It's worth noting that supporting this is completely possible and Java's String class toUpperCase and toLowerCase methods do correctly transition characters across the BMP. Lastly it's worth noting that these classes can and do overlap and so if a character is in both class 3 and class 2 for instance then again we defer the behavior seen in java.util.regex.Pattern and ignore the casing across the BMP but do case appropriately otherwise such as with which will match it's lowercase form in the BMP but not it's uppercase form outside the BMP ΩΙ.

Edit:
For anyone just diving in to this issue. After discussion with @rmuir I realized that Pattern is not comprehensive in how it handles casing. Instead after his recommendation we are going to discuss using Unicode's latest case folding spec: https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt

The intention that I hope for this PR is still to maintain matching capability with Pattern where it makes sense. Such that if Pattern were to match something ideally so would RegExp.

@john-wagster
Copy link
Author

@jpountz, @jimczi, @mayya-sharipova ya'll may be interested in this PR so just tagging you here in case you are interested.

}
}

public static void generateClass1() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where these utility functions should live; they are nice for validating and regenerating relevant code points so putting them here for now but open to suggestions for if / where they belong.

Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think matching java is ok, but we need to explain what the special cases are doing because the current explanation doesnt add up

* with 4 bytes) and are therefore ignored: 223, 304, 329, 496, 912, 944, 1415, 7830-7834, 8016,
* 8018, 8020, 8022, 8064-8111, 8114-8116, 8118, 8119, 8124, 8130-8132, 8134, 8135, 8140, 8146,
* 8147, 8150, 8151, 8162-8164, 8166-8180, 8182, 8183, 8188, 64256-64262, 64275-64279
*/
Copy link
Member

@rmuir rmuir Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment, where is this "in our out of BMP transitioning":

ﬗ is U+FB17 ARMENIAN SMALL LIGATURE MEN XEH
ՄԽ is U+0544 ARMENIAN CAPITAL LETTER MEN followed by U+053D ARMENIAN CAPITAL LETTER XEH

All of this is in the BMP. So we need a better explanation of what this is all about since it isn't that.

Using hex would help when reading it, but even with the decimal numbers, you can see that nothing in this comment is outside of the BMP, it is all less than 64k.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right; poking around a bit more to better understand what's happening here. I was debugging through the Pattern class to see how this was handled there and naively assumed that it was a single character but it's being uppercased into two characters within the BMP. Let me go back and double check these characters, revise the comment at least to be more generically calling out that we respect the same behavior as in Pattern related to not matching when going from a single code point to multiple code points and remove references to the BMP.

static final Map<Integer, int[]> unstableUnicodeCharacters =
Map.ofEntries(
// these are the set of characters whose casing matches across multiple characters
entry(181, new int[] {924, 956}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, use hex for the unicode everywhere

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't mean to come off short here,

But the hex is just a defacto standard, e.g. you'll see that representation in all the unicode data files. It helps to have an unambiguous way to refer to the same codepoint.

There's quite a few lucene code reviewers that can look at the hex digits, and recognize things, maybe how many bytes it takes up in UTF-8, maybe whether it is compatibility character that might explode (﷽), maybe what block it is in and so on.

Since they are all codepoints (int), it will help to do them all in hex like 0x1F602, as you can represent all codepoints with it. Only bother with the \uXXXX, when dealing with java's UTF-16 char type.

entry(75, new int[] {8490, 107}),
entry(8491, new int[] {229, 197}),
entry(229, new int[] {8491, 197}),
entry(197, new int[] {8491, 229}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these big sets of numbers, can we express in some UnicodeSet(s) or properties from unicode data files (e.g. has full case-folding) to understand where they are coming from?

It helps to verify and keep the list updated.

For example, maybe this is some of them (maybe not, it is just an example, as I can't tell if this code is trying to handle characters that expand due to normalization, casing, or both)

[[:Expands_On_NFKC=Yes:]&[[:Changes_When_Casefolded:][:Changes_When_Uppercased:][:Changes_When_Lowercased:]]]

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B%3AExpands_On_NFKC%3DYes%3A%5D%26%5B%5B%3AChanges_When_Casefolded%3A%5D%5B%3AChanges_When_Uppercased%3A%5D%5B%3AChanges_When_Lowercased%3A%5D%5D%5D&g=&i=

https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion; let me poke around a bit more and see if I can somehow do that. My intention was to follow Pattern not the Unicode spec (and let Pattern follow some version of the spec) but it would be nice if there were obvious sets that all of these fell into rather than this hardcoded mapping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm less concerned about what Pattern is doing. I am happy you have a solution though. I don't really want to look at Pattern and given that java STILL doesn't expose case-folding, I'm guessing I'd be sad if I looked.

From unicode perspective I think this is where we want to move towards?
https://unicode.org/reports/tr18/#Simple_Loose_Matches

So if the answer to this special set is that, they are the ones in https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt, it will solve my problem. Then we have a way to keep everything up to date and understand what all the exceptions are for.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sold. I was comparing the CaseFolding spec with the Pattern and String classes today and found a (albeit small) number of characters that neither correctly supports from the unicode spec which seemed to me that should pretty clearly be supported such as:

https://util.unicode.org/UnicodeJsps/character.jsp?a=1C8A&B1=Show
https://util.unicode.org/UnicodeJsps/character.jsp?a=0x1C89&B1=Show

I have some changes related to your other comments but I'm going to iterate a bit further and I'll likely need some additional feedback on an approach that incorporates that spec / we may need to talk about how to maintain the case folding as the unicode spec iterates. I was hoping to rely heavily on the Pattern class being updated so we didn't have to do as much maintenance on our end to make sure case insensitivity continued to work appropriately.

Also just to be clear the intention of the PR is just handling character casing not normalization (just forgot to address that yesterday).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we want the simple "L1" 1-1 case folding, no expansion/contraction/normalization/turkish-azeri alt/etc.

A big part of the problem is that java offers NO method to support this, which is why you are having to create lists of exceptions yourself. java only offers Character.toUpperCase()/toLowerCase()/toTitleCase(), which are for display purposes.

from i18n perspective, if i had one feature request for openjdk from the unicode, it would be to add Character.foldCase() which is for case insensitivity and is different. Imagine if you had this method in Character.java, your PR would be very tiny: https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/lang/UCharacter.html#foldCase-int-boolean-

altAutomaton.add(Automata.makeChar(altCodepoints[i]));
}
altAutomaton.add(Automata.makeChar(codepoint));
result = Operations.union(altAutomaton);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #14193 to try to improve this case. It is hard to think about the worst-cases and so on today, both the current ascii-only code and this patch could just use an int[] and get back a minimal automaton from what I see.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome; I took a brief look and it looked good to me. Part of the reason I was building these up in a List was related to a PR that @mayya-sharipova submitted recently where she found performance issues with unions: #14169

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that PR does not fix any hard problem (optimizing union and concatenate to be better in general). But it can fix your problem.

and we could test, but I'd guess that this PR should be very fast and we should consider the option of deprecating the previous ASCII option and just calling this one CASE_INSENSITIVE? because we can make the argument it is just as fast or faster than previous ASCII option and end out with a cleaner API and maintenance.

@john-wagster
Copy link
Author

Iterated here a bit after the changes in #14193 went in and also pivoted to using https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt. I still need to do some additional validation and work through a couple things here. So not done yet. But feedback of course is always welcome on this direction.


/**
* Generates the case folding set of alternate character mappings based on this spec:
* https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one uses a too-huge file? We don't need all the casefolding mappings of unicode, just the exceptions?

Look at https://www.unicode.org/Public/16.0.0/ucd/SpecialCasing.txt which is only exceptional ones and should be much smaller to deal with.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep agreed. I want to take a look and compare. In the mean time I've been digging into the Character.toLowerCase and Character.toUpperCase code and it's more complex than I was hoping. I'm curious about the performance between them. I'm kind of leaning toward not using those though and the full CaseFolding.txt file is only around 24kb in memory and could be optimized if we are concerned about that size for any reason. Definitely appreciate your perspective here though. So curious about your thoughts. And in the meantime I'll continue to validate / explore the differences so we have more information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would lean on those, trust me. I think adding text files and multiple .java files is overkill to start with on this PR?

Main thing I wanted, was a proper explanation of what the "magic numbers are". The if-then-else or switch statement like you had is fine, it just needs to be explained so we can reason about it and laid out in a way we can maintain it if needed, and ensure correctness.

Here's a REALLY big switch statement as example: https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java#L370-L376

See how the code is easy to review just because of how its formatted, with hex notation, comment with character, and character name? And if you look at top of the file it references where they come from.

So e.g. in your case, you just add comment linking to specialcasing or whatever the source is, format the exceptional cases better, and so on. My gut is, list of exceptions should be small and very stable and then it is easy on us.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So relying on SpecialCasing.txt by itself is not going to work. It lacks several of the more common mappings like:

0x1C84 # CYRILLIC SMALL LETTER TALL TE
0x1C85 # CYRILLIC SMALL LETTER THREE-LEGGED TE
0x0422 # CYRILLIC CAPITAL LETTER TE
0x0442 # CYRILLIC SMALL LETTER TALL TE

I can combine SpecialCasing.txt with UnicodeData.txt or CaseFolding.txt but it really doesn't buy us anything. SpecialCasing.txt is just missing information about all of the "simple" folds. I could take CaseFolding.txt and then filter it by what Character.toUpperCase and Character.toLowerCase can map to but I was thinking about this and it strikes me that I would be doing that simply to reduce the map size from 24kb, which seems pretty small to me to begin with. And then at runtimes we'd be calling Character.toUpperCase and Character.toLowerCase to undo that and get those mappings back. I can see an argument that there's an upside to not maintaining the mappings that "Java already supports" but then it seems painful to know what version of the Unicode spec we actually support. If we build the mapping from CaseFolding.txt only it's simpler logic at regex build time and easier to document and say we support version 16.0 of the Unicode case folding spec. I can and already plan to update the casefolding.txt file with comments about which characters are folding. So I'm not sure it's a huge difference between hard coding it in a Java file vs loading via a text file. If there was a single source like we didn't need to generate the mapping and could read from say CaseFolding.txt or SpecialCasing.txt it would make more sense to hard code it to me, but I'm a little concerned it would be hard to see edits to the spec reflected appropriately without maintaining some of kind utility class like what I currently have as CaseFoldingUtil. But if you want to push back hard on some of this I don't mind building a utility as a one off that outputs a bunch of case statements or Map Entry lines instead of a checked in utility that outputs a bunch of text lines. The two seem pretty equivalent to me. It seems nice to maintain the mapping utility in Lucene itself imo. I'd prefer to get something in for Unicode support rather than nothing though. And particularly for my use case here (I'm trying to clean up and unify regex in Elasticsearch) it would be nice to expose a CaseFolding class that I can reference in ES instead of say using icu4j in ES and Lucene having a potentially slightly different opinion.

Lastly if the map size of 24kb is concerning I can also conditionally load that only when the flag is set. I'm kind of thinking this is a good idea anyway but didn't want to add unnecessary complexity.

To be confident about some of this I went and have code to do much of it locally. I can show you the differences between using SpecialCasing.txt vs all of CaseFolding.txt vs what I did original which is evaluate all of the matches that Pattern pulls back. Let me know if that would be helpful to review one of those and or see some snippets in comments here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not worried about what Pattern does, or offering general purpose unicode apis to elasticsearch.

But mainly, if we want to bring all unicode casefolding data into lucene-core and offer some general API around it, I want it to work, where it can be performant and usable by analyzers and so on. boxing and unboxing and hashmaps on every character aren't great for that. the map is also not 24KB: this file has 3,000 lines and is using objects (Integer, int[]) per mapping, so it is much more.

for regexp, it is enough to just have it like the first patch, only documented where the magic numbers are coming from. And parsing regexp doesn't need to be fast, so it makes it easier to start. If we want to offer some general public api then it needs to look different.

/**
* A utility class that allows downstream code to be consistent with how Lucene handles case folding
*/
public class CaseFolding {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this really need to be public? If so, can it offer an actual case-folding function: fold(int) -> int.

it doesn't offer that today, it just allows generating all the alternatives that have the same case folding: that method should have another name.

@john-wagster
Copy link
Author

@rmuir I made another pass based on your feedback and I'm good with and agree to keep this simple for a first pass. To that end I've done the following:

  • CaseFolding is no longer externally exposed (can revisit this later)
  • CaseFolding has one method to get the set of alternates renamed from fold which was inaccurate previously. The idea being to introduce a fold method in the future if it's valuable.
  • CaseFolding only includes the set of edge cases that are not handled by Character.toLowerCase and Character.toUpperCase and is a switch/case statement now with hex codepoints and comments on ever character
  • The utility class CaseFoldingUtil for generating the switch / case statement code in the CaseFolding class has been removed.
  • ASCII_CASE_INSENSITIVE flag has been deprecated and has the same behavior as the CASE_INSENSITIVE flag
  • Tried to minimize changes otherwise with less concern for performance during compilation of the regex

Thoughts at this point and whether I've adequately incorporated your feedback would be much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants