-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Unicode Support for Case Insensitive Matching #122536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Unicode Support for Case Insensitive Matching #122536
Conversation
|
Hi @john-wagster, I've created a changelog YAML for you. |
benwtrent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to adjust Regex here. Should "just" adjust the flag passed to Lucene for Regex parsing there?
I am not sure how that is passed through actually (need to do some reading).
| pattern = Strings.toLowercaseAscii(pattern); | ||
| str = Strings.toLowercaseAscii(str); | ||
| // FIXME: attempting to address this bug: https://github.com/elastic/elasticsearch/issues/109385 | ||
| // but really would like regex case insensitive handling to be consistent everywhere rather than a hodge podge that's unclear | ||
| // in the docs so need to reconcile Lucene RegExp, Pattern, and simpleMatch in ES + their docs | ||
| // FIXME: reconcile this behavior and add tests that validate it behaves the same as RegExp for case insensitivity | ||
| // FIXME: add documentation or at least reference lucene docs for this | ||
| // FIXME: consider adding a flag for UNICODE_CASE | ||
| // FIXME: search for and reconcile usages of Strings.toLowerCaseASCII for if they need to use normalizeCase instead | ||
| // FIXME: search for and reconcile usages of java.util.regex.Pattern for if they need to use Lucene RegExp instead | ||
| // or some variant of ES CaseFolding since Lucene RegExp PR is not in place yet (https://github.com/apache/lucene/pull/14192) | ||
| // FIXME: search for and reconcile usages of simpleMatch make sure that supporting broad Unicode support is ok | ||
| // FIXME: search for and reconcile usages of RegExp in ES and make sure they pass the new UNICODE_CASE flag | ||
| // when the Lucene RegExp PR is in place: https://github.com/apache/lucene/pull/14192 | ||
|
|
||
| pattern = CaseFolding.normalizeCase(pattern); | ||
| str = CaseFolding.normalizeCase(str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do this. Let's skip this for now. This class is used primarily for simple pattern matching for field names, index names, etc. I didn't find anything that used a query.
The biggest flag to me was PatternAnalyzer, but that is purposefully restrictive and we can expand support in the future. Let's try to unblock the other large regex usecases that don't use this path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context since this draft is so old (for anyone subsequently reading it). At the time the draft was created the Regex class was used differently I believe (I think but haven't completely connected the dots in commits that it was used previously by WildcardPattern for instance). But irrespective of that I still think we need to be a bit careful here.
For instance this particular class is no longer a priority or may not be super relevant I agree.
But if I look at CaseInsensitiveWildcardQuery it references AutomatonQueries in Lucene which is trying to do the right thing but doesn't use CaseFolding and may be missing some Unicode codepoints when calling toCaseInsensitiveWildcardAutomaton -> toCaseInsensitiveChar
Makes sense that we can potentially pull out incremental updates here though.
And I do agree that hopefully references to Lucene's RegExp should be easily upgraded by adding in and passing down the a new flag for Unicode casing similar to how we added a new flag in Lucene (https://github.com/apache/lucene/blob/146ef0374f8cb3c36a53be2b73c32255b305a5cb/lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java#L480).
Related to fixing: #109385
I am been investigating a good solution to fixing our various regex engines to support unicode matches similar to utilities like
java.util.regex.Pattern.There is unfortunately not a great way to do this without hard coding a number of edge case in the Unicode spec found here:
https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt
I've considered various options to deal with some of the edge cases which are detailed in the Lucene PR for RegExp here: apache/lucene#14192
I will refine this PR based on the outcomes in the Lucene PR and see if I can reconcile our various regex approaches to incorporate a consistent case insensitive flag. For now this Draft PR is a placeholder for that work so I don't lose it.
I have a set of utilities for validating and generating these mappings that I've held back for now pending figuring out where I can put them.