-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add Automata.makeCharSet/makeCharClass to optimize regexp #14193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously caseless matching was implemented via code such as this: Operations.union(Automata.makeChar('x'), Automata.makeChar('X')) Proposed unicode caseless matching implements it with repeated unions: a1 = Operations.union(Automata.makeChar('x'), Automata.makeChar('X')) a2 = Operations.union(a1, Automata.makeChar('y')) a3 = Operations.union(a2, Automata.makeChar('Y')) The union operation doesn't return a minimal automaton: improving union would always be nice, but this change offers a simple api for the task that returns half the number of states.
also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Worst-case scenario, we make this new API delegate to union() when we improve it. I'm curious if you considered passing the code points as var-args for convenience?
I considered it but then didn't use any varargs after Dawid's email about compiler performance problems coming from them. |
* Add new "character class" node, which was previously composed by union of many nodes. * Remove "predefined class" node, which previously built an internal separate regex on the fly, it is just another character class. * RegExp no longer uses union() internally, except for union (|) operator.
I generalized this to The build failure is bogus: tests pass. It seems to be bug in compiler:
|
That's error-prone that's broke trying to do some null analysis :) |
anyway, I think this is the right path, rather than fight with union(), let's just get it out of our way. with this change union() is only used for union operator ( |
My example for this one, if you have something like |
I will figure out what angers the error-prone tomorrow. I am rusty with java, so this PR needs assistance LOL. but all the tests pass. |
a few more notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. The only thing that caught my attention would be parallel arguments (from/to) - we could replace it with the more high-level and perhaps more obvious List. But it's such a low-level class that perhaps we shouldn't try to be that fancy and just keep it as is... It's clear what the code is doing (to me at least).
I tried to update the error prone to fix its bugs, it is angry about the way we do gradle. I will YOLO my way thru this stuff.
|
Like i literally have no idea what this tool is trying to tell me there. But I think errorprone is broken, it depends on too many internals of the java compiler apis. |
I fixed the error-prone |
I would propose replacing this checker with ast-grep rules for whatever we need: it is not a good one. the use of internal java APIs is too crazy. |
@dweiss the |
I opened #14200 for the error-prone situation |
…hodError" This reverts commit 4cd4f8e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I left minor comments... thanks @rmuir.
Crazy how union makes dead states like that, but leave that battle for another day...
} else { | ||
a = Automata.makeChar(c); | ||
} | ||
break; | ||
case REGEXP_CHAR_RANGE: | ||
a = Automata.makeCharRange(from, to); | ||
a = Automata.makeCharRange(from[0], to[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we check that from.length == 1
and to.length == 1
somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can clean this up. We could actually remove this node (and others) completely if we wanted. It will help, lemme look again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked. toString is a problem. so I want to leave this one be for now. To me it also doesn't make sense to keep adding more "variables" to the RegExp class , it has too many already for these nodes.
} | ||
} | ||
|
||
/** Like to string, but more verbose (shows the higherchy more clearly). */ | ||
/** Like to string, but more verbose (shows the hierarchy more clearly). */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh nice typo fix!
} | ||
|
||
static RegExp makeCharRange(int flags, int from, int to) { | ||
if (from > to) | ||
throw new IllegalArgumentException( | ||
"invalid range: from (" + from + ") cannot be > to (" + to + ")"); | ||
return newLeafNode(flags, Kind.REGEXP_CHAR_RANGE, null, 0, 0, 0, 0, from, to); | ||
return newLeafNode( | ||
flags, Kind.REGEXP_CHAR_RANGE, null, 0, 0, 0, 0, new int[] {from}, new int[] {to}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh if this is the only place where we create REXEGP_CHAR_RANGE
(the API is not public) then we don't need to otherwise check from.length == 1
and to.length == 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it is. The only reason I didn't nuke these RANGE and CHAR nodes is ... toString()!
starts.add('_' + 1); | ||
ends.add('a' - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh this range is just one character: ` (back quote).
lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
Outdated
Show resolved
Hide resolved
…o match both cases.
OK I tried out a List-based API as alternative to array-based API. It isn't fully correct, which is part of my issue, but see it here: 8b535a1 I was excited that it might cleanup the code, but it presented some problems. I think most problems are because we are talking about array-based API: Automata.makeCharSet(new int[] {'a', 'A'}); List-based API: Automata.makeCharSet(List.of((int) 'a', (int) 'A')); It requires user to make casts because the boxed types in java suck. Also I'm not happy that with the List-based API, error-prone keeps catching me doing this:
It happens because I keep forgetting to call So, I'd like to stick with array API, net-net I think it is better. I will look at messy List code that I have in this parser and see if I can improve that to reduce chances of problems when parsing lists of ranges. I also have other changes in that commit that are unrelated good ones and will poach them into here. |
Does not change the behavior of toString(), which should really be revisited. This is just the internal parse tree used by tests, let's make this one easy at least.
Only the individual characters, not ranges, in a [] class are treated as case-insensitive: this is the previous behavior unchanged. test it.
intersection/union/etc do This one doesn't take care, it adds dead states to basically every regexp. With the change, many regexp are now minimal. It runs in linear time so it doesn't change the complexity of concatenate() but just cleans up the mess.
I'm feeling good about this one now, with the change, a lot of regexps now come out minimal from the start, which is a good thing. We also eliminate overhead of tons of nodes, which is important if we ever want to support caseless range matches (e.g. |
relying on some function to create a mess. it doesn't anymore.
a = Operations.removeDeadStates(a); | ||
assertEquals(3, a.getNumStates()); | ||
assertEquals(1, a.getNumStates()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jpountz this is how i fixed this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better, thanks.
Add Automata.makeCharSet(int[])/makeCharClass(int[],int[]) to optimize regexp. * Add new "character class" node, which was previously composed by union of many nodes. * Remove "predefined class" node, which previously built an internal separate regex on the fly, it is just another character class. * RegExp no longer uses union() internally, except for union (|) operator. * format codepoints in the internal parse tree output with U+%04X * Fix concatenate to remove the dead states it creates, just like intersection/union/etc do * fix dead-states-test to explicitly create dead states, rather than relying on some function to create a mess. it doesn't anymore.
string: ?+½]+]+Ř*+[\]ᖴﴁ. expected: before apache#14193 java.lang.IllegalArgumentException: expected ']' at position 17 actual: after apache#14193 REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=? REGEXP_CHAR char=½ REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=] REGEXP_CHAR char= REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=] REGEXP_REPEAT_MIN min=1 REGEXP_REPEAT REGEXP_CHAR char=Ř REGEXP_CHAR_CLASS starts=[] ends=[] REGEXP_STRING string=ᖴﴁ REGEXP_ANYCHAR Problem is caused by RegExp accepting too much rather than throwing exceptions like it should have. The lenience in the parser comes from "expandPreDefined" which invades on escape character parsing for character classes (e.g. \s). This one adds a lot of complexity to parsing. Don't invoke expandPreDefined(), except for the set of characters that it explicitly handles. This is also consistent with the way expandPreDefined()'s complexity is managed elsewhere in the parser, such as in parseSimpleExp(). Add parsing tests for testEmptyClass(), which is unchanged by this PR, but should be there, and testEscapedInvalidClass(), which fails without the change.
* Fix failure found by TestOperations.testGetRandomAcceptedString string: ?+½]+]+Ř*+[\]ᖴﴁ. expected: before #14193 java.lang.IllegalArgumentException: expected ']' at position 17 actual: after #14193 REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=? REGEXP_CHAR char=½ REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=] REGEXP_CHAR char= REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=] REGEXP_REPEAT_MIN min=1 REGEXP_REPEAT REGEXP_CHAR char=Ř REGEXP_CHAR_CLASS starts=[] ends=[] REGEXP_STRING string=ᖴﴁ REGEXP_ANYCHAR Problem is caused by RegExp accepting too much rather than throwing exceptions like it should have. The lenience in the parser comes from "expandPreDefined" which invades on escape character parsing for character classes (e.g. \s). This one adds a lot of complexity to parsing. Don't invoke expandPreDefined(), except for the set of characters that it explicitly handles. This is also consistent with the way expandPreDefined()'s complexity is managed elsewhere in the parser, such as in parseSimpleExp(). Add parsing tests for testEmptyClass(), which is unchanged by this PR, but should be there, and testEscapedInvalidClass(), which fails without the change. When we consume an escape, either it matches the predefined special logic, or we emit range for next(), no other logic: it is an escape. Add test for escaped '-' in a character class.
* Fix failure found by TestOperations.testGetRandomAcceptedString string: ?+½]+]+Ř*+[\]ᖴﴁ. expected: before #14193 java.lang.IllegalArgumentException: expected ']' at position 17 actual: after #14193 REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=? REGEXP_CHAR char=½ REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=] REGEXP_CHAR char= REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=] REGEXP_REPEAT_MIN min=1 REGEXP_REPEAT REGEXP_CHAR char=Ř REGEXP_CHAR_CLASS starts=[] ends=[] REGEXP_STRING string=ᖴﴁ REGEXP_ANYCHAR Problem is caused by RegExp accepting too much rather than throwing exceptions like it should have. The lenience in the parser comes from "expandPreDefined" which invades on escape character parsing for character classes (e.g. \s). This one adds a lot of complexity to parsing. Don't invoke expandPreDefined(), except for the set of characters that it explicitly handles. This is also consistent with the way expandPreDefined()'s complexity is managed elsewhere in the parser, such as in parseSimpleExp(). Add parsing tests for testEmptyClass(), which is unchanged by this PR, but should be there, and testEscapedInvalidClass(), which fails without the change. When we consume an escape, either it matches the predefined special logic, or we emit range for next(), no other logic: it is an escape. Add test for escaped '-' in a character class.
Previously caseless matching was implemented via code such as this:
Proposed unicode caseless matching (#14192) implements it with repeated unions:
The union operation doesn't return a minimal automaton: improving union would always be nice, but this change offers a simple api for the task that returns half the number of states.
Before: caseless match of "a":

After:

Before: caseless match of "lucene":

After:

Just like the
union
, theconcatenate
adds some useless states, but they are less of a problem than the ones from before.I didn't try anything more such as repeated union or kleene star, to see if I could make a really bad case, I felt like this was good enough, to get it to a better place. We can look at optimizing union/concatenate separately still, but that's always more dangerous and tricky.