Add Automata.makeCharSet/makeCharClass to optimize regexp #14193

rmuir · 2025-02-04T01:46:29Z

Previously caseless matching was implemented via code such as this:

  Operations.union(Automata.makeChar('x'), Automata.makeChar('X'))

Proposed unicode caseless matching (#14192) implements it with repeated unions:

  a1 = Operations.union(Automata.makeChar('x'), Automata.makeChar('X'))
  a2 = Operations.union(a1, Automata.makeChar('y'))
  a3 = Operations.union(a2, Automata.makeChar('Y'))

The union operation doesn't return a minimal automaton: improving union would always be nice, but this change offers a simple api for the task that returns half the number of states.

Before: caseless match of "a":

After:

Before: caseless match of "lucene":

After:

Just like the union, the concatenate adds some useless states, but they are less of a problem than the ones from before.

I didn't try anything more such as repeated union or kleene star, to see if I could make a really bad case, I felt like this was good enough, to get it to a better place. We can look at optimizing union/concatenate separately still, but that's always more dangerous and tricky.

Previously caseless matching was implemented via code such as this: Operations.union(Automata.makeChar('x'), Automata.makeChar('X')) Proposed unicode caseless matching implements it with repeated unions: a1 = Operations.union(Automata.makeChar('x'), Automata.makeChar('X')) a2 = Operations.union(a1, Automata.makeChar('y')) a3 = Operations.union(a2, Automata.makeChar('Y')) The union operation doesn't return a minimal automaton: improving union would always be nice, but this change offers a simple api for the task that returns half the number of states.

rmuir · 2025-02-04T02:02:10Z

also makeCharUnion() comes to mind as a compelling alternative name, since there is already a makeStringUnion(). Naming is hard. just want to get the idea out there, since caseless regex is a common usecase.

jpountz

This looks good to me. Worst-case scenario, we make this new API delegate to union() when we improve it. I'm curious if you considered passing the code points as var-args for convenience?

rmuir · 2025-02-04T12:19:39Z

I considered it but then didn't use any varargs after Dawid's email about compiler performance problems coming from them.

* Add new "character class" node, which was previously composed by union of many nodes. * Remove "predefined class" node, which previously built an internal separate regex on the fly, it is just another character class. * RegExp no longer uses union() internally, except for union (|) operator.

rmuir · 2025-02-05T07:34:18Z

I generalized this to makeCharClass(int[],int[]), added a "character class" node to use it instead of unioning many nodes, replaced the pre-built class functionality with it too.

The build failure is bogus: tests pass. It seems to be bug in compiler:

An exception has occurred in the compiler (21.0.5). Please file a bug against the Java compiler via the Java bug reporting page (https://bugreport.java.com/) after checking the Bug Database (https://bugs.java.com/) for duplicates. Include your program, the following diagnostic, and the parameters passed to the Java compiler in your report. Thank you.

rmuir · 2025-02-05T07:35:06Z

That's error-prone that's broke trying to do some null analysis :)

rmuir · 2025-02-05T07:35:50Z

anyway, I think this is the right path, rather than fight with union(), let's just get it out of our way. with this change union() is only used for union operator (|) and not internally.

rmuir · 2025-02-05T07:56:53Z

My example for this one, if you have something like [^a-gklM-O\s], with the case-insensitive flag maybe, it just calls the new makeCharClass(int[],int[]) method and you get minimal automaton with the ranges you expect from state1->state2. No more walking up and down the parse tree and collecting sub-automata into lists and unioning them, no more sub-regexp creation and unioning for prebuilt classes.

rmuir · 2025-02-05T08:02:13Z

I will figure out what angers the error-prone tomorrow. I am rusty with java, so this PR needs assistance LOL. but all the tests pass.

rmuir · 2025-02-05T08:16:24Z

a few more notes:

maybe we should deprecate union(Automaton, Automaton) and only leave union(List<Automaton>). I see the former approach has proven trappy, let's guide developers to do it the faster way? The deprecation warning itself can be useful here to explain why. Nothing outside of tests uses this signature anymore.
by having a way to create a character class node, if we wanted, we could give it additional boolean @param complement True if the character class should be negated. It would not help performance but would allow us to cleanup on the API side: the problematic complement() is needed and awesome for negated character classes, but the general algorithm is not sustainable and requires determinization. It is a crazy trap: complement() on one of these simple character classes is the perfect algorithm: totalize() to one dead state, then nuke the previous accept state. I definitely wrote the code and deleted it several times to just have this flag, to remove the general-purpose complement() trap elsewhere, which is not efficient at all for all situations.
the PR needs more tests but I fought the existing toString-based ones to exhaustion already. At least tests work and confirm it does what we expect
warning: there may be dead code, that we could remove that I missed. We should look and try to fix RegExp class to not do everything package-private, because it makes it impossible for our tooling to detect dead code. If it used private correctly, tools would tell us if a function is unused by anything.

dweiss

This looks great. The only thing that caught my attention would be parallel arguments (from/to) - we could replace it with the more high-level and perhaps more obvious List. But it's such a low-level class that perhaps we shouldn't try to be that fancy and just keep it as is... It's clear what the code is doing (to me at least).

rmuir · 2025-02-05T14:32:54Z

I tried to update the error prone to fix its bugs, it is angry about the way we do gradle. I will YOLO my way thru this stuff.

The default --should-stop=ifError policy (INIT) is not supported by Error Prone, pass --should-stop=ifError=FLOW instead

rmuir · 2025-02-05T14:34:24Z

Like i literally have no idea what this tool is trying to tell me there. But I think errorprone is broken, it depends on too many internals of the java compiler apis.

rmuir · 2025-02-05T14:51:51Z

I fixed the error-prone

rmuir · 2025-02-05T14:52:41Z

I would propose replacing this checker with ast-grep rules for whatever we need: it is not a good one. the use of internal java APIs is too crazy.

rmuir · 2025-02-05T15:19:01Z

@dweiss the List is a good idea. I did the arrays only because @john-wagster has arrays over on #14192, but in the parser lists are used. The only useful thing about arrays are convenient toString() in hex via IntsRef. I will look into it.

rmuir · 2025-02-05T17:53:48Z

I opened #14200 for the error-prone situation

…hodError" This reverts commit 4cd4f8e.

mikemccand

This looks great! I left minor comments... thanks @rmuir.

Crazy how union makes dead states like that, but leave that battle for another day...

mikemccand · 2025-02-05T20:04:18Z

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

        } else {
          a = Automata.makeChar(c);
        }
        break;
      case REGEXP_CHAR_RANGE:
-        a = Automata.makeCharRange(from, to);
+        a = Automata.makeCharRange(from[0], to[0]);


Do we check that from.length == 1 and to.length == 1 somewhere?

I can clean this up. We could actually remove this node (and others) completely if we wanted. It will help, lemme look again.

I looked. toString is a problem. so I want to leave this one be for now. To me it also doesn't make sense to keep adding more "variables" to the RegExp class , it has too many already for these nodes.

lucene/core/src/java/org/apache/lucene/util/automaton/Automata.java

mikemccand · 2025-02-05T20:10:31Z

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

    }
  }

-  /** Like to string, but more verbose (shows the higherchy more clearly). */
+  /** Like to string, but more verbose (shows the hierarchy more clearly). */


Oooh nice typo fix!

mikemccand · 2025-02-05T20:12:26Z

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

  }

  static RegExp makeCharRange(int flags, int from, int to) {
    if (from > to)
      throw new IllegalArgumentException(
          "invalid range: from (" + from + ") cannot be > to (" + to + ")");
-    return newLeafNode(flags, Kind.REGEXP_CHAR_RANGE, null, 0, 0, 0, 0, from, to);
+    return newLeafNode(
+        flags, Kind.REGEXP_CHAR_RANGE, null, 0, 0, 0, 0, new int[] {from}, new int[] {to});


Oh if this is the only place where we create REXEGP_CHAR_RANGE (the API is not public) then we don't need to otherwise check from.length == 1 and to.length == 1.

yeah it is. The only reason I didn't nuke these RANGE and CHAR nodes is ... toString()!

mikemccand · 2025-02-05T20:19:23Z

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

+      starts.add('_' + 1);
+      ends.add('a' - 1);


Heh this range is just one character: ` (back quote).

lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java

…o match both cases.

rmuir · 2025-02-06T00:17:13Z

OK I tried out a List-based API as alternative to array-based API. It isn't fully correct, which is part of my issue, but see it here: 8b535a1

I was excited that it might cleanup the code, but it presented some problems. I think most problems are because we are talking about int arrays (codepoint ranges), but common use cases will of course use java char.

array-based API:

Automata.makeCharSet(new int[] {'a', 'A'});

List-based API:

Automata.makeCharSet(List.of((int) 'a', (int) 'A'));

It requires user to make casts because the boxed types in java suck.

Also I'm not happy that with the List-based API, error-prone keeps catching me doing this:

error: [BoxedPrimitiveEquality] Comparison using reference equality instead of value equality. Reference equality of boxed primitive types is usually not useful, as they are value objects, and it is bug-prone, as instances are cached for some values but not others.
          if (from.get(i) == to.get(i)) {
                          ^
    (see https://errorprone.info/bugpattern/BoxedPrimitiveEquality)
  Did you mean 'if (Objects.equals(from.get(i), to.get(i))) {' or 'if (from.get(i).equals(to.get(i))) {'?

It happens because I keep forgetting to call .intValue() in special places.

So, I'd like to stick with array API, net-net I think it is better. I will look at messy List code that I have in this parser and see if I can improve that to reduce chances of problems when parsing lists of ranges. I also have other changes in that commit that are unrelated good ones and will poach them into here.

Does not change the behavior of toString(), which should really be revisited. This is just the internal parse tree used by tests, let's make this one easy at least.

Only the individual characters, not ranges, in a [] class are treated as case-insensitive: this is the previous behavior unchanged. test it.

intersection/union/etc do This one doesn't take care, it adds dead states to basically every regexp. With the change, many regexp are now minimal. It runs in linear time so it doesn't change the complexity of concatenate() but just cleans up the mess.

rmuir · 2025-02-06T05:00:13Z

Finally! The concatenate() issue was an easy fix, it neglected to clean up its dead states. All of its partners in crime do this, but the fact we neglect it for concatenate messes up too many As.

rmuir · 2025-02-06T05:07:36Z

I'm feeling good about this one now, with the change, a lot of regexps now come out minimal from the start, which is a good thing.

We also eliminate overhead of tons of nodes, which is important if we ever want to support caseless range matches (e.g. [a-z] with case-insensitive flag matching 'A'). To implement it generally (not just for ascii), we need to iterate the range and add tons of alternatives, but its an int[] and will all be folded into a single state: basically as good as we can get.

relying on some function to create a mess. it doesn't anymore.

rmuir · 2025-02-06T12:19:21Z

lucene/core/src/test/org/apache/lucene/util/automaton/TestAutomaton.java

    a = Operations.removeDeadStates(a);
-    assertEquals(3, a.getNumStates());
+    assertEquals(1, a.getNumStates());
  }


cc @jpountz this is how i fixed this test.

Much better, thanks.

Add Automata.makeCharSet(int[])/makeCharClass(int[],int[]) to optimize regexp. * Add new "character class" node, which was previously composed by union of many nodes. * Remove "predefined class" node, which previously built an internal separate regex on the fly, it is just another character class. * RegExp no longer uses union() internally, except for union (|) operator. * format codepoints in the internal parse tree output with U+%04X * Fix concatenate to remove the dead states it creates, just like intersection/union/etc do * fix dead-states-test to explicitly create dead states, rather than relying on some function to create a mess. it doesn't anymore.

dweiss · 2025-02-06T16:26:49Z

Finally! The concatenate() issue was an easy fix, it neglected to clean up its dead states. All of its partners in crime do this, but the fact we neglect it for concatenate messes up too many As.

God this is so much nicer compared to that first/ original automaton you included!

string: ?+½]+]+Ř*+[\]ᖴﴁ. expected: before apache#14193 java.lang.IllegalArgumentException: expected ']' at position 17 actual: after apache#14193 REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=? REGEXP_CHAR char=½ REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=] REGEXP_CHAR char= REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=] REGEXP_REPEAT_MIN min=1 REGEXP_REPEAT REGEXP_CHAR char=Ř REGEXP_CHAR_CLASS starts=[] ends=[] REGEXP_STRING string=ᖴﴁ REGEXP_ANYCHAR Problem is caused by RegExp accepting too much rather than throwing exceptions like it should have. The lenience in the parser comes from "expandPreDefined" which invades on escape character parsing for character classes (e.g. \s). This one adds a lot of complexity to parsing. Don't invoke expandPreDefined(), except for the set of characters that it explicitly handles. This is also consistent with the way expandPreDefined()'s complexity is managed elsewhere in the parser, such as in parseSimpleExp(). Add parsing tests for testEmptyClass(), which is unchanged by this PR, but should be there, and testEscapedInvalidClass(), which fails without the change.

* Fix failure found by TestOperations.testGetRandomAcceptedString string: ?+½]+]+Ř*+[\]ᖴﴁ. expected: before #14193 java.lang.IllegalArgumentException: expected ']' at position 17 actual: after #14193 REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_CONCATENATION REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=? REGEXP_CHAR char=½ REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=] REGEXP_CHAR char= REGEXP_REPEAT_MIN min=1 REGEXP_CHAR char=] REGEXP_REPEAT_MIN min=1 REGEXP_REPEAT REGEXP_CHAR char=Ř REGEXP_CHAR_CLASS starts=[] ends=[] REGEXP_STRING string=ᖴﴁ REGEXP_ANYCHAR Problem is caused by RegExp accepting too much rather than throwing exceptions like it should have. The lenience in the parser comes from "expandPreDefined" which invades on escape character parsing for character classes (e.g. \s). This one adds a lot of complexity to parsing. Don't invoke expandPreDefined(), except for the set of characters that it explicitly handles. This is also consistent with the way expandPreDefined()'s complexity is managed elsewhere in the parser, such as in parseSimpleExp(). Add parsing tests for testEmptyClass(), which is unchanged by this PR, but should be there, and testEscapedInvalidClass(), which fails without the change. When we consume an escape, either it matches the predefined special logic, or we emit range for next(), no other logic: it is an escape. Add test for escaped '-' in a character class.

rmuir requested review from dweiss, jpountz and mikemccand February 4, 2025 01:46

github-actions bot added the module:core/other label Feb 4, 2025

rmuir mentioned this pull request Feb 4, 2025

Unicode Support for Case Insensitive Matching in RegExp #14192

Merged

jpountz approved these changes Feb 4, 2025

View reviewed changes

msfroh mentioned this pull request Feb 5, 2025

[BUG] Can (easily) run out of memory with case-insensitive term queries opensearch-project/OpenSearch#16975

Closed

rmuir changed the title ~~Add Automata.makeCharSet(int[]) to optimize caseless matching.~~ Add Automata.makeCharSet/makeCharClass to optimize regexp Feb 5, 2025

dweiss approved these changes Feb 5, 2025

View reviewed changes

Merge branch 'main' into optimize_caseless_regexp

ed0955b

rmuir added 2 commits February 5, 2025 14:23

Revert "disable error-prone as it is failing CI builds with NoSuchMet…

8450f32

…hodError" This reverts commit 4cd4f8e.

Merge branch 'main' into optimize_caseless_regexp

693eb5e

mikemccand approved these changes Feb 5, 2025

View reviewed changes

fix error-prone issue!

3437874

fix comment, so that it is obvious that it is a range. fix ordering t…

fd7c831

…o match both cases.

rmuir added 6 commits February 5, 2025 22:30

format codepoints in the internal parse tree output with U+%04X

c5de072

Does not change the behavior of toString(), which should really be revisited. This is just the internal parse tree used by tests, let's make this one easy at least.

add sanity checks and ROOT

be5d01e

add test for a more complicated char class, that uses various features

a3cce39

test negated char class, make sure it comes back minimal

2e5d7d1

add case-insensitive tests.

96419cd

Only the individual characters, not ranges, in a [] class are treated as case-insensitive: this is the previous behavior unchanged. test it.

fix dead-states-test to explicitly create dead states, rather than

3926a67

relying on some function to create a mess. it doesn't anymore.

jpountz mentioned this pull request Feb 6, 2025

Automatically remove dead states from concatenated automata. #14205

Closed

rmuir commented Feb 6, 2025

View reviewed changes

rmuir merged commit fe42efc into apache:main Feb 6, 2025
6 checks passed

asfgit pushed a commit that referenced this pull request Feb 7, 2025

Add CHANGES entry for #14193 RegExp changes

5184bf8

asfgit pushed a commit that referenced this pull request Feb 7, 2025

Add CHANGES entry for #14193 RegExp changes

2ab9cf4

rmuir added this to the 10.2.0 milestone Feb 7, 2025

rmuir mentioned this pull request Feb 12, 2025

Fix failure found by TestOperations.testGetRandomAcceptedString #14227

Merged

benwtrent added a commit to elastic/elasticsearch that referenced this pull request Feb 13, 2025

Fixing compilation after Lucene apache/lucene#14193

43b2e87

rmuir mentioned this pull request Mar 20, 2025

Case insensitive regex query with character range #14378

Closed

john-wagster mentioned this pull request Apr 2, 2025

Fixed Character Class and Ranges WildCardQuery Optimizations elastic/elasticsearch#126154

Merged

Add Automata.makeCharSet/makeCharClass to optimize regexp #14193

Add Automata.makeCharSet/makeCharClass to optimize regexp #14193

Conversation

rmuir commented Feb 4, 2025

Uh oh!

rmuir commented Feb 4, 2025

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

rmuir commented Feb 4, 2025

Uh oh!

rmuir commented Feb 5, 2025

Uh oh!

rmuir commented Feb 5, 2025

Uh oh!

rmuir commented Feb 5, 2025

Uh oh!

rmuir commented Feb 5, 2025

Uh oh!

rmuir commented Feb 5, 2025

Uh oh!

rmuir commented Feb 5, 2025

Uh oh!

dweiss left a comment

Choose a reason for hiding this comment

Uh oh!

rmuir commented Feb 5, 2025

Uh oh!

rmuir commented Feb 5, 2025

Uh oh!

rmuir commented Feb 5, 2025

Uh oh!

rmuir commented Feb 5, 2025

Uh oh!

rmuir commented Feb 5, 2025

Uh oh!

rmuir commented Feb 5, 2025

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rmuir commented Feb 6, 2025

Uh oh!

rmuir commented Feb 6, 2025

Uh oh!

rmuir commented Feb 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dweiss commented Feb 6, 2025

Uh oh!

Uh oh!