Skip to content

Add Automata.makeCharSet/makeCharClass to optimize regexp #14193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 6, 2025

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Feb 4, 2025

Previously caseless matching was implemented via code such as this:

  Operations.union(Automata.makeChar('x'), Automata.makeChar('X'))

Proposed unicode caseless matching (#14192) implements it with repeated unions:

  a1 = Operations.union(Automata.makeChar('x'), Automata.makeChar('X'))
  a2 = Operations.union(a1, Automata.makeChar('y'))
  a3 = Operations.union(a2, Automata.makeChar('Y'))

The union operation doesn't return a minimal automaton: improving union would always be nice, but this change offers a simple api for the task that returns half the number of states.

Before: caseless match of "a":
a_before

After:
a_after

Before: caseless match of "lucene":
lucene_before

After:
lucene_after

Just like the union, the concatenate adds some useless states, but they are less of a problem than the ones from before.

I didn't try anything more such as repeated union or kleene star, to see if I could make a really bad case, I felt like this was good enough, to get it to a better place. We can look at optimizing union/concatenate separately still, but that's always more dangerous and tricky.

Previously caseless matching was implemented via code such as this:

  Operations.union(Automata.makeChar('x'), Automata.makeChar('X'))

Proposed unicode caseless matching implements it with repeated unions:

  a1 = Operations.union(Automata.makeChar('x'), Automata.makeChar('X'))
  a2 = Operations.union(a1, Automata.makeChar('y'))
  a3 = Operations.union(a2, Automata.makeChar('Y'))

The union operation doesn't return a minimal automaton: improving union
would always be nice, but this change offers a simple api for the task
that returns half the number of states.
@rmuir
Copy link
Member Author

rmuir commented Feb 4, 2025

also makeCharUnion() comes to mind as a compelling alternative name, since there is already a makeStringUnion(). Naming is hard. just want to get the idea out there, since caseless regex is a common usecase.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Worst-case scenario, we make this new API delegate to union() when we improve it. I'm curious if you considered passing the code points as var-args for convenience?

@rmuir
Copy link
Member Author

rmuir commented Feb 4, 2025

I considered it but then didn't use any varargs after Dawid's email about compiler performance problems coming from them.

* Add new "character class" node, which was previously composed by union
  of many nodes.
* Remove "predefined class" node, which previously built an internal
  separate regex on the fly, it is just another character class.
* RegExp no longer uses union() internally, except for union (|) operator.
@rmuir rmuir changed the title Add Automata.makeCharSet(int[]) to optimize caseless matching. Add Automata.makeCharSet/makeCharClass to optimize regexp Feb 5, 2025
@rmuir
Copy link
Member Author

rmuir commented Feb 5, 2025

I generalized this to makeCharClass(int[],int[]), added a "character class" node to use it instead of unioning many nodes, replaced the pre-built class functionality with it too.

The build failure is bogus: tests pass. It seems to be bug in compiler:

An exception has occurred in the compiler (21.0.5). Please file a bug against the Java compiler via the Java bug reporting page (https://bugreport.java.com/) after checking the Bug Database (https://bugs.java.com/) for duplicates. Include your program, the following diagnostic, and the parameters passed to the Java compiler in your report. Thank you.

@rmuir
Copy link
Member Author

rmuir commented Feb 5, 2025

That's error-prone that's broke trying to do some null analysis :)

@rmuir
Copy link
Member Author

rmuir commented Feb 5, 2025

anyway, I think this is the right path, rather than fight with union(), let's just get it out of our way. with this change union() is only used for union operator (|) and not internally.

@rmuir
Copy link
Member Author

rmuir commented Feb 5, 2025

My example for this one, if you have something like [^a-gklM-O\s], with the case-insensitive flag maybe, it just calls the new makeCharClass(int[],int[]) method and you get minimal automaton with the ranges you expect from state1->state2. No more walking up and down the parse tree and collecting sub-automata into lists and unioning them, no more sub-regexp creation and unioning for prebuilt classes.

@rmuir
Copy link
Member Author

rmuir commented Feb 5, 2025

I will figure out what angers the error-prone tomorrow. I am rusty with java, so this PR needs assistance LOL. but all the tests pass.

@rmuir
Copy link
Member Author

rmuir commented Feb 5, 2025

a few more notes:

  • maybe we should deprecate union(Automaton, Automaton) and only leave union(List<Automaton>). I see the former approach has proven trappy, let's guide developers to do it the faster way? The deprecation warning itself can be useful here to explain why. Nothing outside of tests uses this signature anymore.
  • by having a way to create a character class node, if we wanted, we could give it additional boolean @param complement True if the character class should be negated. It would not help performance but would allow us to cleanup on the API side: the problematic complement() is needed and awesome for negated character classes, but the general algorithm is not sustainable and requires determinization. It is a crazy trap: complement() on one of these simple character classes is the perfect algorithm: totalize() to one dead state, then nuke the previous accept state. I definitely wrote the code and deleted it several times to just have this flag, to remove the general-purpose complement() trap elsewhere, which is not efficient at all for all situations.
  • the PR needs more tests but I fought the existing toString-based ones to exhaustion already. At least tests work and confirm it does what we expect
  • warning: there may be dead code, that we could remove that I missed. We should look and try to fix RegExp class to not do everything package-private, because it makes it impossible for our tooling to detect dead code. If it used private correctly, tools would tell us if a function is unused by anything.

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. The only thing that caught my attention would be parallel arguments (from/to) - we could replace it with the more high-level and perhaps more obvious List. But it's such a low-level class that perhaps we shouldn't try to be that fancy and just keep it as is... It's clear what the code is doing (to me at least).

@rmuir
Copy link
Member Author

rmuir commented Feb 5, 2025

I tried to update the error prone to fix its bugs, it is angry about the way we do gradle. I will YOLO my way thru this stuff.

The default --should-stop=ifError policy (INIT) is not supported by Error Prone, pass --should-stop=ifError=FLOW instead

@rmuir
Copy link
Member Author

rmuir commented Feb 5, 2025

Like i literally have no idea what this tool is trying to tell me there. But I think errorprone is broken, it depends on too many internals of the java compiler apis.

@rmuir
Copy link
Member Author

rmuir commented Feb 5, 2025

I fixed the error-prone

@rmuir
Copy link
Member Author

rmuir commented Feb 5, 2025

I would propose replacing this checker with ast-grep rules for whatever we need: it is not a good one. the use of internal java APIs is too crazy.

@rmuir
Copy link
Member Author

rmuir commented Feb 5, 2025

@dweiss the List is a good idea. I did the arrays only because @john-wagster has arrays over on #14192, but in the parser lists are used. The only useful thing about arrays are convenient toString() in hex via IntsRef. I will look into it.

@rmuir
Copy link
Member Author

rmuir commented Feb 5, 2025

I opened #14200 for the error-prone situation

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I left minor comments... thanks @rmuir.

Crazy how union makes dead states like that, but leave that battle for another day...

} else {
a = Automata.makeChar(c);
}
break;
case REGEXP_CHAR_RANGE:
a = Automata.makeCharRange(from, to);
a = Automata.makeCharRange(from[0], to[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we check that from.length == 1 and to.length == 1 somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can clean this up. We could actually remove this node (and others) completely if we wanted. It will help, lemme look again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked. toString is a problem. so I want to leave this one be for now. To me it also doesn't make sense to keep adding more "variables" to the RegExp class , it has too many already for these nodes.

}
}

/** Like to string, but more verbose (shows the higherchy more clearly). */
/** Like to string, but more verbose (shows the hierarchy more clearly). */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh nice typo fix!

}

static RegExp makeCharRange(int flags, int from, int to) {
if (from > to)
throw new IllegalArgumentException(
"invalid range: from (" + from + ") cannot be > to (" + to + ")");
return newLeafNode(flags, Kind.REGEXP_CHAR_RANGE, null, 0, 0, 0, 0, from, to);
return newLeafNode(
flags, Kind.REGEXP_CHAR_RANGE, null, 0, 0, 0, 0, new int[] {from}, new int[] {to});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh if this is the only place where we create REXEGP_CHAR_RANGE (the API is not public) then we don't need to otherwise check from.length == 1 and to.length == 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it is. The only reason I didn't nuke these RANGE and CHAR nodes is ... toString()!

Comment on lines +1320 to +1321
starts.add('_' + 1);
ends.add('a' - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh this range is just one character: ` (back quote).

@rmuir
Copy link
Member Author

rmuir commented Feb 6, 2025

OK I tried out a List-based API as alternative to array-based API. It isn't fully correct, which is part of my issue, but see it here: 8b535a1

I was excited that it might cleanup the code, but it presented some problems. I think most problems are because we are talking about int arrays (codepoint ranges), but common use cases will of course use java char.

array-based API:

Automata.makeCharSet(new int[] {'a', 'A'});

List-based API:

Automata.makeCharSet(List.of((int) 'a', (int) 'A'));

It requires user to make casts because the boxed types in java suck.

Also I'm not happy that with the List-based API, error-prone keeps catching me doing this:

error: [BoxedPrimitiveEquality] Comparison using reference equality instead of value equality. Reference equality of boxed primitive types is usually not useful, as they are value objects, and it is bug-prone, as instances are cached for some values but not others.
          if (from.get(i) == to.get(i)) {
                          ^
    (see https://errorprone.info/bugpattern/BoxedPrimitiveEquality)
  Did you mean 'if (Objects.equals(from.get(i), to.get(i))) {' or 'if (from.get(i).equals(to.get(i))) {'?

It happens because I keep forgetting to call .intValue() in special places.

So, I'd like to stick with array API, net-net I think it is better. I will look at messy List code that I have in this parser and see if I can improve that to reduce chances of problems when parsing lists of ranges. I also have other changes in that commit that are unrelated good ones and will poach them into here.

rmuir added 6 commits February 5, 2025 22:30
Does not change the behavior of toString(), which should really be
revisited. This is just the internal parse tree used by tests, let's
make this one easy at least.
Only the individual characters, not ranges, in a [] class are treated as
case-insensitive: this is the previous behavior unchanged. test it.
intersection/union/etc do

This one doesn't take care, it adds dead states to basically every
regexp. With the change, many regexp are now minimal.

It runs in linear time so it doesn't change the complexity of
concatenate() but just cleans up the mess.
@rmuir
Copy link
Member Author

rmuir commented Feb 6, 2025

Finally! The concatenate() issue was an easy fix, it neglected to clean up its dead states. All of its partners in crime do this, but the fact we neglect it for concatenate messes up too many As.

Screen_Shot_2025-02-05_at_23 55 55

@rmuir
Copy link
Member Author

rmuir commented Feb 6, 2025

I'm feeling good about this one now, with the change, a lot of regexps now come out minimal from the start, which is a good thing.

We also eliminate overhead of tons of nodes, which is important if we ever want to support caseless range matches (e.g. [a-z] with case-insensitive flag matching 'A'). To implement it generally (not just for ascii), we need to iterate the range and add tons of alternatives, but its an int[] and will all be folded into a single state: basically as good as we can get.

relying on some function to create a mess. it doesn't anymore.
a = Operations.removeDeadStates(a);
assertEquals(3, a.getNumStates());
assertEquals(1, a.getNumStates());
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jpountz this is how i fixed this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better, thanks.

@rmuir rmuir merged commit fe42efc into apache:main Feb 6, 2025
6 checks passed
rmuir added a commit to rmuir/lucene that referenced this pull request Feb 6, 2025
Add Automata.makeCharSet(int[])/makeCharClass(int[],int[]) to optimize regexp.

* Add new "character class" node, which was previously composed by union
  of many nodes.
* Remove "predefined class" node, which previously built an internal
  separate regex on the fly, it is just another character class.
* RegExp no longer uses union() internally, except for union (|) operator.
* format codepoints in the internal parse tree output with U+%04X
* Fix concatenate to remove the dead states it creates, just like
intersection/union/etc do
* fix dead-states-test to explicitly create dead states, rather than
relying on some function to create a mess. it doesn't anymore.
@dweiss
Copy link
Contributor

dweiss commented Feb 6, 2025

Finally! The concatenate() issue was an easy fix, it neglected to clean up its dead states. All of its partners in crime do this, but the fact we neglect it for concatenate messes up too many As.

Screen_Shot_2025-02-05_at_23 55 55

God this is so much nicer compared to that first/ original automaton you included!

asfgit pushed a commit that referenced this pull request Feb 7, 2025
asfgit pushed a commit that referenced this pull request Feb 7, 2025
@rmuir rmuir added this to the 10.2.0 milestone Feb 7, 2025
rmuir added a commit to rmuir/lucene that referenced this pull request Feb 12, 2025
string: ?+½]+]+Ř*+[\]ᖴﴁ.

expected: before apache#14193
java.lang.IllegalArgumentException: expected ']' at position 17

actual: after apache#14193
REGEXP_CONCATENATION
  REGEXP_CONCATENATION
    REGEXP_CONCATENATION
      REGEXP_CONCATENATION
        REGEXP_CONCATENATION
          REGEXP_CONCATENATION
            REGEXP_CONCATENATION
              REGEXP_CONCATENATION
                REGEXP_REPEAT_MIN min=1
                  REGEXP_CHAR char=?
                REGEXP_CHAR char=½
              REGEXP_REPEAT_MIN min=1
                REGEXP_CHAR char=]
            REGEXP_CHAR char=
          REGEXP_REPEAT_MIN min=1
            REGEXP_CHAR char=]
        REGEXP_REPEAT_MIN min=1
          REGEXP_REPEAT
            REGEXP_CHAR char=Ř
      REGEXP_CHAR_CLASS starts=[] ends=[]
    REGEXP_STRING string=ᖴﴁ
  REGEXP_ANYCHAR

Problem is caused by RegExp accepting too much rather than throwing
exceptions like it should have. The lenience in the parser comes from
"expandPreDefined" which invades on escape character parsing for character
classes (e.g. \s). This one adds a lot of complexity to parsing.

Don't invoke expandPreDefined(), except for the set of characters that it
explicitly handles. This is also consistent with the way
expandPreDefined()'s complexity is managed elsewhere in the parser, such
as in parseSimpleExp().

Add parsing tests for testEmptyClass(), which is unchanged by this PR,
but should be there, and testEscapedInvalidClass(), which fails without
the change.
rmuir added a commit that referenced this pull request Feb 13, 2025
* Fix failure found by TestOperations.testGetRandomAcceptedString

string: ?+½]+]+Ř*+[\]ᖴﴁ.

expected: before #14193
java.lang.IllegalArgumentException: expected ']' at position 17

actual: after #14193
REGEXP_CONCATENATION
  REGEXP_CONCATENATION
    REGEXP_CONCATENATION
      REGEXP_CONCATENATION
        REGEXP_CONCATENATION
          REGEXP_CONCATENATION
            REGEXP_CONCATENATION
              REGEXP_CONCATENATION
                REGEXP_REPEAT_MIN min=1
                  REGEXP_CHAR char=?
                REGEXP_CHAR char=½
              REGEXP_REPEAT_MIN min=1
                REGEXP_CHAR char=]
            REGEXP_CHAR char=
          REGEXP_REPEAT_MIN min=1
            REGEXP_CHAR char=]
        REGEXP_REPEAT_MIN min=1
          REGEXP_REPEAT
            REGEXP_CHAR char=Ř
      REGEXP_CHAR_CLASS starts=[] ends=[]
    REGEXP_STRING string=ᖴﴁ
  REGEXP_ANYCHAR

Problem is caused by RegExp accepting too much rather than throwing
exceptions like it should have. The lenience in the parser comes from
"expandPreDefined" which invades on escape character parsing for character
classes (e.g. \s). This one adds a lot of complexity to parsing.

Don't invoke expandPreDefined(), except for the set of characters that it
explicitly handles. This is also consistent with the way
expandPreDefined()'s complexity is managed elsewhere in the parser, such
as in parseSimpleExp().

Add parsing tests for testEmptyClass(), which is unchanged by this PR,
but should be there, and testEscapedInvalidClass(), which fails without
the change.

When we consume an escape, either it matches the predefined special
logic, or we emit range for next(), no other logic: it is an escape.

Add test for escaped '-' in a character class.
asfgit pushed a commit that referenced this pull request Feb 13, 2025
* Fix failure found by TestOperations.testGetRandomAcceptedString

string: ?+½]+]+Ř*+[\]ᖴﴁ.

expected: before #14193
java.lang.IllegalArgumentException: expected ']' at position 17

actual: after #14193
REGEXP_CONCATENATION
  REGEXP_CONCATENATION
    REGEXP_CONCATENATION
      REGEXP_CONCATENATION
        REGEXP_CONCATENATION
          REGEXP_CONCATENATION
            REGEXP_CONCATENATION
              REGEXP_CONCATENATION
                REGEXP_REPEAT_MIN min=1
                  REGEXP_CHAR char=?
                REGEXP_CHAR char=½
              REGEXP_REPEAT_MIN min=1
                REGEXP_CHAR char=]
            REGEXP_CHAR char=
          REGEXP_REPEAT_MIN min=1
            REGEXP_CHAR char=]
        REGEXP_REPEAT_MIN min=1
          REGEXP_REPEAT
            REGEXP_CHAR char=Ř
      REGEXP_CHAR_CLASS starts=[] ends=[]
    REGEXP_STRING string=ᖴﴁ
  REGEXP_ANYCHAR

Problem is caused by RegExp accepting too much rather than throwing
exceptions like it should have. The lenience in the parser comes from
"expandPreDefined" which invades on escape character parsing for character
classes (e.g. \s). This one adds a lot of complexity to parsing.

Don't invoke expandPreDefined(), except for the set of characters that it
explicitly handles. This is also consistent with the way
expandPreDefined()'s complexity is managed elsewhere in the parser, such
as in parseSimpleExp().

Add parsing tests for testEmptyClass(), which is unchanged by this PR,
but should be there, and testEscapedInvalidClass(), which fails without
the change.

When we consume an escape, either it matches the predefined special
logic, or we emit range for next(), no other logic: it is an escape.

Add test for escaped '-' in a character class.
benwtrent added a commit to elastic/elasticsearch that referenced this pull request Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants