Skip to content

Commit

Permalink
Add Automata.makeCharSet/makeCharClass to optimize regexp (#14193)
Browse files Browse the repository at this point in the history
Add Automata.makeCharSet(int[])/makeCharClass(int[],int[]) to optimize regexp.

* Add new "character class" node, which was previously composed by union
  of many nodes.
* Remove "predefined class" node, which previously built an internal
  separate regex on the fly, it is just another character class.
* RegExp no longer uses union() internally, except for union (|) operator.
* format codepoints in the internal parse tree output with U+%04X
* Fix concatenate to remove the dead states it creates, just like
intersection/union/etc do
* fix dead-states-test to explicitly create dead states, rather than
relying on some function to create a mess. it doesn't anymore.
  • Loading branch information
rmuir authored Feb 6, 2025
1 parent 7425e43 commit fe42efc
Show file tree
Hide file tree
Showing 5 changed files with 329 additions and 104 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Objects;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefIterator;
import org.apache.lucene.util.StringHelper;
Expand Down Expand Up @@ -140,6 +141,32 @@ public static Automaton makeCharRange(int min, int max) {
return a;
}

/** Returns a new minimal automaton that accepts any of the provided codepoints */
public static Automaton makeCharSet(int[] codepoints) {
return makeCharClass(codepoints, codepoints);
}

/** Returns a new minimal automaton that accepts any of the codepoint ranges */
public static Automaton makeCharClass(int[] starts, int[] ends) {
Objects.requireNonNull(starts);
Objects.requireNonNull(ends);
if (starts.length != ends.length) {
throw new IllegalArgumentException("starts must match ends");
}
if (starts.length == 0) {
return makeEmpty();
}
Automaton a = new Automaton();
int s1 = a.createState();
int s2 = a.createState();
a.setAccept(s2, true);
for (int i = 0; i < starts.length; i++) {
a.addTransition(s1, s2, starts[i], ends[i]);
}
a.finishState();
return a;
}

/**
* Constructs sub-automaton corresponding to decimal numbers of length x.substring(n).length().
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -148,8 +148,7 @@ public static Automaton concatenate(List<Automaton> l) {
}

result.finishState();

return result;
return Operations.removeDeadStates(result);
}

/**
Expand Down
Loading

0 comments on commit fe42efc

Please sign in to comment.