Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Operations.reverse() to not add non-deterministic dead states #14212

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Feb 6, 2025

Operations.reverse() can create dead states, but ones that have non-determinism, which is worse than just creating dead states since it causes Automaton.isDeterministic() to return false, e.g. treated as NFA. This can lead to unnecessary det() calls expecially if automaton gets bigger or more complex.

Operations.reverse() serves multiple use-cases today:

  • Search engine use-cases trying to speed up leading wildcards
  • Testing/academic use-case (Brzozowski minimize)

In the search engine use-case, it is used by both Lucene and Solr.

Lucene uses this method for infinite
automata (e.g. leading wildcard) to compute a common suffix. if the expression has one (e.g. "*foo"), then we'll need to evaluate many candidates: so we reverse the automaton as part of computing the common suffix. Then memcmp can be used to filter out candidates quickly.

Solr uses this method, where users can opt-in to also indexing the reversed form of every term, with a special marker to prevent false-positives from the extra reversed terms. At query-time, the reversed wildcard queries can be turned into something that looks more like a prefix query: https://github.com/apache/solr/blob/bca4cd630b9cff66ecc0431397a99f5289a6462b/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L1291-L1324

Move Operations.reverse(Automaton, Set) to AutomatonTestUtil, since it is too difficult to improve while also supporting this hook.

Fix Operations.reverse(Automaton) to remove dead states.

Description

Operations.reverse() can create dead states, but ones that have
non-determinism, which is worse than just creating dead states since it
causes Automaton.isDeterministic() to return false, e.g. treated as NFA.
This can lead to unnecessary det() calls expecially if automaton gets
bigger or more complex.

Operations.reverse() serves multiple use-cases today:
* Search engine use-cases trying to speed up leading wildcards
* Testing/academic use-case (Brzozowski minimize)

In the search engine use-case, it is used by both Lucene and Solr.

Lucene uses this method for infinite
automata (e.g. leading wildcard) to compute a common suffix. if the
expression has one (e.g. "*foo"), then we'll need to evaluate many
candidates: so we reverse the automaton as part of computing the common
suffix. Then memcmp can be used to filter out candidates quickly.

Solr uses this method, where users can opt-in to also indexing the
reversed form of every term, with a special marker to prevent
false-positives from the extra reversed terms. At query-time, the
reversed wildcard queries can be turned into something that looks more
like a prefix query: https://github.com/apache/solr/blob/bca4cd630b9cff66ecc0431397a99f5289a6462b/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L1291-L1324

Move Operations.reverse(Automaton, Set) to AutomatonTestUtil, since it
is too difficult to improve while also supporting this hook.

Fix Operations.reverse(Automaton) to remove dead states.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant