Skip to content

Conversation

@RyanL1997
Copy link
Collaborator

@RyanL1997 RyanL1997 commented Oct 22, 2025

Description

Fix unexpected shift of extraction for rex with nested capture groups in named groups

The rex command in PPL had a critical bug when using named capture groups that contained nested unnamed groups. This caused extracted field values to shift by one position, producing incorrect results.

  • Root Cause: Code used sequential indices 1, 2, 3... but nested groups create non-sequential indices 1, 3, 5...
  • Solution: Bypass index calculation entirely by using Java's native named group extraction (matcher.group(groupName))

Example of the Bug

Query:

curl -X POST "localhost:9200/_plugins/_ppl" \
    -H "Content-Type: application/json" \
    -d '{
      "query": "source=accounts | rex field=email \"(?<user>(amber|hattie|nanette)[a-z]*)@(?<domain>(pyrami|netagy|quility))\\.(?<tld>(com|org))\" | fields user, domain, tld | head 1"
    }'

Expected Result (correct):

["amberduke", "pyrami", "com"]

Actual Result (wrong):

["amberduke", "amber", "pyrami"]

Root Cause

When Java's regex engine processes the pattern (?<user>(amber|hattie))[a-z]*, it assigns group numbers to ALL capture groups (named and unnamed):

Pattern: (?<user>(amber|hattie))[a-z]*@(?<domain>(pyrami|netagy))\.(?<tld>(com|org))

Group Assignment:

  • Group 0: Entire match
  • Group 1: (?<user>...) ← Named group "user"
  • Group 2: (amber|hattie) ← Unnamed nested group
  • Group 3: (?<domain>...) ← Named group "domain"
  • Group 4: (pyrami|netagy) ← Unnamed nested group
  • Group 5: (?<tld>...) ← Named group "tld"
  • Group 6: (com|org) ← Unnamed nested group

Before the fix, the bug is in CalciteRelNodeVisitor.java (lines 265-321). The code does:

  List<String> namedGroups = RegexCommonUtils.getNamedGroupCandidates(patternStr);
  // namedGroups = ["user", "domain", "tld"]

  for (int i = 0; i < namedGroups.size(); i++) {
      extractCall = PPLFuncImpTable.INSTANCE.resolve(
          context.rexBuilder,
          BuiltinFunctionName.REX_EXTRACT,
          fieldRex,
          context.rexBuilder.makeLiteral(patternStr),
          context.relBuilder.literal(i + 1));  // ← WRONG: Assumes sequential named groups
      // ...
  }

The code assumes named groups are at indices 1, 2, 3, ... but the actual indices are 1, 3, 5, ... due to the unnamed nested groups.

With the above buggy logic:

  • REX_EXTRACT(field, pattern, 1) → Gets Group 1 (?<user>...) = "amberduke" → CORRECT
  • REX_EXTRACT(field, pattern, 2) → Gets Group 2 (amber|hattie) = "amber" → WRONG
  • REX_EXTRACT(field, pattern, 3) → Gets Group 3 (?<domain>...) = "pyrami" → WRONG

The second and third extractions are off by one group because they hit the unnamed nested groups.

LogicalProject(
user=[REX_EXTRACT($7, '(?<user>(amber|hattie))[a-z]*@(?<domain>(pyrami|netagy))\.(?<tld>(com|org))', 1)],
domain=[REX_EXTRACT($7, '(?<user>(amber|hattie))[a-z]*@(?<domain>(pyrami|netagy))\.(?<tld>(com|org))', 2)],  -- Wrong!
    tld=[REX_EXTRACT($7, '(?<user>(amber|hattie))[a-z]*@(?<domain>(pyrami|netagy))\.(?<tld>(com|org))', 3)]     -- Wrong!
)

Related Issues

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

matchCount++;
} else {
// If extractor returns null, it might indicate an error (like invalid group name)
// Stop processing to avoid infinite loop
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently thinking about adding an error handling here

Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
@RyanL1997 RyanL1997 changed the title [BugFix] Fix the off-by-one error for rex with nested capture groups in named groups [BugFix] Fix unexpected shift of extraction for rex with nested capture groups in named groups Oct 25, 2025
Swiddis
Swiddis previously approved these changes Oct 28, 2025
fieldRex,
context.rexBuilder.makeLiteral(patternStr),
context.relBuilder.literal(i + 1),
context.rexBuilder.makeLiteral(groupName),
Copy link
Collaborator

@dai-chen dai-chen Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found there is a namedGroups() API in Matcher (since JDK 20?). If we can get correct index here, we don't need to modify the UDFs below? Alternatively we can move capture name -> index logic here from UDFs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct - the core issue is matching named groups to their correct indices, and Pattern.namedGroups() would be the perfect solution. However, I discovered that we're blocked by a
compatibility constraint:

  • Pattern.namedGroups() was introduced in JDK 20
  • We need backward compatibility with JDK 11/17 - for 2.19-dev

I agree that directly leveraging the Pattern.namedGroups() is the right architectural approach - we should definitely migrate to it when we fully upgrade to JDK 20+. At that point, it would be a simple one-line change in CalciteRelNodeVisitor.

"Rex pattern must contain at least one named capture group");
}

// TODO: Once JDK 20+ is supported, consider using Pattern.namedGroups() API for more efficient
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for now, I added a TODO here @dai-chen

@RyanL1997 RyanL1997 merged commit 0c1ec27 into opensearch-project:main Oct 29, 2025
51 of 52 checks passed
@RyanL1997 RyanL1997 deleted the rex-extract-fix branch October 29, 2025 00:34
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 29, 2025
…ture groups in named groups (#4641)

(cherry picked from commit 0c1ec27)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
RyanL1997 pushed a commit that referenced this pull request Oct 29, 2025
…ture groups in named groups (#4641) (#4692)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.19-dev bug Something isn't working bugFix PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] rex command off-by-one error with nested capture groups in named groups

3 participants