-
Couldn't load subscription status.
- Fork 176
[BugFix] Fix unexpected shift of extraction for rex with nested capture groups in named groups
#4641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| matchCount++; | ||
| } else { | ||
| // If extractor returns null, it might indicate an error (like invalid group name) | ||
| // Stop processing to avoid infinite loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently thinking about adding an error handling here
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
Signed-off-by: Jialiang Liang <[email protected]>
1a8f836 to
25629e0
Compare
rex with nested capture groups in named groups rex with nested capture groups in named groups
| fieldRex, | ||
| context.rexBuilder.makeLiteral(patternStr), | ||
| context.relBuilder.literal(i + 1), | ||
| context.rexBuilder.makeLiteral(groupName), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found there is a namedGroups() API in Matcher (since JDK 20?). If we can get correct index here, we don't need to modify the UDFs below? Alternatively we can move capture name -> index logic here from UDFs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct - the core issue is matching named groups to their correct indices, and Pattern.namedGroups() would be the perfect solution. However, I discovered that we're blocked by a
compatibility constraint:
Pattern.namedGroups()was introduced in JDK 20- We need backward compatibility with JDK 11/17 - for
2.19-dev
I agree that directly leveraging the Pattern.namedGroups() is the right architectural approach - we should definitely migrate to it when we fully upgrade to JDK 20+. At that point, it would be a simple one-line change in CalciteRelNodeVisitor.
Signed-off-by: Jialiang Liang <[email protected]>
| "Rex pattern must contain at least one named capture group"); | ||
| } | ||
|
|
||
| // TODO: Once JDK 20+ is supported, consider using Pattern.namedGroups() API for more efficient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for now, I added a TODO here @dai-chen
Description
Fix unexpected shift of extraction for
rexwith nested capture groups in named groupsThe rex command in PPL had a critical bug when using named capture groups that contained nested unnamed groups. This caused extracted field values to shift by one position, producing incorrect results.
1,2,3... but nested groups create non-sequential indices1,3,5...matcher.group(groupName))Example of the Bug
Query:
Expected Result (correct):
Actual Result (wrong):
Root Cause
When Java's regex engine processes the pattern
(?<user>(amber|hattie))[a-z]*, it assigns group numbers to ALL capture groups (named and unnamed):Pattern:
(?<user>(amber|hattie))[a-z]*@(?<domain>(pyrami|netagy))\.(?<tld>(com|org))Group Assignment:
(?<user>...)← Named group "user"(amber|hattie)← Unnamed nested group(?<domain>...)← Named group "domain"(pyrami|netagy)← Unnamed nested group(?<tld>...)← Named group "tld"(com|org)← Unnamed nested groupBefore the fix, the bug is in
CalciteRelNodeVisitor.java(lines 265-321). The code does:The code assumes named groups are at indices
1,2,3, ... but the actual indices are1,3,5, ... due to the unnamed nested groups.With the above buggy logic:
REX_EXTRACT(field, pattern, 1)→ Gets Group 1(?<user>...)= "amberduke" → CORRECTREX_EXTRACT(field, pattern, 2)→ Gets Group 2(amber|hattie)= "amber" → WRONGREX_EXTRACT(field, pattern, 3)→ Gets Group 3(?<domain>...)= "pyrami" → WRONGThe second and third extractions are off by one group because they hit the unnamed nested groups.
Related Issues
rexcommand off-by-one error with nested capture groups in named groups #4466Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.