DRILL-4573: Zero copy LIKE, REGEXP_MATCHES, SUBSTR #458

jcmcote · 2016-04-02T03:44:07Z

All the functions using the java.util.regex.Matcher are currently creating Java string objects to pass into the matcher.reset().
However this creates unnecessary copy of the bytes and a Java string object.
The matcher uses a CharSequence, so instead of making a copy we can create an adapter from the DrillBuffer to the CharSequence interface.
Gains of 25% in execution speed are possible when going over VARCHAR of 36 chars. The gain will be proportional to the size of the VARCHAR.

hsuanyi · 2016-04-10T02:49:51Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/CharSequenceWrapper.java

+
+    @Override
+    public char charAt(int index) {
+        return (char) buffer.getByte(start + index);


We might need to check if the index falls into the legal bounds.

Sounds good. Done.

hsuanyi · 2016-04-10T02:58:46Z

Except for the comments, LGTM. Do you run the unit test ?

hsuanyi · 2016-04-10T03:02:35Z

And can you share the details regarding the performance improvement? For example, how the setting and data look like? Observed % of improvements?

jcmcote · 2016-04-10T18:54:06Z

I have updated the JIRA ticket with the latest patch.
Using this approach you can expect 25% less time to execute the like UDF. I've added details in the JIRA ticket.
All tests pass.

jcmcote · 2016-04-23T01:32:03Z

Just wondering when this patch will be included in the drill code base. I updated the JIRA issue with a new patch file. Thank you

jacques-n · 2016-04-23T01:48:33Z

Yes, we'll get it in there. I was thinking that Hsuan was moving this along. Let me see if one of us can get this in. The contribution is very much appreciated. Sorry that people haven't been more helpful.

jcmcote · 2016-04-23T01:56:07Z

I guess it slipped through the cracks. No worries.

I did not convert the "isnumeric" UDF to use a CharSequence wrapper since I don't really understand how the templating mechanism works. How difficult would it be to change isnumeric to use a wrapper?

hsuanyi · 2016-04-23T02:08:43Z

@jcmcote Sorry for the delay. Let me run some tests and commit it if it passes the tests.
Your contribution is highly appreciated.

jcmcote · 2016-04-26T22:17:19Z

just following up, did the tests all pass?

hsuanyi · 2016-04-26T22:20:37Z

Yes, it passed the tests. I will push it shortly.

hsuanyi · 2016-04-27T00:32:51Z

Just pushed to master; Thanks for the contributions.

jcmcote · 2016-04-29T01:32:56Z

Hey Sean,

You did not apply the correct patch. I made the code review changes you asked and updated the patch in the jira ticket. But somehow the code that made it into the git repo is the initial patch.

Have a look at the one in the jira ticket you will see it has comments in the CharSequenceWrapper.

hsuanyi · 2016-04-29T16:43:47Z

Hi @jcmcote,
I see. But can you do us a favor. Regex_replace()'s behavior changes after your patch
https://issues.apache.org/jira/browse/DRILL-4645

Can you help fix this? And would you mind merging the missing piece with the new patch?

jcmcote · 2016-05-05T22:24:55Z

Hey @hsuanyi
ok, fixed the regex_replace and added a test case. There's a new patch on the ticket. Please apply it and commit.

thanks

jcmcote · 2016-05-10T02:53:33Z

@hsuanyi just following up to see if the patch was applied.

jinfengni · 2016-05-12T18:10:27Z

@jcmcote
I'll review your new patch if it's ready.

Can you please submit the new patch as a new PR (or maybe I miss some part of conversation and just could not find the link)?

Thanks for your contribution to Drill.

jcmcote · 2016-05-12T23:02:00Z

I have attached the patch to the Jira ticket.

I'm not sure of the usual process.. but from what I was reading we can
create patches and attache them to Jira tickets.

On Thu, May 12, 2016 at 2:10 PM, Jinfeng Ni [email protected]
wrote:

@jcmcote https://github.com/jcmcote
I'll review your new patch if it's ready.

Can you please submit the new patch as a new PR (or maybe I miss some part
of conversation and just could not find the link)?

Thanks for your contribution to Drill.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#458 (comment)

jinfengni · 2016-05-17T21:25:22Z

@jcmcote ,

(copy from my comment in DRILL-4573)

I re-visited your first patch. Looks like that the change you made would cause incorrect result when the input string is a udf-8 with each character consisting of multiple bytes. In particular, the original implementation would encode the byte array with udf-8 (which is the default encoding in drill). However, in your CharSequenceWrapper, you will treat each byte as a character. This will cause incorrect result for case when a character is represented by >1 bytes.

For instance, look at the following example, the first query of regexp_matches will produce wrong result.

{code:sql}
select regexp_matches('München', 'München') res3 from (values(1));
+--------+
| res3 |
+--------+
| false |
+--------+
1 row selected (0.148 seconds)
0: jdbc:drill:zk=local> select regexp_matches('abc', 'abc') from (values(1));
+---------+
| EXPR$0 |
+---------+
| true |
+---------+
1 row selected (0.189 seconds)
0: jdbc:drill:zk=local> select 'München' = 'München' res1 from (values(1));
+-------+
| res1 |
+-------+
| true |
+-------+
1 row selected (0.186 seconds)
{code:sql}

Here is the result for 1st query, without your patch

{code:sql}
select regexp_matches('München', 'München') res3 from (values(1));
+-------+
| res3 |
+-------+
| true |
+-------+
{code:sql}

I think you should modify CharSequenceWrapper, so that the encoding method is honored.

jcmcote · 2016-05-19T00:52:42Z

I know where the issue lies. It's related to the mapping between the unicode encoded bytes from the drillbuffer and the conversion to chars. I'll will address the mapping issue and get back to you.

jinfengni · 2016-05-19T17:23:25Z

@jcmcote , I opened DRILL-4688 to track this incorrect regression. Please use DRILL-4688 to submit either a patch or pull request to address this issue.

As part of the verification for this issue, QA plan to add a set of test suite to cover the cases of multi-byte characters. Previously, there is no such test coverage and hence we are not able to catch this regression.

jinfengni · 2016-05-25T17:31:38Z

@jcmcote , just wanna to check the current status of your latest fix.

FYI, you may consider using couple of existing methods [1] [2], when you decode the byte arrays as UDF8 characters, in CharaSequenceWrapper.

[1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionUtil.java#L77

[2] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionUtil.java#L28

kkhatua · 2018-01-09T19:52:22Z

@jcmcote Did you get a chance to follow up on @jinfengni 's comment on this PR?

DRILL-4573: Zero copy LIKE, REGEXP_MATCHES, SUBSTR

71b35ec

hsuanyi reviewed Apr 10, 2016
View reviewed changes

cgivre added the revisit label Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-4573: Zero copy LIKE, REGEXP_MATCHES, SUBSTR #458

DRILL-4573: Zero copy LIKE, REGEXP_MATCHES, SUBSTR #458

jcmcote commented Apr 2, 2016

hsuanyi Apr 10, 2016

jcmcote Apr 10, 2016

hsuanyi commented Apr 10, 2016

hsuanyi commented Apr 10, 2016

jcmcote commented Apr 10, 2016

jcmcote commented Apr 23, 2016

jacques-n commented Apr 23, 2016

jcmcote commented Apr 23, 2016

hsuanyi commented Apr 23, 2016

jcmcote commented Apr 26, 2016

hsuanyi commented Apr 26, 2016

hsuanyi commented Apr 27, 2016

jcmcote commented Apr 29, 2016

hsuanyi commented Apr 29, 2016

jcmcote commented May 5, 2016 •

edited

Loading

jcmcote commented May 10, 2016

jinfengni commented May 12, 2016

jcmcote commented May 12, 2016

jinfengni commented May 17, 2016

jcmcote commented May 19, 2016

jinfengni commented May 19, 2016

jinfengni commented May 25, 2016

kkhatua commented Jan 9, 2018

DRILL-4573: Zero copy LIKE, REGEXP_MATCHES, SUBSTR #458

Are you sure you want to change the base?

DRILL-4573: Zero copy LIKE, REGEXP_MATCHES, SUBSTR #458

Conversation

jcmcote commented Apr 2, 2016

hsuanyi Apr 10, 2016

Choose a reason for hiding this comment

jcmcote Apr 10, 2016

Choose a reason for hiding this comment

hsuanyi commented Apr 10, 2016

hsuanyi commented Apr 10, 2016

jcmcote commented Apr 10, 2016

jcmcote commented Apr 23, 2016

jacques-n commented Apr 23, 2016

jcmcote commented Apr 23, 2016

hsuanyi commented Apr 23, 2016

jcmcote commented Apr 26, 2016

hsuanyi commented Apr 26, 2016

hsuanyi commented Apr 27, 2016

jcmcote commented Apr 29, 2016

hsuanyi commented Apr 29, 2016

jcmcote commented May 5, 2016 • edited Loading

jcmcote commented May 10, 2016

jinfengni commented May 12, 2016

jcmcote commented May 12, 2016

jinfengni commented May 17, 2016

jcmcote commented May 19, 2016

jinfengni commented May 19, 2016

jinfengni commented May 25, 2016

kkhatua commented Jan 9, 2018

jcmcote commented May 5, 2016 •

edited

Loading