Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-4573: Zero copy LIKE, REGEXP_MATCHES, SUBSTR #458

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

DRILL-4573: Zero copy LIKE, REGEXP_MATCHES, SUBSTR #458

wants to merge 1 commit into from

Conversation

jcmcote
Copy link
Contributor

@jcmcote jcmcote commented Apr 2, 2016

All the functions using the java.util.regex.Matcher are currently creating Java string objects to pass into the matcher.reset().
However this creates unnecessary copy of the bytes and a Java string object.
The matcher uses a CharSequence, so instead of making a copy we can create an adapter from the DrillBuffer to the CharSequence interface.
Gains of 25% in execution speed are possible when going over VARCHAR of 36 chars. The gain will be proportional to the size of the VARCHAR.


@Override
public char charAt(int index) {
return (char) buffer.getByte(start + index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to check if the index falls into the legal bounds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Done.

@hsuanyi
Copy link
Contributor

hsuanyi commented Apr 10, 2016

Except for the comments, LGTM. Do you run the unit test ?

@hsuanyi
Copy link
Contributor

hsuanyi commented Apr 10, 2016

And can you share the details regarding the performance improvement? For example, how the setting and data look like? Observed % of improvements?

@jcmcote
Copy link
Contributor Author

jcmcote commented Apr 10, 2016

I have updated the JIRA ticket with the latest patch.
Using this approach you can expect 25% less time to execute the like UDF. I've added details in the JIRA ticket.
All tests pass.

@jcmcote
Copy link
Contributor Author

jcmcote commented Apr 23, 2016

Just wondering when this patch will be included in the drill code base. I updated the JIRA issue with a new patch file. Thank you

@jacques-n
Copy link
Contributor

Yes, we'll get it in there. I was thinking that Hsuan was moving this along. Let me see if one of us can get this in. The contribution is very much appreciated. Sorry that people haven't been more helpful.

@jcmcote
Copy link
Contributor Author

jcmcote commented Apr 23, 2016

I guess it slipped through the cracks. No worries.

I did not convert the "isnumeric" UDF to use a CharSequence wrapper since I don't really understand how the templating mechanism works. How difficult would it be to change isnumeric to use a wrapper?

@hsuanyi
Copy link
Contributor

hsuanyi commented Apr 23, 2016

@jcmcote Sorry for the delay. Let me run some tests and commit it if it passes the tests.
Your contribution is highly appreciated.

@jcmcote
Copy link
Contributor Author

jcmcote commented Apr 26, 2016

just following up, did the tests all pass?

@hsuanyi
Copy link
Contributor

hsuanyi commented Apr 26, 2016

Yes, it passed the tests. I will push it shortly.

@hsuanyi
Copy link
Contributor

hsuanyi commented Apr 27, 2016

Just pushed to master; Thanks for the contributions.

@jcmcote
Copy link
Contributor Author

jcmcote commented Apr 29, 2016

Hey Sean,

You did not apply the correct patch. I made the code review changes you asked and updated the patch in the jira ticket. But somehow the code that made it into the git repo is the initial patch.

Have a look at the one in the jira ticket you will see it has comments in the CharSequenceWrapper.

@hsuanyi
Copy link
Contributor

hsuanyi commented Apr 29, 2016

Hi @jcmcote,
I see. But can you do us a favor. Regex_replace()'s behavior changes after your patch
https://issues.apache.org/jira/browse/DRILL-4645

Can you help fix this? And would you mind merging the missing piece with the new patch?

@jcmcote
Copy link
Contributor Author

jcmcote commented May 5, 2016

Hey @hsuanyi
ok, fixed the regex_replace and added a test case. There's a new patch on the ticket. Please apply it and commit.

thanks

@jcmcote
Copy link
Contributor Author

jcmcote commented May 10, 2016

@hsuanyi just following up to see if the patch was applied.

@jinfengni
Copy link
Contributor

@jcmcote
I'll review your new patch if it's ready.

Can you please submit the new patch as a new PR (or maybe I miss some part of conversation and just could not find the link)?

Thanks for your contribution to Drill.

@jcmcote
Copy link
Contributor Author

jcmcote commented May 12, 2016

I have attached the patch to the Jira ticket.

I'm not sure of the usual process.. but from what I was reading we can
create patches and attache them to Jira tickets.

On Thu, May 12, 2016 at 2:10 PM, Jinfeng Ni [email protected]
wrote:

@jcmcote https://github.com/jcmcote
I'll review your new patch if it's ready.

Can you please submit the new patch as a new PR (or maybe I miss some part
of conversation and just could not find the link)?

Thanks for your contribution to Drill.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#458 (comment)

@jinfengni
Copy link
Contributor

@jcmcote ,

(copy from my comment in DRILL-4573)

I re-visited your first patch. Looks like that the change you made would cause incorrect result when the input string is a udf-8 with each character consisting of multiple bytes. In particular, the original implementation would encode the byte array with udf-8 (which is the default encoding in drill). However, in your CharSequenceWrapper, you will treat each byte as a character. This will cause incorrect result for case when a character is represented by >1 bytes.

For instance, look at the following example, the first query of regexp_matches will produce wrong result.

{code:sql}
select regexp_matches('München', 'München') res3 from (values(1));
+--------+
| res3 |
+--------+
| false |
+--------+
1 row selected (0.148 seconds)
0: jdbc:drill:zk=local> select regexp_matches('abc', 'abc') from (values(1));
+---------+
| EXPR$0 |
+---------+
| true |
+---------+
1 row selected (0.189 seconds)
0: jdbc:drill:zk=local> select 'München' = 'München' res1 from (values(1));
+-------+
| res1 |
+-------+
| true |
+-------+
1 row selected (0.186 seconds)
{code:sql}

Here is the result for 1st query, without your patch

{code:sql}
select regexp_matches('München', 'München') res3 from (values(1));
+-------+
| res3 |
+-------+
| true |
+-------+
{code:sql}

I think you should modify CharSequenceWrapper, so that the encoding method is honored.

@jcmcote
Copy link
Contributor Author

jcmcote commented May 19, 2016

I know where the issue lies. It's related to the mapping between the unicode encoded bytes from the drillbuffer and the conversion to chars. I'll will address the mapping issue and get back to you.

@jinfengni
Copy link
Contributor

@jcmcote , I opened DRILL-4688 to track this incorrect regression. Please use DRILL-4688 to submit either a patch or pull request to address this issue.

As part of the verification for this issue, QA plan to add a set of test suite to cover the cases of multi-byte characters. Previously, there is no such test coverage and hence we are not able to catch this regression.

@jinfengni
Copy link
Contributor

@jcmcote , just wanna to check the current status of your latest fix.

FYI, you may consider using couple of existing methods [1] [2], when you decode the byte arrays as UDF8 characters, in CharaSequenceWrapper.

[1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionUtil.java#L77

[2] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionUtil.java#L28

@kkhatua
Copy link
Contributor

kkhatua commented Jan 9, 2018

@jcmcote Did you get a chance to follow up on @jinfengni 's comment on this PR?

@cgivre cgivre added the revisit label Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants