-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-4573: Zero copy LIKE, REGEXP_MATCHES, SUBSTR #458
base: master
Are you sure you want to change the base?
Conversation
|
||
@Override | ||
public char charAt(int index) { | ||
return (char) buffer.getByte(start + index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to check if the index falls into the legal bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Done.
Except for the comments, LGTM. Do you run the unit test ? |
And can you share the details regarding the performance improvement? For example, how the setting and data look like? Observed % of improvements? |
I have updated the JIRA ticket with the latest patch. |
Just wondering when this patch will be included in the drill code base. I updated the JIRA issue with a new patch file. Thank you |
Yes, we'll get it in there. I was thinking that Hsuan was moving this along. Let me see if one of us can get this in. The contribution is very much appreciated. Sorry that people haven't been more helpful. |
I guess it slipped through the cracks. No worries. I did not convert the "isnumeric" UDF to use a CharSequence wrapper since I don't really understand how the templating mechanism works. How difficult would it be to change isnumeric to use a wrapper? |
@jcmcote Sorry for the delay. Let me run some tests and commit it if it passes the tests. |
just following up, did the tests all pass? |
Yes, it passed the tests. I will push it shortly. |
Just pushed to master; Thanks for the contributions. |
Hey Sean, You did not apply the correct patch. I made the code review changes you asked and updated the patch in the jira ticket. But somehow the code that made it into the git repo is the initial patch. Have a look at the one in the jira ticket you will see it has comments in the CharSequenceWrapper. |
Hi @jcmcote, Can you help fix this? And would you mind merging the missing piece with the new patch? |
Hey @hsuanyi thanks |
@hsuanyi just following up to see if the patch was applied. |
@jcmcote Can you please submit the new patch as a new PR (or maybe I miss some part of conversation and just could not find the link)? Thanks for your contribution to Drill. |
I have attached the patch to the Jira ticket. I'm not sure of the usual process.. but from what I was reading we can On Thu, May 12, 2016 at 2:10 PM, Jinfeng Ni [email protected]
|
@jcmcote , (copy from my comment in DRILL-4573) I re-visited your first patch. Looks like that the change you made would cause incorrect result when the input string is a udf-8 with each character consisting of multiple bytes. In particular, the original implementation would encode the byte array with udf-8 (which is the default encoding in drill). However, in your CharSequenceWrapper, you will treat each byte as a character. This will cause incorrect result for case when a character is represented by >1 bytes. For instance, look at the following example, the first query of regexp_matches will produce wrong result. {code:sql} Here is the result for 1st query, without your patch {code:sql} I think you should modify CharSequenceWrapper, so that the encoding method is honored. |
I know where the issue lies. It's related to the mapping between the unicode encoded bytes from the drillbuffer and the conversion to chars. I'll will address the mapping issue and get back to you. |
@jcmcote , I opened DRILL-4688 to track this incorrect regression. Please use DRILL-4688 to submit either a patch or pull request to address this issue. As part of the verification for this issue, QA plan to add a set of test suite to cover the cases of multi-byte characters. Previously, there is no such test coverage and hence we are not able to catch this regression. |
@jcmcote , just wanna to check the current status of your latest fix. FYI, you may consider using couple of existing methods [1] [2], when you decode the byte arrays as UDF8 characters, in CharaSequenceWrapper. |
@jcmcote Did you get a chance to follow up on @jinfengni 's comment on this PR? |
All the functions using the java.util.regex.Matcher are currently creating Java string objects to pass into the matcher.reset().
However this creates unnecessary copy of the bytes and a Java string object.
The matcher uses a CharSequence, so instead of making a copy we can create an adapter from the DrillBuffer to the CharSequence interface.
Gains of 25% in execution speed are possible when going over VARCHAR of 36 chars. The gain will be proportional to the size of the VARCHAR.