-
Notifications
You must be signed in to change notification settings - Fork 185
add-prefix-suffix-ngram-udf #438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: latest
Are you sure you want to change the base?
add-prefix-suffix-ngram-udf #438
Conversation
|
why there is |
fuction-1 include item docs listed in fuction |
|
|
||
| -- N-gram approach (fast - with prefiltering) | ||
| SELECT * FROM documents | ||
| WHERE NGRAM(content, 2) LIKE '%ap%' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we use LIKE here? We can simply do NGRAM(content, 2) = 'ap'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you Ankit, would address this
| | -------- | | ||
| | fas,ast | | ||
|
|
||
| ### Prefix Operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this - is this unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I follow, why is this unrelated? we also want to doc for prefix?
| | ------------ | | ||
| | Apache Pinot | | ||
|
|
||
| ### Advanced Text Matching with N-grams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please address comment, thanks
|
|
||
| ## Signature | ||
|
|
||
| > NGRAM(col, n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique_ngrams(col, n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
|
|
||
| > NGRAM(col, n) | ||
| > | ||
| > PREFIX(col, prefixString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
| > | ||
| > PREFIX(col, prefixString) | ||
| > | ||
| > SUFFIX(col, suffixString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suffixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
| ```sql | ||
| SELECT NGRAM('Apache pinot', 2) AS bigrams | ||
| FROM myTable | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a tableconfig with "transformationConfig" instead of sql UDF
update the function ngram to unique_ngrams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
001a742 to
de5c295
Compare
|
|
||
| ## Context | ||
|
|
||
| We are onboarding a use case and trying to increase query throughput. We tested that QPS cannot be further improved with existing `REGEXP_LIKE` queries or `TEXT_MATCH` queries. The queries are as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update this section to the following
-->
We were aiming to increase query throughput for a matching use case. Initial testing shows that the current approaches using REGEXP_LIKE and TEXT_MATCH queries have reached their performance limits, and further improvements in QPS cannot be achieved with these methods. The representative queries under evaluation are as follows:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
| ``` | ||
|
|
||
| #### Query Examples | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add two query example using generated ngrams
example1.
Select * from T where regexp_like(field, “*pino.*”) ==>
Select * from T
where ngram = 'pin' and ngram = 'ino' and ngram = 'not'
and regexp_like(field, “*pino.*”)
example2.
In some cases where the search string is shorter or equal to the ngram, we don’t need a validation stage.
Select * from T where regexp_like(field, “*pi.*”) ==>
Select * from T where ngram = 'pi’
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added, thx!
9361aa7 to
fc04681
Compare
New doc for how to add prefix, suffix and ngram UDFs
feature pr: apache/pinot#12392