-
Notifications
You must be signed in to change notification settings - Fork 185
add-prefix-suffix-ngram-udf #438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: latest
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,193 @@ | ||
| --- | ||
| description: This section contains reference documentation for the add prefix, suffix and ngram UDFs. | ||
| --- | ||
|
|
||
| # Add Prefix, Suffix & Ngram UDFs | ||
|
|
||
| Adding ngram, prefix, postfix UDFs to improve query throughput by enabling efficient text search through derived columns with inverted indexes. | ||
|
|
||
| ## Context | ||
|
|
||
| We were aiming to increase query throughput for a matching use case. Initial testing shows that the current approaches using `REGEXP_LIKE` and `TEXT_MATCH` queries have reached their performance limits, and further improvements in QPS cannot be achieved with these methods. The representative queries under evaluation are as follows: | ||
|
|
||
| ```sql | ||
| SELECT col1, col2 FROM table WHERE REGEXP_LIKE(col3, '^data*') | ||
| SELECT col1, col2 FROM table WHERE REGEXP_LIKE(col3, 'data$') | ||
| SELECT col1, col2 FROM table WHERE REGEXP_LIKE(col3, '*data*') | ||
| SELECT col1, col2 FROM table WHERE TEXT_MATCH(col3, '/data*/') | ||
| ``` | ||
|
|
||
| The plan is to generate derived columns that persist prefix, postfix, and ngram to use inverted indexes to filter results fast, and add text match indexes to do validation after filtering to avoid false positive results. | ||
|
|
||
| ## What is N-gram | ||
|
|
||
| N-gram breaks down text into overlapping sequences of characters. For example: | ||
|
|
||
| **"Apache pinot"** produces: | ||
| - `ap`, `pa`, `ac`, `ch`, `he`, `e `, ` p`, ... | ||
|
|
||
| When you match `ngram(col, "ap")`, it searches for the substring "ap" within the generated n-grams. | ||
|
|
||
| When you match `ngram(col, 'apache')`, it generates n-grams from "apache" (`ap`, `pa`, `ac`, etc.) and matches against the column's n-grams. | ||
|
|
||
| ## When to Use N-gram | ||
|
|
||
| N-grams solve performance issues with: | ||
| - **Text match** - Avoid full scans | ||
| - **REGEXP_LIKE** - Currently requires full scan | ||
| - **Prefix matching** - Reduce O(N * length of string) complexity | ||
| - **Wildcard matching** - Reduce O(N * length) complexity | ||
|
|
||
| ### How to Make Wildcards Fast | ||
| Need to do prefiltering to avoid full scan of all rows. | ||
| **How to prefilter → use ngram!** | ||
|
|
||
| ## Function Parameters | ||
|
|
||
| | Function | Parameters | Description | | ||
| |----------|------------|-------------| | ||
| | `prefixes` | `String input, int maxlength` | Generate array of prefix strings shorter than maxlength | | ||
| | `prefixesWithPrefix` | `String input, int maxlength, @Nullable String prefix` | Generate array of prefix matchers with prepended prefix (e.g. '^' for regex) | | ||
| | `suffixes` | `String input, int maxlength` | Generate array of suffix strings shorter than maxlength | | ||
| | `suffixesWithSuffix` | `String input, int maxlength, @Nullable String suffix` | Generate array of suffix matchers with appended suffix (e.g. '$' for regex) | | ||
| | `uniqueNgrams` | `String input, int length` | Generate array of unique ngrams of exact specified length | | ||
| | `uniqueNgrams` | `String input, int minGram, int maxGram` | Generate array of unique ngrams within length range [minGram, maxGram] | | ||
|
|
||
| ## Usage Examples | ||
|
|
||
| ### Example Query | ||
|
|
||
| ```sql | ||
| SELECT organizationUUID, confirmedEmployeeCount, name | ||
| FROM rta.rta.u4b_organizations | ||
| WHERE deletedAt IS NULL | ||
| AND (IN_SERIALIZED_LIST(entityTypes, 'org:u4b:organization')) | ||
| AND (TEXT_MATCH(name, '/.*pacific.*/ AND /.*equity.*/') | ||
| OR TEXT_MATCH(profileName, '/.*pacific.*/ AND /.*equity.*/')) | ||
| ORDER BY confirmedEmployeeCount DESC | ||
| OFFSET 0 LIMIT 10 | ||
| ``` | ||
|
|
||
| ### Prefix Operations | ||
|
|
||
| ```sql | ||
| SELECT prefixes('data', 3) AS result | ||
| FROM myTable | ||
| ``` | ||
|
|
||
| | result | | ||
| | ----------- | | ||
| | ["d", "da", "dat"] | | ||
|
|
||
| ```sql | ||
| SELECT prefixesWithPrefix('data', 3, '^') AS result | ||
| FROM myTable | ||
| ``` | ||
|
|
||
| | result | | ||
| | ----------- | | ||
| | ["^d", "^da", "^dat"] | | ||
|
|
||
| ### Suffix Operations | ||
|
|
||
| ```sql | ||
| SELECT suffixes('data', 3) AS result | ||
| FROM myTable | ||
| ``` | ||
|
|
||
| | result | | ||
| | ----------- | | ||
| | ["a", "ta", "ata"] | | ||
|
|
||
| ```sql | ||
| SELECT suffixesWithSuffix('data', 3, '$') AS result | ||
| FROM myTable | ||
| ``` | ||
|
|
||
| | result | | ||
| | ----------- | | ||
| | ["a$", "ta$", "ata$"] | | ||
|
|
||
| ### N-gram Generation | ||
|
|
||
| #### Table Config with Transformation | ||
|
|
||
| Instead of using SQL UDF, configure n-gram generation during ingestion using `transformationConfig`: | ||
|
|
||
| ```json | ||
| { | ||
| "tableName": "myTable", | ||
| "tableType": "OFFLINE", | ||
| "ingestionConfig": { | ||
| "transformationConfigs": [ | ||
| { | ||
| "columnName": "content_bigrams", | ||
| "transformFunction": "uniqueNgrams(content, 2)" | ||
| }, | ||
| { | ||
| "columnName": "content_ngrams", | ||
| "transformFunction": "uniqueNgrams(content, 2, 4)" | ||
| } | ||
| ] | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| #### Query Examples | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. need to add two query example using generated ngrams example1. example2. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added, thx! |
||
| ```sql | ||
| SELECT uniqueNgrams('Apache pinot', 2) AS bigrams | ||
| FROM myTable | ||
| ``` | ||
|
|
||
| | bigrams | | ||
| | ------- | | ||
| | ["Ap", "pa", "ac", "ch", "he", "e ", " p", "pi", "in", "no", "ot"] | | ||
|
|
||
| ```sql | ||
| SELECT uniqueNgrams('data', 2, 4) AS ngrams | ||
| FROM myTable | ||
| ``` | ||
|
|
||
| | ngrams | | ||
| | ------ | | ||
| | ["da", "at", "ta", "dat", "ata", "data"] | | ||
|
|
||
| **Using Generated N-grams** | ||
|
|
||
| ```sql | ||
| -- Traditional approach (slow) | ||
| SELECT * FROM T WHERE REGEXP_LIKE(field, '*pino.*') | ||
|
|
||
| -- Optimized approach with n-gram prefiltering | ||
| SELECT * FROM T | ||
| WHERE ngram = 'pin' AND ngram = 'ino' AND ngram = 'not' | ||
| AND REGEXP_LIKE(field, '*pino.*') | ||
| ``` | ||
|
|
||
| ```sql | ||
| -- Traditional approach for short strings | ||
| SELECT * FROM T WHERE REGEXP_LIKE(field, '*pi.*') | ||
|
|
||
| -- Optimized approach (no validation needed when search string ≤ ngram length) | ||
| SELECT * FROM T WHERE ngram = 'pi' | ||
| ``` | ||
|
|
||
| ## How to Use N-gram | ||
|
|
||
| ### Simulated N-gram Approach | ||
|
|
||
| This patch creates a simulated ngram approach by storing the grams into a column: | ||
|
|
||
| 1. **First, use the UDF to generate a ngram column** during data ingestion | ||
| 2. **Translate the query with prefilters** to use the ngram column for fast filtering | ||
| 3. **Trade-off**: Disk size will grow, but query performance improves significantly | ||
|
|
||
| ### Implementation Steps | ||
|
|
||
| 1. Generate derived columns with prefix, postfix, and ngram values | ||
| 2. Create inverted indexes on these derived columns | ||
| 3. Use prefiltering with ngram indexes to reduce candidate rows | ||
| 4. Apply original text match validation to avoid false positives | ||
|
|
||
| For more details, see [GitHub PR #12392](https://github.com/apache/pinot/pull/12392). | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a tableconfig with "transformationConfig" instead of sql UDF
update the function
ngramtounique_ngramsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated