Skip to content

Conversation

@musclemuller
Copy link

New doc for how to add prefix, suffix and ngram UDFs
feature pr: apache/pinot#12392

@deemoliu
Copy link
Contributor

why there is function-1 and a function folder?

@musclemuller
Copy link
Author

why there is function-1 and a function folder?

fuction-1 include item docs listed in fuction


-- N-gram approach (fast - with prefiltering)
SELECT * FROM documents
WHERE NGRAM(content, 2) LIKE '%ap%'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we use LIKE here? We can simply do NGRAM(content, 2) = 'ap'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you Ankit, would address this

| -------- |
| fas,ast |

### Prefix Operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this - is this unrelated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I follow, why is this unrelated? we also want to doc for prefix?

| ------------ |
| Apache Pinot |

### Advanced Text Matching with N-grams
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is unrelated.

Copy link
Contributor

@deemoliu deemoliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address comment, thanks

@musclemuller musclemuller requested a review from deemoliu August 6, 2025 22:45

## Signature

> NGRAM(col, n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique_ngrams(col, n)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


> NGRAM(col, n)
>
> PREFIX(col, prefixString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefixes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

>
> PREFIX(col, prefixString)
>
> SUFFIX(col, suffixString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suffixes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

```sql
SELECT NGRAM('Apache pinot', 2) AS bigrams
FROM myTable
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a tableconfig with "transformationConfig" instead of sql UDF
update the function ngram to unique_ngrams

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@musclemuller musclemuller force-pushed the muller/add-prefix-suffix-ngram-udf branch from 001a742 to de5c295 Compare August 26, 2025 01:16
@musclemuller musclemuller requested a review from deemoliu August 26, 2025 07:14

## Context

We are onboarding a use case and trying to increase query throughput. We tested that QPS cannot be further improved with existing `REGEXP_LIKE` queries or `TEXT_MATCH` queries. The queries are as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this section to the following
-->
We were aiming to increase query throughput for a matching use case. Initial testing shows that the current approaches using REGEXP_LIKE and TEXT_MATCH queries have reached their performance limits, and further improvements in QPS cannot be achieved with these methods. The representative queries under evaluation are as follows:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

```

#### Query Examples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add two query example using generated ngrams

example1.

Select * from T where regexp_like(field, “*pino.*”) ==> 

Select * from T 
where ngram = 'pin' and ngram = 'ino' and ngram = 'not'
      and regexp_like(field, “*pino.*”)

example2.
In some cases where the search string is shorter or equal to the ngram, we don’t need a validation stage.

Select * from T where regexp_like(field, “*pi.*”) ==> 

Select * from T where ngram = 'pi’

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, thx!

@musclemuller musclemuller force-pushed the muller/add-prefix-suffix-ngram-udf branch from 9361aa7 to fc04681 Compare August 26, 2025 18:28
@musclemuller musclemuller requested a review from deemoliu August 26, 2025 18:29
@musclemuller musclemuller requested a review from deemoliu August 26, 2025 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants