Is the index in Quikwit language independent? #1388

HeenaBansal2009 · 2022-05-06T16:20:26Z

HeenaBansal2009
May 6, 2022

Hi @fmassot ,
I have few more questions about quickwit indexing .

what kind of tokenizer does the indexer use, ngram or language specific tokenizer.
Does quickwit tokens filter space too like , query = VPN authenticated user is supported in current release?

Thanks.

Answered by fmassot

May 9, 2022

Currently, you can specify 2 tokenizers:

the raw tokenizer that does nothing
the default tokenizer that does the following: split on whitespace and punctuations (everything that is not alphanumeri), remove long token (> 40 bytes), lower case each token.

The code of the SimpleTokenizer used:

impl<'a> SimpleTokenStream<'a> {
    // search for the end of the current token.
    fn search_token_end(&mut self) -> usize {
        (&mut self.chars)
            .filter(|&(_, ref c)| !c.is_alphanumeric())
            .map(|(offset, _)| offset)
            .next()
            .unwrap_or(self.text.len())
    }
}

In tantivy you have access to more tokenizers: ngram, stemming in latin languages, thir…

View full answer

fmassot · 2022-05-09T21:23:26Z

fmassot
May 9, 2022
Maintainer

Currently, you can specify 2 tokenizers:

the raw tokenizer that does nothing
the default tokenizer that does the following: split on whitespace and punctuations (everything that is not alphanumeri), remove long token (> 40 bytes), lower case each token.

The code of the SimpleTokenizer used:

impl<'a> SimpleTokenStream<'a> {
    // search for the end of the current token.
    fn search_token_end(&mut self) -> usize {
        (&mut self.chars)
            .filter(|&(_, ref c)| !c.is_alphanumeric())
            .map(|(offset, _)| offset)
            .next()
            .unwrap_or(self.text.len())
    }
}

In tantivy you have access to more tokenizers: ngram, stemming in latin languages, third party support for Japanese, Chineese...

Does quickwit tokens filter space too like , query = VPN authenticated user is supported in current release?

I'm not sure to understand the query you want to do. Can you give a concrete example?

1 reply

fmassot Jun 2, 2022
Maintainer

@HeenaBansal2009 if you have the time to provide an example of a query and an example of a JSON document, that would be super helpful :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is the index in Quikwit language independent? #1388

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is the index in Quikwit language independent? #1388

Uh oh!

HeenaBansal2009 May 6, 2022

Replies: 1 comment · 1 reply

Uh oh!

fmassot May 9, 2022 Maintainer

Uh oh!

fmassot Jun 2, 2022 Maintainer

HeenaBansal2009
May 6, 2022

Replies: 1 comment 1 reply

fmassot
May 9, 2022
Maintainer

fmassot Jun 2, 2022
Maintainer