Is the index in Quikwit language independent? #1388
Answered
by
fmassot
HeenaBansal2009
asked this question in
Q&A
-
Hi @fmassot ,
Thanks. |
Beta Was this translation helpful? Give feedback.
Answered by
fmassot
May 9, 2022
Replies: 1 comment 1 reply
-
Currently, you can specify 2 tokenizers:
The code of the impl<'a> SimpleTokenStream<'a> {
// search for the end of the current token.
fn search_token_end(&mut self) -> usize {
(&mut self.chars)
.filter(|&(_, ref c)| !c.is_alphanumeric())
.map(|(offset, _)| offset)
.next()
.unwrap_or(self.text.len())
}
} In tantivy you have access to more tokenizers: ngram, stemming in latin languages, third party support for Japanese, Chineese...
I'm not sure to understand the query you want to do. Can you give a concrete example? |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
fmassot
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently, you can specify 2 tokenizers:
raw
tokenizer that does nothingdefault
tokenizer that does the following: split on whitespace and punctuations (everything that is not alphanumeri), remove long token (> 40 bytes), lower case each token.The code of the
SimpleTokenizer
used:In tantivy you have access to more tokenizers: ngram, stemming in latin languages, thir…