Code, tests, docs complete

quickwit-oss · Feb 8, 2025 · 9e59a60 · 9e59a60
1 parent 76374d8
commit 9e59a60
Show file tree

Hide file tree

Showing 6 changed files with 969 additions and 176 deletions.
diff --git a/docs/tutorials.md b/docs/tutorials.md
@@ -214,3 +214,117 @@ assert html_snippet == (
 ```
 
 
+## Create a Custom Tokenizer (Text Analyzer)
+
+Tantivy provides several built-in tokenizers and filters that
+can be chained together to create new tokenizers (or
+'text analyzers') that better fit your needs.
+
+Tantivy-py lets you access these components, assemble them,
+and register the result with an index.
+
+Let's walk through creating and registering a custom text analyzer
+to see how everything fits together.
+
+### Example
+
+First, let's create a text analyzer. As explained further down,
+a text analyzer is a pipeline consisting of one tokenizer and
+any number of token filters.
+
+```python
+from tantivy import (
+    TextAnalyzer,
+    TextAnalyzerBuilder,
+    Tokenizer,
+    Filter,
+    Index,
+    SchemaBuilder
+)
+
+my_analyzer: TextAnalyzer = (
+    TextAnalyzerBuilder(
+        # Create a `Tokenizer` instance.
+        # It instructs the builder about which type of tokenizer
+        # to create internally and with which arguments.
+        Tokenizer.regex(r"(?i)([a-z]+)")
+    )
+    .filter(
+        # Create a `Filter` instance.
+        # Like `Tokenizer`, this object provides instructions
+        # to the builder.
+        Filter.lowercase()
+    )
+    .filter(
+        # Define custom words.
+        Filter.custom_stopword(["www", "com"])
+    )
+    # Finally, build a TextAnalyzer
+    # chaining all tokenizer > [filter, ...] steps together.
+    .build()
+)
+```
+
+We can check that our new analyzer is working as expected
+by passing some text to its `.analyze()` method.
+
+```python
+>>> my_analyzer.analyze('www.this1website1might1exist.com')
+['this', 'website', 'might', 'exist']
+```
+
+The next step is to register our analyzer with an index. Let's
+assume we already have one.
+
+```python
+index.register_tokenizer("custom_analyzer", my_analyzer)
+```
+
+To link an analyzer to a field in the index, pass the
+analyzer name to the `tokenizer_name=` parameter of
+the `SchemaBuilder`'s `add_text_field()` method.
+
+Here is the schema that was used to construct our index:
+
+```python
+schema = (
+    tantivy.SchemaBuilder()
+    .add_text_field("content", tokenizer_name="custom_analyzer")
+    .build()
+)
+index = Index(schema)
+```
+
+Summary:
+
+1. Use `TextAnalyzerBuilder`, `Tokenizer`, and `Filter` to build a `TextAnalyzer`
+2. The analyzer's `.analyze()` method lets you use your analyzer as a tokenizer from Python.
+3. Refer to your analyzer's name when building the index schema.
+4. Use the same name when registering your analyzer on the index.
+
+
+### On terminology: Tokenizer vs. Text Analyzer
+
+Tantivy-py mimics Tantivy's interface as closely as possible.
+This includes minor terminological inconsistencies, one of
+which is how Tantivy distinguishes between 'tokenizers' and
+'text analyzers'.
+
+Quite simply, a 'tokenizer' segments text into tokens.
+A 'text analyzer' is a pipeline consisting of one tokenizer
+and zero or more token filters. The `TextAnalyzer` is the
+primary object of interest when talking about how to
+change Tantivy's tokenization behavior.
+
+Slightly confusingly, though, the `Index` and `SchemaBuilder`
+interfaces use 'tokenizer' to mean 'text analyzer'.
+
+This inconsistency can be observed in `SchemaBuilder.add_text_field`, e.g. --
+
+```
+SchemaBuilder.add_text_field(..., tokenizer_name=<analyzer name>)`
+```
+
+-- and in the name of the `Index.register_tokenizer(...)` method, which actually
+serves to register a *text analyzer*.
+
diff --git a/src/index.rs b/src/index.rs
@@ -12,6 +12,7 @@ use crate::{
     schema::Schema,
     searcher::Searcher,
     to_pyerr,
+    tokenizer::TextAnalyzer as PyTextAnalyzer,
 };
 use tantivy as tv;
 use tantivy::{
@@ -453,6 +454,15 @@ impl Index {
 
         Ok((Query { inner: query }, errors))
     }
+
+    /// Register a custom text analyzer by name. (Confusingly,
+    /// this is one of the places where Tantivy uses 'tokenizer' to refer to a
+    /// TextAnalyzer instance.)
+    ///
+    // Implementation notes: Skipped indirection of TokenizerManager.
+    pub fn register_tokenizer(&self, name: &str, analyzer: PyTextAnalyzer) {
+        self.index.tokenizers().register(name, analyzer.analyzer);
+    }
 }
 
 impl Index {

diff --git a/src/lib.rs b/src/lib.rs
@@ -11,6 +11,7 @@ mod schema;
 mod schemabuilder;
 mod searcher;
 mod snippet;
+mod tokenizer;
 
 use document::{extract_value, extract_value_for_type, Document};
 use facet::Facet;
@@ -20,6 +21,7 @@ use schema::{FieldType, Schema};
 use schemabuilder::SchemaBuilder;
 use searcher::{DocAddress, Order, SearchResult, Searcher};
 use snippet::{Snippet, SnippetGenerator};
+use tokenizer::{Filter, TextAnalyzer, TextAnalyzerBuilder, Tokenizer};
 
 /// Python bindings for the search engine library Tantivy.
 ///
@@ -87,6 +89,10 @@ fn tantivy(_py: Python, m: &Bound<PyModule>) -> PyResult<()> {
     m.add_class::<SnippetGenerator>()?;
     m.add_class::<Occur>()?;
     m.add_class::<FieldType>()?;
+    m.add_class::<Tokenizer>()?;
+    m.add_class::<TextAnalyzerBuilder>()?;
+    m.add_class::<Filter>()?;
+    m.add_class::<TextAnalyzer>()?;
 
     m.add_wrapped(wrap_pymodule!(query_parser_error))?;