Skip to content

Commit

Permalink
Code, tests, docs complete
Browse files Browse the repository at this point in the history
  • Loading branch information
a3lem committed Feb 8, 2025
1 parent 76374d8 commit 9e59a60
Show file tree
Hide file tree
Showing 6 changed files with 969 additions and 176 deletions.
114 changes: 114 additions & 0 deletions docs/tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,3 +214,117 @@ assert html_snippet == (
```


## Create a Custom Tokenizer (Text Analyzer)

Tantivy provides several built-in tokenizers and filters that
can be chained together to create new tokenizers (or
'text analyzers') that better fit your needs.

Tantivy-py lets you access these components, assemble them,
and register the result with an index.

Let's walk through creating and registering a custom text analyzer
to see how everything fits together.

### Example

First, let's create a text analyzer. As explained further down,
a text analyzer is a pipeline consisting of one tokenizer and
any number of token filters.

```python
from tantivy import (
TextAnalyzer,
TextAnalyzerBuilder,
Tokenizer,
Filter,
Index,
SchemaBuilder
)

my_analyzer: TextAnalyzer = (
TextAnalyzerBuilder(
# Create a `Tokenizer` instance.
# It instructs the builder about which type of tokenizer
# to create internally and with which arguments.
Tokenizer.regex(r"(?i)([a-z]+)")
)
.filter(
# Create a `Filter` instance.
# Like `Tokenizer`, this object provides instructions
# to the builder.
Filter.lowercase()
)
.filter(
# Define custom words.
Filter.custom_stopword(["www", "com"])
)
# Finally, build a TextAnalyzer
# chaining all tokenizer > [filter, ...] steps together.
.build()
)
```

We can check that our new analyzer is working as expected
by passing some text to its `.analyze()` method.

```python
>>> my_analyzer.analyze('www.this1website1might1exist.com')
['this', 'website', 'might', 'exist']
```

The next step is to register our analyzer with an index. Let's
assume we already have one.

```python
index.register_tokenizer("custom_analyzer", my_analyzer)
```

To link an analyzer to a field in the index, pass the
analyzer name to the `tokenizer_name=` parameter of
the `SchemaBuilder`'s `add_text_field()` method.

Here is the schema that was used to construct our index:

```python
schema = (
tantivy.SchemaBuilder()
.add_text_field("content", tokenizer_name="custom_analyzer")
.build()
)
index = Index(schema)
```

Summary:

1. Use `TextAnalyzerBuilder`, `Tokenizer`, and `Filter` to build a `TextAnalyzer`
2. The analyzer's `.analyze()` method lets you use your analyzer as a tokenizer from Python.
3. Refer to your analyzer's name when building the index schema.
4. Use the same name when registering your analyzer on the index.


### On terminology: Tokenizer vs. Text Analyzer

Tantivy-py mimics Tantivy's interface as closely as possible.
This includes minor terminological inconsistencies, one of
which is how Tantivy distinguishes between 'tokenizers' and
'text analyzers'.

Quite simply, a 'tokenizer' segments text into tokens.
A 'text analyzer' is a pipeline consisting of one tokenizer
and zero or more token filters. The `TextAnalyzer` is the
primary object of interest when talking about how to
change Tantivy's tokenization behavior.

Slightly confusingly, though, the `Index` and `SchemaBuilder`
interfaces use 'tokenizer' to mean 'text analyzer'.

This inconsistency can be observed in `SchemaBuilder.add_text_field`, e.g. --

```
SchemaBuilder.add_text_field(..., tokenizer_name=<analyzer name>)`
```

-- and in the name of the `Index.register_tokenizer(...)` method, which actually
serves to register a *text analyzer*.

10 changes: 10 additions & 0 deletions src/index.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ use crate::{
schema::Schema,
searcher::Searcher,
to_pyerr,
tokenizer::TextAnalyzer as PyTextAnalyzer,
};
use tantivy as tv;
use tantivy::{
Expand Down Expand Up @@ -453,6 +454,15 @@ impl Index {

Ok((Query { inner: query }, errors))
}

/// Register a custom text analyzer by name. (Confusingly,
/// this is one of the places where Tantivy uses 'tokenizer' to refer to a
/// TextAnalyzer instance.)
///
// Implementation notes: Skipped indirection of TokenizerManager.
pub fn register_tokenizer(&self, name: &str, analyzer: PyTextAnalyzer) {
self.index.tokenizers().register(name, analyzer.analyzer);
}
}

impl Index {
Expand Down
6 changes: 6 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ mod schema;
mod schemabuilder;
mod searcher;
mod snippet;
mod tokenizer;

use document::{extract_value, extract_value_for_type, Document};
use facet::Facet;
Expand All @@ -20,6 +21,7 @@ use schema::{FieldType, Schema};
use schemabuilder::SchemaBuilder;
use searcher::{DocAddress, Order, SearchResult, Searcher};
use snippet::{Snippet, SnippetGenerator};
use tokenizer::{Filter, TextAnalyzer, TextAnalyzerBuilder, Tokenizer};

/// Python bindings for the search engine library Tantivy.
///
Expand Down Expand Up @@ -87,6 +89,10 @@ fn tantivy(_py: Python, m: &Bound<PyModule>) -> PyResult<()> {
m.add_class::<SnippetGenerator>()?;
m.add_class::<Occur>()?;
m.add_class::<FieldType>()?;
m.add_class::<Tokenizer>()?;
m.add_class::<TextAnalyzerBuilder>()?;
m.add_class::<Filter>()?;
m.add_class::<TextAnalyzer>()?;

m.add_wrapped(wrap_pymodule!(query_parser_error))?;

Expand Down
Loading

0 comments on commit 9e59a60

Please sign in to comment.