Skip to content

Add Simplified Linear & RRF Retriever Examples #2026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
242 changes: 226 additions & 16 deletions solutions/search/retrievers-examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@

## Add example data [retrievers-examples-setup]

To begin with, let's create the `retrievers_example` index, and add some documents to it. We will set `number_of_shards=1` for our examples to ensure consistent and reproducible ordering.
To begin with, let's create the `retrievers_example` index, and add some documents to it.
We will set `number_of_shards=1` for our examples to ensure consistent and reproducible ordering.

```console
PUT retrievers_example
Expand All @@ -35,7 +36,11 @@
}
},
"text": {
"type": "text"
"type": "text",
"copy_to": "text_semantic"
},
"text_semantic": {
"type": "semantic_text"
},
"year": {
"type": "integer"
Expand Down Expand Up @@ -103,9 +108,11 @@

## Example: Combining query and kNN with RRF [retrievers-examples-combining-standard-knn-retrievers-with-rrf]

First, let’s examine how to combine two different types of queries: a `kNN` query and a `query_string` query. While these queries may produce scores in different ranges, we can use Reciprocal Rank Fusion (`rrf`) to combine the results and generate a merged final result list.
First, let’s examine how to combine two different types of queries: a `kNN` query and a `query_string` query.
While these queries may produce scores in different ranges, we can use Reciprocal Rank Fusion (`rrf`) to combine the results and generate a merged final result list.

To implement this in the retriever framework, we start with the top-level element: our `rrf` retriever. This retriever operates on top of two other retrievers: a `knn` retriever and a `standard` retriever. Our query structure would look like this:
To implement this in the retriever framework, we start with the top-level element: our `rrf` retriever.
This retriever operates on top of two other retrievers: a `knn` retriever and a `standard` retriever. Our query structure would look like this:

```console
GET /retrievers_example/_search
Expand Down Expand Up @@ -190,9 +197,13 @@

## Example: Hybrid search with linear retriever [retrievers-examples-linear-retriever]

A different, and more intuitive, way to provide hybrid search, is to linearly combine the top documents of different retrievers using a weighted sum of the original scores. Since, as above, the scores could lie in different ranges, we can also specify a `normalizer` that would ensure that all scores for the top ranked documents of a retriever lie in a specific range.
A different, and more intuitive, way to provide hybrid search, is to linearly combine the top documents of different retrievers using a weighted sum of the original scores.
Since, as above, the scores could lie in different ranges, we can also specify a `normalizer` that would ensure that all scores for the top ranked documents of a retriever lie in a specific range.

To implement this, we define a `linear` retriever, and along with a set of retrievers that will generate the heterogeneous results sets that we will combine. We will solve a problem similar to the above, by merging the results of a `standard` and a `knn` retriever. As the `standard` retriever’s scores are based on BM25 and are not strictly bounded, we will also define a `minmax` normalizer to ensure that the scores lie in the [0, 1] range. We will apply the same normalizer to `knn` as well to ensure that we capture the importance of each document within the result set.
To implement this, we define a `linear` retriever, and along with a set of retrievers that will generate the heterogeneous results sets that we will combine.
We will solve a problem similar to the above, by merging the results of a `standard` and a `knn` retriever.
As the `standard` retriever’s scores are based on BM25 and are not strictly bounded, we will also define a `minmax` normalizer to ensure that the scores lie in the [0, 1] range.
We will apply the same normalizer to `knn` as well to ensure that we capture the importance of each document within the result set.

So, let’s now specify the `linear` retriever whose final score is computed as follows:

Expand Down Expand Up @@ -263,22 +274,22 @@
"value": 3,
"relation": "eq"
},
"max_score": -1,
"max_score": 3.5,
"hits": [
{
"_index": "retrievers_example",
"_id": "2",
"_score": -1
"_score": 3.5
},
{
"_index": "retrievers_example",
"_id": "1",
"_score": -2
"_score": 2.3
},
{
"_index": "retrievers_example",
"_id": "3",
"_score": -3
"_score": 0.1
}
]
}
Expand All @@ -288,7 +299,8 @@
::::


By normalizing scores and leveraging `function_score` queries, we can also implement more complex ranking strategies, such as sorting results based on their timestamps, assign the timestamp as a score, and then normalizing this score to [0, 1]. Then, we can easily combine the above with a `knn` retriever as follows:
By normalizing scores and leveraging `function_score` queries, we can also implement more complex ranking strategies, such as sorting results based on their timestamps, assign the timestamp as a score, and then normalizing this score to [0, 1].
Then, we can easily combine the above with a `knn` retriever as follows:

```console
GET /retrievers_example/_search
Expand Down Expand Up @@ -369,27 +381,156 @@
"value": 4,
"relation": "eq"
},
"max_score": -1,
"max_score": 3.5,
"hits": [
{
"_index": "retrievers_example",
"_id": "3",
"_score": -1
"_score": 3.5
},
{
"_index": "retrievers_example",
"_id": "2",
"_score": -2
"_score": 2.0
},
{
"_index": "retrievers_example",
"_id": "4",
"_score": -3
"_score": 1.1
},
{
"_index": "retrievers_example",
"_id": "1",
"_score": -4
"_score": 0.1
}
]
}
}
```

::::


## Example: RRF with the multi-field query format [retrievers-examples-rrf-multi-field-query-format]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through this again, I wonder if we should lead in with the simple way and then have the more complicated example below? Try to get more people to see the simplified retriever first? WDYT?

Non blocking feedback

```yaml {applies_to}
stack: ga 9.1
```

There's an even simpler way to execute a hybrid search though: We can use the [multi-field query format](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-query-format), which allows us to query multiple fields without explicitly specifying inner retrievers.

Check failure on line 419 in solutions/search/retrievers-examples.md

View workflow job for this annotation

GitHub Actions / preview / build

'reference/elasticsearch/rest-apis/retrievers.md' has no anchor named: '#multi-field-query-format'.

The following example uses the multi-field query format to query the `text` and `text_semantic` fields.
Scores from [`text`](elasticsearch://reference/elasticsearch/mapping-reference/text.md) and [`semantic_text`](elasticsearch://reference/elasticsearch/mapping-reference/semantic-text.md) fields don't always fall in the same range, so we need to normalize the ranks across matches on these fields to generate a result set.
For example, BM25 scores from `text` fields are unbounded, while vector similarity scores from `text_embedding` models are bounded between [0, 1].
The multi-field query format [handles this normalization for us automatically](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-field-grouping).

Check failure on line 424 in solutions/search/retrievers-examples.md

View workflow job for this annotation

GitHub Actions / preview / build

'reference/elasticsearch/rest-apis/retrievers.md' has no anchor named: '#multi-field-field-grouping'.

```console
GET /retrievers_example/_search
{
"retriever": {
"rrf": {
"query": "artificial intelligence",
"fields": ["text", "text_semantic"]
}
}
}
```

This returns the following response based on the final rrf score for each result.

::::{dropdown} Example response
```console-result
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 0.8333334,
"hits": [
{
"_index": "retrievers_example",
"_id": "1",
"_score": 0.8333334
},
{
"_index": "retrievers_example",
"_id": "2",
"_score": 0.8333334
},
{
"_index": "retrievers_example",
"_id": "3",
"_score": 0.25
}
]
}
}
```

::::

We don't even need to specify the `fields` parameter when using the multi-field query format.
If we omit it, the retriever will automatically query fields specified in the `index.query.default_field` index setting, which is set to `*` by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

This default value will cause the retriever to query every field that either:

- Supports term queries, such as `keyword` and `text` fields
- Is a `semantic_text` field

In this example, that would translate to the `text`, `text_semantic`, `year`, `topic`, and `timestamp` fields.

```console
GET /retrievers_example/_search
{
"retriever": {
"rrf": {
"query": "artificial intelligence"
}
}
}
```

This returns the following response based on the final rrf score for each result.

::::{dropdown} Example response
```console-result
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 0.8333334,
"hits": [
{
"_index": "retrievers_example",
"_id": "1",
"_score": 0.8333334
},
{
"_index": "retrievers_example",
"_id": "2",
"_score": 0.8333334
},
{
"_index": "retrievers_example",
"_id": "3",
"_score": 0.25
}
]
}
Expand All @@ -398,7 +539,76 @@

::::

See [wildcard field patterns](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-wildcard-field-patterns) for more information about wildcard resolution.

Check failure on line 542 in solutions/search/retrievers-examples.md

View workflow job for this annotation

GitHub Actions / preview / build

'reference/elasticsearch/rest-apis/retrievers.md' has no anchor named: '#multi-field-wildcard-field-patterns'.


## Example: Linear retriever with the multi-field query format [retrievers-examples-linear-multi-field-query-format]
```yaml {applies_to}
stack: ga 9.1
```

We can also use the [multi-field query format](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-query-format) with the `linear` retriever.

Check failure on line 550 in solutions/search/retrievers-examples.md

View workflow job for this annotation

GitHub Actions / preview / build

'reference/elasticsearch/rest-apis/retrievers.md' has no anchor named: '#multi-field-query-format'.
It works much the same way as [on the `rrf` retriever](#retrievers-examples-rrf-multi-field-query-format), with a couple key differences:

- We can use `^` notation to specify a [per-field boost](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-field-boosting)

Check failure on line 553 in solutions/search/retrievers-examples.md

View workflow job for this annotation

GitHub Actions / preview / build

'reference/elasticsearch/rest-apis/retrievers.md' has no anchor named: '#multi-field-field-boosting'.
- We must set the `normalizer` parameter to specify the normalization method used to combine [field group scores](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-field-grouping)

Check failure on line 554 in solutions/search/retrievers-examples.md

View workflow job for this annotation

GitHub Actions / preview / build

'reference/elasticsearch/rest-apis/retrievers.md' has no anchor named: '#multi-field-field-grouping'.

The following example uses the `linear` retriever to query the `text`, `text_semantic`, and `topic` fields, with a boost of 2 on the `topic` field:

```console
GET /retrievers_example/_search
{
"retriever": {
"linear": {
"query": "artificial intelligence",
"fields": ["text", "text_semantic", "topic^2"],
"normalizer": "minmax"
}
}
}
```

This returns the following response based on the normalized score for each result:

::::{dropdown} Example response
```console-result
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 2.0,
"hits": [
{
"_index": "retrievers_example",
"_id": "2",
"_score": 2.0
},
{
"_index": "retrievers_example",
"_id": "1",
"_score": 1.2
},
{
"_index": "retrievers_example",
"_id": "3",
"_score": 0.1
}
]
}
}
```

::::

## Example: Grouping results by year with `collapse` [retrievers-examples-collapsing-retriever-results]

Expand Down
Loading