-
Notifications
You must be signed in to change notification settings - Fork 106
Add Simplified Linear & RRF Retriever Examples #2026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Mikep86
wants to merge
6
commits into
elastic:main
Choose a base branch
from
Mikep86:simplified-linear-and-rrf-retrievers-docs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+226
−16
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
191c721
Added RRF with multi-field query format example
Mikep86 3846eeb
Added linear retriever with multi-field query format example
Mikep86 a40440a
Add applies to tags
Mikep86 b8617a6
Fix typo
Mikep86 5775c80
Add note about score range mismatches
Mikep86 fcb3193
Reference index.query.default_field index setting
Mikep86 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,7 +15,8 @@ | |
|
||
## Add example data [retrievers-examples-setup] | ||
|
||
To begin with, let's create the `retrievers_example` index, and add some documents to it. We will set `number_of_shards=1` for our examples to ensure consistent and reproducible ordering. | ||
To begin with, let's create the `retrievers_example` index, and add some documents to it. | ||
We will set `number_of_shards=1` for our examples to ensure consistent and reproducible ordering. | ||
|
||
```console | ||
PUT retrievers_example | ||
|
@@ -35,7 +36,11 @@ | |
} | ||
}, | ||
"text": { | ||
"type": "text" | ||
"type": "text", | ||
"copy_to": "text_semantic" | ||
}, | ||
"text_semantic": { | ||
"type": "semantic_text" | ||
}, | ||
"year": { | ||
"type": "integer" | ||
|
@@ -103,9 +108,11 @@ | |
|
||
## Example: Combining query and kNN with RRF [retrievers-examples-combining-standard-knn-retrievers-with-rrf] | ||
|
||
First, let’s examine how to combine two different types of queries: a `kNN` query and a `query_string` query. While these queries may produce scores in different ranges, we can use Reciprocal Rank Fusion (`rrf`) to combine the results and generate a merged final result list. | ||
First, let’s examine how to combine two different types of queries: a `kNN` query and a `query_string` query. | ||
While these queries may produce scores in different ranges, we can use Reciprocal Rank Fusion (`rrf`) to combine the results and generate a merged final result list. | ||
|
||
To implement this in the retriever framework, we start with the top-level element: our `rrf` retriever. This retriever operates on top of two other retrievers: a `knn` retriever and a `standard` retriever. Our query structure would look like this: | ||
To implement this in the retriever framework, we start with the top-level element: our `rrf` retriever. | ||
This retriever operates on top of two other retrievers: a `knn` retriever and a `standard` retriever. Our query structure would look like this: | ||
|
||
```console | ||
GET /retrievers_example/_search | ||
|
@@ -190,9 +197,13 @@ | |
|
||
## Example: Hybrid search with linear retriever [retrievers-examples-linear-retriever] | ||
|
||
A different, and more intuitive, way to provide hybrid search, is to linearly combine the top documents of different retrievers using a weighted sum of the original scores. Since, as above, the scores could lie in different ranges, we can also specify a `normalizer` that would ensure that all scores for the top ranked documents of a retriever lie in a specific range. | ||
A different, and more intuitive, way to provide hybrid search, is to linearly combine the top documents of different retrievers using a weighted sum of the original scores. | ||
Since, as above, the scores could lie in different ranges, we can also specify a `normalizer` that would ensure that all scores for the top ranked documents of a retriever lie in a specific range. | ||
|
||
To implement this, we define a `linear` retriever, and along with a set of retrievers that will generate the heterogeneous results sets that we will combine. We will solve a problem similar to the above, by merging the results of a `standard` and a `knn` retriever. As the `standard` retriever’s scores are based on BM25 and are not strictly bounded, we will also define a `minmax` normalizer to ensure that the scores lie in the [0, 1] range. We will apply the same normalizer to `knn` as well to ensure that we capture the importance of each document within the result set. | ||
To implement this, we define a `linear` retriever, and along with a set of retrievers that will generate the heterogeneous results sets that we will combine. | ||
We will solve a problem similar to the above, by merging the results of a `standard` and a `knn` retriever. | ||
As the `standard` retriever’s scores are based on BM25 and are not strictly bounded, we will also define a `minmax` normalizer to ensure that the scores lie in the [0, 1] range. | ||
We will apply the same normalizer to `knn` as well to ensure that we capture the importance of each document within the result set. | ||
|
||
So, let’s now specify the `linear` retriever whose final score is computed as follows: | ||
|
||
|
@@ -263,22 +274,22 @@ | |
"value": 3, | ||
"relation": "eq" | ||
}, | ||
"max_score": -1, | ||
"max_score": 3.5, | ||
"hits": [ | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "2", | ||
"_score": -1 | ||
"_score": 3.5 | ||
}, | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "1", | ||
"_score": -2 | ||
"_score": 2.3 | ||
}, | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "3", | ||
"_score": -3 | ||
"_score": 0.1 | ||
} | ||
] | ||
} | ||
|
@@ -288,7 +299,8 @@ | |
:::: | ||
|
||
|
||
By normalizing scores and leveraging `function_score` queries, we can also implement more complex ranking strategies, such as sorting results based on their timestamps, assign the timestamp as a score, and then normalizing this score to [0, 1]. Then, we can easily combine the above with a `knn` retriever as follows: | ||
By normalizing scores and leveraging `function_score` queries, we can also implement more complex ranking strategies, such as sorting results based on their timestamps, assign the timestamp as a score, and then normalizing this score to [0, 1]. | ||
Then, we can easily combine the above with a `knn` retriever as follows: | ||
|
||
```console | ||
GET /retrievers_example/_search | ||
|
@@ -369,27 +381,156 @@ | |
"value": 4, | ||
"relation": "eq" | ||
}, | ||
"max_score": -1, | ||
"max_score": 3.5, | ||
"hits": [ | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "3", | ||
"_score": -1 | ||
"_score": 3.5 | ||
}, | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "2", | ||
"_score": -2 | ||
"_score": 2.0 | ||
}, | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "4", | ||
"_score": -3 | ||
"_score": 1.1 | ||
}, | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "1", | ||
"_score": -4 | ||
"_score": 0.1 | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
:::: | ||
|
||
|
||
## Example: RRF with the multi-field query format [retrievers-examples-rrf-multi-field-query-format] | ||
```yaml {applies_to} | ||
stack: ga 9.1 | ||
``` | ||
|
||
There's an even simpler way to execute a hybrid search though: We can use the [multi-field query format](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-query-format), which allows us to query multiple fields without explicitly specifying inner retrievers. | ||
|
||
The following example uses the multi-field query format to query the `text` and `text_semantic` fields. | ||
Scores from [`text`](elasticsearch://reference/elasticsearch/mapping-reference/text.md) and [`semantic_text`](elasticsearch://reference/elasticsearch/mapping-reference/semantic-text.md) fields don't always fall in the same range, so we need to normalize the ranks across matches on these fields to generate a result set. | ||
Mikep86 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
For example, BM25 scores from `text` fields are unbounded, while vector similarity scores from `text_embedding` models are bounded between [0, 1]. | ||
The multi-field query format [handles this normalization for us automatically](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-field-grouping). | ||
|
||
```console | ||
GET /retrievers_example/_search | ||
{ | ||
"retriever": { | ||
"rrf": { | ||
"query": "artificial intelligence", | ||
"fields": ["text", "text_semantic"] | ||
} | ||
} | ||
} | ||
``` | ||
|
||
This returns the following response based on the final rrf score for each result. | ||
|
||
::::{dropdown} Example response | ||
```console-result | ||
{ | ||
"took": 42, | ||
"timed_out": false, | ||
"_shards": { | ||
"total": 1, | ||
"successful": 1, | ||
"skipped": 0, | ||
"failed": 0 | ||
}, | ||
"hits": { | ||
"total": { | ||
"value": 3, | ||
"relation": "eq" | ||
}, | ||
"max_score": 0.8333334, | ||
"hits": [ | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "1", | ||
"_score": 0.8333334 | ||
}, | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "2", | ||
"_score": 0.8333334 | ||
}, | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "3", | ||
"_score": 0.25 | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
:::: | ||
|
||
We don't even need to specify the `fields` parameter when using the multi-field query format. | ||
If we omit it, the retriever will automatically query fields specified in the `index.query.default_field` index setting, which is set to `*` by default. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
This default value will cause the retriever to query every field that either: | ||
|
||
- Supports term queries, such as `keyword` and `text` fields | ||
- Is a `semantic_text` field | ||
|
||
In this example, that would translate to the `text`, `text_semantic`, `year`, `topic`, and `timestamp` fields. | ||
|
||
```console | ||
GET /retrievers_example/_search | ||
{ | ||
"retriever": { | ||
"rrf": { | ||
"query": "artificial intelligence" | ||
} | ||
} | ||
} | ||
``` | ||
|
||
This returns the following response based on the final rrf score for each result. | ||
|
||
::::{dropdown} Example response | ||
```console-result | ||
{ | ||
"took": 42, | ||
"timed_out": false, | ||
"_shards": { | ||
"total": 1, | ||
"successful": 1, | ||
"skipped": 0, | ||
"failed": 0 | ||
}, | ||
"hits": { | ||
"total": { | ||
"value": 3, | ||
"relation": "eq" | ||
}, | ||
"max_score": 0.8333334, | ||
"hits": [ | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "1", | ||
"_score": 0.8333334 | ||
}, | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "2", | ||
"_score": 0.8333334 | ||
}, | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "3", | ||
"_score": 0.25 | ||
} | ||
] | ||
} | ||
|
@@ -398,7 +539,76 @@ | |
|
||
:::: | ||
|
||
See [wildcard field patterns](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-wildcard-field-patterns) for more information about wildcard resolution. | ||
|
||
|
||
## Example: Linear retriever with the multi-field query format [retrievers-examples-linear-multi-field-query-format] | ||
```yaml {applies_to} | ||
stack: ga 9.1 | ||
``` | ||
|
||
We can also use the [multi-field query format](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-query-format) with the `linear` retriever. | ||
It works much the same way as [on the `rrf` retriever](#retrievers-examples-rrf-multi-field-query-format), with a couple key differences: | ||
|
||
- We can use `^` notation to specify a [per-field boost](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-field-boosting) | ||
- We must set the `normalizer` parameter to specify the normalization method used to combine [field group scores](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-field-grouping) | ||
|
||
The following example uses the `linear` retriever to query the `text`, `text_semantic`, and `topic` fields, with a boost of 2 on the `topic` field: | ||
|
||
```console | ||
GET /retrievers_example/_search | ||
{ | ||
"retriever": { | ||
"linear": { | ||
"query": "artificial intelligence", | ||
"fields": ["text", "text_semantic", "topic^2"], | ||
"normalizer": "minmax" | ||
} | ||
} | ||
} | ||
``` | ||
|
||
This returns the following response based on the normalized score for each result: | ||
|
||
::::{dropdown} Example response | ||
```console-result | ||
{ | ||
"took": 42, | ||
"timed_out": false, | ||
"_shards": { | ||
"total": 1, | ||
"successful": 1, | ||
"skipped": 0, | ||
"failed": 0 | ||
}, | ||
"hits": { | ||
"total": { | ||
"value": 3, | ||
"relation": "eq" | ||
}, | ||
"max_score": 2.0, | ||
"hits": [ | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "2", | ||
"_score": 2.0 | ||
}, | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "1", | ||
"_score": 1.2 | ||
}, | ||
{ | ||
"_index": "retrievers_example", | ||
"_id": "3", | ||
"_score": 0.1 | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
:::: | ||
|
||
## Example: Grouping results by year with `collapse` [retrievers-examples-collapsing-retriever-results] | ||
|
||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through this again, I wonder if we should lead in with the simple way and then have the more complicated example below? Try to get more people to see the simplified retriever first? WDYT?
Non blocking feedback