Skip to content

[8.19] Simplified Linear and RRF Retrievers Docs #130842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
81 changes: 64 additions & 17 deletions docs/reference/rest-api/common-parms.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -1310,8 +1310,26 @@ See <<index-wait-for-active-shards>>.
end::wait_for_active_shards[]

tag::rrf-retrievers[]

[NOTE]
====
Either `query` or `retrievers` must be specified.
Combining `query` and `retrievers` is not supported.
====

`query`::
(Optional, String)
+
The query to use when using the <<multi-field-query-format, multi-field query format>>.

`fields`::
(Optional, array of strings)
+
The fields to query when using the <<multi-field-query-format, multi-field query format>>.
If not specified, uses the index's default fields from the `index.query.default_field` index setting, which is `*` by default.

`retrievers`::
(Required, array of retriever objects)
(Optional, array of retriever objects)
+
A list of child retrievers to specify which sets of returned top documents
will have the RRF formula applied to them. Each child retriever carries an
Expand All @@ -1337,7 +1355,7 @@ This value determines the size of the individual result sets per
query. A higher value will improve result relevance at the cost of performance. The final
ranked result set is pruned down to the search request's <<search-size-param, size>>.
`rank_window_size` must be greater than or equal to `size` and greater than or equal to `1`.
Defaults to the `size` parameter.
Defaults to 10.
end::compound-retriever-rank-window-size[]

tag::compound-retriever-filter[]
Expand All @@ -1349,39 +1367,68 @@ according to each retriever's specifications.
end::compound-retriever-filter[]

tag::linear-retriever-components[]

[NOTE]
====
Either `query` or `retrievers` must be specified.
Combining `query` and `retrievers` is not supported.
====

`query`::
(Optional, String)
+
The query to use when using the <<multi-field-query-format, multi-field query format>>.

`fields`::
(Optional, array of strings)
+
The fields to query when using the <<multi-field-query-format, multi-field query format>>.
Fields can include boost values using the `^` notation (e.g., `"field^2"`).
If not specified, uses the index's default fields from the `index.query.default_field` index setting, which is `*` by default.

`normalizer`::
(Optional, String)
+
The normalizer to use when using the <<multi-field-query-format, multi-field query format>>.
See <<linear-retriever-normalizers, normalizers>> for supported values.
Required when `query` is specified.
+
[WARNING]
====
Avoid using `none` as that will disable normalization and may bias the result set towards lexical matches.
See <<multi-field-field-grouping, field grouping>> for more information.
====

`retrievers`::
(Required, array of objects)
(Optional, array of objects)
+
A list of the sub-retrievers' configuration, that we will take into account and whose result sets
we will merge through a weighted sum. Each configuration can have a different weight and normalization depending
on the specified retriever.

Each entry specifies the following parameters:
include::common-parms.asciidoc[tag=compound-retriever-rank-window-size]

include::common-parms.asciidoc[tag=compound-retriever-filter]

* `retriever`::
Each entry in the `retrievers` array specifies the following parameters:

`retriever`::
(Required, a <<retriever, retriever>> object)
+
Specifies the retriever for which we will compute the top documents for. The retriever will produce `rank_window_size`
results, which will later be merged based on the specified `weight` and `normalizer`.

* `weight`::
`weight`::
(Optional, float)
+
The weight that each score of this retriever's top docs will be multiplied with. Must be greater or equal to 0. Defaults to 1.0.

* `normalizer`::
`normalizer`::
(Optional, String)
+
Specifies how we will normalize the retriever's scores, before applying the specified `weight`.
Available values are: `minmax`, and `none`. Defaults to `none`.

** `none`
** `minmax` :
A `MinMaxScoreNormalizer` that normalizes scores based on the following formula
+
```
score = (score - min) / (max - min)
```
Specifies how the retriever’s score will be normalized before applying the specified `weight`.
See <<linear-retriever-normalizers, normalizers>> for supported values.
Defaults to `none`.

See also <<retrievers-examples-linear-retriever, this hybrid search example>> using a linear retriever on how to
independently configure and apply normalizers to retrievers.
Expand Down
234 changes: 232 additions & 2 deletions docs/reference/search/retriever.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,28 @@ POST /restaurants/_bulk?refresh

PUT /movies

PUT /books
{
"mappings": {
"properties": {
"title": {
"type": "text",
"copy_to": "title_semantic"
},
"description": {
"type": "text",
"copy_to": "description_semantic"
},
"title_semantic": {
"type": "semantic_text"
},
"description_semantic": {
"type": "semantic_text"
}
}
}
}

PUT _query_rules/my-ruleset
{
"rules": [
Expand Down Expand Up @@ -151,6 +173,8 @@ PUT _query_rules/my-ruleset
DELETE /restaurants

DELETE /movies

DELETE /books
--------------------------------------------------
// TEARDOWN
////
Expand Down Expand Up @@ -282,9 +306,19 @@ A retriever that normalizes and linearly combines the scores of other retrievers

include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=linear-retriever-components]

include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=compound-retriever-rank-window-size]
[[linear-retriever-normalizers]]
===== Normalizers

include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=compound-retriever-filter]
The `linear` retriever supports the following normalizers:

* `none`: No normalization
* `minmax`: Normalizes scores based on the following formula:
+
....
score = (score - min) / (max - min)
....

* `l2_norm`: Normalizes scores using the L2 norm of the score values

[[rrf-retriever]]
==== RRF Retriever
Expand Down Expand Up @@ -912,6 +946,202 @@ GET movies/_search
<1> The `rule` retriever is the outermost retriever, applying rules to the search results that were previously reranked using the `rrf` retriever.
<2> The `rrf` retriever returns results from all of its sub-retrievers, and the output of the `rrf` retriever is used as input to the `rule` retriever.

[discrete]
[[multi-field-query-format]]
=== Multi-field query format

The `linear` and `rrf` retrievers support a multi-field query format that provides a simplified way to define searches across multiple fields without explicitly specifying inner retrievers.
This format automatically generates appropriate inner retrievers based on the field types and query parameters.
This is a great way to search an index, knowing little to nothing about its schema, while also handling normalization across lexical and semantic matches.

[discrete]
[[multi-field-field-grouping]]
==== Field grouping

The multi-field query format groups queried fields into two categories:

- **Lexical fields**: fields that support term queries, such as `keyword` and `text` fields.
- **Semantic fields**: <<semantic-text, `semantic_text` fields>>.

Each field group is queried separately and the scores/ranks are normalized such that each contributes 50% to the final score/rank.
This balances the importance of lexical and semantic fields.
Most indices contain more lexical than semantic fields, and without this grouping the results would often bias towards lexical field matches.

[WARNING]
====
In the `linear` retriever, this grouping relies on using a normalizer other than `none` (i.e., `minmax` or `l2_norm`).
If you use the `none` normalizer, the scores across field groups will not be normalized and the results may be biased towards lexical field matches.
====

[discrete]
[[multi-field-field-boosting]]
==== Linear retriever field boosting

When using the `linear` retriever, fields can be boosted using the `^` notation:

[source,console]
----
GET books/_search
{
"retriever": {
"linear": {
"query": "elasticsearch",
"fields": [
"title^3", <1>
"description^2", <2>
"title_semantic", <3>
"description_semantic^2"
],
"normalizer": "minmax"
}
}
}
----
// TEST[continued]

<1> 3x weight
<2> 2x weight
<3> 1x weight (default)

Due to how the <<multi-field-field-grouping, field group scores>> are normalized, per-field boosts have no effect on the range of the final score.
Instead, they affect the importance of the field's score within its group.

For example, if the schema looks like:

[source,console]
----
PUT /books
{
"mappings": {
"properties": {
"title": {
"type": "text",
"copy_to": "title_semantic"
},
"description": {
"type": "text",
"copy_to": "description_semantic"
},
"title_semantic": {
"type": "semantic_text"
},
"description_semantic": {
"type": "semantic_text"
}
}
}
}
----
// TEST[skip:index created in test setup]

And we run this query:

[source,console]
----
GET books/_search
{
"retriever": {
"linear": {
"query": "elasticsearch",
"fields": [
"title",
"description",
"title_semantic",
"description_semantic"
],
"normalizer": "minmax"
}
}
}
----
// TEST[continued]

The score breakdown would be:

* Lexical fields (50% of score):
** `title`: 50% of lexical fields group score, 25% of final score
** `description`: 50% of lexical fields group score, 25% of final score
* Semantic fields (50% of score):
** `title_semantic`: 50% of semantic fields group score, 25% of final score
** `description_semantic`: 50% of semantic fields group score, 25% of final score

If we apply per-field boosts like so:

[source,console]
----
GET books/_search
{
"retriever": {
"linear": {
"query": "elasticsearch",
"fields": [
"title^3",
"description^2",
"title_semantic",
"description_semantic^2"
],
"normalizer": "minmax"
}
}
}
----
// TEST[continued]

The score breakdown would change to:

* Lexical fields (50% of score):
** `title`: 60% of lexical fields group score, 30% of final score
** `description`: 40% of lexical fields group score, 20% of final score
* Semantic fields (50% of score):
** `title_semantic`: 33% of semantic fields group score, 16.5% of final score
** `description_semantic`: 66% of semantic fields group score, 33% of final score

[discrete]
[[multi-field-wildcard-field-patterns]]
==== Wildcard field patterns

Field names support the `*` wildcard character to match multiple fields:

[source,console]
----
GET books/_search
{
"retriever": {
"rrf": {
"query": "machine learning",
"fields": [
"title*", <1>
"*_text" <2>
]
}
}
}
----
// TEST[continued]

<1> Match fields that start with `title`
<2> Match fields that end with `_text`

Note, however, that wildcard field patterns will only resolve to fields that either:

- Support term queries, such as `keyword` and `text` fields
- Are `semantic_text` fields

[discrete]
[[multi-field-limitations]]
==== Limitations

- **Single index**: Multi-field queries currently work with single index searches only
- **CCS (Cross Cluster Search)**: Multi-field queries do not support remote cluster searches

[discrete]
[[multi-field-examples]]
==== Examples

- <<retrievers-examples-rrf-multi-field-query-format, RRF with the multi-field query format>>
- <<retrievers-examples-linear-multi-field-query-format, Linear retriever with the multi-field query format>>


[discrete]
[[retriever-common-parameters]]
=== Common usage guidelines
Expand Down
Loading