Skip to content

Commit ca82bfc

Browse files
Mikep86leemthompo
andauthored
Add Simplified Linear & RRF Retriever Examples (#2026)
Adds simplified `linear` and `rrf` retriever examples. Sibling PR to elastic/elasticsearch#130559. --------- Co-authored-by: Liam Thompson <[email protected]>
1 parent cb1eec0 commit ca82bfc

File tree

1 file changed

+230
-16
lines changed

1 file changed

+230
-16
lines changed

solutions/search/retrievers-examples.md

Lines changed: 230 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@ Learn how to combine different retrievers in these hands-on examples.
1515

1616
## Add example data [retrievers-examples-setup]
1717

18-
To begin with, let's create the `retrievers_example` index, and add some documents to it. We will set `number_of_shards=1` for our examples to ensure consistent and reproducible ordering.
18+
To begin with, let's create the `retrievers_example` index, and add some documents to it.
19+
We will set `number_of_shards=1` for our examples to ensure consistent and reproducible ordering.
1920

2021
```console
2122
PUT retrievers_example
@@ -35,7 +36,11 @@ PUT retrievers_example
3536
}
3637
},
3738
"text": {
38-
"type": "text"
39+
"type": "text",
40+
"copy_to": "text_semantic"
41+
},
42+
"text_semantic": {
43+
"type": "semantic_text"
3944
},
4045
"year": {
4146
"type": "integer"
@@ -103,9 +108,11 @@ Now that we have our documents in place, let’s try to run some queries using r
103108

104109
## Example: Combining query and kNN with RRF [retrievers-examples-combining-standard-knn-retrievers-with-rrf]
105110

106-
First, let’s examine how to combine two different types of queries: a `kNN` query and a `query_string` query. While these queries may produce scores in different ranges, we can use Reciprocal Rank Fusion (`rrf`) to combine the results and generate a merged final result list.
111+
First, let’s examine how to combine two different types of queries: a `kNN` query and a `query_string` query.
112+
While these queries may produce scores in different ranges, we can use Reciprocal Rank Fusion (`rrf`) to combine the results and generate a merged final result list.
107113

108-
To implement this in the retriever framework, we start with the top-level element: our `rrf` retriever. This retriever operates on top of two other retrievers: a `knn` retriever and a `standard` retriever. Our query structure would look like this:
114+
To implement this in the retriever framework, we start with the top-level element: our `rrf` retriever.
115+
This retriever operates on top of two other retrievers: a `knn` retriever and a `standard` retriever. Our query structure would look like this:
109116

110117
```console
111118
GET /retrievers_example/_search
@@ -190,9 +197,13 @@ This returns the following response based on the final rrf score for each result
190197

191198
## Example: Hybrid search with linear retriever [retrievers-examples-linear-retriever]
192199

193-
A different, and more intuitive, way to provide hybrid search, is to linearly combine the top documents of different retrievers using a weighted sum of the original scores. Since, as above, the scores could lie in different ranges, we can also specify a `normalizer` that would ensure that all scores for the top ranked documents of a retriever lie in a specific range.
200+
A different, and more intuitive, way to provide hybrid search, is to linearly combine the top documents of different retrievers using a weighted sum of the original scores.
201+
Since, as above, the scores could lie in different ranges, we can also specify a `normalizer` that would ensure that all scores for the top ranked documents of a retriever lie in a specific range.
194202

195-
To implement this, we define a `linear` retriever, and along with a set of retrievers that will generate the heterogeneous results sets that we will combine. We will solve a problem similar to the above, by merging the results of a `standard` and a `knn` retriever. As the `standard` retriever’s scores are based on BM25 and are not strictly bounded, we will also define a `minmax` normalizer to ensure that the scores lie in the [0, 1] range. We will apply the same normalizer to `knn` as well to ensure that we capture the importance of each document within the result set.
203+
To implement this, we define a `linear` retriever, and along with a set of retrievers that will generate the heterogeneous results sets that we will combine.
204+
We will solve a problem similar to the above, by merging the results of a `standard` and a `knn` retriever.
205+
As the `standard` retriever’s scores are based on BM25 and are not strictly bounded, we will also define a `minmax` normalizer to ensure that the scores lie in the [0, 1] range.
206+
We will apply the same normalizer to `knn` as well to ensure that we capture the importance of each document within the result set.
196207

197208
So, let’s now specify the `linear` retriever whose final score is computed as follows:
198209

@@ -263,22 +274,22 @@ This returns the following response based on the normalized weighted score for e
263274
"value": 3,
264275
"relation": "eq"
265276
},
266-
"max_score": -1,
277+
"max_score": 3.5,
267278
"hits": [
268279
{
269280
"_index": "retrievers_example",
270281
"_id": "2",
271-
"_score": -1
282+
"_score": 3.5
272283
},
273284
{
274285
"_index": "retrievers_example",
275286
"_id": "1",
276-
"_score": -2
287+
"_score": 2.3
277288
},
278289
{
279290
"_index": "retrievers_example",
280291
"_id": "3",
281-
"_score": -3
292+
"_score": 0.1
282293
}
283294
]
284295
}
@@ -288,7 +299,8 @@ This returns the following response based on the normalized weighted score for e
288299
::::
289300

290301

291-
By normalizing scores and leveraging `function_score` queries, we can also implement more complex ranking strategies, such as sorting results based on their timestamps, assign the timestamp as a score, and then normalizing this score to [0, 1]. Then, we can easily combine the above with a `knn` retriever as follows:
302+
By normalizing scores and leveraging `function_score` queries, we can also implement more complex ranking strategies, such as sorting results based on their timestamps, assign the timestamp as a score, and then normalizing this score to [0, 1].
303+
Then, we can easily combine the above with a `knn` retriever as follows:
292304

293305
```console
294306
GET /retrievers_example/_search
@@ -369,27 +381,162 @@ Which would return the following results:
369381
"value": 4,
370382
"relation": "eq"
371383
},
372-
"max_score": -1,
384+
"max_score": 3.5,
373385
"hits": [
374386
{
375387
"_index": "retrievers_example",
376388
"_id": "3",
377-
"_score": -1
389+
"_score": 3.5
378390
},
379391
{
380392
"_index": "retrievers_example",
381393
"_id": "2",
382-
"_score": -2
394+
"_score": 2.0
383395
},
384396
{
385397
"_index": "retrievers_example",
386398
"_id": "4",
387-
"_score": -3
399+
"_score": 1.1
400+
},
401+
{
402+
"_index": "retrievers_example",
403+
"_id": "1",
404+
"_score": 0.1
405+
}
406+
]
407+
}
408+
}
409+
```
410+
411+
::::
412+
413+
414+
## Example: RRF with the multi-field query format [retrievers-examples-rrf-multi-field-query-format]
415+
```yaml {applies_to}
416+
stack: ga 9.1
417+
```
418+
419+
There's an even simpler way to execute a hybrid search though: We can use the [multi-field query format](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-query-format), which allows us to query multiple fields without explicitly specifying inner retrievers.
420+
421+
One of the major challenges with hybrid search is normalizing the scores across matches on all field types.
422+
Scores from [`text`](elasticsearch://reference/elasticsearch/mapping-reference/text.md) and [`semantic_text`](elasticsearch://reference/elasticsearch/mapping-reference/semantic-text.md) fields don't always fall in the same range, so we need to normalize the ranks across matches on these fields to generate a result set.
423+
For example, BM25 scores from `text` fields are unbounded, while vector similarity scores from `text_embedding` models are bounded between [0, 1].
424+
The multi-field query format [handles this normalization for us automatically](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-field-grouping).
425+
426+
The following example uses the multi-field query format to query every field specified in the `index.query.default_field` index setting, which is set to `*` by default.
427+
This default value will cause the retriever to query every field that either:
428+
429+
- Supports term queries, such as `keyword` and `text` fields
430+
- Is a `semantic_text` field
431+
432+
In this example, that would translate to the `text`, `text_semantic`, `year`, `topic`, and `timestamp` fields.
433+
434+
```console
435+
GET /retrievers_example/_search
436+
{
437+
"retriever": {
438+
"rrf": {
439+
"query": "artificial intelligence"
440+
}
441+
}
442+
}
443+
```
444+
445+
This returns the following response based on the final rrf score for each result.
446+
447+
::::{dropdown} Example response
448+
```console-result
449+
{
450+
"took": 42,
451+
"timed_out": false,
452+
"_shards": {
453+
"total": 1,
454+
"successful": 1,
455+
"skipped": 0,
456+
"failed": 0
457+
},
458+
"hits": {
459+
"total": {
460+
"value": 3,
461+
"relation": "eq"
462+
},
463+
"max_score": 0.8333334,
464+
"hits": [
465+
{
466+
"_index": "retrievers_example",
467+
"_id": "1",
468+
"_score": 0.8333334
469+
},
470+
{
471+
"_index": "retrievers_example",
472+
"_id": "2",
473+
"_score": 0.8333334
388474
},
475+
{
476+
"_index": "retrievers_example",
477+
"_id": "3",
478+
"_score": 0.25
479+
}
480+
]
481+
}
482+
}
483+
```
484+
485+
::::
486+
487+
We can also use the `fields` parameter to explicitly specify the fields to query.
488+
The following example uses the multi-field query format to query the `text` and `text_semantic` fields.
489+
490+
```console
491+
GET /retrievers_example/_search
492+
{
493+
"retriever": {
494+
"rrf": {
495+
"query": "artificial intelligence",
496+
"fields": ["text", "text_semantic"]
497+
}
498+
}
499+
}
500+
```
501+
502+
::::{note}
503+
The `fields` parameter also accepts [wildcard field patterns](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-wildcard-field-patterns).
504+
::::
505+
506+
This returns the following response based on the final rrf score for each result.
507+
508+
::::{dropdown} Example response
509+
```console-result
510+
{
511+
"took": 42,
512+
"timed_out": false,
513+
"_shards": {
514+
"total": 1,
515+
"successful": 1,
516+
"skipped": 0,
517+
"failed": 0
518+
},
519+
"hits": {
520+
"total": {
521+
"value": 3,
522+
"relation": "eq"
523+
},
524+
"max_score": 0.8333334,
525+
"hits": [
389526
{
390527
"_index": "retrievers_example",
391528
"_id": "1",
392-
"_score": -4
529+
"_score": 0.8333334
530+
},
531+
{
532+
"_index": "retrievers_example",
533+
"_id": "2",
534+
"_score": 0.8333334
535+
},
536+
{
537+
"_index": "retrievers_example",
538+
"_id": "3",
539+
"_score": 0.25
393540
}
394541
]
395542
}
@@ -399,6 +546,73 @@ Which would return the following results:
399546
::::
400547

401548

549+
## Example: Linear retriever with the multi-field query format [retrievers-examples-linear-multi-field-query-format]
550+
```yaml {applies_to}
551+
stack: ga 9.1
552+
```
553+
554+
We can also use the [multi-field query format](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-query-format) with the `linear` retriever.
555+
It works much the same way as [on the `rrf` retriever](#retrievers-examples-rrf-multi-field-query-format), with a couple key differences:
556+
557+
- We can use `^` notation to specify a [per-field boost](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-field-boosting)
558+
- We must set the `normalizer` parameter to specify the normalization method used to combine [field group scores](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-field-grouping)
559+
560+
The following example uses the `linear` retriever to query the `text`, `text_semantic`, and `topic` fields, with a boost of 2 on the `topic` field:
561+
562+
```console
563+
GET /retrievers_example/_search
564+
{
565+
"retriever": {
566+
"linear": {
567+
"query": "artificial intelligence",
568+
"fields": ["text", "text_semantic", "topic^2"],
569+
"normalizer": "minmax"
570+
}
571+
}
572+
}
573+
```
574+
575+
This returns the following response based on the normalized score for each result:
576+
577+
::::{dropdown} Example response
578+
```console-result
579+
{
580+
"took": 42,
581+
"timed_out": false,
582+
"_shards": {
583+
"total": 1,
584+
"successful": 1,
585+
"skipped": 0,
586+
"failed": 0
587+
},
588+
"hits": {
589+
"total": {
590+
"value": 3,
591+
"relation": "eq"
592+
},
593+
"max_score": 2.0,
594+
"hits": [
595+
{
596+
"_index": "retrievers_example",
597+
"_id": "2",
598+
"_score": 2.0
599+
},
600+
{
601+
"_index": "retrievers_example",
602+
"_id": "1",
603+
"_score": 1.2
604+
},
605+
{
606+
"_index": "retrievers_example",
607+
"_id": "3",
608+
"_score": 0.1
609+
}
610+
]
611+
}
612+
}
613+
```
614+
615+
::::
402616

403617
## Example: Grouping results by year with `collapse` [retrievers-examples-collapsing-retriever-results]
404618

0 commit comments

Comments
 (0)