You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -15,7 +15,8 @@ Learn how to combine different retrievers in these hands-on examples.
15
15
16
16
## Add example data [retrievers-examples-setup]
17
17
18
-
To begin with, let's create the `retrievers_example` index, and add some documents to it. We will set `number_of_shards=1` for our examples to ensure consistent and reproducible ordering.
18
+
To begin with, let's create the `retrievers_example` index, and add some documents to it.
19
+
We will set `number_of_shards=1` for our examples to ensure consistent and reproducible ordering.
19
20
20
21
```console
21
22
PUT retrievers_example
@@ -35,7 +36,11 @@ PUT retrievers_example
35
36
}
36
37
},
37
38
"text": {
38
-
"type": "text"
39
+
"type": "text",
40
+
"copy_to": "text_semantic"
41
+
},
42
+
"text_semantic": {
43
+
"type": "semantic_text"
39
44
},
40
45
"year": {
41
46
"type": "integer"
@@ -103,9 +108,11 @@ Now that we have our documents in place, let’s try to run some queries using r
103
108
104
109
## Example: Combining query and kNN with RRF [retrievers-examples-combining-standard-knn-retrievers-with-rrf]
105
110
106
-
First, let’s examine how to combine two different types of queries: a `kNN` query and a `query_string` query. While these queries may produce scores in different ranges, we can use Reciprocal Rank Fusion (`rrf`) to combine the results and generate a merged final result list.
111
+
First, let’s examine how to combine two different types of queries: a `kNN` query and a `query_string` query.
112
+
While these queries may produce scores in different ranges, we can use Reciprocal Rank Fusion (`rrf`) to combine the results and generate a merged final result list.
107
113
108
-
To implement this in the retriever framework, we start with the top-level element: our `rrf` retriever. This retriever operates on top of two other retrievers: a `knn` retriever and a `standard` retriever. Our query structure would look like this:
114
+
To implement this in the retriever framework, we start with the top-level element: our `rrf` retriever.
115
+
This retriever operates on top of two other retrievers: a `knn` retriever and a `standard` retriever. Our query structure would look like this:
109
116
110
117
```console
111
118
GET /retrievers_example/_search
@@ -190,9 +197,13 @@ This returns the following response based on the final rrf score for each result
190
197
191
198
## Example: Hybrid search with linear retriever [retrievers-examples-linear-retriever]
192
199
193
-
A different, and more intuitive, way to provide hybrid search, is to linearly combine the top documents of different retrievers using a weighted sum of the original scores. Since, as above, the scores could lie in different ranges, we can also specify a `normalizer` that would ensure that all scores for the top ranked documents of a retriever lie in a specific range.
200
+
A different, and more intuitive, way to provide hybrid search, is to linearly combine the top documents of different retrievers using a weighted sum of the original scores.
201
+
Since, as above, the scores could lie in different ranges, we can also specify a `normalizer` that would ensure that all scores for the top ranked documents of a retriever lie in a specific range.
194
202
195
-
To implement this, we define a `linear` retriever, and along with a set of retrievers that will generate the heterogeneous results sets that we will combine. We will solve a problem similar to the above, by merging the results of a `standard` and a `knn` retriever. As the `standard` retriever’s scores are based on BM25 and are not strictly bounded, we will also define a `minmax` normalizer to ensure that the scores lie in the [0, 1] range. We will apply the same normalizer to `knn` as well to ensure that we capture the importance of each document within the result set.
203
+
To implement this, we define a `linear` retriever, and along with a set of retrievers that will generate the heterogeneous results sets that we will combine.
204
+
We will solve a problem similar to the above, by merging the results of a `standard` and a `knn` retriever.
205
+
As the `standard` retriever’s scores are based on BM25 and are not strictly bounded, we will also define a `minmax` normalizer to ensure that the scores lie in the [0, 1] range.
206
+
We will apply the same normalizer to `knn` as well to ensure that we capture the importance of each document within the result set.
196
207
197
208
So, let’s now specify the `linear` retriever whose final score is computed as follows:
198
209
@@ -263,22 +274,22 @@ This returns the following response based on the normalized weighted score for e
263
274
"value": 3,
264
275
"relation": "eq"
265
276
},
266
-
"max_score": -1,
277
+
"max_score": 3.5,
267
278
"hits": [
268
279
{
269
280
"_index": "retrievers_example",
270
281
"_id": "2",
271
-
"_score": -1
282
+
"_score": 3.5
272
283
},
273
284
{
274
285
"_index": "retrievers_example",
275
286
"_id": "1",
276
-
"_score": -2
287
+
"_score": 2.3
277
288
},
278
289
{
279
290
"_index": "retrievers_example",
280
291
"_id": "3",
281
-
"_score": -3
292
+
"_score": 0.1
282
293
}
283
294
]
284
295
}
@@ -288,7 +299,8 @@ This returns the following response based on the normalized weighted score for e
288
299
::::
289
300
290
301
291
-
By normalizing scores and leveraging `function_score` queries, we can also implement more complex ranking strategies, such as sorting results based on their timestamps, assign the timestamp as a score, and then normalizing this score to [0, 1]. Then, we can easily combine the above with a `knn` retriever as follows:
302
+
By normalizing scores and leveraging `function_score` queries, we can also implement more complex ranking strategies, such as sorting results based on their timestamps, assign the timestamp as a score, and then normalizing this score to [0, 1].
303
+
Then, we can easily combine the above with a `knn` retriever as follows:
292
304
293
305
```console
294
306
GET /retrievers_example/_search
@@ -369,27 +381,162 @@ Which would return the following results:
369
381
"value": 4,
370
382
"relation": "eq"
371
383
},
372
-
"max_score": -1,
384
+
"max_score": 3.5,
373
385
"hits": [
374
386
{
375
387
"_index": "retrievers_example",
376
388
"_id": "3",
377
-
"_score": -1
389
+
"_score": 3.5
378
390
},
379
391
{
380
392
"_index": "retrievers_example",
381
393
"_id": "2",
382
-
"_score": -2
394
+
"_score": 2.0
383
395
},
384
396
{
385
397
"_index": "retrievers_example",
386
398
"_id": "4",
387
-
"_score": -3
399
+
"_score": 1.1
400
+
},
401
+
{
402
+
"_index": "retrievers_example",
403
+
"_id": "1",
404
+
"_score": 0.1
405
+
}
406
+
]
407
+
}
408
+
}
409
+
```
410
+
411
+
::::
412
+
413
+
414
+
## Example: RRF with the multi-field query format [retrievers-examples-rrf-multi-field-query-format]
415
+
```yaml {applies_to}
416
+
stack: ga 9.1
417
+
```
418
+
419
+
There's an even simpler way to execute a hybrid search though: We can use the [multi-field query format](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-query-format), which allows us to query multiple fields without explicitly specifying inner retrievers.
420
+
421
+
One of the major challenges with hybrid search is normalizing the scores across matches on all field types.
422
+
Scores from [`text`](elasticsearch://reference/elasticsearch/mapping-reference/text.md) and [`semantic_text`](elasticsearch://reference/elasticsearch/mapping-reference/semantic-text.md) fields don't always fall in the same range, so we need to normalize the ranks across matches on these fields to generate a result set.
423
+
For example, BM25 scores from `text` fields are unbounded, while vector similarity scores from `text_embedding` models are bounded between [0, 1].
424
+
The multi-field query format [handles this normalization for us automatically](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-field-grouping).
425
+
426
+
The following example uses the multi-field query format to query every field specified in the `index.query.default_field` index setting, which is set to `*` by default.
427
+
This default value will cause the retriever to query every field that either:
428
+
429
+
- Supports term queries, such as `keyword` and `text` fields
430
+
- Is a `semantic_text` field
431
+
432
+
In this example, that would translate to the `text`, `text_semantic`, `year`, `topic`, and `timestamp` fields.
433
+
434
+
```console
435
+
GET /retrievers_example/_search
436
+
{
437
+
"retriever": {
438
+
"rrf": {
439
+
"query": "artificial intelligence"
440
+
}
441
+
}
442
+
}
443
+
```
444
+
445
+
This returns the following response based on the final rrf score for each result.
446
+
447
+
::::{dropdown} Example response
448
+
```console-result
449
+
{
450
+
"took": 42,
451
+
"timed_out": false,
452
+
"_shards": {
453
+
"total": 1,
454
+
"successful": 1,
455
+
"skipped": 0,
456
+
"failed": 0
457
+
},
458
+
"hits": {
459
+
"total": {
460
+
"value": 3,
461
+
"relation": "eq"
462
+
},
463
+
"max_score": 0.8333334,
464
+
"hits": [
465
+
{
466
+
"_index": "retrievers_example",
467
+
"_id": "1",
468
+
"_score": 0.8333334
469
+
},
470
+
{
471
+
"_index": "retrievers_example",
472
+
"_id": "2",
473
+
"_score": 0.8333334
388
474
},
475
+
{
476
+
"_index": "retrievers_example",
477
+
"_id": "3",
478
+
"_score": 0.25
479
+
}
480
+
]
481
+
}
482
+
}
483
+
```
484
+
485
+
::::
486
+
487
+
We can also use the `fields` parameter to explicitly specify the fields to query.
488
+
The following example uses the multi-field query format to query the `text` and `text_semantic` fields.
489
+
490
+
```console
491
+
GET /retrievers_example/_search
492
+
{
493
+
"retriever": {
494
+
"rrf": {
495
+
"query": "artificial intelligence",
496
+
"fields": ["text", "text_semantic"]
497
+
}
498
+
}
499
+
}
500
+
```
501
+
502
+
::::{note}
503
+
The `fields` parameter also accepts [wildcard field patterns](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-wildcard-field-patterns).
504
+
::::
505
+
506
+
This returns the following response based on the final rrf score for each result.
507
+
508
+
::::{dropdown} Example response
509
+
```console-result
510
+
{
511
+
"took": 42,
512
+
"timed_out": false,
513
+
"_shards": {
514
+
"total": 1,
515
+
"successful": 1,
516
+
"skipped": 0,
517
+
"failed": 0
518
+
},
519
+
"hits": {
520
+
"total": {
521
+
"value": 3,
522
+
"relation": "eq"
523
+
},
524
+
"max_score": 0.8333334,
525
+
"hits": [
389
526
{
390
527
"_index": "retrievers_example",
391
528
"_id": "1",
392
-
"_score": -4
529
+
"_score": 0.8333334
530
+
},
531
+
{
532
+
"_index": "retrievers_example",
533
+
"_id": "2",
534
+
"_score": 0.8333334
535
+
},
536
+
{
537
+
"_index": "retrievers_example",
538
+
"_id": "3",
539
+
"_score": 0.25
393
540
}
394
541
]
395
542
}
@@ -399,6 +546,73 @@ Which would return the following results:
399
546
::::
400
547
401
548
549
+
## Example: Linear retriever with the multi-field query format [retrievers-examples-linear-multi-field-query-format]
550
+
```yaml {applies_to}
551
+
stack: ga 9.1
552
+
```
553
+
554
+
We can also use the [multi-field query format](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-query-format) with the `linear` retriever.
555
+
It works much the same way as [on the `rrf` retriever](#retrievers-examples-rrf-multi-field-query-format), with a couple key differences:
556
+
557
+
- We can use `^` notation to specify a [per-field boost](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-field-boosting)
558
+
- We must set the `normalizer` parameter to specify the normalization method used to combine [field group scores](elasticsearch://reference/elasticsearch/rest-apis/retrievers.md#multi-field-field-grouping)
559
+
560
+
The following example uses the `linear` retriever to query the `text`, `text_semantic`, and `topic` fields, with a boost of 2 on the `topic` field:
561
+
562
+
```console
563
+
GET /retrievers_example/_search
564
+
{
565
+
"retriever": {
566
+
"linear": {
567
+
"query": "artificial intelligence",
568
+
"fields": ["text", "text_semantic", "topic^2"],
569
+
"normalizer": "minmax"
570
+
}
571
+
}
572
+
}
573
+
```
574
+
575
+
This returns the following response based on the normalized score for each result:
576
+
577
+
::::{dropdown} Example response
578
+
```console-result
579
+
{
580
+
"took": 42,
581
+
"timed_out": false,
582
+
"_shards": {
583
+
"total": 1,
584
+
"successful": 1,
585
+
"skipped": 0,
586
+
"failed": 0
587
+
},
588
+
"hits": {
589
+
"total": {
590
+
"value": 3,
591
+
"relation": "eq"
592
+
},
593
+
"max_score": 2.0,
594
+
"hits": [
595
+
{
596
+
"_index": "retrievers_example",
597
+
"_id": "2",
598
+
"_score": 2.0
599
+
},
600
+
{
601
+
"_index": "retrievers_example",
602
+
"_id": "1",
603
+
"_score": 1.2
604
+
},
605
+
{
606
+
"_index": "retrievers_example",
607
+
"_id": "3",
608
+
"_score": 0.1
609
+
}
610
+
]
611
+
}
612
+
}
613
+
```
614
+
615
+
::::
402
616
403
617
## Example: Grouping results by year with `collapse` [retrievers-examples-collapsing-retriever-results]
0 commit comments