Skip to content

Commit c7a482a

Browse files
authored
Remove vectors from _source transparently (#130382)
## Summary This PR introduces a new **hybrid mode for the `_source` field** that stores the original source **without dense vector fields**. The goal is to reduce storage overhead and improve performance, especially as vector sizes grow. The setting also affects whether vectors are returned in **search and get APIs**, which matters even for synthetic source, since reconstructing vectors from doc values can be expensive. ## Background Today, Elasticsearch supports two modes for `_source`: * **Stored**: Original JSON is persisted as-is. * **Synthetic**: `_source` is reconstructed from doc values at read time. However, dense vector fields have become problematic: * They **don’t compress well**, unlike text. * They are **already stored in doc values**, so storing them again in `_source` is wasteful. * Their `_source` representation is often **overly precise** (double precision), which isn’t needed for search/indexing. While switching to full synthetic is an option, retrieving the full original `_source` (minus vectors) is often faster and more practical than pulling individual fields from individual storage when the number of metadata fields is high. ## What This PR Adds We’re introducing a **hybrid source mode**: * Keeps the original `_source`, **minus any `dense_vector` fields**. * Built on top of the **synthetic source infrastructure**, reusing parts of it. * Controlled via a **single index-level setting**. ### Key Behavior * When enabled, `dense_vector` fields are **excluded from `_source` at index time**. * The setting **also controls whether vectors are returned in search and get APIs**: * This matters even for **synthetic source**, as **rebuilding vectors is costly**. * You can override behavior at query time using the `exclude_vectors` option. * The setting is: * **Disabled by default** * **Protected by a feature flag** * Intended to be **enabled by default for new indices** in a follow-up ## Motivation This hybrid option is designed for use cases where users: * Want faster reads than full synthetic offers. * Don’t want the storage cost of large vectors in `_source`. * Are okay with **some loss of precision** when vectors are rehydrated. By making this setting default for newly created indices in a follow up, we can help users avoid surprises from the hidden cost of storing and returning high-dimensional vectors. ## Benchmark Results Benchmarking this PR against `main` using the `openai` rally track shows substantial improvements at the cost of a loss of precision when retrieving the original vectors: | Metric | Main (Baseline) | This PR (Contender) | Change | % Change | | :----------------------------------------- | :-------------- | :------------------ | :-------- | :---------- | | **Indexing throughput (mean)** | 1690.77 docs/s | 2704.57 docs/s | +1013.79 | **+59.96%** | | **Indexing time** | 120.25 min | 74.32 min | –45.93 | **–38.20%** | | **Merge time** | 132.56 min | 69.28 min | –63.28 | **–47.74%** | | **Merge throttle time** | 100.99 min | 36.30 min | –64.69 | **–64.06%** | | **Flush time** | 2.71 min | 1.48 min | –1.23 | **–45.29%** | | **Refresh count** | 60 | 42 | –18 | **–30.00%** | | **Dataset / Store size** | 52.29 GB | 19.30 GB | –32.99 GB | **–63.09%** | | **Young Gen GC time** | 30.64 s | 22.17 s | –8.47 | **–27.65%** | | **Search throughput (k=10, multi-client)** | 613 ops/s | 677 ops/s | +64 ops/s | **+10.42%** | | **Search latency (p99, k=10)** | 29.5 ms | 26.5 ms | –3.0 ms | **–10.43%** | ## Miscellaneous Reindexing is not covered in this PR. Since it's one of the main use cases for returning vectors, the plan is for reindex to **force the inclusion of** vectors by default. This will be addressed in a follow-up, as this PR is already quite large.
1 parent ec5254b commit c7a482a

File tree

36 files changed

+1677
-226
lines changed

36 files changed

+1677
-226
lines changed

docs/changelog/130382.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 130382
2+
summary: Remove vectors from `_source` transparently
3+
area: "Vector Search"
4+
type: enhancement
5+
issues: []

qa/ccs-common-rest/build.gradle

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ apply plugin: 'elasticsearch.internal-yaml-rest-test'
1111
restResources {
1212
restApi {
1313
include 'capabilities', 'cat.shards', '_common', 'bulk', 'count', 'cluster', 'field_caps', 'get', 'knn_search', 'index', 'indices', 'msearch',
14-
"nodes.stats", 'search', 'async_search', 'graph', '*_point_in_time', 'info', 'scroll', 'clear_scroll', 'search_mvt', 'eql', 'sql'
14+
"nodes.stats", 'search', 'async_search', 'graph', '*_point_in_time', 'info', 'scroll', 'clear_scroll', 'search_mvt', 'eql', 'sql', 'update'
1515
}
1616
restTests {
1717
includeCore 'field_caps', 'msearch', 'search', 'suggest', 'scroll', "indices.resolve_index"

qa/ccs-common-rest/src/yamlRestTest/java/org/elasticsearch/test/rest/yaml/CcsCommonYamlTestSuiteIT.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,8 @@ public class CcsCommonYamlTestSuiteIT extends ESClientYamlSuiteTestCase {
9191
.setting("xpack.license.self_generated.type", "trial")
9292
.feature(FeatureFlag.TIME_SERIES_MODE)
9393
.feature(FeatureFlag.SUB_OBJECTS_AUTO_ENABLED)
94-
.feature(FeatureFlag.IVF_FORMAT);
94+
.feature(FeatureFlag.IVF_FORMAT)
95+
.feature(FeatureFlag.SYNTHETIC_VECTORS);
9596

9697
private static ElasticsearchCluster remoteCluster = ElasticsearchCluster.local()
9798
.name(REMOTE_CLUSTER_NAME)

qa/ccs-common-rest/src/yamlRestTest/java/org/elasticsearch/test/rest/yaml/RcsCcsCommonYamlTestSuiteIT.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ public class RcsCcsCommonYamlTestSuiteIT extends ESClientYamlSuiteTestCase {
9393
.feature(FeatureFlag.TIME_SERIES_MODE)
9494
.feature(FeatureFlag.SUB_OBJECTS_AUTO_ENABLED)
9595
.feature(FeatureFlag.IVF_FORMAT)
96+
.feature(FeatureFlag.SYNTHETIC_VECTORS)
9697
.user("test_admin", "x-pack-test-password");
9798

9899
private static ElasticsearchCluster fulfillingCluster = ElasticsearchCluster.local()

qa/smoke-test-multinode/src/yamlRestTest/java/org/elasticsearch/smoketest/SmokeTestMultiNodeClientYamlTestSuiteIT.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ public class SmokeTestMultiNodeClientYamlTestSuiteIT extends ESClientYamlSuiteTe
3939
.feature(FeatureFlag.DOC_VALUES_SKIPPER)
4040
.feature(FeatureFlag.USE_LUCENE101_POSTINGS_FORMAT)
4141
.feature(FeatureFlag.IVF_FORMAT)
42+
.feature(FeatureFlag.SYNTHETIC_VECTORS)
4243
.build();
4344

4445
public SmokeTestMultiNodeClientYamlTestSuiteIT(@Name("yaml") ClientYamlTestCandidate testCandidate) {

rest-api-spec/src/main/resources/rest-api-spec/api/get.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,10 @@
6868
"type":"list",
6969
"description":"A list of fields to extract and return from the _source field"
7070
},
71+
"_source_exclude_vectors":{
72+
"type":"boolean",
73+
"description":"Whether vectors should be excluded from _source"
74+
},
7175
"version":{
7276
"type":"number",
7377
"description":"Explicit version number for concurrency control"

rest-api-spec/src/main/resources/rest-api-spec/api/search.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,10 @@
155155
"type":"list",
156156
"description":"A list of fields to extract and return from the _source field"
157157
},
158+
"_source_exclude_vectors":{
159+
"type":"boolean",
160+
"description":"Whether vectors should be excluded from _source"
161+
},
158162
"terminate_after":{
159163
"type":"number",
160164
"description":"The maximum number of documents to collect for each shard, upon reaching which the query execution will terminate early."

rest-api-spec/src/yamlRestTest/java/org/elasticsearch/test/rest/ClientYamlTestSuiteIT.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ public class ClientYamlTestSuiteIT extends ESClientYamlSuiteTestCase {
3939
.feature(FeatureFlag.DOC_VALUES_SKIPPER)
4040
.feature(FeatureFlag.USE_LUCENE101_POSTINGS_FORMAT)
4141
.feature(FeatureFlag.IVF_FORMAT)
42+
.feature(FeatureFlag.SYNTHETIC_VECTORS)
4243
.build();
4344

4445
public ClientYamlTestSuiteIT(@Name("yaml") ClientYamlTestCandidate testCandidate) {

0 commit comments

Comments
 (0)