Remove vectors from `_source` transparently #130382

jimczi · 2025-07-01T10:58:04Z

Summary

This PR introduces a new hybrid mode for the _source field that stores the original source without dense vector fields. The goal is to reduce storage overhead and improve performance, especially as vector sizes grow. The setting also affects whether vectors are returned in search and get APIs, which matters even for synthetic source, since reconstructing vectors from doc values can be expensive.

Background

Today, Elasticsearch supports two modes for _source:

Stored: Original JSON is persisted as-is.
Synthetic: _source is reconstructed from doc values at read time.

However, dense vector fields have become problematic:

They don’t compress well, unlike text.
They are already stored in doc values, so storing them again in _source is wasteful.
Their _source representation is often overly precise (double precision), which isn’t needed for search/indexing.

While switching to full synthetic is an option, retrieving the full original _source (minus vectors) is often faster and more practical than pulling individual fields from individual storage when the number of metadata fields is high.

What This PR Adds

We’re introducing a hybrid source mode:

Keeps the original _source, minus any dense_vector fields.
Built on top of the synthetic source infrastructure, reusing parts of it.
Controlled via a single index-level setting.

Key Behavior

When enabled, dense_vector fields are excluded from _source at index time.
The setting also controls whether vectors are returned in search and get APIs:
- This matters even for synthetic source, as rebuilding vectors is costly.
You can override behavior at query time using the exclude_vectors option.
The setting is:
- Disabled by default
- Protected by a feature flag
- Intended to be enabled by default for new indices in a follow-up

Motivation

This hybrid option is designed for use cases where users:

Want faster reads than full synthetic offers.
Don’t want the storage cost of large vectors in _source.
Are okay with some loss of precision when vectors are rehydrated.

By making this setting default for newly created indices in a follow up, we can help users avoid surprises from the hidden cost of storing and returning high-dimensional vectors.

Benchmark Results

Benchmarking this PR against main using the openai rally track shows substantial improvements at the cost of a loss of precision when retrieving the original vectors:

Metric	Main (Baseline)	This PR (Contender)	Change	% Change
Indexing throughput (mean)	1690.77 docs/s	2704.57 docs/s	+1013.79	+59.96%
Indexing time	120.25 min	74.32 min	–45.93	–38.20%
Merge time	132.56 min	69.28 min	–63.28	–47.74%
Merge throttle time	100.99 min	36.30 min	–64.69	–64.06%
Flush time	2.71 min	1.48 min	–1.23	–45.29%
Refresh count	60	42	–18	–30.00%
Dataset / Store size	52.29 GB	19.30 GB	–32.99 GB	–63.09%
Young Gen GC time	30.64 s	22.17 s	–8.47	–27.65%
Search throughput (k=10, multi-client)	613 ops/s	677 ops/s	+64 ops/s	+10.42%
Search latency (p99, k=10)	29.5 ms	26.5 ms	–3.0 ms	–10.43%

Miscellaneous

Reindexing is not covered in this PR. Since it's one of the main use cases for returning vectors, the plan is for reindex to force the inclusion of vectors by default. This will be addressed in a follow-up, as this PR is already quite large.

…mode in mappings

…ield loader in mapping

…it is set to true

elasticsearchmachine · 2025-07-01T10:58:28Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elasticsearchmachine · 2025-07-01T10:58:28Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2025-07-01T10:58:28Z

Hi @jimczi, I've created a changelog YAML for you.

benwtrent

While this touches a ton of files, I am honestly surprised by how little required changes there are 😅

We don't need anything special for reindex?

server/src/main/java/org/elasticsearch/index/IndexSettings.java

...rc/yamlRestTest/resources/rest-api-spec/test/search.vectors/240_source_synthetic_vectors.yml

benwtrent · 2025-07-01T12:18:28Z

@jimczi I think you need to add the flag here: org.elasticsearch.test.cluster.FeatureFlag so its on for testing.

Then you can add it to all the test runners (there are many of them, I would just add them to all the ones that has the IVF_FORMAT feature).

…c_vectors

…o match the xcontent parsing

jimczi · 2025-07-02T08:16:02Z

We don't need anything special for reindex?

Yep, I left that for a follow-up. Just updated the description.

benwtrent

@jimczi do you want to handle rank_vectors and sparse_vector in a separate PR?

Otherwise this looks good.

...rc/yamlRestTest/resources/rest-api-spec/test/search.vectors/240_source_synthetic_vectors.yml

server/src/main/java/org/elasticsearch/index/mapper/MappingLookup.java

jimczi · 2025-07-03T10:33:56Z

I opened #130540 for the failures in https://buildkite.com/elastic/elasticsearch-pull-request/builds/78733. They're unrelated but legit ones.

jimczi · 2025-07-03T11:28:58Z

do you want to handle rank_vectors and sparse_vector in a separate PR?

Yep that's the plan

server/src/main/java/org/elasticsearch/index/IndexSettings.java

Single area changelog

martijnvg

I left a few comments, but from a mapping side in general LGTM.

server/src/main/java/org/elasticsearch/index/IndexSettings.java

server/src/main/java/org/elasticsearch/index/mapper/SourceFieldMapper.java

…c_vectors

This change adds the support for synthetic vectors (added in elastic#130382) in the rank_vectors field type.

This change adds the support for synthetic vectors (added in #130382) in the rank_vectors field type.

This change adds the support for synthetic vectors (added in elastic#130382) in the sparse_vector field type.

This change adds the support for synthetic vectors (added in #130382) in the sparse_vector field type.

This change modifies reindex behavior to always include vector fields, even if the target index omits embeddings from _source. This prepares for scenarios where embeddings may be automatically excluded (elastic#130382)

This change modifies reindex behavior to always include vector fields, even if the target index omits embeddings from _source. This prepares for scenarios where embeddings may be automatically excluded (#130382).

jimczi added 5 commits July 1, 2025 11:14

Introduce a new setting and a feature flag for the synthetic vectors …

746a233

…mode in mappings

Introduce a synthetic vectors source loader and a synthetic vectors f…

309e4c3

…ield loader in mapping

Handle the new setting in search and get api to exclude vectors when …

9bfea9c

…it is set to true

handle recovery and translog when synthetic vectors is on

6c584e9

Add support for synthetic vectors in dense vector field mapper

9357340

jimczi requested a review from a team as a code owner July 1, 2025 10:58

jimczi added >enhancement :Search Relevance/Vectors Vector search :StorageEngine/Mapping The storage related side of mappings v9.2.0 labels Jul 1, 2025

elasticsearchmachine added the Team:StorageEngine label Jul 1, 2025

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 1, 2025

jimczi added 2 commits July 1, 2025 11:58

Update docs/changelog/130382.yaml

4ff2123

Merge branch 'main' into synthetic_vectors

883af2c

benwtrent reviewed Jul 1, 2025

View reviewed changes

jimczi added 9 commits July 1, 2025 14:18

Propagate feature flag where it's needed

62c98c3

Merge remote-tracking branch 'origin/synthetic_vectors' into syntheti…

ba65346

…c_vectors

add yml tests for partial updates and get API

fa02743

Merge remote-tracking branch 'upstream/main' into synthetic_vectors

e419b76

fix propagation of leaf reader

56d7b75

fix RcsCcsCommonYamlTestSuiteIT

52d9278

add yaml tests with the fields option and patch the vectors as list t…

1d64c86

…o match the xcontent parsing

Merge remote-tracking branch 'upstream/main' into synthetic_vectors

bacc3fe

Merge branch 'main' into synthetic_vectors

bf61ed2

benwtrent approved these changes Jul 2, 2025

View reviewed changes

...rc/yamlRestTest/resources/rest-api-spec/test/search.vectors/240_source_synthetic_vectors.yml Show resolved Hide resolved

server/src/main/java/org/elasticsearch/index/mapper/MappingLookup.java Outdated Show resolved Hide resolved

jimczi added 2 commits July 2, 2025 21:10

Merge remote-tracking branch 'upstream/main' into synthetic_vectors

9b383d3

empty line

6cdf89b

Merge branch 'main' into synthetic_vectors

439dae1

jimczi removed the request for review from a team July 3, 2025 11:29

martijnvg reviewed Jul 4, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/index/IndexSettings.java Outdated Show resolved Hide resolved

Update 130382.yaml

6554aee

Single area changelog

martijnvg approved these changes Jul 4, 2025

View reviewed changes

jimczi added 6 commits July 4, 2025 11:34

apply review comments

27c3a3b

Add comments and make the code more readable

5845c27

Merge remote-tracking branch 'upstream/main' into synthetic_vectors

26ab174

Merge branch 'main' into synthetic_vectors

efaf6bd

Merge remote-tracking branch 'upstream/main' into synthetic_vectors

f0cbd8b

Merge remote-tracking branch 'origin/synthetic_vectors' into syntheti…

e494584

…c_vectors

jimczi merged commit c7a482a into elastic:main Jul 7, 2025
32 checks passed

jimczi deleted the synthetic_vectors branch July 7, 2025 09:34

jimczi added a commit to jimczi/elasticsearch that referenced this pull request Jul 7, 2025

Add synthetic vectors support for rank_vectors

aa228e6

This change adds the support for synthetic vectors (added in elastic#130382) in the rank_vectors field type.

jimczi mentioned this pull request Jul 7, 2025

Add synthetic vectors support for rank_vectors #130715

Merged

jimczi added a commit that referenced this pull request Jul 7, 2025

Add synthetic vectors support for rank_vectors (#130715)

5a4961b

This change adds the support for synthetic vectors (added in #130382) in the rank_vectors field type.

jimczi added a commit to jimczi/elasticsearch that referenced this pull request Jul 7, 2025

Add synthetic vectors support for sparse_vector

453cf48

This change adds the support for synthetic vectors (added in elastic#130382) in the sparse_vector field type.

jimczi mentioned this pull request Jul 7, 2025

Add synthetic vectors support for sparse_vector #130756

Merged

jimczi added a commit that referenced this pull request Jul 7, 2025

Add synthetic vectors support for sparse_vector (#130756)

6d81ff9

This change adds the support for synthetic vectors (added in #130382) in the sparse_vector field type.

jimczi mentioned this pull request Jul 8, 2025

Ensure vectors are always included in reindex actions #130834

Merged

Remove vectors from _source transparently #130382

Remove vectors from _source transparently #130382

Uh oh!

Conversation

jimczi commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

What This PR Adds

Key Behavior

Motivation

Benchmark Results

Miscellaneous

Uh oh!

elasticsearchmachine commented Jul 1, 2025

Uh oh!

elasticsearchmachine commented Jul 1, 2025

Uh oh!

elasticsearchmachine commented Jul 1, 2025

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benwtrent commented Jul 1, 2025

Uh oh!

jimczi commented Jul 2, 2025

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jimczi commented Jul 3, 2025

Uh oh!

jimczi commented Jul 3, 2025

Uh oh!

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Remove vectors from `_source` transparently #130382

Remove vectors from `_source` transparently #130382

jimczi commented Jul 1, 2025 •

edited

Loading