Skip to content

Remove vectors from _source transparently #130382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jul 7, 2025
Merged

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Jul 1, 2025

Summary

This PR introduces a new hybrid mode for the _source field that stores the original source without dense vector fields. The goal is to reduce storage overhead and improve performance, especially as vector sizes grow. The setting also affects whether vectors are returned in search and get APIs, which matters even for synthetic source, since reconstructing vectors from doc values can be expensive.

Background

Today, Elasticsearch supports two modes for _source:

  • Stored: Original JSON is persisted as-is.
  • Synthetic: _source is reconstructed from doc values at read time.

However, dense vector fields have become problematic:

  • They don’t compress well, unlike text.
  • They are already stored in doc values, so storing them again in _source is wasteful.
  • Their _source representation is often overly precise (double precision), which isn’t needed for search/indexing.

While switching to full synthetic is an option, retrieving the full original _source (minus vectors) is often faster and more practical than pulling individual fields from individual storage when the number of metadata fields is high.

What This PR Adds

We’re introducing a hybrid source mode:

  • Keeps the original _source, minus any dense_vector fields.
  • Built on top of the synthetic source infrastructure, reusing parts of it.
  • Controlled via a single index-level setting.

Key Behavior

  • When enabled, dense_vector fields are excluded from _source at index time.

  • The setting also controls whether vectors are returned in search and get APIs:

    • This matters even for synthetic source, as rebuilding vectors is costly.
  • You can override behavior at query time using the exclude_vectors option.

  • The setting is:

    • Disabled by default
    • Protected by a feature flag
    • Intended to be enabled by default for new indices in a follow-up

Motivation

This hybrid option is designed for use cases where users:

  • Want faster reads than full synthetic offers.
  • Don’t want the storage cost of large vectors in _source.
  • Are okay with some loss of precision when vectors are rehydrated.

By making this setting default for newly created indices in a follow up, we can help users avoid surprises from the hidden cost of storing and returning high-dimensional vectors.

Benchmark Results

Benchmarking this PR against main using the openai rally track shows substantial improvements at the cost of a loss of precision when retrieving the original vectors:

Metric Main (Baseline) This PR (Contender) Change % Change
Indexing throughput (mean) 1690.77 docs/s 2704.57 docs/s +1013.79 +59.96%
Indexing time 120.25 min 74.32 min –45.93 –38.20%
Merge time 132.56 min 69.28 min –63.28 –47.74%
Merge throttle time 100.99 min 36.30 min –64.69 –64.06%
Flush time 2.71 min 1.48 min –1.23 –45.29%
Refresh count 60 42 –18 –30.00%
Dataset / Store size 52.29 GB 19.30 GB –32.99 GB –63.09%
Young Gen GC time 30.64 s 22.17 s –8.47 –27.65%
Search throughput (k=10, multi-client) 613 ops/s 677 ops/s +64 ops/s +10.42%
Search latency (p99, k=10) 29.5 ms 26.5 ms –3.0 ms –10.43%

Miscellaneous

Reindexing is not covered in this PR. Since it's one of the main use cases for returning vectors, the plan is for reindex to force the inclusion of vectors by default. This will be addressed in a follow-up, as this PR is already quite large.

@jimczi jimczi requested a review from a team as a code owner July 1, 2025 10:58
@jimczi jimczi added >enhancement :Search Relevance/Vectors Vector search :StorageEngine/Mapping The storage related side of mappings v9.2.0 labels Jul 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @jimczi, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 1, 2025
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this touches a ton of files, I am honestly surprised by how little required changes there are 😅

We don't need anything special for reindex?

@benwtrent
Copy link
Member

@jimczi I think you need to add the flag here: org.elasticsearch.test.cluster.FeatureFlag so its on for testing.

Then you can add it to all the test runners (there are many of them, I would just add them to all the ones that has the IVF_FORMAT feature).

@jimczi
Copy link
Contributor Author

jimczi commented Jul 2, 2025

We don't need anything special for reindex?

Yep, I left that for a follow-up. Just updated the description.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi do you want to handle rank_vectors and sparse_vector in a separate PR?

Otherwise this looks good.

@jimczi
Copy link
Contributor Author

jimczi commented Jul 3, 2025

I opened #130540 for the failures in https://buildkite.com/elastic/elasticsearch-pull-request/builds/78733. They're unrelated but legit ones.

@jimczi
Copy link
Contributor Author

jimczi commented Jul 3, 2025

do you want to handle rank_vectors and sparse_vector in a separate PR?

Yep that's the plan

@jimczi jimczi removed the request for review from a team July 3, 2025 11:29
Single area changelog
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, but from a mapping side in general LGTM.

@jimczi jimczi merged commit c7a482a into elastic:main Jul 7, 2025
32 checks passed
@jimczi jimczi deleted the synthetic_vectors branch July 7, 2025 09:34
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Jul 7, 2025
This change adds the support for synthetic vectors (added in elastic#130382) in the rank_vectors field type.
jimczi added a commit that referenced this pull request Jul 7, 2025
This change adds the support for synthetic vectors (added in #130382) in the rank_vectors field type.
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Jul 7, 2025
This change adds the support for synthetic vectors (added in elastic#130382) in the sparse_vector field type.
jimczi added a commit that referenced this pull request Jul 7, 2025
This change adds the support for synthetic vectors (added in #130382) in the sparse_vector field type.
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Jul 8, 2025
This change modifies reindex behavior to always include vector fields, even if the target index omits embeddings from _source.
This prepares for scenarios where embeddings may be automatically excluded (elastic#130382)
jimczi added a commit that referenced this pull request Jul 9, 2025
This change modifies reindex behavior to always include vector fields, even if the target index omits embeddings from _source.
This prepares for scenarios where embeddings may be automatically excluded (#130382).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Relevance/Vectors Vector search :StorageEngine/Mapping The storage related side of mappings Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch Team:StorageEngine v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants