Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(dataobj): Download pages in 16MB batches #16689

Merged
merged 5 commits into from
Mar 12, 2025

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Mar 11, 2025

Previously, each page in a call to ReadPages would result in one request to storage. This added a lot of latency when data objects were backed by object storage, with the roundtrip time accumulating.

This PR enables pages in a call to ReadPages to be batched into 16MB windows (from S3's recommendation of using 8MB or 16MB chunks; 16MB was chosen to further reduce roundtrips). Windows are currently downloaded sequentially, though this could be updated to use concurrency if desired.

The effectiveness of this code depends on reading multiple columns and pages at once; this only happens when using dataset.Reader from #16429.

Additionally:

  • Column metadata now also supports batching, rather than downloading the metadata for one column at a time.
  • Dataset wrappers have been updated to retain the batching specified by the caller.

rfratto added 2 commits March 7, 2025 11:19
A common download size for chunked data in S3 is 8MB or 16MB. When
downloading a slice of pages, we find pages that align into an 8MB or
16MB "window" and download that entire set of pages in a single request.

This trades off fewer roundtrips for downloading garbage data: if only
two pages are downloaded, and fit within a 16MB window, the majority of
data in that 8/16MB could be outside the range of both pages.

This commit adds utilities for identifying windows. The windowing code
is made generic to permit windowing any arbitrary element in the file,
including pages and column metadata.
@rfratto rfratto requested a review from a team as a code owner March 11, 2025 18:26
@rfratto rfratto force-pushed the dataobj-parallel-downloads branch from 8c67b3b to 53fa52c Compare March 11, 2025 18:28
// storage, 16MB is chosen over 8MB, as it will lead to fewer requests.
//
// [recommendations]: https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html
const windowSize = 16_000_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use 16 * 1024 * 1024 instead? It's a similar number but reduces round trip times a little more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could! But that would be 16 MiB, not 16 MB like S3 suggests.

Copy link
Member Author

@rfratto rfratto Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to try either, but I'm worried S3 may align reads on multiples of 1000 instead of 1024, which is why they recommended 8/16 MB 🤔 (or even multiples of 8,000,000)

@rfratto rfratto requested a review from benclive March 12, 2025 12:34
@rfratto rfratto merged commit 8c4eb42 into grafana:main Mar 12, 2025
60 checks passed
@rfratto rfratto deleted the dataobj-parallel-downloads branch March 12, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants