-
Notifications
You must be signed in to change notification settings - Fork 31
Manifest Splitting #767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Manifest Splitting #767
Conversation
dcherian
commented
Feb 21, 2025
•
edited
Loading
edited
- does the config get serialized properly?
- real-world benchmark; test with ERA5
- add ndim based condition (3D vs 4D)
8630650
to
fd1c572
Compare
icechunk/src/format/manifest.rs
Outdated
pub struct ManifestShards(Vec<ManifestExtents>); | ||
|
||
impl ManifestShards { | ||
pub fn default(ndim: usize) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this, but it is certainly tied to ndim
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ManifestSplits is an enum to avoid this?
enum ManifestSplits {
Single,
Multiple(Vec<ManifestExtents>)
}
What I don't like is the empty vector. I wonder if Rust has a NonEmptyVec type, otherwise, a trick people use is:
...
Multiple{ first: ManifestExtents, rest: Vec<ManifestExtents>}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -37,9 +33,77 @@ impl ManifestExtents { | |||
Self(v) | |||
} | |||
|
|||
pub fn contains(&self, coord: &[u32]) -> bool { | |||
self.iter().zip(coord.iter()).all(|(range, that)| range.contains(that)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to start checking on writes that indexes have the proper size for the metadata
e7d9221
to
09476a4
Compare
9c1605f
to
34126a8
Compare
34126a8
to
d816f8b
Compare
76478b1
to
9a8bbc0
Compare
9a8bbc0
to
a64252a
Compare
icechunk/src/session.rs
Outdated
for chunk in chunks { | ||
let shard_index = shards.which(&chunk.coord)?; | ||
sharded_refs | ||
.entry(shard_index) | ||
.or_insert_with(|| Vec::with_capacity(ref_capacity)) | ||
.push(chunk); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am attempting to convert this method to accept a impl Stream<Item=SessionResult<ChunkInfo>>
but I don't see how to convert this groupby logic.
// - 0: 120 | ||
// - path: ./temperature # 4D variable: (time, level, latitude, longitude) | ||
// manifest-split-sizes: | ||
// - "level": 1 # alternatively 0: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs validation. E.g. do these dimensions exist? does that axis number make sense?
} | ||
|
||
static DEFAULT_MANIFEST_PRELOAD_CONFIG: OnceLock<ManifestPreloadConfig> = OnceLock::new(); | ||
static DEFAULT_MANIFEST_SHARDING_CONFIG: OnceLock<ManifestShardingConfig> = | ||
OnceLock::new(); | ||
|
||
impl ManifestConfig { | ||
pub fn merge(&self, other: Self) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merging is for merging a user-defined value with the library's default value.
icechunk/src/config.rs
Outdated
Self { | ||
preload: other.preload.or(self.preload.clone()), | ||
// FIXME: why prioritize one over the other? | ||
sharding: other.sharding.or(self.sharding.clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: this could be overwrite instead. We need to careful about ordering after merge.
b512ab1
to
137f283
Compare
Local benchmarks as of late last night: Read: 4x speedup
benchmark code: one array with shape 500_000_000, chunks=1000, shard_size=100_000 so 5 shards with 100_000 chunk refs each. Reading one element (so really one chunk request). def fn():
repo = ic.Repository.open(
storage=synth_dataset.storage,
config=ic.RepositoryConfig(manifest=ic.ManifestConfig(preload=preload)),
)
ds = xr.open_zarr(
repo.readonly_session("main").store,
group=synth_dataset.group,
chunks=None,
consolidated=False,
)
subset = ds.isel(synth_dataset.chunk_selector)
subset[synth_dataset.load_variables].compute() Parameterized over:
Write: 10% slowdown on commit.
benchmark: This benchmarks only sharding: None (default) or shard size = 10_000 refs, so 50 shards in total. even for the default case of writing a single shard. But this is only 0.5s so I don't think it matters much.
|
Local ----- S3 --
* main: Fix `Diff` python typing (#890) Fail when creating Storage for Tigris using s3_compatible (#889) Disallow tag/branch creation with non-existing snapshot (#888) Log errors during listing and deleting of objects (#886) Rust integration tests can run in more object stores. (#884) Update pyo3. (#885) Add expiration to stateful test (#868)
ed14ec4
to
33966cb
Compare
c81875d
to
6fc7eb6
Compare
icechunk/src/format/manifest.rs
Outdated
pub struct ManifestShards(Vec<ManifestExtents>); | ||
|
||
impl ManifestShards { | ||
pub fn default(ndim: usize) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ManifestSplits is an enum to avoid this?
enum ManifestSplits {
Single,
Multiple(Vec<ManifestExtents>)
}
What I don't like is the empty vector. I wonder if Rust has a NonEmptyVec type, otherwise, a trick people use is:
...
Multiple{ first: ManifestExtents, rest: Vec<ManifestExtents>}
/// ] | ||
/// ); | ||
/// assert_eq!(actual, expected); | ||
/// ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you be willing to write some property tests for this function?
@@ -87,6 +100,15 @@ impl ArrayShape { | |||
} | |||
} | |||
|
|||
// Implement indexing for immutable access | |||
impl Index<usize> for ArrayShape { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't like about this: panics. Maybe it should return an Option? Not sure if that's something people do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what the trait requires, so our hands are tied, no?
I could add a get
method instead that returns Option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is what i was trying to say, a get to get an option would be safer.
icechunk/src/session.rs
Outdated
@@ -1708,6 +1899,7 @@ async fn fetch_manifest( | |||
/// available in `from` and `to` arguments. | |||
/// | |||
/// Yes, this is horrible. | |||
#[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we no longer use this shit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i do a pass to group the references in to a manifest shard, so I just accumulate in that pass. I can delete.
* main: Bump the rust-dependencies group with 2 updates (#909) Release version 0.2.13 (#907) Skip bytes logging in object_store (#906) More randomness for test repo prefixes (#905) S3 Storage supports setting storage class (#903) Update configuration.md (#899) Add example to exercise high read concurrency (#896) Bump the rust-dependencies group with 2 updates (#897)
51d9d02
to
48bebf2
Compare
3dbac59
to
8c4cc59
Compare