Skip to content

Conversation

@EdSchouten
Copy link
Collaborator

@EdSchouten EdSchouten commented Nov 6, 2022

Buildbarn has invested heavily in using virtual file systems. Both on the worker and client side it's possible to lazily fault in data from the CAS. As Buildbarn implements checksum verification where needed, randomly accessing large files may be slow. To address this, this change adds support for composing and decomposing CAS objects, using newly added ConcatenateBlobs() and SplitBlobs() operations.

If implemented naively (e.g., using SHA-256), these operations would not be verifiable. To rephrase: when merely given the checksum of smaller objects, there is no way to obtain that of its concatenated version. This is why we suggest that these operations are only used in combination with SHA256TREE (see #235).

With these new operations present, there is no true need to use the Bytestream protocol any longer. Writes can be performed by uploading smaller parts through BatchUpdateBlobs(), followed by calling ConcatenateBlobs(). Conversely, reads of large objects can be performed by calling SplitBlobs() and downloading individual parts through BatchReadBlobs(). For compatibility, we still permit the Bytestream protocol to be used. This is a decision we can revisit in REv3.

Fixes: #178

@EdSchouten EdSchouten force-pushed the eschouten/20221106-chunking branch 3 times, most recently from ba20afd to a53b775 Compare November 7, 2022 15:18
Copy link
Contributor

@moroten moroten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a promising way to get this feature into v2 instead of waiting for v3. Great work!

Comment on lines 1714 to 1739
// The size of the non-trailing blobs to create. It must be less than
// the sizes stored in `digests`. Furthermore, for BLAKE3CONCAT must
// be a power of 2.
int64 split_size_bytes = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a recommended size to use?

Copy link
Collaborator Author

@EdSchouten EdSchouten Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. If the client wants really granular access to the object with as much deduplication as possible, it can use blake3concat_min_split_size_bytes. If it simply wants to download the object in its entirety without caring too much about deduplication, it can use the highest power of two that does not exceed max_batch_total_size.

Comment on lines 2026 to 2029
// The minimum size of blobs that can be created by calling
// [SplitBlobs][build.bazel.remote.execution.v2.ContentAddressableStorage.SplitBlobs]
// against a blob that uses the BLAKE3CONCAT digest function,
// disregarding the blob containing trailing data.
//
// If supported, this field MUST have value 2^k, where k > 10. It may
// not exceed `max_batch_total_size_bytes`.
int32 blake3concat_min_split_size_bytes = 9;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the server be able to have an efficient implementation if the API is called with different chunk sizes? Would there be an "optimal" value for a static chunk size to reduce the risk of inefficient client-server combinations?

Copy link
Collaborator Author

@EdSchouten EdSchouten Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in the case of Buildbarn I will just set blake3concat_max_upload_size_bytes == blake3concat_min_split_size_bytes. Then all clients/servers will be in full agreement on what the chunk size is.

The reason I kept them apart, is because I tried to keep the upload & download paths separated. I can imagine that if someone operating a cluster discovers that they made a bad choice regarding chunk size, that having these separate will make it easier to gradually migrate from one chunk size to the other if needed.

With regards to a one size fits all optimal chunk size, I'm not sure whether such a value exists. I guess it really depends on bandwidth vs. latency. If you're bandwidth constrained, then smaller chunks is better. Conversely, if latency is high, it may be desirable to set the chunk size higher, so that you need to call into ConcatenateBlobs() and SplitBlobs() less frequently.

@EdSchouten EdSchouten force-pushed the eschouten/20221106-chunking branch 2 times, most recently from 677b4e8 to b8c052f Compare November 8, 2022 10:26
EdSchouten added a commit to EdSchouten/remote-apis that referenced this pull request Nov 17, 2022
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named BLAKE3CONCAT.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the BLAKE3CONCAT digest function.
I have derived these by modifying the BLAKE3 reference implementation
written in Rust, and rerunning the tool that emits the official test
vectors:

https://github.com/BLAKE3-team/BLAKE3/blob/master/test_vectors/test_vectors.json

Furthermore, I have been able to validate the newly obtained test
vectors using a custom BLAKE3CONCAT implementation that I have written
in Go, which will become part of Buildbarn.
EdSchouten added a commit to EdSchouten/remote-apis that referenced this pull request Nov 24, 2022
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named BLAKE3CONCAT.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the BLAKE3CONCAT digest function.
I have derived these by modifying the BLAKE3 reference implementation
written in Rust, and rerunning the tool that emits the official test
vectors:

https://github.com/BLAKE3-team/BLAKE3/blob/master/test_vectors/test_vectors.json

Furthermore, I have been able to validate the newly obtained test
vectors using a custom BLAKE3CONCAT implementation that I have written
in Go, which will become part of Buildbarn.
EdSchouten added a commit to EdSchouten/remote-apis that referenced this pull request Nov 24, 2022
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named BLAKE3CONCAT.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the BLAKE3CONCAT digest function.
I have derived these by modifying the BLAKE3 reference implementation
written in Rust, and rerunning the tool that emits the official test
vectors:

https://github.com/BLAKE3-team/BLAKE3/blob/master/test_vectors/test_vectors.json

Furthermore, I have been able to validate the newly obtained test
vectors using a custom BLAKE3CONCAT implementation that I have written
in Go, which will become part of Buildbarn.
EdSchouten added a commit to EdSchouten/remote-apis that referenced this pull request Dec 1, 2022
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named SHA256TREE.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the SHA256TREE digest function.
I have derived these by implementing three different versions in the Go
programming language:

- One version that uses regular arithmetic in Go.
- One version for x86-64 that uses AVX2.
- One version for ARM64 that uses the ARMv6 cryptography extensions.

All three versions behave identically.
EdSchouten added a commit to EdSchouten/remote-apis that referenced this pull request Dec 1, 2022
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named SHA256TREE.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the SHA256TREE digest function.
I have derived these by implementing three different versions in the Go
programming language:

- One version that uses regular arithmetic in Go.
- One version for x86-64 that uses AVX2.
- One version for ARM64 that uses the ARMv6 cryptography extensions.

All three versions behave identically.
EdSchouten added a commit to EdSchouten/remote-apis that referenced this pull request Dec 1, 2022
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named SHA256TREE.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the SHA256TREE digest function.
I have derived these by implementing three different versions in the Go
programming language:

- One version that uses regular arithmetic in Go.
- One version for x86-64 that uses AVX2.
- One version for ARM64 that uses the ARMv6 cryptography extensions.

All three versions behave identically.
EdSchouten added a commit to EdSchouten/remote-apis that referenced this pull request Dec 2, 2022
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named SHA256TREE.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the SHA256TREE digest function.
I have derived these by implementing three different versions in the Go
programming language:

- One version that uses regular arithmetic in Go.
- One version for x86-64 that uses AVX2.
- One version for ARM64 that uses the ARMv8 cryptography extensions.

All three versions behave identically.
EdSchouten added a commit to EdSchouten/remote-apis that referenced this pull request Dec 16, 2022
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named SHA256TREE.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the SHA256TREE digest function.
I have derived these by implementing three different versions in the Go
programming language:

- One version that uses regular arithmetic in Go.
- One version for x86-64 that uses AVX2.
- One version for ARM64 that uses the ARMv8 cryptography extensions.

All three versions behave identically.
@EdSchouten EdSchouten force-pushed the eschouten/20221106-chunking branch from b8c052f to 0f05b90 Compare December 26, 2022 17:08
@EdSchouten EdSchouten changed the title Add support for chunking of blobs, using a variant of BLAKE3 Add support for chunking of blobs, using SHA256TREE Dec 26, 2022
@EdSchouten
Copy link
Collaborator Author

EdSchouten commented Dec 26, 2022

As #235 and #236 are in my opinion close to a state in which they can be merged, I have gone ahead and reimplemented this PR on top of #235. Changes to the previous version are as follows:

  • As suggested by @EricBurnett + @bergsieker, I have moved the {Concatenate,Split}Blobs() capabilities into a separate message. In theory you could use it in combination with any digest function. The downside of using anything other than SHA256TREE is obviously that client/server-side validation of these requests is either impossible or prohibitively expensive.
  • Related to the above, {Concatenate,Split}Blobs() now take/return hashes of small objects, instead of BLAKE3-style chaining values.
  • FindMissingBlobsRequest now also has a new split_sizes_bytes field. This permits clients to more efficiently increase the lifetime of objects returned by SplitBlobs().

PTAL, ignoring the first commit in this PR. That one is part of #235.

@EdSchouten EdSchouten force-pushed the eschouten/20221106-chunking branch 2 times, most recently from 4c54573 to b659e33 Compare December 26, 2022 18:00
EdSchouten added a commit to EdSchouten/remote-apis that referenced this pull request Dec 28, 2022
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named SHA256TREE.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the SHA256TREE digest function.
I have derived these by implementing three different versions in the Go
programming language:

- One version that uses regular arithmetic in Go.
- One version for x86-64 that uses AVX2.
- One version for ARM64 that uses the ARMv8 cryptography extensions.

All three versions behave identically.
EdSchouten added a commit to EdSchouten/remote-apis that referenced this pull request Dec 28, 2022
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named SHA256TREE.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the SHA256TREE digest function.
I have derived these by implementing three different versions in the Go
programming language:

- One version that uses regular arithmetic in Go.
- One version for x86-64 that uses AVX2.
- One version for ARM64 that uses the ARMv8 cryptography extensions.

All three versions behave identically.
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named SHA256TREE.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the SHA256TREE digest function.
I have derived these by implementing three different versions in the Go
programming language:

- One version that uses regular arithmetic in Go.
- One version for x86-64 that uses AVX2.
- One version for ARM64 that uses the ARMv8 cryptography extensions.

All three versions behave identically.
@EdSchouten EdSchouten force-pushed the eschouten/20221106-chunking branch from b659e33 to 990d387 Compare January 22, 2023 10:41
Buildbarn has invested heavily in using virtual file systems. Both on
the worker and client side it's possible to lazily fault in data from
the CAS. As Buildbarn implements checksum verification where needed,
randomly accessing large files may be slow. To address this, this change
adds support for composing and decomposing CAS objects, using newly
added ConcatenateBlobs() and SplitBlobs() operations.

If implemented naively (e.g., using SHA-256), these operations would not
be verifiable. To rephrase: when merely given the checksum of smaller
objects, there is no way to obtain that of its concatenated version.
This is why we suggest that these operations are only used in
combination with SHA256TREE (see bazelbuild#235).

With these new operations present, there is no true need to use the
Bytestream protocol any longer. Writes can be performed by uploading
smaller parts through BatchUpdateBlobs(), followed by calling
ConcatenateBlobs(). Conversely, reads of large objects can be performed
by calling SplitBlobs() and downloading individual parts through
BatchReadBlobs(). For compatibility, we still permit the Bytestream
protocol to be used. This is a decision we can revisit in REv3.
@EdSchouten EdSchouten force-pushed the eschouten/20221106-chunking branch from 990d387 to d921f6f Compare January 22, 2023 10:47
@tylerwilliams
Copy link
Contributor

I think this idea is neat! I'm excited about combining this with the idea of transmitting variable size chunks.

I'd like to propose a slight modification: instead of repeated string small_hashes = 2, if we repeated Digest messages, we could allow for multiple digests of different sizes to be joined into a contiguous chunk. It's also a more natural fit -- small chunks that have been uploaded to the CAS already return a digest.

Separately, what do you think about returning a ConcatenateBlobsResponse from the ConcatenateBlobs call? Though the response matches BatchUpdateBlobsResponse today, it may not in the future, and it's much easier to change this now before clients are using it.

Finally, would it make sense to remove the chunk size in SplitBlobsRequest and let the server decide this value?

EdSchouten added a commit to EdSchouten/remote-apis that referenced this pull request Feb 17, 2023
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named SHA256TREE.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the SHA256TREE digest function.
I have derived these by implementing three different versions in the Go
programming language:

- One version that uses regular arithmetic in Go.
- One version for x86-64 that uses AVX2.
- One version for ARM64 that uses the ARMv8 cryptography extensions.

All three versions behave identically.
EdSchouten added a commit to EdSchouten/remote-apis that referenced this pull request Feb 17, 2023
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named SHA256TREE.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the SHA256TREE digest function.
I have derived these by implementing three different versions in the Go
programming language:

- One version that uses regular arithmetic in Go.
- One version for x86-64 that uses AVX2.
- One version for ARM64 that uses the ARMv8 cryptography extensions.

All three versions behave identically.
sstriker pushed a commit that referenced this pull request Mar 14, 2023
In PR #233 I proposed the addition of two new ContentAddressableStorage
methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random
access it large CAS objects, while still providing a way to very data
integrity. As part of that change, I added a new digest function to help
with that, named SHA256TREE.

This PR adds just this digest function, without bringing in any support
for chunking. This will be done separately, as it was requested that
both these features landed independently.

I have also included test vectors for the SHA256TREE digest function.
I have derived these by implementing three different versions in the Go
programming language:

- One version that uses regular arithmetic in Go.
- One version for x86-64 that uses AVX2.
- One version for ARM64 that uses the ARMv8 cryptography extensions.

All three versions behave identically.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

V3 idea: Drop support for offset reads & writes. Decompose large CAS objects into shallow Merkle trees

4 participants