- 
        Couldn't load subscription status. 
- Fork 127
Add blob split and splice API #282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Though this proposal may solve certain annoyances, it does not fix some other issues: 
 I would rather see that we try to solve these issues as part of REv3, by working towards eliminating the existence of large CAS objects in their entirety. See this section in the proposal I've been working on: | 
| This is quite similar to https://github.com/bazelbuild/remote-apis/pull/233/files. @roloffs have you had a chance to review the PR prior and related discussions? I think both PRs approaching this by adding a separate RPC, which is good and V2 compatible. | 
| @EdSchouten, thanks for sharing the link to your REv3 discussion document and your comments about this proposal. Sorry for not being aware of this document. After going through it, I agree with you that this API extension would not make much sense in REv3 given your proposal in the "Elimination of large CAS objects" section. However, as also @sluongng stated, since this extension is conservative and backwards compatible to REv2 and the release of REv3 is very uncertain right now, it would not harm people that do not use it, but would provide advantages already now for people who use it and could also lead to insights for your content-defined chunking ideas for REv3, since we also used such an algorithm to split blobs. I also agree with your concerns that uploading large files is something not covered by this proposal, while there exist relevant use-cases for this. However, I can think of a symmetric SpliceBlob rpc to allow for splitting a large blob on the client side, uploading only those parts of this blob that are missing on the server side, and then splicing there. This could be added in this PR as well. @sluongng, thanks for pointing out this PR. Despite the fact they look very similar they actually target complementary goals. Let me explain why. While the PR from @EdSchouten introduces split and combine blobs rpcs, the goal is not to safe traffic but to introduce a blob splitting scheme, which allows to verify the integrity of a blob by validating the digests of its chunks without actually reading the whole chunk data. In order to achieve this, he introduced a new digest function SHA256TREE, which allows recursive digest calculation. I hope I did not completely misunderstood your intention @EdSchouten. In contrast, the presented splitting scheme targets reuse as much as possible with the final goal of traffic reduction between client and server. E.g., if a large binary in the remote CAS was just modified slightly and you want to use it locally, you would have to download it completely. Using the presented extension, only the binary differences between the two versions determined by content-defined chunking would have to be downloaded, which is typically much less than the whole data. As I said both splitting schemes are actually complementary and follow different goals. | 
| I think what's missing in this PR was a specification regarding how the splitting algorithm would look like, and the ability to choose different algorithms for the job. In #233 , the chunking algorithm was mixed with the Digest algorithm, which I think is a good start as it's customizable. But I definitely can see cases where the Digest algorithm and Chunking algorithm are separated for different combinations (I.e. reed solomon + blake3, FastCDC + SHA256, delta compression + GITSHA1 etc...). And each combination could serve different purposes (deduplication, download parallelization, etc...). It would be nice if you could provide a bit more detail regarding your splitting algorithm of choice as an option here. | 
| While the actual choice of the splitting algorithm is mainly an implementation detail of the remote-execution endpoint (which of course affects the quality of the split result), the essential property of a server is to provide certain guarantees to a client if it successfully answers a  
 Besides this guarantee, in order to increase the reuse factor as much as possible between different versions of a blob, it makes sense to implement a content-defined chunking algorithm. They typically result in chunks of variable size and are insensitive to the data-shifting problem of fixed-size chunking. Such content-defined chunking algorithms typically rely on a rolling-hash function to efficiently compute hash values of consecutive bytes at every byte position in the data stream in order to determine the chunk boundaries. Popular algorithms for content-defined chunking are: 
 I have selected FastCDC as chunking algorithm for the endpoint implementation in our build system, since it has been proven to be very compute efficient and faster than the other rolling-hash algorithms while achieving similar deduplication ratios as the Rabin fingerprint. We already observed reuse factors of 96-98% for small changes, when working with big file-system images (around 800 MB) and also of 75% for a 300 MB executable with debug information. Maybe, you want to have a look at our internal design document for more information about this blob-splitting API extension. | 
| Ah I think I have realized what's missing here. Your design seems to be focusing on splitting the blob on the server side for the client to download large blobs. While I was thinking that blob splitting could happen to both the client side and the server side. For example: a game designer may work on some graphic assets, say a really large picture. Subsequent versions of a picture may get chunked on the client side. Then the client can compare the chunk list with the chunks that are already available on the server-side, and only upload the parts that are missing. So in the case where both client and server have to split big blobs for efficient download AND upload, it's beneficial for 2 sides to agree upon how to split (and put back together) big blobs. | 
| Yes, you are right, this design currently focuses on splitting on the server side and downloading large blobs, but as mentioned in a comment above, I am willing to extend this design proposal by a  Maybe it is worth to mention that in this case, it is not necessarily required for client and server to agree upon the same splitting algorithm, since after the first round-trip overhead, the chunking algorithm for each direction anyway ensures an efficient reuse. I will update this proposal to handle uploads for you to review. Thank you very much for your interest and nice suggestions. | 
| Do keep in mind that there could be mixed usage of clients (a) with chunking support and clients (b) without chunking support. So I do believe a negotiation via the initial GetCapability RPC, similar to the current Digest and Compressor negotiation, is much desirable. As the server would need to know how to put a split blob upload, from (a), back together to serve it to (b). I would recommend throwing the design ideas into #178. It's not yet settled whether chunking support needs to be a V3 exclusive feature, or we could do it as part of V2. Discussion to help nudge the issue forward would be much appreciated. | 
186240b    to
    b74bac8      
    Compare
  
    | @sluongng I have updated the PR with a more sharpened description of what is meant by and what is the goal of this blob-splitting approach and a proposal for the chunked upload of large blobs. Some thoughts about your hints regarding the capabilities negotiation between client and server: 
 This means, each side is responsible for its chunking approach without having the other side to know about it. The other side just needs to be able to concatenate the chunks. Furthermore, it would be difficult to agree, e.g., on the same FastCDC algorithm, since this algorithm internally depends on an array of 256 random numbers (generated by the implementer) and thus could result in completely different chunk boundaries for two different implementations preventing any reuse between the chunks on the server and the client. I will also put a summary of this blob splitting and splicing concept into #178. Would be nice if this concept could find its way into REv2 since it is just an extension free to use and no invasive modification. | 
| Do give #272 and my draft PR a read on how client/server could negotiate for a spec leveraging GetCapabilities rpc. Could be useful if you want to have a consistent splitting scheme between client and server. | 
| Hello @sluongng, I have updated the proposal by using the capabilities service as you have proposed. It is now possible for a client to determine the supported chunking algorithms at the server side and select one at a SplitBlob request. By this means, the client can select one that it also uses locally so that both communication directions benefit from the available chunking data on each side. Furthermore, I have added some comments about lifetime of chunks. Thanks for your time reviewing this PR! | 
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
| Hello all, after spending quite some time working on this proposal and its implementation, I have finished incorporating all suggestions made by the reviewers and that came up during the working group meeting. Finally, the following high-level features would be added to the REv2 protocol: 
 This whole proposal is fully implemented in our own remote-execution implementation in justbuild: and used by the just client: From my side, this proposal is finished and ready for final review. What I would like to know from you is what needs now to be done that this proposal finally gets merged into main. I can also summarize it again at the next working group meeting and at best would like to know a decision how to proceed with this proposal. Thank you very much for your efforts. | 
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we already effectively have the ability to splice blobs by using the bytestream API with read_offset and read_limit?
| // The digest of the blob to be splitted. | ||
| Digest blob_digest = 2; | ||
|  | ||
| // The chunking algorithm to be used. Must be IDENTITY (no chunking) or one of | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead reject IDENTITY as an invalid argument? I imagine this would only be used by broken clients?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about that, I have basically copied the pattern from PR #276 to include a sane default value. I leave that open to your decision, I have no objections to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this field from IDENTITY to DEFAULT, because as you mentioned IDENTITY does not really makes sense being requested by a client. Instead, to provide a proper default value for the chunking algorithm enum, I have introduced DEFAULT, which means the client does not care, which exact chunking algorithm is used by the server, just use the default implementation. If a client wants to negotiate more explicitly about the used chunking algorithm, it should specify one of the other enum values that are supported and advertised by the server.
I hope, this resolves your concerns? @mostynb
| @mostynb, as far as I have understood the protocol, no. While the bytestream API with read_offset and read_limit allows you to partially read the content of a blob, it does not allow you to create a new blob from a batch of other blobs (its chunks) at the remote CAS. The goal of blob splicing is that if a client regularly uploads slightly different large objects to the remote CAS, only the binary differences between the versions are needed to be uploaded and not the entire block of binary data every time. To achieve this, the client needs to split the large object into reusable chunks (which is typically done by content-defined chunking) and just uploads the chunks (handled as blobs) that are missing at the remote CAS, which are normally a lot when uploading the first time. If the client needs to upload this large object again, but a slightly different version of it (meaning only a percentage of the binary data has been changed), he again splits it into chunks and tests which chunks are missing at the remote CAS. Normally, content-defined chunking splits the binary data that hasn't been changed into the same set of chunks, only where binary differences occur, different chunks will be created. This means, only a fraction of the whole set of chunks need to be uploaded to the remote CAS in order to be able to reconstruct the second version of the large object at the remote CAS. The actual reconstruction of a large blob at the remote side is done using the splice command with a description of which chunks need to be concatenated (a list of chunk digests available at the remote CAS). The split operation works exactly the other way around, when you regularly download an ever changing large object from remote CAS. Then, the server splits the large object into chunks, the client fetches only the locally missing chunks and reconstructs the large object locally from the locally available chunks. Finally, to exploit chunking for both directions at the same time, it makes sense that the client and the server agree on a chunking algorithm to allow reusing chunks created on both sides. For this, we added a negotiation mechanism to agree on the chunking algorithm used on both sides. | 
| // (Algorithm 2, FastCDC8KB). The algorithm is configured to have the | ||
| // following properties on resulting chunk sizes. | ||
| // - Minimum chunk size: 2 KB | ||
| // - Average chunk size: 8 KB | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that using small chunk sizes, such as 8 KB, can increase the likelihood of deduplication and may also reduce the risk of disk storage fragmentation. However, have you considered if there is potential performance overhead of having too many fine-grained CAS blobs?
I can envision that a feature like this could also be beneficial in distributing the load more evenly across multiple CAS shards. But for such use cases, it might make sense to use much larger chunks, perhaps 8 MB? Should we somehow accommodate also for larger chunks in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We experimented with FastCDC on approximately 5TB of real Bazel data from many different codebases and found that 0.5MB is a good trade-off between space savings and metadata overhead. Too small of a chunk size means the metadata for all chunks becomes very large / numerous, while too large of a chunk size means poor space savings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @luluz66, I think 0.5 MB is more reasonable than 8 KB.
Do you think there is one value that would fit all, or should a size like this be allowed to be tuned in a flexible way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @luluz66 for your experiments. Indeed, we did not evaluate storage consumption trade-offs since we were mainly interested in traffic reduction. I think, 500 KB of average chunk size is a sane default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.5 MB was the ideal range for us based on the Bazel-specific data set we were testing against. However, there would be no telling whether that number would be different for a different client/server pair, or a different data set.
So I think we would want a discovery mechanism for the FastCDC configuration on the server side. The client should follow the server's advertised setting in order to achieve the best result. WDYT?
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
| Hi @sluongng, I have updated this PR and removed any negotiation stuff, also updated the comments accordingly. Can you maybe have a look again, whether this fits your vision of the split/splice API without negotiation, also from description point of view. Thank you very much. | 
| After our dedicated meeting to discuss the different large-blobs proposals, we agreed on that we would like to have both proposals in the REv2 protocol, because they have different pros and cons. According to the agreement, I have removed any negotiation parts from the split/splice proposal and have it ready for a final review. A proper negotiation proposal for performance improvement will be a follow-up PR. It would be nice, if the key players would give their final remarks or approval for this PR, best before our next working-group meeting. Thank you very much. | 
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
| @sluongng Does this pull request now reflect what we agreed upon during the dedicated large-blob meeting (so that we can merge it), or do you think further changes are necessary? | 
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just a few comments.
I also think it would be worthwhile to specify a negotiation procedure for the chunking algorithm before usage becomes too widespread and it gets harder to do; but it's fine to start without it.
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, I have updated the pull request according to your suggestions. Regarding the negotiation procedure, I would propose to handle this in a follow-up PR, since as you have mentioned, it is not required for this PR and allows to have comprehensive discussions of a separate concern in a separate PR. Awaiting your final decision to move on with this PR.
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
This new rpc allows clients to upload a large blob in chunks (potentially in parallel), and then ask the server to join those chunks into a new large blob. The SplitBlob rpc is not yet supported, I am waiting for details of a common chunking algorithm to be decided. bazelbuild/remote-apis#282
This new rpc allows clients to upload a large blob in chunks (potentially in parallel), and then ask the server to join those chunks into a new large blob. The SplitBlob rpc is not yet supported, I am waiting for details of a common chunking algorithm to be decided. bazelbuild/remote-apis#282
This new rpc allows clients to upload a large blob in chunks (potentially in parallel), and then ask the server to join those chunks into a new large blob. The SplitBlob rpc is not yet supported, I am waiting for details of a common chunking algorithm to be decided. bazelbuild/remote-apis#282
This is a proposal of a conservative extension to the
ContentAddressableStorageservice, which allows to reduce traffic when blobs are fetched from the remote CAS to the host for local usage or inspection. With this extension it is possible to request a remote-execution endpoint to split a specified blob into chunks of a certain average size. These chunks are then stored in the CAS as blobs and the ordered list of chunk digests is returned. The client can then check, which blob chunks are available locally from earlier fetches and fetch only the missing chunks. By using the digest list, the client can splice the requested blob from the locally available chunk data.This extension could especially help to reduce traffic if large binary files are created at the remote side and needed locally such as executables with debug information, comprehensive libraries, or even whole file system images. It is a conservative extension, so no client is forced to use it. In our build-system project justbuild, we have implemented this protocol extension for server and client side.