-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
b2 upload-file
does not calculate SHA1 automatically for large files if it's not been provided by --sha1
.
#539
Comments
sha1 is calculated by b2 CLI by default. You may bypass this behavior and provide the checksum by yourself, but it is not possible to not have the checksum at all. Checksums of fragments and checksum of the whole file are sort of equivalent, you can use either one to ensure data integrity. |
b2 upload-file
does not calculate SHA1 automatically if it's not been provided by --sha1
.b2 upload-file
does not calculate SHA1 automatically for large files if it's not been provided by --sha1
.
It doesn't seem to calculate SHA1 for large files. After I uploaded a large file without |
The Backblaze APIs require that the metadata for a large file be set when you call Also, the Backblaze APIs hide the part boundaries once the file has been uploaded. (The S3 APIs do the same thing.) This allows the system to restructure the file on the back end as needed. It would make sense, though, to extend B2 to store the uploaded part sizes and checksums for integrity checking on download. Adding an option to this command-line tool to always compute the SHA1 for the entire file, and set @eonil - Would you be interested in making that change? |
I am a user and I am more annoyed that your tool does not try to provide full-cycle integrity (local -> remote -> another local) "by default". I don't understand why you think people are going to get annoyed for "slow and safe" by default where they can override it to "fast but unsafe" by providing IMO, the "default" should reflect company philosophy, and options for users' demand. Current default of Today I tried another uploads with two 1GB files, and I discovered I'm not interested patching this codebase or Python. I am going to write my own uploader. Thank you for suggestion. I hope issues are at tool level rather than API level. By the way, I really hope Backblaze to provide segment based checksums. If you de-couple segment sizes of hashing and uploading, hash computation won't be duplicated. As server verifies checksum of each segment, therefore this eliminates potential from server to have wrong checksum. Dropbox's "checksum of checksums" method can also be considered though I am not sure how safe this is. |
It's a fair point. For small files Local -> Remote -> Local is fully checked, but for large files only Local -> Remote is checked. That is indeed the default behavior and that is also the only currently supported behavior by the CLI. We still have the tcp checksum and due to the usage of https an error would probably trigger a fault in decryption, but if we assume that works, sha1 checksums would not be needed at any point. There is a couple of problems with using a sha1 checksum for download integrity verification:
However, the B2 backend stores the part checksums (I hope!). If those were exposed, we could successfully verify the integrity of the file upon download. Moreover, in such scenario the download process could be (somewhat) optimized to parallelize hashing and downloading, avoiding an additional read (and a fully optimized transferer implementation which sets the download chunk size to (a 1/N fraction of) the server-side chunk size would also be possible and settable as an option (such strategy improves performance but potentially consumes slightly more transaction tokens than the non-optimized behavior). B2 should retain the ability to restructure the file internally: the chunk size / amount and the respective checksums could change one day, but that would not really impact the checksum verification as long as the client receives a consistent snapshot of the checksums at all times (even during the restructuring process). |
Yup. And downloader also can verify data integrity incrementally. This is important for resource constrained platforms like mobile apps. This can be quite important because multi-GB hashing takes long time and more likely to be interrupted. AFAIK, it's not easy to serialize hasher state, therefore I'd like to hash segments smaller than 64MB. If you care mobile devices, this is an important factor. Also Backblaze don't have to recalculate checksums as long as they keep original segment checksums even if the segments are restructured. I hope they keep hash segment sizes smaller than 64MB even if they restructure the segments. Because as I said above, dealing with "big" stuffs in mobile is really painful. Bigger involves more pain. With current big files behavior, the best way to write stable and reliable mobile apps is ignoring B2's big file system and handling segmentation completely client side. |
I have spent an awful lot of time trying to upload files to b2 using clientside ajax requests (vue-dropzone.js), and even though I supplied the file's valid sha1 checksum, the b2 server still responds with "checksum did not match data received" with status code 400. I've checked and rechecked the checksums with all the tools I have and I'm still not able to trace the source of the error. Its as if something happens to the file while its in transit or something. I've uploaded the same files using the command line tool and it works fine but when I upload via ajax using the exact same sha1 checksum it doesn't work. My questions are:
Please inspect my ajax codes see if I got anything wrong:
|
Hey, you've posted a comment to a B2 CLI issue, which is written in Python, but you posted javascript code. I can't really help you much, though what I would suggest is to try to upload an empty file and inspect the communication between the server and the browser using browser F12 network tab. If you will make it identical to what b2cli does, then it will be guaranteed to work (server only knows what you tell it, so doesn't know what it is speaking with) |
I know this issue has been reported and being declined multiple times for various reasons. But I like to tell more reasons why I need this -- auto-calculation of SHA1 by default.
I think one important thing is missing in older threads.
We upload files to use them later. Having no error on uploading is actually only half-work and not really helpful to archive integrity when I downloaded the file. Integrity scenario should include the case after the file after downloaded.
After downloading, the only way to verify integrity of downloaded file is its SHA1 hash. Missing SHA1 means no way to verify whether the download file is fine or damaged. (please correct me if I'm wrong!) Therefore, I think Backblaze should require SHA1 for all uploaded file. Missing SHA1 should be treated as incomplete upload. (now I started to worry about how Backblaze personal/business backup product deals with integrity of recovered backup.)
IMO, with that in mind, option
--sha1
should become an overriding switch rather than an optional attachment. Users wantb2
command would calculate SHA1 automatically if they consider "full integrity" and--sha1
overriding has not been provided.If
b2
command line cannot accept behavioral modification, I think you can provide extra switch like--autocalc-sha1
.Or Backblaze can provide access to SHA1 hashes for original uploading segments. (if it exists...) I don't know how files are stored on server-side, but if Backblaze keeps SHA1 hashes for original segments and the range of each segments, it's easy to verify on client side. If this is possible, everything is done simply and beautifully. Nothing is really required more.
The text was updated successfully, but these errors were encountered: