-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utils/archive: Add support for detecting archives without extensions #6132
base: master
Are you sure you want to change the base?
Conversation
This patch enhances the archive module to detect and extract archive files without proper extensions. Previously, while is_archive() could correctly identify archive files by examining their content, the ArchiveFile class (used by extract()) was relying solely on file extensions. The implementation now: Adds content-based detection to ArchiveFile for tar, zip, and compressed archives Adds a new is_bzip2_file() function to detect bzip2 files Improves error handling in the uncompress function Adds comprehensive unit tests for archives with and without extensions This fixes the issue where extract() would fail with "file is not an archive" error when trying to extract an archive without a proper extension, even though is_archive() correctly identified it. Reference: avocado-framework#5997 Signed-off-by: Harvey Lynden <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6132 +/- ##
==========================================
- Coverage 68.89% 68.73% -0.16%
==========================================
Files 203 203
Lines 22019 22087 +68
==========================================
+ Hits 15170 15182 +12
- Misses 6849 6905 +56 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @harvey0100,
Thanks for this! IMO, the extra bzip2 support should come as a separate commit. Also, there are a number of comments that are more on the general code approach and/or refactor opportunities. Let me know what you think of them.
self.tmpdir = tempfile.TemporaryDirectory(prefix="avocado_" + __name__) | ||
self.base_dir = self.tmpdir.name | ||
|
||
# Create a simple text file to archive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think creating files and the archives at run time is the best choice, given that:
- It consumes extra time/resources
- It's error prone, in the sense that a bug in a system
tar
or sorts, would impact the outcome of a test - There's already the precedent of shipping "golden" archives in
selftests/.data
- It adds extra dependencies at test run time (like
tar
itself)
with archive.ArchiveFile.open(zip_file) as arch: | ||
self.assertIsNotNone(arch) | ||
files = arch._engine.namelist() | ||
self.assertTrue(len(files) > 0, "Archive should contain files") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea, not necessary for this patch, but anyway: what if we have some supporting utility code that reads a metadata file associated with an archive. For instance, for the existing selftests/.data/avocado.gz
, we could have a selftests/.data/avocado.gz.metadata
containing:
{"members": [["avocado", "f1d2d2f924e986ac86fdf7b36c94bcdf32beec15"]]}
The point is to lean towards the idea of shipping "golden" archive files that document themselves and in theory could be tested "automatically" to a great extent.
# Check for TAR file | ||
elif tarfile.is_tarfile(filename): | ||
# Detect compression method for tar files | ||
with open(filename, "rb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the logic here could be transformed into a data structure or even added to the existing (or a modified version) of the _extension_table
dict?
try: | ||
with open(path, "rb") as bz2_file: | ||
# Check for bzip2 magic bytes (BZh) | ||
return bz2_file.read(3) == b"BZh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some places we have constants for the magic bytes, in others, functions that actually check for them. It'd be nice to have a common approach.
This patch enhances the archive module to detect and extract archive files without proper extensions. Previously, while is_archive() could correctly identify archive files by examining their content, the ArchiveFile class (used by extract()) was relying solely on file extensions.
The implementation now:
Adds content-based detection to ArchiveFile for tar, zip, and compressed archives Adds a new is_bzip2_file() function to detect bzip2 files Improves error handling in the uncompress function Adds comprehensive unit tests for archives with and without extensions This fixes the issue where extract() would fail with "file is not an archive" error when trying to extract an archive without a proper extension, even though is_archive() correctly identified it.
Reference: #5997