Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils/archive: Add support for detecting archives without extensions #6132

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

harvey0100
Copy link
Contributor

This patch enhances the archive module to detect and extract archive files without proper extensions. Previously, while is_archive() could correctly identify archive files by examining their content, the ArchiveFile class (used by extract()) was relying solely on file extensions.

The implementation now:

Adds content-based detection to ArchiveFile for tar, zip, and compressed archives Adds a new is_bzip2_file() function to detect bzip2 files Improves error handling in the uncompress function Adds comprehensive unit tests for archives with and without extensions This fixes the issue where extract() would fail with "file is not an archive" error when trying to extract an archive without a proper extension, even though is_archive() correctly identified it.

Reference: #5997

This patch enhances the archive module to detect and extract archive files
without proper extensions. Previously, while is_archive() could correctly
identify archive files by examining their content, the ArchiveFile class
(used by extract()) was relying solely on file extensions.

The implementation now:

Adds content-based detection to ArchiveFile for tar, zip, and compressed archives
Adds a new is_bzip2_file() function to detect bzip2 files
Improves error handling in the uncompress function
Adds comprehensive unit tests for archives with and without extensions
This fixes the issue where extract() would fail with "file is not an
archive" error when trying to extract an archive without a proper
extension, even though is_archive() correctly identified it.

Reference: avocado-framework#5997
Signed-off-by: Harvey Lynden <[email protected]>
@harvey0100 harvey0100 self-assigned this Mar 3, 2025
@harvey0100 harvey0100 added this to the 110 - Codename TBD milestone Mar 3, 2025
@harvey0100 harvey0100 added customer:QEMU Requirements/issues raised by the QEMU project enhancement labels Mar 3, 2025
@harvey0100 harvey0100 requested review from richtja and clebergnu and removed request for richtja March 3, 2025 13:25
Copy link

codecov bot commented Mar 3, 2025

Codecov Report

Attention: Patch coverage is 31.25000% with 55 lines in your changes missing coverage. Please review.

Project coverage is 68.73%. Comparing base (65c967d) to head (165b045).

Files with missing lines Patch % Lines
avocado/utils/archive.py 31.25% 55 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6132      +/-   ##
==========================================
- Coverage   68.89%   68.73%   -0.16%     
==========================================
  Files         203      203              
  Lines       22019    22087      +68     
==========================================
+ Hits        15170    15182      +12     
- Misses       6849     6905      +56     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@clebergnu clebergnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @harvey0100,

Thanks for this! IMO, the extra bzip2 support should come as a separate commit. Also, there are a number of comments that are more on the general code approach and/or refactor opportunities. Let me know what you think of them.

self.tmpdir = tempfile.TemporaryDirectory(prefix="avocado_" + __name__)
self.base_dir = self.tmpdir.name

# Create a simple text file to archive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think creating files and the archives at run time is the best choice, given that:

  1. It consumes extra time/resources
  2. It's error prone, in the sense that a bug in a system tar or sorts, would impact the outcome of a test
  3. There's already the precedent of shipping "golden" archives in selftests/.data
  4. It adds extra dependencies at test run time (like tar itself)

with archive.ArchiveFile.open(zip_file) as arch:
self.assertIsNotNone(arch)
files = arch._engine.namelist()
self.assertTrue(len(files) > 0, "Archive should contain files")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea, not necessary for this patch, but anyway: what if we have some supporting utility code that reads a metadata file associated with an archive. For instance, for the existing selftests/.data/avocado.gz, we could have a selftests/.data/avocado.gz.metadata containing:

{"members": [["avocado", "f1d2d2f924e986ac86fdf7b36c94bcdf32beec15"]]}

The point is to lean towards the idea of shipping "golden" archive files that document themselves and in theory could be tested "automatically" to a great extent.

# Check for TAR file
elif tarfile.is_tarfile(filename):
# Detect compression method for tar files
with open(filename, "rb") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the logic here could be transformed into a data structure or even added to the existing (or a modified version) of the _extension_table dict?

try:
with open(path, "rb") as bz2_file:
# Check for bzip2 magic bytes (BZh)
return bz2_file.read(3) == b"BZh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some places we have constants for the magic bytes, in others, functions that actually check for them. It'd be nice to have a common approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer:QEMU Requirements/issues raised by the QEMU project enhancement
Projects
Status: Review Requested
Development

Successfully merging this pull request may close these issues.

utils/archive: support for detecting archive files without proper extensions
2 participants