Skip to content

Conversation

piotrkochan
Copy link

@piotrkochan piotrkochan commented Sep 29, 2025

Fixes #167

Image Size Filtering for PhotoRec

This PR implements filtering of recovered image files by dimensions and file size, addressing the requirement to skip thumbnail-sized images during recovery. The feature currently supports JPG and PNG formats with memory-efficient buffering.

Problems Addressed

Excessive I/O for small files: PhotoRec's original architecture opens a file handle for every detected file signature, writes data to disk, then evaluates filters post-recovery. For recoveries with thousands of thumbnails (10-50KB JPG/PNG files), this meant:

  • Opening file handle → writing blocks to disk → reading file back → checking filters → deleting if rejected
  • This I/O pattern repeated thousands of times causes significant slowdown
  • Files were already written to disk before max filesize evaluation

No dimension-based filtering: There was no option to filter by image dimensions (width, height, resolution) and by image filesize on request.

Solution

Pre-save filtering with memory buffering: To filter images without wasting I/O, PhotoRec needs to know both dimensions AND file size before creating files on disk. But this creates a problem: dimensions are in the image header (first few hundred bytes), while actual file size requires finding the end-of-file marker.

The solution uses memory buffering combined with a new file_check_presave() callback:

Instead of writing files to disk immediately:

  1. Buffer file data in memory up to a size limit (minimum of: format's max_filesize, filter's max_file_size, or 100MB cap)
  2. Parse image headers from memory buffer (JPEG SOF markers, PNG IHDR chunks)
  3. Detect end-of-file markers (JPEG EOI 0xFFD9, PNG IEND) to estimate actual file size
  4. Apply both dimension and size filters before creating any file on disk
  5. Only write to disk if all filters pass
  6. If buffer size limit is exceeded, recovery for that file is aborted (rare for JPG/PNG within configured size limits)

This eliminates wasted disk I/O for rejected images entirely. The file_check_presave() callback operates on memory buffer where both dimensions and file size are known, allowing complete filtering decision before any disk writes.

Core Changes

New filtering module (src/image_filter.c, src/image_filter.h):

  • Implements dimension filtering: min/max width (pixels), height (pixels), and combined resolution
  • Resolution filter accepts both pixel count format (307200) and dimension format (640x480)
  • Implements file size filtering: min/max bytes with k/m/g unit support
  • Range-based specification with hyphen notation (e.g., 800-1920 or -1080 for "no min, max 1080")
  • Validates filters to prevent conflicting parameters (pixels vs width/height)

File format handlers (src/file_jpg.c, src/file_png.c):

  • Added file_check_presave() callback that evaluates filters on recovered file data (from memory buffer if buffering is active, or from initial read buffer otherwise)
  • JPG: Parses SOF (Start of Frame) markers for dimensions, detects EOI (0xFFD9) for estimated file size
  • PNG: Parses IHDR chunks for dimensions, detects IEND chunks for estimated file size
  • Filters apply to both primary images and extracted thumbnails (e.g., EXIF thumbnails in JPG)
  • Set is_image=1 flag in file_hint structures to enable memory buffering for these formats

To enable image filtering for other formats, modify the file format handler (file_*.c) to:

  1. Set is_image=1 in the file_hint_t structure
  2. Implement file_check_presave() callback that:
    • Parses image dimensions from headers in provided buffer
    • Detects end-of-file markers to estimate file size
    • Calls should_skip_image_by_dimensions() and should_skip_image_by_filesize() from image_filter.h
    • Returns 1 to save file, 0 to skip
  3. In header_check_*() function:
    • Set file_recovery_new->file_check_presave = &your_presave_callback
    • Set file_recovery_new->image_filter = file_recovery->image_filter

See file_jpg.c:jpg_maches_image_filtering() and file_png.c:png_maches_image_filtering() for reference implementations.

Memory buffering (src/filegen.c):

  • Reduces disk I/O by buffering file data in memory until filters can be evaluated
  • Uses calloc() instead of malloc() to avoid immediate physical memory allocation
  • Buffer size limited to minimum of: file format's max_filesize, image filter's max_file_size, or 100MB hard cap
  • Only enabled for image formats (JPG, PNG) when image filtering is active
  • If buffer allocation fails, memory buffering is disabled for that file
  • If buffer size is exceeded during recovery, that file's recovery is aborted
  • Buffer flushed to disk only if image passes all filters

ncurses UI (src/phrecn.c):

  • New submenu: Options → Image size filters
PhotoRec 7.3-WIP, Data Recovery Utility, April 2025
Christophe GRENIER <[email protected]>
https://www.cgsecurity.org

Image size filters : Disabled
Note: These filters apply only to JPG and PNG files


                           min            max
File size:                 [ disabled ] - [ disabled ]
Width (pixels):            [ disabled ] - [ disabled ]
Height (pixels):           [ disabled ] - [480       ]
Resolution (WIDTHxHEIGHT): [100x100   ] - [ disabled ]


Use Arrow keys to select field, Enter to edit, 'c' to clear, 'q' to quit
  • Interactive editor with arrow key navigation, Enter to edit fields, 'c' to clear, 'q' to quit
  • Real-time validation prevents conflicting parameters

CLI interface (src/phcli.c, /cmd batch mode):

  • Format: imagesize,size,MIN-MAX,width,MIN-MAX,height,MIN-MAX,pixels,MIN-MAX
  • File size accepts units: 100k, 1.5m, 2g (kilobytes/megabytes/gigabytes)
  • Width/height in pixels: 800-1920 (range), 800- (min only), -1080 (max only)
  • Resolution supports two formats:
    • Pixel count: pixels,307200-2073600 (direct pixel values)
    • Dimension format: pixels,640x480-1920x1080 (width×height, auto-multiplied to pixel count)
  • Example: imagesize,size,100k-,width,800-,height,600- (min 100KB, min 800×600)
  • Example: imagesize,pixels,640x480- (min 640×480 resolution = 307200 pixels)

Session persistence (src/sessionp.c):

  • Filter settings saved/restored in session files
  • Stored in CLI format for consistency

Testing

Python test suite available at https://gist.github.com/piotrkochan/1eb15d8ecb85c866e716bd07ee48d203

The test script automates validation by running PhotoRec against a disk image with various filter configurations, then verifying that recovered files match the specified criteria using ImageMagick's identify command. It tests file size filtering with min/max/range values and unit notation (k/m/g), dimension filtering for width and height with various boundary conditions, and resolution filtering in both pixel count and WIDTHxHEIGHT format. Combined filters with multiple parameters active simultaneously are also tested. The script performs automatic baseline analysis using percentile calculations to generate realistic test ranges based on actual recovered content.

Future Work

This implementation is designed for extensibility:

  • Filter logic abstracted in image_filter.c for easy addition of other image formats (GIF, BMP, TIFF, WebP, etc.)
  • Same pattern (parse headers in memory, apply filters before save) can be copied with minor modifications
  • Memory buffering architecture supports any file format with bounded size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: Establish minimum sizes when recovering images

1 participant