Add custom MultiFileReader for reading delete files #641

mchataigner · 2025-12-22T16:22:13Z

Context: Slower read performance when a table has many delete files.

TL;DR: We can leverage the metadata already available in DuckLake to improve load time of delete files.

Problem & Motivation:

DuckLake stores file_size metadata for both data and delete files. For data files, there is already a mechanism to forward this metadata to the MultiFileReader and the underlying filesystem. The Parquet reader requires this file_size to access the footer metadata. When using an HTTPFileSystem instance (e.g., for S3, Azure), it performs a HEAD request on the file if metadata fields (file_size, etag, last_modified) are not present. Since all files in DuckLake are immutable, we can apply the same optimization logic for delete files to avoid these unnecessary HEAD requests.

Solution:

Implements a custom multi-file reading solution that pre-populates file metadata to eliminate redundant storage HEAD requests when scanning delete files:

Key Changes:

New DeleteFileFunctionInfo struct: Extends TableFunctionInfo to carry DuckLakeFileData metadata through the table function binding process.
Custom DeleteFileMultiFileReader class:
- Extends DuckDB's MultiFileReader to intercept file list creation
- Pre-populates ExtendedOpenFileInfo with metadata already available from DuckLake:
  - File size (file_size_bytes)
  - ETag (empty string as placeholder)
  - Last modified timestamp (set to epoch)
  - Encryption key (if present)
- Creates a SimpleMultiFileList with this extended info upfront
- Overrides CreateFileList() to return the pre-built list, bypassing DuckDB's default file discovery
Modified ScanDeleteFile() method:
- Changed parquet_scan from const reference to mutable copy to allow modification
- Attaches DeleteFileFunctionInfo and custom reader factory to the table function
- Passes the actual parquet_scan function to TableFunctionBindInput instead of a dummy function, ensuring proper function context

Performance Impact: Eliminates HEAD requests to object storage when opening Parquet delete files. This is particularly beneficial when working with remote storage (S3, Azure, etc.) and tables with many delete files, where HEAD requests were causing significant performance bottlenecks.

pdet

Hi @mchataigner thanks for the PR!
Could you add a MinIO test that demonstrates fewer requests are done ?
Could you also retarget it to v 1.4?

mchataigner · 2025-12-23T08:21:36Z

@pdet you're welcome. I will do the changes.
Thanks for your feedback.

…e files **Context**: We experience slow read performance when a table has many delete files. **TL;DR**: We can leverage the metadata already available in DuckLake to improve load time of delete files. **Problem & Motivation:** DuckLake stores `file_size` metadata for both data and delete files. For data files, there is already a mechanism to forward this metadata to the MultiFileReader and the underlying filesystem. The Parquet reader requires this `file_size` to access the footer metadata. When using an `HTTPFileSystem` instance (e.g., for S3, Azure), it performs a HEAD request on the file if metadata fields (`file_size`, `etag`, `last_modified`) are not present. Since all files in DuckLake are immutable, we can apply the same optimization logic for delete files to avoid these unnecessary HEAD requests. **Solution:** Implements a custom multi-file reading solution that pre-populates file metadata to eliminate redundant storage HEAD requests when scanning delete files: **Key Changes:** 1. **New `DeleteFileFunctionInfo` struct**: Extends `TableFunctionInfo` to carry `DuckLakeFileData` metadata through the table function binding process. 2. **Custom `DeleteFileMultiFileReader` class**: - Extends DuckDB's `MultiFileReader` to intercept file list creation - Pre-populates `ExtendedOpenFileInfo` with metadata already available from DuckLake: - File size (`file_size_bytes`) - ETag (empty string as placeholder) - Last modified timestamp (set to epoch) - Encryption key (if present) - Creates a `SimpleMultiFileList` with this extended info upfront - Overrides `CreateFileList()` to return the pre-built list, bypassing DuckDB's default file discovery 3. **Modified `ScanDeleteFile()` method**: - Changed `parquet_scan` from const reference to mutable copy to allow modification - Attaches `DeleteFileFunctionInfo` and custom reader factory to the table function - Passes the actual `parquet_scan` function to `TableFunctionBindInput` instead of a dummy function, ensuring proper function context **Performance Impact**: Eliminates HEAD requests to object storage when opening Parquet delete files. This is particularly beneficial when working with remote storage (S3, Azure, etc.) and tables with many delete files, where HEAD requests were causing significant performance bottlenecks.

mchataigner · 2025-12-26T23:33:16Z

@pdet sorry for the delay, I updated the PR with format fix and added a test with MinIO.

mchataigner force-pushed the mbc/improve_scan_delete_files branch 2 times, most recently from 8dfed69 to c45e07f Compare December 22, 2025 16:48

mchataigner changed the title ~~Add custom MultiFileReader to avoid HEAD requests when scanning delete files~~ Add custom MultiFileReader for reading delete files Dec 22, 2025

pdet reviewed Dec 22, 2025

View reviewed changes

mchataigner force-pushed the mbc/improve_scan_delete_files branch from c45e07f to b625d74 Compare December 26, 2025 23:29

mchataigner force-pushed the mbc/improve_scan_delete_files branch from b625d74 to db066de Compare December 26, 2025 23:31

mchataigner changed the base branch from main to v1.4-andium December 26, 2025 23:31

mchataigner requested a review from pdet December 26, 2025 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add custom MultiFileReader for reading delete files #641

Add custom MultiFileReader for reading delete files #641

Uh oh!

mchataigner commented Dec 22, 2025

Uh oh!

pdet left a comment

Uh oh!

mchataigner commented Dec 23, 2025

Uh oh!

mchataigner commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add custom MultiFileReader for reading delete files #641

Are you sure you want to change the base?

Add custom MultiFileReader for reading delete files #641

Uh oh!

Conversation

mchataigner commented Dec 22, 2025

Uh oh!

pdet left a comment

Choose a reason for hiding this comment

Uh oh!

mchataigner commented Dec 23, 2025

Uh oh!

mchataigner commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants