Add custom MultiFileReader for reading delete files #641
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context: Slower read performance when a table has many delete files.
TL;DR: We can leverage the metadata already available in DuckLake to improve load time of delete files.
Problem & Motivation:
DuckLake stores
file_sizemetadata for both data and delete files. For data files, there is already a mechanism to forward this metadata to the MultiFileReader and the underlying filesystem. The Parquet reader requires thisfile_sizeto access the footer metadata. When using anHTTPFileSysteminstance (e.g., for S3, Azure), it performs a HEAD request on the file if metadata fields (file_size,etag,last_modified) are not present. Since all files in DuckLake are immutable, we can apply the same optimization logic for delete files to avoid these unnecessary HEAD requests.Solution:
Implements a custom multi-file reading solution that pre-populates file metadata to eliminate redundant storage HEAD requests when scanning delete files:
Key Changes:
New
DeleteFileFunctionInfostruct: ExtendsTableFunctionInfoto carryDuckLakeFileDatametadata through the table function binding process.Custom
DeleteFileMultiFileReaderclass:MultiFileReaderto intercept file list creationExtendedOpenFileInfowith metadata already available from DuckLake:file_size_bytes)SimpleMultiFileListwith this extended info upfrontCreateFileList()to return the pre-built list, bypassing DuckDB's default file discoveryModified
ScanDeleteFile()method:parquet_scanfrom const reference to mutable copy to allow modificationDeleteFileFunctionInfoand custom reader factory to the table functionparquet_scanfunction toTableFunctionBindInputinstead of a dummy function, ensuring proper function contextPerformance Impact: Eliminates HEAD requests to object storage when opening Parquet delete files. This is particularly beneficial when working with remote storage (S3, Azure, etc.) and tables with many delete files, where HEAD requests were causing significant performance bottlenecks.