[Feature Request] [Kernel] During Active-AddFile-Log-Replay do not pass the RemoveFile to the parquet (checkpoint) reader #4102

scottsand-db · 2025-01-29T21:40:49Z

ActiveAddFilesIterator.java is the class in Kernel responsible for replaying the delta log and figuring out which AddFiles are indeed active at the given version of the table (i.e. they have not been logically deleted or "tombstoned" by a RemoveFile).

See here for an explanation and summary of the reverse-log-replay logic implemented.

Note that we only look at the RemoveFiles that are from Delta commit (.json) files. We do not look at any RemoveFiles from checkpoint (parquet) files. This is because: if we are looking at a given AddFile X and want to determine if X is still present in a version of the table, then we need to cover two cases.

X was read from a json file. Then there may have been a RemoveFile later (also in a json) that removed it. Hence, we must keep track of RemoveFiles fromjson files
X was read from a checkpoint parquet file. Well, if X was written to the checkpoint file, then it was by definition active at that version of the table. Note that X still could be deleted by a RemoveFile later in a .json, just like in the case above, but there is certainly no RemoveFile in the checkpoint parquet file that removed it.

This means that: we do not need to read any RemoveFiles when we read checkpoint parquet files during active-add-file-log-replay.

The feature request: avoid passing in the RemoveFile as part of the read schema to the parquet reader, here during active-add-file-log-replay.

The expected result here is: better performance when reading checkpoint files during active-add-file-log-replay.

The text was updated successfully, but these errors were encountered:

scovich · 2025-01-29T22:11:01Z

Seems very reasonable -- kernel-rs log replay already does it:
https://github.com/delta-io/delta-kernel-rs/blob/main/kernel/src/scan/mod.rs#L420-L427

scovich · 2025-01-29T22:12:58Z

I think row group skipping is a separate thing tho? At least in kernel-rs we would have to push add.path IS NOT NULL filter down into the parquet reader (which I think we currently do not)?

However -- I would also expect adds and removes to be equally randomly scattered through checkpoint part files, so I doubt that push-down would actually prune any checkpoint parts in practice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] [Kernel] During Active-AddFile-Log-Replay do not pass the RemoveFile to the parquet (checkpoint) reader #4102

[Feature Request] [Kernel] During Active-AddFile-Log-Replay do not pass the RemoveFile to the parquet (checkpoint) reader #4102

scottsand-db commented Jan 29, 2025 •

edited

Loading

scovich commented Jan 29, 2025

scovich commented Jan 29, 2025

[Feature Request] [Kernel] During Active-AddFile-Log-Replay do not pass the RemoveFile to the parquet (checkpoint) reader #4102

[Feature Request] [Kernel] During Active-AddFile-Log-Replay do not pass the RemoveFile to the parquet (checkpoint) reader #4102

Comments

scottsand-db commented Jan 29, 2025 • edited Loading

scovich commented Jan 29, 2025

scovich commented Jan 29, 2025

scottsand-db commented Jan 29, 2025 •

edited

Loading