Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] [Kernel] During Active-AddFile-Log-Replay do not pass the RemoveFile to the parquet (checkpoint) reader #4102

Open
scottsand-db opened this issue Jan 29, 2025 · 2 comments

Comments

@scottsand-db
Copy link
Collaborator

scottsand-db commented Jan 29, 2025

ActiveAddFilesIterator.java is the class in Kernel responsible for replaying the delta log and figuring out which AddFiles are indeed active at the given version of the table (i.e. they have not been logically deleted or "tombstoned" by a RemoveFile).

See here for an explanation and summary of the reverse-log-replay logic implemented.

Note that we only look at the RemoveFiles that are from Delta commit (.json) files. We do not look at any RemoveFiles from checkpoint (parquet) files. This is because: if we are looking at a given AddFile X and want to determine if X is still present in a version of the table, then we need to cover two cases.

  1. X was read from a json file. Then there may have been a RemoveFile later (also in a json) that removed it. Hence, we must keep track of RemoveFiles fromjson files
  2. X was read from a checkpoint parquet file. Well, if X was written to the checkpoint file, then it was by definition active at that version of the table. Note that X still could be deleted by a RemoveFile later in a .json, just like in the case above, but there is certainly no RemoveFile in the checkpoint parquet file that removed it.

This means that: we do not need to read any RemoveFiles when we read checkpoint parquet files during active-add-file-log-replay.

The feature request: avoid passing in the RemoveFile as part of the read schema to the parquet reader, here during active-add-file-log-replay.

The expected result here is: better performance when reading checkpoint files during active-add-file-log-replay.

@scovich
Copy link
Collaborator

scovich commented Jan 29, 2025

Seems very reasonable -- kernel-rs log replay already does it:
https://github.com/delta-io/delta-kernel-rs/blob/main/kernel/src/scan/mod.rs#L420-L427

@scovich
Copy link
Collaborator

scovich commented Jan 29, 2025

I think row group skipping is a separate thing tho? At least in kernel-rs we would have to push add.path IS NOT NULL filter down into the parquet reader (which I think we currently do not)?

However -- I would also expect adds and removes to be equally randomly scattered through checkpoint part files, so I doubt that push-down would actually prune any checkpoint parts in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants