You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ActiveAddFilesIterator.java is the class in Kernel responsible for replaying the delta log and figuring out which AddFiles are indeed active at the given version of the table (i.e. they have not been logically deleted or "tombstoned" by a RemoveFile).
See here for an explanation and summary of the reverse-log-replay logic implemented.
Note that we only look at the RemoveFiles that are from Delta commit (.json) files. We do not look at any RemoveFiles from checkpoint (parquet) files. This is because: if we are looking at a given AddFileX and want to determine if X is still present in a version of the table, then we need to cover two cases.
X was read from a json file. Then there may have been a RemoveFile later (also in a json) that removed it. Hence, we must keep track of RemoveFiles fromjson files
X was read from a checkpoint parquet file. Well, if X was written to the checkpoint file, then it was by definition active at that version of the table. Note that X still could be deleted by a RemoveFile later in a .json, just like in the case above, but there is certainly no RemoveFile in the checkpoint parquet file that removed it.
This means that: we do not need to read any RemoveFiles when we read checkpoint parquet files during active-add-file-log-replay.
The feature request: avoid passing in the RemoveFile as part of the read schema to the parquet reader, here during active-add-file-log-replay.
The expected result here is: better performance when reading checkpoint files during active-add-file-log-replay.
The text was updated successfully, but these errors were encountered:
I think row group skipping is a separate thing tho? At least in kernel-rs we would have to push add.path IS NOT NULL filter down into the parquet reader (which I think we currently do not)?
However -- I would also expect adds and removes to be equally randomly scattered through checkpoint part files, so I doubt that push-down would actually prune any checkpoint parts in practice.
ActiveAddFilesIterator.java is the class in Kernel responsible for replaying the delta log and figuring out which
AddFile
s are indeed active at the given version of the table (i.e. they have not been logically deleted or "tombstoned" by aRemoveFile
).See here for an explanation and summary of the reverse-log-replay logic implemented.
Note that we only look at the
RemoveFile
s that are from Delta commit (.json
) files. We do not look at anyRemoveFile
s from checkpoint (parquet
) files. This is because: if we are looking at a givenAddFile
X
and want to determine ifX
is still present in a version of the table, then we need to cover two cases.X
was read from ajson
file. Then there may have been aRemoveFile
later (also in ajson
) that removed it. Hence, we must keep track ofRemoveFile
s fromjson
filesX
was read from a checkpointparquet
file. Well, ifX
was written to the checkpoint file, then it was by definition active at that version of the table. Note thatX
still could be deleted by aRemoveFile
later in a.json
, just like in the case above, but there is certainly noRemoveFile
in the checkpointparquet
file that removed it.This means that: we do not need to read any
RemoveFile
s when we read checkpointparquet
files during active-add-file-log-replay.The feature request: avoid passing in the
RemoveFile
as part of the read schema to the parquet reader, here during active-add-file-log-replay.The expected result here is: better performance when reading checkpoint files during active-add-file-log-replay.
The text was updated successfully, but these errors were encountered: