-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Parquet writer TableClient
APIs and default implementation
#2626
Conversation
* Extends {@link FileStatus} to include additional details such as column level statistics | ||
* of the data file in the Delta Lake table. | ||
*/ | ||
public class DataFileStatus extends FileStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just change the FileStatus
to include the DataFileStatistics
? This class doesn't seem to be adding a lot of value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely there are cases of FileStatus
where statistics just are not useful or applicable? This class seems very useful to me. Please elaborate?
7d3e2c7
to
a7d0ae7
Compare
...rnel-defaults/src/main/java/io/delta/kernel/defaults/internal/data/DefaultColumnarBatch.java
Outdated
Show resolved
Hide resolved
fbe6b97
to
b6a75ff
Compare
Adds the interface to write a Parquet file and collect the file status and stats. Currently only support writing int type columns. Once the interfaces are approved, will add the rest of the column type support.
b6a75ff
to
f3f8096
Compare
TableClient
APIsTableClient
APIs
This reverts commit aab4588.
} else if (precision <= ParquetSchemaUtils.DECIMAL_MAX_DIGITS_IN_LONG) { | ||
return new DecimalLongWriter(colName, fieldIndex, columnVector); | ||
} | ||
// TODO: Need to support legacy mode where all decimals are written as binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will create issues once this PR is landed. Without landing it, we don't know what we are referring to.
|
||
@Override | ||
void writeNonNullRowValue(RecordConsumer recordConsumer, int rowId) { | ||
recordConsumer.addInteger(columnVector.getByte(rowId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.addInteger
.... .getByte
? should it be .getInteger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is getByte
because the vector stores byte values. Parquet has one physical type int
for byte
, short
and int
logical types. Internally it has a mechanism to encode to save the space on disk.
|
||
@Override | ||
void writeNonNullRowValue(RecordConsumer recordConsumer, int rowId) { | ||
recordConsumer.addInteger(columnVector.getShort(rowId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an .addShort
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
kernel/kernel-defaults/src/main/java/io/delta/kernel/defaults/internal/DefaultKernelUtils.java
Outdated
Show resolved
Hide resolved
...l-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ParquetColumnWriters.java
Show resolved
Hide resolved
}) | ||
.filter(Objects::nonNull) | ||
.collect(Collectors.toList()); | ||
.map(column -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just indentation change? can we ignore this noise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It got changed as part of the autoformat in IntelliJ. Just space changes are easy to visualize in github.
...nel-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ParquetStatsReader.java
Outdated
Show resolved
Hide resolved
...nel-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ParquetStatsReader.java
Outdated
Show resolved
Hide resolved
return Optional.empty(); | ||
} | ||
|
||
return metadataList.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how hard is it to not stream and reduce? stream
is known to not have great performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is executed once per file. If it is per row, the cost adds up. Also this happens mostly in tasks at executors.
|
||
private static boolean hasInvalidStatistics(Collection<ColumnChunkMetaData> metadataList) { | ||
// If any row group does not have stats collected, stats for the file will not be valid | ||
return metadataList.stream().anyMatch(metadata -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. i wonder if we should avoid stream
on the hot path. can you confirm: is this on any sort of hot path? seems like it could be happening multiple times for every file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the hotpath. Stats extraction is done once per file and mostly at tasks on the executor.
TableClient
APIsTableClient
APIs and default implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Which Delta project/connector is this regarding?
Description
Add the following API to
ParquetHandler
to support writing Parquet files.The default implementation of the above interface uses
parquet-mr
library.How was this patch tested?
Added support for all Delta types except the
timestamp_ntz
. Tested writing different data types with variations of nested levels, null/non-null values and target file size.Followup work
MAP
,LIST
data types.