[Kernel] Parquet writer `TableClient` APIs and default implementation #2626

vkorukanti · 2024-02-09T19:05:37Z

Which Delta project/connector is this regarding?

Description

Add the following API to ParquetHandler to support writing Parquet files.

    /**
     * Write the given data batches to a Parquet files. Try to keep the Parquet file size to given
     * size. If the current file exceeds this size close the current file and start writing to a new
     * file.
     * <p>
     *
     * @param directoryPath Path to the directory where the Parquet should be written into.
     * @param dataIter      Iterator of data batches to write.
     * @param maxFileSize   Target maximum size of the created Parquet file in bytes.
     * @param statsColumns  List of columns to collect statistics for. The statistics collection is
     *                      optional. If the implementation does not support statistics collection,
     *                      it is ok to return no statistics.
     * @return an iterator of {@link DataFileStatus} containing the status of the written files.
     * Each status contains the file path and the optionally collected statistics for the file
     * It is the responsibility of the caller to close the iterator.
     *
     * @throws IOException if an I/O error occurs during the file writing. This may leave some files
     *                     already written in the directory. It is the responsibility of the caller
     *                     to clean up.
     * @since 3.2.0
     */
    CloseableIterator<DataFileStatus> writeParquetFiles(
            String directoryPath,
            CloseableIterator<FilteredColumnarBatch> dataIter,
            long maxFileSize,
            List<Column> statsColumns) throws IOException;

The default implementation of the above interface uses parquet-mr library.

How was this patch tested?

Added support for all Delta types except the timestamp_ntz. Tested writing different data types with variations of nested levels, null/non-null values and target file size.

Followup work

Support 2-level structures for array and primitive type data writing
Support INT64 format timestamp writing
Uniform support to add field id for intermediate elements in MAP, LIST data types.

vkorukanti · 2024-02-09T19:20:51Z

kernel/kernel-api/src/main/java/io/delta/kernel/utils/DataFileStatus.java

+ * Extends {@link FileStatus} to include additional details such as column level statistics
+ * of the data file in the Delta Lake table.
+ */
+public class DataFileStatus extends FileStatus {


Should we just change the FileStatus to include the DataFileStatistics? This class doesn't seem to be adding a lot of value.

Surely there are cases of FileStatus where statistics just are not useful or applicable? This class seems very useful to me. Please elaborate?

...rnel-defaults/src/main/java/io/delta/kernel/defaults/internal/data/DefaultColumnarBatch.java

Adds the interface to write a Parquet file and collect the file status and stats. Currently only support writing int type columns. Once the interfaces are approved, will add the rest of the column type support.

This reverts commit aab4588.

scottsand-db · 2024-02-13T21:36:01Z

...l-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ParquetColumnWriters.java

+            } else if (precision <= ParquetSchemaUtils.DECIMAL_MAX_DIGITS_IN_LONG) {
+                return new DecimalLongWriter(colName, fieldIndex, columnVector);
+            }
+            // TODO: Need to support legacy mode where all decimals are written as binary


make an issue?

Will create issues once this PR is landed. Without landing it, we don't know what we are referring to.

scottsand-db · 2024-02-13T21:37:15Z

...l-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ParquetColumnWriters.java

+
+        @Override
+        void writeNonNullRowValue(RecordConsumer recordConsumer, int rowId) {
+            recordConsumer.addInteger(columnVector.getByte(rowId));


.addInteger .... .getByte ? should it be .getInteger?

It is getByte because the vector stores byte values. Parquet has one physical type int for byte, short and int logical types. Internally it has a mechanism to encode to save the space on disk.

scottsand-db · 2024-02-13T21:37:35Z

...l-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ParquetColumnWriters.java

+
+        @Override
+        void writeNonNullRowValue(RecordConsumer recordConsumer, int rowId) {
+            recordConsumer.addInteger(columnVector.getShort(rowId));


is there an .addShort?

same as above.

kernel/kernel-defaults/src/main/java/io/delta/kernel/defaults/internal/DefaultKernelUtils.java

...l-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ParquetColumnWriters.java

scottsand-db · 2024-02-13T22:00:54Z

...nel-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ParquetSchemaUtils.java

-            })
-            .filter(Objects::nonNull)
-            .collect(Collectors.toList());
+                .map(column -> {


is this just indentation change? can we ignore this noise?

It got changed as part of the autoformat in IntelliJ. Just space changes are easy to visualize in github.

...nel-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ParquetStatsReader.java

scottsand-db · 2024-02-13T22:09:38Z

...nel-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ParquetStatsReader.java

+            return Optional.empty();
+        }
+
+        return metadataList.stream()


how hard is it to not stream and reduce? stream is known to not have great performance

This code is executed once per file. If it is per row, the cost adds up. Also this happens mostly in tasks at executors.

scottsand-db · 2024-02-13T22:10:24Z

...nel-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ParquetStatsReader.java

+
+    private static boolean hasInvalidStatistics(Collection<ColumnChunkMetaData> metadataList) {
+        // If any row group does not have stats collected, stats for the file will not be valid
+        return metadataList.stream().anyMatch(metadata ->


same here. i wonder if we should avoid stream on the hot path. can you confirm: is this on any sort of hot path? seems like it could be happening multiple times for every file?

This is not the hotpath. Stats extraction is done once per file and mostly at tasks on the executor.

scottsand-db

LGTM!

vkorukanti added the kernel label Feb 9, 2024

vkorukanti requested a review from tdas February 9, 2024 19:05

vkorukanti commented Feb 9, 2024

View reviewed changes

vkorukanti force-pushed the parquetWriter branch 3 times, most recently from 7d3e2c7 to a7d0ae7 Compare February 9, 2024 19:56

scottsand-db reviewed Feb 9, 2024

View reviewed changes

...rnel-defaults/src/main/java/io/delta/kernel/defaults/internal/data/DefaultColumnarBatch.java Outdated Show resolved Hide resolved

vkorukanti force-pushed the parquetWriter branch 3 times, most recently from fbe6b97 to b6a75ff Compare February 12, 2024 18:08

[WIP][Kernel] Parquet writer TableClient APIs

f3f8096

Adds the interface to write a Parquet file and collect the file status and stats. Currently only support writing int type columns. Once the interfaces are approved, will add the rest of the column type support.

vkorukanti force-pushed the parquetWriter branch from b6a75ff to f3f8096 Compare February 13, 2024 07:22

vkorukanti changed the title ~~[WIP][Kernel] Parquet writer TableClient APIs~~ [Kernel] Parquet writer TableClient APIs Feb 13, 2024

vkorukanti requested a review from allisonport-db February 13, 2024 07:27

vkorukanti added 4 commits February 12, 2024 23:38

cleanup

aab4588

support for decimal binary + few more tests

7499a29

f

0be30c1

Revert "cleanup"

0f2b93c

This reverts commit aab4588.

scottsand-db requested changes Feb 13, 2024

View reviewed changes

vkorukanti added 2 commits February 13, 2024 16:11

more tests

add1895

address review

3542d0c

vkorukanti requested a review from scottsand-db February 14, 2024 00:49

extract out the stats reader (will come as a separate PR)

7745650

vkorukanti changed the title ~~[Kernel] Parquet writer TableClient APIs~~ [Kernel] Parquet writer TableClient APIs and default implementation Feb 14, 2024

scottsand-db approved these changes Feb 14, 2024

View reviewed changes

Merge branch 'master' into parquetWriter

747978b

vkorukanti merged commit 4ecfa45 into delta-io:master Feb 14, 2024
6 checks passed

vkorukanti deleted the parquetWriter branch May 9, 2024 02:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Parquet writer `TableClient` APIs and default implementation #2626

[Kernel] Parquet writer `TableClient` APIs and default implementation #2626

vkorukanti commented Feb 9, 2024 •

edited

Loading

vkorukanti Feb 9, 2024

scottsand-db Feb 9, 2024

scottsand-db Feb 13, 2024

vkorukanti Feb 14, 2024

scottsand-db Feb 13, 2024

vkorukanti Feb 14, 2024

scottsand-db Feb 13, 2024

vkorukanti Feb 14, 2024

scottsand-db Feb 13, 2024

vkorukanti Feb 14, 2024

scottsand-db Feb 13, 2024

vkorukanti Feb 14, 2024 •

edited

Loading

scottsand-db Feb 13, 2024

vkorukanti Feb 14, 2024 •

edited

Loading

scottsand-db left a comment

[Kernel] Parquet writer TableClient APIs and default implementation #2626

[Kernel] Parquet writer TableClient APIs and default implementation #2626

Conversation

vkorukanti commented Feb 9, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Followup work

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkorukanti Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkorukanti Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

scottsand-db left a comment

Choose a reason for hiding this comment

[Kernel] Parquet writer `TableClient` APIs and default implementation #2626

[Kernel] Parquet writer `TableClient` APIs and default implementation #2626

vkorukanti commented Feb 9, 2024 •

edited

Loading

vkorukanti Feb 14, 2024 •

edited

Loading

vkorukanti Feb 14, 2024 •

edited

Loading