parquet source with partition extraction #728

unical1988 · 2025-06-18T01:40:37Z

Added ConversionSource for Parquet files

Partitions format are specified by a user config.

xtable-api/src/main/java/org/apache/xtable/model/storage/TableFormat.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetSourceConfig.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetPartitionValueExtractor.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetSourcePartitionSpecExtractor.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSourceProvider.java

the-other-tim-brown · 2025-06-23T03:04:04Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+    public InternalSnapshot getCurrentSnapshot() {
+        List<InternalDataFile> internalDataFiles = getInternalDataFiles();
+        InternalTable table = getTable(-1L);
+        return InternalSnapshot.builder()


We'll need to set the version here. I am guessing it should be the last modification time but I need to think it through more.

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

the-other-tim-brown · 2025-06-23T03:09:41Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+
+@Builder
+// @NoArgsConstructor(access = AccessLevel.PRIVATE)
+public class ParquetConversionSource implements ConversionSource<Long> {


For us to provide support for incremental sync, I think this may need to be a range instead of a singular point. That way we can filter the files by a start and end time when performing the sync.

@unical1988 this is related to the incremental sync paths. We'll want this source to have some time range to find the files that need to be synced. If we don't do this, we will need to limit this source to snapshot syncs.

…entSnapshot()

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSourceProvider.java

xtable-api/src/main/java/org/apache/xtable/model/schema/PartitionFieldSpec.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetPartitionSpecExtractor.java

xtable-api/src/main/java/org/apache/xtable/model/schema/PartitionFieldSpec.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

… PartitionFieldSpec + no file diffs removed files

xtable-core/src/main/java/org/apache/xtable/hudi/ConfigurationBasedPartitionSpecExtractor.java

the-other-tim-brown · 2025-07-23T02:09:25Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+        ParquetMetadata parquetMetadata =
+                parquetMetadataExtractor.readParquetMetadata(hadoopConf, latestFile.getPath());
+
+        List<InternalPartitionField> partitionFields =


The ParquetPartitionSpecExtractor should be used here. The spec extractor is used for defining which fields are used and the ParquetPartitionValueExtractor should be used to get the values for the partition fields per file.

in this case partitionValueExtractor.extractParquetPartitions() is useless? as matter of fact the it is called down in the code to extract the partition fields (should the other calls be replaced as well?) ALSO should ParquetPartitionValueExtractor have a constructor with as parameter a ParquetPartitionSpecExtractor obj, or use it use through an instance in ParquetPartitionValueExtractor ?

The spec of the partitions is different from the values of the partitions. When you define a table, you define the partitioning of that table as a whole. For example, you would describe the table as "partitioned on date with day granularity" instead of "partition is 2025-07-29." The extractParquetPartitions is still required for attaching the partition values to the files.

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

the-other-tim-brown · 2025-07-23T02:14:11Z

xtable-core/src/test/java/org/apache/xtable/ITConversionController.java

@@ -141,7 +140,7 @@ private static Stream<Arguments> testCasesWithPartitioningAndSyncModes() {

  private static Stream<Arguments> generateTestParametersForFormatsSyncModesAndPartitioning() {
    List<Arguments> arguments = new ArrayList<>();
-    for (String sourceTableFormat : Arrays.asList(HUDI, DELTA, ICEBERG)) {
+    for (String sourceTableFormat : Arrays.asList(HUDI, DELTA, ICEBERG, PARQUET)) {


This will likely cause some issues since we don't have a concept for updates or deletes in the parquet source. It is assumed to be append-only. We can instead setup a new test in this class that creates parquet tables, converts them to all 3 other table formats, and validates the data using the existing helpers.

I notice you start the tests by creating a SparkSession, can we leverage it for creating the parquet file to convert into 3 other formats? is a call to conversionController.sync(conversionConfig, conversionSourceProvider); sufficient to test the coversion? do you validate the converted data through checkDatasetEquivalence() (all of ITConversionController.java) ?

Yes we can use the spark session and your understanding of the methods is correct

…f using static function

the-other-tim-brown · 2025-08-04T14:19:47Z

xtable-api/src/main/java/org/apache/xtable/model/storage/TableFormat.java

@@ -27,6 +27,7 @@ public class TableFormat {
  public static final String HUDI = "HUDI";
  public static final String ICEBERG = "ICEBERG";
  public static final String DELTA = "DELTA";
+  public static final String PARQUET = "PARQUET";

  public static String[] values() {
    return new String[] {"HUDI", "ICEBERG", "DELTA"};


Should PARQUET be added here?

was added here because of TableFormat.Parquet to be defined, but for the old tests to pass, It needs to be discarded.

xtable-core/src/main/java/org/apache/xtable/hudi/ConfigurationBasedPartitionSpecExtractor.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

the-other-tim-brown · 2025-08-04T14:25:14Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+  }
+
+  @Override
+  public boolean isIncrementalSyncSafeFrom(Instant instant) {


This looks like we're just checking if there are files older than the provided instant. Is this because we are assuming the files may be removed from the source table?

is there other way to think of it? how would approach it otherwise?

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

the-other-tim-brown · 2025-08-04T14:34:36Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+
+@Builder
+// @NoArgsConstructor(access = AccessLevel.PRIVATE)
+public class ParquetConversionSource implements ConversionSource<Long> {


@unical1988 this is related to the incremental sync paths. We'll want this source to have some time range to find the files that need to be synced. If we don't do this, we will need to limit this source to snapshot syncs.

…moved)

the-other-tim-brown · 2025-08-24T22:48:54Z

xtable-core/src/main/java/org/apache/xtable/hudi/BaseFileUpdatesExtractor.java

    writeStatus.setStat(writeStat);
    return writeStatus;
  }

  private Map<String, HoodieColumnRangeMetadata<Comparable>> convertColStats(
-      String fileName, List<ColumnStat> columnStatMap) {
+      String fileName, List<ColumnStat> columnStatMap, String fileFormat) {


We cannot have this be dependent on fileFormat as mentioned before. The intermediate object should be standardized so it is not dependent on source format

It isnt, i forgot to remove the parameter

Also, pls check with me the current CI error in TestDeltaSync and TestIcebergSync a tiny error must be in a certain path

the-other-tim-brown · 2025-08-25T00:43:57Z

xtable-core/src/main/java/org/apache/xtable/iceberg/IcebergDataFileUpdatesSync.java

+                    schema,
+                    dataFile.getRecordCount(),
+                    dataFile.getColumnStats(),
+                    dataFile.getFileFormat().toString()))


This needs to be cleaned up as well. The change in expected args is causing the mocks to no longer match and that is causing the test failures for Iceberg.

You re right, i forgot to revert back the changes here too, now done

the-other-tim-brown · 2025-08-25T01:46:19Z

xtable-core/src/main/java/org/apache/xtable/iceberg/IcebergColumnStatsConverter.java

@@ -50,7 +50,8 @@ public static IcebergColumnStatsConverter getInstance() {
    return INSTANCE;
  }

-  public Metrics toIceberg(Schema schema, long totalRowCount, List<ColumnStat> fieldColumnStats) {
+  public Metrics toIceberg(


The changes to this file and others that are just formatting in the Iceberg and Delta paths should be reverted to minimize the diff to just the parts that are essential for review.

the-other-tim-brown · 2025-08-25T01:55:31Z

xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetConversionSource.java

+import org.apache.xtable.hudi.HudiTestUtil;
+import org.apache.xtable.model.sync.SyncMode;
+
+public class TestParquetConversionSource {


We will need to either move this to run in the integration tests by changing Test to IT or it needs to set @Execution(SAME_THREAD) so that it will not try to start a spark session in the same JVM as the other tests like TestDeltaSync. This looks like the reason why TestDeltaSync will fail when run as part of the full suite but not when run in isolation.

Yes renaming the test class with prefix IT solved it

the-other-tim-brown · 2025-08-25T02:02:53Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetSchemaExtractor.java

              .dataType(InternalType.RECORD)
              .fields(subFields)
-              .isNullable(isNullable(schema.asGroupType()))
+              .isNullable(
+                      isNullable(schema.asGroupType())) // false isNullable(schema.asGroupType()) (TODO causing


Right now this is returning that the top level record schema is nullable which should not be the case. Do we need some special handling for this case?

As is, it does not seem to cause any error

This is causing errors when syncing to Hudi since it is sending a union of null and the actual schema to the target instead of simply the record schema.

I will recheck but the CI does not fail on that

It seems indeed that this requires special handling in the case of passed schema (record)

the-other-tim-brown · 2025-08-25T02:04:30Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java

+                                        columnMetaData.getPrimitiveType().getPrimitiveTypeName()
+                                                == PrimitiveType.PrimitiveTypeName
+                                                    .BINARY // TODO how about DECIMAL, JSON, BSON
+                                            // and ENUM logicalTypes?
+                                            ? columnMetaData
+                                                        .getPrimitiveType()
+                                                        .getLogicalTypeAnnotation()
+                                                    != null
+                                                ? columnMetaData
+                                                        .getPrimitiveType()
+                                                        .getLogicalTypeAnnotation()
+                                                        .toString()
+                                                        .equals("STRING")
+                                                    ? new String(
+                                                        ((Binary)
+                                                                columnMetaData
+                                                                    .getStatistics()
+                                                                    .genericGetMin())
+                                                            .getBytes(),
+                                                        StandardCharsets.UTF_8)
+                                                    : columnMetaData.getStatistics().genericGetMin()
+                                                : columnMetaData.getStatistics().genericGetMin()
+                                            : columnMetaData
+                                                .getStatistics()
+                                                .genericGetMin(), // if stats are string convert to
+                                        // litteraly a string stat and


Let's move this conversion logic into a helper method and then add a new GH Issue for handling any other logical types.

the-other-tim-brown · 2025-08-25T02:05:31Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java

+                  partitionValueExtractor.extractSchemaForParquetPartitions(
+                      parquetMetadataExtractor.readParquetMetadata(hadoopConf, file.getPath()),
+                      file.getPath().toString())),
+              parentPath.toString());
    } catch (java.io.IOException e) {


I think this is just hiding the exception currently and we need to fix this.

How is it hiding which exception?

It catches the exception but does not throw any new exception. The user will not know of the errors.

the-other-tim-brown · 2025-08-25T02:06:57Z

xtable-core/src/test/java/org/apache/xtable/ITConversionController.java

@@ -188,6 +306,10 @@ private ConversionSourceProvider<?> getConversionSourceProvider(String sourceTab
      throw new IllegalArgumentException("Unsupported source format: " + sourceTableFormat);
    }
  }
+  /*
+     test for Parquet file conversion


Let's remove these changes then if they are not required

the-other-tim-brown · 2025-08-25T02:09:50Z

xtable-core/src/main/java/org/apache/xtable/hudi/HudiPathUtils.java

+  public static String getPartitionPathValue(Path tableBasePath, Path filePath) {
+    return getPartitionPath(tableBasePath, filePath).split("=")[1];
+  }


What if there are multiple = in the path? Like for /some/path/year=2025/month=08/

Nested partition are not handled by that helper, will need some tweaking

Also this should be moved to the parquet utils directly since Hudi has handling for multiple partition fields already

Or i can use the hudi logic for that in parquet?

Btw, since the sourcefield name (timestamp) is different from the partition col (year) i am not sure how Hudi logic extracts the values of partition from the path?

To note that I cannot set sourceField as the year col, since year col is not part of the schema which makes the hudi logic for extracting values of part from path unusable

We use the user provided partition spec for Hudi to determine the mapping of field in the data to the partition path pattern. For example, a field in the data called ts is mapped to year=YYYY/month=MM/day=DD with a config ts:DAY:year=YYYY/month=MM/day=DD.

The current partitioning logic (hudi) (reflected in PathBasedPartitionValuesExtractor) allows rather for config in the following form: ts:DAY:yyyy-mm-dd

I changed the partitioning for parquet while keeping functions from hudi, i think it will be able to deal with nested partitions.

the-other-tim-brown · 2025-08-25T02:11:12Z

xtable-core/src/test/java/org/apache/xtable/parquet/TestSparkParquetTable.java

+
+import org.apache.xtable.GenericTable;
+
+public class TestSparkParquetTable implements GenericTable<Group, String> {


Is this being used by the tests?

First attempt at parquet source with partition extraction

a7b3171

the-other-tim-brown reviewed Jun 23, 2025

View reviewed changes

Selim Soufargi added 3 commits June 23, 2025 21:04

refactored conversion source for reading parquetFiles once to getCurr…

88a9896

…entSnapshot()

file changes + refactoring of conversion source

6f3142d

cleanup

a725a82

the-other-tim-brown reviewed Jul 7, 2025

View reviewed changes

Selim Soufargi added 6 commits July 7, 2025 21:07

implement get commit identifier

58d0262

Hudi Source Config uses PartitionFieldSpec.java

eb874f8

Hudi Source Config uses PartitionFieldSpec.java bug fix

17b070c

refactoring for PartitionValueExtractor to use Hudi based related code

93f4d79

aligning ConversionSource tests to include Parquet

cb9ce38

implemented remaining parquetSource methods + refactoring, TODO tests

b50140d

the-other-tim-brown reviewed Jul 13, 2025

View reviewed changes

default case as an exception for finding parquet files + javadocs for…

bcd853d

… PartitionFieldSpec + no file diffs removed files

the-other-tim-brown reviewed Jul 23, 2025

View reviewed changes

Selim Soufargi added 12 commits July 23, 2025 05:28

refactoring for specExtractor

c8d277a

table changes update does not check if table exists already

268f00b

added test for conversion source

d2c6368

adjusted test for Hudi after update of HudiSourceConfig

11f9879

fixed test bug for parquet and added partitionValues for statsExtractor

a110b1f

fixed test bug for stats extractor by defining instance obj instead o…

2bebdef

…f using static function

now using a string for the partition format input

427eb55

conversion source const for the test using specExtractor

b87fdcd

adjusted conversionSourceProvider for partition init

aee37e6

adjusted table creation with Spark for Parquet

bd94661

fix tests for CI to pass

ac3e872

fix tests for CI to pass (reverted)

02025f1

the-other-tim-brown reviewed Aug 4, 2025

View reviewed changes

cleanups

91f1cab

Selim Soufargi added 13 commits August 20, 2025 15:09

CI pass error fixed

e805310

CI pass error fixed

5e4ecdf

CI pass error fixed

fd86b3c

CI pass error fixed

aba0d1b

refactored parquetSource + cleanups

3bcff6a

fix schemaExtractor bug

cf201a1

fix Parquet related partition values extraction from path

209c219

fix for Iceberg sync tests for CI to pass

c1ab6f4

binary stats are converted into string stats in parquet stats extractor

565b8ba

binary stats converted per logical type case: STRING....

8b115dd

testing dataset equivalence differently for Hudi (partition column re…

f7ced6b

…moved)

set source identifier for current snapshot as the last modif time

f76b7fe

set spark config for CI

906dfc7

the-other-tim-brown reviewed Aug 24, 2025

View reviewed changes

reformatting + cleanups

cb5b66e

the-other-tim-brown reviewed Aug 25, 2025

View reviewed changes

iceberg CI error fix

69c5bd0

the-other-tim-brown reviewed Aug 25, 2025

View reviewed changes

Selim Soufargi added 12 commits August 25, 2025 04:53

delta CI error fix

2058f61

delta CI error fix

5b7c686

refactoring for the statsExtractor

853ea32

refactoring for the statsExtractor

4bef1ea

refactoring for the statsExtractor

f1c2956

many partitions as input for parquet

646b628

many partitions as input for parquet

c9d384b

revert change in hudi spec extractor

3804342

fix for isNullable records

979adec

fix for isNullable records

6cf043c

cleanups

2924b56

cleanups

eaa71d2


		import org.apache.xtable.GenericTable;

		public class TestSparkParquetTable implements GenericTable<Group, String> {

parquet source with partition extraction #728

Are you sure you want to change the base?

parquet source with partition extraction #728

Uh oh!

Conversation

unical1988 commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

unical1988 Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

unical1988 Jul 23, 2025 •

edited

Loading

unical1988 Aug 25, 2025 •

edited

Loading