Skip to content

Conversation

@aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Jul 27, 2025

Rationale for this change

This is to test out variant implementation in Parquet java against the test cases generated in apache/parquet-testing#91 (test cases from Iceberg) and apache/parquet-testing#94 (test cases from GO language).

Variant Implementation Compatibility

Overall, the results demonstrate that the Variant implementation is compatible with Iceberg.

Notable observations:

  • Missing logical type annotation in Parquet files:
    Since Parquet files do not yet include the variant logical type annotation (pending this test for release of Parquet-java with the annotation), a temporary workaround was added to the code.

  • Decimal type inconsistency in Iceberg:
    Encountered a known issue with decimal types in Iceberg (iceberg#13692). Regenerated test data accordingly. The Variant implementation in Parquet-java handles decimal encoding correctly.

  • Error message discrepancies:
    While error cases throw different messages compared to Iceberg, they fail as expected with appropriate exceptions.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@aihuaxu aihuaxu changed the title Test Variant read from files Test Variant implementation with external test cases Jul 27, 2025
model.setField(record, "value", valuePos, builder.encodedValue());
parent.add(record);
Variant variant = builder.build();
parent.add(variant);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cashmand Would this convert to Variant object here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the question. Are you saying that we should be setting the Variant object directly here, rather than a (metadata, value) record? That might be reasonable, my understanding of Avro, and how it's meant to be used, is pretty weak. cc @rdblue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Seems we should return Variant rather than (metadata, value) record.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to confirm with you and I will can make the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds fine to me, I'd suggest getting @rdblue to approve. Maybe we want to do something similar on the write side, where I think it also currently expects a metadata, value pair from Avro rather than a Variant object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed with @rdblue offline and it should return a variant. Let me work on the fix and also address write side as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reading the mapping in https://iceberg.apache.org/spec/#avro, actually we are expecting a record of metadata and value in Avro for variant.

Looks like that we just need to update the test rather than making the code. @rdblue and @cashmand

image

@aihuaxu aihuaxu changed the title Test Variant implementation with external test cases Test Variant implementation with external test cases (not ready for merge) Jul 29, 2025
@aihuaxu aihuaxu force-pushed the test-variant-read-file branch from 6aed836 to d07879a Compare July 30, 2025 01:45
Object record = model.newRecord(null, avroSchema);
model.setField(record, "metadata", metadataPos, metadata.getEncodedBuffer());
model.setField(record, "value", valuePos, builder.encodedValue());
Object record = model.newRecord(null, VARIANT_SCHEMA);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this change that we will produce a fixed schema of value and metadata rather than reading the original avro schema.

Also, we also fix the issue that value field may be missing from the schema since that is allowed, i.e., typed_value exists but value doesn't. We don't read value field for the position and the output schema should be fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't change the expected Avro schema. We can reject it, but the contract is to use the schema that was passed in. I think the earlier code is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here what we are passing in the shredded schema which may/may not have value or typed_value field.

But we want to generate a variant which maps to a record of value and metadata fields, right?

image

Copy link
Contributor

@cashmand cashmand Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent was the parquet schema should be the shredding schema, but the Avro schema provided for Variant should always be a record of (value, metadata), even if the parquet schema doesn't contain value. So as @rdblue said, we should reject the schema if that isn't what was provided for a Variant column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cashmand I followed up with Ryan and I got more context around Avro.

For some reader path, seems we are passing AvroSchema as a shredded schema which should not happen. Let me checkout more how that happened. The AvroSchema here should always be a record of (value, metadata).

@aihuaxu aihuaxu changed the title Test Variant implementation with external test cases (not ready for merge) Test Variant implementation with external test cases Jul 30, 2025
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.io.InputFile;
import org.apache.parquet.io.LocalInputFile;
import org.assertj.core.api.Assertions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses JUnit 5. Does that work in Parquet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the dependency in pom.xml and seems to be working.

@aihuaxu aihuaxu requested review from cashmand and rdblue August 5, 2025 19:51
@aihuaxu aihuaxu force-pushed the test-variant-read-file branch 2 times, most recently from 5f34456 to 950b1b4 Compare August 12, 2025 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants