-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Test Variant implementation with external test cases #3258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| model.setField(record, "value", valuePos, builder.encodedValue()); | ||
| parent.add(record); | ||
| Variant variant = builder.build(); | ||
| parent.add(variant); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cashmand Would this convert to Variant object here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the question. Are you saying that we should be setting the Variant object directly here, rather than a (metadata, value) record? That might be reasonable, my understanding of Avro, and how it's meant to be used, is pretty weak. cc @rdblue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Seems we should return Variant rather than (metadata, value) record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to confirm with you and I will can make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds fine to me, I'd suggest getting @rdblue to approve. Maybe we want to do something similar on the write side, where I think it also currently expects a metadata, value pair from Avro rather than a Variant object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed with @rdblue offline and it should return a variant. Let me work on the fix and also address write side as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By reading the mapping in https://iceberg.apache.org/spec/#avro, actually we are expecting a record of metadata and value in Avro for variant.
Looks like that we just need to update the test rather than making the code. @rdblue and @cashmand
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java
Outdated
Show resolved
Hide resolved
6aed836 to
d07879a
Compare
| Object record = model.newRecord(null, avroSchema); | ||
| model.setField(record, "metadata", metadataPos, metadata.getEncodedBuffer()); | ||
| model.setField(record, "value", valuePos, builder.encodedValue()); | ||
| Object record = model.newRecord(null, VARIANT_SCHEMA); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need this change that we will produce a fixed schema of value and metadata rather than reading the original avro schema.
Also, we also fix the issue that value field may be missing from the schema since that is allowed, i.e., typed_value exists but value doesn't. We don't read value field for the position and the output schema should be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't change the expected Avro schema. We can reject it, but the contract is to use the schema that was passed in. I think the earlier code is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent was the parquet schema should be the shredding schema, but the Avro schema provided for Variant should always be a record of (value, metadata), even if the parquet schema doesn't contain value. So as @rdblue said, we should reject the schema if that isn't what was provided for a Variant column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cashmand I followed up with Ryan and I got more context around Avro.
For some reader path, seems we are passing AvroSchema as a shredded schema which should not happen. Let me checkout more how that happened. The AvroSchema here should always be a record of (value, metadata).
parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java
Outdated
Show resolved
Hide resolved
parquet-avro/src/main/java/org/apache/parquet/avro/AvroVariantConverter.java
Show resolved
Hide resolved
| import org.apache.parquet.avro.AvroParquetReader; | ||
| import org.apache.parquet.io.InputFile; | ||
| import org.apache.parquet.io.LocalInputFile; | ||
| import org.assertj.core.api.Assertions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses JUnit 5. Does that work in Parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the dependency in pom.xml and seems to be working.
5f34456 to
950b1b4
Compare
parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java
Outdated
Show resolved
Hide resolved
950b1b4 to
46aa743
Compare

Rationale for this change
This is to test out variant implementation in Parquet java against the test cases generated in apache/parquet-testing#91 (test cases from Iceberg) and apache/parquet-testing#94 (test cases from GO language).
Variant Implementation Compatibility
Overall, the results demonstrate that the Variant implementation is compatible with Iceberg.
Notable observations:
Missing logical type annotation in Parquet files:
Since Parquet files do not yet include the variant logical type annotation (pending this test for release of Parquet-java with the annotation), a temporary workaround was added to the code.
Decimal type inconsistency in Iceberg:
Encountered a known issue with decimal types in Iceberg (iceberg#13692). Regenerated test data accordingly. The Variant implementation in Parquet-java handles decimal encoding correctly.
Error message discrepancies:
While error cases throw different messages compared to Iceberg, they fail as expected with appropriate exceptions.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?