Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet]Fix writing columns with all null values after row group flush #24555

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hantangwangd
Copy link
Member

@hantangwangd hantangwangd commented Feb 13, 2025

Description

This PR fix a bug in parquet which cased the flaky #22907. The core reason is that, when Parquet writes a data page with a column that is all null right after row group level flush, the handling of that column is incorrect: its encoding still uses dictionary encoding, but no dictionary page is output.

Motivation and Context

Fix #22907

Impact

N/A

Test Plan

  • Newly added test case TestParquetWriter.testWriteAllNullDataPageAfterRowGroupFlush() which can reproduce the issue without this change.
  • Set the invocation count for testSingleLevelSchemaArrayOfArrayOfStructOfArray() to 1024, and make sure the issue do not appear anymore.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@jaystarshot
Copy link
Member

The core reason is that, when Parquet writes a data page with a column that is all null right after row group level flush, the handling of that column is incorrect

It sounds like the reset() method of the columnWriter is incorrect in this case? maybe that would be a better fix?

@hantangwangd
Copy link
Member Author

It sounds like the reset() method of the columnWriter is incorrect in this case? maybe that would be a better fix?

Yes I think you're right, the reset() of a ColumnWriter is only invoked on row group level flush, so we can reset and reuse PrimitiveColumnWriter, and fix the reset logic to recreate the ValuesWriter in it. Will do a fix in this way.

@hantangwangd
Copy link
Member Author

But through checking the code and comment in parquet-columns, I believe the ValuesWriter should not be reset and reuse after each row group level flush, see here. It seems that Parquet is designed this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test TestParquetReader.testStructOfTwoNestedArrays
2 participants