Skip to content

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Sep 1, 2025

Which issue does this PR close?

Rationale for this change

Do not compress v2 data page when compress is bad quality ( compressed size is greater or equal to uncompressed_size )

What changes are included in this PR?

Discard compression when it's too large

Are these changes tested?

Covered by existing

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 1, 2025
cmpr.compress(&values_data.buf, &mut buffer)?;
if uncompressed_size <= buffer.len() - buffer_len {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be a "score", like if uncompressed_size <= (buffer.len() - buffer_len) * 0.9 {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest maybe going the opposite direction. If compression doesn't give say a 10% size reduction, then don't compress...it will be faster to read. 10% might be too much...maybe we could make it configurable.

Anyway, that's probably out of scope for this PR. Let's just go with the simple test for now...it's still an improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I do think that would need to configurable as I think some people would treat a 10% size increase as a regression, even if they got a faster reader

A good follow on ticket for sure

@mapleFU
Copy link
Member Author

mapleFU commented Sep 1, 2025

This need add some test, I found a "missing required field PageHeader.uncompressed_page_size" bug here

@mapleFU mapleFU marked this pull request as draft September 1, 2025 14:02
@mapleFU mapleFU marked this pull request as ready for review September 2, 2025 02:12
@mapleFU
Copy link
Member Author

mapleFU commented Sep 2, 2025

This need add some test, I found a "missing required field PageHeader.uncompressed_page_size" bug here

This is caused by another bug in user side, it's not related to this problem :-(

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes a lot of sense, and seems like the intended use of the is_compressed flag.

@mapleFU mapleFU force-pushed the page-v2-no-compress-if-size-too-lage branch from c35ae85 to 0e198e9 Compare September 5, 2025 14:43
@mapleFU
Copy link
Member Author

mapleFU commented Sep 5, 2025

I've added a test for this, would you mind take a look again?

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test looks good! Thanks

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me -- thank you @mapleFU and @etseidl

cmpr.compress(&values_data.buf, &mut buffer)?;
if uncompressed_size <= buffer.len() - buffer_len {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I do think that would need to configurable as I think some people would treat a 10% size increase as a regression, even if they got a faster reader

A good follow on ticket for sure

get_test_column_writer::<ByteArrayType>(Box::new(page_writer), 0, 0, Arc::new(props));

// Create small, simple data that Snappy compression will likely increase in size
// due to compression overhead for very small data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 0c7cb2a into apache:main Sep 8, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 8, 2025

Thanks again @mapleFU and @etseidl

@mapleFU mapleFU deleted the page-v2-no-compress-if-size-too-lage branch September 8, 2025 14:20
@JigaoLuo
Copy link
Contributor

JigaoLuo commented Sep 8, 2025

Hello everyone,

I just came across this PR and noticed that most of the discussion is happening here, so I’d like to continue the conversation in this thread rather than on the issue page.

I believe the direction of this PR aligns well with a previous issue we discussed in XiangpengHao/liquid-cache#227. I’ve been working on my own parquet-rewrite tool that touches on similar ideas, particularly with the score metric—a kind of breakeven point to decide whether compression should be applied. The goal of this tool is to help the reader skip unnecessary compression that adds overhead without delivering meaningful size reduction, ultimately improving the reader's reading performance.

Setting this score is quite tricky and empirical. For now, I’ve set it at 10%, mainly to catch cases where compression offers no size benefit at all. Here is an example of this case (in the level of full column):

image

As a side note, I’ve also made some patches to Xiangpeng’s viewer tool, which I use to inspect my generated Parquet files. This has been instrumental in iterating on my reader implementation.

@mapleFU
Copy link
Member Author

mapleFU commented Sep 8, 2025

Generally different level of data would have different distribution, and like what query-optimizer meets, data changes ( like frequently insertion or insert overwrite ) might need to re-sampling the data. So I may think runtime config would be different from others

And z-ordering clustering or other cluserting might also changes the distribution score. So currently I may think: a user config can set the own score, maybe different score for just ingested data (which might need fast write) or well clustered data ( which might need well compressed ). 10% is a good intuition but it's hard to define it's good. When compressed size > uncompressed size it's 100% worse.

@JigaoLuo
Copy link
Contributor

JigaoLuo commented Sep 8, 2025

I agree. I’ve been thinking more about this, especially since my focus is primarily on cuDF rather than DataFusion.

At a high level, it’s a trade-off between computation (specifically decompression) and I/O (file size reduction). In CPU scenarios like datafusion, I believe reading compressed Parquet files tends to be computation-bound

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet: Do not compress v2 data page when compress is bad quality
4 participants