Parquet: Do not compress v2 data page when compress is bad quality #8257

mapleFU · 2025-09-01T06:51:22Z

Which issue does this PR close?

Closes Parquet: Do not compress v2 data page when compress is bad quality #8256 .

Rationale for this change

Do not compress v2 data page when compress is bad quality ( compressed size is greater or equal to uncompressed_size )

What changes are included in this PR?

Discard compression when it's too large

Are these changes tested?

Covered by existing

Are there any user-facing changes?

No

mapleFU · 2025-09-01T07:02:12Z

parquet/src/column/writer/mod.rs

                        cmpr.compress(&values_data.buf, &mut buffer)?;
+                        if uncompressed_size <= buffer.len() - buffer_len {


This can also be a "score", like if uncompressed_size <= (buffer.len() - buffer_len) * 0.9 {

I'd suggest maybe going the opposite direction. If compression doesn't give say a 10% size reduction, then don't compress...it will be faster to read. 10% might be too much...maybe we could make it configurable.

Anyway, that's probably out of scope for this PR. Let's just go with the simple test for now...it's still an improvement.

Yeah, I do think that would need to configurable as I think some people would treat a 10% size increase as a regression, even if they got a faster reader

A good follow on ticket for sure

mapleFU · 2025-09-01T11:10:19Z

This need add some test, I found a "missing required field PageHeader.uncompressed_page_size" bug here

mapleFU · 2025-09-02T02:12:26Z

This need add some test, I found a "missing required field PageHeader.uncompressed_page_size" bug here

This is caused by another bug in user side, it's not related to this problem :-(

etseidl

I think this makes a lot of sense, and seems like the intended use of the is_compressed flag.

mapleFU · 2025-09-05T14:43:52Z

I've added a test for this, would you mind take a look again?

etseidl

test looks good! Thanks

alamb

looks good to me -- thank you @mapleFU and @etseidl

alamb · 2025-09-05T21:18:17Z

parquet/src/column/writer/mod.rs

                        cmpr.compress(&values_data.buf, &mut buffer)?;
+                        if uncompressed_size <= buffer.len() - buffer_len {


Yeah, I do think that would need to configurable as I think some people would treat a 10% size increase as a regression, even if they got a faster reader

A good follow on ticket for sure

alamb · 2025-09-05T21:19:00Z

parquet/src/column/writer/mod.rs

+            get_test_column_writer::<ByteArrayType>(Box::new(page_writer), 0, 0, Arc::new(props));
+
+        // Create small, simple data that Snappy compression will likely increase in size
+        // due to compression overhead for very small data


alamb · 2025-09-08T10:53:07Z

Thanks again @mapleFU and @etseidl

JigaoLuo · 2025-09-08T15:58:14Z

Hello everyone,

I just came across this PR and noticed that most of the discussion is happening here, so I’d like to continue the conversation in this thread rather than on the issue page.

I believe the direction of this PR aligns well with a previous issue we discussed in XiangpengHao/liquid-cache#227. I’ve been working on my own parquet-rewrite tool that touches on similar ideas, particularly with the score metric—a kind of breakeven point to decide whether compression should be applied. The goal of this tool is to help the reader skip unnecessary compression that adds overhead without delivering meaningful size reduction, ultimately improving the reader's reading performance.

Setting this score is quite tricky and empirical. For now, I’ve set it at 10%, mainly to catch cases where compression offers no size benefit at all. Here is an example of this case (in the level of full column):

As a side note, I’ve also made some patches to Xiangpeng’s viewer tool, which I use to inspect my generated Parquet files. This has been instrumental in iterating on my reader implementation.

mapleFU · 2025-09-08T16:04:32Z

Generally different level of data would have different distribution, and like what query-optimizer meets, data changes ( like frequently insertion or insert overwrite ) might need to re-sampling the data. So I may think runtime config would be different from others

And z-ordering clustering or other cluserting might also changes the distribution score. So currently I may think: a user config can set the own score, maybe different score for just ingested data (which might need fast write) or well clustered data ( which might need well compressed ). 10% is a good intuition but it's hard to define it's good. When compressed size > uncompressed size it's 100% worse.

JigaoLuo · 2025-09-08T17:19:03Z

I agree. I’ve been thinking more about this, especially since my focus is primarily on cuDF rather than DataFusion.

At a high level, it’s a trade-off between computation (specifically decompression) and I/O (file size reduction). In CPU scenarios like datafusion, I believe reading compressed Parquet files tends to be computation-bound

Do not compress v2 data page when compress is bad quality

e0077f9

github-actions bot added the parquet Changes to the parquet crate label Sep 1, 2025

mapleFU commented Sep 1, 2025

View reviewed changes

mapleFU marked this pull request as draft September 1, 2025 14:02

mapleFU marked this pull request as ready for review September 2, 2025 02:12

etseidl approved these changes Sep 4, 2025

View reviewed changes

mapleFU added 2 commits September 5, 2025 22:12

Merge branch 'main' into page-v2-no-compress-if-size-too-lage

2d66332

Add testing

0e198e9

mapleFU force-pushed the page-v2-no-compress-if-size-too-lage branch from c35ae85 to 0e198e9 Compare September 5, 2025 14:43

etseidl approved these changes Sep 5, 2025

View reviewed changes

alamb approved these changes Sep 5, 2025

View reviewed changes

alamb merged commit 0c7cb2a into apache:main Sep 8, 2025
16 checks passed

mapleFU deleted the page-v2-no-compress-if-size-too-lage branch September 8, 2025 14:20

		cmpr.compress(&values_data.buf, &mut buffer)?;
		if uncompressed_size <= buffer.len() - buffer_len {

Parquet: Do not compress v2 data page when compress is bad quality #8257

Parquet: Do not compress v2 data page when compress is bad quality #8257

Uh oh!

Conversation

mapleFU commented Sep 1, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mapleFU Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

mapleFU commented Sep 1, 2025

Uh oh!

mapleFU commented Sep 2, 2025

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

mapleFU commented Sep 5, 2025

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Sep 8, 2025

Uh oh!

JigaoLuo commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU commented Sep 8, 2025

Uh oh!

JigaoLuo commented Sep 8, 2025

Uh oh!

Uh oh!

alamb left a comment •

edited

Loading

JigaoLuo commented Sep 8, 2025 •

edited

Loading