Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Memory Usage and Long GC Times When Writing Parquet Files #3102

Open
ccl125 opened this issue Dec 10, 2024 · 3 comments
Open

High Memory Usage and Long GC Times When Writing Parquet Files #3102

ccl125 opened this issue Dec 10, 2024 · 3 comments

Comments

@ccl125
Copy link

ccl125 commented Dec 10, 2024

Describe the usage question you have. Please include as many useful details as possible.

In my project, I am using the following code to write Parquet files to the server:

ParquetWriter parquetWriter = ExampleParquetWriter.builder(new Path(filePath))
.withConf(new Configuration())
.withType(messageType)
.build();

Each Parquet file contains 30000 columns. This code is executed by multiple threads simultaneously, which results in increased GC time. Upon analyzing memory usage, I found that the main memory consumers are related to the following chain:

InternalParquetRecordWriter -> ColumnWriterV1 -> FallbackValuesWriter -> PlainDoubleDictionaryValuesWriter -> IntList

Each thread writes to a file with the same table schema (header), differing only in the filePath.

I initially suspected that the memory usage was caused by the file buffer not being flushed in time. To address this, I tried configuring the writer with the following parameters:

parquetWriter = ExampleParquetWriter.builder(new Path(filePath))
.withConf(new Configuration())
.withType(messageType)
.withMinRowCountForPageSizeCheck(SpringContextUtils.getApplicationContext()
.getBean(EtlTaskProperties.class).getMinRowCountForPageSizeCheck())
.withMaxRowCountForPageSizeCheck(SpringContextUtils.getApplicationContext()
.getBean(EtlTaskProperties.class).getMaxRowCountForPageSizeCheck())
.withRowGroupSize(SpringContextUtils.getApplicationContext()
.getBean(EtlTaskProperties.class).getRowGroupSize())
.build();

However, these adjustments did not solve the issue. The program still experiences long GC pauses and excessive memory usage.

Expected Behavior

Efficient Parquet file writing with reduced GC time and optimized memory usage when multiple threads are writing files simultaneously.

Observed Behavior
• Increased GC time and excessive memory usage.
• Memory analysis indicates IntList under PlainDoubleDictionaryValuesWriter is the primary consumer of memory.

Request

What are the recommended strategies to mitigate excessive memory usage in this scenario?
Is there a way to share table schema-related objects across threads, or other optimizations to reduce memory overhead?

Please let me know if additional information is needed!

No response

@ccl125 ccl125 closed this as completed Dec 10, 2024
@ccl125 ccl125 reopened this Dec 10, 2024
@ccl125
Copy link
Author

ccl125 commented Dec 12, 2024

I noticed that when I set withDictionaryEncoding(false), the writer switches from using FallbackValuesWriter to PlainValuesWriter. These two have significantly different memory usage. It seems that using PlainValuesWriter might address my issue.

Here is the context:
• Each file has a fixed 500 rows.
• The number of columns varies, ranging from approximately 1 to 30,000.

I would like to know:
1. Can I directly solve the problem by setting withDictionaryEncoding(false)?
2. How will this impact file size, write efficiency, and read performance?

@wgtmac
Copy link
Member

wgtmac commented Dec 31, 2024

In general, dictionary encoding consumes a lot of memory due to buffering all entries. So yes, withDictionaryEncoding(false) is the right approach to reduce the memory footprint in your case. For the resulting file size and read performance, it depends on the data distribution or repetition.

Each file has a fixed 500 rows

I'd say Parquet is not designed for small data in which case the metadata overhead is non-trivial. It is more suitable for 100,000+ rows of data to enjoy the columnar encoding and compression.

@ccl125
Copy link
Author

ccl125 commented Jan 13, 2025

In general, dictionary encoding consumes a lot of memory due to buffering all entries. So yes, withDictionaryEncoding(false) is the right approach to reduce the memory footprint in your case. For the resulting file size and read performance, it depends on the data distribution or repetition.

Each file has a fixed 500 rows

I'd say Parquet is not designed for small data in which case the metadata overhead is non-trivial. It is more suitable for 100,000+ rows of data to enjoy the columnar encoding and compression.

Yes, in our business scenario, we split the total sample into multiple Parquet files, each with a fixed 500 rows but with varying column counts. When the number of columns is high (over 30,000), we encounter GC issues lasting over 1 minute. I modified the configuration to use dictionary encoding only for BINARY and BOOLEAN type columns, while setting withDictionaryEncoding(false) for other column types. After this modification, the GC time improved significantly, changing from minutes to normal millisecond levels.

However, I encountered another issue: after setting withDictionaryEncoding(false), the size of all generated Parquet files increased substantially. For a task with 800,000 rows and 30,000+ columns, the total file size grew from around 20GB to 90GB. Our business requirements limit the maximum size to 50GB. To address this issue, I discovered that ParquetFileWriter doesn't configure file compression by default. After implementing builder.withCompressionCodec(CompressionCodecName.SNAPPY) for file compression, the file size reduced to around 30GB, which meets our business requirements while also solving the GC issue.

However, we still occasionally encounter cases where file sizes exceed 50GB, which didn't happen before implementing withDictionaryEncoding(false). It seems that builder.withCompressionCodec(CompressionCodecName.SNAPPY) sometimes doesn't provide compression as effective as the original approach (without configuring withDictionaryEncoding(false) and without builder.withCompressionCodec). I suspect the compression ratio might be dependent on the file content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants