Skip to content

[SPARK-52582][SQL] Improve the memory usage of XML parser #51287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

xiaonanyang-db
Copy link
Contributor

@xiaonanyang-db xiaonanyang-db commented Jun 26, 2025

What changes were proposed in this pull request?

Today, the XML parser is not memory efficient. It loads each XML record into memory first before parsing, which causes an OOM if the input XML record is large. This PR improves the parser to parse an XML record token by token to avoid copying the entire XML record into memory ahead of time.

The optimization is governed by an SQL conf and disabled by default because it comes with two consequences:

  1. XSD validation is not supported in the optimized parser. The current XSD validation works only on a full XML record string, which violates the optimization here.
  2. Behavior change of corrupted record handling. Currently, good records in an XML file can be parsed correctly even if there is a corrupted XML record in between, but they won't be parsed anymore in the optimized parser. For example, given the following XML file:
<ROWS>
  <ROW><b>1</b></ROW>
  <ROW><b></ROW>
  <ROW><b>2</b></ROW>
</ROWS>

In the current parser, both the 1st and 3rd records will be parsed correctly, and the second one will be moved to the corrupted data column. In the new parser, only the 1st record will be parsed and the whole document will be moved to the corrupted data column as the second record.

Why are the changes needed?

Solve the OOM issue in XML ingestion.

Does this PR introduce any user-facing change?

No. The new behavior is disabled by default for now.

How was this patch tested?

New UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

@xiaonanyang-db xiaonanyang-db marked this pull request as ready for review June 26, 2025 01:46
@github-actions github-actions bot added the SQL label Jun 26, 2025
@HyukjinKwon HyukjinKwon changed the title [SPARK-52582] Improve the memory usage of XML parser [SPARK-52582][SQL] Improve the memory usage of XML parser Jun 26, 2025
parser.parseStream(
CodecStreams.createInputStreamWithCloseResource(conf, file.toPath),
requiredSchema)
if (SQLConf.get.enableOptimizedXmlParser) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the optimized XML parser have any negative impacts? Why is it necessary to add a switch instead of directly replacing the old implementation with the optimized XML parser?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are two implications for the new optimized parser, please check the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants