[SPARK-52582][SQL] Improve the memory usage of XML parser #51287

xiaonanyang-db · 2025-06-26T01:46:44Z

What changes were proposed in this pull request?

Today, the XML parser is not memory efficient. It loads each XML record into memory first before parsing, which causes an OOM if the input XML record is large. This PR improves the parser to parse an XML record token by token to avoid copying the entire XML record into memory ahead of time.

The optimization is governed by an SQL conf and disabled by default because it comes with two consequences:

XSD validation is not supported in the optimized parser. The current XSD validation works only on a full XML record string, which violates the optimization here.
Behavior change of corrupted record handling. Currently, good records in an XML file can be parsed correctly even if there is a corrupted XML record in between, but they won't be parsed anymore in the optimized parser. For example, given the following XML file:

<ROWS>
  <ROW><b>1</b></ROW>
  <ROW><b></ROW>
  <ROW><b>2</b></ROW>
</ROWS>

In the current parser, both the 1st and 3rd records will be parsed correctly, and the second one will be moved to the corrupted data column. In the new parser, only the 1st record will be parsed and the whole document will be moved to the corrupted data column as the second record.

Why are the changes needed?

Solve the OOM issue in XML ingestion.

Does this PR introduce any user-facing change?

No. The new behavior is disabled by default for now.

How was this patch tested?

New UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

LuciferYang · 2025-06-26T05:15:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlDataSource.scala

-    parser.parseStream(
-      CodecStreams.createInputStreamWithCloseResource(conf, file.toPath),
-      requiredSchema)
+    if (SQLConf.get.enableOptimizedXmlParser) {


Will the optimized XML parser have any negative impacts? Why is it necessary to add a switch instead of directly replacing the old implementation with the optimized XML parser?

Yes, there are two implications for the new optimized parser, please check the PR description.

xiaonanyang-db added 2 commits June 25, 2025 18:41

draft

05e975c

add test

854d3de

xiaonanyang-db marked this pull request as ready for review June 26, 2025 01:46

github-actions bot added the SQL label Jun 26, 2025

HyukjinKwon changed the title ~~[SPARK-52582] Improve the memory usage of XML parser~~ [SPARK-52582][SQL] Improve the memory usage of XML parser Jun 26, 2025

LuciferYang reviewed Jun 26, 2025

View reviewed changes

xiaonanyang-db added 2 commits June 30, 2025 14:14

fix

6d2a960

address comments

57c9685

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52582][SQL] Improve the memory usage of XML parser #51287

[SPARK-52582][SQL] Improve the memory usage of XML parser #51287

xiaonanyang-db commented Jun 26, 2025 •

edited

Loading

Uh oh!

LuciferYang Jun 26, 2025

Uh oh!

xiaonanyang-db Jul 1, 2025

Uh oh!

Uh oh!

[SPARK-52582][SQL] Improve the memory usage of XML parser #51287

Are you sure you want to change the base?

[SPARK-52582][SQL] Improve the memory usage of XML parser #51287

Conversation

xiaonanyang-db commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xiaonanyang-db commented Jun 26, 2025 •

edited

Loading