Handle malformed XML in extract_text_from_xml and add edge case test by tharunkumar4562 · Pull Request #71 · AOSSIE-Org/OpenVerifiableLLM

tharunkumar4562 · 2026-03-12T17:00:01Z

Addressed Issues:

Fixes #48

Screenshots/Recordings:

Before fix: malformed XML raised a low-level ParseError without context.

After fix: the error now includes a clear message indicating which
XML dump failed to parse.

Additional Notes:

While running the test suite locally, I noticed that extract_text_from_xml does not provide useful context when encountering malformed XML.

Changes in this PR:

Added error handling around ET.iterparse to catch xml.etree.ElementTree.ParseError.
Log a descriptive error message including the input file path.
Re-raise the error with additional context to improve debugging.
Added a new unit test test_extract_text_from_xml_malformed to verify behaviour when malformed XML is encountered.

This improves robustness of the XML extraction step and prevents unclear crashes when corrupted Wikipedia dumps are processed.

All tests pass locally.

Checklist

My code follows the project's code style and conventions
I have made corresponding changes to the documentation
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

New Features
- Manifest generation added after successful XML extraction.
- Safe, atomic output handling for XML processing to avoid partial writes.
Bug Fixes
- Improved XML parsing error messages and robust cleanup on failure.
- In-loop memory and disk-state cleanup for more reliable processing.
Tests
- Added test coverage for malformed XML error handling.

coderabbitai · 2026-03-12T17:00:19Z

Walkthrough

Implemented resilient XML preprocessing in extract_text_from_xml: uses a temporary file with atomic replace, wraps ET.iterparse in try/except to re-raise ParseError with a descriptive message, ensures flush/fsync and per-element cleanup, generates manifest fields, and adds a unit test for malformed XML.

Changes

Cohort / File(s)	Summary
XML processing & manifest `openverifiablellm/utils.py`	Rewrote extract_text_from_xml to write to a tempfile then atomically replace the output, added `tempfile` import, added try/except to catch `ET.ParseError` and re-raise with a descriptive message, ensured buffer flush/fsync and in-loop element cleanup, added finalization/cleanup of temp files, and extended manifest generation (raw/processed merkle roots, chunk size).
Tests `tests/test_util.py`	Added `test_extract_text_from_xml_malformed` validating that malformed XML raises `ET.ParseError` and that the error message includes "Failed to parse XML".

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Utils as extract_text_from_xml
    participant Parser as ET.iterparse
    participant FS as FileSystem
    participant Manifest as ManifestGenerator

    Caller->>Utils: call extract_text_from_xml(input_path, out_path)
    Utils->>Parser: open input and start iterparse
    Parser-->>Utils: yields elements / raises ParseError
    alt ParseError
      Parser-->>Utils: ParseError
      Utils->>Caller: raise ParseError("Failed to parse XML dump...")
    else Successful parse
      Utils->>FS: write processed text to temp file (flush & fsync)
      Utils->>Utils: elem.clear() for memory-safe cleanup
      Utils->>FS: os.replace(temp -> final output)
      Utils->>Manifest: compute merkle roots & chunk_size_bytes
      Manifest-->>Utils: manifest data
      Utils->>Caller: return success + manifest
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Improve XML extraction logic and update tests #29: Modifies extract_text_from_xml (adds input-format auto-detection); overlaps with parsing and temp-file handling changes.

Suggested labels

Python Lang

Suggested reviewers

Archit381

Poem

🐰 I nibble tags and stash them neat,
Temp files tucked beneath my feet.
If XML stumbles, I will shout,
Hash the roots and clean up spout—
Hooray, the pipeline hops about!

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: adding error handling for malformed XML and a corresponding test case.
Linked Issues check	✅ Passed	The PR addresses all coding requirements from issue `#48`: adds error handling for malformed XML with descriptive logging, implements a unit test verifying ParseError is raised for malformed XML, and maintains clear failure semantics.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#48` objectives: XML error handling improvements in utils.py and a new test in test_util.py. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 225-245: The current parser writes directly to output_path (the
with open(output_path, "w", ...) block) so a late ET.ParseError can leave a
partial or corrupted wiki_clean.txt; instead, write to a temporary file in the
same directory (using tempfile.NamedTemporaryFile or mkstemp) while parsing and
only atomically replace the final output_path with os.replace when parsing
completes successfully; ensure the temp file is opened with encoding="utf-8",
cleaned text is written there (via clean_wikitext), close and sync the temp file
before os.replace, and on exceptions (e.g., ET.ParseError) remove the temp file
and re-raise the error so the original output is preserved.

In `@tests/test_util.py`:
- Around line 195-200: Tighten the test so it only catches the XML parse failure
from defusedxml by using the concrete ParseError type instead of Exception and
assert the contextual message; update the pytest.raises(...) to expect
defusedxml.ElementTree.ParseError (import ParseError at top of
tests/test_util.py or reference defusedxml.ElementTree.ParseError) when calling
utils.extract_text_from_xml, and ensure the assertion checks that the exception
string contains the specific "Failed to parse XML" message produced by
extract_text_from_xml.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: af2b2c77-f0a8-4633-8515-c082d8e119c2

📥 Commits

Reviewing files that changed from the base of the PR and between c352df0 and 132592b.

📒 Files selected for processing (2)

openverifiablellm/utils.py
tests/test_util.py

openverifiablellm/utils.py

tests/test_util.py

coderabbitai

♻️ Duplicate comments (1)

tests/test_util.py (1)

195-200: ⚠️ Potential issue | 🟠 Major

Tighten the assertion to catch only the specific parse failure.

Using pytest.raises(Exception) is too broad—the test would pass on any exception (e.g., FileNotFoundError, TypeError), defeating the purpose of verifying the malformed-XML contract. The implementation in utils.py raises ET.ParseError with a guaranteed message prefix.

Use the concrete ParseError type and the match parameter to validate the contextual error message:

🧪 Suggested fix

-    # ensure the parse error bubbles up
-    with pytest.raises(Exception) as excinfo:
-        utils.extract_text_from_xml(input_file)
-
-    # elementtree ParseError is expected
-    assert "Failed to parse XML" in str(excinfo.value) or "ParseError" in str(excinfo.value)
+    # ensure the parse error bubbles up with context
+    with pytest.raises(
+        utils.ET.ParseError,
+        match=r"Failed to parse XML dump '.*malformed\.xml':",
+    ):
+        utils.extract_text_from_xml(input_file)

,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_util.py` around lines 195 - 200, The test currently uses
pytest.raises(Exception) which is too broad; change it to assert that
utils.extract_text_from_xml(input_file) raises the specific
xml.etree.ElementTree.ParseError and validate the message using
pytest.raises(..., match=...) to match the guaranteed "Failed to parse XML"
prefix; import xml.etree.ElementTree as ET in the test (or reference
ET.ParseError) and use pytest.raises(ET.ParseError, match=r"^Failed to parse
XML") around the call to utils.extract_text_from_xml.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/test_util.py`:
- Around line 195-200: The test currently uses pytest.raises(Exception) which is
too broad; change it to assert that utils.extract_text_from_xml(input_file)
raises the specific xml.etree.ElementTree.ParseError and validate the message
using pytest.raises(..., match=...) to match the guaranteed "Failed to parse
XML" prefix; import xml.etree.ElementTree as ET in the test (or reference
ET.ParseError) and use pytest.raises(ET.ParseError, match=r"^Failed to parse
XML") around the call to utils.extract_text_from_xml.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 83186ed0-b3fe-43dc-bca3-58ddd8dcb8d5

📥 Commits

Reviewing files that changed from the base of the PR and between 132592b and 723674a.

📒 Files selected for processing (1)

tests/test_util.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/test_util.py (1)

1-8: ⚠️ Potential issue | 🟡 Minor

Fix import order to resolve CI failure.

The pipeline reports an unsorted import block. Third-party imports (defusedxml, pytest) should be grouped together, separated from standard library imports by a blank line.

🔧 Proposed fix

 import bz2
 import hashlib
 import json
-from defusedxml.ElementTree import ParseError

 import pytest
+from defusedxml.ElementTree import ParseError

 from openverifiablellm import utils

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_util.py` around lines 1 - 8, Reorder the import block in
tests/test_util.py so standard library imports (bz2, hashlib, json) come first,
then a blank line, then third-party imports (from defusedxml.ElementTree import
ParseError and import pytest), then another blank line and the local package
import (from openverifiablellm import utils); ensure imports are grouped and
separated by blank lines to satisfy the import sorting rule.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 235-244: The iterparse loop over "context" is leaking memory
because processed XML elements ("elem") are never freed; after writing cleaned
text (where you call clean_wikitext and out.write) you must explicitly clear the
element to release memory when using ET.iterparse—so inside the for _, elem in
context loop (after handling text_elem and writing to out) call elem.clear() to
free parsed element data (keeping the existing out.flush() / os.fsync()
behavior); this prevents accumulation when parsing large XML dumps.

---

Outside diff comments:
In `@tests/test_util.py`:
- Around line 1-8: Reorder the import block in tests/test_util.py so standard
library imports (bz2, hashlib, json) come first, then a blank line, then
third-party imports (from defusedxml.ElementTree import ParseError and import
pytest), then another blank line and the local package import (from
openverifiablellm import utils); ensure imports are grouped and separated by
blank lines to satisfy the import sorting rule.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1dcaebff-9acb-4ab3-b7c2-ed8140c49dc3

📥 Commits

Reviewing files that changed from the base of the PR and between 723674a and f7a229a.

📒 Files selected for processing (2)

openverifiablellm/utils.py
tests/test_util.py

openverifiablellm/utils.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 253-256: The generic "except Exception" block currently calls
temp_output_path.unlink(missing_ok=True) which duplicates cleanup performed in
the finally block and can mask the original exception if unlink itself fails;
remove the unlink call from the "except Exception" handler so it simply
re-raises the exception (matching the ParseError handler), leaving file removal
to the existing finally block that uses
temp_output_path.unlink(missing_ok=True).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6f6ddb04-17e3-4ccd-866e-79aca1a103e3

📥 Commits

Reviewing files that changed from the base of the PR and between f7a229a and d2f1960.

📒 Files selected for processing (2)

openverifiablellm/utils.py
tests/test_util.py

coderabbitai · 2026-03-15T20:22:13Z

openverifiablellm/utils.py

+    except Exception:
+        if temp_output_path.exists():
+            temp_output_path.unlink(missing_ok=True)
+        raise


🧹 Nitpick | 🔵 Trivial

Redundant cleanup that may mask the original exception.

The explicit unlink in the generic Exception handler is unnecessary since the finally block (lines 261-263) already handles cleanup. If unlink() raises an unexpected error (e.g., PermissionError), it will mask the original exception.

The ParseError handler correctly relies on finally for cleanup—this block should follow the same pattern.

♻️ Proposed fix: remove redundant cleanup

except ET.ParseError as e: msg = f"Failed to parse XML dump '{input_path}': {e}" logger.error(msg) raise ET.ParseError(msg) from e except Exception: - if temp_output_path.exists(): - temp_output_path.unlink(missing_ok=True) raise else:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception:

if temp_output_path.exists():

temp_output_path.unlink(missing_ok=True)

raise

except Exception:

raise

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 253 - 256, The generic "except Exception" block currently calls temp_output_path.unlink(missing_ok=True) which duplicates cleanup performed in the finally block and can mask the original exception if unlink itself fails; remove the unlink call from the "except Exception" handler so it simply re-raises the exception (matching the ParseError handler), leaving file removal to the existing finally block that uses temp_output_path.unlink(missing_ok=True).

Archit381 · 2026-03-17T19:53:03Z

Already being addressed in another #49

Handle malformed XML in extract_text_from_xml and add edge case test

132592b

github-actions bot added backend python size/M first-time-contributor pending-coderabbit-review labels Mar 12, 2026

github-actions bot added size/M and removed size/M labels Mar 12, 2026

Apply Ruff formatting

723674a

github-actions bot added size/M and removed size/M labels Mar 12, 2026

coderabbitai bot requested changes Mar 12, 2026

View reviewed changes

openverifiablellm/utils.py Show resolved Hide resolved

tests/test_util.py Outdated Show resolved Hide resolved

github-actions bot added size/M and removed size/M labels Mar 12, 2026

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

Fix atomic write in extract_text_from_xml and tighten ParseError test

f7a229a

github-actions bot added size/M and removed size/M labels Mar 15, 2026

coderabbitai bot requested changes Mar 15, 2026

View reviewed changes

openverifiablellm/utils.py Show resolved Hide resolved

Fix XML parsing cleanup and tighten malformed XML test

d2f1960

github-actions bot added size/M and removed size/M labels Mar 15, 2026

coderabbitai bot requested changes Mar 15, 2026

View reviewed changes

Archit381 closed this Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle malformed XML in extract_text_from_xml and add edge case test#71

Handle malformed XML in extract_text_from_xml and add edge case test#71
tharunkumar4562 wants to merge 4 commits intoAOSSIE-Org:mainfrom
tharunkumar4562:malformed-xml-fix

tharunkumar4562 commented Mar 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 15, 2026

Uh oh!

Archit381 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tharunkumar4562 commented Mar 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Screenshots/Recordings:

Additional Notes:

Checklist

⚠️ AI Notice - Important!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Archit381 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tharunkumar4562 commented Mar 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 12, 2026 •

edited

Loading