Skip to content

Handle malformed XML in extract_text_from_xml and add edge case test#71

Closed
tharunkumar4562 wants to merge 4 commits intoAOSSIE-Org:mainfrom
tharunkumar4562:malformed-xml-fix
Closed

Handle malformed XML in extract_text_from_xml and add edge case test#71
tharunkumar4562 wants to merge 4 commits intoAOSSIE-Org:mainfrom
tharunkumar4562:malformed-xml-fix

Conversation

@tharunkumar4562
Copy link
Copy Markdown
Contributor

@tharunkumar4562 tharunkumar4562 commented Mar 12, 2026

Addressed Issues:

Fixes #48

Screenshots/Recordings:

Before fix: malformed XML raised a low-level ParseError without context.

image

After fix: the error now includes a clear message indicating which
XML dump failed to parse.

image

Additional Notes:

While running the test suite locally, I noticed that extract_text_from_xml does not provide useful context when encountering malformed XML.

Changes in this PR:

  • Added error handling around ET.iterparse to catch xml.etree.ElementTree.ParseError.
  • Log a descriptive error message including the input file path.
  • Re-raise the error with additional context to improve debugging.
  • Added a new unit test test_extract_text_from_xml_malformed to verify behaviour when malformed XML is encountered.

This improves robustness of the XML extraction step and prevents unclear crashes when corrupted Wikipedia dumps are processed.

All tests pass locally.

Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Manifest generation added after successful XML extraction.
    • Safe, atomic output handling for XML processing to avoid partial writes.
  • Bug Fixes

    • Improved XML parsing error messages and robust cleanup on failure.
    • In-loop memory and disk-state cleanup for more reliable processing.
  • Tests

    • Added test coverage for malformed XML error handling.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 12, 2026

Walkthrough

Implemented resilient XML preprocessing in extract_text_from_xml: uses a temporary file with atomic replace, wraps ET.iterparse in try/except to re-raise ParseError with a descriptive message, ensures flush/fsync and per-element cleanup, generates manifest fields, and adds a unit test for malformed XML.

Changes

Cohort / File(s) Summary
XML processing & manifest
openverifiablellm/utils.py
Rewrote extract_text_from_xml to write to a tempfile then atomically replace the output, added tempfile import, added try/except to catch ET.ParseError and re-raise with a descriptive message, ensured buffer flush/fsync and in-loop element cleanup, added finalization/cleanup of temp files, and extended manifest generation (raw/processed merkle roots, chunk size).
Tests
tests/test_util.py
Added test_extract_text_from_xml_malformed validating that malformed XML raises ET.ParseError and that the error message includes "Failed to parse XML".

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Utils as extract_text_from_xml
    participant Parser as ET.iterparse
    participant FS as FileSystem
    participant Manifest as ManifestGenerator

    Caller->>Utils: call extract_text_from_xml(input_path, out_path)
    Utils->>Parser: open input and start iterparse
    Parser-->>Utils: yields elements / raises ParseError
    alt ParseError
      Parser-->>Utils: ParseError
      Utils->>Caller: raise ParseError("Failed to parse XML dump...")
    else Successful parse
      Utils->>FS: write processed text to temp file (flush & fsync)
      Utils->>Utils: elem.clear() for memory-safe cleanup
      Utils->>FS: os.replace(temp -> final output)
      Utils->>Manifest: compute merkle roots & chunk_size_bytes
      Manifest-->>Utils: manifest data
      Utils->>Caller: return success + manifest
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 I nibble tags and stash them neat,
Temp files tucked beneath my feet.
If XML stumbles, I will shout,
Hash the roots and clean up spout—
Hooray, the pipeline hops about!

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: adding error handling for malformed XML and a corresponding test case.
Linked Issues check ✅ Passed The PR addresses all coding requirements from issue #48: adds error handling for malformed XML with descriptive logging, implements a unit test verifying ParseError is raised for malformed XML, and maintains clear failure semantics.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #48 objectives: XML error handling improvements in utils.py and a new test in test_util.py. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/M and removed size/M labels Mar 12, 2026
@github-actions github-actions bot added size/M and removed size/M labels Mar 12, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 225-245: The current parser writes directly to output_path (the
with open(output_path, "w", ...) block) so a late ET.ParseError can leave a
partial or corrupted wiki_clean.txt; instead, write to a temporary file in the
same directory (using tempfile.NamedTemporaryFile or mkstemp) while parsing and
only atomically replace the final output_path with os.replace when parsing
completes successfully; ensure the temp file is opened with encoding="utf-8",
cleaned text is written there (via clean_wikitext), close and sync the temp file
before os.replace, and on exceptions (e.g., ET.ParseError) remove the temp file
and re-raise the error so the original output is preserved.

In `@tests/test_util.py`:
- Around line 195-200: Tighten the test so it only catches the XML parse failure
from defusedxml by using the concrete ParseError type instead of Exception and
assert the contextual message; update the pytest.raises(...) to expect
defusedxml.ElementTree.ParseError (import ParseError at top of
tests/test_util.py or reference defusedxml.ElementTree.ParseError) when calling
utils.extract_text_from_xml, and ensure the assertion checks that the exception
string contains the specific "Failed to parse XML" message produced by
extract_text_from_xml.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: af2b2c77-f0a8-4633-8515-c082d8e119c2

📥 Commits

Reviewing files that changed from the base of the PR and between c352df0 and 132592b.

📒 Files selected for processing (2)
  • openverifiablellm/utils.py
  • tests/test_util.py

@github-actions github-actions bot added size/M and removed size/M labels Mar 12, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/test_util.py (1)

195-200: ⚠️ Potential issue | 🟠 Major

Tighten the assertion to catch only the specific parse failure.

Using pytest.raises(Exception) is too broad—the test would pass on any exception (e.g., FileNotFoundError, TypeError), defeating the purpose of verifying the malformed-XML contract. The implementation in utils.py raises ET.ParseError with a guaranteed message prefix.

Use the concrete ParseError type and the match parameter to validate the contextual error message:

🧪 Suggested fix
-    # ensure the parse error bubbles up
-    with pytest.raises(Exception) as excinfo:
-        utils.extract_text_from_xml(input_file)
-
-    # elementtree ParseError is expected
-    assert "Failed to parse XML" in str(excinfo.value) or "ParseError" in str(excinfo.value)
+    # ensure the parse error bubbles up with context
+    with pytest.raises(
+        utils.ET.ParseError,
+        match=r"Failed to parse XML dump '.*malformed\.xml':",
+    ):
+        utils.extract_text_from_xml(input_file)

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_util.py` around lines 195 - 200, The test currently uses
pytest.raises(Exception) which is too broad; change it to assert that
utils.extract_text_from_xml(input_file) raises the specific
xml.etree.ElementTree.ParseError and validate the message using
pytest.raises(..., match=...) to match the guaranteed "Failed to parse XML"
prefix; import xml.etree.ElementTree as ET in the test (or reference
ET.ParseError) and use pytest.raises(ET.ParseError, match=r"^Failed to parse
XML") around the call to utils.extract_text_from_xml.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/test_util.py`:
- Around line 195-200: The test currently uses pytest.raises(Exception) which is
too broad; change it to assert that utils.extract_text_from_xml(input_file)
raises the specific xml.etree.ElementTree.ParseError and validate the message
using pytest.raises(..., match=...) to match the guaranteed "Failed to parse
XML" prefix; import xml.etree.ElementTree as ET in the test (or reference
ET.ParseError) and use pytest.raises(ET.ParseError, match=r"^Failed to parse
XML") around the call to utils.extract_text_from_xml.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 83186ed0-b3fe-43dc-bca3-58ddd8dcb8d5

📥 Commits

Reviewing files that changed from the base of the PR and between 132592b and 723674a.

📒 Files selected for processing (1)
  • tests/test_util.py

@github-actions github-actions bot added size/M and removed size/M labels Mar 15, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_util.py (1)

1-8: ⚠️ Potential issue | 🟡 Minor

Fix import order to resolve CI failure.

The pipeline reports an unsorted import block. Third-party imports (defusedxml, pytest) should be grouped together, separated from standard library imports by a blank line.

🔧 Proposed fix
 import bz2
 import hashlib
 import json
-from defusedxml.ElementTree import ParseError

 import pytest
+from defusedxml.ElementTree import ParseError

 from openverifiablellm import utils
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_util.py` around lines 1 - 8, Reorder the import block in
tests/test_util.py so standard library imports (bz2, hashlib, json) come first,
then a blank line, then third-party imports (from defusedxml.ElementTree import
ParseError and import pytest), then another blank line and the local package
import (from openverifiablellm import utils); ensure imports are grouped and
separated by blank lines to satisfy the import sorting rule.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 235-244: The iterparse loop over "context" is leaking memory
because processed XML elements ("elem") are never freed; after writing cleaned
text (where you call clean_wikitext and out.write) you must explicitly clear the
element to release memory when using ET.iterparse—so inside the for _, elem in
context loop (after handling text_elem and writing to out) call elem.clear() to
free parsed element data (keeping the existing out.flush() / os.fsync()
behavior); this prevents accumulation when parsing large XML dumps.

---

Outside diff comments:
In `@tests/test_util.py`:
- Around line 1-8: Reorder the import block in tests/test_util.py so standard
library imports (bz2, hashlib, json) come first, then a blank line, then
third-party imports (from defusedxml.ElementTree import ParseError and import
pytest), then another blank line and the local package import (from
openverifiablellm import utils); ensure imports are grouped and separated by
blank lines to satisfy the import sorting rule.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1dcaebff-9acb-4ab3-b7c2-ed8140c49dc3

📥 Commits

Reviewing files that changed from the base of the PR and between 723674a and f7a229a.

📒 Files selected for processing (2)
  • openverifiablellm/utils.py
  • tests/test_util.py

@github-actions github-actions bot added size/M and removed size/M labels Mar 15, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 253-256: The generic "except Exception" block currently calls
temp_output_path.unlink(missing_ok=True) which duplicates cleanup performed in
the finally block and can mask the original exception if unlink itself fails;
remove the unlink call from the "except Exception" handler so it simply
re-raises the exception (matching the ParseError handler), leaving file removal
to the existing finally block that uses
temp_output_path.unlink(missing_ok=True).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6f6ddb04-17e3-4ccd-866e-79aca1a103e3

📥 Commits

Reviewing files that changed from the base of the PR and between f7a229a and d2f1960.

📒 Files selected for processing (2)
  • openverifiablellm/utils.py
  • tests/test_util.py

Comment on lines +253 to +256
except Exception:
if temp_output_path.exists():
temp_output_path.unlink(missing_ok=True)
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Redundant cleanup that may mask the original exception.

The explicit unlink in the generic Exception handler is unnecessary since the finally block (lines 261-263) already handles cleanup. If unlink() raises an unexpected error (e.g., PermissionError), it will mask the original exception.

The ParseError handler correctly relies on finally for cleanup—this block should follow the same pattern.

♻️ Proposed fix: remove redundant cleanup
     except ET.ParseError as e:
         msg = f"Failed to parse XML dump '{input_path}': {e}"
         logger.error(msg)
         raise ET.ParseError(msg) from e
     except Exception:
-        if temp_output_path.exists():
-            temp_output_path.unlink(missing_ok=True)
         raise
     else:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception:
if temp_output_path.exists():
temp_output_path.unlink(missing_ok=True)
raise
except Exception:
raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 253 - 256, The generic "except
Exception" block currently calls temp_output_path.unlink(missing_ok=True) which
duplicates cleanup performed in the finally block and can mask the original
exception if unlink itself fails; remove the unlink call from the "except
Exception" handler so it simply re-raises the exception (matching the ParseError
handler), leaving file removal to the existing finally block that uses
temp_output_path.unlink(missing_ok=True).

@Archit381
Copy link
Copy Markdown
Member

Already being addressed in another #49

@Archit381 Archit381 closed this Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] : Add edge case test for extract_text_from_xml to handle malformed XML

2 participants