Handle malformed XML in extract_text_from_xml and add edge case test#71
Handle malformed XML in extract_text_from_xml and add edge case test#71tharunkumar4562 wants to merge 4 commits intoAOSSIE-Org:mainfrom
Conversation
WalkthroughImplemented resilient XML preprocessing in extract_text_from_xml: uses a temporary file with atomic replace, wraps ET.iterparse in try/except to re-raise ParseError with a descriptive message, ensures flush/fsync and per-element cleanup, generates manifest fields, and adds a unit test for malformed XML. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Utils as extract_text_from_xml
participant Parser as ET.iterparse
participant FS as FileSystem
participant Manifest as ManifestGenerator
Caller->>Utils: call extract_text_from_xml(input_path, out_path)
Utils->>Parser: open input and start iterparse
Parser-->>Utils: yields elements / raises ParseError
alt ParseError
Parser-->>Utils: ParseError
Utils->>Caller: raise ParseError("Failed to parse XML dump...")
else Successful parse
Utils->>FS: write processed text to temp file (flush & fsync)
Utils->>Utils: elem.clear() for memory-safe cleanup
Utils->>FS: os.replace(temp -> final output)
Utils->>Manifest: compute merkle roots & chunk_size_bytes
Manifest-->>Utils: manifest data
Utils->>Caller: return success + manifest
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 225-245: The current parser writes directly to output_path (the
with open(output_path, "w", ...) block) so a late ET.ParseError can leave a
partial or corrupted wiki_clean.txt; instead, write to a temporary file in the
same directory (using tempfile.NamedTemporaryFile or mkstemp) while parsing and
only atomically replace the final output_path with os.replace when parsing
completes successfully; ensure the temp file is opened with encoding="utf-8",
cleaned text is written there (via clean_wikitext), close and sync the temp file
before os.replace, and on exceptions (e.g., ET.ParseError) remove the temp file
and re-raise the error so the original output is preserved.
In `@tests/test_util.py`:
- Around line 195-200: Tighten the test so it only catches the XML parse failure
from defusedxml by using the concrete ParseError type instead of Exception and
assert the contextual message; update the pytest.raises(...) to expect
defusedxml.ElementTree.ParseError (import ParseError at top of
tests/test_util.py or reference defusedxml.ElementTree.ParseError) when calling
utils.extract_text_from_xml, and ensure the assertion checks that the exception
string contains the specific "Failed to parse XML" message produced by
extract_text_from_xml.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: af2b2c77-f0a8-4633-8515-c082d8e119c2
📒 Files selected for processing (2)
openverifiablellm/utils.pytests/test_util.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tests/test_util.py (1)
195-200:⚠️ Potential issue | 🟠 MajorTighten the assertion to catch only the specific parse failure.
Using
pytest.raises(Exception)is too broad—the test would pass on any exception (e.g.,FileNotFoundError,TypeError), defeating the purpose of verifying the malformed-XML contract. The implementation inutils.pyraisesET.ParseErrorwith a guaranteed message prefix.Use the concrete
ParseErrortype and thematchparameter to validate the contextual error message:🧪 Suggested fix
- # ensure the parse error bubbles up - with pytest.raises(Exception) as excinfo: - utils.extract_text_from_xml(input_file) - - # elementtree ParseError is expected - assert "Failed to parse XML" in str(excinfo.value) or "ParseError" in str(excinfo.value) + # ensure the parse error bubbles up with context + with pytest.raises( + utils.ET.ParseError, + match=r"Failed to parse XML dump '.*malformed\.xml':", + ): + utils.extract_text_from_xml(input_file),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_util.py` around lines 195 - 200, The test currently uses pytest.raises(Exception) which is too broad; change it to assert that utils.extract_text_from_xml(input_file) raises the specific xml.etree.ElementTree.ParseError and validate the message using pytest.raises(..., match=...) to match the guaranteed "Failed to parse XML" prefix; import xml.etree.ElementTree as ET in the test (or reference ET.ParseError) and use pytest.raises(ET.ParseError, match=r"^Failed to parse XML") around the call to utils.extract_text_from_xml.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/test_util.py`:
- Around line 195-200: The test currently uses pytest.raises(Exception) which is
too broad; change it to assert that utils.extract_text_from_xml(input_file)
raises the specific xml.etree.ElementTree.ParseError and validate the message
using pytest.raises(..., match=...) to match the guaranteed "Failed to parse
XML" prefix; import xml.etree.ElementTree as ET in the test (or reference
ET.ParseError) and use pytest.raises(ET.ParseError, match=r"^Failed to parse
XML") around the call to utils.extract_text_from_xml.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 83186ed0-b3fe-43dc-bca3-58ddd8dcb8d5
📒 Files selected for processing (1)
tests/test_util.py
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_util.py (1)
1-8:⚠️ Potential issue | 🟡 MinorFix import order to resolve CI failure.
The pipeline reports an unsorted import block. Third-party imports (
defusedxml,pytest) should be grouped together, separated from standard library imports by a blank line.🔧 Proposed fix
import bz2 import hashlib import json -from defusedxml.ElementTree import ParseError import pytest +from defusedxml.ElementTree import ParseError from openverifiablellm import utils🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_util.py` around lines 1 - 8, Reorder the import block in tests/test_util.py so standard library imports (bz2, hashlib, json) come first, then a blank line, then third-party imports (from defusedxml.ElementTree import ParseError and import pytest), then another blank line and the local package import (from openverifiablellm import utils); ensure imports are grouped and separated by blank lines to satisfy the import sorting rule.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 235-244: The iterparse loop over "context" is leaking memory
because processed XML elements ("elem") are never freed; after writing cleaned
text (where you call clean_wikitext and out.write) you must explicitly clear the
element to release memory when using ET.iterparse—so inside the for _, elem in
context loop (after handling text_elem and writing to out) call elem.clear() to
free parsed element data (keeping the existing out.flush() / os.fsync()
behavior); this prevents accumulation when parsing large XML dumps.
---
Outside diff comments:
In `@tests/test_util.py`:
- Around line 1-8: Reorder the import block in tests/test_util.py so standard
library imports (bz2, hashlib, json) come first, then a blank line, then
third-party imports (from defusedxml.ElementTree import ParseError and import
pytest), then another blank line and the local package import (from
openverifiablellm import utils); ensure imports are grouped and separated by
blank lines to satisfy the import sorting rule.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1dcaebff-9acb-4ab3-b7c2-ed8140c49dc3
📒 Files selected for processing (2)
openverifiablellm/utils.pytests/test_util.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 253-256: The generic "except Exception" block currently calls
temp_output_path.unlink(missing_ok=True) which duplicates cleanup performed in
the finally block and can mask the original exception if unlink itself fails;
remove the unlink call from the "except Exception" handler so it simply
re-raises the exception (matching the ParseError handler), leaving file removal
to the existing finally block that uses
temp_output_path.unlink(missing_ok=True).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6f6ddb04-17e3-4ccd-866e-79aca1a103e3
📒 Files selected for processing (2)
openverifiablellm/utils.pytests/test_util.py
| except Exception: | ||
| if temp_output_path.exists(): | ||
| temp_output_path.unlink(missing_ok=True) | ||
| raise |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Redundant cleanup that may mask the original exception.
The explicit unlink in the generic Exception handler is unnecessary since the finally block (lines 261-263) already handles cleanup. If unlink() raises an unexpected error (e.g., PermissionError), it will mask the original exception.
The ParseError handler correctly relies on finally for cleanup—this block should follow the same pattern.
♻️ Proposed fix: remove redundant cleanup
except ET.ParseError as e:
msg = f"Failed to parse XML dump '{input_path}': {e}"
logger.error(msg)
raise ET.ParseError(msg) from e
except Exception:
- if temp_output_path.exists():
- temp_output_path.unlink(missing_ok=True)
raise
else:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception: | |
| if temp_output_path.exists(): | |
| temp_output_path.unlink(missing_ok=True) | |
| raise | |
| except Exception: | |
| raise |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 253 - 256, The generic "except
Exception" block currently calls temp_output_path.unlink(missing_ok=True) which
duplicates cleanup performed in the finally block and can mask the original
exception if unlink itself fails; remove the unlink call from the "except
Exception" handler so it simply re-raises the exception (matching the ParseError
handler), leaving file removal to the existing finally block that uses
temp_output_path.unlink(missing_ok=True).
|
Already being addressed in another #49 |
Addressed Issues:
Fixes #48
Screenshots/Recordings:
Before fix: malformed XML raised a low-level ParseError without context.
After fix: the error now includes a clear message indicating which
XML dump failed to parse.
Additional Notes:
While running the test suite locally, I noticed that
extract_text_from_xmldoes not provide useful context when encountering malformed XML.Changes in this PR:
ET.iterparseto catchxml.etree.ElementTree.ParseError.test_extract_text_from_xml_malformedto verify behaviour when malformed XML is encountered.This improves robustness of the XML extraction step and prevents unclear crashes when corrupted Wikipedia dumps are processed.
All tests pass locally.
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Bug Fixes
Tests