Fix: Prevent duplicate output during resume in extract_text_from_xml (Issue #76)#80
Fix: Prevent duplicate output during resume in extract_text_from_xml (Issue #76)#80ANUBprad wants to merge 4 commits intoAOSSIE-Org:mainfrom
Conversation
|
Hi, I’ve submitted a fix for this issue. Please review. |
WalkthroughExtract_text_from_xml() checkpointing now tracks a stable input_identity and output byte file_offset; output is reconciled to file_offset on resume, checkpoints saved more frequently (every 100 pages) and include file_offset to prevent duplicate appended content. Changes
Sequence Diagram(s)sequenceDiagram
participant XML as Input XML
participant Proc as extract_text_from_xml()
participant Out as data/processed/wiki_clean.txt
participant CP as data/checkpoint.json
XML->>Proc: stream pages
Proc->>CP: _load_checkpoint() -> (input_identity, file_offset)
alt output ahead of checkpoint
Proc->>Out: truncate file to file_offset
end
Proc->>Out: write cleaned chunks
Proc->>Out: flush & update file_offset = Out.size()
Proc->>CP: _save_checkpoint(input_identity, pages_processed, file_offset) [periodic / on-exit]
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 318-320: The code recomputes the input SHA-256 by calling
_compute_input_identity(input_path) instead of reusing the already-validated
identity from _load_checkpoint(), which wastes time on large inputs; change
_load_checkpoint() to return the computed input_identity (or add a second return
value) and update the caller (the code that currently sets input_identity =
_compute_input_identity(input_path)) to use the identity returned by
_load_checkpoint() instead of calling _compute_input_identity again; touch the
_load_checkpoint() signature and its call sites (including the place in
openverifiablellm/utils.py where input_identity is set) so the validated
identity is propagated and reused.
- Around line 255-269: count_written_pages currently infers progress by
splitting wiki_clean.txt contents which is lossy and memory-heavy; change it
(and the similar helpers around the other ranges) to persist and return a
physical byte offset (e.g. use the file on-disk size or last-flushed byte count)
instead of counting non-empty page blocks so resumptions can truncate by byte
offset; locate count_written_pages and the similar functions referenced in the
comment, have them return the output file's flushed byte length (via os/stat or
opening the file and seeking to end) and update any truncation logic to truncate
the output file by that byte offset rather than by counting pages.
- Around line 352-356: The checkpointing uses a hardcoded `100` instead of the
shared `CHECKPOINT_INTERVAL`, creating two sources of truth; update the
conditional that checks `pages_written % 100 == 0` to use the
`CHECKPOINT_INTERVAL` constant and ensure any default/definition of
`CHECKPOINT_INTERVAL` matches the intended cadence, leaving the `out.flush()`
and `_save_checkpoint(checkpoint_path, pages_written, input_identity)` calls
unchanged so checkpointing is routed consistently through the
`CHECKPOINT_INTERVAL` setting.
In `@test.xml`:
- Around line 1-14: Add a new <page> entry to the XML fixture that will produce
an empty cleaned output to exercise the resume/skew case; create a page element
(use the same structure as existing pages: <page>, <title>, <revision>, <text>)
with a title like "EmptyOutput" and a <text>{{template}}</text> (or other
content that cleans to nothing) so that processing of that page emits no cleaned
text and triggers the edge case.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1d571491-45b0-4620-a698-6e91c8bb38ab
📒 Files selected for processing (2)
openverifiablellm/utils.pytest.xml
| <mediawiki> | ||
| <page> | ||
| <title>Test1</title> | ||
| <revision> | ||
| <text>This is [[sample]] text</text> | ||
| </revision> | ||
| </page> | ||
| <page> | ||
| <title>Test2</title> | ||
| <revision> | ||
| <text>Another {{template}} example</text> | ||
| </revision> | ||
| </page> | ||
| </mediawiki> No newline at end of file |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add an empty-output page to this fixture.
Both current pages still produce non-empty cleaned text, so this file cannot exercise the counter skew that happens when a processed page emits nothing. A page like <text>{{template}}</text> would cover the resume failure mode this PR is targeting.
Suggested fixture addition
<mediawiki>
<page>
<title>Test1</title>
<revision>
<text>This is [[sample]] text</text>
</revision>
</page>
<page>
<title>Test2</title>
<revision>
<text>Another {{template}} example</text>
</revision>
</page>
+ <page>
+ <title>EmptyAfterClean</title>
+ <revision>
+ <text>{{template}}</text>
+ </revision>
+ </page>
</mediawiki>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <mediawiki> | |
| <page> | |
| <title>Test1</title> | |
| <revision> | |
| <text>This is [[sample]] text</text> | |
| </revision> | |
| </page> | |
| <page> | |
| <title>Test2</title> | |
| <revision> | |
| <text>Another {{template}} example</text> | |
| </revision> | |
| </page> | |
| </mediawiki> | |
| <mediawiki> | |
| <page> | |
| <title>Test1</title> | |
| <revision> | |
| <text>This is [[sample]] text</text> | |
| </revision> | |
| </page> | |
| <page> | |
| <title>Test2</title> | |
| <revision> | |
| <text>Another {{template}} example</text> | |
| </revision> | |
| </page> | |
| <page> | |
| <title>EmptyAfterClean</title> | |
| <revision> | |
| <text>{{template}}</text> | |
| </revision> | |
| </page> | |
| </mediawiki> |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test.xml` around lines 1 - 14, Add a new <page> entry to the XML fixture that
will produce an empty cleaned output to exercise the resume/skew case; create a
page element (use the same structure as existing pages: <page>, <title>,
<revision>, <text>) with a title like "EmptyOutput" and a
<text>{{template}}</text> (or other content that cleans to nothing) so that
processing of that page emits no cleaned text and triggers the edge case.
|
Thanks for the detailed review! |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
openverifiablellm/utils.py (1)
215-235:⚠️ Potential issue | 🟠 MajorReject checkpoints that are missing a real
file_offset.Line 234 silently turns a missing
file_offsetinto0. Ifpages_processedis already non-zero, resume still switches to append mode but skips reconciliation, so legacy/corrupt checkpoints can reproduce the duplicate-tail bug this PR is trying to close. Treat a missing or non-integer offset as an invalid checkpoint once any pages were processed.Suggested fix
pages_processed = data.get("pages_processed") stored_identity = data.get("input_identity") + raw_file_offset = data.get("file_offset") current_identity = _compute_input_identity(input_path) if not isinstance(pages_processed, int) or pages_processed < 0: raise ValueError("Invalid pages_processed value") + + if raw_file_offset is None: + if pages_processed > 0: + raise ValueError("Checkpoint missing file_offset; cannot safely resume") + file_offset = 0 + elif not isinstance(raw_file_offset, int) or raw_file_offset < 0: + raise ValueError("Invalid file_offset value") + else: + file_offset = raw_file_offset if stored_identity != current_identity: raise ValueError("Input file changed since checkpoint") ... return { "pages_processed": pages_processed, "input_identity": stored_identity, - "file_offset": data.get("file_offset", 0), + "file_offset": file_offset, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 215 - 235, The checkpoint loader currently defaults a missing file_offset to 0 which allows resuming with non-zero pages_processed and can reintroduce duplicate-tail bugs; in the function that computes pages_processed/current_identity (referencing pages_processed, stored_identity, current_identity and _compute_input_identity) replace the silent default data.get("file_offset", 0) with explicit validation: read file_offset = data.get("file_offset"), ensure it is an int >= 0, and if pages_processed > 0 treat a missing or non-integer file_offset as an invalid checkpoint by raising ValueError("Invalid or missing file_offset for resumed checkpoint"); keep allowing file_offset=0 only when it is present and valid (or when pages_processed==0).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 334-336: The code currently uses out.tell() after out.flush() to
set file_offset (in the block around out.write/ out.flush / file_offset =
out.tell()), but tell() on a text TextIOWrapper returns an internal position
cookie, not a reliable byte offset; replace that usage by querying the actual
file size via output_path.stat().st_size after flushing and assign that to
file_offset so subsequent binary truncation uses a true byte offset. Locate
occurrences of out.write, out.flush and the file_offset assignment and change
the assignment to use output_path.stat().st_size (keeping the flush) to ensure
accurate byte offsets.
---
Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 215-235: The checkpoint loader currently defaults a missing
file_offset to 0 which allows resuming with non-zero pages_processed and can
reintroduce duplicate-tail bugs; in the function that computes
pages_processed/current_identity (referencing pages_processed, stored_identity,
current_identity and _compute_input_identity) replace the silent default
data.get("file_offset", 0) with explicit validation: read file_offset =
data.get("file_offset"), ensure it is an int >= 0, and if pages_processed > 0
treat a missing or non-integer file_offset as an invalid checkpoint by raising
ValueError("Invalid or missing file_offset for resumed checkpoint"); keep
allowing file_offset=0 only when it is present and valid (or when
pages_processed==0).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3c3309e7-6abe-48dc-9eae-af063128b96f
📒 Files selected for processing (1)
openverifiablellm/utils.py
|
Thanks for pointing that out! |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 317-319: Remove the redundant file seek and existence check: in
the block that handles resuming writes (referencing variables write_mode,
output_path, pages_processed and the call out.seek(0, os.SEEK_END)), delete the
out.seek(...) line and the output_path.exists() condition since files opened
with "a" always append and _load_checkpoint already ensures the file exists when
pages_processed > 0; leave the surrounding resume logic intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9f7128df-1ef8-4761-8cc3-4f9a6b9a4c7a
📒 Files selected for processing (1)
openverifiablellm/utils.py
Closes #76
Fix: Prevent duplicate output during resume in extract_text_from_xml
Problem
The resume logic relied on checkpoint.json to determine progress, but output
was written continuously. If the process crashed after writing additional
pages but before checkpoint update, the output file could contain more
pages than recorded in the checkpoint.
This caused duplicate content when resuming.
Solution
Changes
extract_text_from_xml()inutils.pyTesting
Impact
Fixes a critical data consistency issue and ensures reliable resume behavior
for large dataset preprocessing
Summary by CodeRabbit
Bug Fixes
Tests