Skip to content

Fix: Prevent duplicate output during resume in extract_text_from_xml (Issue #76)#80

Open
ANUBprad wants to merge 4 commits intoAOSSIE-Org:mainfrom
ANUBprad:fix-resume-duplication
Open

Fix: Prevent duplicate output during resume in extract_text_from_xml (Issue #76)#80
ANUBprad wants to merge 4 commits intoAOSSIE-Org:mainfrom
ANUBprad:fix-resume-duplication

Conversation

@ANUBprad
Copy link

@ANUBprad ANUBprad commented Mar 21, 2026

Closes #76

Fix: Prevent duplicate output during resume in extract_text_from_xml

Problem

The resume logic relied on checkpoint.json to determine progress, but output
was written continuously. If the process crashed after writing additional
pages but before checkpoint update, the output file could contain more
pages than recorded in the checkpoint.

This caused duplicate content when resuming.

Solution

  • Added logic to detect mismatch between output file and checkpoint state
  • Ensured output file is truncated to match checkpoint before resuming
  • Maintained consistency between logical progress and physical output

Changes

  • Added helper functions for counting and truncating processed pages
  • Updated extract_text_from_xml() in utils.py
  • Improved checkpoint reliability

Testing

  • Simulated interrupted runs (Ctrl + C)
  • Verified no duplicate content after resume
  • Verified correct truncation behavior when mismatch occurs
  • Tested edge case where output exceeds checkpoint

Impact

Fixes a critical data consistency issue and ensures reliable resume behavior
for large dataset preprocessing

Summary by CodeRabbit

  • Bug Fixes

    • Checkpointing now records a stable input identity and exact output byte position to prevent mismatches and improve recovery.
    • More frequent and more robust periodic saves; interruptions now persist the latest output offset for reliable resume.
  • Tests

    • Added an XML test fixture covering pages with link and template content.

@ANUBprad
Copy link
Author

Hi, I’ve submitted a fix for this issue. Please review.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

Walkthrough

Extract_text_from_xml() checkpointing now tracks a stable input_identity and output byte file_offset; output is reconciled to file_offset on resume, checkpoints saved more frequently (every 100 pages) and include file_offset to prevent duplicate appended content.

Changes

Cohort / File(s) Summary
Checkpoint & Resume Logic
openverifiablellm/utils.py
Changed checkpoint format and APIs to include input_identity and file_offset; CHECKPOINT_INTERVAL reduced to 100; _load_checkpoint() always returns (input_identity, file_offset) (computing defaults when missing); _save_checkpoint() now accepts and persists file_offset; extract_text_from_xml() truncates or seeks output to file_offset, updates file_offset after writes, and saves checkpoints periodically and on interruption/exceptions.
Test Fixture
test.xml
Added MediaWiki XML test fixture with two pages (Test1, Test2) containing example content for extraction tests.

Sequence Diagram(s)

sequenceDiagram
    participant XML as Input XML
    participant Proc as extract_text_from_xml()
    participant Out as data/processed/wiki_clean.txt
    participant CP as data/checkpoint.json

    XML->>Proc: stream pages
    Proc->>CP: _load_checkpoint() -> (input_identity, file_offset)
    alt output ahead of checkpoint
        Proc->>Out: truncate file to file_offset
    end
    Proc->>Out: write cleaned chunks
    Proc->>Out: flush & update file_offset = Out.size()
    Proc->>CP: _save_checkpoint(input_identity, pages_processed, file_offset) [periodic / on-exit]
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • #68: Directly modifies checkpoint and resume logic in openverifiablellm/utils.py, including tracking input_identity and file_offset.

Suggested labels

Python Lang

Poem

🐰 I nibble bytes and mark the place,
Where pages pause in time and space,
I trim the tail and mend the thread,
So resumed runs won’t double-read. 🥕

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: preventing duplicate output during resume in extract_text_from_xml, matching the core objective of fixing issue #76.
Linked Issues check ✅ Passed The PR implements all key objectives from issue #76: detects mismatches between output file and checkpoint, truncates output to checkpoint state, maintains file offset consistency, and prevents duplicate content on resume.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the resume duplication bug: checkpoint tracking modifications, file offset management, and a test fixture XML file are all within scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/M and removed size/M labels Mar 21, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 318-320: The code recomputes the input SHA-256 by calling
_compute_input_identity(input_path) instead of reusing the already-validated
identity from _load_checkpoint(), which wastes time on large inputs; change
_load_checkpoint() to return the computed input_identity (or add a second return
value) and update the caller (the code that currently sets input_identity =
_compute_input_identity(input_path)) to use the identity returned by
_load_checkpoint() instead of calling _compute_input_identity again; touch the
_load_checkpoint() signature and its call sites (including the place in
openverifiablellm/utils.py where input_identity is set) so the validated
identity is propagated and reused.
- Around line 255-269: count_written_pages currently infers progress by
splitting wiki_clean.txt contents which is lossy and memory-heavy; change it
(and the similar helpers around the other ranges) to persist and return a
physical byte offset (e.g. use the file on-disk size or last-flushed byte count)
instead of counting non-empty page blocks so resumptions can truncate by byte
offset; locate count_written_pages and the similar functions referenced in the
comment, have them return the output file's flushed byte length (via os/stat or
opening the file and seeking to end) and update any truncation logic to truncate
the output file by that byte offset rather than by counting pages.
- Around line 352-356: The checkpointing uses a hardcoded `100` instead of the
shared `CHECKPOINT_INTERVAL`, creating two sources of truth; update the
conditional that checks `pages_written % 100 == 0` to use the
`CHECKPOINT_INTERVAL` constant and ensure any default/definition of
`CHECKPOINT_INTERVAL` matches the intended cadence, leaving the `out.flush()`
and `_save_checkpoint(checkpoint_path, pages_written, input_identity)` calls
unchanged so checkpointing is routed consistently through the
`CHECKPOINT_INTERVAL` setting.

In `@test.xml`:
- Around line 1-14: Add a new <page> entry to the XML fixture that will produce
an empty cleaned output to exercise the resume/skew case; create a page element
(use the same structure as existing pages: <page>, <title>, <revision>, <text>)
with a title like "EmptyOutput" and a <text>{{template}}</text> (or other
content that cleans to nothing) so that processing of that page emits no cleaned
text and triggers the edge case.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1d571491-45b0-4620-a698-6e91c8bb38ab

📥 Commits

Reviewing files that changed from the base of the PR and between 578bc79 and 1f6dd72.

📒 Files selected for processing (2)
  • openverifiablellm/utils.py
  • test.xml

Comment on lines +1 to +14
<mediawiki>
<page>
<title>Test1</title>
<revision>
<text>This is [[sample]] text</text>
</revision>
</page>
<page>
<title>Test2</title>
<revision>
<text>Another {{template}} example</text>
</revision>
</page>
</mediawiki> No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add an empty-output page to this fixture.

Both current pages still produce non-empty cleaned text, so this file cannot exercise the counter skew that happens when a processed page emits nothing. A page like <text>{{template}}</text> would cover the resume failure mode this PR is targeting.

Suggested fixture addition
 <mediawiki>
   <page>
     <title>Test1</title>
     <revision>
       <text>This is [[sample]] text</text>
     </revision>
   </page>
   <page>
     <title>Test2</title>
     <revision>
       <text>Another {{template}} example</text>
     </revision>
   </page>
+  <page>
+    <title>EmptyAfterClean</title>
+    <revision>
+      <text>{{template}}</text>
+    </revision>
+  </page>
 </mediawiki>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<mediawiki>
<page>
<title>Test1</title>
<revision>
<text>This is [[sample]] text</text>
</revision>
</page>
<page>
<title>Test2</title>
<revision>
<text>Another {{template}} example</text>
</revision>
</page>
</mediawiki>
<mediawiki>
<page>
<title>Test1</title>
<revision>
<text>This is [[sample]] text</text>
</revision>
</page>
<page>
<title>Test2</title>
<revision>
<text>Another {{template}} example</text>
</revision>
</page>
<page>
<title>EmptyAfterClean</title>
<revision>
<text>{{template}}</text>
</revision>
</page>
</mediawiki>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test.xml` around lines 1 - 14, Add a new <page> entry to the XML fixture that
will produce an empty cleaned output to exercise the resume/skew case; create a
page element (use the same structure as existing pages: <page>, <title>,
<revision>, <text>) with a title like "EmptyOutput" and a
<text>{{template}}</text> (or other content that cleans to nothing) so that
processing of that page emits no cleaned text and triggers the edge case.

@github-actions github-actions bot added size/M and removed size/M labels Mar 21, 2026
@ANUBprad
Copy link
Author

Thanks for the detailed review!
I've updated the implementation to use file offsets instead of page counting and addressed all the suggested changes. Please let me know if anything else needs improvement.

@github-actions github-actions bot added size/M and removed size/M labels Mar 21, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openverifiablellm/utils.py (1)

215-235: ⚠️ Potential issue | 🟠 Major

Reject checkpoints that are missing a real file_offset.

Line 234 silently turns a missing file_offset into 0. If pages_processed is already non-zero, resume still switches to append mode but skips reconciliation, so legacy/corrupt checkpoints can reproduce the duplicate-tail bug this PR is trying to close. Treat a missing or non-integer offset as an invalid checkpoint once any pages were processed.

Suggested fix
         pages_processed = data.get("pages_processed")
         stored_identity = data.get("input_identity")
+        raw_file_offset = data.get("file_offset")

         current_identity = _compute_input_identity(input_path)

         if not isinstance(pages_processed, int) or pages_processed < 0:
             raise ValueError("Invalid pages_processed value")
+
+        if raw_file_offset is None:
+            if pages_processed > 0:
+                raise ValueError("Checkpoint missing file_offset; cannot safely resume")
+            file_offset = 0
+        elif not isinstance(raw_file_offset, int) or raw_file_offset < 0:
+            raise ValueError("Invalid file_offset value")
+        else:
+            file_offset = raw_file_offset

         if stored_identity != current_identity:
             raise ValueError("Input file changed since checkpoint")
...
         return {
             "pages_processed": pages_processed,
             "input_identity": stored_identity,
-            "file_offset": data.get("file_offset", 0),
+            "file_offset": file_offset,
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 215 - 235, The checkpoint loader
currently defaults a missing file_offset to 0 which allows resuming with
non-zero pages_processed and can reintroduce duplicate-tail bugs; in the
function that computes pages_processed/current_identity (referencing
pages_processed, stored_identity, current_identity and _compute_input_identity)
replace the silent default data.get("file_offset", 0) with explicit validation:
read file_offset = data.get("file_offset"), ensure it is an int >= 0, and if
pages_processed > 0 treat a missing or non-integer file_offset as an invalid
checkpoint by raising ValueError("Invalid or missing file_offset for resumed
checkpoint"); keep allowing file_offset=0 only when it is present and valid (or
when pages_processed==0).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 334-336: The code currently uses out.tell() after out.flush() to
set file_offset (in the block around out.write/ out.flush / file_offset =
out.tell()), but tell() on a text TextIOWrapper returns an internal position
cookie, not a reliable byte offset; replace that usage by querying the actual
file size via output_path.stat().st_size after flushing and assign that to
file_offset so subsequent binary truncation uses a true byte offset. Locate
occurrences of out.write, out.flush and the file_offset assignment and change
the assignment to use output_path.stat().st_size (keeping the flush) to ensure
accurate byte offsets.

---

Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 215-235: The checkpoint loader currently defaults a missing
file_offset to 0 which allows resuming with non-zero pages_processed and can
reintroduce duplicate-tail bugs; in the function that computes
pages_processed/current_identity (referencing pages_processed, stored_identity,
current_identity and _compute_input_identity) replace the silent default
data.get("file_offset", 0) with explicit validation: read file_offset =
data.get("file_offset"), ensure it is an int >= 0, and if pages_processed > 0
treat a missing or non-integer file_offset as an invalid checkpoint by raising
ValueError("Invalid or missing file_offset for resumed checkpoint"); keep
allowing file_offset=0 only when it is present and valid (or when
pages_processed==0).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3c3309e7-6abe-48dc-9eae-af063128b96f

📥 Commits

Reviewing files that changed from the base of the PR and between 1f6dd72 and ccd48d7.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

@github-actions github-actions bot added size/M and removed size/M labels Mar 21, 2026
@ANUBprad
Copy link
Author

Thanks for pointing that out!
I've updated it to use the actual file size via stat().st_size to ensure correct truncation behavior.

@github-actions github-actions bot added size/M and removed size/M labels Mar 21, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 317-319: Remove the redundant file seek and existence check: in
the block that handles resuming writes (referencing variables write_mode,
output_path, pages_processed and the call out.seek(0, os.SEEK_END)), delete the
out.seek(...) line and the output_path.exists() condition since files opened
with "a" always append and _load_checkpoint already ensures the file exists when
pages_processed > 0; leave the surrounding resume logic intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9f7128df-1ef8-4761-8cc3-4f9a6b9a4c7a

📥 Commits

Reviewing files that changed from the base of the PR and between ccd48d7 and 2f13a4e.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Resume preprocessing can duplicate output after crash between checkpoint saves

1 participant