Fix: Prevent duplicate output during resume in extract_text_from_xml (Issue #76) by ANUBprad · Pull Request #80 · AOSSIE-Org/OpenVerifiableLLM

ANUBprad · 2026-03-21T04:11:35Z

Closes #76

Fix: Prevent duplicate output during resume in extract_text_from_xml

Problem

The resume logic relied on checkpoint.json to determine progress, but output
was written continuously. If the process crashed after writing additional
pages but before checkpoint update, the output file could contain more
pages than recorded in the checkpoint.

This caused duplicate content when resuming.

Solution

Added logic to detect mismatch between output file and checkpoint state
Ensured output file is truncated to match checkpoint before resuming
Maintained consistency between logical progress and physical output

Changes

Added helper functions for counting and truncating processed pages
Updated extract_text_from_xml() in utils.py
Improved checkpoint reliability

Testing

Simulated interrupted runs (Ctrl + C)
Verified no duplicate content after resume
Verified correct truncation behavior when mismatch occurs
Tested edge case where output exceeds checkpoint

Impact

Fixes a critical data consistency issue and ensures reliable resume behavior
for large dataset preprocessing

Summary by CodeRabbit

Bug Fixes
- Checkpointing now records a stable input identity and exact output byte position to prevent mismatches and improve recovery.
- More frequent and more robust periodic saves; interruptions now persist the latest output offset for reliable resume.
Tests
- Added an XML test fixture covering pages with link and template content.

ANUBprad · 2026-03-21T04:12:08Z

Hi, I’ve submitted a fix for this issue. Please review.

coderabbitai · 2026-03-21T04:12:52Z

Walkthrough

Extract_text_from_xml() checkpointing now tracks a stable input_identity and output byte file_offset; output is reconciled to file_offset on resume, checkpoints saved more frequently (every 100 pages) and include file_offset to prevent duplicate appended content.

Changes

Cohort / File(s)	Summary
Checkpoint & Resume Logic `openverifiablellm/utils.py`	Changed checkpoint format and APIs to include `input_identity` and `file_offset`; `CHECKPOINT_INTERVAL` reduced to 100; `_load_checkpoint()` always returns `(input_identity, file_offset)` (computing defaults when missing); `_save_checkpoint()` now accepts and persists `file_offset`; `extract_text_from_xml()` truncates or seeks output to `file_offset`, updates `file_offset` after writes, and saves checkpoints periodically and on interruption/exceptions.
Test Fixture `test.xml`	Added MediaWiki XML test fixture with two pages (`Test1`, `Test2`) containing example content for extraction tests.

Sequence Diagram(s)

sequenceDiagram
    participant XML as Input XML
    participant Proc as extract_text_from_xml()
    participant Out as data/processed/wiki_clean.txt
    participant CP as data/checkpoint.json

    XML->>Proc: stream pages
    Proc->>CP: _load_checkpoint() -> (input_identity, file_offset)
    alt output ahead of checkpoint
        Proc->>Out: truncate file to file_offset
    end
    Proc->>Out: write cleaned chunks
    Proc->>Out: flush & update file_offset = Out.size()
    Proc->>CP: _save_checkpoint(input_identity, pages_processed, file_offset) [periodic / on-exit]

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

#68: Directly modifies checkpoint and resume logic in openverifiablellm/utils.py, including tracking input_identity and file_offset.

Suggested labels

Python Lang

Poem

🐰 I nibble bytes and mark the place,
Where pages pause in time and space,
I trim the tail and mend the thread,
So resumed runs won’t double-read. 🥕

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main change: preventing duplicate output during resume in extract_text_from_xml, matching the core objective of fixing issue `#76`.
Linked Issues check	✅ Passed	The PR implements all key objectives from issue `#76`: detects mismatches between output file and checkpoint, truncates output to checkpoint state, maintains file offset consistency, and prevents duplicate content on resume.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the resume duplication bug: checkpoint tracking modifications, file offset management, and a test fixture XML file are all within scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 318-320: The code recomputes the input SHA-256 by calling
_compute_input_identity(input_path) instead of reusing the already-validated
identity from _load_checkpoint(), which wastes time on large inputs; change
_load_checkpoint() to return the computed input_identity (or add a second return
value) and update the caller (the code that currently sets input_identity =
_compute_input_identity(input_path)) to use the identity returned by
_load_checkpoint() instead of calling _compute_input_identity again; touch the
_load_checkpoint() signature and its call sites (including the place in
openverifiablellm/utils.py where input_identity is set) so the validated
identity is propagated and reused.
- Around line 255-269: count_written_pages currently infers progress by
splitting wiki_clean.txt contents which is lossy and memory-heavy; change it
(and the similar helpers around the other ranges) to persist and return a
physical byte offset (e.g. use the file on-disk size or last-flushed byte count)
instead of counting non-empty page blocks so resumptions can truncate by byte
offset; locate count_written_pages and the similar functions referenced in the
comment, have them return the output file's flushed byte length (via os/stat or
opening the file and seeking to end) and update any truncation logic to truncate
the output file by that byte offset rather than by counting pages.
- Around line 352-356: The checkpointing uses a hardcoded `100` instead of the
shared `CHECKPOINT_INTERVAL`, creating two sources of truth; update the
conditional that checks `pages_written % 100 == 0` to use the
`CHECKPOINT_INTERVAL` constant and ensure any default/definition of
`CHECKPOINT_INTERVAL` matches the intended cadence, leaving the `out.flush()`
and `_save_checkpoint(checkpoint_path, pages_written, input_identity)` calls
unchanged so checkpointing is routed consistently through the
`CHECKPOINT_INTERVAL` setting.

In `@test.xml`:
- Around line 1-14: Add a new <page> entry to the XML fixture that will produce
an empty cleaned output to exercise the resume/skew case; create a page element
(use the same structure as existing pages: <page>, <title>, <revision>, <text>)
with a title like "EmptyOutput" and a <text>{{template}}</text> (or other
content that cleans to nothing) so that processing of that page emits no cleaned
text and triggers the edge case.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1d571491-45b0-4620-a698-6e91c8bb38ab

📥 Commits

Reviewing files that changed from the base of the PR and between 578bc79 and 1f6dd72.

📒 Files selected for processing (2)

openverifiablellm/utils.py
test.xml

openverifiablellm/utils.py

coderabbitai · 2026-03-21T04:20:01Z

test.xml

+<mediawiki>
+  <page>
+    <title>Test1</title>
+    <revision>
+      <text>This is [[sample]] text</text>
+    </revision>
+  </page>
+  <page>
+    <title>Test2</title>
+    <revision>
+      <text>Another {{template}} example</text>
+    </revision>
+  </page>
+</mediawiki>


🧹 Nitpick | 🔵 Trivial

Add an empty-output page to this fixture.

Both current pages still produce non-empty cleaned text, so this file cannot exercise the counter skew that happens when a processed page emits nothing. A page like <text>{{template}}</text> would cover the resume failure mode this PR is targeting.

Suggested fixture addition

<mediawiki> <page> <title>Test1</title> <revision> <text>This is [[sample]] text</text> </revision> </page> <page> <title>Test2</title> <revision> <text>Another {{template}} example</text> </revision> </page> + <page> + <title>EmptyAfterClean</title> + <revision> + <text>{{template}}</text> + </revision> + </page> </mediawiki>

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

<mediawiki>

<page>

<title>Test1</title>

<revision>

<text>This is [[sample]] text</text>

</revision>

</page>

<page>

<title>Test2</title>

<revision>

<text>Another {{template}} example</text>

</revision>

</page>

</mediawiki>

<mediawiki>

<page>

<title>Test1</title>

<revision>

<text>This is [[sample]] text</text>

</revision>

</page>

<page>

<title>Test2</title>

<revision>

<text>Another {{template}} example</text>

</revision>

</page>

<page>

<title>EmptyAfterClean</title>

<revision>

<text>{{template}}</text>

</revision>

</page>

</mediawiki>

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test.xml` around lines 1 - 14, Add a new <page> entry to the XML fixture that will produce an empty cleaned output to exercise the resume/skew case; create a page element (use the same structure as existing pages: <page>, <title>, <revision>, <text>) with a title like "EmptyOutput" and a <text>{{template}}</text> (or other content that cleans to nothing) so that processing of that page emits no cleaned text and triggers the edge case.

ANUBprad · 2026-03-21T07:48:44Z

Thanks for the detailed review!
I've updated the implementation to use file offsets instead of page counting and addressed all the suggested changes. Please let me know if anything else needs improvement.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

openverifiablellm/utils.py (1)

215-235: ⚠️ Potential issue | 🟠 Major

Reject checkpoints that are missing a real file_offset.

Line 234 silently turns a missing file_offset into 0. If pages_processed is already non-zero, resume still switches to append mode but skips reconciliation, so legacy/corrupt checkpoints can reproduce the duplicate-tail bug this PR is trying to close. Treat a missing or non-integer offset as an invalid checkpoint once any pages were processed.

Suggested fix

         pages_processed = data.get("pages_processed")
         stored_identity = data.get("input_identity")
+        raw_file_offset = data.get("file_offset")

         current_identity = _compute_input_identity(input_path)

         if not isinstance(pages_processed, int) or pages_processed < 0:
             raise ValueError("Invalid pages_processed value")
+
+        if raw_file_offset is None:
+            if pages_processed > 0:
+                raise ValueError("Checkpoint missing file_offset; cannot safely resume")
+            file_offset = 0
+        elif not isinstance(raw_file_offset, int) or raw_file_offset < 0:
+            raise ValueError("Invalid file_offset value")
+        else:
+            file_offset = raw_file_offset

         if stored_identity != current_identity:
             raise ValueError("Input file changed since checkpoint")
...
         return {
             "pages_processed": pages_processed,
             "input_identity": stored_identity,
-            "file_offset": data.get("file_offset", 0),
+            "file_offset": file_offset,
         }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 215 - 235, The checkpoint loader
currently defaults a missing file_offset to 0 which allows resuming with
non-zero pages_processed and can reintroduce duplicate-tail bugs; in the
function that computes pages_processed/current_identity (referencing
pages_processed, stored_identity, current_identity and _compute_input_identity)
replace the silent default data.get("file_offset", 0) with explicit validation:
read file_offset = data.get("file_offset"), ensure it is an int >= 0, and if
pages_processed > 0 treat a missing or non-integer file_offset as an invalid
checkpoint by raising ValueError("Invalid or missing file_offset for resumed
checkpoint"); keep allowing file_offset=0 only when it is present and valid (or
when pages_processed==0).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 334-336: The code currently uses out.tell() after out.flush() to
set file_offset (in the block around out.write/ out.flush / file_offset =
out.tell()), but tell() on a text TextIOWrapper returns an internal position
cookie, not a reliable byte offset; replace that usage by querying the actual
file size via output_path.stat().st_size after flushing and assign that to
file_offset so subsequent binary truncation uses a true byte offset. Locate
occurrences of out.write, out.flush and the file_offset assignment and change
the assignment to use output_path.stat().st_size (keeping the flush) to ensure
accurate byte offsets.

---

Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 215-235: The checkpoint loader currently defaults a missing
file_offset to 0 which allows resuming with non-zero pages_processed and can
reintroduce duplicate-tail bugs; in the function that computes
pages_processed/current_identity (referencing pages_processed, stored_identity,
current_identity and _compute_input_identity) replace the silent default
data.get("file_offset", 0) with explicit validation: read file_offset =
data.get("file_offset"), ensure it is an int >= 0, and if pages_processed > 0
treat a missing or non-integer file_offset as an invalid checkpoint by raising
ValueError("Invalid or missing file_offset for resumed checkpoint"); keep
allowing file_offset=0 only when it is present and valid (or when
pages_processed==0).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3c3309e7-6abe-48dc-9eae-af063128b96f

📥 Commits

Reviewing files that changed from the base of the PR and between 1f6dd72 and ccd48d7.

📒 Files selected for processing (1)

openverifiablellm/utils.py

openverifiablellm/utils.py

ANUBprad · 2026-03-21T08:16:28Z

Thanks for pointing that out!
I've updated it to use the actual file size via stat().st_size to ensure correct truncation behavior.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 317-319: Remove the redundant file seek and existence check: in
the block that handles resuming writes (referencing variables write_mode,
output_path, pages_processed and the call out.seek(0, os.SEEK_END)), delete the
out.seek(...) line and the output_path.exists() condition since files opened
with "a" always append and _load_checkpoint already ensures the file exists when
pages_processed > 0; leave the surrounding resume logic intact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9f7128df-1ef8-4761-8cc3-4f9a6b9a4c7a

📥 Commits

Reviewing files that changed from the base of the PR and between ccd48d7 and 2f13a4e.

📒 Files selected for processing (1)

openverifiablellm/utils.py

openverifiablellm/utils.py

Fix duplicate output during resume in extract_text_from_xml

1f6dd72

github-actions bot added bug Something isn't working backend python size/M first-time-contributor pending-coderabbit-review labels Mar 21, 2026

github-actions bot added size/M and removed size/M labels Mar 21, 2026

coderabbitai bot requested changes Mar 21, 2026

View reviewed changes

ANUBprad added 2 commits March 21, 2026 13:03

Use file offset for resume consistency and address review feedback

fc5209b

Fix resume logic using file offsets and address review feedback

ccd48d7

github-actions bot added size/M and removed size/M labels Mar 21, 2026

coderabbitai bot requested changes Mar 21, 2026

View reviewed changes

openverifiablellm/utils.py Outdated Show resolved Hide resolved

Fix file offset tracking using actual byte size instead of tell()

2f13a4e

github-actions bot added size/M and removed size/M labels Mar 21, 2026

coderabbitai bot requested changes Mar 21, 2026

View reviewed changes

openverifiablellm/utils.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Prevent duplicate output during resume in extract_text_from_xml (Issue #76)#80

Fix: Prevent duplicate output during resume in extract_text_from_xml (Issue #76)#80
ANUBprad wants to merge 4 commits intoAOSSIE-Org:mainfrom
ANUBprad:fix-resume-duplication

ANUBprad commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

ANUBprad commented Mar 21, 2026

Uh oh!

coderabbitai bot commented Mar 21, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 21, 2026

Uh oh!

ANUBprad commented Mar 21, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

ANUBprad commented Mar 21, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ANUBprad commented Mar 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!