Incremental Preprocessing with Checkpoints to avoid halfway fail-restart by aniket866 · Pull Request #68 · AOSSIE-Org/OpenVerifiableLLM

aniket866 · 2026-03-11T20:12:58Z

Addressed Issues:

Fixes #63

Problem: Reprocessing huge Wikipedia dumps takes hours. If something fails halfway, restart from the beginning.

Solution: Add checkpoints - save progress and resume.

Step 1 — Download the dump

Windows / Linux / Mac

uv run python scripts/download_dump.py --wiki simplewiki --date 20260201 --output-dir data/raw

Step 2 — Start preprocessing and interrupt it

Windows / Linux / Mac

uv run python -m openverifiablellm.utils data/raw/simplewiki-20260201-pages-articles.xml.bz2

Wait a few seconds, then press Ctrl+C.

Step 3 — Confirm checkpoint was saved

Windows (CMD)

type data\processed\wiki_clean.checkpoint.json

Linux / Mac

cat data/processed/wiki_clean.checkpoint.json

Expected:

{"pages_processed": 10000}

Note the line count of the output so far:

Windows (CMD)

find /c /v "" data\processed\wiki_clean.txt

Linux / Mac

wc -l data/processed/wiki_clean.txt

Write this number down.

Step 4 — Resume and confirm it continues, not restarts

Windows / Linux / Mac

uv run python -m openverifiablellm.utils data/raw/simplewiki-20260201-pages-articles.xml.bz2

Watch the logs — you should see:

INFO - Resuming from checkpoint: 10000 pages already processed

Check line count again — must be higher than Step 3:

Windows (CMD)

find /c /v "" data\processed\wiki_clean.txt

Linux / Mac

wc -l data/processed/wiki_clean.txt

Step 5 — Confirm checkpoint is deleted after success

Windows (CMD)

dir data\processed\wiki_clean.checkpoint.json

Linux / Mac

ls data/processed/wiki_clean.checkpoint.json

Expected: File not found — deleted automatically on successful completion.

Step 6 — Verify fresh restart works

Delete output and checkpoint, then re-run:

Windows (CMD)

del /f data\processed\wiki_clean.checkpoint.json
del /f data\processed\wiki_clean.txt

Linux / Mac

rm -f data/processed/wiki_clean.checkpoint.json
rm -f data/processed/wiki_clean.txt

Windows / Linux / Mac

uv run python -m openverifiablellm.utils data/raw/simplewiki-20260201-pages-articles.xml.bz2

Logs should show no "Resuming from checkpoint" — starts clean.

Step 7 — Run manifest verification

Windows / Linux / Mac

uv run python -m openverifiablellm.verify data/raw/simplewiki-20260201-pages-articles.xml.bz2

Expected: ALL CHECKS PASSED — the resumed output is identical to a full uninterrupted run.

Screenshots/Recordings:

Additional Notes:

Checklist

My code follows the project's code style and conventions
I have made corresponding changes to the documentation
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

New Features
- Checkpoint-based resumable processing for large operations
- Manifest generation records links to prior manifests
- Manifest hashing now excludes the stored predecessor reference so hashes reflect manifest content only
- New hashing utility exposing raw hash bytes
Tests
- Comprehensive test suite for manifest chain verification, tamper detection, and backward-compatibility scenarios

coderabbitai · 2026-03-11T20:13:06Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Removes parent_manifest_hash from dict manifests before hashing; adds checkpointing and resume support to XML text extraction with periodic saves and atomic checkpoint writes; generate_manifest now records parent_manifest_hash; comprehensive manifest-chain verification tests added.

Changes

Cohort / File(s)	Summary
Manifest Chain Core `openverifiablellm/manifest_chain.py`	`compute_manifest_hash` now strips the `parent_manifest_hash` field from in-memory dict manifests before canonical serialization and hashing so the computed hash excludes chain metadata.
Preprocessing Pipeline `openverifiablellm/utils.py`	Added checkpointing constants and helpers (`CHECKPOINT_INTERVAL`, `_checkpoint_path`, `_compute_input_identity`, `_load_checkpoint`, `_save_checkpoint`); `extract_text_from_xml` supports resume/append, periodic flush/checkpoint, and atomic checkpoint removal; `generate_manifest` fetches/inserts `parent_manifest_hash`. Also split SHA256 helpers: `compute_sha256_bytes` (bytes) and `compute_sha256` (hex).
Tests `tests/test_manifest_chain.py`	New test suite covering canonical JSON serialization, manifest hashing (dict and file inputs), parent-hash retrieval, link validation, full-chain verification, tamper scenarios, and backward-compatibility cases.

Sequence Diagram(s)

sequenceDiagram
    participant Extract as Extractor
    participant Check as CheckpointStore
    participant FS as FileStorage
    participant Gen as ManifestGenerator
    participant Hash as HashComputer

    Extract->>Check: load checkpoint (if exists)
    alt checkpoint found
        Check-->>Extract: processed_pages info
        Extract->>FS: open output file (append)
    else no checkpoint
        Extract->>FS: open output file (write)
    end

    loop per page
        Extract->>FS: write page text
        Extract->>Check: periodically save checkpoint
    end

    Extract->>Check: remove checkpoint on success

    Gen->>FS: get_parent_manifest_hash()
    FS-->>Gen: parent_hash (if exists)
    Gen->>Hash: compute_manifest_hash(manifest)  -- removes parent_manifest_hash before canonicalizing
    Hash-->>Gen: manifest_hash
    Gen->>FS: write manifest including `parent_manifest_hash`

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Implement Manifest Chain Tamper Detection for cryptographically linked dataset manifests. #53: Prior changes to manifest hashing/parent-hash handling that this PR further adjusts.

Suggested labels

Python Lang, Documentation, Linter

Suggested reviewers

Archit381

Poem

🐰 I hop through manifests, neat and fast,
I skip my parent's name when hashing's cast,
Checkpoints cradle pages when runs go long,
I stitch the chain and hum a little song,
Small paws guard hashes, steady and steadfast.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Incremental Preprocessing with Checkpoints to avoid halfway fail-restart' directly and clearly describes the main change: adding checkpoint support for resumable preprocessing.
Linked Issues check	✅ Passed	The PR implements all primary objectives from issue `#63`: checkpointing infrastructure, resume capability, periodic saves, checkpoint removal on completion, and manifest chain verification for validation.
Out of Scope Changes check	✅ Passed	Changes align with scope: manifest_chain.py handles hash computation excluding parent references; utils.py adds checkpointing and integrates parent manifest tracking; tests verify both features comprehensively.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 309-334: The variable pages_written is misleading because it
increments for every page processed (in the loop that iterates ET.iterparse)
even when no text is written; rename it to pages_processed (or
total_pages_handled) to reflect its actual meaning, and update all references
(initialization from pages_already_done, the increment inside the for _, elem in
context loop, and any external uses) so the counter semantics match;
alternatively, if you truly need a count of pages that produced output, keep
pages_written and only increment it when cleaned text is non-empty (i.e., after
out.write(cleaned + "\n\n")), and introduce pages_processed for the per-page
count.
- Around line 197-202: The current _compute_input_identity function swallows all
exceptions and returns an empty string which can falsely match during resume;
change _compute_input_identity to not return "" on error—either let exceptions
propagate (remove the broad try/except) or return a clearly distinct sentinel
(e.g., None) so it cannot compare equal to a valid checksum; then update
_load_checkpoint to explicitly catch the propagated exception or handle the
sentinel and decide whether to fail loudly or start fresh, referencing the same
function names (_compute_input_identity and _load_checkpoint) when making the
edits.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0cc96972-b1fc-40e5-a8f3-6c950c975463

📥 Commits

Reviewing files that changed from the base of the PR and between 6b00bd4 and 40e2f40.

📒 Files selected for processing (2)

openverifiablellm/manifest_chain.py
openverifiablellm/utils.py

openverifiablellm/utils.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 234-242: _save_checkpoint currently calls
_compute_input_identity(input_path) on every checkpoint which re-hashes the
entire input file repeatedly; change the code so the input identity is computed
once and reused: compute _compute_input_identity(input_path) once before the
preprocessing loop (or at the top-level loader), pass that cached identity into
_save_checkpoint as an argument (e.g., add parameter input_identity: str and
remove the internal call), and update all callers (the places around the other
checkpoint calls referenced) to supply the precomputed identity instead of
letting _save_checkpoint recompute it.
- Around line 338-342: The exception handler currently uses a broad `except
Exception` which does not catch KeyboardInterrupt, so when Ctrl+C is used the
checkpoint (_save_checkpoint) isn't saved; modify the try/except to add an
explicit `except KeyboardInterrupt:` (or add KeyboardInterrupt to the exception
tuple) that calls `_save_checkpoint(checkpoint_path, pages_written,
input_path)`, logs via `logger.error("Processing interrupted after %d pages. Run
again to resume.", pages_written)`, and then re-raises the KeyboardInterrupt to
preserve normal interrupt behavior; keep the existing generic `except
Exception:` after that to handle other errors and re-raise them as before.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b9aff65d-a4f2-4c16-9426-f9d2482e5335

📥 Commits

Reviewing files that changed from the base of the PR and between 40e2f40 and e2c3017.

📒 Files selected for processing (1)

openverifiablellm/utils.py

openverifiablellm/utils.py

Archit381 · 2026-03-17T19:56:10Z

Fix lint issues and do requested code-rabbit changes

aniket866 · 2026-03-17T19:57:37Z

Fix lint issues and do requested code-rabbit changes

@Archit381 I have applied the requested changes , Please have a look

coderabbitai

♻️ Duplicate comments (1)

openverifiablellm/utils.py (1)
334-345: ⚠️ Potential issue | 🔴 Critical

Type mismatch: input_path (Path) passed where input_identity (str) is expected.

All three _save_checkpoint calls (lines 336, 338, 343) pass input_path (a Path object) instead of input_identity (a string). This causes json.dump to raise TypeError: Object of type PosixPath is not JSON serializable. The exception is caught and logged as "Failed to save checkpoint", but the result is that no checkpoints are ever saved, completely breaking the resume feature.

Compute input_identity once and reuse it across all save calls:
     input_path = Path(input_path)
+    input_identity = _compute_input_identity(input_path)

     # Fixed output path
     project_root = Path.cwd()
Then update the three call sites to pass input_identity instead of input_path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 334 - 345, The checkpoint save calls
pass a Path object (input_path) where a string identity is expected, causing
JSON serialization to fail; compute input_identity once (e.g., str(input_path)
or a dedicated identity extraction) before the loop and replace all three calls
to _save_checkpoint(...) that currently pass input_path with input_identity,
ensuring _save_checkpoint(checkpoint_path, pages_written, input_identity) is
used in the normal checkpoint, KeyboardInterrupt, and generic Exception
handlers; keep usage of pages_written and existing CHECKPOINT_INTERVAL logic
unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 334-345: The checkpoint save calls pass a Path object (input_path)
where a string identity is expected, causing JSON serialization to fail; compute
input_identity once (e.g., str(input_path) or a dedicated identity extraction)
before the loop and replace all three calls to _save_checkpoint(...) that
currently pass input_path with input_identity, ensuring
_save_checkpoint(checkpoint_path, pages_written, input_identity) is used in the
normal checkpoint, KeyboardInterrupt, and generic Exception handlers; keep usage
of pages_written and existing CHECKPOINT_INTERVAL logic unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1ed22dfc-4fd1-43d0-9522-edefe7a4567b

📥 Commits

Reviewing files that changed from the base of the PR and between e2c3017 and 66bbeab.

📒 Files selected for processing (1)

openverifiablellm/utils.py

resume-preprocessing

ff689a9

github-actions bot added backend enhancement New feature or request first-time-contributor pending-coderabbit-review python size/S size/XL external-contributor and removed size/S labels Mar 11, 2026

aniket866 marked this pull request as ready for review March 16, 2026 10:27

aniket866 marked this pull request as draft March 16, 2026 10:27

fixing-merge-conflicts

be86883

github-actions bot added size/M size/XL and removed size/M size/XL labels Mar 16, 2026

fixing-ci-fails

5fbdff1

github-actions bot added size/M size/XL and removed size/M size/XL labels Mar 16, 2026

fixing ruff formatting

672f071

github-actions bot added size/M size/XL and removed size/M size/XL labels Mar 16, 2026

aniket866 marked this pull request as ready for review March 16, 2026 17:21

github-actions bot added the size/M label Mar 16, 2026

github-actions bot added size/M and removed size/XL size/M labels Mar 17, 2026

coderabbitai bot requested changes Mar 17, 2026

View reviewed changes

openverifiablellm/utils.py Outdated Show resolved Hide resolved

openverifiablellm/utils.py Show resolved Hide resolved

Code rabbit follow up

e2c3017

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

github-actions bot added size/M and removed size/M labels Mar 17, 2026

coderabbitai bot requested changes Mar 17, 2026

View reviewed changes

openverifiablellm/utils.py Outdated Show resolved Hide resolved

openverifiablellm/utils.py Show resolved Hide resolved

Code rabbit follow-up

e35f8ab

github-actions bot added size/M and removed size/M labels Mar 17, 2026

Code rabbit follow-up

a754edd

github-actions bot added size/M and removed size/M labels Mar 17, 2026

coderabbitai bot approved these changes Mar 17, 2026

View reviewed changes

fixing-comput_sha

66bbeab

github-actions bot added size/M and removed size/M labels Mar 17, 2026

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Archit381 mentioned this pull request Mar 17, 2026

Add incremental preprocessing checkpoints with resume support #63

Closed

5 tasks

Archit381 merged commit 173cdd5 into AOSSIE-Org:main Mar 17, 2026
5 of 7 checks passed

This was referenced Mar 18, 2026

[FIX] : Add Unit Tests for create_tokenizer and load_merkle_proof #78

Draft

Fix: Prevent duplicate output during resume in extract_text_from_xml (Issue #76) #80

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incremental Preprocessing with Checkpoints to avoid halfway fail-restart#68

Incremental Preprocessing with Checkpoints to avoid halfway fail-restart#68
Archit381 merged 12 commits intoAOSSIE-Org:mainfrom
aniket866:resume-preprocessing

aniket866 commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

Reviews paused

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Archit381 commented Mar 17, 2026

Uh oh!

aniket866 commented Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

aniket866 commented Mar 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Problem: Reprocessing huge Wikipedia dumps takes hours. If something fails halfway, restart from the beginning.

Solution: Add checkpoints - save progress and resume.

Step 1 — Download the dump

Step 2 — Start preprocessing and interrupt it

Step 3 — Confirm checkpoint was saved

Step 4 — Resume and confirm it continues, not restarts

Step 5 — Confirm checkpoint is deleted after success

Step 6 — Verify fresh restart works

Step 7 — Run manifest verification

Screenshots/Recordings:

Additional Notes:

Checklist

⚠️ AI Notice - Important!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Archit381 commented Mar 17, 2026

Uh oh!

aniket866 commented Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aniket866 commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 11, 2026 •

edited

Loading