Skip to content

Incremental Preprocessing with Checkpoints to avoid halfway fail-restart#68

Merged
Archit381 merged 12 commits intoAOSSIE-Org:mainfrom
aniket866:resume-preprocessing
Mar 17, 2026
Merged

Incremental Preprocessing with Checkpoints to avoid halfway fail-restart#68
Archit381 merged 12 commits intoAOSSIE-Org:mainfrom
aniket866:resume-preprocessing

Conversation

@aniket866
Copy link
Copy Markdown
Contributor

@aniket866 aniket866 commented Mar 11, 2026

Addressed Issues:

Fixes #63

Problem: Reprocessing huge Wikipedia dumps takes hours. If something fails halfway, restart from the beginning.

Solution: Add checkpoints - save progress and resume.

Step 1 — Download the dump

Windows / Linux / Mac

uv run python scripts/download_dump.py --wiki simplewiki --date 20260201 --output-dir data/raw

Step 2 — Start preprocessing and interrupt it

Windows / Linux / Mac

uv run python -m openverifiablellm.utils data/raw/simplewiki-20260201-pages-articles.xml.bz2

Wait a few seconds, then press Ctrl+C.


Step 3 — Confirm checkpoint was saved

Windows (CMD)

type data\processed\wiki_clean.checkpoint.json

Linux / Mac

cat data/processed/wiki_clean.checkpoint.json

Expected:

{"pages_processed": 10000}

Note the line count of the output so far:

Windows (CMD)

find /c /v "" data\processed\wiki_clean.txt

Linux / Mac

wc -l data/processed/wiki_clean.txt

Write this number down.


Step 4 — Resume and confirm it continues, not restarts

Windows / Linux / Mac

uv run python -m openverifiablellm.utils data/raw/simplewiki-20260201-pages-articles.xml.bz2

Watch the logs — you should see:

INFO - Resuming from checkpoint: 10000 pages already processed

Check line count again — must be higher than Step 3:

Windows (CMD)

find /c /v "" data\processed\wiki_clean.txt

Linux / Mac

wc -l data/processed/wiki_clean.txt

Step 5 — Confirm checkpoint is deleted after success

Windows (CMD)

dir data\processed\wiki_clean.checkpoint.json

Linux / Mac

ls data/processed/wiki_clean.checkpoint.json

Expected: File not found — deleted automatically on successful completion.


Step 6 — Verify fresh restart works

Delete output and checkpoint, then re-run:

Windows (CMD)

del /f data\processed\wiki_clean.checkpoint.json
del /f data\processed\wiki_clean.txt

Linux / Mac

rm -f data/processed/wiki_clean.checkpoint.json
rm -f data/processed/wiki_clean.txt

Windows / Linux / Mac

uv run python -m openverifiablellm.utils data/raw/simplewiki-20260201-pages-articles.xml.bz2

Logs should show no "Resuming from checkpoint" — starts clean.


Step 7 — Run manifest verification

Windows / Linux / Mac

uv run python -m openverifiablellm.verify data/raw/simplewiki-20260201-pages-articles.xml.bz2

Expected: ALL CHECKS PASSED — the resumed output is identical to a full uninterrupted run.


Screenshots/Recordings:

Additional Notes:

Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Checkpoint-based resumable processing for large operations
    • Manifest generation records links to prior manifests
    • Manifest hashing now excludes the stored predecessor reference so hashes reflect manifest content only
    • New hashing utility exposing raw hash bytes
  • Tests

    • Comprehensive test suite for manifest chain verification, tamper detection, and backward-compatibility scenarios

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Removes parent_manifest_hash from dict manifests before hashing; adds checkpointing and resume support to XML text extraction with periodic saves and atomic checkpoint writes; generate_manifest now records parent_manifest_hash; comprehensive manifest-chain verification tests added.

Changes

Cohort / File(s) Summary
Manifest Chain Core
openverifiablellm/manifest_chain.py
compute_manifest_hash now strips the parent_manifest_hash field from in-memory dict manifests before canonical serialization and hashing so the computed hash excludes chain metadata.
Preprocessing Pipeline
openverifiablellm/utils.py
Added checkpointing constants and helpers (CHECKPOINT_INTERVAL, _checkpoint_path, _compute_input_identity, _load_checkpoint, _save_checkpoint); extract_text_from_xml supports resume/append, periodic flush/checkpoint, and atomic checkpoint removal; generate_manifest fetches/inserts parent_manifest_hash. Also split SHA256 helpers: compute_sha256_bytes (bytes) and compute_sha256 (hex).
Tests
tests/test_manifest_chain.py
New test suite covering canonical JSON serialization, manifest hashing (dict and file inputs), parent-hash retrieval, link validation, full-chain verification, tamper scenarios, and backward-compatibility cases.

Sequence Diagram(s)

sequenceDiagram
    participant Extract as Extractor
    participant Check as CheckpointStore
    participant FS as FileStorage
    participant Gen as ManifestGenerator
    participant Hash as HashComputer

    Extract->>Check: load checkpoint (if exists)
    alt checkpoint found
        Check-->>Extract: processed_pages info
        Extract->>FS: open output file (append)
    else no checkpoint
        Extract->>FS: open output file (write)
    end

    loop per page
        Extract->>FS: write page text
        Extract->>Check: periodically save checkpoint
    end

    Extract->>Check: remove checkpoint on success

    Gen->>FS: get_parent_manifest_hash()
    FS-->>Gen: parent_hash (if exists)
    Gen->>Hash: compute_manifest_hash(manifest)  -- removes parent_manifest_hash before canonicalizing
    Hash-->>Gen: manifest_hash
    Gen->>FS: write manifest including `parent_manifest_hash`
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

Python Lang, Documentation, Linter

Suggested reviewers

  • Archit381

Poem

🐰 I hop through manifests, neat and fast,
I skip my parent's name when hashing's cast,
Checkpoints cradle pages when runs go long,
I stitch the chain and hum a little song,
Small paws guard hashes, steady and steadfast.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Incremental Preprocessing with Checkpoints to avoid halfway fail-restart' directly and clearly describes the main change: adding checkpoint support for resumable preprocessing.
Linked Issues check ✅ Passed The PR implements all primary objectives from issue #63: checkpointing infrastructure, resume capability, periodic saves, checkpoint removal on completion, and manifest chain verification for validation.
Out of Scope Changes check ✅ Passed Changes align with scope: manifest_chain.py handles hash computation excluding parent references; utils.py adds checkpointing and integrates parent manifest tracking; tests verify both features comprehensively.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@aniket866 aniket866 marked this pull request as ready for review March 16, 2026 17:21
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 309-334: The variable pages_written is misleading because it
increments for every page processed (in the loop that iterates ET.iterparse)
even when no text is written; rename it to pages_processed (or
total_pages_handled) to reflect its actual meaning, and update all references
(initialization from pages_already_done, the increment inside the for _, elem in
context loop, and any external uses) so the counter semantics match;
alternatively, if you truly need a count of pages that produced output, keep
pages_written and only increment it when cleaned text is non-empty (i.e., after
out.write(cleaned + "\n\n")), and introduce pages_processed for the per-page
count.
- Around line 197-202: The current _compute_input_identity function swallows all
exceptions and returns an empty string which can falsely match during resume;
change _compute_input_identity to not return "" on error—either let exceptions
propagate (remove the broad try/except) or return a clearly distinct sentinel
(e.g., None) so it cannot compare equal to a valid checksum; then update
_load_checkpoint to explicitly catch the propagated exception or handle the
sentinel and decide whether to fail loudly or start fresh, referencing the same
function names (_compute_input_identity and _load_checkpoint) when making the
edits.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0cc96972-b1fc-40e5-a8f3-6c950c975463

📥 Commits

Reviewing files that changed from the base of the PR and between 6b00bd4 and 40e2f40.

📒 Files selected for processing (2)
  • openverifiablellm/manifest_chain.py
  • openverifiablellm/utils.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions bot added size/M and removed size/M labels Mar 17, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 234-242: _save_checkpoint currently calls
_compute_input_identity(input_path) on every checkpoint which re-hashes the
entire input file repeatedly; change the code so the input identity is computed
once and reused: compute _compute_input_identity(input_path) once before the
preprocessing loop (or at the top-level loader), pass that cached identity into
_save_checkpoint as an argument (e.g., add parameter input_identity: str and
remove the internal call), and update all callers (the places around the other
checkpoint calls referenced) to supply the precomputed identity instead of
letting _save_checkpoint recompute it.
- Around line 338-342: The exception handler currently uses a broad `except
Exception` which does not catch KeyboardInterrupt, so when Ctrl+C is used the
checkpoint (_save_checkpoint) isn't saved; modify the try/except to add an
explicit `except KeyboardInterrupt:` (or add KeyboardInterrupt to the exception
tuple) that calls `_save_checkpoint(checkpoint_path, pages_written,
input_path)`, logs via `logger.error("Processing interrupted after %d pages. Run
again to resume.", pages_written)`, and then re-raises the KeyboardInterrupt to
preserve normal interrupt behavior; keep the existing generic `except
Exception:` after that to handle other errors and re-raise them as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b9aff65d-a4f2-4c16-9426-f9d2482e5335

📥 Commits

Reviewing files that changed from the base of the PR and between 40e2f40 and e2c3017.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

@github-actions github-actions bot added size/M and removed size/M labels Mar 17, 2026
@Archit381
Copy link
Copy Markdown
Member

Fix lint issues and do requested code-rabbit changes

@github-actions github-actions bot added size/M and removed size/M labels Mar 17, 2026
@aniket866
Copy link
Copy Markdown
Contributor Author

Fix lint issues and do requested code-rabbit changes

@Archit381 I have applied the requested changes , Please have a look

@github-actions github-actions bot added size/M and removed size/M labels Mar 17, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
openverifiablellm/utils.py (1)

334-345: ⚠️ Potential issue | 🔴 Critical

Type mismatch: input_path (Path) passed where input_identity (str) is expected.

All three _save_checkpoint calls (lines 336, 338, 343) pass input_path (a Path object) instead of input_identity (a string). This causes json.dump to raise TypeError: Object of type PosixPath is not JSON serializable. The exception is caught and logged as "Failed to save checkpoint", but the result is that no checkpoints are ever saved, completely breaking the resume feature.

Compute input_identity once and reuse it across all save calls:

     input_path = Path(input_path)
+    input_identity = _compute_input_identity(input_path)

     # Fixed output path
     project_root = Path.cwd()

Then update the three call sites to pass input_identity instead of input_path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 334 - 345, The checkpoint save calls
pass a Path object (input_path) where a string identity is expected, causing
JSON serialization to fail; compute input_identity once (e.g., str(input_path)
or a dedicated identity extraction) before the loop and replace all three calls
to _save_checkpoint(...) that currently pass input_path with input_identity,
ensuring _save_checkpoint(checkpoint_path, pages_written, input_identity) is
used in the normal checkpoint, KeyboardInterrupt, and generic Exception
handlers; keep usage of pages_written and existing CHECKPOINT_INTERVAL logic
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 334-345: The checkpoint save calls pass a Path object (input_path)
where a string identity is expected, causing JSON serialization to fail; compute
input_identity once (e.g., str(input_path) or a dedicated identity extraction)
before the loop and replace all three calls to _save_checkpoint(...) that
currently pass input_path with input_identity, ensuring
_save_checkpoint(checkpoint_path, pages_written, input_identity) is used in the
normal checkpoint, KeyboardInterrupt, and generic Exception handlers; keep usage
of pages_written and existing CHECKPOINT_INTERVAL logic unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1ed22dfc-4fd1-43d0-9522-edefe7a4567b

📥 Commits

Reviewing files that changed from the base of the PR and between e2c3017 and 66bbeab.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

@Archit381 Archit381 merged commit 173cdd5 into AOSSIE-Org:main Mar 17, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants