Incremental Preprocessing with Checkpoints to avoid halfway fail-restart#68
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughRemoves Changes
Sequence Diagram(s)sequenceDiagram
participant Extract as Extractor
participant Check as CheckpointStore
participant FS as FileStorage
participant Gen as ManifestGenerator
participant Hash as HashComputer
Extract->>Check: load checkpoint (if exists)
alt checkpoint found
Check-->>Extract: processed_pages info
Extract->>FS: open output file (append)
else no checkpoint
Extract->>FS: open output file (write)
end
loop per page
Extract->>FS: write page text
Extract->>Check: periodically save checkpoint
end
Extract->>Check: remove checkpoint on success
Gen->>FS: get_parent_manifest_hash()
FS-->>Gen: parent_hash (if exists)
Gen->>Hash: compute_manifest_hash(manifest) -- removes parent_manifest_hash before canonicalizing
Hash-->>Gen: manifest_hash
Gen->>FS: write manifest including `parent_manifest_hash`
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 309-334: The variable pages_written is misleading because it
increments for every page processed (in the loop that iterates ET.iterparse)
even when no text is written; rename it to pages_processed (or
total_pages_handled) to reflect its actual meaning, and update all references
(initialization from pages_already_done, the increment inside the for _, elem in
context loop, and any external uses) so the counter semantics match;
alternatively, if you truly need a count of pages that produced output, keep
pages_written and only increment it when cleaned text is non-empty (i.e., after
out.write(cleaned + "\n\n")), and introduce pages_processed for the per-page
count.
- Around line 197-202: The current _compute_input_identity function swallows all
exceptions and returns an empty string which can falsely match during resume;
change _compute_input_identity to not return "" on error—either let exceptions
propagate (remove the broad try/except) or return a clearly distinct sentinel
(e.g., None) so it cannot compare equal to a valid checksum; then update
_load_checkpoint to explicitly catch the propagated exception or handle the
sentinel and decide whether to fail loudly or start fresh, referencing the same
function names (_compute_input_identity and _load_checkpoint) when making the
edits.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0cc96972-b1fc-40e5-a8f3-6c950c975463
📒 Files selected for processing (2)
openverifiablellm/manifest_chain.pyopenverifiablellm/utils.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 234-242: _save_checkpoint currently calls
_compute_input_identity(input_path) on every checkpoint which re-hashes the
entire input file repeatedly; change the code so the input identity is computed
once and reused: compute _compute_input_identity(input_path) once before the
preprocessing loop (or at the top-level loader), pass that cached identity into
_save_checkpoint as an argument (e.g., add parameter input_identity: str and
remove the internal call), and update all callers (the places around the other
checkpoint calls referenced) to supply the precomputed identity instead of
letting _save_checkpoint recompute it.
- Around line 338-342: The exception handler currently uses a broad `except
Exception` which does not catch KeyboardInterrupt, so when Ctrl+C is used the
checkpoint (_save_checkpoint) isn't saved; modify the try/except to add an
explicit `except KeyboardInterrupt:` (or add KeyboardInterrupt to the exception
tuple) that calls `_save_checkpoint(checkpoint_path, pages_written,
input_path)`, logs via `logger.error("Processing interrupted after %d pages. Run
again to resume.", pages_written)`, and then re-raises the KeyboardInterrupt to
preserve normal interrupt behavior; keep the existing generic `except
Exception:` after that to handle other errors and re-raise them as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b9aff65d-a4f2-4c16-9426-f9d2482e5335
📒 Files selected for processing (1)
openverifiablellm/utils.py
|
Fix lint issues and do requested code-rabbit changes |
@Archit381 I have applied the requested changes , Please have a look |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
openverifiablellm/utils.py (1)
334-345:⚠️ Potential issue | 🔴 CriticalType mismatch:
input_path(Path) passed whereinput_identity(str) is expected.All three
_save_checkpointcalls (lines 336, 338, 343) passinput_path(aPathobject) instead ofinput_identity(a string). This causesjson.dumpto raiseTypeError: Object of type PosixPath is not JSON serializable. The exception is caught and logged as "Failed to save checkpoint", but the result is that no checkpoints are ever saved, completely breaking the resume feature.Compute
input_identityonce and reuse it across all save calls:input_path = Path(input_path) + input_identity = _compute_input_identity(input_path) # Fixed output path project_root = Path.cwd()Then update the three call sites to pass
input_identityinstead ofinput_path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 334 - 345, The checkpoint save calls pass a Path object (input_path) where a string identity is expected, causing JSON serialization to fail; compute input_identity once (e.g., str(input_path) or a dedicated identity extraction) before the loop and replace all three calls to _save_checkpoint(...) that currently pass input_path with input_identity, ensuring _save_checkpoint(checkpoint_path, pages_written, input_identity) is used in the normal checkpoint, KeyboardInterrupt, and generic Exception handlers; keep usage of pages_written and existing CHECKPOINT_INTERVAL logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 334-345: The checkpoint save calls pass a Path object (input_path)
where a string identity is expected, causing JSON serialization to fail; compute
input_identity once (e.g., str(input_path) or a dedicated identity extraction)
before the loop and replace all three calls to _save_checkpoint(...) that
currently pass input_path with input_identity, ensuring
_save_checkpoint(checkpoint_path, pages_written, input_identity) is used in the
normal checkpoint, KeyboardInterrupt, and generic Exception handlers; keep usage
of pages_written and existing CHECKPOINT_INTERVAL logic unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1ed22dfc-4fd1-43d0-9522-edefe7a4567b
📒 Files selected for processing (1)
openverifiablellm/utils.py
Addressed Issues:
Fixes #63
Problem: Reprocessing huge Wikipedia dumps takes hours. If something fails halfway, restart from the beginning.
Solution: Add checkpoints - save progress and resume.
Step 1 — Download the dump
Windows / Linux / Mac
Step 2 — Start preprocessing and interrupt it
Windows / Linux / Mac
Wait a few seconds, then press Ctrl+C.
Step 3 — Confirm checkpoint was saved
Windows (CMD)
Linux / Mac
Expected:
Note the line count of the output so far:
Windows (CMD)
Linux / Mac
Write this number down.
Step 4 — Resume and confirm it continues, not restarts
Windows / Linux / Mac
Watch the logs — you should see:
Check line count again — must be higher than Step 3:
Windows (CMD)
Linux / Mac
Step 5 — Confirm checkpoint is deleted after success
Windows (CMD)
Linux / Mac
Expected: File not found — deleted automatically on successful completion.
Step 6 — Verify fresh restart works
Delete output and checkpoint, then re-run:
Windows (CMD)
Linux / Mac
Windows / Linux / Mac
Logs should show no "Resuming from checkpoint" — starts clean.
Step 7 — Run manifest verification
Windows / Linux / Mac
Expected:
ALL CHECKS PASSED— the resumed output is identical to a full uninterrupted run.Screenshots/Recordings:
Additional Notes:
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Tests