Skip to content

fix: sweep.py wipes search state by default to defeat .completed resume#32

Merged
Jammy2211 merged 1 commit into
mainfrom
fix/clear-completed-by-default
May 28, 2026
Merged

fix: sweep.py wipes search state by default to defeat .completed resume#32
Jammy2211 merged 1 commit into
mainfrom
fix/clear-completed-by-default

Conversation

@Jammy2211
Copy link
Copy Markdown
Contributor

Summary

Run 322549 came back 3.5× faster than 322548 on the same A100 cell. Looked like cache magic. It wasn't — run 2 didn't sample at all. It loaded the cached samples.csv + Nautilus pickle from run 1 and reported the same total_samples=65500 with a meaningless time_per_eval_ms=2.82.

PyAutoFit's resume gate is the .completed sentinel file (abstract_search.py:520):

if not self.paths.is_complete:
    result = self.start_resume_fit(...)
else:
    result = self.result_via_completed_fit(...)

force_pickle_overwrite=True only re-writes output pickles on the resume path; it doesn't bypass the gate. The previous _samplers.py comment claiming otherwise was wrong, and PR #29's README inherited the same mistake.

For production (SLaM chained phases) the resume default is correct. For profiling it produces phantom speedups whenever a prior attempt completed sampling — including the post-fit-latent-crash case that PR #30 just fixed, where run 1's sampling completed before the crash and left .completed behind.

Changes

  • sweep.py--keep-completed flag (default off). When off, sweep removes output/searches/<sampler>/<ds>/<model>/<instrument>/<config>/ before each cell run, wiping .completed + Nautilus pickle + cached samples.csv. A one-line [clear-completed] removed ... log per cell so wipes are auditable.
  • _samplers.py — fix the force_pickle_overwrite docstring; clarify it controls output-file re-writes, not the resume gate.
  • README.md — rewrite the corresponding paragraph; document the sweep-level wipe as the actual mechanism.

Honest run-1 number

From job 322548's log timestamps:

  • 10:55:35 script start
  • 10:56:01 Visualization warm-up complete → JIT + setup = 26s
  • 10:56:01 → 11:03:13 Nautilus sampling, 65,500 evals → 432 s → 6.6 ms/eval (real)
  • 11:03:13 → 11:07:12 two final perform_updates + latent crash → ~4 min

The 2.82 ms in run 2's JSON is (load + viz wall) / cached-sample-count — not a per-eval cost.

Test plan

  • sweep.py --help shows --keep-completed.
  • Dry-run prints [clear-completed] (dry-run) would remove ... when the dir exists.
  • Real wipe via _wipe_search_state(...) removes .completed + nested files.
  • No-op when the output dir doesn't exist.
  • Re-submit A100 with the wipe; confirm Nautilus iteration lines appear and per-eval cost lands around 6.6 ms.

🤖 Generated with Claude Code

Discovered while debugging the 3.5× run-2 vs run-1 speedup on the A100
HST MGE submit (job 322549 = 3m43s vs 322548 = 11m40s). Run 2 didn't
actually re-sample — it loaded the cached samples.csv + Nautilus pickle
left by run 1 and reported the same total_samples=65500 with a meaningless
time_per_eval_ms=2.82.

The resume gate is `.completed` (PyAutoFit/abstract_search.py:520-529),
not `force_pickle_overwrite` as the previous comment claimed.
`force_pickle_overwrite=True` only re-writes output pickles on an
existing resume; it does not bypass the gate.

For production (SLaM-style chained phases) the resume default is
correct. For profiling it produces phantom speedups whenever a prior
run completed sampling — even one that crashed in post-fit, as the
latent-crash in PR #29 showed.

- sweep.py: --keep-completed flag (default off). When off, removes
  output/searches/<sampler>/<ds>/<model>/<instrument>/<config>/ before
  each cell run, wiping .completed + Nautilus pickle + samples.csv.
- _samplers.py: correct the docstring claim about force_pickle_overwrite.
- README.md: rewrite the "force_pickle_overwrite defeats .completed"
  paragraph; document the sweep-level wipe as the actual mechanism.

The honest A100 number from run 1's actual sampling window is ~6.6 ms/eval
(432 s for 65500 evals between Visualization warm-up complete and the
first Fit Running update), not the 2.82 ms in run 2's JSON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Jammy2211 Jammy2211 merged commit e5c2220 into main May 28, 2026
1 check failed
@Jammy2211 Jammy2211 deleted the fix/clear-completed-by-default branch May 28, 2026 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant