fix: sweep.py wipes search state by default to defeat .completed resume#32
Merged
Conversation
Discovered while debugging the 3.5× run-2 vs run-1 speedup on the A100 HST MGE submit (job 322549 = 3m43s vs 322548 = 11m40s). Run 2 didn't actually re-sample — it loaded the cached samples.csv + Nautilus pickle left by run 1 and reported the same total_samples=65500 with a meaningless time_per_eval_ms=2.82. The resume gate is `.completed` (PyAutoFit/abstract_search.py:520-529), not `force_pickle_overwrite` as the previous comment claimed. `force_pickle_overwrite=True` only re-writes output pickles on an existing resume; it does not bypass the gate. For production (SLaM-style chained phases) the resume default is correct. For profiling it produces phantom speedups whenever a prior run completed sampling — even one that crashed in post-fit, as the latent-crash in PR #29 showed. - sweep.py: --keep-completed flag (default off). When off, removes output/searches/<sampler>/<ds>/<model>/<instrument>/<config>/ before each cell run, wiping .completed + Nautilus pickle + samples.csv. - _samplers.py: correct the docstring claim about force_pickle_overwrite. - README.md: rewrite the "force_pickle_overwrite defeats .completed" paragraph; document the sweep-level wipe as the actual mechanism. The honest A100 number from run 1's actual sampling window is ~6.6 ms/eval (432 s for 65500 evals between Visualization warm-up complete and the first Fit Running update), not the 2.82 ms in run 2's JSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Run 322549 came back 3.5× faster than 322548 on the same A100 cell. Looked like cache magic. It wasn't — run 2 didn't sample at all. It loaded the cached
samples.csv+ Nautilus pickle from run 1 and reported the sametotal_samples=65500with a meaninglesstime_per_eval_ms=2.82.PyAutoFit's resume gate is the
.completedsentinel file (abstract_search.py:520):force_pickle_overwrite=Trueonly re-writes output pickles on the resume path; it doesn't bypass the gate. The previous_samplers.pycomment claiming otherwise was wrong, and PR #29's README inherited the same mistake.For production (SLaM chained phases) the resume default is correct. For profiling it produces phantom speedups whenever a prior attempt completed sampling — including the post-fit-latent-crash case that PR #30 just fixed, where run 1's sampling completed before the crash and left
.completedbehind.Changes
sweep.py—--keep-completedflag (default off). When off, sweep removesoutput/searches/<sampler>/<ds>/<model>/<instrument>/<config>/before each cell run, wiping.completed+ Nautilus pickle + cachedsamples.csv. A one-line[clear-completed] removed ...log per cell so wipes are auditable._samplers.py— fix theforce_pickle_overwritedocstring; clarify it controls output-file re-writes, not the resume gate.README.md— rewrite the corresponding paragraph; document the sweep-level wipe as the actual mechanism.Honest run-1 number
From job 322548's log timestamps:
10:55:35script start10:56:01Visualization warm-up complete → JIT + setup = 26s10:56:01 → 11:03:13Nautilus sampling, 65,500 evals → 432 s → 6.6 ms/eval (real)11:03:13 → 11:07:12two final perform_updates + latent crash → ~4 minThe 2.82 ms in run 2's JSON is
(load + viz wall) / cached-sample-count— not a per-eval cost.Test plan
sweep.py --helpshows--keep-completed.[clear-completed] (dry-run) would remove ...when the dir exists._wipe_search_state(...)removes.completed+ nested files.🤖 Generated with Claude Code