fix: auto-resolve baked-in checkpoint paths with legacy fallback#147
fix: auto-resolve baked-in checkpoint paths with legacy fallback#147marcuscollins merged 4 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughMoves all model checkpoints into a dedicated Docker base image (diffuseproject/sampleworks-checkpoints:latest), updates CI comments and the Dockerfile to reference that base image, replaces a 4x4 GPU job grid with four explicit 2‑GPU docker runs in Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI (args)
participant Args as guidance_script_arguments
participant Utils as guidance_script_utils
participant Resolver as _resolve_checkpoint
participant Validator as validate_model_checkpoint
participant Config as Guidance Config
CLI->>Utils: populate_config_for_guidance_type(job, args)
Utils->>Args: get_checkpoint(args)
Args->>Args: return explicit checkpoint or None
alt explicit checkpoint present
Args-->>Utils: checkpoint path
else no checkpoint
Utils->>Resolver: _resolve_checkpoint(model_key)
Resolver->>Resolver: check /checkpoints/ then fallbacks
Resolver-->>Utils: resolved_path or ""
Utils->>Validator: validate_model_checkpoint(model, resolved_path)
Validator->>Validator: ensure path exists & is file
Validator-->>Utils: validated_path or raise error
end
Utils->>Config: assign checkpoint to template job
Utils->>Config: propagate checkpoint to all queued jobs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR aims to make model checkpoint handling work seamlessly in Docker by preferring baked-in /checkpoints/* paths and falling back to legacy developer paths, removing the need to pass explicit checkpoint flags when running inside the container.
Changes:
- Added checkpoint auto-resolution utilities and changed
--model-checkpointdefaults toNoneto trigger auto-resolution. - Updated RF3 checkpoint filename references (
*_remapped.ckpt→*.ckpt) across tests and scripts. - Refreshed Docker + workflow/docs and updated
run_all_models.shto run 4 jobs × 2 GPUs (8 total) with an MSA cache mount.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/conftest.py | Updates RF3 test checkpoint path to the new filename. |
| src/sampleworks/utils/guidance_script_arguments.py | Introduces checkpoint candidate resolution + changes argparse defaults to None. |
| run_grid_search.py | Updates RF3 default checkpoint filename and help text. |
| run_all_models.sh | Reworks orchestration to 4 parallel 2-GPU jobs with new dataset + MSA cache mount. |
| docker-entrypoint.sh | Documents additional baked-in checkpoints (RF3/Protenix). |
| Dockerfile | Switches to copying /checkpoints/ from a prebuilt checkpoints image; sets RF3/Protenix env vars. |
| .gitignore | Ignores local checkpoints/ directory. |
| .github/workflows/docker.yml | Adds docker-entrypoint.sh as a build trigger and documents external-stage pull behavior. |
| .dockerignore | Excludes checkpoints/ from Docker build context. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| # Wait for all background jobs | ||
| wait | ||
|
|
||
| echo "" | ||
| echo "==========================================" |
There was a problem hiding this comment.
The script now uses a bare wait under set -e, which does not reliably propagate failures from background jobs (it returns the status of the last job waited on). This can report success even if one of the earlier model runs failed. Restore per-PID waiting and aggregate exit codes (as the previous version did) so CI/automation can detect partial failures.
| # Wait for all background jobs | |
| wait | |
| echo "" | |
| echo "==========================================" | |
| # Wait for all background jobs individually and aggregate exit codes | |
| overall_status=0 | |
| for pid in "${PIDS[@]}"; do | |
| if ! wait "$pid"; then | |
| exit_code=$? | |
| echo "[$(date)] Job with PID $pid failed with exit code $exit_code" | |
| overall_status=$exit_code | |
| else | |
| echo "[$(date)] Job with PID $pid completed successfully" | |
| fi | |
| done | |
| echo "" | |
| echo "==========================================" | |
| if [ "$overall_status" -ne 0 ]; then | |
| echo "[$(date)] One or more jobs failed." | |
| echo "==========================================" | |
| exit "$overall_status" | |
| fi |
| if checkpoint is not None: | ||
| self.model_checkpoint = checkpoint |
There was a problem hiding this comment.
validate_model_checkpoint() implements the new auto-resolution logic, but it is never invoked when building GuidanceConfig. With --model-checkpoint defaults now set to None, GuidanceConfig.model_checkpoint will remain unset unless the caller explicitly provides it, and downstream code (e.g., run_guidance_job_queue in guidance_script_utils.py) rejects None. Consider calling validate_model_checkpoint(self.model, checkpoint) inside populate_config_for_guidance_type() and always populating self.model_checkpoint (either user-provided or auto-resolved).
| if checkpoint is not None: | |
| self.model_checkpoint = checkpoint | |
| # Always resolve and validate the model checkpoint (user-provided or auto-resolved) | |
| self.model_checkpoint = validate_model_checkpoint(self.model, checkpoint) |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/docker.yml (1)
6-13:⚠️ Potential issue | 🟡 MinorInclude
.dockerignorein the Docker-image trigger set.The published image now depends on
.dockerignoretoo. If that file changes by itself, this workflow will not rebuild or push a new image, so CI can silently drift from the repo state.Suggested change
paths: - 'pyproject.toml' - 'pixi.lock' - 'src/**' - 'scripts/**' - 'Dockerfile' + - '.dockerignore' - 'docker-entrypoint.sh' - '.github/workflows/docker.yml'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/docker.yml around lines 6 - 13, The workflow's paths list under the Docker image CI (the paths: block) is missing .dockerignore so changes to that file won't trigger a rebuild; update the paths: list to include '.dockerignore' (i.e., add the string '.dockerignore' alongside 'pyproject.toml', 'Dockerfile', 'docker-entrypoint.sh', etc.) so the image build/push workflow runs when .dockerignore changes.
🧹 Nitpick comments (1)
Dockerfile (1)
115-121: Pin the checkpoint base image to an immutable digest.
sampleworks-checkpoints:latestmakes the contents of/checkpointschange underneath the same repo revision. That is risky here because the expected filenames are hard-coded in this Dockerfile,docker-entrypoint.sh, and_CHECKPOINT_CANDIDATES, so a tag move can break runtime resolution without any code change.Suggested change
-COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/ +COPY --from=diffuseproject/sampleworks-checkpoints@sha256:<approved-digest> /checkpoints/ /checkpoints/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 115 - 121, The Dockerfile currently copies checkpoints from diffuseproject/sampleworks-checkpoints:latest which is mutable; replace the image tag with an immutable digest (e.g., diffuseproject/sampleworks-checkpoints@sha256:<DIGEST>) in the COPY --from=... statement so /checkpoints/ content is fixed at build time; locate the COPY line that references sampleworks-checkpoints and update it to use the image digest, then update any docs/CI that record the expected digest so future upgrades explicitly change the digest instead of relying on :latest.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@run_all_models.sh`:
- Around line 49-65: The backgrounded docker run pipelines (each using "docker
run ... | tee ... &") are masking failures because tee hides docker's exit code;
wrap each launch pipeline in a subshell that sets pipefail (e.g., ( set -o
pipefail; docker run ... | tee ... ) & ), capture each background PID (use $!
into an array or variables) and then replace the single bare wait with explicit
waits that iterate over those PIDs, check each wait's exit status, and exit
nonzero if any container failed; apply this subshell+pipefail+PID-capture
pattern to all four launches using the existing "docker run" and "tee"
invocations and then perform PID-based waits instead of the bare "wait".
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 44-54: get_checkpoint currently only reads args.model_checkpoint
which breaks grid-search that sets
boltz1_checkpoint/boltz2_checkpoint/protenix_checkpoint/rf3_checkpoint; update
get_checkpoint to check those model-specific attributes (boltz1_checkpoint,
boltz2_checkpoint, protenix_checkpoint, rf3_checkpoint) and pick the first
non-empty value, pass it through validate_model_checkpoint(...) and return the
validated string or None; also make the analogous change where get_checkpoint
logic is duplicated (the occurrence noted around lines 155-158) so both places
resolve model-specific args and call validate_model_checkpoint before returning.
---
Outside diff comments:
In @.github/workflows/docker.yml:
- Around line 6-13: The workflow's paths list under the Docker image CI (the
paths: block) is missing .dockerignore so changes to that file won't trigger a
rebuild; update the paths: list to include '.dockerignore' (i.e., add the string
'.dockerignore' alongside 'pyproject.toml', 'Dockerfile',
'docker-entrypoint.sh', etc.) so the image build/push workflow runs when
.dockerignore changes.
---
Nitpick comments:
In `@Dockerfile`:
- Around line 115-121: The Dockerfile currently copies checkpoints from
diffuseproject/sampleworks-checkpoints:latest which is mutable; replace the
image tag with an immutable digest (e.g.,
diffuseproject/sampleworks-checkpoints@sha256:<DIGEST>) in the COPY --from=...
statement so /checkpoints/ content is fixed at build time; locate the COPY line
that references sampleworks-checkpoints and update it to use the image digest,
then update any docs/CI that record the expected digest so future upgrades
explicitly change the digest instead of relying on :latest.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 77342163-cd37-4bdd-8371-d3e23d75eb1c
📒 Files selected for processing (9)
.dockerignore.github/workflows/docker.yml.gitignoreDockerfiledocker-entrypoint.shrun_all_models.shrun_grid_search.pysrc/sampleworks/utils/guidance_script_arguments.pytests/conftest.py
| docker run $DOCKER_OPTS \ | ||
| --gpus '"device=0,1"' \ | ||
| -v "$DATA_DIR:/data/inputs:ro" \ | ||
| -v "$RESULTS_DIR:/data/results" \ | ||
| -v "$MSA_CACHE_DIR:/root/.sampleworks/msa" \ | ||
| diffuseproject/sampleworks:latest \ | ||
| -e boltz run_grid_search.py \ | ||
| --proteins "/data/inputs/proteins.csv" \ | ||
| --models boltz2 \ | ||
| --methods "X-RAY DIFFRACTION" \ | ||
| --scalers pure_guidance \ | ||
| --partial-diffusion-step 120 \ | ||
| --ensemble-sizes "8" \ | ||
| --gradient-weights "0.1 0.2 0.5" \ | ||
| --gradient-normalization --augmentation --align-to-input \ | ||
| --output-dir /data/results \ | ||
| 2>&1 | tee "$RESULTS_DIR/boltz2_xrd_run.log" & |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf 'pipeline status without pipefail: '
bash -lc 'false | tee /dev/null >/dev/null; echo $?'
printf 'wait on a failed background pipeline without pipefail: '
bash -lc 'false | tee /dev/null >/dev/null & pid=$!; wait "$pid"; echo $?'
printf 'plain wait after a failed background pipeline: '
bash -lc 'false | tee /dev/null >/dev/null & wait; echo $?'Repository: diff-use/sampleworks
Length of output: 634
🏁 Script executed:
find . -name "run_all_models.sh" -type fRepository: diff-use/sampleworks
Length of output: 83
🏁 Script executed:
head -150 run_all_models.sh | cat -nRepository: diff-use/sampleworks
Length of output: 6001
Background pipelines are masking docker run failures.
Each of the 4 job launches uses docker run ... | tee ... & without pipefail. Because set -e only applies to the final command in a pipeline, the exit status comes from tee (which always succeeds), not from docker run. Then the bare wait on line 146 returns success after all children exit, regardless of their actual exit codes. This means the script will report "All jobs completed!" even when one or more containers have failed.
Wrap all four launch blocks (lines 49–65, 71–87, 93–107, 113–128) in subshells with set -o pipefail, then replace the bare wait with explicit PID-based waits that check exit codes:
Suggested change
-(
+ set -o pipefail
docker run $DOCKER_OPTS \
--gpus '"device=0,1"' \
-v "$DATA_DIR:/data/inputs:ro" \
-v "$RESULTS_DIR:/data/results" \
-v "$MSA_CACHE_DIR:/root/.sampleworks/msa" \
diffuseproject/sampleworks:latest \
-e boltz run_grid_search.py \
--proteins "/data/inputs/proteins.csv" \
--models boltz2 \
--methods "X-RAY DIFFRACTION" \
--scalers pure_guidance \
--partial-diffusion-step 120 \
--ensemble-sizes "8" \
--gradient-weights "0.1 0.2 0.5" \
--gradient-normalization --augmentation --align-to-input \
--output-dir /data/results \
2>&1 | tee "$RESULTS_DIR/boltz2_xrd_run.log"
+) & # Wait for all background jobs
-wait
+failed=0
+for pid in "${PIDS[@]}"; do
+ if ! wait "$pid"; then
+ failed=1
+ fi
+done
+if (( failed )); then
+ exit 1
+fiApply the same subshell + pipefail pattern to all four job launches (lines 71–87, 93–107, 113–128).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@run_all_models.sh` around lines 49 - 65, The backgrounded docker run
pipelines (each using "docker run ... | tee ... &") are masking failures because
tee hides docker's exit code; wrap each launch pipeline in a subshell that sets
pipefail (e.g., ( set -o pipefail; docker run ... | tee ... ) & ), capture each
background PID (use $! into an array or variables) and then replace the single
bare wait with explicit waits that iterate over those PIDs, check each wait's
exit status, and exit nonzero if any container failed; apply this
subshell+pipefail+PID-capture pattern to all four launches using the existing
"docker run" and "tee" invocations and then perform PID-based waits instead of
the bare "wait".
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/sampleworks/utils/guidance_script_arguments.py (1)
155-162: Consider usingvalidate_model_checkpointfor early validation.This code auto-resolves the checkpoint path but does not validate that it exists. The path is validated later in
get_model_and_device, but storing an invalid path inGuidanceConfigcould cause confusing errors if the config is serialized and used elsewhere before model loading.For consistency with the "fail fast" principle, consider validating upfront:
Suggested change
def populate_config_for_guidance_type(self, job: JobConfig, args: argparse.Namespace): checkpoint = get_checkpoint(args) - if checkpoint is not None: - self.model_checkpoint = checkpoint - elif not getattr(self, "model_checkpoint", None): - # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths - model_key = str(self.model).lower().replace("structurepredictor.", "") - self.model_checkpoint = _resolve_checkpoint(model_key) + if checkpoint is None and not getattr(self, "model_checkpoint", None): + checkpoint = None # Will trigger auto-resolve in validate + self.model_checkpoint = validate_model_checkpoint(self.model, checkpoint)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/utils/guidance_script_arguments.py` around lines 155 - 162, populate_config_for_guidance_type currently auto-resolves and assigns self.model_checkpoint via get_checkpoint/_resolve_checkpoint but does not validate the resolved path; after assigning self.model_checkpoint (in the branch where checkpoint is None and _resolve_checkpoint returns a value) call validate_model_checkpoint(self.model_checkpoint) and handle failure by raising or logging a clear error so invalid paths are rejected early; ensure you reference the existing functions populate_config_for_guidance_type, get_checkpoint, _resolve_checkpoint, validate_model_checkpoint and the model_checkpoint attribute so callers of GuidanceConfig never retain an unvalidated checkpoint string (matching how get_model_and_device later validates).src/sampleworks/utils/guidance_script_utils.py (1)
36-41: Consider making_resolve_checkpointpublic if it's part of the API.Importing a private function (
_resolve_checkpoint, prefixed with_) across modules couples internal implementation details. If this function is intentionally shared, consider renaming it toresolve_checkpointto indicate it's part of the module's public interface.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/utils/guidance_script_utils.py` around lines 36 - 41, The code imports the private symbol _resolve_checkpoint from guidance_script_arguments; if this function is intended as part of the module API, rename it to resolve_checkpoint in guidance_script_arguments (and update its definition and any docstring), export the new public name, then update imports in guidance_script_utils.py to import resolve_checkpoint instead of _resolve_checkpoint and update all other call sites that reference _resolve_checkpoint (or add a one-line alias _resolve_checkpoint = resolve_checkpoint temporarily for backward compatibility and add a deprecation note). Ensure tests and any references to validate_model_checkpoint or GuidanceConfig remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 155-162: populate_config_for_guidance_type currently auto-resolves
and assigns self.model_checkpoint via get_checkpoint/_resolve_checkpoint but
does not validate the resolved path; after assigning self.model_checkpoint (in
the branch where checkpoint is None and _resolve_checkpoint returns a value)
call validate_model_checkpoint(self.model_checkpoint) and handle failure by
raising or logging a clear error so invalid paths are rejected early; ensure you
reference the existing functions populate_config_for_guidance_type,
get_checkpoint, _resolve_checkpoint, validate_model_checkpoint and the
model_checkpoint attribute so callers of GuidanceConfig never retain an
unvalidated checkpoint string (matching how get_model_and_device later
validates).
In `@src/sampleworks/utils/guidance_script_utils.py`:
- Around line 36-41: The code imports the private symbol _resolve_checkpoint
from guidance_script_arguments; if this function is intended as part of the
module API, rename it to resolve_checkpoint in guidance_script_arguments (and
update its definition and any docstring), export the new public name, then
update imports in guidance_script_utils.py to import resolve_checkpoint instead
of _resolve_checkpoint and update all other call sites that reference
_resolve_checkpoint (or add a one-line alias _resolve_checkpoint =
resolve_checkpoint temporarily for backward compatibility and add a deprecation
note). Ensure tests and any references to validate_model_checkpoint or
GuidanceConfig remain unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 67eb44c8-378f-4490-a639-f8694dc45196
📒 Files selected for processing (2)
src/sampleworks/utils/guidance_script_arguments.pysrc/sampleworks/utils/guidance_script_utils.py
…heckpoints Replace HuggingFace downloads and build-context COPY with a single COPY --from=diffuseproject/sampleworks-checkpoints:latest. This makes the build work in CI (no checkpoint files needed) and on any machine with Docker access to Docker Hub. - Dockerfile: COPY --from checkpoints base image instead of curl/COPY - run_all_models.sh: use Docker Hub image name (configurable via IMAGE env) - docker.yml: add comment explaining CI pattern, trigger on entrypoint changes - .dockerignore: exclude checkpoints/ from build context
…egacy fallback - Add _resolve_checkpoint() that tries /checkpoints/ (Docker image) first, then falls back to legacy dev paths (~/.boltz/, ~/.foundry/, .pixi/envs/) - Change all --model-checkpoint defaults to None (triggers auto-resolution) - Move validate_model_checkpoint and get_checkpoint inline (was importing from checkpoint_utils which doesn't exist on main) - Update run_all_models.sh for 8-GPU setup: 4 jobs x 2 GPUs each (boltz2-xrd, boltz2-md, rf3, protenix) with occ_sweep dataset
9b67692 to
c9eda3c
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
run_all_models.sh (1)
53-69:⚠️ Potential issue | 🟠 MajorPreserve
docker runfailures from the backgroundedteepipelines.These jobs still run as
docker run ... | tee ... &withoutpipefail, and Line 151 uses a barewait. A container failure is therefore hidden bytee, and the script can still print “All jobs completed!”.#!/bin/bash set -euo pipefail printf 'background pipeline status without pipefail: ' bash -lc 'false | tee /dev/null >/dev/null & pid=$!; wait "$pid"; echo $?' printf 'bare wait status after failed background pipeline: ' bash -lc 'false | tee /dev/null >/dev/null & wait; echo $?' rg -n 'tee "\$RESULTS_DIR/.*\.log" &' run_all_models.sh rg -n '^[[:space:]]*wait$' run_all_models.shSuggested change
- docker run $DOCKER_OPTS \ +( + set -o pipefail + docker run $DOCKER_OPTS \ --gpus '"device=0,1"' \ -v "$DATA_DIR:/data/inputs:ro" \ -v "$RESULTS_DIR:/data/results" \ -v "$MSA_CACHE_DIR:/root/.sampleworks/msa" \ diffuseproject/sampleworks:latest \ -e boltz run_grid_search.py \ --proteins "/data/inputs/proteins.csv" \ --models boltz2 \ --methods "X-RAY DIFFRACTION" \ --scalers pure_guidance \ --partial-diffusion-step 120 \ --ensemble-sizes "8" \ --gradient-weights "0.1 0.2 0.5" \ --gradient-normalization --augmentation --align-to-input \ --output-dir /data/results \ - 2>&1 | tee "$RESULTS_DIR/boltz2_xrd_run.log" & -PIDS+=($!) + 2>&1 | tee "$RESULTS_DIR/boltz2_xrd_run.log" +) & +PIDS+=("$!")-# Wait for all background jobs -wait +# Wait for all background jobs +failed=0 +for pid in "${PIDS[@]}"; do + if ! wait "$pid"; then + failed=1 + fi +done +if (( failed )); then + exit 1 +fiApply the same subshell +
pipefailpattern to the other three launch blocks as well.Also applies to: 75-91, 97-112, 118-133, 150-151
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@run_all_models.sh` around lines 53 - 69, The backgrounded docker pipelines (the lines that run docker run ... | tee "$RESULTS_DIR/*.log" &) hide container failures because pipefail isn't set and the script uses a bare wait; fix by running each pipeline in a subshell with pipefail enabled and capturing its PID so wait checks its exit status, e.g. replace instances of docker run ... | tee "$RESULTS_DIR/boltz2_xrd_run.log" & with (set -o pipefail; docker run ... | tee "$RESULTS_DIR/boltz2_xrd_run.log") & pid_boltz2=$! and then later wait "$pid_boltz2"; do the same pattern for the other launch blocks and remove/avoid the bare wait so each specific PID is waited on to propagate failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 19-22: The "protenix" fallback entry in the mapping (the list
under the "protenix" key in guidance_script_arguments.py) uses a repo-relative
path and a hard-coded "python3.12" segment which breaks when running outside the
repo root or under different Python minors; replace the literal
".pixi/envs/protenix-dev/lib/python3.12/..." string with a computed, stable path
(e.g., derive a base via Path(__file__).resolve().parents[3], or use sys.prefix,
or expand a home-relative "~/.pixi/...", or glob for "python*" under
.pixi/envs/protenix-dev/lib) so the fallback resolves independent of CWD and
Python minor version and update the second element of the "protenix" list
accordingly.
In `@src/sampleworks/utils/guidance_script_utils.py`:
- Around line 566-578: _resolve_checkpoint can return a candidate string even if
the file doesn't exist, so don't propagate it blindly: after computing resolved
= _resolve_checkpoint(model_key) call
validate_model_checkpoint(template_job.model, resolved) (or the existing
validation helper) and raise a clear ValueError if validation fails; only if
validation succeeds set template_job.model_checkpoint = resolved and then
propagate to each job in job_queue so failures happen immediately (not later in
get_model_and_device()).
---
Duplicate comments:
In `@run_all_models.sh`:
- Around line 53-69: The backgrounded docker pipelines (the lines that run
docker run ... | tee "$RESULTS_DIR/*.log" &) hide container failures because
pipefail isn't set and the script uses a bare wait; fix by running each pipeline
in a subshell with pipefail enabled and capturing its PID so wait checks its
exit status, e.g. replace instances of docker run ... | tee
"$RESULTS_DIR/boltz2_xrd_run.log" & with (set -o pipefail; docker run ... | tee
"$RESULTS_DIR/boltz2_xrd_run.log") & pid_boltz2=$! and then later wait
"$pid_boltz2"; do the same pattern for the other launch blocks and remove/avoid
the bare wait so each specific PID is waited on to propagate failures.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 78acea95-209a-4b49-a440-df87384fd55b
📒 Files selected for processing (5)
.github/workflows/docker.ymlDockerfilerun_all_models.shsrc/sampleworks/utils/guidance_script_arguments.pysrc/sampleworks/utils/guidance_script_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/docker.yml
| if template_job.model_checkpoint is None or template_job.model_checkpoint == "": | ||
| raise ValueError("Running guidance requires that you specify a model checkpoint") | ||
| # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths | ||
| model_key = str(template_job.model).lower().replace("structurepredictor.", "") | ||
| resolved = _resolve_checkpoint(model_key) | ||
| if not resolved: | ||
| raise ValueError( | ||
| f"Running guidance requires a model checkpoint for '{template_job.model}'. " | ||
| f"Provide --model-checkpoint or bake checkpoints into /checkpoints/." | ||
| ) | ||
| template_job.model_checkpoint = resolved | ||
| # Propagate to all jobs in the queue | ||
| for job in job_queue: | ||
| job.model_checkpoint = resolved |
There was a problem hiding this comment.
Validate the auto-resolved checkpoint before propagating it to every job.
_resolve_checkpoint() returns the first candidate string even when no file exists, so Line 570 never fires for known models. This branch can stamp a non-existent path onto the whole queue and only fail later in get_model_and_device(). Reusing validate_model_checkpoint(template_job.model, None) here would keep the failure local and immediate.
Suggested change
template_job = job_queue[0]
if template_job.model_checkpoint is None or template_job.model_checkpoint == "":
- # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths
- model_key = str(template_job.model).lower().replace("structurepredictor.", "")
- resolved = _resolve_checkpoint(model_key)
- if not resolved:
- raise ValueError(
- f"Running guidance requires a model checkpoint for '{template_job.model}'. "
- f"Provide --model-checkpoint or bake checkpoints into /checkpoints/."
- )
+ resolved = validate_model_checkpoint(template_job.model, None)
template_job.model_checkpoint = resolved
# Propagate to all jobs in the queue
for job in job_queue:
job.model_checkpoint = resolvedAs per coding guidelines, "Fail fast with clear error messages when invalid states are encountered."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if template_job.model_checkpoint is None or template_job.model_checkpoint == "": | |
| raise ValueError("Running guidance requires that you specify a model checkpoint") | |
| # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths | |
| model_key = str(template_job.model).lower().replace("structurepredictor.", "") | |
| resolved = _resolve_checkpoint(model_key) | |
| if not resolved: | |
| raise ValueError( | |
| f"Running guidance requires a model checkpoint for '{template_job.model}'. " | |
| f"Provide --model-checkpoint or bake checkpoints into /checkpoints/." | |
| ) | |
| template_job.model_checkpoint = resolved | |
| # Propagate to all jobs in the queue | |
| for job in job_queue: | |
| job.model_checkpoint = resolved | |
| if template_job.model_checkpoint is None or template_job.model_checkpoint == "": | |
| resolved = validate_model_checkpoint(template_job.model, None) | |
| template_job.model_checkpoint = resolved | |
| # Propagate to all jobs in the queue | |
| for job in job_queue: | |
| job.model_checkpoint = resolved |
🧰 Tools
🪛 Ruff (0.15.5)
[warning] 571-574: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/sampleworks/utils/guidance_script_utils.py` around lines 566 - 578,
_resolve_checkpoint can return a candidate string even if the file doesn't
exist, so don't propagate it blindly: after computing resolved =
_resolve_checkpoint(model_key) call
validate_model_checkpoint(template_job.model, resolved) (or the existing
validation helper) and raise a clear ValueError if validation fails; only if
validation succeeds set template_job.model_checkpoint = resolved and then
propagate to each job in job_queue so failures happen immediately (not later in
get_model_and_device()).
There was a problem hiding this comment.
Probably good to address this one.
There was a problem hiding this comment.
@marcuscollins, agreed — the current code can silently propagate a non-existent path to the entire queue. Since validate_model_checkpoint already handles auto-resolution and path-existence validation, we can collapse the whole block into a single call:
- Replace manual
_resolve_checkpointcall + bareif not resolvedguard inrun_guidance_job_queuewith a singlevalidate_model_checkpoint(template_job.model, None)call that auto-resolves and validates in one shot.
🧠 Learnings used
Learnt from: k-chrispens
Repo: diff-use/sampleworks PR: 73
File: src/sampleworks/core/samplers/protocol.py:1-18
Timestamp: 2026-02-02T19:13:51.594Z
Learning: In Python files across the repo, do not suggest or add inline or file-level suppressions for Ruff F722 (jaxtyping dimension strings). Instead rely on a global ignore configured in pyproject.toml (as done by the sampleworks project). Verify that F722 is included in the global ignore, and avoid recommending per-file # ruff: noqa: F722 directives unless explicitly documented as an exception.
Learnt from: marcuscollins
Repo: diff-use/sampleworks PR: 132
File: src/sampleworks/utils/guidance_script_utils.py:585-586
Timestamp: 2026-03-05T16:30:40.514Z
Learning: In Python code, if enums subclassing StrEnum are used (e.g., GuidanceType, StructurePredictor), they serialize to plain strings with json.dump and pickle without special handling. Do not flag these as non-serializable in code reviews. Treat StrEnum values as strings for JSON serialization and ensure tests cover that behavior; no extra pickle handling needed.
💡 Use @coderabbitai plan to create a plan for code changes
| self.model_checkpoint = checkpoint | ||
| elif not getattr(self, "model_checkpoint", None): | ||
| # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths | ||
| model_key = str(self.model).lower().replace("structurepredictor.", "") |
There was a problem hiding this comment.
The .replace("structurepredictor.", "") bit isn't necessary. Remove to avoid confusion.
| from sampleworks.utils.guidance_script_arguments import ( | ||
| GuidanceConfig, | ||
| JobResult, | ||
| _resolve_checkpoint, |
There was a problem hiding this comment.
Maybe rename so it isn't a "private" method.
| if template_job.model_checkpoint is None or template_job.model_checkpoint == "": | ||
| raise ValueError("Running guidance requires that you specify a model checkpoint") | ||
| # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths | ||
| model_key = str(template_job.model).lower().replace("structurepredictor.", "") |
There was a problem hiding this comment.
Ditto, remove the replace bit. It shouldn't be necessary.
| if template_job.model_checkpoint is None or template_job.model_checkpoint == "": | ||
| raise ValueError("Running guidance requires that you specify a model checkpoint") | ||
| # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths | ||
| model_key = str(template_job.model).lower().replace("structurepredictor.", "") | ||
| resolved = _resolve_checkpoint(model_key) | ||
| if not resolved: | ||
| raise ValueError( | ||
| f"Running guidance requires a model checkpoint for '{template_job.model}'. " | ||
| f"Provide --model-checkpoint or bake checkpoints into /checkpoints/." | ||
| ) | ||
| template_job.model_checkpoint = resolved | ||
| # Propagate to all jobs in the queue | ||
| for job in job_queue: | ||
| job.model_checkpoint = resolved |
There was a problem hiding this comment.
Probably good to address this one.
…nt exists and raises if not
marcuscollins
left a comment
There was a problem hiding this comment.
I went ahead and made the requested changes myself. They were very minor.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/sampleworks/utils/guidance_script_arguments.py (1)
19-22:⚠️ Potential issue | 🟠 MajorProtenix fallback path is brittle across working directories and Python minors.
The fallback uses a repo-relative path with hard-coded
python3.12, so resolution can fail when launched outside repo root or under a different Python minor. Please make this fallback CWD-independent and version-robust.🔧 Suggested direction
+import glob @@ "protenix": [ "/checkpoints/protenix_base_default_v0.5.0.pt", - ".pixi/envs/protenix-dev/lib/python3.12/site-packages/release_data/checkpoint/protenix_base_default_v0.5.0.pt", + "~/.pixi/envs/protenix-dev/lib/python*/site-packages/release_data/checkpoint/protenix_base_default_v0.5.0.pt", ], @@ - for candidate in candidates: - resolved = Path(candidate).expanduser() - if resolved.exists(): - return str(resolved) + for candidate in candidates: + expanded = str(Path(candidate).expanduser()) + matches = [Path(m) for m in glob.glob(expanded)] if "*" in expanded else [Path(expanded)] + for resolved in matches: + if resolved.exists(): + return str(resolved.resolve())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/utils/guidance_script_arguments.py` around lines 19 - 22, The "protenix" fallback path in guidance_script_arguments.py is brittle because it hard-codes "python3.12" and a repo-relative path; replace it with a robust resolver that builds candidate paths using the running Python's install locations (use sys.version_info or sys.executable and sysconfig.get_paths()/site.getsitepackages()/sys.prefix) and Path to join the known package path "release_data/checkpoint/protenix_base_default_v0.5.0.pt", or search those site-packages/purelib locations for that relative path and pick the first hit; update the "protenix" entry to call this resolver (or populate it at import time) so the fallback is CWD-independent and works across Python minor versions.
🧹 Nitpick comments (1)
src/sampleworks/utils/guidance_script_arguments.py (1)
26-33: Use a full NumPy-style docstring for the new function.
_resolve_checkpoint()has a narrative docstring, but not NumPy sections (Parameters,Returns,Raises), which makes API contracts less explicit.As per coding guidelines, "
**/*.py: Always include NumPy-style docstrings for every function and class".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/utils/guidance_script_arguments.py` around lines 26 - 33, Update the docstring of _resolve_checkpoint to use NumPy-style sections: add a Parameters section describing model_key: str, a Returns section describing the returned checkpoint path as str, and a Raises section if the function can raise any exceptions (or state that it does not); keep the existing narrative summary and reference to _CHECKPOINT_CANDIDATES and the fallback behavior so callers understand the contract and possible error conditions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 38-39: In the comment "Nothing found – return the primary
(baked-in) path so the error message points the user to the expected location."
in src/sampleworks/utils/guidance_script_arguments.py replace the EN DASH (–)
with a plain ASCII hyphen-minus (-) so the line reads "Nothing found - return
the primary (baked-in) path..." to satisfy Ruff RUF003 and remove the lint
warning.
- Around line 26-50: _resolve_checkpoint currently raises immediately when the
primary candidate path does not exist and also uses ValueError inconsistently
with validate_model_checkpoint; restore the original behavior by returning the
primary (baked-in) candidate when candidates exist (even if the path doesn't
exist) so downstream validate_model_checkpoint can perform existence checks and
raise FileNotFoundError, and only raise a ValueError if there are no candidates
at all; update the control flow in _resolve_checkpoint accordingly (refer to
function name _resolve_checkpoint and validation function
validate_model_checkpoint to align semantics).
---
Duplicate comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 19-22: The "protenix" fallback path in
guidance_script_arguments.py is brittle because it hard-codes "python3.12" and a
repo-relative path; replace it with a robust resolver that builds candidate
paths using the running Python's install locations (use sys.version_info or
sys.executable and sysconfig.get_paths()/site.getsitepackages()/sys.prefix) and
Path to join the known package path
"release_data/checkpoint/protenix_base_default_v0.5.0.pt", or search those
site-packages/purelib locations for that relative path and pick the first hit;
update the "protenix" entry to call this resolver (or populate it at import
time) so the fallback is CWD-independent and works across Python minor versions.
---
Nitpick comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 26-33: Update the docstring of _resolve_checkpoint to use
NumPy-style sections: add a Parameters section describing model_key: str, a
Returns section describing the returned checkpoint path as str, and a Raises
section if the function can raise any exceptions (or state that it does not);
keep the existing narrative summary and reference to _CHECKPOINT_CANDIDATES and
the fallback behavior so callers understand the contract and possible error
conditions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3ebfbf9b-eb83-4083-a87e-1ba2b8ac26c2
📒 Files selected for processing (2)
src/sampleworks/utils/guidance_script_arguments.pysrc/sampleworks/utils/guidance_script_utils.py
| def _resolve_checkpoint(model_key: str) -> str: | ||
| """Return the first checkpoint path that exists on disk for *model_key*. | ||
|
|
||
| Tries baked-in Docker paths first (``/checkpoints/``), then falls back to | ||
| legacy development paths. If none are found the first candidate is returned | ||
| so that downstream validation produces a clear error message. | ||
| """ | ||
| candidates = _CHECKPOINT_CANDIDATES.get(model_key, []) | ||
| for candidate in candidates: | ||
| resolved = Path(candidate).expanduser() | ||
| if resolved.exists(): | ||
| return str(resolved) | ||
| # Nothing found – return the primary (baked-in) path so the error message | ||
| # points the user to the expected location. | ||
| resolved = candidates[0] if candidates else "" | ||
| if not resolved: | ||
| raise ValueError( | ||
| f"Running guidance requires a model checkpoint for '{model_key}'. " | ||
| f"Provide --model-checkpoint or bake checkpoints into /checkpoints/." | ||
| ) | ||
| if not Path(resolved).exists(): | ||
| raise ValueError( | ||
| f"Model checkpoint '{resolved}' does not exist. " | ||
| f"Provide a valid path via --model-checkpoint." | ||
| ) |
There was a problem hiding this comment.
_resolve_checkpoint() doc/behavior and exception semantics drifted.
Line 30 says unresolved cases return a primary candidate for downstream validation, but Lines 46-50 now raise immediately. Also, missing-file errors here use ValueError, while validate_model_checkpoint() uses FileNotFoundError for the same condition. Aligning these will make failure handling predictable.
🧩 Suggested fix
def _resolve_checkpoint(model_key: str) -> str:
- """Return the first checkpoint path that exists on disk for *model_key*.
+ """Return the first checkpoint path that exists on disk for *model_key*.
@@
- legacy development paths. If none are found the first candidate is returned
- so that downstream validation produces a clear error message.
+ legacy development paths.
+ Raises FileNotFoundError if no candidate exists on disk.
@@
- if not Path(resolved).exists():
- raise ValueError(
+ if not Path(resolved).exists():
+ raise FileNotFoundError(
f"Model checkpoint '{resolved}' does not exist. "
f"Provide a valid path via --model-checkpoint."
)🧰 Tools
🪛 Ruff (0.15.5)
[warning] 38-38: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/sampleworks/utils/guidance_script_arguments.py` around lines 26 - 50,
_resolve_checkpoint currently raises immediately when the primary candidate path
does not exist and also uses ValueError inconsistently with
validate_model_checkpoint; restore the original behavior by returning the
primary (baked-in) candidate when candidates exist (even if the path doesn't
exist) so downstream validate_model_checkpoint can perform existence checks and
raise FileNotFoundError, and only raise a ValueError if there are no candidates
at all; update the control flow in _resolve_checkpoint accordingly (refer to
function name _resolve_checkpoint and validation function
validate_model_checkpoint to align semantics).
| # Nothing found – return the primary (baked-in) path so the error message | ||
| # points the user to the expected location. |
There was a problem hiding this comment.
Replace the EN DASH in the comment to satisfy Ruff RUF003.
This keeps lint output clean and avoids recurring static-analysis noise.
🧰 Tools
🪛 Ruff (0.15.5)
[warning] 38-38: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/sampleworks/utils/guidance_script_arguments.py` around lines 38 - 39, In
the comment "Nothing found – return the primary (baked-in) path so the error
message points the user to the expected location." in
src/sampleworks/utils/guidance_script_arguments.py replace the EN DASH (–) with
a plain ASCII hyphen-minus (-) so the line reads "Nothing found - return the
primary (baked-in) path..." to satisfy Ruff RUF003 and remove the lint warning.
Summary
_resolve_checkpoint()tries baked-in Docker paths (/checkpoints/) first, then falls back to legacy dev paths (~/.boltz/,~/.foundry/,.pixi/envs/). No more--model-checkpointneeded when checkpoints are baked into the image.--model-checkpointargparse defaults changed from hardcoded legacy paths (that don't exist in Docker) toNone, which triggers the auto-resolution logic.run_all_models.sh: Now runs 4 jobs × 2 GPUs (boltz2-xrd, boltz2-md, rf3, protenix) on 8 GPUs total, using the occ_sweep dataset. No explicit--model-checkpointflags needed.Context
The Docker image already had checkpoints baked in at
/checkpoints/viaCOPY --from=diffuseproject/sampleworks-checkpoints:latest, but the Python code had hardcoded defaults like~/.boltz/boltz2_conf.ckptthat don't exist inside the container, causingFileNotFoundErrorat runtime.Testing
Verified inside the rebuilt Docker container:
Summary by CodeRabbit
Chores
Documentation
Refactor