fix: auto-resolve baked-in checkpoint paths with legacy fallback by xraymemory · Pull Request #147 · diff-use/sampleworks

xraymemory · 2026-03-10T01:24:32Z

Summary

Auto-resolve checkpoints: New _resolve_checkpoint() tries baked-in Docker paths (/checkpoints/) first, then falls back to legacy dev paths (~/.boltz/, ~/.foundry/, .pixi/envs/). No more --model-checkpoint needed when checkpoints are baked into the image.
Fix broken defaults: All --model-checkpoint argparse defaults changed from hardcoded legacy paths (that don't exist in Docker) to None, which triggers the auto-resolution logic.
Update run_all_models.sh: Now runs 4 jobs × 2 GPUs (boltz2-xrd, boltz2-md, rf3, protenix) on 8 GPUs total, using the occ_sweep dataset. No explicit --model-checkpoint flags needed.

Context

The Docker image already had checkpoints baked in at /checkpoints/ via COPY --from=diffuseproject/sampleworks-checkpoints:latest, but the Python code had hardcoded defaults like ~/.boltz/boltz2_conf.ckpt that don't exist inside the container, causing FileNotFoundError at runtime.

Testing

Verified inside the rebuilt Docker container:

boltz2:   /checkpoints/boltz2_conf.ckpt   ✓
boltz1:   /checkpoints/boltz1_conf.ckpt   ✓
rf3:      /checkpoints/rf3_foundry_01_24_latest.ckpt   ✓
protenix: /checkpoints/protenix_base_default_v0.5.0.pt ✓

Summary by CodeRabbit

Chores
- Docker build now uses a pre-built base image that contains all model checkpoints, removing the need to include checkpoints in the build context.
Documentation
- Added explanatory comments in build workflows and image docs clarifying checkpoint sources and CI behavior.
Refactor
- Automatic checkpoint resolution from standard locations; CLI defaults updated to allow auto-selection.
- Multi-model execution reorganized for explicit per-model runs, GPU mappings, pre-flight checks, and simplified completion reporting.

coderabbitai · 2026-03-10T01:24:58Z

📝 Walkthrough

Walkthrough

Moves all model checkpoints into a dedicated Docker base image (diffuseproject/sampleworks-checkpoints:latest), updates CI comments and the Dockerfile to reference that base image, replaces a 4x4 GPU job grid with four explicit 2‑GPU docker runs in run_all_models.sh, and adds auto-resolution of checkpoints in CLI utilities preferring /checkpoints/ with fallbacks.

Changes

Cohort / File(s)	Summary
Docker & CI `.github/workflows/docker.yml`, `Dockerfile`	Added explanatory CI comments and reworked Dockerfile commentary to document and use a pre-built checkpoints base image (`diffuseproject/sampleworks-checkpoints:latest`) via `COPY --from`, removing baked-in per-checkpoint listings.
Orchestration script `run_all_models.sh`	Replaced multi-job grid with four explicit `docker run` invocations (each using 2 GPUs); added preflight steps (dirs, image pull), MSA cache mounts, per-model logs, simplified wait logic, and updated status messaging.
Checkpoint resolution (CLI) `src/sampleworks/utils/guidance_script_arguments.py`, `src/sampleworks/utils/guidance_script_utils.py`	Introduced `_CHECKPOINT_CANDIDATES` and `_resolve_checkpoint`; updated argument defaults to `None`; added auto-resolution logic used when model checkpoint is not provided, with clearer validation and propagation to queued jobs.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (args)
    participant Args as guidance_script_arguments
    participant Utils as guidance_script_utils
    participant Resolver as _resolve_checkpoint
    participant Validator as validate_model_checkpoint
    participant Config as Guidance Config

    CLI->>Utils: populate_config_for_guidance_type(job, args)
    Utils->>Args: get_checkpoint(args)
    Args->>Args: return explicit checkpoint or None
    alt explicit checkpoint present
        Args-->>Utils: checkpoint path
    else no checkpoint
        Utils->>Resolver: _resolve_checkpoint(model_key)
        Resolver->>Resolver: check /checkpoints/ then fallbacks
        Resolver-->>Utils: resolved_path or ""
        Utils->>Validator: validate_model_checkpoint(model, resolved_path)
        Validator->>Validator: ensure path exists & is file
        Validator-->>Utils: validated_path or raise error
    end
    Utils->>Config: assign checkpoint to template job
    Utils->>Config: propagate checkpoint to all queued jobs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Bake RF3 and Protenix checkpoints into Docker image #141: Also modifies checkpoint provisioning and default checkpoint resolution in CLI/Docker handling.
fix: use checkpoints base image from Docker Hub for CI builds #143: Also updates Dockerfile and CI workflow to pull checkpoints from a pre-built checkpoints base image.
chore(cleanup): remove checkpoint_utils.py #114: Related changes to checkpoint helper functions and integration with guidance_script_utils.

Poem

🐰 I hopped inside the Docker tree,
Checkpoints baked where none could be,
Four containers hum, two GPUs each,
Paths resolved within my reach,
A tiny hop — deployment's free! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: auto-resolve baked-in checkpoint paths with legacy fallback' accurately and specifically describes the main change: introducing auto-resolution logic for checkpoint paths that prioritizes baked-in Docker paths with a fallback to legacy paths.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/bake-all-checkpoints

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR aims to make model checkpoint handling work seamlessly in Docker by preferring baked-in /checkpoints/* paths and falling back to legacy developer paths, removing the need to pass explicit checkpoint flags when running inside the container.

Changes:

Added checkpoint auto-resolution utilities and changed --model-checkpoint defaults to None to trigger auto-resolution.
Updated RF3 checkpoint filename references (*_remapped.ckpt → *.ckpt) across tests and scripts.
Refreshed Docker + workflow/docs and updated run_all_models.sh to run 4 jobs × 2 GPUs (8 total) with an MSA cache mount.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/conftest.py	Updates RF3 test checkpoint path to the new filename.
src/sampleworks/utils/guidance_script_arguments.py	Introduces checkpoint candidate resolution + changes argparse defaults to `None`.
run_grid_search.py	Updates RF3 default checkpoint filename and help text.
run_all_models.sh	Reworks orchestration to 4 parallel 2-GPU jobs with new dataset + MSA cache mount.
docker-entrypoint.sh	Documents additional baked-in checkpoints (RF3/Protenix).
Dockerfile	Switches to copying `/checkpoints/` from a prebuilt checkpoints image; sets RF3/Protenix env vars.
.gitignore	Ignores local `checkpoints/` directory.
.github/workflows/docker.yml	Adds `docker-entrypoint.sh` as a build trigger and documents external-stage pull behavior.
.dockerignore	Excludes `checkpoints/` from Docker build context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-10T01:27:49Z

+# Wait for all background jobs
+wait

 echo ""
 echo "=========================================="


The script now uses a bare wait under set -e, which does not reliably propagate failures from background jobs (it returns the status of the last job waited on). This can report success even if one of the earlier model runs failed. Restore per-PID waiting and aggregate exit codes (as the previous version did) so CI/automation can detect partial failures.

Suggested change

# Wait for all background jobs

wait

echo ""

echo "=========================================="

# Wait for all background jobs individually and aggregate exit codes

overall_status=0

for pid in "${PIDS[@]}"; do

if ! wait "$pid"; then

exit_code=$?

echo "[$(date)] Job with PID $pid failed with exit code $exit_code"

overall_status=$exit_code

else

echo "[$(date)] Job with PID $pid completed successfully"

fi

done

echo ""

echo "=========================================="

if [ "$overall_status" -ne 0 ]; then

echo "[$(date)] One or more jobs failed."

echo "=========================================="

exit "$overall_status"

fi

Copilot · 2026-03-10T01:27:50Z

+        if checkpoint is not None:
+            self.model_checkpoint = checkpoint


validate_model_checkpoint() implements the new auto-resolution logic, but it is never invoked when building GuidanceConfig. With --model-checkpoint defaults now set to None, GuidanceConfig.model_checkpoint will remain unset unless the caller explicitly provides it, and downstream code (e.g., run_guidance_job_queue in guidance_script_utils.py) rejects None. Consider calling validate_model_checkpoint(self.model, checkpoint) inside populate_config_for_guidance_type() and always populating self.model_checkpoint (either user-provided or auto-resolved).

Suggested change

if checkpoint is not None:

self.model_checkpoint = checkpoint

# Always resolve and validate the model checkpoint (user-provided or auto-resolved)

self.model_checkpoint = validate_model_checkpoint(self.model, checkpoint)

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/docker.yml (1)
6-13: ⚠️ Potential issue | 🟡 Minor

Include .dockerignore in the Docker-image trigger set.

The published image now depends on .dockerignore too. If that file changes by itself, this workflow will not rebuild or push a new image, so CI can silently drift from the repo state.
Suggested change
     paths:
       - 'pyproject.toml'
       - 'pixi.lock'
       - 'src/**'
       - 'scripts/**'
       - 'Dockerfile'
+      - '.dockerignore'
       - 'docker-entrypoint.sh'
       - '.github/workflows/docker.yml'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/docker.yml around lines 6 - 13, The workflow's paths list
under the Docker image CI (the paths: block) is missing .dockerignore so changes
to that file won't trigger a rebuild; update the paths: list to include
'.dockerignore' (i.e., add the string '.dockerignore' alongside
'pyproject.toml', 'Dockerfile', 'docker-entrypoint.sh', etc.) so the image
build/push workflow runs when .dockerignore changes.

🧹 Nitpick comments (1)

Dockerfile (1)
115-121: Pin the checkpoint base image to an immutable digest.

sampleworks-checkpoints:latest makes the contents of /checkpoints change underneath the same repo revision. That is risky here because the expected filenames are hard-coded in this Dockerfile, docker-entrypoint.sh, and _CHECKPOINT_CANDIDATES, so a tag move can break runtime resolution without any code change.
Suggested change
-COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/
+COPY --from=diffuseproject/sampleworks-checkpoints@sha256:<approved-digest> /checkpoints/ /checkpoints/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 115 - 121, The Dockerfile currently copies
checkpoints from diffuseproject/sampleworks-checkpoints:latest which is mutable;
replace the image tag with an immutable digest (e.g.,
diffuseproject/sampleworks-checkpoints@sha256:<DIGEST>) in the COPY --from=...
statement so /checkpoints/ content is fixed at build time; locate the COPY line
that references sampleworks-checkpoints and update it to use the image digest,
then update any docs/CI that record the expected digest so future upgrades
explicitly change the digest instead of relying on :latest.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@run_all_models.sh`:
- Around line 49-65: The backgrounded docker run pipelines (each using "docker
run ... | tee ... &") are masking failures because tee hides docker's exit code;
wrap each launch pipeline in a subshell that sets pipefail (e.g., ( set -o
pipefail; docker run ... | tee ... ) & ), capture each background PID (use $!
into an array or variables) and then replace the single bare wait with explicit
waits that iterate over those PIDs, check each wait's exit status, and exit
nonzero if any container failed; apply this subshell+pipefail+PID-capture
pattern to all four launches using the existing "docker run" and "tee"
invocations and then perform PID-based waits instead of the bare "wait".

In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 44-54: get_checkpoint currently only reads args.model_checkpoint
which breaks grid-search that sets
boltz1_checkpoint/boltz2_checkpoint/protenix_checkpoint/rf3_checkpoint; update
get_checkpoint to check those model-specific attributes (boltz1_checkpoint,
boltz2_checkpoint, protenix_checkpoint, rf3_checkpoint) and pick the first
non-empty value, pass it through validate_model_checkpoint(...) and return the
validated string or None; also make the analogous change where get_checkpoint
logic is duplicated (the occurrence noted around lines 155-158) so both places
resolve model-specific args and call validate_model_checkpoint before returning.

---

Outside diff comments:
In @.github/workflows/docker.yml:
- Around line 6-13: The workflow's paths list under the Docker image CI (the
paths: block) is missing .dockerignore so changes to that file won't trigger a
rebuild; update the paths: list to include '.dockerignore' (i.e., add the string
'.dockerignore' alongside 'pyproject.toml', 'Dockerfile',
'docker-entrypoint.sh', etc.) so the image build/push workflow runs when
.dockerignore changes.

---

Nitpick comments:
In `@Dockerfile`:
- Around line 115-121: The Dockerfile currently copies checkpoints from
diffuseproject/sampleworks-checkpoints:latest which is mutable; replace the
image tag with an immutable digest (e.g.,
diffuseproject/sampleworks-checkpoints@sha256:<DIGEST>) in the COPY --from=...
statement so /checkpoints/ content is fixed at build time; locate the COPY line
that references sampleworks-checkpoints and update it to use the image digest,
then update any docs/CI that record the expected digest so future upgrades
explicitly change the digest instead of relying on :latest.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 77342163-cd37-4bdd-8371-d3e23d75eb1c

📥 Commits

Reviewing files that changed from the base of the PR and between ee3ce7f and b2159d2.

📒 Files selected for processing (9)

.dockerignore
.github/workflows/docker.yml
.gitignore
Dockerfile
docker-entrypoint.sh
run_all_models.sh
run_grid_search.py
src/sampleworks/utils/guidance_script_arguments.py
tests/conftest.py

coderabbitai · 2026-03-10T01:37:11Z

+docker run $DOCKER_OPTS \
+    --gpus '"device=0,1"' \
+    -v "$DATA_DIR:/data/inputs:ro" \
+    -v "$RESULTS_DIR:/data/results" \
+    -v "$MSA_CACHE_DIR:/root/.sampleworks/msa" \
+    diffuseproject/sampleworks:latest \
+    -e boltz run_grid_search.py \
+    --proteins "/data/inputs/proteins.csv" \
+    --models boltz2 \
+    --methods "X-RAY DIFFRACTION" \
+    --scalers pure_guidance \
+    --partial-diffusion-step 120 \
+    --ensemble-sizes "8" \
+    --gradient-weights "0.1 0.2 0.5" \
+    --gradient-normalization --augmentation --align-to-input \
+    --output-dir /data/results \
+    2>&1 | tee "$RESULTS_DIR/boltz2_xrd_run.log" &


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail printf 'pipeline status without pipefail: ' bash -lc 'false | tee /dev/null >/dev/null; echo $?' printf 'wait on a failed background pipeline without pipefail: ' bash -lc 'false | tee /dev/null >/dev/null & pid=$!; wait "$pid"; echo $?' printf 'plain wait after a failed background pipeline: ' bash -lc 'false | tee /dev/null >/dev/null & wait; echo $?'

Repository: diff-use/sampleworks

Length of output: 634

🏁 Script executed:

find . -name "run_all_models.sh" -type f

Repository: diff-use/sampleworks

Length of output: 83

🏁 Script executed:

head -150 run_all_models.sh | cat -n

Repository: diff-use/sampleworks

Length of output: 6001

Background pipelines are masking docker run failures.

Each of the 4 job launches uses docker run ... | tee ... & without pipefail. Because set -e only applies to the final command in a pipeline, the exit status comes from tee (which always succeeds), not from docker run. Then the bare wait on line 146 returns success after all children exit, regardless of their actual exit codes. This means the script will report "All jobs completed!" even when one or more containers have failed.

Wrap all four launch blocks (lines 49–65, 71–87, 93–107, 113–128) in subshells with set -o pipefail, then replace the bare wait with explicit PID-based waits that check exit codes:

Suggested change

-( + set -o pipefail docker run $DOCKER_OPTS \ --gpus '"device=0,1"' \ -v "$DATA_DIR:/data/inputs:ro" \ -v "$RESULTS_DIR:/data/results" \ -v "$MSA_CACHE_DIR:/root/.sampleworks/msa" \ diffuseproject/sampleworks:latest \ -e boltz run_grid_search.py \ --proteins "/data/inputs/proteins.csv" \ --models boltz2 \ --methods "X-RAY DIFFRACTION" \ --scalers pure_guidance \ --partial-diffusion-step 120 \ --ensemble-sizes "8" \ --gradient-weights "0.1 0.2 0.5" \ --gradient-normalization --augmentation --align-to-input \ --output-dir /data/results \ 2>&1 | tee "$RESULTS_DIR/boltz2_xrd_run.log" +) &

# Wait for all background jobs -wait +failed=0 +for pid in "${PIDS[@]}"; do + if ! wait "$pid"; then + failed=1 + fi +done +if (( failed )); then + exit 1 +fi

Apply the same subshell + pipefail pattern to all four job launches (lines 71–87, 93–107, 113–128).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@run_all_models.sh` around lines 49 - 65, The backgrounded docker run pipelines (each using "docker run ... | tee ... &") are masking failures because tee hides docker's exit code; wrap each launch pipeline in a subshell that sets pipefail (e.g., ( set -o pipefail; docker run ... | tee ... ) & ), capture each background PID (use $! into an array or variables) and then replace the single bare wait with explicit waits that iterate over those PIDs, check each wait's exit status, and exit nonzero if any container failed; apply this subshell+pipefail+PID-capture pattern to all four launches using the existing "docker run" and "tee" invocations and then perform PID-based waits instead of the bare "wait".

coderabbitai

🧹 Nitpick comments (2)

src/sampleworks/utils/guidance_script_arguments.py (1)

155-162: Consider using validate_model_checkpoint for early validation.

This code auto-resolves the checkpoint path but does not validate that it exists. The path is validated later in get_model_and_device, but storing an invalid path in GuidanceConfig could cause confusing errors if the config is serialized and used elsewhere before model loading.

For consistency with the "fail fast" principle, consider validating upfront:

Suggested change

     def populate_config_for_guidance_type(self, job: JobConfig, args: argparse.Namespace):
         checkpoint = get_checkpoint(args)
-        if checkpoint is not None:
-            self.model_checkpoint = checkpoint
-        elif not getattr(self, "model_checkpoint", None):
-            # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths
-            model_key = str(self.model).lower().replace("structurepredictor.", "")
-            self.model_checkpoint = _resolve_checkpoint(model_key)
+        if checkpoint is None and not getattr(self, "model_checkpoint", None):
+            checkpoint = None  # Will trigger auto-resolve in validate
+        self.model_checkpoint = validate_model_checkpoint(self.model, checkpoint)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/utils/guidance_script_arguments.py` around lines 155 - 162,
populate_config_for_guidance_type currently auto-resolves and assigns
self.model_checkpoint via get_checkpoint/_resolve_checkpoint but does not
validate the resolved path; after assigning self.model_checkpoint (in the branch
where checkpoint is None and _resolve_checkpoint returns a value) call
validate_model_checkpoint(self.model_checkpoint) and handle failure by raising
or logging a clear error so invalid paths are rejected early; ensure you
reference the existing functions populate_config_for_guidance_type,
get_checkpoint, _resolve_checkpoint, validate_model_checkpoint and the
model_checkpoint attribute so callers of GuidanceConfig never retain an
unvalidated checkpoint string (matching how get_model_and_device later
validates).

src/sampleworks/utils/guidance_script_utils.py (1)

36-41: Consider making _resolve_checkpoint public if it's part of the API.

Importing a private function (_resolve_checkpoint, prefixed with _) across modules couples internal implementation details. If this function is intentionally shared, consider renaming it to resolve_checkpoint to indicate it's part of the module's public interface.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/utils/guidance_script_utils.py` around lines 36 - 41, The
code imports the private symbol _resolve_checkpoint from
guidance_script_arguments; if this function is intended as part of the module
API, rename it to resolve_checkpoint in guidance_script_arguments (and update
its definition and any docstring), export the new public name, then update
imports in guidance_script_utils.py to import resolve_checkpoint instead of
_resolve_checkpoint and update all other call sites that reference
_resolve_checkpoint (or add a one-line alias _resolve_checkpoint =
resolve_checkpoint temporarily for backward compatibility and add a deprecation
note). Ensure tests and any references to validate_model_checkpoint or
GuidanceConfig remain unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 155-162: populate_config_for_guidance_type currently auto-resolves
and assigns self.model_checkpoint via get_checkpoint/_resolve_checkpoint but
does not validate the resolved path; after assigning self.model_checkpoint (in
the branch where checkpoint is None and _resolve_checkpoint returns a value)
call validate_model_checkpoint(self.model_checkpoint) and handle failure by
raising or logging a clear error so invalid paths are rejected early; ensure you
reference the existing functions populate_config_for_guidance_type,
get_checkpoint, _resolve_checkpoint, validate_model_checkpoint and the
model_checkpoint attribute so callers of GuidanceConfig never retain an
unvalidated checkpoint string (matching how get_model_and_device later
validates).

In `@src/sampleworks/utils/guidance_script_utils.py`:
- Around line 36-41: The code imports the private symbol _resolve_checkpoint
from guidance_script_arguments; if this function is intended as part of the
module API, rename it to resolve_checkpoint in guidance_script_arguments (and
update its definition and any docstring), export the new public name, then
update imports in guidance_script_utils.py to import resolve_checkpoint instead
of _resolve_checkpoint and update all other call sites that reference
_resolve_checkpoint (or add a one-line alias _resolve_checkpoint =
resolve_checkpoint temporarily for backward compatibility and add a deprecation
note). Ensure tests and any references to validate_model_checkpoint or
GuidanceConfig remain unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 67eb44c8-378f-4490-a639-f8694dc45196

📥 Commits

Reviewing files that changed from the base of the PR and between b2159d2 and cb71143.

📒 Files selected for processing (2)

src/sampleworks/utils/guidance_script_arguments.py
src/sampleworks/utils/guidance_script_utils.py

…heckpoints Replace HuggingFace downloads and build-context COPY with a single COPY --from=diffuseproject/sampleworks-checkpoints:latest. This makes the build work in CI (no checkpoint files needed) and on any machine with Docker access to Docker Hub. - Dockerfile: COPY --from checkpoints base image instead of curl/COPY - run_all_models.sh: use Docker Hub image name (configurable via IMAGE env) - docker.yml: add comment explaining CI pattern, trigger on entrypoint changes - .dockerignore: exclude checkpoints/ from build context

…egacy fallback - Add _resolve_checkpoint() that tries /checkpoints/ (Docker image) first, then falls back to legacy dev paths (~/.boltz/, ~/.foundry/, .pixi/envs/) - Change all --model-checkpoint defaults to None (triggers auto-resolution) - Move validate_model_checkpoint and get_checkpoint inline (was importing from checkpoint_utils which doesn't exist on main) - Update run_all_models.sh for 8-GPU setup: 4 jobs x 2 GPUs each (boltz2-xrd, boltz2-md, rf3, protenix) with occ_sweep dataset

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

run_all_models.sh (1)

53-69: ⚠️ Potential issue | 🟠 Major

Preserve docker run failures from the backgrounded tee pipelines.

These jobs still run as docker run ... | tee ... & without pipefail, and Line 151 uses a bare wait. A container failure is therefore hidden by tee, and the script can still print “All jobs completed!”.

#!/bin/bash
set -euo pipefail

printf 'background pipeline status without pipefail: '
bash -lc 'false | tee /dev/null >/dev/null & pid=$!; wait "$pid"; echo $?'

printf 'bare wait status after failed background pipeline: '
bash -lc 'false | tee /dev/null >/dev/null & wait; echo $?'

rg -n 'tee "\$RESULTS_DIR/.*\.log" &' run_all_models.sh
rg -n '^[[:space:]]*wait$' run_all_models.sh

Suggested change

- docker run $DOCKER_OPTS \
+(
+    set -o pipefail
+    docker run $DOCKER_OPTS \
         --gpus '"device=0,1"' \
         -v "$DATA_DIR:/data/inputs:ro" \
         -v "$RESULTS_DIR:/data/results" \
         -v "$MSA_CACHE_DIR:/root/.sampleworks/msa" \
         diffuseproject/sampleworks:latest \
         -e boltz run_grid_search.py \
         --proteins "/data/inputs/proteins.csv" \
         --models boltz2 \
         --methods "X-RAY DIFFRACTION" \
         --scalers pure_guidance \
         --partial-diffusion-step 120 \
         --ensemble-sizes "8" \
         --gradient-weights "0.1 0.2 0.5" \
         --gradient-normalization --augmentation --align-to-input \
         --output-dir /data/results \
-        2>&1 | tee "$RESULTS_DIR/boltz2_xrd_run.log" &
-PIDS+=($!)
+        2>&1 | tee "$RESULTS_DIR/boltz2_xrd_run.log"
+) &
+PIDS+=("$!")

-# Wait for all background jobs
-wait
+# Wait for all background jobs
+failed=0
+for pid in "${PIDS[@]}"; do
+    if ! wait "$pid"; then
+        failed=1
+    fi
+done
+if (( failed )); then
+    exit 1
+fi

Apply the same subshell + pipefail pattern to the other three launch blocks as well.

Also applies to: 75-91, 97-112, 118-133, 150-151

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@run_all_models.sh` around lines 53 - 69, The backgrounded docker pipelines
(the lines that run docker run ... | tee "$RESULTS_DIR/*.log" &) hide container
failures because pipefail isn't set and the script uses a bare wait; fix by
running each pipeline in a subshell with pipefail enabled and capturing its PID
so wait checks its exit status, e.g. replace instances of docker run ... | tee
"$RESULTS_DIR/boltz2_xrd_run.log" & with (set -o pipefail; docker run ... | tee
"$RESULTS_DIR/boltz2_xrd_run.log") & pid_boltz2=$! and then later wait
"$pid_boltz2"; do the same pattern for the other launch blocks and remove/avoid
the bare wait so each specific PID is waited on to propagate failures.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 19-22: The "protenix" fallback entry in the mapping (the list
under the "protenix" key in guidance_script_arguments.py) uses a repo-relative
path and a hard-coded "python3.12" segment which breaks when running outside the
repo root or under different Python minors; replace the literal
".pixi/envs/protenix-dev/lib/python3.12/..." string with a computed, stable path
(e.g., derive a base via Path(__file__).resolve().parents[3], or use sys.prefix,
or expand a home-relative "~/.pixi/...", or glob for "python*" under
.pixi/envs/protenix-dev/lib) so the fallback resolves independent of CWD and
Python minor version and update the second element of the "protenix" list
accordingly.

In `@src/sampleworks/utils/guidance_script_utils.py`:
- Around line 566-578: _resolve_checkpoint can return a candidate string even if
the file doesn't exist, so don't propagate it blindly: after computing resolved
= _resolve_checkpoint(model_key) call
validate_model_checkpoint(template_job.model, resolved) (or the existing
validation helper) and raise a clear ValueError if validation fails; only if
validation succeeds set template_job.model_checkpoint = resolved and then
propagate to each job in job_queue so failures happen immediately (not later in
get_model_and_device()).

---

Duplicate comments:
In `@run_all_models.sh`:
- Around line 53-69: The backgrounded docker pipelines (the lines that run
docker run ... | tee "$RESULTS_DIR/*.log" &) hide container failures because
pipefail isn't set and the script uses a bare wait; fix by running each pipeline
in a subshell with pipefail enabled and capturing its PID so wait checks its
exit status, e.g. replace instances of docker run ... | tee
"$RESULTS_DIR/boltz2_xrd_run.log" & with (set -o pipefail; docker run ... | tee
"$RESULTS_DIR/boltz2_xrd_run.log") & pid_boltz2=$! and then later wait
"$pid_boltz2"; do the same pattern for the other launch blocks and remove/avoid
the bare wait so each specific PID is waited on to propagate failures.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78acea95-209a-4b49-a440-df87384fd55b

📥 Commits

Reviewing files that changed from the base of the PR and between cb71143 and 6c5cd47.

📒 Files selected for processing (5)

.github/workflows/docker.yml
Dockerfile
run_all_models.sh
src/sampleworks/utils/guidance_script_arguments.py
src/sampleworks/utils/guidance_script_utils.py

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/docker.yml

coderabbitai · 2026-03-10T04:34:06Z

    if template_job.model_checkpoint is None or template_job.model_checkpoint == "":
-        raise ValueError("Running guidance requires that you specify a model checkpoint")
+        # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths
+        model_key = str(template_job.model).lower().replace("structurepredictor.", "")
+        resolved = _resolve_checkpoint(model_key)
+        if not resolved:
+            raise ValueError(
+                f"Running guidance requires a model checkpoint for '{template_job.model}'. "
+                f"Provide --model-checkpoint or bake checkpoints into /checkpoints/."
+            )
+        template_job.model_checkpoint = resolved
+        # Propagate to all jobs in the queue
+        for job in job_queue:
+            job.model_checkpoint = resolved


⚠️ Potential issue | 🟡 Minor

Validate the auto-resolved checkpoint before propagating it to every job.

_resolve_checkpoint() returns the first candidate string even when no file exists, so Line 570 never fires for known models. This branch can stamp a non-existent path onto the whole queue and only fail later in get_model_and_device(). Reusing validate_model_checkpoint(template_job.model, None) here would keep the failure local and immediate.

Suggested change

template_job = job_queue[0] if template_job.model_checkpoint is None or template_job.model_checkpoint == "": - # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths - model_key = str(template_job.model).lower().replace("structurepredictor.", "") - resolved = _resolve_checkpoint(model_key) - if not resolved: - raise ValueError( - f"Running guidance requires a model checkpoint for '{template_job.model}'. " - f"Provide --model-checkpoint or bake checkpoints into /checkpoints/." - ) + resolved = validate_model_checkpoint(template_job.model, None) template_job.model_checkpoint = resolved # Propagate to all jobs in the queue for job in job_queue: job.model_checkpoint = resolved

As per coding guidelines, "Fail fast with clear error messages when invalid states are encountered."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if template_job.model_checkpoint is None or template_job.model_checkpoint == "":

raise ValueError("Running guidance requires that you specify a model checkpoint")

# Auto-resolve from baked-in /checkpoints/ or legacy fallback paths

model_key = str(template_job.model).lower().replace("structurepredictor.", "")

resolved = _resolve_checkpoint(model_key)

if not resolved:

raise ValueError(

f"Running guidance requires a model checkpoint for '{template_job.model}'. "

f"Provide --model-checkpoint or bake checkpoints into /checkpoints/."

)

template_job.model_checkpoint = resolved

# Propagate to all jobs in the queue

for job in job_queue:

job.model_checkpoint = resolved

if template_job.model_checkpoint is None or template_job.model_checkpoint == "":

resolved = validate_model_checkpoint(template_job.model, None)

template_job.model_checkpoint = resolved

# Propagate to all jobs in the queue

for job in job_queue:

job.model_checkpoint = resolved

🧰 Tools

🪛 Ruff (0.15.5)

[warning] 571-574: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/utils/guidance_script_utils.py` around lines 566 - 578, _resolve_checkpoint can return a candidate string even if the file doesn't exist, so don't propagate it blindly: after computing resolved = _resolve_checkpoint(model_key) call validate_model_checkpoint(template_job.model, resolved) (or the existing validation helper) and raise a clear ValueError if validation fails; only if validation succeeds set template_job.model_checkpoint = resolved and then propagate to each job in job_queue so failures happen immediately (not later in get_model_and_device()).

Probably good to address this one.

@marcuscollins, agreed — the current code can silently propagate a non-existent path to the entire queue. Since validate_model_checkpoint already handles auto-resolution and path-existence validation, we can collapse the whole block into a single call:

Replace manual _resolve_checkpoint call + bare if not resolved guard in run_guidance_job_queue with a single validate_model_checkpoint(template_job.model, None) call that auto-resolves and validates in one shot.

🧠 Learnings used

Learnt from: k-chrispens Repo: diff-use/sampleworks PR: 73 File: src/sampleworks/core/samplers/protocol.py:1-18 Timestamp: 2026-02-02T19:13:51.594Z Learning: In Python files across the repo, do not suggest or add inline or file-level suppressions for Ruff F722 (jaxtyping dimension strings). Instead rely on a global ignore configured in pyproject.toml (as done by the sampleworks project). Verify that F722 is included in the global ignore, and avoid recommending per-file # ruff: noqa: F722 directives unless explicitly documented as an exception.

Learnt from: marcuscollins Repo: diff-use/sampleworks PR: 132 File: src/sampleworks/utils/guidance_script_utils.py:585-586 Timestamp: 2026-03-05T16:30:40.514Z Learning: In Python code, if enums subclassing StrEnum are used (e.g., GuidanceType, StructurePredictor), they serialize to plain strings with json.dump and pickle without special handling. Do not flag these as non-serializable in code reviews. Treat StrEnum values as strings for JSON serialization and ensure tests cover that behavior; no extra pickle handling needed.

💡 Use @coderabbitai plan to create a plan for code changes

marcuscollins · 2026-03-11T19:32:21Z

            self.model_checkpoint = checkpoint
+        elif not getattr(self, "model_checkpoint", None):
+            # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths
+            model_key = str(self.model).lower().replace("structurepredictor.", "")


The .replace("structurepredictor.", "") bit isn't necessary. Remove to avoid confusion.

marcuscollins · 2026-03-11T19:36:45Z

 from sampleworks.utils.guidance_script_arguments import (
    GuidanceConfig,
    JobResult,
+    _resolve_checkpoint,


Maybe rename so it isn't a "private" method.

marcuscollins · 2026-03-11T19:37:17Z

    if template_job.model_checkpoint is None or template_job.model_checkpoint == "":
-        raise ValueError("Running guidance requires that you specify a model checkpoint")
+        # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths
+        model_key = str(template_job.model).lower().replace("structurepredictor.", "")


Ditto, remove the replace bit. It shouldn't be necessary.

marcuscollins · 2026-03-11T19:39:23Z

    if template_job.model_checkpoint is None or template_job.model_checkpoint == "":
-        raise ValueError("Running guidance requires that you specify a model checkpoint")
+        # Auto-resolve from baked-in /checkpoints/ or legacy fallback paths
+        model_key = str(template_job.model).lower().replace("structurepredictor.", "")
+        resolved = _resolve_checkpoint(model_key)
+        if not resolved:
+            raise ValueError(
+                f"Running guidance requires a model checkpoint for '{template_job.model}'. "
+                f"Provide --model-checkpoint or bake checkpoints into /checkpoints/."
+            )
+        template_job.model_checkpoint = resolved
+        # Propagate to all jobs in the queue
+        for job in job_queue:
+            job.model_checkpoint = resolved


Probably good to address this one.

…nt exists and raises if not

marcuscollins

I went ahead and made the requested changes myself. They were very minor.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

src/sampleworks/utils/guidance_script_arguments.py (1)

19-22: ⚠️ Potential issue | 🟠 Major

Protenix fallback path is brittle across working directories and Python minors.

The fallback uses a repo-relative path with hard-coded python3.12, so resolution can fail when launched outside repo root or under a different Python minor. Please make this fallback CWD-independent and version-robust.

🔧 Suggested direction

+import glob
@@
     "protenix": [
         "/checkpoints/protenix_base_default_v0.5.0.pt",
-        ".pixi/envs/protenix-dev/lib/python3.12/site-packages/release_data/checkpoint/protenix_base_default_v0.5.0.pt",
+        "~/.pixi/envs/protenix-dev/lib/python*/site-packages/release_data/checkpoint/protenix_base_default_v0.5.0.pt",
     ],
@@
-    for candidate in candidates:
-        resolved = Path(candidate).expanduser()
-        if resolved.exists():
-            return str(resolved)
+    for candidate in candidates:
+        expanded = str(Path(candidate).expanduser())
+        matches = [Path(m) for m in glob.glob(expanded)] if "*" in expanded else [Path(expanded)]
+        for resolved in matches:
+            if resolved.exists():
+                return str(resolved.resolve())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/utils/guidance_script_arguments.py` around lines 19 - 22, The
"protenix" fallback path in guidance_script_arguments.py is brittle because it
hard-codes "python3.12" and a repo-relative path; replace it with a robust
resolver that builds candidate paths using the running Python's install
locations (use sys.version_info or sys.executable and
sysconfig.get_paths()/site.getsitepackages()/sys.prefix) and Path to join the
known package path "release_data/checkpoint/protenix_base_default_v0.5.0.pt", or
search those site-packages/purelib locations for that relative path and pick the
first hit; update the "protenix" entry to call this resolver (or populate it at
import time) so the fallback is CWD-independent and works across Python minor
versions.

🧹 Nitpick comments (1)

src/sampleworks/utils/guidance_script_arguments.py (1)
26-33: Use a full NumPy-style docstring for the new function.

_resolve_checkpoint() has a narrative docstring, but not NumPy sections (Parameters, Returns, Raises), which makes API contracts less explicit.

As per coding guidelines, "**/*.py: Always include NumPy-style docstrings for every function and class".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampleworks/utils/guidance_script_arguments.py` around lines 26 - 33,
Update the docstring of _resolve_checkpoint to use NumPy-style sections: add a
Parameters section describing model_key: str, a Returns section describing the
returned checkpoint path as str, and a Raises section if the function can raise
any exceptions (or state that it does not); keep the existing narrative summary
and reference to _CHECKPOINT_CANDIDATES and the fallback behavior so callers
understand the contract and possible error conditions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 38-39: In the comment "Nothing found – return the primary
(baked-in) path so the error message points the user to the expected location."
in src/sampleworks/utils/guidance_script_arguments.py replace the EN DASH (–)
with a plain ASCII hyphen-minus (-) so the line reads "Nothing found - return
the primary (baked-in) path..." to satisfy Ruff RUF003 and remove the lint
warning.
- Around line 26-50: _resolve_checkpoint currently raises immediately when the
primary candidate path does not exist and also uses ValueError inconsistently
with validate_model_checkpoint; restore the original behavior by returning the
primary (baked-in) candidate when candidates exist (even if the path doesn't
exist) so downstream validate_model_checkpoint can perform existence checks and
raise FileNotFoundError, and only raise a ValueError if there are no candidates
at all; update the control flow in _resolve_checkpoint accordingly (refer to
function name _resolve_checkpoint and validation function
validate_model_checkpoint to align semantics).

---

Duplicate comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 19-22: The "protenix" fallback path in
guidance_script_arguments.py is brittle because it hard-codes "python3.12" and a
repo-relative path; replace it with a robust resolver that builds candidate
paths using the running Python's install locations (use sys.version_info or
sys.executable and sysconfig.get_paths()/site.getsitepackages()/sys.prefix) and
Path to join the known package path
"release_data/checkpoint/protenix_base_default_v0.5.0.pt", or search those
site-packages/purelib locations for that relative path and pick the first hit;
update the "protenix" entry to call this resolver (or populate it at import
time) so the fallback is CWD-independent and works across Python minor versions.

---

Nitpick comments:
In `@src/sampleworks/utils/guidance_script_arguments.py`:
- Around line 26-33: Update the docstring of _resolve_checkpoint to use
NumPy-style sections: add a Parameters section describing model_key: str, a
Returns section describing the returned checkpoint path as str, and a Raises
section if the function can raise any exceptions (or state that it does not);
keep the existing narrative summary and reference to _CHECKPOINT_CANDIDATES and
the fallback behavior so callers understand the contract and possible error
conditions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3ebfbf9b-eb83-4083-a87e-1ba2b8ac26c2

📥 Commits

Reviewing files that changed from the base of the PR and between 6c5cd47 and dd18525.

📒 Files selected for processing (2)

src/sampleworks/utils/guidance_script_arguments.py
src/sampleworks/utils/guidance_script_utils.py

coderabbitai · 2026-03-14T00:30:54Z

+def _resolve_checkpoint(model_key: str) -> str:
+    """Return the first checkpoint path that exists on disk for *model_key*.
+
+    Tries baked-in Docker paths first (``/checkpoints/``), then falls back to
+    legacy development paths.  If none are found the first candidate is returned
+    so that downstream validation produces a clear error message.
+    """
+    candidates = _CHECKPOINT_CANDIDATES.get(model_key, [])
+    for candidate in candidates:
+        resolved = Path(candidate).expanduser()
+        if resolved.exists():
+            return str(resolved)
+    # Nothing found – return the primary (baked-in) path so the error message
+    # points the user to the expected location.
+    resolved = candidates[0] if candidates else ""
+    if not resolved:
+        raise ValueError(
+            f"Running guidance requires a model checkpoint for '{model_key}'. "
+            f"Provide --model-checkpoint or bake checkpoints into /checkpoints/."
+        )
+    if not Path(resolved).exists():
+        raise ValueError(
+            f"Model checkpoint '{resolved}' does not exist. "
+            f"Provide a valid path via --model-checkpoint."
+        )


⚠️ Potential issue | 🟡 Minor

_resolve_checkpoint() doc/behavior and exception semantics drifted.

Line 30 says unresolved cases return a primary candidate for downstream validation, but Lines 46-50 now raise immediately. Also, missing-file errors here use ValueError, while validate_model_checkpoint() uses FileNotFoundError for the same condition. Aligning these will make failure handling predictable.

🧩 Suggested fix

def _resolve_checkpoint(model_key: str) -> str: - """Return the first checkpoint path that exists on disk for *model_key*. + """Return the first checkpoint path that exists on disk for *model_key*. @@ - legacy development paths. If none are found the first candidate is returned - so that downstream validation produces a clear error message. + legacy development paths. + Raises FileNotFoundError if no candidate exists on disk. @@ - if not Path(resolved).exists(): - raise ValueError( + if not Path(resolved).exists(): + raise FileNotFoundError( f"Model checkpoint '{resolved}' does not exist. " f"Provide a valid path via --model-checkpoint." )

🧰 Tools

🪛 Ruff (0.15.5)

[warning] 38-38: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/utils/guidance_script_arguments.py` around lines 26 - 50, _resolve_checkpoint currently raises immediately when the primary candidate path does not exist and also uses ValueError inconsistently with validate_model_checkpoint; restore the original behavior by returning the primary (baked-in) candidate when candidates exist (even if the path doesn't exist) so downstream validate_model_checkpoint can perform existence checks and raise FileNotFoundError, and only raise a ValueError if there are no candidates at all; update the control flow in _resolve_checkpoint accordingly (refer to function name _resolve_checkpoint and validation function validate_model_checkpoint to align semantics).

coderabbitai · 2026-03-14T00:30:54Z

+    # Nothing found – return the primary (baked-in) path so the error message
+    # points the user to the expected location.


⚠️ Potential issue | 🟡 Minor

Replace the EN DASH in the comment to satisfy Ruff RUF003.

This keeps lint output clean and avoids recurring static-analysis noise.

🧰 Tools

🪛 Ruff (0.15.5)

[warning] 38-38: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/sampleworks/utils/guidance_script_arguments.py` around lines 38 - 39, In the comment "Nothing found – return the primary (baked-in) path so the error message points the user to the expected location." in src/sampleworks/utils/guidance_script_arguments.py replace the EN DASH (–) with a plain ASCII hyphen-minus (-) so the line reads "Nothing found - return the primary (baked-in) path..." to satisfy Ruff RUF003 and remove the lint warning.

Copilot AI review requested due to automatic review settings March 10, 2026 01:24

Copilot started reviewing on behalf of xraymemory March 10, 2026 01:24 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 10, 2026

View reviewed changes

xraymemory added 2 commits March 9, 2026 21:53

xraymemory force-pushed the fix/bake-all-checkpoints branch from 9b67692 to c9eda3c Compare March 10, 2026 01:54

fix: auto-pull latest Docker image before running jobs

6c5cd47

coderabbitai Bot reviewed Mar 10, 2026

View reviewed changes

marcuscollins requested changes Mar 11, 2026

View reviewed changes

fix(checkpoints): _resolve_checkpoint now validates that the checkpoi…

dd18525

…nt exists and raises if not

marcuscollins temporarily deployed to gpu-testing March 14, 2026 00:23 — with GitHub Actions Inactive

marcuscollins had a problem deploying to gpu-testing March 14, 2026 00:23 — with GitHub Actions Failure

marcuscollins approved these changes Mar 14, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 14, 2026

View reviewed changes

marcuscollins merged commit 2b11549 into main Mar 14, 2026
3 of 4 checks passed

coderabbitai Bot mentioned this pull request Mar 17, 2026

feat(grid search args): rename --use-tweedie; switch --models to --model; switch --methods to --method #167

Merged

coderabbitai Bot mentioned this pull request Mar 27, 2026

ci: Add CI workflow for lint, typecheck, and CPU tests #189

Merged

3 tasks

k-chrispens deleted the fix/bake-all-checkpoints branch April 22, 2026 00:48

-# Wait for all background jobs
-wait
-echo ""
-echo "=========================================="
+# Wait for all background jobs individually and aggregate exit codes
+overall_status=0
+for pid in "${PIDS[@]}"; do
+    if ! wait "$pid"; then
+        exit_code=$?
+        echo "[$(date)] Job with PID $pid failed with exit code $exit_code"
+        overall_status=$exit_code
+    else
+        echo "[$(date)] Job with PID $pid completed successfully"
+    fi
+done
+echo ""
+echo "=========================================="
+if [ "$overall_status" -ne 0 ]; then
+    echo "[$(date)] One or more jobs failed."
+    echo "=========================================="
+    exit "$overall_status"
+fi

		if checkpoint is not None:
		self.model_checkpoint = checkpoint

		# Nothing found – return the primary (baked-in) path so the error message
		# points the user to the expected location.

Conversation

xraymemory commented Mar 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcuscollins Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marcuscollins Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xraymemory commented Mar 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 10, 2026 •

edited

Loading

coderabbitai Bot Mar 10, 2026 •

edited

Loading