fix(docker): combine pixi installs into single RUN to prevent disk-full#206
fix(docker): combine pixi installs into single RUN to prevent disk-full#206marcuscollins merged 5 commits intomainfrom
Conversation
The docker-container driver duplicates the checkpoint image in its nested content store (~20 GB for pull + extract), leaving insufficient space for pixi install. The docker driver runs BuildKit in-process, avoiding this duplication. Also adds pull_request trigger and disables push for testing.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe Dockerfile consolidates three separate Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/workflows/docker.yml (1)
13-14: Consider guarding Docker Hub auth/publish for PR events (especially forks).With Line 13 enabling
pull_request, keep build validation but gate secret-dependent steps so fork PRs don’t fail on Docker Hub auth.Suggested hardening
- name: Login to Docker Hub + if: github.event_name != 'pull_request' || !github.event.pull_request.head.repo.fork uses: docker/login-action@v4 with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_TOKEN }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/docker.yml around lines 13 - 14, The workflow enables pull_request events but doesn’t guard secret-dependent Docker Hub auth/publish steps, causing fork PRs to fail; update the Docker login/publish steps (those that use DOCKERHUB_USERNAME/DOCKERHUB_TOKEN or perform docker build-and-push) to run only when the PR is not from a fork by adding an if conditional such as if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.fork == false (or equivalent check for same-repo PRs), and ensure repository-level permissions/secrets are restricted so secret access is skipped for forked PRs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/docker.yml:
- Line 71: The workflow currently disables image publishing by setting "push:
false" which prevents pushes on real events (e.g., push to main or release
tags); update the workflow to restore publishing for real runs by removing or
changing the test-only override so that "push" is enabled for production events
(e.g., set push: true or conditionally set push: false only for the specific
test job/event), ensuring the Docker login step can push images as intended.
---
Nitpick comments:
In @.github/workflows/docker.yml:
- Around line 13-14: The workflow enables pull_request events but doesn’t guard
secret-dependent Docker Hub auth/publish steps, causing fork PRs to fail; update
the Docker login/publish steps (those that use
DOCKERHUB_USERNAME/DOCKERHUB_TOKEN or perform docker build-and-push) to run only
when the PR is not from a fork by adding an if conditional such as if:
github.event_name != 'pull_request' || github.event.pull_request.head.repo.fork
== false (or equivalent check for same-repo PRs), and ensure repository-level
permissions/secrets are restricted so secret access is skipped for forked PRs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bb9cf2bb-d424-494e-b3c0-9c911c0f576e
📒 Files selected for processing (1)
.github/workflows/docker.yml
The checkpoint COPY materializes ~10 GB into the build layer graph, leaving insufficient space for pixi to extract conda packages. Moving it after pixi installs reduces peak disk usage during the build. Added df -h before/after each pixi install and before checkpoint COPY to capture disk usage for debugging.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
Dockerfile (1)
113-115: Remove temporarydf -hprobes before mergeThe
df -hprobes were useful for debugging, but keeping them permanently adds noisy logs and unnecessary build steps. Since this PR’s test plan treats these as temporary diagnostics, it’s better to drop them after validation.Proposed cleanup
-RUN df -h / && pixi install -e boltz --frozen && df -h / -RUN df -h / && pixi install -e protenix --frozen && df -h / -RUN df -h / && pixi install -e rf3 --frozen && df -h / +RUN pixi install -e boltz --frozen +RUN pixi install -e protenix --frozen +RUN pixi install -e rf3 --frozen ... -RUN df -h / COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/Also applies to: 124-124
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 113 - 115, Remove the temporary df -h probes from the Docker build RUN steps that wrap pixi install commands; specifically edit the RUN lines that call "pixi install -e boltz --frozen", "pixi install -e protenix --frozen", and "pixi install -e rf3 --frozen" to drop the preceding and trailing "df -h / &&" and "&& df -h /" so the steps only run the pixi install commands, and do the same for the similar RUN at the other location mentioned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Dockerfile`:
- Line 125: The Dockerfile COPY step uses a mutable tag
"diffuseproject/sampleworks-checkpoints:latest" which makes builds
non-reproducible; replace that tag with an immutable image digest (use the
repository image digest from the registry) and update the COPY --from reference
accordingly (e.g., change COPY
--from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/
to use diffuseproject/sampleworks-checkpoints@sha256:<actual-digest>), ensuring
the pinned digest is the one you verified from the registry so future builds use
the exact same source image.
---
Nitpick comments:
In `@Dockerfile`:
- Around line 113-115: Remove the temporary df -h probes from the Docker build
RUN steps that wrap pixi install commands; specifically edit the RUN lines that
call "pixi install -e boltz --frozen", "pixi install -e protenix --frozen", and
"pixi install -e rf3 --frozen" to drop the preceding and trailing "df -h / &&"
and "&& df -h /" so the steps only run the pixi install commands, and do the
same for the similar RUN at the other location mentioned.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
Reverting to the original PR #204 ordering (checkpoints COPY before pixi installs) to capture df -h readings and verify disk usage theory. This will likely fail on 75G runners but the df output will confirm why.
Testing whether combining pixi installs into one RUN reduces peak disk via fewer overlay layers. Checkpoints still before pixi (the ordering that failed on 72G runners) to see if single-RUN is enough to fit.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
Dockerfile (1)
109-111:⚠️ Potential issue | 🟠 MajorPin checkpoints image to an immutable digest.
Using
:lateston Line 110 makes this build non-reproducible; upstream image changes can silently alter artifacts and disk behavior across CI runs.Proposed hardening
-COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/ +COPY --from=diffuseproject/sampleworks-checkpoints@sha256:<verified_digest> /checkpoints/ /checkpoints/#!/bin/bash set -euo pipefail # Verify mutable tag usage in Dockerfile rg -n 'COPY\s+--from=diffuseproject/sampleworks-checkpoints:latest' Dockerfile # Resolve the current digest for :latest from Docker Hub (to pin in Dockerfile) TOKEN="$(curl -fsSL 'https://auth.docker.io/token?service=registry.docker.io&scope=repository:diffuseproject/sampleworks-checkpoints:pull' | jq -r '.token')" curl -fsSI \ -H "Authorization: Bearer ${TOKEN}" \ -H 'Accept: application/vnd.docker.distribution.manifest.list.v2+json' \ 'https://registry-1.docker.io/v2/diffuseproject/sampleworks-checkpoints/manifests/latest' \ | tr -d '\r' | rg -i 'docker-content-digest'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 109 - 111, The Dockerfile uses a mutable tag in the COPY --from stage (COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/), making builds non-reproducible; resolve the current immutable digest for diffuseproject/sampleworks-checkpoints:latest (via Docker Hub manifest API or docker pull + docker inspect/manifest) and replace the tag with the pinned digest form (diffuseproject/sampleworks-checkpoints@sha256:<DIGEST>) in the COPY --from instruction; optionally document/update CI to refresh the pinned digest when intentional updates are needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@Dockerfile`:
- Around line 109-111: The Dockerfile uses a mutable tag in the COPY --from
stage (COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/
/checkpoints/), making builds non-reproducible; resolve the current immutable
digest for diffuseproject/sampleworks-checkpoints:latest (via Docker Hub
manifest API or docker pull + docker inspect/manifest) and replace the tag with
the pinned digest form (diffuseproject/sampleworks-checkpoints@sha256:<DIGEST>)
in the COPY --from instruction; optionally document/update CI to refresh the
pinned digest when intentional updates are needed.
Separate RUN commands for each pixi environment create overlay layers that duplicate shared conda packages (numpy, CUDA libs, etc.), consuming ~37 GB vs ~14 GB in a single layer. This causes disk-full failures on CI runners with 72 GB disks (ubuntu-latest provisions either 72 or 145 GB non-deterministically). Reverts the split introduced in #204 while keeping the other optimizations (free-disk-space, checkpoint layer ordering).
Summary
RUN pixi installcommands back into a singleRUNubuntu-latestnon-deterministically provisions runners with 72 GB or 145 GB disks. On 72 GB runners, the split-RUN approach exceeds available space during buildRoot cause investigation
The split was introduced in #204 for better layer caching. However, the three pixi environments share many conda packages, and overlay layers store full copies of files per layer. This ~23 GB overhead pushes the build past the disk limit on smaller runners.
Disk usage measured via
df -hinside Docker build steps:Test plan
Summary by CodeRabbit