Skip to content

fix(docker): combine pixi installs into single RUN to prevent disk-full#206

Merged
marcuscollins merged 5 commits intomainfrom
worktree-fix-docker-buildx-driver
Apr 9, 2026
Merged

fix(docker): combine pixi installs into single RUN to prevent disk-full#206
marcuscollins merged 5 commits intomainfrom
worktree-fix-docker-buildx-driver

Conversation

@Abdelsalam-Abbas
Copy link
Copy Markdown
Contributor

@Abdelsalam-Abbas Abdelsalam-Abbas commented Apr 9, 2026

Summary

  • Combines the three separate RUN pixi install commands back into a single RUN
  • Separate RUNs create overlay layers that duplicate shared conda packages (numpy, CUDA libs, etc.) across environments -- measured ~37 GB (3 layers) vs ~14 GB (1 layer)
  • ubuntu-latest non-deterministically provisions runners with 72 GB or 145 GB disks. On 72 GB runners, the split-RUN approach exceeds available space during build

Root cause investigation

The split was introduced in #204 for better layer caching. However, the three pixi environments share many conda packages, and overlay layers store full copies of files per layer. This ~23 GB overhead pushes the build past the disk limit on smaller runners.

Disk usage measured via df -h inside Docker build steps:

Metric Split RUN (3 layers) Single RUN (1 layer)
boltz 8 GB 8 GB
protenix 12 GB (new layer) 4 GB (shared pkgs deduped)
rf3 12 GB (new layer) 2 GB (shared pkgs deduped)
Total pixi disk ~37 GB ~14 GB

Test plan

  • Verified split-RUN fails on 72 GB runners (jobs 70663224672, 70670232201)
  • Verified single-RUN disk usage via df -h debugging (job 70682099334)
  • Confirmed checkpoint image unchanged since March 25 (same SHA across all builds)

Summary by CodeRabbit

  • Chores
    • Streamlined the deployment build process by consolidating multiple environment setup commands into a single optimized layer, resulting in improved build performance, reduced container image overhead, better dependency caching efficiency, and enhanced operational efficiency during containerization and deployment cycles while maintaining full functionality.

The docker-container driver duplicates the checkpoint image in its
nested content store (~20 GB for pull + extract), leaving insufficient
space for pixi install. The docker driver runs BuildKit in-process,
avoiding this duplication.

Also adds pull_request trigger and disables push for testing.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 64b41697-72dd-476a-bb99-6adbb311c78d

📥 Commits

Reviewing files that changed from the base of the PR and between a0ef63c and c907a49.

📒 Files selected for processing (1)
  • Dockerfile
🚧 Files skipped from review as they are similar to previous changes (1)
  • Dockerfile

📝 Walkthrough

Walkthrough

The Dockerfile consolidates three separate pixi install commands for the boltz, protenix, and rf3 environments into a single chained RUN instruction. This reduces Docker layers from three independent installation steps to one combined layer, altering caching behavior and image layer composition during the build process.

Changes

Cohort / File(s) Summary
Docker Build Optimization
Dockerfile
Consolidated three separate RUN pixi install -e <env> --frozen commands into a single chained RUN using && operators for sequential environment installations within one layer.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested reviewers

  • xraymemory
  • k-chrispens
  • marcuscollins

Poem

🐰 Layers once three, now melded as one,
Docker builds faster when chaining is done!
Pixi's environments dance in a line,
One cache, one image—a optimization divine!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(docker): combine pixi installs into single RUN to prevent disk-full' accurately describes the main change—consolidating multiple pixi install commands into a single RUN layer to reduce disk usage during Docker builds.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-fix-docker-buildx-driver

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
.github/workflows/docker.yml (1)

13-14: Consider guarding Docker Hub auth/publish for PR events (especially forks).

With Line 13 enabling pull_request, keep build validation but gate secret-dependent steps so fork PRs don’t fail on Docker Hub auth.

Suggested hardening
       - name: Login to Docker Hub
+        if: github.event_name != 'pull_request' || !github.event.pull_request.head.repo.fork
         uses: docker/login-action@v4
         with:
           username: ${{ secrets.DOCKERHUB_USERNAME }}
           password: ${{ secrets.DOCKERHUB_TOKEN }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/docker.yml around lines 13 - 14, The workflow enables
pull_request events but doesn’t guard secret-dependent Docker Hub auth/publish
steps, causing fork PRs to fail; update the Docker login/publish steps (those
that use DOCKERHUB_USERNAME/DOCKERHUB_TOKEN or perform docker build-and-push) to
run only when the PR is not from a fork by adding an if conditional such as if:
github.event_name != 'pull_request' || github.event.pull_request.head.repo.fork
== false (or equivalent check for same-repo PRs), and ensure repository-level
permissions/secrets are restricted so secret access is skipped for forked PRs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/docker.yml:
- Line 71: The workflow currently disables image publishing by setting "push:
false" which prevents pushes on real events (e.g., push to main or release
tags); update the workflow to restore publishing for real runs by removing or
changing the test-only override so that "push" is enabled for production events
(e.g., set push: true or conditionally set push: false only for the specific
test job/event), ensuring the Docker login step can push images as intended.

---

Nitpick comments:
In @.github/workflows/docker.yml:
- Around line 13-14: The workflow enables pull_request events but doesn’t guard
secret-dependent Docker Hub auth/publish steps, causing fork PRs to fail; update
the Docker login/publish steps (those that use
DOCKERHUB_USERNAME/DOCKERHUB_TOKEN or perform docker build-and-push) to run only
when the PR is not from a fork by adding an if conditional such as if:
github.event_name != 'pull_request' || github.event.pull_request.head.repo.fork
== false (or equivalent check for same-repo PRs), and ensure repository-level
permissions/secrets are restricted so secret access is skipped for forked PRs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bb9cf2bb-d424-494e-b3c0-9c911c0f576e

📥 Commits

Reviewing files that changed from the base of the PR and between 41512b1 and ef8a4bd.

📒 Files selected for processing (1)
  • .github/workflows/docker.yml

Comment thread .github/workflows/docker.yml Outdated
The checkpoint COPY materializes ~10 GB into the build layer graph,
leaving insufficient space for pixi to extract conda packages. Moving
it after pixi installs reduces peak disk usage during the build.

Added df -h before/after each pixi install and before checkpoint COPY
to capture disk usage for debugging.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
Dockerfile (1)

113-115: Remove temporary df -h probes before merge

The df -h probes were useful for debugging, but keeping them permanently adds noisy logs and unnecessary build steps. Since this PR’s test plan treats these as temporary diagnostics, it’s better to drop them after validation.

Proposed cleanup
-RUN df -h / && pixi install -e boltz --frozen && df -h /
-RUN df -h / && pixi install -e protenix --frozen && df -h /
-RUN df -h / && pixi install -e rf3 --frozen && df -h /
+RUN pixi install -e boltz --frozen
+RUN pixi install -e protenix --frozen
+RUN pixi install -e rf3 --frozen
...
-RUN df -h /
 COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/

Also applies to: 124-124

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 113 - 115, Remove the temporary df -h probes from
the Docker build RUN steps that wrap pixi install commands; specifically edit
the RUN lines that call "pixi install -e boltz --frozen", "pixi install -e
protenix --frozen", and "pixi install -e rf3 --frozen" to drop the preceding and
trailing "df -h / &&" and "&& df -h /" so the steps only run the pixi install
commands, and do the same for the similar RUN at the other location mentioned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Dockerfile`:
- Line 125: The Dockerfile COPY step uses a mutable tag
"diffuseproject/sampleworks-checkpoints:latest" which makes builds
non-reproducible; replace that tag with an immutable image digest (use the
repository image digest from the registry) and update the COPY --from reference
accordingly (e.g., change COPY
--from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/
to use diffuseproject/sampleworks-checkpoints@sha256:<actual-digest>), ensuring
the pinned digest is the one you verified from the registry so future builds use
the exact same source image.

---

Nitpick comments:
In `@Dockerfile`:
- Around line 113-115: Remove the temporary df -h probes from the Docker build
RUN steps that wrap pixi install commands; specifically edit the RUN lines that
call "pixi install -e boltz --frozen", "pixi install -e protenix --frozen", and
"pixi install -e rf3 --frozen" to drop the preceding and trailing "df -h / &&"
and "&& df -h /" so the steps only run the pixi install commands, and do the
same for the similar RUN at the other location mentioned.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 330f40f0-a2a6-4178-a902-b3f686624d46

📥 Commits

Reviewing files that changed from the base of the PR and between ef8a4bd and 78e7e44.

📒 Files selected for processing (1)
  • Dockerfile

Comment thread Dockerfile Outdated
Reverting to the original PR #204 ordering (checkpoints COPY before pixi
installs) to capture df -h readings and verify disk usage theory. This
will likely fail on 75G runners but the df output will confirm why.
Testing whether combining pixi installs into one RUN reduces peak disk
via fewer overlay layers. Checkpoints still before pixi (the ordering
that failed on 72G runners) to see if single-RUN is enough to fit.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
Dockerfile (1)

109-111: ⚠️ Potential issue | 🟠 Major

Pin checkpoints image to an immutable digest.

Using :latest on Line 110 makes this build non-reproducible; upstream image changes can silently alter artifacts and disk behavior across CI runs.

Proposed hardening
-COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/
+COPY --from=diffuseproject/sampleworks-checkpoints@sha256:<verified_digest> /checkpoints/ /checkpoints/
#!/bin/bash
set -euo pipefail

# Verify mutable tag usage in Dockerfile
rg -n 'COPY\s+--from=diffuseproject/sampleworks-checkpoints:latest' Dockerfile

# Resolve the current digest for :latest from Docker Hub (to pin in Dockerfile)
TOKEN="$(curl -fsSL 'https://auth.docker.io/token?service=registry.docker.io&scope=repository:diffuseproject/sampleworks-checkpoints:pull' | jq -r '.token')"
curl -fsSI \
  -H "Authorization: Bearer ${TOKEN}" \
  -H 'Accept: application/vnd.docker.distribution.manifest.list.v2+json' \
  'https://registry-1.docker.io/v2/diffuseproject/sampleworks-checkpoints/manifests/latest' \
  | tr -d '\r' | rg -i 'docker-content-digest'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 109 - 111, The Dockerfile uses a mutable tag in the
COPY --from stage (COPY --from=diffuseproject/sampleworks-checkpoints:latest
/checkpoints/ /checkpoints/), making builds non-reproducible; resolve the
current immutable digest for diffuseproject/sampleworks-checkpoints:latest (via
Docker Hub manifest API or docker pull + docker inspect/manifest) and replace
the tag with the pinned digest form
(diffuseproject/sampleworks-checkpoints@sha256:<DIGEST>) in the COPY --from
instruction; optionally document/update CI to refresh the pinned digest when
intentional updates are needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@Dockerfile`:
- Around line 109-111: The Dockerfile uses a mutable tag in the COPY --from
stage (COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/
/checkpoints/), making builds non-reproducible; resolve the current immutable
digest for diffuseproject/sampleworks-checkpoints:latest (via Docker Hub
manifest API or docker pull + docker inspect/manifest) and replace the tag with
the pinned digest form (diffuseproject/sampleworks-checkpoints@sha256:<DIGEST>)
in the COPY --from instruction; optionally document/update CI to refresh the
pinned digest when intentional updates are needed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dad32474-d386-47c7-9134-276f3fce374c

📥 Commits

Reviewing files that changed from the base of the PR and between 9c856d9 and a0ef63c.

📒 Files selected for processing (1)
  • Dockerfile

Separate RUN commands for each pixi environment create overlay layers
that duplicate shared conda packages (numpy, CUDA libs, etc.), consuming
~37 GB vs ~14 GB in a single layer. This causes disk-full failures on
CI runners with 72 GB disks (ubuntu-latest provisions either 72 or
145 GB non-deterministically).

Reverts the split introduced in #204 while keeping the other
optimizations (free-disk-space, checkpoint layer ordering).
@Abdelsalam-Abbas Abdelsalam-Abbas changed the title fix(ci): switch buildx to docker driver to fix disk-full build fix(docker): combine pixi installs into single RUN to prevent disk-full Apr 9, 2026
@marcuscollins marcuscollins merged commit bc2cb94 into main Apr 9, 2026
1 check passed
@k-chrispens k-chrispens deleted the worktree-fix-docker-buildx-driver branch April 22, 2026 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants