Skip to content

feat(vllm-tensorizer): Optimize Multi-Stage Build for Slimmer Inference Image #101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 26, 2025
Merged
4 changes: 3 additions & 1 deletion .github/configurations/vllm-tensorizer.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,6 @@ vllm-commit:
flashinfer-commit:
- 'v0.2.6.post1'
base-image:
- 'ghcr.io/coreweave/ml-containers/torch-extras:es-compute-12.0-67208ca-nccl-cuda12.9.0-ubuntu22.04-nccl2.27.3-1-torch2.7.1-vision0.22.1-audio2.7.1-abi1'
- 'ghcr.io/coreweave/ml-containers/torch-extras:es-cuda-12.9.1-74755e9-nccl-cuda12.9.1-ubuntu22.04-nccl2.27.5-1-torch2.7.1-vision0.22.1-audio2.7.1-abi1'
lean-base-image:
- 'ghcr.io/coreweave/ml-containers/torch-extras:es-cuda-12.9.1-74755e9-base-cuda12.9.1-ubuntu22.04-torch2.7.1-vision0.22.1-audio2.7.1-abi1'
1 change: 1 addition & 0 deletions .github/workflows/vllm-tensorizer.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,4 @@ jobs:
VLLM_COMMIT=${{ matrix.vllm-commit }}
FLASHINFER_COMMIT=${{ matrix.flashinfer-commit }}
BASE_IMAGE=${{ matrix.base-image }}
LEAN_BASE_IMAGE=${{ matrix.lean-base-image }}
8 changes: 5 additions & 3 deletions vllm-tensorizer/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
ARG BASE_IMAGE="ghcr.io/coreweave/ml-containers/torch-extras:es-compute-12.0-67208ca-nccl-cuda12.9.0-ubuntu22.04-nccl2.27.3-1-torch2.7.1-vision0.22.1-audio2.7.1-abi1"
ARG BASE_IMAGE="ghcr.io/coreweave/ml-containers/torch-extras:es-cuda-12.9.1-74755e9-nccl-cuda12.9.1-ubuntu22.04-nccl2.27.5-1-torch2.7.1-vision0.22.1-audio2.7.1-abi1"
ARG LEAN_BASE_IMAGE="ghcr.io/coreweave/ml-containers/torch-extras:es-cuda-12.9.1-74755e9-base-cuda12.9.1-ubuntu22.04-torch2.7.1-vision0.22.1-audio2.7.1-abi1"

FROM scratch AS freezer
WORKDIR /
COPY --chmod=755 freeze.sh /

FROM ${BASE_IMAGE} AS builder-base

ARG MAX_JOBS="16"
ARG MAX_JOBS="32"

RUN ldconfig

Expand Down Expand Up @@ -81,7 +83,7 @@ RUN --mount=type=bind,from=flashinfer-downloader,source=/git/flashinfer,target=/
WORKDIR /wheels


FROM ${BASE_IMAGE} AS base
FROM ${LEAN_BASE_IMAGE} AS base

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell me if this is a dumb question, but if this is merging to main, and LEAN_BASE_IMAGE is replacing BASE_IMAGE, does this change anything about the not-lean builds? I'm not entirely sure how merging a branch to main affects the build pipeline, so this could be a misunderstanding thing on my part.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is intended to replace all builds with slimmed down builds, so yes, this replaces stuff.

Copy link
Contributor

@sangstar sangstar Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risking sounding pedantic here but why don't we just call it BASE_IMAGE then and do away with the larger one? And I take it there's 0 benefit in using the larger one? Wouldn't a nccl-having image be useful for distributed inference with vLLM?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. There are two base images because one is used for compiling vLLM and one is used for the final image artifact being produced. The one used for compilation will be larger because it includes the compiler and dev libraries.

Copy link
Contributor

@sangstar sangstar Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if NCCL libraries remain or not on the final image, since it'll be good to still be able to do vLLM distributed inference. I'm not solid on the specifics here as to what exact deps are needed to do distributed inference -- just wanting to make sure this final image can still do inference with model parallelism now that that nccl tag is no longer in play for the base image.

WORKDIR /workspace

Expand Down
Loading