feat(vllm-tensorizer): Optimize Multi-Stage Build for Slimmer Inference Image #101

JustinPerlman · 2025-06-24T18:36:48Z

This PR optimizes the existing multi-stage Docker build for the vllm-tensorizer image, significantly reducing its final size (by about 9.5 GiB) for more efficient deployment and faster pod spin-up.

Key Changes:

Multi-Stage Architecture: Separated build-time dependencies from runtime essentials by defining distinct base images for different build stages.
- Builder Stages: Utilize a larger nccl variant (BUILDER_BASE_IMAGE) for compilation, ensuring all CUDA development tools (nvcc, libcublas-dev, etc.) are present.
- Final Stage: Employ a barebones base variant (LEAN_BASE_IMAGE) for the runtime image, drastically reducing the final image footprint.
CUDA Version: Updated CUDA version from 12.9.0 to 12.9.1.

Testing:

The image has been successfully built and tested on CoreWeave's H100 GPU cluster, including launching the vLLM inference server and performing basic text generation.

To reproduce testing:

Launch the container on an H100 node
Launch the server: vllm serve facebook/opt-125m --host 127.0.0.1 --port 8000 --gpu-memory-utilization 0.9 &
Test inference requests: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "facebook/opt-125m", "prompt": "Hello, my name is Mike Intrator. What''s your favorite color?", "max_tokens": 7, "temperature": 0.0 }'

Implement multi-stage build with distinct builder (dev tools) and lean base (runtime) images for smaller size. Increase `MAX_JOBS` to 32 for speedier CI.

Eta0 · 2025-06-24T18:46:39Z

The current description is misleading. This PR does not introduce a multi-stage build because it was already a multi-stage build. It just changes the existing multi-stage build to use different base images for the builder and final image stages.

Can you speak to how much this reduced the image size specifically, with some before & after numbers?

.github/configurations/vllm-tensorizer.yml

Eta0 · 2025-06-24T19:56:50Z

Can you speak to how much this reduced the image size specifically, with some before & after numbers?

To answer my own question, this:

IMAGE_TAGS=('ghcr.io/coreweave/ml-containers/vllm-tensorizer:'{'c87fc8f-b6553be1bc75f046b00046a4ad7576364d03c835','jp-testing-slim-vllm-image-f238cc3-b6553be1bc75f046b00046a4ad7576364d03c835'})
for IMAGE_TAG in "${IMAGE_TAGS[@]}"; do
  docker pull -q "${IMAGE_TAG}" && \
  docker inspect -f "{{ .Size }}" "${IMAGE_TAG}" \
  | awk '{{ print ": " $1 " bytes (" $1/(2**30) " GiB)\n" }}'
done

Shows that it was reduced substantially in uncompressed size:

ghcr.io/coreweave/ml-containers/vllm-tensorizer:c87fc8f-b6553be1bc75f046b00046a4ad7576364d03c835
: 32258861368 bytes (30.0434 GiB)

ghcr.io/coreweave/ml-containers/vllm-tensorizer:jp-testing-slim-vllm-image-f238cc3-b6553be1bc75f046b00046a4ad7576364d03c835
: 22064905303 bytes (20.5495 GiB)

So a reduction of about 9.5 GiB.

sangstar

Based on this part of your PR body:

To reproduce testing:

Launch the container on an H100 node

Launch the server: vllm serve facebook/opt-125m --host 127.0.0.1 --port 8000 --gpu-memory-utilization 0.9 &

Test inference requests: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "facebook/opt-125m", "prompt": "Hello, my name is Mike Intrator. What''s your favorite color?", "max_tokens": 7, "temperature": 0.0 }'

Have you tried to explicitly use pytest and directly trying to run some of their test scripts? Did that not end up working? You could at least try pytest tests/openai/test_tensorizer_entrypoint.py. This is just my test script, which directly tests serialization, deserialization, and a completion all in one.

.github/configurations/vllm-tensorizer.yml

sangstar · 2025-06-25T15:34:18Z

vllm-tensorizer/Dockerfile

@@ -81,7 +83,7 @@ RUN --mount=type=bind,from=flashinfer-downloader,source=/git/flashinfer,target=/
 WORKDIR /wheels


-FROM ${BASE_IMAGE} AS base
+FROM ${LEAN_BASE_IMAGE} AS base


Tell me if this is a dumb question, but if this is merging to main, and LEAN_BASE_IMAGE is replacing BASE_IMAGE, does this change anything about the not-lean builds? I'm not entirely sure how merging a branch to main affects the build pipeline, so this could be a misunderstanding thing on my part.

This PR is intended to replace all builds with slimmed down builds, so yes, this replaces stuff.

Risking sounding pedantic here but why don't we just call it BASE_IMAGE then and do away with the larger one? And I take it there's 0 benefit in using the larger one? Wouldn't a nccl-having image be useful for distributed inference with vLLM?

I'm not sure what you mean. There are two base images because one is used for compiling vLLM and one is used for the final image artifact being produced. The one used for compilation will be larger because it includes the compiler and dev libraries.

Just wondering if NCCL libraries remain or not on the final image, since it'll be good to still be able to do vLLM distributed inference. I'm not solid on the specifics here as to what exact deps are needed to do distributed inference -- just wanting to make sure this final image can still do inference with model parallelism now that that nccl tag is no longer in play for the base image.

github-actions · 2025-06-25T19:32:10Z

@zachspar Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/15883288647
Image: ``

.github/workflows/vllm-tensorizer.yml

Co-authored-by: Eta <[email protected]>

github-actions · 2025-06-26T17:27:20Z

@JustinPerlman Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/15906790295
Image: ``

JustinPerlman added 6 commits June 18, 2025 16:12

feat(vllm-tensorizer): Implement multi-stage build with lean final image

f12a01f

Implement multi-stage build with distinct builder (dev tools) and lean base (runtime) images for smaller size. Increase `MAX_JOBS` to 32 for speedier CI.

fix(vllm-tensorizer): Add missing backslash for multi-line RUN command

284ec59

fix(vllm-tensorizer): Remove cuda-nvprof package install

b2b48e5

fix(vllm-tensorizer): Remove redundant CUDA dev package installation

97f150a

feat(vllm-tensorizer): Update base images to CUDA 12.9.1 variants

e1e1b28

fix(vllm-tensorizer): Ensure matching ARG names for base image

f238cc3

JustinPerlman requested a review from Eta0 June 24, 2025 18:36

JustinPerlman self-assigned this Jun 24, 2025

JustinPerlman added the enhancement New feature or request label Jun 24, 2025

JustinPerlman changed the title ~~feat(vllm-tensorizer): Implement Multi-Stage Build for Slimmer Inference Image~~ feat(vllm-tensorizer): Optimize Multi-Stage Build for Slimmer Inference Image Jun 24, 2025

Eta0 reviewed Jun 24, 2025

View reviewed changes

.github/configurations/vllm-tensorizer.yml Outdated Show resolved Hide resolved

Eta0 previously approved these changes Jun 24, 2025

View reviewed changes

JustinPerlman requested a review from sangstar June 25, 2025 14:24

sangstar reviewed Jun 25, 2025

View reviewed changes

chore(vllm-tensorizer): Update naming for base images

0282ebd

JustinPerlman dismissed Eta0’s stale review via 0282ebd June 25, 2025 17:40

JustinPerlman requested review from Eta0 and sangstar June 25, 2025 21:53

Eta0 requested changes Jun 26, 2025

View reviewed changes

.github/workflows/vllm-tensorizer.yml Outdated Show resolved Hide resolved

chore(vllm-tensorizer): Rename base images in workflow file

b726efb

Co-authored-by: Eta <[email protected]>

Eta0 approved these changes Jun 26, 2025

View reviewed changes

Eta0 merged commit b42c222 into main Jun 26, 2025
2 checks passed

Eta0 deleted the jp/testing/slim-vllm-image branch June 26, 2025 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(vllm-tensorizer): Optimize Multi-Stage Build for Slimmer Inference Image #101

feat(vllm-tensorizer): Optimize Multi-Stage Build for Slimmer Inference Image #101

JustinPerlman commented Jun 24, 2025 •

edited

Loading

Uh oh!

Eta0 commented Jun 24, 2025

Uh oh!

Uh oh!

Eta0 commented Jun 24, 2025

Uh oh!

sangstar left a comment

Uh oh!

Uh oh!

sangstar Jun 25, 2025

Uh oh!

Eta0 Jun 25, 2025

Uh oh!

sangstar Jun 25, 2025 •

edited

Loading

Uh oh!

Eta0 Jun 25, 2025

Uh oh!

sangstar Jun 25, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

Uh oh!

github-actions bot commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

feat(vllm-tensorizer): Optimize Multi-Stage Build for Slimmer Inference Image #101

feat(vllm-tensorizer): Optimize Multi-Stage Build for Slimmer Inference Image #101

Conversation

JustinPerlman commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Eta0 commented Jun 24, 2025

Uh oh!

Uh oh!

Eta0 commented Jun 24, 2025

Uh oh!

sangstar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sangstar Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Eta0 Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

sangstar Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Eta0 Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

sangstar Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

Uh oh!

github-actions bot commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

JustinPerlman commented Jun 24, 2025 •

edited

Loading

sangstar Jun 25, 2025 •

edited

Loading

sangstar Jun 25, 2025 •

edited

Loading