Skip to content

feat(vllm-tensorizer): Optimize Multi-Stage Build for Slimmer Inference Image #101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 26, 2025

Conversation

JustinPerlman
Copy link
Contributor

@JustinPerlman JustinPerlman commented Jun 24, 2025

This PR optimizes the existing multi-stage Docker build for the vllm-tensorizer image, significantly reducing its final size (by about 9.5 GiB) for more efficient deployment and faster pod spin-up.

Key Changes:

  • Multi-Stage Architecture: Separated build-time dependencies from runtime essentials by defining distinct base images for different build stages.
    • Builder Stages: Utilize a larger nccl variant (BUILDER_BASE_IMAGE) for compilation, ensuring all CUDA development tools (nvcc, libcublas-dev, etc.) are present.
    • Final Stage: Employ a barebones base variant (LEAN_BASE_IMAGE) for the runtime image, drastically reducing the final image footprint.
  • CUDA Version: Updated CUDA version from 12.9.0 to 12.9.1.

Testing:

  • The image has been successfully built and tested on CoreWeave's H100 GPU cluster, including launching the vLLM inference server and performing basic text generation.

To reproduce testing:

  1. Launch the container on an H100 node
  2. Launch the server: vllm serve facebook/opt-125m --host 127.0.0.1 --port 8000 --gpu-memory-utilization 0.9 &
  3. Test inference requests: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "facebook/opt-125m", "prompt": "Hello, my name is Mike Intrator. What''s your favorite color?", "max_tokens": 7, "temperature": 0.0 }'

@JustinPerlman JustinPerlman requested a review from Eta0 June 24, 2025 18:36
@JustinPerlman JustinPerlman self-assigned this Jun 24, 2025
@JustinPerlman JustinPerlman added the enhancement New feature or request label Jun 24, 2025
@Eta0
Copy link
Collaborator

Eta0 commented Jun 24, 2025

The current description is misleading. This PR does not introduce a multi-stage build because it was already a multi-stage build. It just changes the existing multi-stage build to use different base images for the builder and final image stages.

Can you speak to how much this reduced the image size specifically, with some before & after numbers?

@JustinPerlman JustinPerlman changed the title feat(vllm-tensorizer): Implement Multi-Stage Build for Slimmer Inference Image feat(vllm-tensorizer): Optimize Multi-Stage Build for Slimmer Inference Image Jun 24, 2025
Eta0
Eta0 previously approved these changes Jun 24, 2025
@Eta0
Copy link
Collaborator

Eta0 commented Jun 24, 2025

Can you speak to how much this reduced the image size specifically, with some before & after numbers?

To answer my own question, this:

IMAGE_TAGS=('ghcr.io/coreweave/ml-containers/vllm-tensorizer:'{'c87fc8f-b6553be1bc75f046b00046a4ad7576364d03c835','jp-testing-slim-vllm-image-f238cc3-b6553be1bc75f046b00046a4ad7576364d03c835'})
for IMAGE_TAG in "${IMAGE_TAGS[@]}"; do
  docker pull -q "${IMAGE_TAG}" && \
  docker inspect -f "{{ .Size }}" "${IMAGE_TAG}" \
  | awk '{{ print ": " $1 " bytes (" $1/(2**30) " GiB)\n" }}'
done

Shows that it was reduced substantially in uncompressed size:

ghcr.io/coreweave/ml-containers/vllm-tensorizer:c87fc8f-b6553be1bc75f046b00046a4ad7576364d03c835
: 32258861368 bytes (30.0434 GiB)

ghcr.io/coreweave/ml-containers/vllm-tensorizer:jp-testing-slim-vllm-image-f238cc3-b6553be1bc75f046b00046a4ad7576364d03c835
: 22064905303 bytes (20.5495 GiB)

So a reduction of about 9.5 GiB.

@JustinPerlman JustinPerlman requested a review from sangstar June 25, 2025 14:24
Copy link
Contributor

@sangstar sangstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this part of your PR body:

To reproduce testing:

  • Launch the container on an H100 node
  • Launch the server: vllm serve facebook/opt-125m --host 127.0.0.1 --port 8000 --gpu-memory-utilization 0.9 &
  • Test inference requests: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "facebook/opt-125m", "prompt": "Hello, my name is Mike Intrator. What''s your favorite color?", "max_tokens": 7, "temperature": 0.0 }'

Have you tried to explicitly use pytest and directly trying to run some of their test scripts? Did that not end up working? You could at least try pytest tests/openai/test_tensorizer_entrypoint.py. This is just my test script, which directly tests serialization, deserialization, and a completion all in one.

@@ -81,7 +83,7 @@ RUN --mount=type=bind,from=flashinfer-downloader,source=/git/flashinfer,target=/
WORKDIR /wheels


FROM ${BASE_IMAGE} AS base
FROM ${LEAN_BASE_IMAGE} AS base
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell me if this is a dumb question, but if this is merging to main, and LEAN_BASE_IMAGE is replacing BASE_IMAGE, does this change anything about the not-lean builds? I'm not entirely sure how merging a branch to main affects the build pipeline, so this could be a misunderstanding thing on my part.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is intended to replace all builds with slimmed down builds, so yes, this replaces stuff.

Copy link
Contributor

@sangstar sangstar Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risking sounding pedantic here but why don't we just call it BASE_IMAGE then and do away with the larger one? And I take it there's 0 benefit in using the larger one? Wouldn't a nccl-having image be useful for distributed inference with vLLM?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. There are two base images because one is used for compiling vLLM and one is used for the final image artifact being produced. The one used for compilation will be larger because it includes the compiler and dev libraries.

Copy link
Contributor

@sangstar sangstar Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if NCCL libraries remain or not on the final image, since it'll be good to still be able to do vLLM distributed inference. I'm not solid on the specifics here as to what exact deps are needed to do distributed inference -- just wanting to make sure this final image can still do inference with model parallelism now that that nccl tag is no longer in play for the base image.

Copy link

@zachspar Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/15883288647
Image: ``

@JustinPerlman JustinPerlman requested review from Eta0 and sangstar June 25, 2025 21:53
Copy link

@Eta0 Eta0 merged commit b42c222 into main Jun 26, 2025
2 checks passed
@Eta0 Eta0 deleted the jp/testing/slim-vllm-image branch June 26, 2025 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants