-
Notifications
You must be signed in to change notification settings - Fork 5
feat(vllm-tensorizer): Optimize Multi-Stage Build for Slimmer Inference Image #101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implement multi-stage build with distinct builder (dev tools) and lean base (runtime) images for smaller size. Increase `MAX_JOBS` to 32 for speedier CI.
The current description is misleading. This PR does not introduce a multi-stage build because it was already a multi-stage build. It just changes the existing multi-stage build to use different base images for the builder and final image stages. Can you speak to how much this reduced the image size specifically, with some before & after numbers? |
To answer my own question, this: IMAGE_TAGS=('ghcr.io/coreweave/ml-containers/vllm-tensorizer:'{'c87fc8f-b6553be1bc75f046b00046a4ad7576364d03c835','jp-testing-slim-vllm-image-f238cc3-b6553be1bc75f046b00046a4ad7576364d03c835'})
for IMAGE_TAG in "${IMAGE_TAGS[@]}"; do
docker pull -q "${IMAGE_TAG}" && \
docker inspect -f "{{ .Size }}" "${IMAGE_TAG}" \
| awk '{{ print ": " $1 " bytes (" $1/(2**30) " GiB)\n" }}'
done Shows that it was reduced substantially in uncompressed size:
So a reduction of about 9.5 GiB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this part of your PR body:
To reproduce testing:
- Launch the container on an H100 node
- Launch the server: vllm serve facebook/opt-125m --host 127.0.0.1 --port 8000 --gpu-memory-utilization 0.9 &
- Test inference requests: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "facebook/opt-125m", "prompt": "Hello, my name is Mike Intrator. What''s your favorite color?", "max_tokens": 7, "temperature": 0.0 }'
Have you tried to explicitly use pytest
and directly trying to run some of their test scripts? Did that not end up working? You could at least try pytest tests/openai/test_tensorizer_entrypoint.py
. This is just my test script, which directly tests serialization, deserialization, and a completion all in one.
vllm-tensorizer/Dockerfile
Outdated
@@ -81,7 +83,7 @@ RUN --mount=type=bind,from=flashinfer-downloader,source=/git/flashinfer,target=/ | |||
WORKDIR /wheels | |||
|
|||
|
|||
FROM ${BASE_IMAGE} AS base | |||
FROM ${LEAN_BASE_IMAGE} AS base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tell me if this is a dumb question, but if this is merging to main, and LEAN_BASE_IMAGE
is replacing BASE_IMAGE
, does this change anything about the not-lean builds? I'm not entirely sure how merging a branch to main affects the build pipeline, so this could be a misunderstanding thing on my part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is intended to replace all builds with slimmed down builds, so yes, this replaces stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Risking sounding pedantic here but why don't we just call it BASE_IMAGE
then and do away with the larger one? And I take it there's 0 benefit in using the larger one? Wouldn't a nccl
-having image be useful for distributed inference with vLLM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean. There are two base images because one is used for compiling vLLM and one is used for the final image artifact being produced. The one used for compilation will be larger because it includes the compiler and dev libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if NCCL libraries remain or not on the final image, since it'll be good to still be able to do vLLM distributed inference. I'm not solid on the specifics here as to what exact deps are needed to do distributed inference -- just wanting to make sure this final image can still do inference with model parallelism now that that nccl
tag is no longer in play for the base image.
@zachspar Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/15883288647 |
Co-authored-by: Eta <[email protected]>
@JustinPerlman Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/15906790295 |
This PR optimizes the existing multi-stage Docker build for the vllm-tensorizer image, significantly reducing its final size (by about 9.5 GiB) for more efficient deployment and faster pod spin-up.
Key Changes:
nccl
variant (BUILDER_BASE_IMAGE
) for compilation, ensuring all CUDA development tools (nvcc
,libcublas-dev
, etc.) are present.base
variant (LEAN_BASE_IMAGE
) for the runtime image, drastically reducing the final image footprint.Testing:
vLLM
inference server and performing basic text generation.To reproduce testing:
vllm serve facebook/opt-125m --host 127.0.0.1 --port 8000 --gpu-memory-utilization 0.9 &
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "facebook/opt-125m", "prompt": "Hello, my name is Mike Intrator. What''s your favorite color?", "max_tokens": 7, "temperature": 0.0 }'