Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix same family model issue and add test #9

Open
wants to merge 63 commits into
base: rebase_mm
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
cb4e1c3
[misc] upgrade filelock version (#10731)
youkaichao Nov 28, 2024
70dc14f
[Model] support bitsandbytes quantization with minicpm3 model (#10682)
zixuanzhang226 Nov 28, 2024
278be67
[Doc] Update model in arch_overview.rst to match comment (#10701)
spacewander Nov 28, 2024
d9b4b3f
[Bug][CLI] Allow users to disable prefix caching explicitly (#10724)
rickyyx Nov 28, 2024
a79b122
[V1] Do not allocate beyond the max_model_len (#10730)
WoosukKwon Nov 28, 2024
9a8bff0
[Kernel] Update vllm-flash-attn version (#10736)
WoosukKwon Nov 28, 2024
3ed5e73
[TPU] Update requirements-tpu (#10726)
richardsliu Nov 28, 2024
5fc5ce0
[Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561)
sixsixcoder Nov 28, 2024
8c1e77f
[Kernel] Update vllm-flash-attn version to reduce CPU overheads (#10742)
WoosukKwon Nov 28, 2024
98f47f2
[V1] Optimize the CPU overheads in FlashAttention custom op (#10733)
WoosukKwon Nov 28, 2024
c83919c
[Model] Add Internlm2 LoRA support (#5064)
Isotr0py Nov 28, 2024
fa6ecb9
[Model] Clean up MiniCPMV (#10751)
DarkLight1337 Nov 29, 2024
c82b432
[Misc] typo find in sampling_metadata.py (#10740)
noooop Nov 29, 2024
3132aac
[Bugfix] Fix Idefics3 bug (#10778)
jeejeelee Nov 29, 2024
661175b
[platform] Add verify_quantization in platform. (#10757)
wangxiyuan Nov 29, 2024
40bc242
[Bugfix] Fix OpenVino/Neuron `driver_worker` init (#10779)
NickLucche Nov 30, 2024
16ee07f
[Model] Refactor Molmo weights loading to use AutoWeightsLoader (#10771)
Isotr0py Nov 30, 2024
e7cfc4e
[Interleaved ATTN] Support for Mistral-8B (#10591)
patrickvonplaten Nov 30, 2024
7e4bbda
[doc] format fix (#10789)
wangxiyuan Nov 30, 2024
1337071
[Model] Replace embedding models with pooling adapter (#10769)
DarkLight1337 Dec 1, 2024
f877a7d
[Misc] Improve type annotations for `support_torch_compile` (#10763)
DarkLight1337 Dec 1, 2024
d2f058e
[Misc] Rename embedding classes to pooling (#10801)
DarkLight1337 Dec 1, 2024
169a0ff
[doc] add warning about comparing hf and vllm outputs (#10805)
youkaichao Dec 1, 2024
c11f172
[Misc] Adding `MMMU-Pro` vision dataset to serving benchmark (#10804)
ywang96 Dec 1, 2024
0590ec3
[Core] Implement disagg prefill by StatelessProcessGroup (#10502)
KuntaiDu Dec 2, 2024
b18c9bb
[Model] Add BNB support to Llava and Pixtral-HF (#10795)
Isotr0py Dec 2, 2024
b795477
[core] Avoid metrics log noise when idle - include speculative decodi…
cduk Dec 2, 2024
073a4bd
[Kernel] Use `out` arg in flash_attn_varlen_func (#10811)
WoosukKwon Dec 2, 2024
e25810a
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill (#10799)
maxdebayser Dec 2, 2024
63a1641
[misc] remove xverse modeling file (#10814)
youkaichao Dec 2, 2024
995a148
[doc]Update config docstring (#10732)
wangxiyuan Dec 2, 2024
ef31eab
[Model]: add some tests for aria model (#10770)
xffxff Dec 2, 2024
e95f275
[CI/Build] Update `mistral_common` version for tests and docs (#10825)
DarkLight1337 Dec 2, 2024
a4c4daf
[misc] use out argument for flash attention (#10822)
youkaichao Dec 2, 2024
b45f0d7
[Misc][LoRA] Move the implementation of lora bias to punica.py (#10829)
jeejeelee Dec 2, 2024
519cc6c
[Misc][XPU] Avoid torch compile for XPU platform (#10747)
yma11 Dec 2, 2024
9b14d97
Fix openvino on GPU (#10793)
janimo Dec 2, 2024
4c05edb
[Model] Add TP and BNB quantization support to LlavaMultiModalProject…
Isotr0py Dec 2, 2024
4433195
[Bugfix] Prevent benchmark_throughput.py from using duplicated random…
mgoin Dec 3, 2024
d746268
[Model] support bitsandbytes quantization with minicpm model (#10842)
zixuanzhang226 Dec 3, 2024
a4cf256
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug (#10844)
jeejeelee Dec 3, 2024
21fe7b4
[core][distributed] add pynccl broadcast (#10843)
youkaichao Dec 3, 2024
dc5ce86
[torch.compile] remove compilation_context and simplify code (#10838)
youkaichao Dec 3, 2024
ef51831
[Doc] Add github links for source code references (#10672)
russellb Dec 3, 2024
3257d44
[Misc] Remove deprecated names (#10817)
DarkLight1337 Dec 3, 2024
9323a31
[Core][Performance] Add XGrammar support for guided decoding and set …
aarnphm Dec 3, 2024
f6084f6
[Speculative Decoding] Move indices to device before filtering output…
zhengy001 Dec 3, 2024
3bc94ca
[V1] VLM - Run the mm_mapper preprocessor in the frontend process (#1…
alexm-redhat Dec 3, 2024
2f2cdc7
[MISC][XPU] quick fix for XPU CI (#10859)
yma11 Dec 3, 2024
7090c27
[Bugfix] Only require XGrammar on x86 (#10865)
mgoin Dec 3, 2024
7c32b68
[Frontend] correctly record prefill and decode time metrics (#10853)
tomeras91 Dec 3, 2024
a061fe6
[Build][Bugfix] Using the correct type hint (#10866)
gshtras Dec 3, 2024
381ac93
[Benchmark] Benchmark structured output with datasets (#10557)
xuechendi Dec 4, 2024
d2bd88b
[CI/Build] Replace mean with torch.all in test_pynccl.py (#10876)
tlrmchlsmth Dec 4, 2024
b5b647b
Drop ROCm load format check (#10767)
wangxiyuan Dec 4, 2024
fa2dea6
[ci/build] Change queue name for Release jobs (#10875)
khluu Dec 4, 2024
c9ca4fc
[ci/build] Job to build and push release image (#10877)
khluu Dec 4, 2024
8db957e
[bugfix] fixed parameter “n” when set parameter “bestof” > 1 (#10854)
o2363286 Dec 4, 2024
c92acb9
[ci/build] Update vLLM postmerge ECR repo (#10887)
khluu Dec 4, 2024
41685cb
copy origin file to mm_ files
jikunshang Nov 22, 2024
883fac0
support multi models
jikunshang Nov 22, 2024
655b610
add model arg to original client
xuechendi Dec 3, 2024
e4ad1c5
fix GPU memory allocation for multi_models
xuechendi Dec 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ steps:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
- image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
Expand Down Expand Up @@ -51,7 +51,7 @@ steps:
queue: H200
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
Expand All @@ -71,7 +71,7 @@ steps:
queue: H100
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/sh
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-test-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-postmerge-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-postmerge-repo/manifests/$BUILDKITE_COMMIT"

TIMEOUT_SECONDS=10

Expand Down
17 changes: 15 additions & 2 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
steps:
- label: "Build wheel - CUDA 12.1"
agents:
queue: cpu_queue
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
Expand All @@ -18,11 +18,24 @@ steps:
- label: "Build wheel - CUDA 11.8"
# depends_on: block-build-cu118-wheel
agents:
queue: cpu_queue
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"

- block: "Build release image"
depends_on: ~
key: block-release-image-build

- label: "Build release image"
depends_on: block-release-image-build
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
7 changes: 5 additions & 2 deletions .buildkite/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,8 @@ remove_docker_container() { docker rm -f xpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test python3 examples/offline_inference.py
# Run the image and test offline inference/tensor parallel
docker run --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test sh -c '
python3 examples/offline_inference.py
python3 examples/offline_inference_cli.py -tp 2
'
9 changes: 6 additions & 3 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -334,7 +334,6 @@ steps:
commands:
- pytest -v -s models/decoder_only/language -m 'core_model or quant_model'
- pytest -v -s models/embedding/language -m core_model
- pytest -v -s models/embedding/vision_language -m core_model

- label: Language Models Test (Extended) # 50min
optional: true
Expand All @@ -346,7 +345,6 @@ steps:
commands:
- pytest -v -s models/decoder_only/language -m 'not core_model and not quant_model'
- pytest -v -s models/embedding/language -m 'not core_model'
- pytest -v -s models/embedding/vision_language -m 'not core_model'

- label: Multi-Modal Models Test (Standard) # 26min
#mirror_hardwares: [amd]
Expand All @@ -359,6 +357,7 @@ steps:
commands:
- pytest -v -s models/decoder_only/audio_language -m 'core_model or quant_model'
- pytest -v -s --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'core_model or quant_model'
- pytest -v -s models/embedding/vision_language -m core_model
- pytest -v -s models/encoder_decoder/language -m core_model
- pytest -v -s models/encoder_decoder/vision_language -m core_model

Expand All @@ -376,6 +375,7 @@ steps:
# https://github.com/huggingface/transformers/issues/34307
- pytest -v -s models/decoder_only/vision_language/test_phi3v.py
- pytest -v -s --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'not core_model and not quant_model'
- pytest -v -s models/embedding/vision_language -m 'not core_model'
- pytest -v -s models/encoder_decoder/language -m 'not core_model'
- pytest -v -s models/encoder_decoder/vision_language -m 'not core_model'

Expand Down Expand Up @@ -430,6 +430,9 @@ steps:
- vllm/model_executor/models/
- tests/distributed/
- vllm/compilation
- vllm/worker/worker_base.py
- vllm/worker/worker.py
- vllm/worker/model_runner.py
commands:
- pytest -v -s ./compile/test_basic_correctness.py
- pytest -v -s ./compile/test_wrapper.py
Expand All @@ -443,6 +446,7 @@ steps:
- pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s distributed/test_distributed_oot.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/disagg_test.py

- label: Multi-step Tests (4 GPUs) # 36min
working_dir: "/vllm-workspace/tests"
Expand Down Expand Up @@ -477,7 +481,6 @@ steps:

- label: LoRA TP Test (Distributed)
num_gpus: 4
soft_fail: true
source_file_dependencies:
- vllm/lora
- tests/lora
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -522,7 +522,7 @@ else()
FetchContent_Declare(
vllm-flash-attn
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG 5259c586c403a4e4d8bf69973c159b40cc346fb9
GIT_TAG 04325b6798bcc326c86fb35af62d05a9c8c8eceb
GIT_PROGRESS TRUE
# Don't share the vllm-flash-attn build between build types
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
Expand Down
Loading