vLLM on A100s

#41
by fsaudm - opened

Im running into an CUDA OOM issue.. though Im trying to serve a bf16 version I found here on HF (opensourcerelease/DeepSeek-V3-bf16) since I have A100s.

I have access to only A100s, and I have 9 nodes with 2 GPUs (so 18 A100s of 80 GBs, 1440 GBs in total). I though 685B params would be somewhere around 1350 GBs plus some overhead for half precision. Any thoughts? I am also trying to unload to CPU but still getting CUDA OOM...

vllm serve opensourcerelease/DeepSeek-V3-bf16
--dtype bfloat16
--host 0.0.0.0
--port 5000
--gpu-memory-utilization 0.7
--cpu-offload-gb 540
--tensor-parallel-size 2
--pipeline-parallel-size 9
--trust-remote-code

any thoughts? pls help :(

Mark, I also encountered a similar problem.
The official provided pipelines for H series graphics cards, but it seems that there is no example for A100 series cards.
Also, fp8 model requires 16 cards of h20, I thought the bf16 model required ~32 cards of a100 (https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/discussions/5).
Has anyone tried quantization?

@HuggingLianWang really? 32 is like a lot of GPUs.. why 32??

There's no point in running it in bf16, since the model is trained in fp8

@xiaoqianWX A100s can't operate fp8..

685B params would be somewhere around 1350 GBs is just for weights' memory occupation, considering activation/kv cache and some temporary buffer for kernel computation, generally double memory budget is a good idea. Though MoE and MLA/MGA can significantly reduce memory usage, 1440 GBsmay still fall short of the total needed.

Sign up or log in to comment