-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Flash-Decoding into engine #181
Conversation
This is ready for review. More benchmarks will be done after it is merged. You should update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @masahi!
A follow-up to #177
As I commented in #177 (comment), this PR introduces a breaking change to the build flow (
--use-vllm-attention
is removed). So I recommend merging this PR after other high-priority PRs like #82 are merged. Marked as draft to avoid an early merge.After this PR, replace
--use-vllm-attention
in your build command with--paged-kv-cache-type vllm
or--paged-kv-cache-type flash-decoding
. You also need the latestfor-mlc-serve-jan12
.Preliminary benchmark results
benchmark_throughput.py
Using
--max-num-batched-tokens 4096 --greedy-sampling-ratio 1
llama 7B fp16
FD (block size 256): Engine Throughput: 43.52 requests/s, 15714.20 tokens/s
(437 blocks)FD (block size 128): Engine Throughput: 44.29 requests/s, 15991.46 tokens/s
(874 blocks)vLLM: Engine Throughput: 42.22 requests/s, 15245.43 tokens/s
Mistral 7B fp16
FD (block size 256): Engine Throughput: 46.68 requests/s, 17859.27 tokens/s
(1766 blocks)FD (block size 128): Engine Throughput: 48.80 requests/s, 18673.48 tokens/s
(3533 blocks)vLLM: Engine Throughput: 52.95 requests/s, 20259.87 tokens/s
llama 13b fp16
FD (block size 256): Engine Throughput: 24.02 requests/s, 8674.84 tokens/s
(210 blocks)FD (block size 128): Engine Throughput: 23.73 requests/s, 8569.77 tokens/s
(421 blocks)vLLM: Engine Throughput: 22.73 requests/s, 8206.14 tokens/s
llama 70b fp16, 2gpu
FD (block size 256): Engine Throughput: 5.09 requests/s, 1839.43 tokens/s
(59 blocks)FD (block size 128): Engine Throughput: 5.70 requests/s, 2057.70 tokens/s
(113 blocks)vLLM: Engine Throughput: 6.01 requests/s, 2168.58 tokens/s
(909 blocks)Mixtral fp16, 2gpu
FD (block size 256): Engine Throughput: 26.84 requests/s, 10270.41 tokens/s
(1637 blocks)FD (block size 128): Engine Throughput: 25.16 requests/s, 9625.92 tokens/s
(3274 blocks)vLLM: Engine Throughput: 26.27 requests/s, 10052.30 tokens/s
llmperf
Using llama 13b fp16
MLC_API_BASE="http://localhost:8000/v1" MLC_API_KEY="xxxxx" python llmperf.py -r 300 -c 30 --max-tokens 150 -f mlc -m dist/models/llama-2-13b-chat-hf
FD
vLLM