-
Couldn't load subscription status.
- Fork 8
Integrate Flash-Decoding into engine #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is ready for review. More benchmarks will be done after it is merged. You should update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @masahi!
A follow-up to #177
As I commented in #177 (comment), this PR introduces a breaking change to the build flow (
--use-vllm-attentionis removed). So I recommend merging this PR after other high-priority PRs like #82 are merged. Marked as draft to avoid an early merge.After this PR, replace
--use-vllm-attentionin your build command with--paged-kv-cache-type vllmor--paged-kv-cache-type flash-decoding. You also need the latestfor-mlc-serve-jan12.Preliminary benchmark results
benchmark_throughput.pyUsing
--max-num-batched-tokens 4096 --greedy-sampling-ratio 1llama 7B fp16
FD (block size 256): Engine Throughput: 43.52 requests/s, 15714.20 tokens/s(437 blocks)FD (block size 128): Engine Throughput: 44.29 requests/s, 15991.46 tokens/s(874 blocks)vLLM: Engine Throughput: 42.22 requests/s, 15245.43 tokens/sMistral 7B fp16
FD (block size 256): Engine Throughput: 46.68 requests/s, 17859.27 tokens/s(1766 blocks)FD (block size 128): Engine Throughput: 48.80 requests/s, 18673.48 tokens/s(3533 blocks)vLLM: Engine Throughput: 52.95 requests/s, 20259.87 tokens/sllama 13b fp16
FD (block size 256): Engine Throughput: 24.02 requests/s, 8674.84 tokens/s(210 blocks)FD (block size 128): Engine Throughput: 23.73 requests/s, 8569.77 tokens/s(421 blocks)vLLM: Engine Throughput: 22.73 requests/s, 8206.14 tokens/sllama 70b fp16, 2gpu
FD (block size 256): Engine Throughput: 5.09 requests/s, 1839.43 tokens/s(59 blocks)FD (block size 128): Engine Throughput: 5.70 requests/s, 2057.70 tokens/s(113 blocks)vLLM: Engine Throughput: 6.01 requests/s, 2168.58 tokens/s(909 blocks)Mixtral fp16, 2gpu
FD (block size 256): Engine Throughput: 26.84 requests/s, 10270.41 tokens/s(1637 blocks)FD (block size 128): Engine Throughput: 25.16 requests/s, 9625.92 tokens/s(3274 blocks)vLLM: Engine Throughput: 26.27 requests/s, 10052.30 tokens/sllmperf
Using llama 13b fp16
MLC_API_BASE="http://localhost:8000/v1" MLC_API_KEY="xxxxx" python llmperf.py -r 300 -c 30 --max-tokens 150 -f mlc -m dist/models/llama-2-13b-chat-hfFD
vLLM