Integrate Flash-Decoding into engine #181

masahi · 2024-01-31T04:38:44Z

A follow-up to #177

As I commented in #177 (comment), this PR introduces a breaking change to the build flow (--use-vllm-attention is removed). So I recommend merging this PR after other high-priority PRs like #82 are merged. Marked as draft to avoid an early merge.

After this PR, replace --use-vllm-attention in your build command with --paged-kv-cache-type vllm or --paged-kv-cache-type flash-decoding. You also need the latest for-mlc-serve-jan12.

Preliminary benchmark results

benchmark_throughput.py

Using --max-num-batched-tokens 4096 --greedy-sampling-ratio 1

llama 7B fp16
FD (block size 256): Engine Throughput: 43.52 requests/s, 15714.20 tokens/s (437 blocks)
FD (block size 128): Engine Throughput: 44.29 requests/s, 15991.46 tokens/s (874 blocks)
vLLM: Engine Throughput: 42.22 requests/s, 15245.43 tokens/s

Mistral 7B fp16
FD (block size 256): Engine Throughput: 46.68 requests/s, 17859.27 tokens/s (1766 blocks)
FD (block size 128): Engine Throughput: 48.80 requests/s, 18673.48 tokens/s (3533 blocks)
vLLM: Engine Throughput: 52.95 requests/s, 20259.87 tokens/s

llama 13b fp16
FD (block size 256): Engine Throughput: 24.02 requests/s, 8674.84 tokens/s (210 blocks)
FD (block size 128): Engine Throughput: 23.73 requests/s, 8569.77 tokens/s (421 blocks)
vLLM: Engine Throughput: 22.73 requests/s, 8206.14 tokens/s

llama 70b fp16, 2gpu
FD (block size 256): Engine Throughput: 5.09 requests/s, 1839.43 tokens/s (59 blocks)
FD (block size 128): Engine Throughput: 5.70 requests/s, 2057.70 tokens/s (113 blocks)
vLLM: Engine Throughput: 6.01 requests/s, 2168.58 tokens/s (909 blocks)

Mixtral fp16, 2gpu
FD (block size 256): Engine Throughput: 26.84 requests/s, 10270.41 tokens/s (1637 blocks)
FD (block size 128): Engine Throughput: 25.16 requests/s, 9625.92 tokens/s (3274 blocks)
vLLM: Engine Throughput: 26.27 requests/s, 10052.30 tokens/s

llmperf
Using llama 13b fp16
MLC_API_BASE="http://localhost:8000/v1" MLC_API_KEY="xxxxx" python llmperf.py -r 300 -c 30 --max-tokens 150 -f mlc -m dist/models/llama-2-13b-chat-hf

FD

OK          280
Mismatch     20
Name: count, dtype: int64
Clean DF is: 300
Mean End-to-end: 3191 ms
Mean TTFT: 495 ms (mean tokens in: 504, out: 135)
Max TTFT: 939 ms
TTFT > 3 s: 0.00%
ITL (out): 23.77 ms/token, mean tokens/s output (out): 42.18 token/s

vLLM

OK          278
Mismatch     22
Name: count, dtype: int64
Clean DF is: 300
Mean End-to-end: 3468 ms
Mean TTFT: 503 ms (mean tokens in: 503, out: 134)
Max TTFT: 890 ms
TTFT > 3 s: 0.00%
ITL (out): 26.01 ms/token, mean tokens/s output (out): 38.53 token/s

serve/mlc_serve/model/tvm_model.py

mlc_llm/core.py

sunggg · 2024-01-31T16:13:28Z

Thank you for the great improvement, @masahi! Let me follow-up later this week.

So I recommend merging this PR after other high-priority PRs like #82 are merged.
To avoid the accidental mistake, can we mark this PR as the draft for now?

masahi · 2024-02-08T19:49:21Z

This is ready for review. More benchmarks will be done after it is merged. You should update for-mlc-serve-jan12, and --use-vllm-attention in the build command needs to be replaced with --page-kv-cache-type vllm. FD is not used unless you specify --paged-kv-cache-type flash-decoding.

@sunggg @elvin-n @yelite @vinx13

sunggg

LGTM, thanks @masahi!

masahi added 22 commits January 29, 2024 09:26

test stub

30e57a0

wip

6f3429a

wip

97a4366

wip

7279cb6

compiled

7348f0e

wip

b692376

fix

1df6cac

fix

8c8872c

wip, decode with flash decoding works

6a8272f

all work

487129c

add paged_kv_cache_type option

8114197

read kv_type from artifact

2d6c81b

black

67353b2

refactor attention backend

b9e41e1

minor clean up

910e31b

Integrate flash-decoding into mlc-serve

ab910f2

remove --use-vllm-attention

4c8a75b

wip decode_multi_query integration

00e1d09

temp handling for multi-query logits

5fbf671

remove tmp support for multi-query decode

2eff7b0

Merge branch 'batch-serving' into flash-decoding-engine

c51c2a4

Merge branch 'batch-serving' into flash-decoding-engine

d7704e2

masahi commented Jan 31, 2024

View reviewed changes

serve/mlc_serve/model/tvm_model.py Show resolved Hide resolved

masahi commented Jan 31, 2024

View reviewed changes

serve/mlc_serve/model/tvm_model.py Show resolved Hide resolved

elvin-n reviewed Jan 31, 2024

View reviewed changes

mlc_llm/core.py Outdated Show resolved Hide resolved

masahi marked this pull request as draft January 31, 2024 18:06

masahi added 3 commits January 31, 2024 18:07

typo

404b305

Merge branch 'batch-serving' into flash-decoding-engine

d87506c

Merge branch 'batch-serving' into flash-decoding-engine

99af3fb

masahi added 3 commits February 1, 2024 21:52

use block size 128 or 64 when possible

a003965

Merge branch 'batch-serving' into flash-decoding-engine

780e244

remove unused var

56d7a23

masahi mentioned this pull request Feb 8, 2024

[JSON Mode] Constrained Sampling #175

Merged

masahi added 2 commits February 8, 2024 19:36

Merge branch 'batch-serving' into flash-decoding-engine

1b976dc

merge fix

a028c7d

masahi marked this pull request as ready for review February 8, 2024 19:43

vinx13 approved these changes Feb 8, 2024

View reviewed changes

sunggg approved these changes Feb 12, 2024

View reviewed changes

sunggg merged commit edf8d27 into octoml:batch-serving Feb 12, 2024

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Feb 27, 2024

Free memory on reload using memory allocator clear func (octoml#181)

058cbbf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate Flash-Decoding into engine #181

Integrate Flash-Decoding into engine #181

Uh oh!

masahi commented Jan 31, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sunggg commented Jan 31, 2024

Uh oh!

masahi commented Feb 8, 2024 •

edited

Loading

Uh oh!

sunggg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Integrate Flash-Decoding into engine #181

Integrate Flash-Decoding into engine #181

Uh oh!

Conversation

masahi commented Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sunggg commented Jan 31, 2024

Uh oh!

masahi commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunggg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

masahi commented Jan 31, 2024 •

edited

Loading

masahi commented Feb 8, 2024 •

edited

Loading