Fix GPU-CPU tensor manipulation. Small performance boost #178

vvchernov · 2024-01-30T09:38:26Z

There are two fixes:

Greedy and random masks are created on CPU side, but are used on both CPU and GPU sides, particularly it selects logits from GPU (potential performance reduction). After fix there are two pairs of masks on CPU and logits device side, moreover CPU masks are created only if it needs.
Replace res_random = torch.multinomial(probs, 1, True).cpu().numpy()[:, 0] by res_random = torch.multinomial(probs, 1, True)[:, 0].cpu().numpy(). As I understand in the first case we copy the slice from old numpy to new one after copying full tensor from GPU to CPU, but in the second case we get slice view (without memory copying) and copying from GPU to CPU only sliced tensor (not full)

Results of benchmark:

MISTRAL (python serve/benchmarks/benchmark_throughput.py --local-id mistral-7b-instruct-q0f16-presharded-1gpu --dataset /opt/models/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --seed 0 --num-prompts 1000)

batch-serving:
Engine Throughput: 49.18 requests/s, 18816.81 tokens/s
Engine Throughput: 48.36 requests/s, 18503.69 tokens/s
Engine Throughput: 48.79 requests/s, 18668.08 tokens/s
AVERAGE: 48.78 requests/s, 18662.86 tokens/s

Fix 1:
Engine Throughput: 48.92 requests/s, 18717.29 tokens/s
Engine Throughput: 49.37 requests/s, 18891.61 tokens/s
Engine Throughput: 48.73 requests/s, 18647.23 tokens/s
AVERAGE: 49.01 requests/s, 18752.04 tokens/s

Fix1 + Fix2:
Engine Throughput: 49.23 requests/s, 18837.64 tokens/s
Engine Throughput: 49.43 requests/s, 18911.63 tokens/s
Engine Throughput: 50.00 requests/s, 19130.03 tokens/s
AVERAGE: 49.55 requests/s, 18959.77 tokens/s

MIXTRAL (python serve/benchmarks/benchmark_throughput.py --local-id mixtral-8x7b-instruct-v0.1-q0f16-presharded-2gpu --dataset /opt/models/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --seed 0 --num-prompts 1000)

batch-serving:
Engine Throughput: 23.71 requests/s, 9073.14 tokens/s
Engine Throughput: 23.62 requests/s, 9038.81 tokens/s
Engine Throughput: 23.44 requests/s, 8970.07 tokens/s
AVERAGE: 23.59 requests/s, 9027.34 tokens/s

Fix1 + Fix2:
Engine Throughput: 23.69 requests/s, 9064.93 tokens/s
Engine Throughput: 23.64 requests/s, 9045.66 tokens/s
Engine Throughput: 23.58 requests/s, 9022.22 tokens/s
AVERAGE: 23.64 requests/s, 9044.27 tokens/s

Note: Fluctuation of time from run to run is big enough (~1-2%) therefore several runs were performed and averaged, possibly measurements should be more careful

vvchernov · 2024-01-30T09:58:33Z

cc @masahi

masahi · 2024-01-30T10:13:04Z

I think the second optimization is not really helping since the output of multinomial has shape (num_tokens, 1). So the number of elements that are copied are the same. Can you double check your performance result? It would be surprising if the second optimization does matter.

masahi

Good find!

masahi · 2024-01-30T10:15:27Z

serve/mlc_serve/model/model_common.py

@@ -75,13 +75,13 @@ def _is_safe_to_sample(prob_like):
    logits = torch.from_dlpack(logits)
    num_seq = len(sampling_params)

-    mask_random = torch.tensor(
+    mask_random_dvc = torch.tensor(


dvc is a strange post fix. Just use gpu or simply mask_random (no post fix).

Hello @masahi! I guess that potentially it can work on cpu in the future. I thought to rename it to _gpu, but it will confuse somebody in the future due to my guess. What do you think about it?

Logits here is supposed to be always on GPU (you can add an assert). So there is no confusion.

masahi · 2024-01-30T10:19:31Z

serve/mlc_serve/model/model_common.py

+        [p.sampling_type == SamplingType.RANDOM for p in sampling_params],
+        dtype=torch.bool
+    )
+    mask_greedy_cpu = torch.logical_not(mask_random_cpu)


When you create a GPU mask, PT will first create a CPU mask and do cudaMemcpy under the hood. So you can create a CPU mask once, and create a GPU mask by copying the CPU mask explicitly. So you can avoid creating a CPU mask twice.

Please, recheck

masahi · 2024-01-30T10:36:58Z

serve/mlc_serve/model/model_common.py

    )
-    mask_greedy = torch.logical_not(mask_random)
+    mask_greedy_cpu = torch.logical_not(mask_random_cpu)
+    if logits.device == torch.device("cpu"):


Don't need this case.

It costs nothing but we support case with run topology on CPU. As I know @elvin-n run singlebatch on CPU after small fixes

serve/mlc_serve/model/model_common.py

* ios downloader * use dist as optional provided dir * Update iOS app to new reload api --------- Co-authored-by: Yaxing Cai <[email protected]>

Valery Chernov added 4 commits January 30, 2024 12:06

transfer mask to gpu

4f3ff7e

add cpu mask

37579fc

slice tensor on gpu side

f61e148

add device suffix for clarity

29fe035

vvchernov mentioned this pull request Jan 30, 2024

Enable Logprobs in MLC Batch Serving #82

Merged

masahi reviewed Jan 30, 2024

View reviewed changes

create mask on cpu once

5c4c585

masahi reviewed Jan 30, 2024

View reviewed changes

serve/mlc_serve/model/model_common.py Outdated Show resolved Hide resolved

Valery Chernov added 2 commits January 30, 2024 14:44

transfer tensors to device

e39daa2

typo fix

f7eb400

masahi approved these changes Jan 30, 2024

View reviewed changes

masahi merged commit e1bd866 into octoml:batch-serving Jan 30, 2024
1 check passed

vvchernov deleted the vc/mask_gpu branch January 30, 2024 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU-CPU tensor manipulation. Small performance boost #178

Fix GPU-CPU tensor manipulation. Small performance boost #178

vvchernov commented Jan 30, 2024 •

edited

Loading

vvchernov commented Jan 30, 2024

masahi commented Jan 30, 2024 •

edited

Loading

masahi left a comment

masahi Jan 30, 2024

vvchernov Jan 30, 2024

masahi Jan 30, 2024

masahi Jan 30, 2024

vvchernov Jan 30, 2024

masahi Jan 30, 2024

vvchernov Jan 30, 2024

Fix GPU-CPU tensor manipulation. Small performance boost #178

Fix GPU-CPU tensor manipulation. Small performance boost #178

Conversation

vvchernov commented Jan 30, 2024 • edited Loading

vvchernov commented Jan 30, 2024

masahi commented Jan 30, 2024 • edited Loading

masahi left a comment

Choose a reason for hiding this comment

masahi Jan 30, 2024

Choose a reason for hiding this comment

vvchernov Jan 30, 2024

Choose a reason for hiding this comment

masahi Jan 30, 2024

Choose a reason for hiding this comment

masahi Jan 30, 2024

Choose a reason for hiding this comment

vvchernov Jan 30, 2024

Choose a reason for hiding this comment

masahi Jan 30, 2024

Choose a reason for hiding this comment

vvchernov Jan 30, 2024

Choose a reason for hiding this comment

vvchernov commented Jan 30, 2024 •

edited

Loading

masahi commented Jan 30, 2024 •

edited

Loading