Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GPU-CPU tensor manipulation. Small performance boost #178

Merged
merged 7 commits into from
Jan 30, 2024

Conversation

vvchernov
Copy link

@vvchernov vvchernov commented Jan 30, 2024

There are two fixes:

  1. Greedy and random masks are created on CPU side, but are used on both CPU and GPU sides, particularly it selects logits from GPU (potential performance reduction). After fix there are two pairs of masks on CPU and logits device side, moreover CPU masks are created only if it needs.
  2. Replace res_random = torch.multinomial(probs, 1, True).cpu().numpy()[:, 0] by res_random = torch.multinomial(probs, 1, True)[:, 0].cpu().numpy(). As I understand in the first case we copy the slice from old numpy to new one after copying full tensor from GPU to CPU, but in the second case we get slice view (without memory copying) and copying from GPU to CPU only sliced tensor (not full)

Results of benchmark:

MISTRAL (python serve/benchmarks/benchmark_throughput.py --local-id mistral-7b-instruct-q0f16-presharded-1gpu --dataset /opt/models/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --seed 0 --num-prompts 1000)

batch-serving:
Engine Throughput: 49.18 requests/s, 18816.81 tokens/s
Engine Throughput: 48.36 requests/s, 18503.69 tokens/s
Engine Throughput: 48.79 requests/s, 18668.08 tokens/s
AVERAGE: 48.78 requests/s, 18662.86 tokens/s

Fix 1:
Engine Throughput: 48.92 requests/s, 18717.29 tokens/s
Engine Throughput: 49.37 requests/s, 18891.61 tokens/s
Engine Throughput: 48.73 requests/s, 18647.23 tokens/s
AVERAGE: 49.01 requests/s, 18752.04 tokens/s

Fix1 + Fix2:
Engine Throughput: 49.23 requests/s, 18837.64 tokens/s
Engine Throughput: 49.43 requests/s, 18911.63 tokens/s
Engine Throughput: 50.00 requests/s, 19130.03 tokens/s
AVERAGE: 49.55 requests/s, 18959.77 tokens/s

MIXTRAL (python serve/benchmarks/benchmark_throughput.py --local-id mixtral-8x7b-instruct-v0.1-q0f16-presharded-2gpu --dataset /opt/models/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --seed 0 --num-prompts 1000)

batch-serving:
Engine Throughput: 23.71 requests/s, 9073.14 tokens/s
Engine Throughput: 23.62 requests/s, 9038.81 tokens/s
Engine Throughput: 23.44 requests/s, 8970.07 tokens/s
AVERAGE: 23.59 requests/s, 9027.34 tokens/s

Fix1 + Fix2:
Engine Throughput: 23.69 requests/s, 9064.93 tokens/s
Engine Throughput: 23.64 requests/s, 9045.66 tokens/s
Engine Throughput: 23.58 requests/s, 9022.22 tokens/s
AVERAGE: 23.64 requests/s, 9044.27 tokens/s

Note: Fluctuation of time from run to run is big enough (~1-2%) therefore several runs were performed and averaged, possibly measurements should be more careful

@vvchernov
Copy link
Author

cc @masahi

@masahi
Copy link
Member

masahi commented Jan 30, 2024

I think the second optimization is not really helping since the output of multinomial has shape (num_tokens, 1). So the number of elements that are copied are the same. Can you double check your performance result? It would be surprising if the second optimization does matter.

Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

@@ -75,13 +75,13 @@ def _is_safe_to_sample(prob_like):
logits = torch.from_dlpack(logits)
num_seq = len(sampling_params)

mask_random = torch.tensor(
mask_random_dvc = torch.tensor(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dvc is a strange post fix. Just use gpu or simply mask_random (no post fix).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @masahi! I guess that potentially it can work on cpu in the future. I thought to rename it to _gpu, but it will confuse somebody in the future due to my guess. What do you think about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logits here is supposed to be always on GPU (you can add an assert). So there is no confusion.

[p.sampling_type == SamplingType.RANDOM for p in sampling_params],
dtype=torch.bool
)
mask_greedy_cpu = torch.logical_not(mask_random_cpu)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you create a GPU mask, PT will first create a CPU mask and do cudaMemcpy under the hood. So you can create a CPU mask once, and create a GPU mask by copying the CPU mask explicitly. So you can avoid creating a CPU mask twice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, recheck

)
mask_greedy = torch.logical_not(mask_random)
mask_greedy_cpu = torch.logical_not(mask_random_cpu)
if logits.device == torch.device("cpu"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It costs nothing but we support case with run topology on CPU. As I know @elvin-n run singlebatch on CPU after small fixes

@masahi masahi merged commit e1bd866 into octoml:batch-serving Jan 30, 2024
1 check passed
@vvchernov vvchernov deleted the vc/mask_gpu branch January 30, 2024 11:19
Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Jan 30, 2024
* ios downloader

* use dist as optional provided dir

* Update iOS app to new reload api

---------

Co-authored-by: Yaxing Cai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants