-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GPU-CPU tensor manipulation. Small performance boost #178
Conversation
cc @masahi |
I think the second optimization is not really helping since the output of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find!
@@ -75,13 +75,13 @@ def _is_safe_to_sample(prob_like): | |||
logits = torch.from_dlpack(logits) | |||
num_seq = len(sampling_params) | |||
|
|||
mask_random = torch.tensor( | |||
mask_random_dvc = torch.tensor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dvc
is a strange post fix. Just use gpu
or simply mask_random
(no post fix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @masahi! I guess that potentially it can work on cpu in the future. I thought to rename it to _gpu
, but it will confuse somebody in the future due to my guess. What do you think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logits here is supposed to be always on GPU (you can add an assert). So there is no confusion.
[p.sampling_type == SamplingType.RANDOM for p in sampling_params], | ||
dtype=torch.bool | ||
) | ||
mask_greedy_cpu = torch.logical_not(mask_random_cpu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you create a GPU mask, PT will first create a CPU mask and do cudaMemcpy
under the hood. So you can create a CPU mask once, and create a GPU mask by copying the CPU mask explicitly. So you can avoid creating a CPU mask twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, recheck
) | ||
mask_greedy = torch.logical_not(mask_random) | ||
mask_greedy_cpu = torch.logical_not(mask_random_cpu) | ||
if logits.device == torch.device("cpu"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It costs nothing but we support case with run topology on CPU. As I know @elvin-n run singlebatch on CPU after small fixes
* ios downloader * use dist as optional provided dir * Update iOS app to new reload api --------- Co-authored-by: Yaxing Cai <[email protected]>
There are two fixes:
logits
from GPU (potential performance reduction). After fix there are two pairs of masks on CPU andlogits
device side, moreover CPU masks are created only if it needs.res_random = torch.multinomial(probs, 1, True).cpu().numpy()[:, 0]
byres_random = torch.multinomial(probs, 1, True)[:, 0].cpu().numpy()
. As I understand in the first case we copy the slice from old numpy to new one after copying full tensor from GPU to CPU, but in the second case we get slice view (without memory copying) and copying from GPU to CPU only sliced tensor (not full)Results of benchmark:
Note: Fluctuation of time from run to run is big enough (~1-2%) therefore several runs were performed and averaged, possibly measurements should be more careful