[Ascend] feat: update do_bench_npu#669
Conversation
|
wangtao489 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
| buffer = runtime.driver.active.get_empty_cache_for_benchmark() | ||
| do_bench_clear(buffer.data_ptr(), buffer.numel(), stream) | ||
| buffer = buffer.float() # to avoid type cast | ||
| buffer.sum() |
There was a problem hiding this comment.
usually,we use a CPU "wrapper" to call a triton-kernel,it is probably that using the "torch.sum" in the wrapper function, and it will cause the time is wrong because this "torch.sum" time is captured by msprof。
There was a problem hiding this comment.
In the do_bench_npu function, buffer.sum() is used to forcefully clear the NPU’s L2 cache before each iteration of the benchmark (when clear_l2_cache=True).
| # Run for 300 μs to raise the frequency to 800. | ||
| mat_a = torch.randn(4096, 4096).to(dtype=torch.bfloat16).npu() | ||
| mat_b = torch.randn(4096, 4096).to(dtype=torch.bfloat16).npu() | ||
| mat_c = torch.matmul(mat_a, mat_b) |
There was a problem hiding this comment.
according to experience, it needs a ">= 300us" kernel which make NPU works on a high frequency。
this method is used by some open-source libraries, e.g. bytedance's "mojoOpset".
There was a problem hiding this comment.
The warmup in do_bench_npu is implemented in two stages:
- Pre‑warmup (1 call per function) – ensures the runtime environment is “ready” without being measured
- Profiled warmup (warmup calls per function) – allows the device to reach a steady state, but these calls are excluded from the final results.
No description provided.