Skip to content

[Ascend] feat: update do_bench_npu#669

Open
winter77-x wants to merge 1 commit into
flagos-ai:triton_v3.2.xfrom
winter77-x:feat_update_benchmark
Open

[Ascend] feat: update do_bench_npu#669
winter77-x wants to merge 1 commit into
flagos-ai:triton_v3.2.xfrom
winter77-x:feat_update_benchmark

Conversation

@winter77-x

Copy link
Copy Markdown
Collaborator

No description provided.

@winter77-x winter77-x requested a review from zhzhcookie as a code owner June 8, 2026 07:59
@CLAassistant

CLAassistant commented Jun 8, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


wangtao489 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

buffer = runtime.driver.active.get_empty_cache_for_benchmark()
do_bench_clear(buffer.data_ptr(), buffer.numel(), stream)
buffer = buffer.float() # to avoid type cast
buffer.sum()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually,we use a CPU "wrapper" to call a triton-kernel,it is probably that using the "torch.sum" in the wrapper function, and it will cause the time is wrong because this "torch.sum" time is captured by msprof。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the do_bench_npu function, buffer.sum() is used to forcefully clear the NPU’s L2 cache before each iteration of the benchmark (when clear_l2_cache=True).

# Run for 300 μs to raise the frequency to 800.
mat_a = torch.randn(4096, 4096).to(dtype=torch.bfloat16).npu()
mat_b = torch.randn(4096, 4096).to(dtype=torch.bfloat16).npu()
mat_c = torch.matmul(mat_a, mat_b)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to experience, it needs a ">= 300us" kernel which make NPU works on a high frequency。
this method is used by some open-source libraries, e.g. bytedance's "mojoOpset".

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warmup in do_bench_npu is implemented in two stages:

  1. Pre‑warmup (1 call per function) – ensures the runtime environment is “ready” without being measured
  2. Profiled warmup (warmup calls per function) – allows the device to reach a steady state, but these calls are excluded from the final results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants