[Ascend] feat: update do_bench_npu by winter77-x · Pull Request #669 · flagos-ai/FlagTree

winter77-x · 2026-06-08T07:59:23Z

No description provided.

CLAassistant · 2026-06-08T07:59:30Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

wangtao489 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

neoskypig · 2026-06-08T08:23:16Z

        buffer = runtime.driver.active.get_empty_cache_for_benchmark()
-        do_bench_clear(buffer.data_ptr(), buffer.numel(), stream)
+        buffer = buffer.float()  # to avoid type cast
+        buffer.sum()


usually，we use a CPU "wrapper" to call a triton-kernel，it is probably that using the "torch.sum" in the wrapper function， and it will cause the time is wrong because this "torch.sum" time is captured by msprof。

In the do_bench_npu function, buffer.sum() is used to forcefully clear the NPU’s L2 cache before each iteration of the benchmark (when clear_l2_cache=True).

neoskypig · 2026-06-08T08:25:52Z

-    # Run for 300 μs to raise the frequency to 800.
-    mat_a = torch.randn(4096, 4096).to(dtype=torch.bfloat16).npu()
-    mat_b = torch.randn(4096, 4096).to(dtype=torch.bfloat16).npu()
-    mat_c = torch.matmul(mat_a, mat_b)


according to experience, it needs a ">= 300us" kernel which make NPU works on a high frequency。
this method is used by some open-source libraries, e.g. bytedance's "mojoOpset".

The warmup in do_bench_npu is implemented in two stages:

Pre‑warmup (1 call per function) – ensures the runtime environment is “ready” without being measured

Profiled warmup (warmup calls per function) – allows the device to reach a steady state, but these calls are excluded from the final results.

update do_bench_npu

d321088

winter77-x requested a review from zhzhcookie as a code owner June 8, 2026 07:59

github-actions Bot added ascend triton_v3.2.x labels Jun 8, 2026

neoskypig reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Ascend] feat: update do_bench_npu#669

[Ascend] feat: update do_bench_npu#669
winter77-x wants to merge 1 commit into
flagos-ai:triton_v3.2.xfrom
winter77-x:feat_update_benchmark

winter77-x commented Jun 8, 2026

Uh oh!

CLAassistant commented Jun 8, 2026 •

edited

Loading

Uh oh!

neoskypig Jun 8, 2026

Uh oh!

winter77-x Jun 10, 2026

Uh oh!

neoskypig Jun 8, 2026

Uh oh!

winter77-x Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

winter77-x commented Jun 8, 2026

Uh oh!

CLAassistant commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neoskypig Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

winter77-x Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

neoskypig Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

winter77-x Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Jun 8, 2026 •

edited

Loading