feat: add DeepSeek-V4 inference on Ascend with TP/EP/DP/CP parallel support by Song-begin · Pull Request #375 · XPU-Forces/mojo_opset

Song-begin · 2026-06-26T07:36:20Z

Features

support DeepSeek-V4 model inference on Ascend
support DeepSeek-V4 distributed inference with TP / EP / DP / CP
support DeepSeek-V4 multi-batch inference
support MTP for DeepSeek-V4
support graph mode / npugraph-ex execution for DeepSeek-V4
support single-node and multi-node inference deployment
support AscendC / torch_npu operators required by DeepSeek-V4

…opset into tongbowen/dpskv4_ascend_multibatch_v1

…ization

# Conflicts: # examples/llm_inference.py # mojo_opset/modeling/deepseekv4/mojo_deepseek_v4.py

gemini-code-assist

Code Review

This pull request introduces AscendC backend support, custom operators (such as MojoCompressor, MojoHcPost, MojoHcPre, MojoScatterNdUpdateAsc, and MojoRMSNormDynamicQuant), and multi-node/multi-card LLM inference scripts optimized for DeepSeek-V4. Key feedback includes fixing a logic bug in llm_inference.py that builds the model twice when --transformers is enabled, vectorizing a loop in MojoScatterNdUpdateAsc to avoid performance-degrading host-device synchronizations, removing the evaluation-at-import anti-pattern of torch.npu.current_device() in a function signature, and replacing hardcoded NPU device indices (13) in the sparse attention tests to prevent failures on standard hardware.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-26T07:40:04Z

+        device: str = f"npu:{torch.npu.current_device()}",
+
+    ):
+        return torch.ops.custom.npu_sparse_attn_sharedkv_metadata(


Using torch.npu.current_device() as a default argument in the function signature is a Python anti-pattern. Default arguments are evaluated once at module import time, not at function execution time. This can cause errors if the NPU is not yet initialized when the module is imported, or it will incorrectly lock the default device to npu:0 (or the import-time active device) for all ranks in a multi-NPU environment.

Suggested change

device: str = f"npu:{torch.npu.current_device()}",

):

return torch.ops.custom.npu_sparse_attn_sharedkv_metadata(

device: str = None,

):

if device is None:

device = f"npu:{torch.npu.current_device()}"

return torch.ops.custom.npu_sparse_attn_sharedkv_metadata(

gemini-code-assist · 2026-06-26T07:40:04Z

+    else:
+        model = build_model_from_hf(


When --transformers is enabled, the model is built on lines 1717-1723. However, because AutoModelForCausalLM does not have a load_weights attribute, the execution falls into the else block on line 1870, causing the model to be built a second time. This redundant model creation can lead to high memory usage or Out-Of-Memory (OOM) errors.

Suggested change

else:

model = build_model_from_hf(

elif not args.transformers:

model = build_model_from_hf(

gemini-code-assist · 2026-06-26T07:40:04Z

+        for i in range(idx.numel()):
+            j = int(idx[i].item())
+            if j >= 0:
+                var[j, :] = update[i, :]
+        return var


Using a Python loop with .item() on NPU tensors causes host-device synchronization at every iteration. For large inputs (e.g., u = 8192 in tests), this will result in thousands of syncs, severely degrading performance. This can be fully vectorized using PyTorch indexing, which runs entirely on the device and is orders of magnitude faster.

Suggested change

for i in range(idx.numel()):

j = int(idx[i].item())

if j >= 0:

var[j, :] = update[i, :]

return var

idx = indices.to(dtype=torch.int64).view(-1)

valid_mask = idx >= 0

if valid_mask.any():

var[idx[valid_mask]] = update[valid_mask]

gemini-code-assist · 2026-06-26T07:40:04Z

+def test_mojo_sparse_attn_sharedkv_metadata():
+    if not (hasattr(torch, 'npu') and torch.npu.is_available()):
+        pytest.skip("NPU not available!")
+


Hardcoding NPU device index 13 will cause the test to fail on standard systems with fewer than 14 NPUs (e.g., standard 8-NPU or 1-NPU nodes). It is better to use device 0 or dynamically query the current device.

Suggested change

torch.npu.set_device(0)

gemini-code-assist · 2026-06-26T07:40:04Z

+        layout_kv='PA_ND',
+        has_ori_kv=True,
+        has_cmp_kv=False,
+        device='npu:13'


Hardcoding NPU device index 13 will cause the test to fail on standard systems with fewer than 14 NPUs. It is better to use device 0 or dynamically query the current device.

Suggested change

device='npu:13'

device='npu:0'

gemini-code-assist · 2026-06-26T07:40:04Z

+    if not (hasattr(torch, 'npu') and torch.npu.is_available()):
+        pytest.skip("NPU not available!")
+
+    torch.npu.set_device(13)


Hardcoding NPU device index 13 will cause the test to fail on standard systems with fewer than 14 NPUs. It is better to use device 0 or dynamically query the current device.

Suggested change

torch.npu.set_device(13)

torch.npu.set_device(0)

gemini-code-assist · 2026-06-26T07:40:04Z

+        layout_kv='PA_ND',
+        has_ori_kv=True,
+        has_cmp_kv=False,
+        device='npu:13'


Hardcoding NPU device index 13 will cause the test to fail on standard systems with fewer than 14 NPUs. It is better to use device 0 or dynamically query the current device.

Suggested change

device='npu:13'

device='npu:0'

zhangjihang-BD and others added 30 commits May 12, 2026 15:34

Add ascendc backend

e45520a

add AscendcHCpost

5d659e6

add hcpost,moe_gating_top_k

0a22212

add deepseek v4 forward, 2layer forward sucess

14b76d7

add operators

dc3630b

merge modeling

1fffdef

添加 ops-transformer 子模块，更新 .gitmodules

1265274

EP8 Inference dialogue OK

b7120b9

quant_lightning_indexer & kv_sparse_attn

e59f18b

replace mojo ops

542da5d

fix long sequence bug

f4d784c

replace torch_npu

78e978c

adapt multibatch forward

dbc92c2

Merge branch 'tongbowen/dpskv4_ascend' of github.com:XPU-Forces/mojo_…

926ca0f

…opset into tongbowen/dpskv4_ascend_multibatch_v1

multi-batch conversations work normally with mojo operator

994ea00

support npugraph-ex

ec3bce4

npugraph-ex:fix long prompt

c9134d1

npugraph-ex: fix bs2 acc

ab5ed6b

dsv4_long_prompt_test

25beca2

graph/eager both rely on attn-metadata

e5a43c4

optimize prepare_input

9af0364

optimize get_slot_mappint_decode fast_path:prepare_input:12ms

eadad1c

add multi_stream/transpose opt/matmulNZ

c5445d4

fix transpose clean code

fa05f46

optimize kv cache manager

73df59b

enable static_kernel

52267a3

support MTP

a5eaeab

refactor: deepseek v4 runtime input assembly & indexer/moe safe optim…

5466cd4

…ization

add prefill CP+DP,decode DP+EP

c9d3e54

Merge remote-tracking branch 'origin/tongbowen/dpskv4_ascend' into temp

790a5ca

bowentong-HW and others added 17 commits May 29, 2026 10:04

add prefill:CP+EP,decode:DP+EP

4f57177

add multi-node inference

772adf9

refactor code about MTP

ca7897c

fix EP/DP/CP

7e5e998

Merge remote-tracking branch 'origin/tongbowen/dpskv4_ascend' into temp

3e4a96d

# Conflicts: # examples/llm_inference.py # mojo_opset/modeling/deepseekv4/mojo_deepseek_v4.py

optimize decode com_slot_mapping compute

4a7e13b

fix MTP Different ranks are unstable

1ea94a6

optimize prefill calculation in host

d29ca09

resolv graph bug, simplify code

24d833b

add perf_time

735ff9f

Move DeepSeek V4 CP into distributed/parallel

bb83f09

Move DeepSeek V4 DP into distributed/parallel

aadcb50

Move DeepSeek V4 EP into distributed/parallel

73bd431

Move DeepSeek V4 TP into distributed/parallel

f41ba69

clean code for parallel

dcc27ff

add prefect eplb

14c7314

ep8 tpot_model_ms=25.5

ba91628

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add DeepSeek-V4 inference on Ascend with TP/EP/DP/CP parallel support#375

feat: add DeepSeek-V4 inference on Ascend with TP/EP/DP/CP parallel support#375
Song-begin wants to merge 47 commits into
masterfrom
tongbowen/dpskv4_ascend

Song-begin commented Jun 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

Conversation

Song-begin commented Jun 26, 2026

Features

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants