Tongbowen/dpskv4 ascend by bowentong-HW · Pull Request #322 · XPU-Forces/mojo_opset

bowentong-HW · 2026-05-22T09:42:32Z

add deepseek v4 infer：

Support EP8 inference;
Support multi-batch inference;
Support npugraph-ex;
Integrate cann-receips operators；
only support w8a8 int8

…opset into tongbowen/dpskv4_ascend_multibatch_v1

gemini-code-assist

Code Review

This pull request introduces support for DeepSeek-V4 inference, adding several custom NPU operators, distributed execution capabilities, and graph mode compilation via torch.compile. The feedback highlights several areas for improvement, including the removal of hardcoded device indices and debug print statements, addressing performance bottlenecks caused by frequent host-device synchronization, and ensuring exception safety when modifying global PyTorch settings. Additionally, the reviewer pointed out redundant tensor initializations and operations that could be simplified for better efficiency.

gemini-code-assist · 2026-05-22T09:52:22Z

+        layout_kv: str = 'PA_ND',
+        has_ori_kv: bool = True,
+        has_cmp_kv: bool = False,
+        device: str = 'npu:0',


The device index is hardcoded to 'npu:0'. This will cause issues in multi-NPU environments where the current process might be assigned to a different device. Use f"npu:{torch.npu.current_device()}" or a device parameter passed from the model.

Suggested change

device: str = 'npu:0',

device: str = f"npu:{torch.npu.current_device()}",

gemini-code-assist · 2026-05-22T09:52:23Z

+    input_ids = torch.full((len(encoded), max_len), pad_token_id, dtype=torch.long)
+    attention_mask = torch.zeros((len(encoded), max_len), dtype=torch.bool)
+    for idx, ids in enumerate(encoded):
+        flat = ids.squeeze(0).cpu()


Moving tensors to CPU within a loop during batch padding can significantly degrade performance due to frequent synchronization between the host and the NPU. It is more efficient to perform these operations on the device or pre-process the data before the inference loop.

gemini-code-assist · 2026-05-22T09:52:23Z

+        torch.set_default_dtype(torch.bfloat16)
+        with no_init_weights():
+            model = model_class(hf_config, num_layers=args.num_layers, ep_size=ep_size, ep_rank=ep_rank)
+        torch.set_default_dtype(origin_dtype)


Changing the global default dtype using torch.set_default_dtype is not exception-safe here. If an error occurs during model construction, the global state will remain altered, potentially affecting subsequent operations. Consider using a try...finally block or a context manager to ensure the original dtype is restored.

Suggested change

torch.set_default_dtype(torch.bfloat16)

with no_init_weights():

model = model_class(hf_config, num_layers=args.num_layers, ep_size=ep_size, ep_rank=ep_rank)

torch.set_default_dtype(origin_dtype)

origin_dtype = torch.get_default_dtype()

try:

torch.set_default_dtype(torch.bfloat16)

with no_init_weights():

model = model_class(hf_config, num_layers=args.num_layers, ep_size=ep_size, ep_rank=ep_rank)

finally:

torch.set_default_dtype(origin_dtype)

gemini-code-assist · 2026-05-22T09:52:23Z

+        query = query.clone().contiguous()
+        key = key.clone().contiguous()
+        weights = weights.clone().contiguous() if weights is not None else None
+        query_dequant_scale = query_dequant_scale.clone().contiguous()
+        key_dequant_scale = key_dequant_scale.clone().contiguous()


Calling .clone().contiguous() is redundant because contiguous() already returns a copy if the tensor is not already contiguous. If the tensor is already contiguous, clone() creates an unnecessary deep copy. Using just .contiguous() is more efficient.

Suggested change

query = query.clone().contiguous()

key = key.clone().contiguous()

weights = weights.clone().contiguous() if weights is not None else None

query_dequant_scale = query_dequant_scale.clone().contiguous()

key_dequant_scale = key_dequant_scale.clone().contiguous()

query = query.contiguous()

key = key.contiguous()

weights = weights.contiguous() if weights is not None else None

query_dequant_scale = query_dequant_scale.contiguous()

key_dequant_scale = key_dequant_scale.contiguous()

gemini-code-assist · 2026-05-22T09:52:23Z

+        Returns:
+            Output tensor with the same shape as residual.
+        """
+        print('qqqq')


A debug print statement print('qqqq') was left in the production code. Please remove it to keep the logs clean.

gemini-code-assist · 2026-05-22T09:52:23Z

+        index_score = torch.zeros(
+            (batch_size, q_seq_len, k_seq_len),
+            dtype=torch.float32,
+            device=query.device,
+        )


The index_score tensor is initialized twice with the same parameters. The first initialization at line 445 is redundant.

gemini-code-assist · 2026-05-22T09:52:23Z

+            return y_out, expert_idx_out, norm_out_fp32
+        return y_out, expert_idx_out, norm_out_fp32


The if not out_flag check is redundant because both branches return the exact same values. The logic can be simplified to a single return statement.

Suggested change

return y_out, expert_idx_out, norm_out_fp32

return y_out, expert_idx_out, norm_out_fp32

return y_out, expert_idx_out, norm_out_fp32

gemini-code-assist · 2026-05-22T09:52:23Z

+    if not (hasattr(torch, 'npu') and torch.npu.is_available()):
+        pytest.skip("NPU not available!")
+
+    torch.npu.set_device(13)


Hardcoding the NPU device index to 13 will cause the test to fail on systems with fewer than 14 NPUs. Use torch.npu.set_device(0) or detect an available device dynamically.

Suggested change

torch.npu.set_device(13)

torch.npu.set_device(0)

…ization

# Conflicts: # examples/llm_inference.py # mojo_opset/modeling/deepseekv4/mojo_deepseek_v4.py

luohaocheng and others added 19 commits May 13, 2026 03:26

add AscendcHCpost

5d659e6

add hcpost,moe_gating_top_k

0a22212

add deepseek v4 forward, 2layer forward sucess

14b76d7

add operators

dc3630b

merge modeling

1fffdef

添加 ops-transformer 子模块，更新 .gitmodules

1265274

EP8 Inference dialogue OK

b7120b9

quant_lightning_indexer & kv_sparse_attn

e59f18b

replace mojo ops

542da5d

fix long sequence bug

f4d784c

replace torch_npu

78e978c

adapt multibatch forward

dbc92c2

Merge branch 'tongbowen/dpskv4_ascend' of github.com:XPU-Forces/mojo_…

926ca0f

…opset into tongbowen/dpskv4_ascend_multibatch_v1

multi-batch conversations work normally with mojo operator

994ea00

support npugraph-ex

ec3bce4

npugraph-ex:fix long prompt

c9134d1

npugraph-ex: fix bs2 acc

ab5ed6b

dsv4_long_prompt_test

25beca2

graph/eager both rely on attn-metadata

e5a43c4

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

bowentong-HW and others added 10 commits May 24, 2026 08:39

optimize prepare_input

9af0364

optimize get_slot_mappint_decode fast_path:prepare_input:12ms

eadad1c

add multi_stream/transpose opt/matmulNZ

c5445d4

fix transpose clean code

fa05f46

optimize kv cache manager

73df59b

enable static_kernel

52267a3

support MTP

a5eaeab

refactor: deepseek v4 runtime input assembly & indexer/moe safe optim…

5466cd4

…ization

add prefill CP+DP,decode DP+EP

c9d3e54

Merge remote-tracking branch 'origin/tongbowen/dpskv4_ascend' into temp

790a5ca

bowentong-HW and others added 2 commits May 29, 2026 10:04

add prefill:CP+EP,decode:DP+EP

4f57177

add multi-node inference

772adf9

bowentong-HW force-pushed the tongbowen/dpskv4_ascend branch from 101146e to 772adf9 Compare May 30, 2026 07:18

Song-begin and others added 15 commits June 1, 2026 16:54

refactor code about MTP

ca7897c

fix EP/DP/CP

7e5e998

Merge remote-tracking branch 'origin/tongbowen/dpskv4_ascend' into temp

3e4a96d

# Conflicts: # examples/llm_inference.py # mojo_opset/modeling/deepseekv4/mojo_deepseek_v4.py

optimize decode com_slot_mapping compute

4a7e13b

fix MTP Different ranks are unstable

1ea94a6

optimize prefill calculation in host

d29ca09

resolv graph bug, simplify code

24d833b

add perf_time

735ff9f

Move DeepSeek V4 CP into distributed/parallel

bb83f09

Move DeepSeek V4 DP into distributed/parallel

aadcb50

Move DeepSeek V4 EP into distributed/parallel

73bd431

Move DeepSeek V4 TP into distributed/parallel

f41ba69

clean code for parallel

dcc27ff

add prefect eplb

14c7314

ep8 tpot_model_ms=25.5

ba91628

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tongbowen/dpskv4 ascend#322

Tongbowen/dpskv4 ascend#322
bowentong-HW wants to merge 46 commits into
dev/dpskv4_ascendfrom
tongbowen/dpskv4_ascend

bowentong-HW commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

	device: str = 'npu:0',
	device: str = f"npu:{torch.npu.current_device()}",

		return y_out, expert_idx_out, norm_out_fp32
		return y_out, expert_idx_out, norm_out_fp32

Uh oh!

Conversation

bowentong-HW commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants