Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/vllm_ascend_test_nightly_a3.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,9 @@ jobs:
- name: multi-node-deepseek-pd
config_file_path: DeepSeek-V3.yaml
size: 2
- name: multi-node-deepseek-pd
config_file_path: DeepSeek-V3-layerwise.yaml
size: 2
- name: multi-node-qwen3-dp
config_file_path: Qwen3-235B-A3B.yaml
size: 2
Expand All @@ -61,6 +64,9 @@ jobs:
- name: multi-node-qwenw8a8-2node
config_file_path: Qwen3-235B-W8A8.yaml
size: 2
- name: multi-node-qwenw8a8-2node
config_file_path: Qwen3-235B-W8A8-layerwise.yaml
size: 2
- name: multi-node-glm-2node
config_file_path: GLM-4_5.yaml
size: 2
Expand Down
112 changes: 112 additions & 0 deletions tests/e2e/nightly/multi_node/config/models/DeepSeek-V3-layerwise.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# For disaggregated mode, set is_disaggregated: true, and set the following parameters:
# Prefiller_index: the hosts index of the node running prefiller
# Decoder_index: the hosts index of the node running decoder
# Suppose we have **4 nodes** running a 2P1D setup (2 Prefillers + 1 Decoder):
# ┌───────────────┬───────────────┬───────────────┬───────────────┐
# │ node0 │ node1 │ node2 │ node3 │
# │ Prefiller #1 │ Prefiller #2 │ Decoder │ Decoder │
# └───────────────┴───────────────┴───────────────┴───────────────┘
# For the prefiller nodes. the hosts should be node0 and node1
# For the decoder nodes. we only have 1 decoder node(dp+tp+ep across node2 and node3. Where node3 is running with headless mode)
# So the prefiller_host_index is [0, 1], and the decoder_host_index is [2]
Comment on lines +4 to +11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The comment describes a 4-node setup, but the configuration below uses num_nodes: 2. This is misleading. Please update the comment to reflect the actual 2-node (1 prefiller, 1 decoder) setup being configured.

# Suppose we have **2 nodes** running a 1P1D setup (1 Prefiller + 1 Decoder):
#   ┌───────────────┬───────────────┐
#   │   node0       │   node1       │
#   │   Prefiller   │   Decoder     │
#   └───────────────┴───────────────┘
# For the prefiller node, the host is node0.
# For the decoder node, the host is node1.
# So the prefiller_host_index is [0], and the decoder_host_index is [1].

test_name: "test DeepSeek-V3 disaggregated_prefill"
model: "vllm-ascend/DeepSeek-V3-W8A8"
num_nodes: 2
npu_per_node: 16
env_common:
VLLM_USE_MODELSCOPE: true
OMP_PROC_BIND: false
OMP_NUM_THREADS: 100
HCCL_BUFFSIZE: 1024
SERVER_PORT: 8080
NUMEXPR_MAX_THREADS: 128
DISAGGREGATED_PREFILL_PROXY_SCRIPT: "examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py"
# For None kubernetes deployment, list the IPs of all nodes used in order as follow
# cluster_hosts: []
disaggregated_prefill:
enabled: true
prefiller_host_index: [0]
decoder_host_index: [1]

deployment:
-
server_cmd: >
vllm serve "vllm-ascend/DeepSeek-V3-W8A8"
--host 0.0.0.0
--port $SERVER_PORT
--data-parallel-size 2
--data-parallel-size-local 2
--tensor-parallel-size 8
--seed 1024
--enforce-eager
--enable-expert-parallel
--max-num-seqs 16
--max-model-len 8192
--max-num-batched-tokens 8192
--quantization ascend
--trust-remote-code
--no-enable-prefix-caching
--gpu-memory-utilization 0.9
--kv-transfer-config
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 2,
"tp_size": 8
}
}
}'

-
server_cmd: >
vllm serve "vllm-ascend/DeepSeek-V3-W8A8"
--host 0.0.0.0
--port $SERVER_PORT
--data-parallel-size 2
--data-parallel-size-local 2
--tensor-parallel-size 8
--seed 1024
--quantization ascend
--max-num-seqs 16
--max-model-len 8192
--max-num-batched-tokens 8192
--enable-expert-parallel
--trust-remote-code
--no-enable-prefix-caching
--gpu-memory-utilization 0.9
--additional-config '{"torchair_graph_config":{"enabled":true}}'
--kv-transfer-config
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "1",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 2,
"tp_size": 8
}
}
}'
benchmarks:
acc:
case_type: accuracy
dataset_path: vllm-ascend/gsm8k-lite
request_conf: vllm_api_general_chat
dataset_conf: gsm8k/gsm8k_gen_0_shot_cot_chat_prompt
max_out_len: 4096
batch_size: 512
baseline: 95
threshold: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
test_name: "test Qwen3-235B-A22B-W8A8 disaggregated_prefill"
model: "vllm-ascend/Qwen3-235B-A22B-W8A8"
num_nodes: 2
npu_per_node: 16
env_common:
VLLM_USE_MODELSCOPE: true
OMP_PROC_BIND: false
OMP_NUM_THREADS: 100
HCCL_BUFFSIZE: 1024
SERVER_PORT: 8080
NUMEXPR_MAX_THREADS: 128
DISAGGREGATED_PREFILL_PROXY_SCRIPT: "examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py"
# For None kubernetes deployment, list the IPs of all nodes used in order as follow
# cluster_hosts: []
disaggregated_prefill:
enabled: true
prefiller_host_index: [0]
decoder_host_index: [1]

deployment:
-
server_cmd: >
vllm serve "vllm-ascend/Qwen3-235B-A22B-W8A8"
--host 0.0.0.0
--port $SERVER_PORT
--data-parallel-size 2
--data-parallel-size-local 2
--tensor-parallel-size 8
--seed 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The producer (prefill node) deployment is missing the --enforce-eager flag. In disaggregated prefill setups, it's common to run the prefill stage in eager mode for flexibility, while the decode stage runs in graph mode for performance. The DeepSeek-V3-layerwise.yaml configuration in this PR follows this pattern. For consistency and to ensure correct behavior, please add this flag.

        --seed 1024
        --enforce-eager

--enable-expert-parallel
--max-num-seqs 16
--max-model-len 8192
--max-num-batched-tokens 8192
--quantization ascend
--trust-remote-code
--no-enable-prefix-caching
--gpu-memory-utilization 0.9
--kv-transfer-config
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_producer",
"kv_port": "30000",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 2,
"tp_size": 8
}
}
}'

-
server_cmd: >
vllm serve "vllm-ascend/Qwen3-235B-A22B-W8A8"
--host 0.0.0.0
--port $SERVER_PORT
--data-parallel-size 2
--data-parallel-size-local 2
--tensor-parallel-size 8
--seed 1024
--quantization ascend
--max-num-seqs 16
--max-model-len 8192
--max-num-batched-tokens 8192
--enable-expert-parallel
--trust-remote-code
--no-enable-prefix-caching
--gpu-memory-utilization 0.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The consumer (decode node) deployment is missing the configuration to enable graph mode (--additional-config '{\"torchair_graph_config\":{\"enabled\":true}}'). This is inconsistent with the DeepSeek-V3-layerwise.yaml config, where the decoder is explicitly configured to run in graph mode for better performance. Please add this configuration for consistency and to leverage performance optimizations.

        --gpu-memory-utilization 0.9
        --additional-config '{\"torchair_graph_config\":{\"enabled\":true}}'

--kv-transfer-config
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_consumer",
"kv_port": "30200",
"engine_id": "1",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 2,
"tp_size": 8
}
}
}'
benchmarks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The benchmarks section is empty. For the CI to run a meaningful test, it needs benchmark configuration, such as an accuracy test. This seems to be an oversight, as the DeepSeek-V3-layerwise.yaml file includes a benchmark configuration. Please add the appropriate benchmark configuration, similar to the other test file.

benchmarks:
  acc:
    case_type: accuracy
    dataset_path: vllm-ascend/gsm8k-lite
    request_conf: vllm_api_general_chat
    dataset_conf: gsm8k/gsm8k_gen_0_shot_cot_chat_prompt
    max_out_len: 4096
    batch_size: 512
    baseline: 95
    threshold: 5

Loading