-
Notifications
You must be signed in to change notification settings - Fork 621
[P/D] add layerwise connector CI #4468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,112 @@ | ||
| # For disaggregated mode, set is_disaggregated: true, and set the following parameters: | ||
| # Prefiller_index: the hosts index of the node running prefiller | ||
| # Decoder_index: the hosts index of the node running decoder | ||
| # Suppose we have **4 nodes** running a 2P1D setup (2 Prefillers + 1 Decoder): | ||
| # ┌───────────────┬───────────────┬───────────────┬───────────────┐ | ||
| # │ node0 │ node1 │ node2 │ node3 │ | ||
| # │ Prefiller #1 │ Prefiller #2 │ Decoder │ Decoder │ | ||
| # └───────────────┴───────────────┴───────────────┴───────────────┘ | ||
| # For the prefiller nodes. the hosts should be node0 and node1 | ||
| # For the decoder nodes. we only have 1 decoder node(dp+tp+ep across node2 and node3. Where node3 is running with headless mode) | ||
| # So the prefiller_host_index is [0, 1], and the decoder_host_index is [2] | ||
| test_name: "test DeepSeek-V3 disaggregated_prefill" | ||
| model: "vllm-ascend/DeepSeek-V3-W8A8" | ||
| num_nodes: 2 | ||
| npu_per_node: 16 | ||
| env_common: | ||
| VLLM_USE_MODELSCOPE: true | ||
| OMP_PROC_BIND: false | ||
| OMP_NUM_THREADS: 100 | ||
| HCCL_BUFFSIZE: 1024 | ||
| SERVER_PORT: 8080 | ||
| NUMEXPR_MAX_THREADS: 128 | ||
| DISAGGREGATED_PREFILL_PROXY_SCRIPT: "examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py" | ||
| # For None kubernetes deployment, list the IPs of all nodes used in order as follow | ||
| # cluster_hosts: [] | ||
| disaggregated_prefill: | ||
| enabled: true | ||
| prefiller_host_index: [0] | ||
| decoder_host_index: [1] | ||
|
|
||
| deployment: | ||
| - | ||
| server_cmd: > | ||
| vllm serve "vllm-ascend/DeepSeek-V3-W8A8" | ||
| --host 0.0.0.0 | ||
| --port $SERVER_PORT | ||
| --data-parallel-size 2 | ||
| --data-parallel-size-local 2 | ||
| --tensor-parallel-size 8 | ||
| --seed 1024 | ||
| --enforce-eager | ||
| --enable-expert-parallel | ||
| --max-num-seqs 16 | ||
| --max-model-len 8192 | ||
| --max-num-batched-tokens 8192 | ||
| --quantization ascend | ||
| --trust-remote-code | ||
| --no-enable-prefix-caching | ||
| --gpu-memory-utilization 0.9 | ||
| --kv-transfer-config | ||
| '{"kv_connector": "MooncakeLayerwiseConnector", | ||
| "kv_role": "kv_producer", | ||
| "kv_port": "30000", | ||
| "engine_id": "0", | ||
| "kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector", | ||
| "kv_connector_extra_config": { | ||
| "prefill": { | ||
| "dp_size": 2, | ||
| "tp_size": 8 | ||
| }, | ||
| "decode": { | ||
| "dp_size": 2, | ||
| "tp_size": 8 | ||
| } | ||
| } | ||
| }' | ||
|
|
||
| - | ||
| server_cmd: > | ||
| vllm serve "vllm-ascend/DeepSeek-V3-W8A8" | ||
| --host 0.0.0.0 | ||
| --port $SERVER_PORT | ||
| --data-parallel-size 2 | ||
| --data-parallel-size-local 2 | ||
| --tensor-parallel-size 8 | ||
| --seed 1024 | ||
| --quantization ascend | ||
| --max-num-seqs 16 | ||
| --max-model-len 8192 | ||
| --max-num-batched-tokens 8192 | ||
| --enable-expert-parallel | ||
| --trust-remote-code | ||
| --no-enable-prefix-caching | ||
| --gpu-memory-utilization 0.9 | ||
| --additional-config '{"torchair_graph_config":{"enabled":true}}' | ||
| --kv-transfer-config | ||
| '{"kv_connector": "MooncakeLayerwiseConnector", | ||
| "kv_role": "kv_consumer", | ||
| "kv_port": "30200", | ||
| "engine_id": "1", | ||
| "kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector", | ||
| "kv_connector_extra_config": { | ||
| "prefill": { | ||
| "dp_size": 2, | ||
| "tp_size": 8 | ||
| }, | ||
| "decode": { | ||
| "dp_size": 2, | ||
| "tp_size": 8 | ||
| } | ||
| } | ||
| }' | ||
| benchmarks: | ||
| acc: | ||
| case_type: accuracy | ||
| dataset_path: vllm-ascend/gsm8k-lite | ||
| request_conf: vllm_api_general_chat | ||
| dataset_conf: gsm8k/gsm8k_gen_0_shot_cot_chat_prompt | ||
| max_out_len: 4096 | ||
| batch_size: 512 | ||
| baseline: 95 | ||
| threshold: 5 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| test_name: "test Qwen3-235B-A22B-W8A8 disaggregated_prefill" | ||
| model: "vllm-ascend/Qwen3-235B-A22B-W8A8" | ||
| num_nodes: 2 | ||
| npu_per_node: 16 | ||
| env_common: | ||
| VLLM_USE_MODELSCOPE: true | ||
| OMP_PROC_BIND: false | ||
| OMP_NUM_THREADS: 100 | ||
| HCCL_BUFFSIZE: 1024 | ||
| SERVER_PORT: 8080 | ||
| NUMEXPR_MAX_THREADS: 128 | ||
| DISAGGREGATED_PREFILL_PROXY_SCRIPT: "examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py" | ||
| # For None kubernetes deployment, list the IPs of all nodes used in order as follow | ||
| # cluster_hosts: [] | ||
| disaggregated_prefill: | ||
| enabled: true | ||
| prefiller_host_index: [0] | ||
| decoder_host_index: [1] | ||
|
|
||
| deployment: | ||
| - | ||
| server_cmd: > | ||
| vllm serve "vllm-ascend/Qwen3-235B-A22B-W8A8" | ||
| --host 0.0.0.0 | ||
| --port $SERVER_PORT | ||
| --data-parallel-size 2 | ||
| --data-parallel-size-local 2 | ||
| --tensor-parallel-size 8 | ||
| --seed 1024 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The producer (prefill node) deployment is missing the --seed 1024
--enforce-eager |
||
| --enable-expert-parallel | ||
| --max-num-seqs 16 | ||
| --max-model-len 8192 | ||
| --max-num-batched-tokens 8192 | ||
| --quantization ascend | ||
| --trust-remote-code | ||
| --no-enable-prefix-caching | ||
| --gpu-memory-utilization 0.9 | ||
| --kv-transfer-config | ||
| '{"kv_connector": "MooncakeLayerwiseConnector", | ||
| "kv_role": "kv_producer", | ||
| "kv_port": "30000", | ||
| "engine_id": "0", | ||
| "kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector", | ||
| "kv_connector_extra_config": { | ||
| "prefill": { | ||
| "dp_size": 2, | ||
| "tp_size": 8 | ||
| }, | ||
| "decode": { | ||
| "dp_size": 2, | ||
| "tp_size": 8 | ||
| } | ||
| } | ||
| }' | ||
|
|
||
| - | ||
| server_cmd: > | ||
| vllm serve "vllm-ascend/Qwen3-235B-A22B-W8A8" | ||
| --host 0.0.0.0 | ||
| --port $SERVER_PORT | ||
| --data-parallel-size 2 | ||
| --data-parallel-size-local 2 | ||
| --tensor-parallel-size 8 | ||
| --seed 1024 | ||
| --quantization ascend | ||
| --max-num-seqs 16 | ||
| --max-model-len 8192 | ||
| --max-num-batched-tokens 8192 | ||
| --enable-expert-parallel | ||
| --trust-remote-code | ||
| --no-enable-prefix-caching | ||
| --gpu-memory-utilization 0.9 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The consumer (decode node) deployment is missing the configuration to enable graph mode ( --gpu-memory-utilization 0.9
--additional-config '{\"torchair_graph_config\":{\"enabled\":true}}' |
||
| --kv-transfer-config | ||
| '{"kv_connector": "MooncakeLayerwiseConnector", | ||
| "kv_role": "kv_consumer", | ||
| "kv_port": "30200", | ||
| "engine_id": "1", | ||
| "kv_connector_module_path": "vllm_ascend.distributed.mooncake_layerwise_connector", | ||
| "kv_connector_extra_config": { | ||
| "prefill": { | ||
| "dp_size": 2, | ||
| "tp_size": 8 | ||
| }, | ||
| "decode": { | ||
| "dp_size": 2, | ||
| "tp_size": 8 | ||
| } | ||
| } | ||
| }' | ||
| benchmarks: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The benchmarks:
acc:
case_type: accuracy
dataset_path: vllm-ascend/gsm8k-lite
request_conf: vllm_api_general_chat
dataset_conf: gsm8k/gsm8k_gen_0_shot_cot_chat_prompt
max_out_len: 4096
batch_size: 512
baseline: 95
threshold: 5 |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment describes a 4-node setup, but the configuration below uses
num_nodes: 2. This is misleading. Please update the comment to reflect the actual 2-node (1 prefiller, 1 decoder) setup being configured.