Request to support it with AMD ROCm GPU.
# DFlash on ROCm/vLLM Deployment Notes
## Goal
Try to run the following speculative decoding pair with vLLM on ROCm:
- Target model: `Qwen3.5-27B`
- Draft model: `z-lab/Qwen3.5-27B-DFlash`
## What was verified
### Container start command
```bash
docker run -d --name dflash-vllm-rocm-nightly \
--ipc host \
--network host \
--privileged \
--cap-add SYS_ADMIN \
--cap-add SYS_PTRACE \
--device /dev/kfd:/dev/kfd \
--device /dev/dri:/dev/dri \
--device /dev/mem:/dev/mem \
--group-add render \
--security-opt seccomp=unconfined \
-e GLOO_SOCKET_IFNAME=eth0 \
-e NCCL_SOCKET_IFNAME=eth0 \
-e VLLM_HOST_IP=10.50.0.59 \
-v /mnt/volume_atl1_d4t/HF:/model:ro \
--entrypoint bash \
rocm/vllm-dev:nightly \
-lc "exec vllm serve /model/Qwen/Qwen3.5-27B \
--speculative-config '{\"method\":\"dflash\",\"model\":\"/model/Qwen/Qwen3.5-27B-DFlash\",\"num_speculative_tokens\":15}' \
--tensor-parallel-size 1 \
--dtype auto \
--enable-prefix-caching \
--max-model-len 8192 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.9 \
--host 0.0.0.0 \
--port 18000 \
--served-model-name qwen35-27b-dflash \
--api-key amd \
--trust-remote-code"
Failure path A: default ROCm backend
When not forcing --attention-backend flash_attn, the service failed during backend selection.
Key error:
ValueError: No valid attention backend found for rocm with AttentionSelectorConfig(... use_non_causal=True).
Reasons: {ROCM_ATTN: [non-causal attention not supported], ROCM_AITER_UNIFIED_ATTN: [non-causal attention not supported], TRITON_ATTN: [non-causal attention not supported]}.
Interpretation
This means:
- DFlash internally requires
use_non_causal=True
- Current ROCm backends available to vLLM do not support that execution mode
- Therefore DFlash cannot run on this path even though it is recognized by vLLM
Important clarification:
- Removing
--attention-backend flash_attn does not remove the need for non-causal attention
- The need for non-causal attention comes from the DFlash execution path itself
Failure path B: forcing --attention-backend flash_attn
When forcing:
--attention-backend flash_attn
the run progressed much farther:
- target model loaded successfully
- draft model loaded successfully
- weights were loaded
- compilation/profile steps ran
Then it failed with:
AssertionError: FlashAttention version not detected.
Even though pip list | grep flash showed:
Short summary
DFlash is recognized by ROCm vLLM nightly, but the run still fails because the ROCm attention backends do not support the non-causal attention required by DFlash. Forcing flash_attn avoids that specific selection path but then fails later because FlashAttention is not successfully detected by vLLM.
Request to support it with AMD ROCm GPU.
Failure path A: default ROCm backend
When not forcing
--attention-backend flash_attn, the service failed during backend selection.Key error:
Interpretation
This means:
use_non_causal=TrueImportant clarification:
--attention-backend flash_attndoes not remove the need for non-causal attentionFailure path B: forcing
--attention-backend flash_attnWhen forcing:
the run progressed much farther:
Then it failed with:
Even though
pip list | grep flashshowed:Short summary