Skip to content

Failed on AMD ROCm GPU with vLLM #72

@alexhegit

Description

@alexhegit

Request to support it with AMD ROCm GPU.

# DFlash on ROCm/vLLM Deployment Notes


## Goal

Try to run the following speculative decoding pair with vLLM on ROCm:

- Target model: `Qwen3.5-27B`
- Draft model: `z-lab/Qwen3.5-27B-DFlash`


## What was verified

### Container start command

```bash
docker run -d --name dflash-vllm-rocm-nightly \
  --ipc host \
  --network host \
  --privileged \
  --cap-add SYS_ADMIN \
  --cap-add SYS_PTRACE \
  --device /dev/kfd:/dev/kfd \
  --device /dev/dri:/dev/dri \
  --device /dev/mem:/dev/mem \
  --group-add render \
  --security-opt seccomp=unconfined \
  -e GLOO_SOCKET_IFNAME=eth0 \
  -e NCCL_SOCKET_IFNAME=eth0 \
  -e VLLM_HOST_IP=10.50.0.59 \
  -v /mnt/volume_atl1_d4t/HF:/model:ro \
  --entrypoint bash \
  rocm/vllm-dev:nightly \
  -lc "exec vllm serve /model/Qwen/Qwen3.5-27B \
    --speculative-config '{\"method\":\"dflash\",\"model\":\"/model/Qwen/Qwen3.5-27B-DFlash\",\"num_speculative_tokens\":15}' \
    --tensor-parallel-size 1 \
    --dtype auto \
    --enable-prefix-caching \
    --max-model-len 8192 \
    --max-num-batched-tokens 32768 \
    --gpu-memory-utilization 0.9 \
    --host 0.0.0.0 \
    --port 18000 \
    --served-model-name qwen35-27b-dflash \
    --api-key amd \
    --trust-remote-code"

Failure path A: default ROCm backend

When not forcing --attention-backend flash_attn, the service failed during backend selection.

Key error:

ValueError: No valid attention backend found for rocm with AttentionSelectorConfig(... use_non_causal=True).
Reasons: {ROCM_ATTN: [non-causal attention not supported], ROCM_AITER_UNIFIED_ATTN: [non-causal attention not supported], TRITON_ATTN: [non-causal attention not supported]}.

Interpretation

This means:

  1. DFlash internally requires use_non_causal=True
  2. Current ROCm backends available to vLLM do not support that execution mode
  3. Therefore DFlash cannot run on this path even though it is recognized by vLLM

Important clarification:

  • Removing --attention-backend flash_attn does not remove the need for non-causal attention
  • The need for non-causal attention comes from the DFlash execution path itself

Failure path B: forcing --attention-backend flash_attn

When forcing:

--attention-backend flash_attn

the run progressed much farther:

  • target model loaded successfully
  • draft model loaded successfully
  • weights were loaded
  • compilation/profile steps ran

Then it failed with:

AssertionError: FlashAttention version not detected.

Even though pip list | grep flash showed:

flash_attn 2.8.3

Short summary

DFlash is recognized by ROCm vLLM nightly, but the run still fails because the ROCm attention backends do not support the non-causal attention required by DFlash. Forcing flash_attn avoids that specific selection path but then fails later because FlashAttention is not successfully detected by vLLM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions