Failed on AMD ROCm GPU with vLLM

Request to support it with AMD ROCm GPU.



```
# DFlash on ROCm/vLLM Deployment Notes


## Goal

Try to run the following speculative decoding pair with vLLM on ROCm:

- Target model: `Qwen3.5-27B`
- Draft model: `z-lab/Qwen3.5-27B-DFlash`


## What was verified

### Container start command

```bash
docker run -d --name dflash-vllm-rocm-nightly \
  --ipc host \
  --network host \
  --privileged \
  --cap-add SYS_ADMIN \
  --cap-add SYS_PTRACE \
  --device /dev/kfd:/dev/kfd \
  --device /dev/dri:/dev/dri \
  --device /dev/mem:/dev/mem \
  --group-add render \
  --security-opt seccomp=unconfined \
  -e GLOO_SOCKET_IFNAME=eth0 \
  -e NCCL_SOCKET_IFNAME=eth0 \
  -e VLLM_HOST_IP=10.50.0.59 \
  -v /mnt/volume_atl1_d4t/HF:/model:ro \
  --entrypoint bash \
  rocm/vllm-dev:nightly \
  -lc "exec vllm serve /model/Qwen/Qwen3.5-27B \
    --speculative-config '{\"method\":\"dflash\",\"model\":\"/model/Qwen/Qwen3.5-27B-DFlash\",\"num_speculative_tokens\":15}' \
    --tensor-parallel-size 1 \
    --dtype auto \
    --enable-prefix-caching \
    --max-model-len 8192 \
    --max-num-batched-tokens 32768 \
    --gpu-memory-utilization 0.9 \
    --host 0.0.0.0 \
    --port 18000 \
    --served-model-name qwen35-27b-dflash \
    --api-key amd \
    --trust-remote-code"
```


## Failure path A: default ROCm backend

When **not** forcing `--attention-backend flash_attn`, the service failed during backend selection.

Key error:

```text
ValueError: No valid attention backend found for rocm with AttentionSelectorConfig(... use_non_causal=True).
Reasons: {ROCM_ATTN: [non-causal attention not supported], ROCM_AITER_UNIFIED_ATTN: [non-causal attention not supported], TRITON_ATTN: [non-causal attention not supported]}.
```

### Interpretation

This means:

1. DFlash internally requires `use_non_causal=True`
2. Current ROCm backends available to vLLM do not support that execution mode
3. Therefore DFlash cannot run on this path even though it is recognized by vLLM

Important clarification:

- Removing `--attention-backend flash_attn` does **not** remove the need for non-causal attention
- The need for non-causal attention comes from the DFlash execution path itself

## Failure path B: forcing `--attention-backend flash_attn`

When forcing:

```bash
--attention-backend flash_attn
```

the run progressed much farther:

- target model loaded successfully
- draft model loaded successfully
- weights were loaded
- compilation/profile steps ran

Then it failed with:

```text
AssertionError: FlashAttention version not detected.
```

Even though `pip list | grep flash` showed:

```text
flash_attn 2.8.3
```


## Short summary

> DFlash is recognized by ROCm vLLM nightly, but the run still fails because the ROCm attention backends do not support the non-causal attention required by DFlash. Forcing `flash_attn` avoids that specific selection path but then fails later because FlashAttention is not successfully detected by vLLM.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed on AMD ROCm GPU with vLLM #72

Failure path A: default ROCm backend

Interpretation

Failure path B: forcing `--attention-backend flash_attn`

Short summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failed on AMD ROCm GPU with vLLM #72

Description

Failure path A: default ROCm backend

Interpretation

Failure path B: forcing --attention-backend flash_attn

Short summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failure path B: forcing `--attention-backend flash_attn`