Description
❓ General Questions
When using MLC LLM with ROCM on a Radeon 7900xtx I am noticing a very large time to first token. With context lengths around 4k I'm seeing upwards of 4 or 5 second delays before generation starts running llama3.1 8B at q4f16_1 quantization.
I've been using MLC LLM for quite a while on CUDA with an Nvidia 3080 with great success. I never benchmarked the performance but was using it for a real time assistant and never noticed large delays.
One more factor to consider is that when running on the Nvidia GPU, I was running MLC in a docker container (using the WSL2 integration). However, with the AMD GPU I am now running MLC directly within WSL, this is mostly because it doesn't seem possible currently to pass ROCM support through using the WSL2 integration of docker.
I've tried tweaking the context size and prefill chunk size to no avail. As well as trying different models and getting similar results. My next course of action was going to be trying Vulcan directly in Windows, but based on everything I've read I would expect ROCM to perform better than Vulcan.
Does anyone have any insight into what could cause a slowdown like this?