You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using MLC LLM with ROCM on a Radeon 7900xtx I am noticing a very large time to first token. With context lengths around 4k I'm seeing upwards of 4 or 5 second delays before generation starts running llama3.1 8B at q4f16_1 quantization.
I've been using MLC LLM for quite a while on CUDA with an Nvidia 3080 with great success. I never benchmarked the performance but was using it for a real time assistant and never noticed large delays.
One more factor to consider is that when running on the Nvidia GPU, I was running MLC in a docker container (using the WSL2 integration). However, with the AMD GPU I am now running MLC directly within WSL, this is mostly because it doesn't seem possible currently to pass ROCM support through using the WSL2 integration of docker.
I've tried tweaking the context size and prefill chunk size to no avail. As well as trying different models and getting similar results. My next course of action was going to be trying Vulcan directly in Windows, but based on everything I've read I would expect ROCM to perform better than Vulcan.
Does anyone have any insight into what could cause a slowdown like this?
The text was updated successfully, but these errors were encountered:
After some testing, I found that setting my server to "interactive" mode, made it behave much more like I expected. It was about the same speed for the first response, then it was much much faster. This leads me to think that something is wrong with cache handling in larger batch sizes (>1) on ROCM.
Based on this, I think it is likely to be the same underlying issue as: #2992
In addition to the impact of the server mode, it also seems the context window size is very important. Based on my testing, if I set the context window size to anything less than double the size of the context I'm actually using, it exhibits the same issue where it seems as though the input is not getting cached at all.
As an example, if I'm using a context of around 4.5k tokens:
Server in interactive mode with 8k context window size:
First command: 5s delay
Subsequent commands: 5s delay
Server in interactive mode with 10k context window size:
First command: 5s delay
Subsequent commands: 0.5s delay
❓ General Questions
When using MLC LLM with ROCM on a Radeon 7900xtx I am noticing a very large time to first token. With context lengths around 4k I'm seeing upwards of 4 or 5 second delays before generation starts running llama3.1 8B at q4f16_1 quantization.
I've been using MLC LLM for quite a while on CUDA with an Nvidia 3080 with great success. I never benchmarked the performance but was using it for a real time assistant and never noticed large delays.
One more factor to consider is that when running on the Nvidia GPU, I was running MLC in a docker container (using the WSL2 integration). However, with the AMD GPU I am now running MLC directly within WSL, this is mostly because it doesn't seem possible currently to pass ROCM support through using the WSL2 integration of docker.
I've tried tweaking the context size and prefill chunk size to no avail. As well as trying different models and getting similar results. My next course of action was going to be trying Vulcan directly in Windows, but based on everything I've read I would expect ROCM to perform better than Vulcan.
Does anyone have any insight into what could cause a slowdown like this?
The text was updated successfully, but these errors were encountered: