Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow time to first token on ROCM #3119

Open
Jyers opened this issue Feb 5, 2025 · 2 comments
Open

Very slow time to first token on ROCM #3119

Jyers opened this issue Feb 5, 2025 · 2 comments
Labels
question Question about the usage

Comments

@Jyers
Copy link

Jyers commented Feb 5, 2025

❓ General Questions

When using MLC LLM with ROCM on a Radeon 7900xtx I am noticing a very large time to first token. With context lengths around 4k I'm seeing upwards of 4 or 5 second delays before generation starts running llama3.1 8B at q4f16_1 quantization.

I've been using MLC LLM for quite a while on CUDA with an Nvidia 3080 with great success. I never benchmarked the performance but was using it for a real time assistant and never noticed large delays.

One more factor to consider is that when running on the Nvidia GPU, I was running MLC in a docker container (using the WSL2 integration). However, with the AMD GPU I am now running MLC directly within WSL, this is mostly because it doesn't seem possible currently to pass ROCM support through using the WSL2 integration of docker.

I've tried tweaking the context size and prefill chunk size to no avail. As well as trying different models and getting similar results. My next course of action was going to be trying Vulcan directly in Windows, but based on everything I've read I would expect ROCM to perform better than Vulcan.

Does anyone have any insight into what could cause a slowdown like this?

@Jyers Jyers added the question Question about the usage label Feb 5, 2025
@Jyers
Copy link
Author

Jyers commented Feb 9, 2025

After some testing, I found that setting my server to "interactive" mode, made it behave much more like I expected. It was about the same speed for the first response, then it was much much faster. This leads me to think that something is wrong with cache handling in larger batch sizes (>1) on ROCM.

Based on this, I think it is likely to be the same underlying issue as: #2992

@Jyers
Copy link
Author

Jyers commented Feb 10, 2025

Another small update:

In addition to the impact of the server mode, it also seems the context window size is very important. Based on my testing, if I set the context window size to anything less than double the size of the context I'm actually using, it exhibits the same issue where it seems as though the input is not getting cached at all.

As an example, if I'm using a context of around 4.5k tokens:
Server in interactive mode with 8k context window size:
First command: 5s delay
Subsequent commands: 5s delay
Server in interactive mode with 10k context window size:
First command: 5s delay
Subsequent commands: 0.5s delay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about the usage
Projects
None yet
Development

No branches or pull requests

1 participant