Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Windows 11 Tensor Parallelism slow #760

Open
3 tasks done
frenzybiscuit opened this issue Mar 23, 2025 · 1 comment
Open
3 tasks done

[BUG] Windows 11 Tensor Parallelism slow #760

frenzybiscuit opened this issue Mar 23, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@frenzybiscuit
Copy link

OS

Windows

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

2.6.0

Model

No response

Describe the bug

With TabbyAPI, when using Windows 11 64bit pro without WSL regular GPU split mode works fine, but tensor parallelism tanks performance by 25% or so.

When using Debian 12 on the same machine, tensor parallelism increases performance by 25% or so...

I tried to use WSL2 to get around this, but my performance was less then half of what it was on Windows 11 native...

There are no logs or errors that I can see.

Reproduction steps

Install latest cuda, visual studio and git. Then install exllamav2 & TabbyAPI.

Expected behavior

works as expected

Logs

.

Additional context

.

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.
@frenzybiscuit frenzybiscuit added the bug Something isn't working label Mar 23, 2025
@turboderp
Copy link
Member

Tensor-P in ExLlama was always an experimental feature. It's a single-threaded multi-device approach that CUDA just doesn't like. I tried multi-threading but that only makes it worse since CUDA itself is single-threaded and blocks the entire process on every API call or kernel launch anyway. On Linux, the overhead this causes doesn't seem to outweigh the benefits of tensor-P completely, but apparently Windows 11 switches things up now? I would have to investigate.

On the other hand, the proper way to actually solve it would be with a multi-process framework. Which is coming in the form of ExLlamaV3 (soon!) It will be a little while before V3 actually has TP support, but it is better suited for full multi-process parallelism. So I don't think it's worth it to put too much more time into V2's experimental features at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants