You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tensor-P in ExLlama was always an experimental feature. It's a single-threaded multi-device approach that CUDA just doesn't like. I tried multi-threading but that only makes it worse since CUDA itself is single-threaded and blocks the entire process on every API call or kernel launch anyway. On Linux, the overhead this causes doesn't seem to outweigh the benefits of tensor-P completely, but apparently Windows 11 switches things up now? I would have to investigate.
On the other hand, the proper way to actually solve it would be with a multi-process framework. Which is coming in the form of ExLlamaV3 (soon!) It will be a little while before V3 actually has TP support, but it is better suited for full multi-process parallelism. So I don't think it's worth it to put too much more time into V2's experimental features at this point.
OS
Windows
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
2.6.0
Model
No response
Describe the bug
With TabbyAPI, when using Windows 11 64bit pro without WSL regular GPU split mode works fine, but tensor parallelism tanks performance by 25% or so.
When using Debian 12 on the same machine, tensor parallelism increases performance by 25% or so...
I tried to use WSL2 to get around this, but my performance was less then half of what it was on Windows 11 native...
There are no logs or errors that I can see.
Reproduction steps
Install latest cuda, visual studio and git. Then install exllamav2 & TabbyAPI.
Expected behavior
works as expected
Logs
.
Additional context
.
Acknowledgements
The text was updated successfully, but these errors were encountered: