[BUG] Windows 11 Tensor Parallelism slow

### OS

Windows

### GPU Library

CUDA 12.x

### Python version

3.12

### Pytorch version

2.6.0

### Model

_No response_

### Describe the bug

With TabbyAPI, when using Windows 11 64bit pro without WSL regular GPU split mode works fine, but tensor parallelism tanks performance by 25% or so.

When using Debian 12 on the same machine, tensor parallelism increases performance by 25% or so...

I tried to use WSL2 to get around this, but my performance was less then half of what it was on Windows 11 native...

There are no logs or errors that I can see.

### Reproduction steps

Install latest cuda, visual studio and git. Then install exllamav2 & TabbyAPI.

### Expected behavior

works as expected

### Logs

.

### Additional context

.

### Acknowledgements

- [x] I have looked for similar issues before submitting this one.
- [x] I understand that the developers have lives and my issue will be answered when possible.
- [x] I understand the developers of this program are human, and I will ask my questions politely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG] Windows 11 Tensor Parallelism slow #760

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Windows 11 Tensor Parallelism slow #760

Description

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions