You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hi.
I conducted some experiments using the avx instruction on inter_node_proto_ll and achieved better performance.
Performance with avx/without avx:
Here is the environment:
Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz
256GB DRAM
A100 80GB * 2
100gbps mellaonx5
This is a reference for simple implementation(buffer address/size must be aligned with 256/512): visualxu@1c502b2#diff-45a9034a0c75cbfbbb34e853a43f6513c1d4c933eccf6adca705abe234fc1113
I hope this idea can be helpful to the NCCL community.
The text was updated successfully, but these errors were encountered:
visualxu
changed the title
use avx to speed up inter node protoll
using avx to accelerate inter node protoll
Apr 7, 2025
hi.

I conducted some experiments using the avx instruction on inter_node_proto_ll and achieved better performance.
Performance with avx/without avx:
Here is the environment:
Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz
256GB DRAM
A100 80GB * 2
100gbps mellaonx5
This is a reference for simple implementation(buffer address/size must be aligned with 256/512):
visualxu@1c502b2#diff-45a9034a0c75cbfbbb34e853a43f6513c1d4c933eccf6adca705abe234fc1113
I hope this idea can be helpful to the NCCL community.
The text was updated successfully, but these errors were encountered: