Skip to content

using avx to accelerate inter node protoll #1675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
visualxu opened this issue Apr 7, 2025 · 0 comments
Open

using avx to accelerate inter node protoll #1675

visualxu opened this issue Apr 7, 2025 · 0 comments

Comments

@visualxu
Copy link

visualxu commented Apr 7, 2025

hi.
I conducted some experiments using the avx instruction on inter_node_proto_ll and achieved better performance.
Performance with avx/without avx:
Image
Here is the environment:
Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz
256GB DRAM
A100 80GB * 2
100gbps mellaonx5
This is a reference for simple implementation(buffer address/size must be aligned with 256/512):
visualxu@1c502b2#diff-45a9034a0c75cbfbbb34e853a43f6513c1d4c933eccf6adca705abe234fc1113
I hope this idea can be helpful to the NCCL community.

@visualxu visualxu changed the title use avx to speed up inter node protoll using avx to accelerate inter node protoll Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant