It requires ROCm image with PyTorch (new versions are desired, i.e. >= 2.5).
python setup.py develop
See individual issue for running instructions.
It relies on GPU peer access over unified memory. Each rank allocates an array of flags, which is used to notify other ranks.
Run using:
torchrun --nproc-per-node 8 amd_repro/multi_gpu_barrier.py
It hangs after a few iterations despite all attempts to invalidate and write back caches.
It take over 60 secs to reserve 256GB of virtual space:
python amd_repro/virtual_memory_slow.py
Segfaults
python amd_repro/virtual_memory_export.py
Segfaults
python amd_repro/virtual_memory_dealloc.py
If IPC is enabled, event always returns 'true' for the query() call.
torchrun --nproc-per-node 2 amd_repro/ipc_event.py
Output:
0 Waiting ...
1 Done
0 Waiting ...
0 Waiting ...
0 Waiting ...
0 Waiting ...
0 Done
Expected output (on NVIDIA H100):
0 Waiting ...
1 Waiting ...
0 Waiting ...
1 Waiting ...
0 Waiting ...
0 Done
1 Done