Skip to content

Commit 329bf87

Browse files
committed
improve horovod docu
1 parent 7db6b62 commit 329bf87

File tree

1 file changed

+8
-2
lines changed

1 file changed

+8
-2
lines changed

doc/TRAINING.rst

+8-2
Original file line numberDiff line numberDiff line change
@@ -203,9 +203,15 @@ If you have a capable compute architecture, it is possible to distribute the tra
203203
Horovod is capable of using MPI and NVIDIA's NCCL for highly optimized inter-process communication.
204204
It also offers `Gloo <https://github.com/facebookincubator/gloo>`_ as an easy-to-setup communication backend.
205205

206-
For more information about setup or tuning of Horovod please visit `Horovod's Github <https://github.com/horovod/horovod>`_.
206+
For more information about setup or tuning of Horovod please visit `Horovod's documentation <https://horovod.readthedocs.io/en/stable/summary_include.html>`_.
207207

208-
To train on 4 machines using 4 GPUs each:
208+
Horovod is expected to run on heterogeneous systems (e.g. different number and model type of GPUs per machine).
209+
However, this can cause unpredictable problems and user interaction in training code is needed.
210+
Therefore, we do only support homogenous systems, which means same hardware and also same software configuration (OS, drivers, MPI, NCCL, TensorFlow, ...) on each machine.
211+
The only exception is different number of GPUs per machine, since this can be controlled by ``horovodrun -H``.
212+
213+
Detailed documentation how to run Horovod is provided `here <https://horovod.readthedocs.io/en/stable/running.html>`_.
214+
The short command to train on 4 machines using 4 GPUs each:
209215

210216
.. code-block:: bash
211217

0 commit comments

Comments
 (0)