22
33These examples are based on the ROCm container provided to you at:
44```
5- /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.7.1 .sif
5+ /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0 .sif
66```
77
88The examples also assume there is an allocation in place to be used for one or more nodes. That could be accomplished with, e.g.:
@@ -13,7 +13,7 @@ The examples also assume there is an allocation in place to be used for one or m
1313With the allocation and container set we can do a quick smoke test to make sure Pytorch can detect the GPUs available in a node:
1414```
1515srun singularity exec \
16- /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.7.1 .sif \
16+ /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0 .sif \
1717 bash -c '$WITH_CONDA ; \
1818 python -c "import torch; print(torch.cuda.device_count())"'
1919```
@@ -110,7 +110,7 @@ srun -N1 -n8 --gpus 8 \
110110 --cpu-bind=mask_cpu=0x00fe000000000000,0xfe00000000000000,0x0000000000fe0000,0x00000000fe000000,0x00000000000000fe,0x000000000000fe00,0x000000fe00000000,0x0000fe0000000000\
111111 singularity exec \
112112 -B .:/workdir \
113- /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.7.1 .sif \
113+ /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0 .sif \
114114 /workdir/run.sh \
115115 python -u /workdir/GPT-neo-IMDB-finetuning-mp.py \
116116 --model-name gpt-imdb-model \
@@ -128,7 +128,7 @@ srun -N2 -n16 --gpus 16 \
128128 -B /opt/cray \
129129 -B /usr/lib64/libcxi.so.1 \
130130 -B .:/workdir \
131- /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.7.1 .sif\
131+ /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0 .sif\
132132 /workdir/run.sh \
133133 python -u /workdir/GPT-neo-IMDB-finetuning-mp.py \
134134 --model-name gpt-imdb-model \
@@ -162,7 +162,7 @@ srun -N2 -n16 --gpus 16 \
162162 -B /opt/cray \
163163 -B /usr/lib64/libcxi.so.1 \
164164 -B .:/workdir \
165- /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.7.1 .sif \
165+ /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0 .sif \
166166 /workdir/run-profile.sh \
167167 python -u /workdir/GPT-neo-IMDB-finetuning-mp.py \
168168 --model-name gpt-imdb-model \
@@ -225,7 +225,7 @@ srun -N $N -n $((N*8)) --gpus $((N*8)) \
225225 -B /usr/lib64/libcxi.so.1 \
226226 -B .:/workdir \
227227 -B /flash -B /pfs \
228- /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.7.1 .sif \
228+ /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0 .sif \
229229 /workdir/run.sh \
230230 python -u /workdir/cv_example.py \
231231 -a resnet50 \
@@ -280,7 +280,7 @@ https://github.com/microsoft/DeepSpeedExamples/raw/master/training/imagenet/conf
280280Parse the files to create some understanding of the differences.
281281
282282### 2. Running DeepSpeed with required dependencies
283- This container has DeepSpeed already installed so we will leverage it: ` /appl/local/containers/sif-images/lumi-pytorch-rocm-6.1.3 -python-3.12-pytorch-v2.4.1 .sif ` .
283+ This container has DeepSpeed already installed so we will leverage it: ` /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4 -python-3.12-pytorch-v2.6.0 .sif ` .
284284
285285You can run the example like the following, however some dependencies might be missing. Can you install those? Can you setup the ` spawn ` multiprocessing mode?
286286```
@@ -294,7 +294,7 @@ srun -N $N -n $((N*8)) --gpus $((N*8)) \
294294 -B /usr/lib64/libcxi.so.1 \
295295 -B .:/workdir \
296296 -B /flash -B /pfs \
297- /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.7.1 .sif \
297+ /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0 .sif \
298298 /workdir/run.sh \
299299 python -u /workdir/cv_example_ds.py \
300300 --deepspeed \
@@ -336,7 +336,7 @@ srun -N $N -n $((N*8)) --gpus $((N*8)) \
336336 -B /usr/lib64/libcxi.so.1 \
337337 -B .:/workdir \
338338 -B /flash -B /pfs \
339- /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.7.1 .sif \
339+ /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0 .sif \
340340 /workdir/run.sh \
341341 python -u /workdir/cv_example.py \
342342 -a resnet50 \
0 commit comments