Skip to content

Commit ce4d7cc

Browse files
authored
Adjust project ID for lesson 9
1 parent 689df49 commit ce4d7cc

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

09_Extreme_scale_AI/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ These examples are based on the ROCm container provided to you at:
77

88
The examples also assume there is an allocation in place to be used for one or more nodes. That could be accomplished with, e.g.:
99
```
10-
N=2 ; salloc -p standard-g --account=project_465001707 --reservation=AI_workshop_2 --threads-per-core 1 --exclusive -N $N --gpus $((N*8)) -t 1:00:00 --mem 0
10+
N=2 ; salloc -p standard-g --account=project_465001958 --reservation=AI_workshop_2 --threads-per-core 1 --exclusive -N $N --gpus $((N*8)) -t 1:00:00 --mem 0
1111
```
1212

1313
With the allocation and container set we can do a quick smoke test to make sure Pytorch can detect the GPUs available in a node:
@@ -203,12 +203,12 @@ We have downloaded in advance the data set (ImageNet) as that is a time consumin
203203

204204
Here's how the data is organized:
205205
* Reduced set in scratch storage:
206-
* /scratch/project_465001707/data-sets/data-resnet-small
206+
* /scratch/project_465001958/data-sets/data-resnet-small
207207
* Reduced set in flash storage:
208-
* /flash/project_465001707/data-sets/data-resnet-small
208+
* /flash/project_465001958/data-sets/data-resnet-small
209209

210210
* Tarball container for the data set:
211-
* /flash/project_465001707/data-sets/data-resnet-small.tar
211+
* /flash/project_465001958/data-sets/data-resnet-small.tar
212212

213213
The container is useful to move the data around as it is much faster to move a single large file rather than many small files, e.g. it is better to untar a container than copy an expanded dataset from elsewhere. The folders `/scratch` and `/flash` contain symbolic links so it is important to mount in your containers `/pfs` as these links are pointing there.
214214

@@ -237,7 +237,7 @@ srun -N $N -n $((N*8)) --gpus $((N*8)) \
237237
--dist-url "tcp://$(scontrol show hostname "$SLURM_NODELIST" | head -n1):45678" \
238238
--dist-backend 'nccl' \
239239
--epochs 2 \
240-
/flash/project_465001707/data-sets/data-resnet-small
240+
/flash/project_465001958/data-sets/data-resnet-small
241241
```
242242
Here we are doing training using ResNet-50 over 2 epochs with 512 batch-size per GPU. We use the same 7 workers as before. The dataset is given by the last argument - we use the small data set but you are free to try the complete one. The other arguments are similar to what we used before to translate information from the SLURM environment.
243243

@@ -306,7 +306,7 @@ srun -N $N -n $((N*8)) --gpus $((N*8)) \
306306
--local_rank \$SLURM_LOCALID \
307307
--world-size \$SLURM_NPROCS \
308308
--epochs 2 \
309-
/flash/project_465001707/data-sets/data-resnet-small
309+
/flash/project_465001958/data-sets/data-resnet-small
310310
```
311311
Note that, in spite of this being a similar example to what we tested before the options and their meaning changed a bit. E.g. the number of worker is per GPU in this case.
312312

@@ -323,7 +323,7 @@ You are welcome to try larger data-sets and from different storage types to see
323323

324324
If limited by I/O, we could try in-memory storage. LUMI nodes don't have local SSD but have significant ammount of memory, so that could be sufficient for your needs. To store data in memory it is sufficient to do it as files under `/tmp` as that lives in memory. So we can do:
325325
```
326-
srun tar -C /tmp -xf /flash/project_465001707/data-sets/data-resnet-small.tar
326+
srun tar -C /tmp -xf /flash/project_465001958/data-sets/data-resnet-small.tar
327327
```
328328
to expand the trimmed down data set into memory and then we can just our model training there:
329329
```

0 commit comments

Comments
 (0)