You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The container is useful to move the data around as it is much faster to move a single large file rather than many small files, e.g. it is better to untar a container than copy an expanded dataset from elsewhere. The folders `/scratch` and `/flash` contain symbolic links so it is important to mount in your containers `/pfs` as these links are pointing there.
Here we are doing training using ResNet-50 over 2 epochs with 512 batch-size per GPU. We use the same 7 workers as before. The dataset is given by the last argument - we use the small data set but you are free to try the complete one. The other arguments are similar to what we used before to translate information from the SLURM environment.
Note that, in spite of this being a similar example to what we tested before the options and their meaning changed a bit. E.g. the number of worker is per GPU in this case.
312
312
@@ -323,7 +323,7 @@ You are welcome to try larger data-sets and from different storage types to see
323
323
324
324
If limited by I/O, we could try in-memory storage. LUMI nodes don't have local SSD but have significant ammount of memory, so that could be sufficient for your needs. To store data in memory it is sufficient to do it as files under `/tmp` as that lives in memory. So we can do:
325
325
```
326
-
srun tar -C /tmp -xf /flash/project_465001707/data-sets/data-resnet-small.tar
326
+
srun tar -C /tmp -xf /flash/project_465001958/data-sets/data-resnet-small.tar
327
327
```
328
328
to expand the trimmed down data set into memory and then we can just our model training there:
0 commit comments