How to enable Multi-GPU training (1 model, multiple GPUs) under the server with limited memory?

### **_Description_** 

Hi, @lengstrom . Thanks for your wonderful work!

My goal is to run a ResNet18 under ImageNet on my server using a multi-GPU training strategy to speed up the training process. The server has 4 RTX 2080 Ti GPUs with a 46G memory, **which is not large enough to load ImageNet into the memory**.

I have read the instructions on https://docs.ffcv.io/parameter_tuning.html (**Scenario: Large scale datasets** and **Scenario: Multi-GPU training (1 model, multiple GPUs**)

Right now, I can run a ResNet18 on a single card by using `os_cache=False`. **However, if I use `in_memory=0` and  `distributed = 1` to run the provided `train_imagenet.py` code as follows, some errors are reported, which are listed at the bottom.** Would you please tell me how to solve this issue?

---
### **_Command_** 

```
python train_imagenet.py --config-file rn18_configs/rn18_16_epochs.yaml \
    ... \
    --data.in_memory=0 \
    --training.distributed=1
```

---
### **_Message_**


Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.

=> Logging in ...

**Not enough memory; try setting quasi-random ordering
(`OrderOption.QUASI_RANDOM`) in the dataloader constructor's `order` argument.**

**Full error below:**
  0%|                                                                                                          | 0/1251 [00:01<?, ?it/s]
Exception ignored in: <function EpochIterator.__del__ at 0x7f528d4f04c0>
Traceback (most recent call last):
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 161, in __del__
    self.close()
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 158, in close
    self.memory_context.__exit__(None, None, None)
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 59, in __exit__
    self.executor.__exit__(*args)
**AttributeError: 'ProcessCacheContext' object has no attribute 'executor'**
Traceback (most recent call last):
  File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 510, in <module>
    ImageNetTrainer.launch_from_args()
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
    return func(*args, **kwargs)
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in __call__
    return self.func(*args, **filled_args)
  File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 461, in launch_from_args
    ch.multiprocessing.spawn(cls._exec_wrapper, nprocs=world_size, join=True)
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
**torch.multiprocessing.spawn.ProcessRaisedException:** 

**-- Process 0 terminated with the following error:**
Traceback (most recent call last):
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 468, in _exec_wrapper
    cls.exec(*args, **kwargs)
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
    return func(*args, **kwargs)
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in __call__
    return self.func(*args, **filled_args)
  File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 478, in exec
    trainer.train()
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
    return func(*args, **kwargs)
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in __call__
    return self.func(*args, **filled_args)
  File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 300, in train
    train_loss = self.train_loop(epoch)
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
    return func(*args, **kwargs)
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in __call__
    return self.func(*args, **filled_args)
  File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 361, in train_loop
    for ix, (images, target) in enumerate(iterator):
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/loader.py", line 214, in __iter__
    return EpochIterator(self, selected_order)
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 43, in __init__
    raise e
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 37, in __init__
    self.memory_context.__enter__()
  File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 32, in __enter__
    self.memory = np.zeros((self.schedule.num_slots, self.page_size),
**numpy.core._exceptions._ArrayMemoryError: Unable to allocate 229. GiB for an array with shape (29251, 8388608) and data type uint8**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to enable Multi-GPU training (1 model, multiple GPUs) under the server with limited memory? #10

Description

Command

Message

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to enable Multi-GPU training (1 model, multiple GPUs) under the server with limited memory? #10

Description

Description

Command

Message

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions