Multinode DDP setup

Need training to run on multinode. Here's a config that should work on the cluster
```yaml
model:
  model_id: dcae
  sample_size: [720,1280]
  channels: 4
  latent_size: 8
  latent_channels: 128

  ch_0: 256
  ch_max: 2048

  encoder_blocks_per_stage: [4, 4, 4, 4, 4, 4, 4]
  decoder_blocks_per_stage: [4, 4, 4, 4, 4, 4, 4]

  use_middle_block: false

train:
  trainer_id: rec
  data_id: s3_cod_features
  data_kwargs:
    bucket_name: gta-jpegs
    prefix: 1080p-depth-30fs
    include_depth: true
    target_size: [720,1280]

  target_batch_size: 32
  batch_size: 4

  epochs: 200

  opt: AdamW
  opt_kwargs:
    lr: 3.0e-5
    weight_decay: 1.0e-4
    betas: [0.9, 0.95]
    eps: 1.0e-15

  lpips_type: convnext
  loss_weights:
    kl: 1.0e-6
    lpips: 12.0
    l2: 1.0
    dwt: 0.0

  scheduler: LinearWarmup
  scheduler_kwargs:
    warmup_steps: 3000
    min_lr: 3.0e-6

  checkpoint_dir: checkpoints/gta_128x_depth
  resume_ckpt: null #checkpoints/gta_128x_depth/step_10000.pt

  sample_interval: 1000
  save_interval: 5000

wandb:
  name: ${env:WANDB_USER_NAME}
  project: new_vaes
  run_name: 64x_depth_gta
```

Additional points to consider: The real thing to test would be if batch size 1, with target 32 (i.e. 4 nodes) works. Note that some configurations of multinode make world_size always = 8, even when theres more gpus across the nodes, so if this is the case you may need to adjust target batch size appropriately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multinode DDP setup #36

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Multinode DDP setup #36

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions