Skip to content

Multinode DDP setup #36

@shahbuland

Description

@shahbuland

Need training to run on multinode. Here's a config that should work on the cluster

model:
  model_id: dcae
  sample_size: [720,1280]
  channels: 4
  latent_size: 8
  latent_channels: 128

  ch_0: 256
  ch_max: 2048

  encoder_blocks_per_stage: [4, 4, 4, 4, 4, 4, 4]
  decoder_blocks_per_stage: [4, 4, 4, 4, 4, 4, 4]

  use_middle_block: false

train:
  trainer_id: rec
  data_id: s3_cod_features
  data_kwargs:
    bucket_name: gta-jpegs
    prefix: 1080p-depth-30fs
    include_depth: true
    target_size: [720,1280]

  target_batch_size: 32
  batch_size: 4

  epochs: 200

  opt: AdamW
  opt_kwargs:
    lr: 3.0e-5
    weight_decay: 1.0e-4
    betas: [0.9, 0.95]
    eps: 1.0e-15

  lpips_type: convnext
  loss_weights:
    kl: 1.0e-6
    lpips: 12.0
    l2: 1.0
    dwt: 0.0

  scheduler: LinearWarmup
  scheduler_kwargs:
    warmup_steps: 3000
    min_lr: 3.0e-6

  checkpoint_dir: checkpoints/gta_128x_depth
  resume_ckpt: null #checkpoints/gta_128x_depth/step_10000.pt

  sample_interval: 1000
  save_interval: 5000

wandb:
  name: ${env:WANDB_USER_NAME}
  project: new_vaes
  run_name: 64x_depth_gta

Additional points to consider: The real thing to test would be if batch size 1, with target 32 (i.e. 4 nodes) works. Note that some configurations of multinode make world_size always = 8, even when theres more gpus across the nodes, so if this is the case you may need to adjust target batch size appropriately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions