Need training to run on multinode. Here's a config that should work on the cluster
model:
model_id: dcae
sample_size: [720,1280]
channels: 4
latent_size: 8
latent_channels: 128
ch_0: 256
ch_max: 2048
encoder_blocks_per_stage: [4, 4, 4, 4, 4, 4, 4]
decoder_blocks_per_stage: [4, 4, 4, 4, 4, 4, 4]
use_middle_block: false
train:
trainer_id: rec
data_id: s3_cod_features
data_kwargs:
bucket_name: gta-jpegs
prefix: 1080p-depth-30fs
include_depth: true
target_size: [720,1280]
target_batch_size: 32
batch_size: 4
epochs: 200
opt: AdamW
opt_kwargs:
lr: 3.0e-5
weight_decay: 1.0e-4
betas: [0.9, 0.95]
eps: 1.0e-15
lpips_type: convnext
loss_weights:
kl: 1.0e-6
lpips: 12.0
l2: 1.0
dwt: 0.0
scheduler: LinearWarmup
scheduler_kwargs:
warmup_steps: 3000
min_lr: 3.0e-6
checkpoint_dir: checkpoints/gta_128x_depth
resume_ckpt: null #checkpoints/gta_128x_depth/step_10000.pt
sample_interval: 1000
save_interval: 5000
wandb:
name: ${env:WANDB_USER_NAME}
project: new_vaes
run_name: 64x_depth_gta
Additional points to consider: The real thing to test would be if batch size 1, with target 32 (i.e. 4 nodes) works. Note that some configurations of multinode make world_size always = 8, even when theres more gpus across the nodes, so if this is the case you may need to adjust target batch size appropriately.
Need training to run on multinode. Here's a config that should work on the cluster
Additional points to consider: The real thing to test would be if batch size 1, with target 32 (i.e. 4 nodes) works. Note that some configurations of multinode make world_size always = 8, even when theres more gpus across the nodes, so if this is the case you may need to adjust target batch size appropriately.