Add standalone parallel CNN functions for MNIST #20

shuds13 · 2025-01-27T18:51:42Z

For comparison with libE.

Works in parallel on Perlmutter and uses GPUs on one node, but does not work across nodes.

Main loop

    for epoch in range(epochs):
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)

            optimizer.zero_grad()
            output = model(data)
            loss = nn.CrossEntropyLoss()(output, target)
            loss.backward()

            # Synchronize gradients across processes (libE would do this)
            for param in model.parameters():
                dist.all_reduce(param.grad, op=dist.ReduceOp.SUM)
                param.grad /= world_size

            optimizer.step()

shuds13 · 2025-01-27T23:37:04Z

cnn_comparator/simf_cnn_NCCL_parallel.py

+                        torch.tensor(calc_in["grads_in"][offset:offset + grad_size].reshape(grad_shape),
+                                    device=param.grad.device)
+                    )
+                    offset += grad_size


If need to average above, then need to do it here also.

shuds13 · 2025-01-27T23:38:16Z

cnn_comparator/simf_cnn_NCCL_parallel.py

+                param.grad /= group_size  # SH averaging?
+
+            # SH TODO - consider inheriting dist (torch.distributed)
+            # and overriding dist.all_reduce() to do libE stuff


torch.distributed is a module not a class, but could extract these lines to some provided function.

shuds13 · 2025-02-04T20:56:46Z

cnn_comparator/pytorch_cnn_NCCL_parallel.py

+            loss = nn.CrossEntropyLoss()(output, target)
+            loss.backward()
+
+            # Synchronize gradients across processes


#if do optimizer step on gen gen does

optimizer.zero_grad()
gradients.sum()
optimizer.step()

Add standalone parallel CNN functions for MNIST

8c69f51

shuds13 self-assigned this Jan 27, 2025

shuds13 added 3 commits January 27, 2025 17:09

Add possible sim_f version of cnn

4272e24

Add grads_in field

84bc5d4

Update readme

a9e8af5

shuds13 commented Jan 27, 2025

View reviewed changes

shuds13 commented Feb 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add standalone parallel CNN functions for MNIST #20

Add standalone parallel CNN functions for MNIST #20

Uh oh!

shuds13 commented Jan 27, 2025

Uh oh!

shuds13 Jan 27, 2025

Uh oh!

shuds13 Jan 27, 2025

Uh oh!

shuds13 Feb 4, 2025

Uh oh!

Uh oh!

Add standalone parallel CNN functions for MNIST #20

Are you sure you want to change the base?

Add standalone parallel CNN functions for MNIST #20

Uh oh!

Conversation

shuds13 commented Jan 27, 2025

Uh oh!

shuds13 Jan 27, 2025

Choose a reason for hiding this comment

Uh oh!

shuds13 Jan 27, 2025

Choose a reason for hiding this comment

Uh oh!

shuds13 Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!