The report suggests the following:
For cases where p.ndim > 2, what are m and n? The example code flattens the tensor but still uses the first and second dimension sizes from the original parameter shape to adjust the learning rate, rather than the flattened tensor’s size. Is there a reason for this?