The general way to calculate FLOPs :
- Param = n * ( h * w * c + 1 )
- FLOPs = H * W * n * ( h * w * c + 1 )
Besides the above way to calculate FLOPs, you pruned channels by the citeria (Channels are assumed to be pruned if their l2 norm is very small or if magnitude of gradient is very small.).
'
if (len(weight_size) == 4) or (len(weight_size) == 2) or (len(weight_size) == 1):
if not (p.grad is None):
# consider gradients as well and if gradient is below spesific threshold than we claim parameter to be removed
divider_grad = p.grad.data.pow(2).view(nunits, -1).sum(dim=1).pow(0.5)
eps = 1e-8
divider_bool_grad = divider_grad.gt(eps).view(-1).float()
divider_bool = divider_bool_grad * divider_bool
if (len(weight_size) == 4) or (len(weight_size) == 2):
# get gradient for input:
divider_grad_input = p.grad.data.pow(2).transpose(0,1).contiguous().view(p.data.size(1),-1).sum(dim=1).pow(0.5)
divider_bool_grad_input = divider_grad_input.gt(eps).view(-1).float()
divider_input = p.data.pow(2).transpose(0,1).contiguous().view(p.data.size(1), -1).sum(dim=1).pow(0.5)
divider_bool_input = divider_input.gt(eps).view(-1).float()
divider_bool_input = divider_bool_input * divider_bool_grad_input
# if gradient is small then remove it out
'
Channels with small norm indeed should be discarded. But channels with very small gradient could be discarded, this sentense make me confused. If channel with small gradient, i think it has achieved the optimized target of loss function on this iteration. In addition, channels with small gradient do not mean its l2 norm small as well.
Can you help me understand the criteria?
The general way to calculate FLOPs :
Besides the above way to calculate FLOPs, you pruned channels by the citeria (Channels are assumed to be pruned if their l2 norm is very small or if magnitude of gradient is very small.).
'
'
Channels with small norm indeed should be discarded. But channels with very small gradient could be discarded, this sentense make me confused. If channel with small gradient, i think it has achieved the optimized target of loss function on this iteration. In addition, channels with small gradient do not mean its l2 norm small as well.
Can you help me understand the criteria?