about FLOPs calculation

The general way to calculate FLOPs :

- Param = n * ( h * w * c + 1 )
- FLOPs = H * W  *  n * ( h * w * c + 1 )

Besides the above way to calculate FLOPs,  you pruned channels by the citeria ([Channels are assumed to be pruned if their l2 norm is very small or if magnitude of gradient is very small.](https://github.com/NVlabs/Taylor_pruning/blob/4b70120a8903494cdb565cd8ca97509543fd2862/utils/group_lasso_optimizer.py#L105)).

'

                        if (len(weight_size) == 4) or (len(weight_size) == 2) or (len(weight_size) == 1):
                            if not (p.grad is None):
                                # consider gradients as well and if gradient is below spesific threshold than we claim parameter to be removed
                                divider_grad = p.grad.data.pow(2).view(nunits, -1).sum(dim=1).pow(0.5)
                                eps = 1e-8
                                divider_bool_grad = divider_grad.gt(eps).view(-1).float()
                                divider_bool = divider_bool_grad * divider_bool

                                if (len(weight_size) == 4) or (len(weight_size) == 2):
                                    # get gradient for input:
                                    divider_grad_input = p.grad.data.pow(2).transpose(0,1).contiguous().view(p.data.size(1),-1).sum(dim=1).pow(0.5)
                                    divider_bool_grad_input = divider_grad_input.gt(eps).view(-1).float()

                                    divider_input = p.data.pow(2).transpose(0,1).contiguous().view(p.data.size(1), -1).sum(dim=1).pow(0.5)
                                    divider_bool_input = divider_input.gt(eps).view(-1).float()
                                    divider_bool_input = divider_bool_input * divider_bool_grad_input
                                    # if gradient is small then remove it out
'

Channels with small norm indeed should be discarded. But channels with very small gradient could be discarded,  this sentense make me confused.  If channel with small gradient, i think it has achieved the optimized target of loss function on this iteration. In addition, channels with small gradient do not mean its l2 norm small as well.

Can you help me understand the criteria? 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about FLOPs calculation #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

about FLOPs calculation #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions