Merge pull request #49 from NVIDIA/john/csa

Convolution Self-Attention
NVIDIA · Jan 10, 2024 · 3dad2b9 · 3dad2b9
2 parents f963a04 + a170fbf
commit 3dad2b9
Show file tree

Hide file tree

Showing 5 changed files with 374 additions and 0 deletions.
diff --git a/ConvSelfAttention/LICENSE b/ConvSelfAttention/LICENSE
@@ -0,0 +1,36 @@
+NVIDIA Source Code License for Convolutional Self-Attention (CSA)
+
+
+1. Definitions
+
+“Licensor” means any person or entity that distributes its Work.
+“Work” means (a) the original work of authorship made available under this license, which may include software, documentation, or other files, and (b) any additions to or derivative works  thereof  that are made available under this license.
+The terms “reproduce,” “reproduction,” “derivative works,” and “distribution” have the meaning as provided under U.S. copyright law; provided, however, that for the purposes of this license, derivative works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work.
+Works are “made available” under this license by including in or with the Work either (a) a copyright notice referencing the applicability of this license to the Work, or (b) a copy of this license.
+
+2. License Grant
+
+2.1 Copyright Grant. Subject to the terms and conditions of this license, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free, copyright license to use, reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form.
+
+3. Limitations
+
+3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so under this license, (b) you include a complete copy of this license with your distribution, and (c) you retain without modification any copyright, patent, trademark, or attribution notices that are present in the Work.
+
+3.2 Derivative Works. You may specify that additional or different terms apply to the use, reproduction, and distribution of your derivative works of the Work (“Your Terms”) only if (a) Your Terms provide that the use limitation in Section 3.3 applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. Notwithstanding Your Terms, this license (including the redistribution requirements in Section 3.1) will continue to apply to the Work itself.
+
+3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use non-commercially. Notwithstanding the foregoing, NVIDIA Corporation and its affiliates may use the Work and any derivative works commercially. As used herein, “non-commercially” means for research or evaluation purposes only.
+
+3.4 Patent Claims. If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under this license from such Licensor (including the grant in Section 2.1) will terminate immediately.
+
+3.5 Trademarks. This license does not grant any rights to use any Licensor’s or its affiliates’ names, logos, or trademarks, except as necessary to reproduce the notices described in this license.
+
+3.6 Termination. If you violate any term of this license, then your rights under this license (including the grant in Section 2.1) will terminate immediately.
+
+4. Disclaimer of Warranty.
+
+THE WORK IS PROVIDED “AS IS” WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS LICENSE. 
+
+5. Limitation of Liability.
+
+EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER DAMAGES OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
+
diff --git a/ConvSelfAttention/README.md b/ConvSelfAttention/README.md
@@ -0,0 +1,93 @@
+# Convolutional Self-Attention (CSA)
+
+<!-- ![image](resources/image.png) -->
+<div align="center">
+  <img src="./resources/CSA_Block.png" height="500">
+</div>
+
+
+
+## ***[Emulating attention mechanism in transformer models with a fully convolutional network (link to be fixed to Blogpost)](https://arxiv.org/abs/2204.13791)***<br />
+John Yang, Le An, Su Inn Park  
+
+Unlike other convolutional models that try to ingest the attention module from transformer model, 
+Convolutional Self-Attention (CSA) explicitly finds relationships among features one-to-many with only convolutions in conjunction with simple tensor shape manipulations. 
+As results, CSA operates without bells and hassles in TensorRT’s restricted mode, making it suitable for AV production for safety-critical applications. 
+
+
+<hr>
+
+## Usage
+
+
+We employ the same setup as that in [ConvNeXt](https://github.com/facebookresearch/ConvNeXt) repository for general usages including training/testing.
+For details on environment preparation, data download, and training/evaluation scripts, please refer to the original repo for details. 
+
+### Setting up CSA
+
+For setting up, firstly git clone ConvNeXt repository and that of ours. 
+
+```bash
+git clone https://github.com/facebookresearch/ConvNeXt.git
+git clone https://github.com/NVIDIA/DL4AGX.git
+```
+
+In order to place and set-up Conv-Self-Attention model files within the ConvNeXt implementation, the following commands set up files in the appropriate locations for the training/testing commands to be run. 
+
+```bash
+cp your/path/to/DL4AGX/ConvSelfAttention/convselfattn.py your/path/to/ConvNeXt/models
+cp your/path/to/DL4AGX/ConvSelfAttention/implement_CSA.py your/path/to/ConvNeXt/
+cd ConvNeXt
+```
+
+Once copying required files in the ConvNeXt repository, make sure the files in this git are located in the following directories of ConvNeXt:
+
+```yaml
+ConvNeXt
+ ├ models
+ │ ...
+ │ └ convselfattn.py
+ ├ object_detection
+ ├ semantic_segmentation
+ │ ...
+ └ implement_CSA.py
+```
+
+Then, run the following command in order for `main.py` to include the newly implemented files:
+
+```bash
+python implement_CSA.py
+```
+
+
+
+
+### Training
+
+For training the network with CSA modules, 
+we established the distributed training via `bcprun` with [NVIDIA Base Command Platform](https://docs.nvidia.com/base-command-platform/user-guide/index.html).
+
+The training was dones with 2 8-GPU nodes updating every 2 epochs to follow the original training criterion of ConvNeXt's `batch_size=4096`.
+
+```bash
+bcprun --nnodes 2 --npernode 8 --cmd 'python main.py --model convselfattn --drop_path 0.1 \
+--lr 4e-3 --batch_size 128 --update_freq 2 --use_amp True --model_ema true \
+--model_ema_eval false --data_path your/path/to/dataset/ \
+--output_dir /results --sync_BN True --warmup_epochs 20 --epochs 300'
+```
+
+### Testing
+Once the training is done, we provide an example evaluation command for a ImageNet-1K pre-trained CSA Network:
+
+```bash
+python main.py --model convselfattn --eval true --resume your/path/to/trained_model.pth
+```
+
+Our CSA network should give Top-1 accuracy of `81.30%` for FP32 inferences if trained properly with the training command above. 
+
+
+<hr>
+
+
+## License
+The provided code can be used for research or other non-commercial purposes. For details please check the [LICENSE](LICENSE) file.
diff --git a/ConvSelfAttention/convselfattn.py b/ConvSelfAttention/convselfattn.py
@@ -0,0 +1,224 @@
+# Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+import torch
+import torch.nn as nn
+from timm.models.layers import trunc_normal_, DropPath
+from timm.models.registry import register_model
+
+
+class Block_Conv_SelfAttn(nn.Module):
+    """
+    Convolutional Self-Attention module
+
+    Parameters
+    ----------
+    dim : int
+        Number of input channels.
+    drop_path : float
+        Stochastic depth rate. Default: 0.0.
+    layer_scale_init_value : float
+        Init value for Layer Scale. Default: 1e-6.
+    sr_to : int
+        Target spatial reduction size. Default: 14.
+    num_heads : int
+        Number of heads. Defulat: 4.
+    mlp_ratio : int
+        Number to multiply input dimension for the last mlp layer. Default: 3.
+    neighbors : int
+        Kernel window size for depth-wise convolution. Default: 5
+    resize_mode : string
+        Algorithm used for resizing: ['nearest' | 'bilinear']. Default: 'bilinear'.
+    """
+    def __init__(self, dim, drop_path=0., layer_scale_init_value=0., sr_to=14, num_heads=4, mlp_ratio=3,
+                 neighbors=7, resize_mode='bilinear', **kwargs):
+        super().__init__()
+        self.dim = mlp_ratio * dim
+        self.num_heads = num_heads
+        self.resize_mode = resize_mode
+        self.sr_to = sr_to
+        self.HW = sr_to ** 2
+
+        self.v = nn.Conv2d(dim, dim, kernel_size=neighbors, padding=neighbors//2, groups=dim)
+        self.act_v = nn.Sequential(nn.BatchNorm2d(dim), Swish(dim, trainable=False))
+
+        self.q = nn.Conv2d(dim, self.num_heads * self.HW, 1)
+        self.norm_q = nn.BatchNorm2d(self.num_heads * self.HW)
+        self.qk = nn.Conv2d(self.num_heads * self.HW, dim, 1)
+        self.act_qk = nn.Sequential(nn.BatchNorm2d(dim), nn.Sigmoid())
+
+        self.qkv = nn.Conv2d(dim, self.dim, 1)
+
+        self.act_qkv = nn.Sequential(nn.BatchNorm2d(self.dim), Swish(self.dim, trainable=False))
+        self.mlp = nn.Conv2d(self.dim, dim, 1)
+
+        self.gamma = nn.Parameter(layer_scale_init_value * torch.ones((1, dim, 1, 1)),
+                                  requires_grad=True) if layer_scale_init_value > 0 else None
+
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+
+    def forward(self, x):
+        input = x
+        _, _, H_og, W_og = input.shape
+
+        # TensorRT-8.6.11.4 - restricted mode does NOT support resize with size parameter
+        if type(H_og) != int:
+            H_og = H_og.item()
+        if type(W_og) != int:
+            W_og = W_og.item()
+        sr_h, sr_w = self.sr_to / H_og, self.sr_to / W_og
+
+        v = self.act_v(self.v(input))
+
+        # TensorRT-8.6.11.4 - restricted mode does NOT support resize with size parameter
+        v_ = torch.nn.functional.interpolate(v, scale_factor=(sr_h, sr_w), mode='bilinear', align_corners=False)
+        # v_ = torch.nn.functional.interpolate(v, size=(self.sr_to, self.sr_to), mode='bilinear', align_corners=False)
+
+        q = self.norm_q(self.q(v_))
+        B_, C, H, W = q.shape
+        k = q.view(B_, self.num_heads, self.HW, self.HW).transpose(3, 2).contiguous().view(B_, self.num_heads * self.HW, H, W)
+
+        qk = torch.nn.functional.interpolate(q * k, scale_factor=(1/sr_h, 1/sr_w), mode='bilinear', align_corners=False)
+        # qk = torch.nn.functional.interpolate(q * k, size=(H_og, W_og), mode='bilinear', align_corners=False)
+
+        qk = self.act_qk(self.qk(qk))
+        x = self.act_qkv(self.qkv(qk * v))
+        x = self.mlp(x)
+
+        if self.gamma is not None:
+            x = self.gamma * x
+
+        return input + self.drop_path(x)
+
+
+class Swish(nn.Module):
+    """
+    Swish activation [b * x * sigmoid(x)] : https://arxiv.org/abs/1710.05941v2
+
+    Parameters
+    ----------
+    dim : int
+        Number of input channels.
+    trainable : bool
+        Whether to include a trainable parameter b or not. Default: False.
+    """
+    def __init__(self, dim, trainable=False):
+        super().__init__()
+        if trainable:
+            self.beta = nn.Parameter(torch.ones((1, dim, 1, 1)), requires_grad=True)
+        else:
+            self.beta = 1.
+        self.trainable = trainable
+
+    def forward(self, x):
+        if self.trainable:
+            x = self.beta * x
+        return x * self.sigm(x)
+
+
+class CSA_backbone(nn.Module):
+    """
+    Backbone Network that incorporates CSA modules.
+
+    Parameters
+    ----------
+    in_chans : int
+        Number of input channels. Default: 3.
+    num_classes : int
+        Number of output classes for prediction. Default: 1000.
+    depths : list
+        Numbers of blocks per phase. Default: [3, 3, 9, 3]
+    dims : list
+        Numbers of channels for each block per phase. Default: [96, 192, 384, 768]
+    drop_path_rate : float
+        Stochastic depth rate. Default: 0.
+    layer_scale_init_value : float
+        Init value for Layer Scale. Default: 0.
+    head_init_scale : float
+        Init scaling value for classifier weights and biases. Default: 1.
+    ds_patch : list
+        Kernel window sizes for downsampling layers per phase. Default: [7, 3, 3, 3]
+    strides : list
+        Stride sizes for downsampling layers per phase. Default: [4, 2, 2, 2]
+    num_heads : list
+        Numbers of heads per phase. Default: [1, 2, 4, 8]
+    mlp_dim : list
+        Numbers to multiply input dimension for the last mlp layer per phase. Default: [2, 2, 2, 2].
+    sr_to : list
+        Sizes to reduce feature maps to. Default: [14, 14, 14, 7].
+    neighbors : 5
+        Kernel window size for depth-wise convolution for all CSA blocks. Default: 5.
+    resize_mode : string
+        Algorithm used for resizing: ['nearest' | 'bilinear']. Default: 'bilinear'.
+    """
+    def __init__(self, in_chans=3, num_classes=1000, depths=[3, 3, 9, 3], dims=[96, 192, 384, 768], drop_path_rate=0.,
+                 layer_scale_init_value=0, head_init_scale=1., ds_patch=[7, 3, 3, 3], strides=[4, 2, 2, 2],
+                 num_heads=[1, 2, 4, 8], mlp_dim=[2, 2, 2, 2], sr_to=[14, 14, 14, 7], neighbors=5, resize_mode='bilinear'):
+        super().__init__()
+        self.num_phases = len(depths)
+        self.downsample_layers = nn.ModuleList()
+        stem = nn.Sequential(
+            nn.Conv2d(in_chans, dims[0], kernel_size=ds_patch[0], stride=strides[0], padding=ds_patch[0]//2),
+            nn.BatchNorm2d(dims[0])
+            )
+        self.downsample_layers.append(stem)
+
+        for i in range(self.num_phases - 1):
+            downsample_layer = nn.Sequential(
+                nn.Conv2d(dims[i], dims[i + 1], kernel_size=ds_patch[i + 1], stride=strides[i + 1]),
+                nn.BatchNorm2d(dims[i + 1])
+            )
+            self.downsample_layers.append(downsample_layer)
+
+        self.stages = nn.ModuleList()
+        dp_rates = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]
+        cur = 0
+        for i in range(self.num_phases):
+            stage = nn.Sequential(
+                *[Block_Conv_SelfAttn(dim=dims[i],
+                                      drop_path=dp_rates[cur + j],
+                                      layer_scale_init_value=layer_scale_init_value,
+                                      num_heads=num_heads[i],
+                                      mlp_ratio=mlp_dim[i],
+                                      sr_to=sr_to[i],
+                                      neighbors=neighbors,
+                                      resize_mode=resize_mode) for j in
+                  range(depths[i])]
+            )
+            self.stages.append(stage)
+            cur += depths[i]
+
+        self.norm = nn.BatchNorm2d(dims[-1])
+        self.head = nn.Conv2d(dims[-1], num_classes, 1)
+
+        self.apply(self._init_weights)
+        self.avgpool = nn.AvgPool2d(6)
+        self.head.weight.data.mul_(head_init_scale)
+        self.head.bias.data.mul_(head_init_scale)
+
+    def _init_weights(self, m):
+        if isinstance(m, (nn.Conv2d, nn.Linear)):
+            trunc_normal_(m.weight, std=.02)
+            nn.init.constant_(m.bias, 0)
+
+    def forward_features(self, x):
+        for i in range(self.num_phases):
+            x = self.downsample_layers[i](x)
+            x = self.stages[i](x)
+
+        # TensorRT-8.6.11.4 - restricted mode does not support ReduceMean
+        # x = x.mean([-2, -1]).view(x.size(0), x.size(1), 1, 1)
+        x = self.avgpool(x)
+        return self.norm(x)
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.head(x)
+        return x.squeeze()
+
+
+@register_model
+def convselfattn(pretrained=False, pretrained_cfg=None, pretrained_cfg_overlay=False, in_22k=False, **kwargs):
+    model = CSA_backbone(depths=[3, 4, 6, 3], dims=[96, 192, 384, 768], num_heads=[1, 2, 4, 8], mlp_dim=[3, 3, 3, 3],
+                         sr_to=[14, 14, 14, 7], neighbors=5, **kwargs)
+    return model
+
diff --git a/ConvSelfAttention/implement_CSA.py b/ConvSelfAttention/implement_CSA.py
@@ -0,0 +1,21 @@
+# Specify the file path
+file_path = "main.py"
+
+# The line number where you want to insert the sentence (1-based index)
+line_number = 34
+
+# The sentence you want to insert
+sentence_to_insert = "import models.convselfattn \n"
+
+# Read the file and store its contents in a list
+with open(file_path, "r") as file:
+    lines = file.readlines()
+
+# Close the file
+
+# Insert the sentence at the desired line
+lines.insert(line_number - 1, sentence_to_insert + "\n")  # Subtract 1 to convert to 0-based index
+
+# Open the file in write mode and overwrite its contents
+with open(file_path, "w") as file:
+    file.writelines(lines)
diff --git a/ConvSelfAttention/resources/CSA_Block.png b/ConvSelfAttention/resources/CSA_Block.png