[Refactor]: Refactor DETR and Deformable DETR (open-mmlab#8763)

* [Fix] Fix UT to be compatible with pytorch 1.6 (open-mmlab#8707) * Update * Update * force reinstall pycocotools * Fix build_cuda * docker install git * Update * comment other job to speedup process * update * uncomment * Update * Update * Add comments for --force-reinstall * [Refactor] Refactor anchor head and base head with boxlist (open-mmlab#8625) * Refactor anchor head * Update * Update * Update * Add a series of boxes tools * Fix box type to support n x box_dim boxes * revert box type changes * Add docstring * refactor retina_head * Update * Update * Fix comments * modify docstring of coder and ioucalculator * Replace with_boxlist with use_box_type * fix: fix config of detr-r18 * fix: modified import of MSDeformAttn in PixelDecoder of Mask2Former * feat: add TransformerDetector as the base detector of DETR-like detectors * refactor: refactor modules and configs of DETR * refactor: refactor DETR-related modules in transformer.py * refactor: refactor DETR-related modules in transformer.py * fix: add type comments in detr.py * correct trainloop in detr_r50 config * fix: modify the parent class of DETRHead to BaseModule * refactor: refactor modules and configs of Deformable DETR * fix: modify the usage of num_query * fix: modify the usage of num_query in configs * refactor: replace input_proj of detr with ChannelMapper neck * refactor: delete multi_apply in DETRHead.forward() * Update detr_r18_8xb2-500e_coco.py using channel mapper for r18 * change the name of detection_transfomer.py to base_detr.py * refactor: modify construct binary masks section of forward_pretransformer * refactor: utilize abstractmethod * update ABCmeta to make sure reload class TransformerDetector * some annotation * some annotation * some annotation * refactor: delete _init_transformer in detectors * refactor: modify args of deformable detr * refactor: modify about super().__init__() * Update detr_head.py Remove the multi feat lvl in function 'predict_by_feat' * Update detr.py update init_weights * some annotation for head * to make sure the head args the same as detector * to make sure the head args the same as detector * some bug * fix: fix bugs of num_pred in DeformableDETRHead * add kwargs to transformer * support MLP and sineembed position * detele positional encodeing * delete useless postnorm * Revert "add kwargs to transformer" This reverts commit a265c1a. * Update detr_head.py Update type and shape of args * Update detr_head.py fix args docstring in predict_by_feat * Update base_detr.py Update docstring for forward_pretransformer * Update deformable_detr.py Fix docstring * to support conditional detr with reload forward_transformer * fix: update config files of Two-stage and Box-refine * replace all bs with batch_size in detr-related files * update deformable.py and transformer.py * update docstring in base_detr * update docstring in base_detr, detr * doc refine * Revert "doc refine" This reverts commit b69da4f. * doc refine * doc refine * updabase_detr, detr, and le layers/transformdoc * fix doc in base_detr * add origin repo link * add origin repo link * refine doc * refine doc * refine doc * refine doc * refine doc * refine doc * refine doc * refine doc * doc: add doc of the first edition of Deformable DETR * batch_size to bs * refine doc * refine doc * feat: add config comments of specific module * refactor: refactor base DETR class TransformerDetector * fix: fix wrong return typehint of forward_encoder in TransformerDetector * refactor: refactor DETR * refactor: refactor Deformable DETR * refactor: refactor forward_encoder and pre_decoder * fix: fix bugs of new edition * refactor: small modifications * fix: move get_reference_points to deformable_encoder * refactor: merge init_&inter_reference to references in Deformable DETR * modify docstring of get_valid_ratio in Deformable DETR * add some docstring * doc: add docstring of deformable_detr.py * doc: add docstring of deformable_detr_head.py * doc: modify docstring of deformable detr * doc: add docstring of deformable_detr_head.py * doc: modify docstring of deformable detr * doc: add docstring of base_detr.py * doc: refine docstring of base_detr.py * doc: refine docstring of base_detr.py * a little change of MLP * a little change of MLP * a little change of MLP * a little change of MLP * refine config * refine config * refine config * refine doc string for detr * little refine doc string for detr.py * tiny modification * doc: refine docstring of detr.py * tiny modifications to resolve the conversations * DETRHead.predict() draft * tiny modifications to resolve conversations * refactor: modify arg names and forward strategies of bbox_head * tiny modifications to resolve the conversations * support MLP * fix docsting of function pre_decoder * fix docsting of function pre_decoder * fix docstring * modifications for resolving conversations * refactor: eradicate key_padding_mask args * refactor: eradicate key_padding_mask args * fix: fix bug of deformable detr and resolve some conversations * refactor: rename base class with DetectionTransformer and other modifications * fix: fix config of detr * fix the bug of init * fix: fix init_weight of DETR and Deformable DETR * resolve conflict * fix auto-merge bug * fix pre-commit bug * refactor: move the position of encoder and decoder * delete Transformer in ci test * delete Transformer in ci test Co-authored-by: jbwang1997 <[email protected]> Co-authored-by: KeiChiTse <[email protected]> Co-authored-by: LYMDLUT <[email protected]> Co-authored-by: lym <[email protected]> Co-authored-by: Kei-Chi Tse <[email protected]>
imclab · Jan 19, 2023 · 4d30934 · 4d30934
1 parent d2a3cbb
commit 4d30934
Show file tree

Hide file tree

Showing 19 changed files with 2,635 additions and 1,771 deletions.
diff --git a/configs/deformable_detr/deformable-detr_r50_16xb2-50e_coco.py b/configs/deformable_detr/deformable-detr_r50_16xb2-50e_coco.py
@@ -3,6 +3,10 @@
 ]
 model = dict(
     type='DeformableDETR',
+    num_query=300,
+    num_feature_levels=4,
+    with_box_refine=False,
+    as_two_stage=False,
     data_preprocessor=dict(
         type='DetDataPreprocessor',
         mean=[123.675, 116.28, 103.53],
@@ -27,50 +31,31 @@
         act_cfg=None,
         norm_cfg=dict(type='GN', num_groups=32),
         num_outs=4),
+    encoder=dict(  # DeformableDetrTransformerEncoder
+        num_layers=6,
+        layer_cfg=dict(  # DeformableDetrTransformerEncoderLayer
+            self_attn_cfg=dict(  # MultiScaleDeformableAttention
+                embed_dims=256),
+            ffn_cfg=dict(
+                embed_dims=256, feedforward_channels=1024, ffn_drop=0.1))),
+    decoder=dict(  # DeformableDetrTransformerDecoder
+        num_layers=6,
+        return_intermediate=True,
+        layer_cfg=dict(  # DeformableDetrTransformerDecoderLayer
+            self_attn_cfg=dict(  # MultiheadAttention
+                embed_dims=256,
+                num_heads=8,
+                dropout=0.1),
+            cross_attn_cfg=dict(  # MultiScaleDeformableAttention
+                embed_dims=256),
+            ffn_cfg=dict(
+                embed_dims=256, feedforward_channels=1024, ffn_drop=0.1)),
+        post_norm_cfg=None),
+    positional_encoding_cfg=dict(num_feats=128, normalize=True, offset=-0.5),
     bbox_head=dict(
         type='DeformableDETRHead',
-        num_query=300,
         num_classes=80,
-        in_channels=2048,
         sync_cls_avg_factor=True,
-        as_two_stage=False,
-        transformer=dict(
-            type='DeformableDetrTransformer',
-            encoder=dict(
-                type='DetrTransformerEncoder',
-                num_layers=6,
-                transformerlayers=dict(
-                    type='BaseTransformerLayer',
-                    attn_cfgs=dict(
-                        type='MultiScaleDeformableAttention', embed_dims=256),
-                    feedforward_channels=1024,
-                    ffn_dropout=0.1,
-                    operation_order=('self_attn', 'norm', 'ffn', 'norm'))),
-            decoder=dict(
-                type='DeformableDetrTransformerDecoder',
-                num_layers=6,
-                return_intermediate=True,
-                transformerlayers=dict(
-                    type='DetrTransformerDecoderLayer',
-                    attn_cfgs=[
-                        dict(
-                            type='MultiheadAttention',
-                            embed_dims=256,
-                            num_heads=8,
-                            dropout=0.1),
-                        dict(
-                            type='MultiScaleDeformableAttention',
-                            embed_dims=256)
-                    ],
-                    feedforward_channels=1024,
-                    ffn_dropout=0.1,
-                    operation_order=('self_attn', 'norm', 'cross_attn', 'norm',
-                                     'ffn', 'norm')))),
-        positional_encoding=dict(
-            type='SinePositionalEncoding',
-            num_feats=128,
-            normalize=True,
-            offset=-0.5),
         loss_cls=dict(
             type='FocalLoss',
             use_sigmoid=True,

diff --git a/configs/deformable_detr/deformable-detr_refine_r50_16xb2-50e_coco.py b/configs/deformable_detr/deformable-detr_refine_r50_16xb2-50e_coco.py
@@ -1,2 +1,2 @@
 _base_ = 'deformable-detr_r50_16xb2-50e_coco.py'
-model = dict(bbox_head=dict(with_box_refine=True))
+model = dict(with_box_refine=True)
diff --git a/configs/deformable_detr/deformable-detr_refine_twostage_r50_16xb2-50e_coco.py b/configs/deformable_detr/deformable-detr_refine_twostage_r50_16xb2-50e_coco.py
@@ -1,2 +1,2 @@
 _base_ = 'deformable-detr_refine_r50_16xb2-50e_coco.py'
-model = dict(bbox_head=dict(as_two_stage=True))
+model = dict(as_two_stage=True)
diff --git a/configs/detr/detr_r18_8xb2-500e_coco.py b/configs/detr/detr_r18_8xb2-500e_coco.py
@@ -4,4 +4,4 @@
     backbone=dict(
         depth=18,
         init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet18')),
-    bbox_head=dict(in_channels=512))
+    neck=dict(in_channels=[512]))
diff --git a/configs/detr/detr_r50_8xb2-150e_coco.py b/configs/detr/detr_r50_8xb2-150e_coco.py
@@ -3,6 +3,7 @@
 ]
 model = dict(
     type='DETR',
+    num_query=100,
     data_preprocessor=dict(
         type='DetDataPreprocessor',
         mean=[123.675, 116.28, 103.53],
@@ -19,45 +20,50 @@
         norm_eval=True,
         style='pytorch',
         init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
+    neck=dict(
+        type='ChannelMapper',
+        in_channels=[2048],
+        kernel_size=1,
+        out_channels=256,
+        act_cfg=None,
+        norm_cfg=None,
+        num_outs=1),
+    encoder=dict(  # DetrTransformerEncoder
+        num_layers=6,
+        layer_cfg=dict(  # DetrTransformerEncoderLayer
+            self_attn_cfg=dict(  # MultiheadAttention
+                embed_dims=256,
+                num_heads=8,
+                dropout=0.1),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=2048,
+                num_fcs=2,
+                ffn_drop=0.1,
+                act_cfg=dict(type='ReLU', inplace=True)))),
+    decoder=dict(  # DetrTransformerDecoder
+        num_layers=6,
+        layer_cfg=dict(  # DetrTransformerDecoderLayer
+            self_attn_cfg=dict(  # MultiheadAttention
+                embed_dims=256,
+                num_heads=8,
+                dropout=0.1),
+            cross_attn_cfg=dict(  # MultiheadAttention
+                embed_dims=256,
+                num_heads=8,
+                dropout=0.1),
+            ffn_cfg=dict(
+                embed_dims=256,
+                feedforward_channels=2048,
+                num_fcs=2,
+                ffn_drop=0.1,
+                act_cfg=dict(type='ReLU', inplace=True))),
+        return_intermediate=True),
+    positional_encoding_cfg=dict(num_feats=128, normalize=True),
     bbox_head=dict(
         type='DETRHead',
         num_classes=80,
-        in_channels=2048,
-        transformer=dict(
-            type='Transformer',
-            encoder=dict(
-                type='DetrTransformerEncoder',
-                num_layers=6,
-                transformerlayers=dict(
-                    type='BaseTransformerLayer',
-                    attn_cfgs=[
-                        dict(
-                            type='MultiheadAttention',
-                            embed_dims=256,
-                            num_heads=8,
-                            dropout=0.1)
-                    ],
-                    feedforward_channels=2048,
-                    ffn_dropout=0.1,
-                    operation_order=('self_attn', 'norm', 'ffn', 'norm'))),
-            decoder=dict(
-                type='DetrTransformerDecoder',
-                return_intermediate=True,
-                num_layers=6,
-                transformerlayers=dict(
-                    type='DetrTransformerDecoderLayer',
-                    attn_cfgs=dict(
-                        type='MultiheadAttention',
-                        embed_dims=256,
-                        num_heads=8,
-                        dropout=0.1),
-                    feedforward_channels=2048,
-                    ffn_dropout=0.1,
-                    operation_order=('self_attn', 'norm', 'cross_attn', 'norm',
-                                     'ffn', 'norm')),
-            )),
-        positional_encoding=dict(
-            type='SinePositionalEncoding', num_feats=128, normalize=True),
+        embed_dims=256,
         loss_cls=dict(
             type='CrossEntropyLoss',
             bg_cls_weight=0.1,