https://arxiv.org/abs/2108.00154
CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention (Wenxiao Wang, Lu Yao, Long Chen, Deng Cai, Xiaofei He, Wei Liu)
multiscale pooling + dilated local self attention + dynamic positional encoding 조합이네요. 디텍션에서 swin을 깨는 결과를 보여주는데...1x 스케쥴이군요. 직접 테스트해보는 것이 필요하겠습니다.
#vit