Conv2Former阅读记录

摘要

simplify the self-attention by leveraging a convolutional modulation operation. 通过卷积调制操作简化自注意力机制。

主要内容

Models like ResNet mostly aggregate responses with large receptive fields by stacking multiple building blocks and adopting the pyamid network architecture but neglect the importance of explicitly modeling the global contextual information.像ResNet 这样的模型大多通过堆叠多个构建块和采取金字塔结构来获得大的感受野，但是这会忽略一个重要点：直接获得全局语境信息。

SENet introduce attention-based mechanisms into CNNs to capture long-range dependencies ,attaining surprisingly good performance.SENet 在CNNs中引入注意力机制来捕获长距离依赖，获得了令人惊讶的性能。

The self-attention mechanism in Transformers is able to model global pairwise dependencies,providing a more efficient way to encode spatial information. Nevertheless, the computational cost caused by the self-attention when processing high-resolution images is considerable.自注意力所带来的计算开销是相当大的。

卷积调制

As shown in the left part of the above picture,self-attention computes the output of each pixel by a weighted summation of all other positions.This process can also be mimicked by computing the Hadamard proudct between the output of a large-kernel convolution and value representations, which we call convolutional modulation as depicted in the right part of the above picture.自注意力每个像素对应的输出来自其它所有位置像素的加权求和，这个过程也可以通过计算大卷积核的哈达玛积来模仿和表达，这就是卷积调制。

The difference is that the convolution kernels are static while the attention matrix generated by self-attention can adapt to the input content.不同之处在于卷积调制的卷积核是静态的，而自注意力所产生的信息与输入内容有关。

Simply replacing the self-attention in ViTs with the proposed convolutional modulation operation yields the porposed network,termed Conv2Former. 用卷积调制操作代替ViTs中的自注意力所产生的网络称为Conv2Former。

Another main contribution of this paper is that we show Conv2Former can benefit more from convolutions with larger kernels, like 11×11 and 21×21. It alse show that the method using 11×11 depthwise convolutions performs even better than the recent works using super large kernel convolutions.论文另一个主要的贡献就是Conv2Former 能够随着卷积核的增大持续获得更多的收益。

整体架构

loading-ag-142