H2former论文阅读记录

摘要

Although methods based on convolutional neural networks (CNNs) have achieved good results, it is weak to model the long-range dependencies, which is very important for
segmentation task to build global context dependencies.The Transformers can establish long-range dependencies among pixels by self-attention, providing a supplement to the local convolution. In addition, multi-scale feature fusion and feature selection are crucial for medical image segmentation tasks, which is ignored by Transformers.However, it is challenging to directly apply self-attention to CNNs due to the quadratic computational complexity for high-resolution feature maps. Therefore, to integrate the merits of CNNs, multi-scale channel attention and Transformers, we propose an efficient hierarchical hybrid vision Transformer (H2Former) for medical image segmentation.（全局上下文依赖关系对于分割任务是十分重要的，但CNN这方面较弱。Transformers通过自我注意建立像素间的长距离相关性，对局部卷积算法提供了补充。另外，多尺度特征融合和特征选择是医学图像分割的关键，而Transformers忽略了这一点。并且，由于高分辨率特征映射的二次型计算复杂度，直接将自注意应用于CNN具有挑战性。因此，为了综合CNN、多尺度通道注意力和Transformers的优点，提出了一种高效的分层混合视觉Transformer（H2Former）用于医学图像分割。）

介绍

However, due to the intrinsic locality of convolution, CNNs failed to model long-rangedependencies, leading to sub-optimal results, and it has the following two short-comings. Firstly, the small convolutional kernels attend to a local region and force the network to focus on local feature patterns rather than the global features.For medical image segmentation, it often requires long-range global information for reliable segmentation, since the shape and size of lesions vary greatly. Secondly, after train-ing, the convolutional kernels are static and fixed, and the well trained parameters cannot adapt to the input image contents. There-fore, CNNs lack flexibility for different inputs with different characteristics.（由于卷积的固有局部性，它不能对长距离的依赖性进行建模，导致了次优结果，并且它存在以下两个缺点。首先，小卷积核关注局部区域，迫使网络关注局部特征模式而不是全局特征。对于医学图像分割，由于病灶的形状和大小变化很大，因此通常需要长范围的全局信息来进行可靠的分割。其次，训练后的卷积核是静态的、固定的，训练好的参数不能适应输入图像的内容。因此，对于具有不同特性的不同输入，CNNs缺乏灵活性。）

With the long-range information interaction and dynamic feature encoding ability, Transformers have been applied for medical image segmentation.（Transformers具有长距离信息交互和动态特征编码能力，已被广泛应用于医学图像分割）

Although long-range dependencies can be modeled well in Transformer, it still has drawbacks and limits the performance. Firstly, the spatial information is ignored ， as it serializes images into 1D tokens, and it is weak in local feature learning,which is crucial for 2D images. Although this problem can be relieved by position encoding, while position encoding needs to adapt to the various input resolution through interpolation,thus affecting the performance. The second drawback of Transformer is the dependence on largescale datasets with quadratic computational complexity, which can be interpreted as low inductive bias in modeling local visual cues.
Finally, Transformer only learns features in a token-wise attention manner with single scale and cannot perceive multi-scale channel-wise feature dependencies, which is harmful for lesions with various shapes and scales. Therefore, there is still room for improvement in hybrid structure of CNNs andTransformers.

（虽然在Transformer中可以很好地建模长距离依赖关系，但它仍然存在缺陷并限制了性能。首先，由于它将图像序列化为一维符号，忽略了空间信息，并且在局部特征学习方面较弱，而局部特征学习对于二维图像是至关重要的。虽然这个问题可以通过位置编码来缓解，但是位置编码需要通过插值来适应各种输入分辨率，从而影响了性能。第二个缺点是Transformer是对具有二次计算复杂度的大规模数据集的依赖，这可以解释为在建模局部视觉线索时的低归纳偏差（****）。最后，Transformer仅以单尺度的令牌式注意方式学习特征，并且不能感知多尺度通道式特征依赖性，这对于具有各种形状和尺度的病变是有害）

主要贡献

• We proposed a hierarchical hybrid model that elegantly integrate the local information of the CNNs, multi-scale channel attention features and long-range features of Transformer within a unified block, which can integrate the merits of them simultaneously, and enhanced the feature representation ability of the model.（提出了一种分层混合模型，将CNN的局部信息、多尺度通道注意力特征和Transformer的远程特征巧妙地集成在一个统一的块内，可以同时集成它们的优点，增强了模型的特征表示能力。）
• A light-weight multi-scale channel attention (MSCA) branch is presented, which serializes the feature maps into multi-scale token pyramids, and then they are calibrated with channel-wise attention. MSCA benefits medical image segmentation task with different shapes and scales and is complementary to token-wise self-attention.（提出了一种轻量级的多尺度通道注意力（MSCA）分支，该分支将特征映射序列化为多尺度令牌金字塔，然后使用通道注意力对其进行校准。MSCA有利于不同形状和尺度的医学图像分割任务）
• Finally, we demonstrate the superiority of our method comprehensively in terms of performance, model parameters, FLOPs and inference time, which outperforms the
competing models on three 2D and two 3D medical image segmentation tasks, which demonstrates the effectiveness of our model

网络框架

Hybrid Transformer Block

#MSCA
class eca_layer(nn.Module):
    """Constructs a ECA module.
    Args:
        channel: Number of channels of the input feature map
        k_size: Adaptive selection of kernel size
    """
    def __init__(self, channel, k_size=3):
        super(eca_layer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.conv = nn.Conv1d(1, 1, kernel_size=k_size, padding=(k_size - 1) // 2, bias=False) 
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # feature descriptor on the global spatial information
        y = self.avg_pool(x)

        # Two different branches of ECA module
        y = self.conv(y.squeeze(-1).transpose(-1, -2)).transpose(-1, -2).unsqueeze(-1)

        # Multi-scale information fusion
        y = self.sigmoid(y)

        return x * y.expand_as(x)+x
class PatchMerging(nn.Module):

    def __init__(self, dim, patch_size=[2,4], norm_layer=nn.LayerNorm):
        super().__init__()
        self.dim = dim
        self.reductions = nn.ModuleList()
        self.patch_size = patch_size
        self.norm = norm_layer(dim)
        self.eca = eca_layer(dim, 3)
        for i, ps in enumerate(patch_size):
            if i == len(patch_size) - 1:
                out_dim = 2 * dim // 2 ** i
            else:
                out_dim = 2 * dim // 2 ** (i + 1)
            stride = 2
            padding = (ps - stride) // 2
            #padding = math.ceil(((ps-1)*dilations[i] + 1 - stride) / 2)
            self.reductions.append(nn.Sequential(nn.Conv2d(dim, out_dim, kernel_size=ps, stride=stride, padding=padding)))

    def forward(self, x):
        B, L, C = x.shape
        x = self.norm(x)
        H = int(np.sqrt(L))
        W = int(np.sqrt(L))
        x = x.view(B, H, W, C).permute(0, 3, 1, 2).contiguous()

        xs = []
        for i in range(len(self.reductions)):
            tmp_x = self.reductions[i](x)
            xs.append(tmp_x)
        x = torch.cat(xs, dim=1)
        x = self.eca(x)
        return x

ms2 = self.MS2(x)
x = self.layer2(x)
x = x+ms2
x = x.flatten(2).transpose(1, 2)
x = self.swin_layers[1] (x)
B, L, C = x.shape
ms3 = self.MS3(x)
x = x.view(B, int(np.sqrt(L)), int(np.sqrt(L)), C).permute(0,3,1, 2)
encoder.append(x)