MLP进军下游视觉任务！目标检测与分割领域最新MLP架构研究进展！

深度学习技术前沿公众号博主

发布于 2021-09-06 11:31:06

1.3K0

发布于 2021-09-06 11:31:06

【导读】随着ResMLP、MLP-Mixer等文章的提出，基于MLP的backbone重新回到了CV领域。在图像识别方面，基于MLP的结构具有较少的inductive bias，但是依旧能够达到与CNN和Vision Transformer相当的性能。那么，MLP在其它下游视觉任务的效果如何呢？自6,7月份以来，MLP正式进军下游视觉任务，在检测与分割领域纷纷推出了最强MLP架构，本文我们将对近期在检测与分割领域最新MLP架构进行梳理总结，主要包括：上科大&腾讯优图开源AS-MLP, 香港大学&商汤科技提出的CycleMLP, 百度提出的目前最强视觉MLP架构S2-MLP（V1-V2）.

上科大&腾讯优图开源AS-MLP：一种轴向位移的MLP框架器

paper: https://arxiv.org/abs/2107.08391
code：https://github.com/svip-lab/AS-MLP

本文是上海科技大学和腾讯优图在MLP架构方面的探索，不同于MLP-Mixer通过矩阵转置+词混叠MLP进行全局空域特征编码，ASMLP在局部特征通信方向投入了更多的关注。主要设计了一种轴向移位操作以便于进行不同方向的空间信息流交互。首先通过在水平和垂直方向上空间移动特征，轴向位移可以将不同空间位置的特征排列在相同的位置。然后，使用 MLP 来组合这些功能，简单而有效。这种方法使模型能够获得更多的局部依赖，从而提高性能。我们还可以设计AS-MLP的感受野尺寸以及扩张因子。ASMLP凭借如此简单而有效的架构取得了优于其他MLP架构的性能，同时达到了与Transformer架构(比如Swin Transformer)相当的性能，并且只需要更少的FLOPs。比如，AS-MLP 在 ImageNet-1K 数据集中仅使用 88M 参数和 15.2 GFLOP 就获得了 83.3% Top-1 准确率，且无需额外训练数据。在架构方面，AS-MLP采用了类似PVT的分层架构，因为可以轻易的迁移到下游任务。所提方法在ImageNet数据集上取得了优于其他MLP架构的性能，AS-MLP也是首个用于下游任务(如目标检测、语义分割)的MLP架构。AS-MLP在COCO验证集上取得了51.5mAP指标，在ADE20K数据集上取得了49.5mIoU指标，具有与Transformer架构相当的性能。

网络架构

核心代码

class AxialShift(nn.Module):
    r""" Axial shift  
    Args:
        dim (int): Number of input channels.
        shift_size (int): shift size .
        as_bias (bool, optional):  If True, add a learnable bias to as mlp. Default: True
        proj_drop (float, optional): Dropout ratio of output. Default: 0.0
    """

    def __init__(self, dim, shift_size, as_bias=True, proj_drop=0.):

        super().__init__()
        self.dim = dim
        self.shift_size = shift_size
        self.pad = shift_size // 2
        self.conv1 = nn.Conv2d(dim, dim, 1, 1, 0, groups=1, bias=as_bias)
        self.conv2_1 = nn.Conv2d(dim, dim, 1, 1, 0, groups=1, bias=as_bias)
        self.conv2_2 = nn.Conv2d(dim, dim, 1, 1, 0, groups=1, bias=as_bias)
        self.conv3 = nn.Conv2d(dim, dim, 1, 1, 0, groups=1, bias=as_bias)

        self.actn = nn.GELU()

        self.norm1 = MyNorm(dim)
        self.norm2 = MyNorm(dim)

        self.shift_dim2 = Shift(self.shift_size, 2)                                                   
        self.shift_dim3 = Shift(self.shift_size, 3)

    def forward(self, x):
        """
        Args:
            x: input features with shape of (num_windows*B, N, C)
            mask: (0/-inf) mask with shape of (num_windows, Wh*Ww, Wh*Ww) or None
        """
        B_, C, H, W = x.shape

        x = self.conv1(x)
        x = self.norm1(x)
        x = self.actn(x)
       
        '''
        x = F.pad(x, (self.pad, self.pad, self.pad, self.pad) , "constant", 0)
        
        xs = torch.chunk(x, self.shift_size, 1)
        def shift(dim):
            x_shift = [ torch.roll(x_c, shift, dim) for x_c, shift in zip(xs, range(-self.pad, self.pad+1))]
            x_cat = torch.cat(x_shift, 1)
            x_cat = torch.narrow(x_cat, 2, self.pad, H)
            x_cat = torch.narrow(x_cat, 3, self.pad, W)
            return x_cat
        x_shift_lr = shift(3)
        x_shift_td = shift(2)
        '''
        
        x_shift_lr = self.shift_dim3(x)
        x_shift_td = self.shift_dim2(x)
        
        x_lr = self.conv2_1(x_shift_lr)
        x_td = self.conv2_2(x_shift_td)

        x_lr = self.actn(x_lr)
        x_td = self.actn(x_td)

        x = x_lr + x_td
        x = self.norm2(x)

        x = self.conv3(x)

        return x

    def extra_repr(self) -> str:
        return f'dim={self.dim}, shift_size={self.shift_size}'

    def flops(self, N):
        # calculate flops for 1 window with token length of N
        flops = 0
        # conv1 
        flops += N * self.dim * self.dim
        # norm 1
        flops += N * self.dim
        # conv2_1 conv2_2
        flops += N * self.dim * self.dim * 2
        # x_lr + x_td
        flops += N * self.dim
        # norm2
        flops += N * self.dim
        # norm3
        flops += N * self.dim * self.dim
        return

CycleMLP：密集预测任务的通用主干,已开源

作者单位：香港大学, 商汤科技
代码：https://github.com/ShoufaChen/CycleMLP
论文：https://arxiv.org/abs/2107.1022

本文是香港大学&商汤科技在MLP架构方面的探索，主要提出了一个简单的MLP架构 CycleMLP，它是视觉识别和密集预测任务的通用主干网络。CycleMLP 旨在为 MLP 模型的目标检测、实例分割和语义分割提供一个有竞争力的基线。与其它MLP架构相比CycleMLP主要有两个优势：

可以灵活的处理各类图像尺寸；
通过使用 local windows操作，其计算复杂度与图像尺寸成线性关系。

CycleMLP的主要贡献点在于：

提出了一种新的MLP模块：CycleFC，它是一个广义的即插即用操作，可用于替换MLP-Mixer中的token mixing MLP操作；
CycleFC计算复杂度与图像分辨率成线性关系；主要通过对ChannelFC的采样点引入更高感受野升级为CycleFC，在提升感受野的同时保持计算量不变。
CycleMLP在ImageNet上取得了83.2%的top1精度，具有与Swin 相当的性能。在ImageNet、COCO以及ADE20K等数据集上，CycleMLP也取得了优于其他MLP的性能。

网络架构

核心代码

class CycleFC(nn.Module):
    """
    """

    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size,  # re-defined kernel_size, represent the spatial area of staircase FC
        stride: int = 1,
        padding: int = 0,
        dilation: int = 1,
        groups: int = 1,
        bias: bool = True,
    ):
        super(CycleFC, self).__init__()

        if in_channels % groups != 0:
            raise ValueError('in_channels must be divisible by groups')
        if out_channels % groups != 0:
            raise ValueError('out_channels must be divisible by groups')
        if stride != 1:
            raise ValueError('stride must be 1')
        if padding != 0:
            raise ValueError('padding must be 0')

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = _pair(stride)
        self.padding = _pair(padding)
        self.dilation = _pair(dilation)
        self.groups = groups

        self.weight = nn.Parameter(torch.empty(out_channels, in_channels // groups, 1, 1))  # kernel size == 1

        if bias:
            self.bias = nn.Parameter(torch.empty(out_channels))
        else:
            self.register_parameter('bias', None)
        self.register_buffer('offset', self.gen_offset())

        self.reset_parameters()

    def reset_parameters(self) -> None:
        init.kaiming_uniform_(self.weight, a=math.sqrt(5))

        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

    def gen_offset(self):
        """
        offset (Tensor[batch_size, 2 * offset_groups * kernel_height * kernel_width,
            out_height, out_width]): offsets to be applied for each position in the
            convolution kernel.
        """
        offset = torch.empty(1, self.in_channels*2, 1, 1)
        start_idx = (self.kernel_size[0] * self.kernel_size[1]) // 2
        assert self.kernel_size[0] == 1 or self.kernel_size[1] == 1, self.kernel_size
        for i in range(self.in_channels):
            if self.kernel_size[0] == 1:
                offset[0, 2 * i + 0, 0, 0] = 0
                offset[0, 2 * i + 1, 0, 0] = (i + start_idx) % self.kernel_size[1] - (self.kernel_size[1] // 2)
            else:
                offset[0, 2 * i + 0, 0, 0] = (i + start_idx) % self.kernel_size[0] - (self.kernel_size[0] // 2)
                offset[0, 2 * i + 1, 0, 0] = 0
        return offset

    def forward(self, input: Tensor) -> Tensor:
        """
        Args:
            input (Tensor[batch_size, in_channels, in_height, in_width]): input tensor
        """
        B, C, H, W = input.size()
        return deform_conv2d_tv(input, self.offset.expand(B, -1, H, W), self.weight, self.bias, stride=self.stride,
                                padding=self.padding, dilation=self.dilation)

    def extra_repr(self) -> str:
        s = self.__class__.__name__ + '('
        s += '{in_channels}'
        s += ', {out_channels}'
        s += ', kernel_size={kernel_size}'
        s += ', stride={stride}'
        s += ', padding={padding}' if self.padding != (0, 0) else ''
        s += ', dilation={dilation}' if self.dilation != (1, 1) else ''
        s += ', groups={groups}' if self.groups != 1 else ''
        s += ', bias=False' if self.bias is None else ''
        s += ')'
        return s.format(**self.__dict__)

百度提出S^2MLP(V1)：引入了空间移位操作

作者单位：百度研究院
论文：https://arxiv.org/abs/2106.0747

为了对不同的Patch进行交互，除了常规的channel-mixing MLP，还引入了额外的token-mixing(也就是将特征图转置下，再接MLP)。这些方法为了达到优异的性能，需要在大型数据集上预训练。为此我们分析了其中的token-mixing，发现其等价于一个拥有全局感受野，且空间特异性的Depthwise卷积，但这两个特性会使得token-mixing MLP容易过拟合。为此，我们提出了一种新颖的纯 MLP 架构：Spatial-shift MLP结构，我们抛弃了token-mixing的过程，仅包含channel-mixing MLP,同时引入了空间移位操作来增强各个Patch的通信，它具有局部感受野，并且是spatial-agnostic。

Spatial shift operation：首先我们给定一个输入X，其形状为W, H, C然后将该输入在通道维度上进行分组，这里我们只移动四个方向，因此分为四组，然后每一组为W, H, C/4。接着是对每组输入进行不同方向的移位操作，以第一组为例子，我们在W维度上移一格，第二组在W操作反着移一格。同理另外两组在H维度上进行相同操作。一段伪代码如下所示：

def spatial_shift(x):
  w,h,c = x.size()
  x[1:,:,:c/4] = x[:w-1,:,:c/4]
  x[:w-1,:,c/4:c/2] = x[1:,:,c/4:c/2]
  x[:,1:,c/2:c*3/4] = x[:,:h-1,c/2:c*3/4]
  x[:,:h-1,3*c/4:] = x[:,1:,3*c/4:]
  return x

S^2MLP(V2)：目前最强的视觉MLP架构，在ImageNet上达到83.6% Top-1准确率

作者单位：百度研究院
论文链接：https://arxiv.org/abs/2108.0107

spatial-shift MLP（S2-MLP）采用了空间移位操作，因此达到了比ResMLP、MLP-Mixer更好的性能。近期，采用了更小的patch和金字塔结构的Vision Permutator (ViP)和Global Filter Network (GFNet) 在性能上又一次超越了S2-MLP。

因此，在本文中，我们改进了空间移位 MLP（S2-MLP）并提出了 S2-MLPv2 模型。相比于S2-MLP，S2-MLPv2的改动主要有以下方面：

沿着通道维度拓展了特征的维度，并将特征按通道维度分成了多组，每组进行不同的空间移位操作，最后再采用split-attention将这些特征融合起来。
同时，我们利用更小尺度的patch和分层金字塔结构来提高其对细粒度信息进行建模的能力，来提高图像识别精度。
在55M参数下，作者提出的S2-MLPv2-Medium能够在ImageNet上达到83.6%的性能（不适用额外的数据预训练，输入图片大小为224x224）。

下图是将扩展后的特征图分割成三部分。它单独移动每个分割部分，然后通过split-attention融合分割特征图。

代码如下：

def spatial_shift1(x):
    b, w, h, c = x.size()
    x[:, 1:, :, :c / 4] = x[:, :w - 1, :, :c / 4]
    x[:, :w - 1, :, c / 4:c / 2] = x[:, 1:, :, c / 4:c / 2]
    x[:, :, 1:, c / 2:c * 3 / 4] = x[:, :, :h - 1, c / 2:c * 3 / 4]
    x[:, :, :h - 1, 3 * c / 4:] = x[:, :, 1:, 3 * c / 4:]
    return x
def spatial_shift2(x):
    b, w, h, c = x.size()
    x[:, :, 1:, :c / 4] = x[:, :, :h - 1, :c / 4]
    x[:, :, :h - 1, c / 4:c / 2] = x[:, :, 1:, c / 4:c / 2]
    x[:, 1:, :, c / 2:c * 3 / 4] = x[:, :w - 1, :, c / 2:c * 3 / 4]
    x[:, :w - 1, :, 3 * c / 4:] = x[:, 1:, :, 3 * c / 4:]
    return x

class S2-MLPv2(nn.Module):
    def __init__(self, channels):
        super().__init__()

    self.mlp1 = nn.Linear(channels, channels * 3)
    self.mlp2 = nn.Linear(channels, channels)
    self.split_attention = SplitAttention()

    def forward(self, x):
        b, w, h, c = x.size()

    x = self.mlp1(x)
    x1 = spatial_shift1(x[:, :, :, :c / 3])
    x2 = spatial_shift2(x[:, :, :, c / 3:c / 3 * 2])
    x3 = x[:, :, :, c / 3 * 2:]
    a = self.split_attention(x1, x2, x3)
    x = self.mlp2(a)
    return x

总结：与基于 Transformer 的方法相比，我们的 S2-MLPv2 模型在没有自注意力和更少参数的情况下实现了极具竞争力的效果。与 MLP-mixer、ResMLP 等基于 MLP 的开创性工作以及同期工作（包括 Vision Permutator 和 GFNet）相比，空间位移 MLP 的另一个重要优势是空间位移 MLP的可以保持输入图像大小形状不变。因此，S2MLPV2可以基于特定比例的输入图像得到相应的预训练权重，这样便可以很好地用于具有各种尺寸大小的输入图像的下游任务中。未来的工作将致力于不断提高空间位移 MLP 架构的图像识别精度。一个未来的方向是尝试更小尺寸的patch和更先进的四级金字塔，如 CycleMLP 和 AS-MLP，以进一步减少 FLOPs 并缩短基于 Transformer 的模型之间的识别差距。(目前百度这项工作没有完全开源，估计S^2MLP(V3)已经在路上了！)