RT-DETR优化改进：Backbone改进｜RIFormer：无需TokenMixer也能达成SOTA性能的极简ViT架构 | CVPR2023

原创

AI小怪兽

发布于 2023-11-20 16:21:37

9450

发布于 2023-11-20 16:21:37

文章被收录于专栏：YOLO大作战

本文独家改进：RIFormer助力RT-DETR ，替换backbone， RIFormer-M36的吞吐量可达1185，同时精度高达82.6%；而PoolFormer-M36的吞吐量为109，精度为82.1%。

推荐指数：五星

1.RIFormer介绍

论文：https://arxiv.org/pdf/2304.05659.pdf

问题：Vision Transformer 已取得长足进步，token mixer，其优秀的建模能力已在各种视觉任务中被广泛证明，典型的 token mixer 为自注意力机制，推理耗时长，计算代价大。直接去除会导致模型结构先验不完整，从而带来显著的准确性下降。本文探索如何去掉 token mixer，并以 poolformer 为基准，探索在保证精度的同时，直接去掉 token mixer 模块！

本文基于重参数机制提出了RepIdentityFormer方案以研究无Token Mixer的架构体系。紧接着，作者改进了学习架构以打破无Token Mixer架构的局限性并总结了优化策略。搭配上所提优化策略后，本文构建了一种极致简单且具有优异性能的视觉骨干，此外它还具有高推理效率优势。

为什么这么做？

Token Mixer是ViT骨干非常重要的组成成分，它用于对不同空域位置信息进行自适应聚合，但常规的自注意力往往存在高计算复杂度与高延迟问题。而直接移除Token Mixer又会导致不完备的结构先验，进而导致严重的性能下降。

Token Mixer是ViT架构中用于空域信息聚合的关键模块，但由于采用了自注意力机制导致其计算量与内存消耗与图像尺寸强相关

重参数方法在各个领域得到了广泛的应用。RIFormer推理时的TokenMixer模块可以视作LN+Identity组合

表 6 展示了 RIFormer 在 ImageNet 分类上的结果。文章主要关注吞吐量指标，因为首要考量是满足边缘设备的延迟要求。如预期所示，比其他类型的骨干拥有明显的速度优势，因为 RIFormer 其构建块不包含任何 token mixer。

2. RIFormer引入到RT-DETR

2.1 加入ultralytics/nn/backbone/RIFormer.py

核心代码：

class RIFormer(nn.Module):
    """RIFormer.
    A PyTorch implementation of RIFormer introduced by:
    `RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer <https://arxiv.org/abs/xxxx.xxxxx>`_
    Args:
        arch (str | dict): The model's architecture. If string, it should be
            one of architecture in ``RIFormer.arch_settings``. And if dict, it
            should include the following two keys:
            - layers (list[int]): Number of blocks at each stage.
            - embed_dims (list[int]): The number of channels at each stage.
            - mlp_ratios (list[int]): Expansion ratio of MLPs.
            - layer_scale_init_value (float): Init value for Layer Scale.
            Defaults to 'S12'.
        norm_cfg (dict): The config dict for norm layers.
            Defaults to ``dict(type='LN2d', eps=1e-6)``.
        act_cfg (dict): The config dict for activation between pointwise
            convolution. Defaults to ``dict(type='GELU')``.
        in_patch_size (int): The patch size of/? input image patch embedding.
            Defaults to 7.
        in_stride (int): The stride of input image patch embedding.
            Defaults to 4.
        in_pad (int): The padding of input image patch embedding.
            Defaults to 2.
        down_patch_size (int): The patch size of downsampling patch embedding.
            Defaults to 3.
        down_stride (int): The stride of downsampling patch embedding.
            Defaults to 2.
        down_pad (int): The padding of downsampling patch embedding.
            Defaults to 1.
        drop_rate (float): Dropout rate. Defaults to 0.
        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
        out_indices (Sequence | int): Output from which network position.
            Index 0-6 respectively corresponds to
            [stage1, downsampling, stage2, downsampling, stage3, downsampling, stage4]
            Defaults to -1, means the last stage.
        frozen_stages (int): Stages to be frozen (all param fixed).
            Defaults to -1, which means not freezing any parameters.
        deploy (bool): Whether to switch the model structure to
            deployment mode. Default: False.
        init_cfg (dict, optional): Initialization config dict
    """  # noqa: E501
 
    # --layers: [x,x,x,x], numbers of layers for the four stages
    # --embed_dims, --mlp_ratios:
    #     embedding dims and mlp ratios for the four stages
    # --downsamples: flags to apply downsampling or not in four blocks
    arch_settings = {
        's12': {
            'layers': [2, 2, 6, 2],
            'embed_dims': [64, 128, 320, 512],
            'mlp_ratios': [4, 4, 4, 4],
            'layer_scale_init_value': 1e-5,
        },
        's24': {
            'layers': [4, 4, 12, 4],
            'embed_dims': [64, 128, 320, 512],
            'mlp_ratios': [4, 4, 4, 4],
            'layer_scale_init_value': 1e-5,
        },
        's36': {
            'layers': [6, 6, 18, 6],
            'embed_dims': [64, 128, 320, 512],
            'mlp_ratios': [4, 4, 4, 4],
            'layer_scale_init_value': 1e-6,
        },
        'm36': {
            'layers': [6, 6, 18, 6],
            'embed_dims': [96, 192, 384, 768],
            'mlp_ratios': [4, 4, 4, 4],
            'layer_scale_init_value': 1e-6,
        },
        'm48': {
            'layers': [8, 8, 24, 8],
            'embed_dims': [96, 192, 384, 768],
            'mlp_ratios': [4, 4, 4, 4],
            'layer_scale_init_value': 1e-6,
        },
    }
 
    def __init__(self,
                 arch='s12',
                 weights = '',
                 in_channels=3,
                 norm_cfg=dict(type='GN', num_groups=1),
                 act_cfg=dict(type='GELU'),
                 in_patch_size=7,
                 in_stride=4,
                 in_pad=2,
                 down_patch_size=3,
                 down_stride=2,
                 down_pad=1,
                 drop_rate=0.,
                 drop_path_rate=0.,
                 out_indices=[0, 2, 4, 6],
                 deploy=False):
 
        super().__init__()
 
        if isinstance(arch, str):
            assert arch in self.arch_settings, \
                f'Unavailable arch, please choose from ' \
                f'({set(self.arch_settings)}) or pass a dict.'
            arch = self.arch_settings[arch]
        elif isinstance(arch, dict):
            assert 'layers' in arch and 'embed_dims' in arch, \
                f'The arch dict must have "layers" and "embed_dims", ' \
                f'but got {list(arch.keys())}.'
 
        layers = arch['layers']
        embed_dims = arch['embed_dims']
        mlp_ratios = arch['mlp_ratios'] \
            if 'mlp_ratios' in arch else [4, 4, 4, 4]
        layer_scale_init_value = arch['layer_scale_init_value'] \
            if 'layer_scale_init_value' in arch else 1e-5
 
        self.patch_embed = PatchEmbed(
            patch_size=in_patch_size,
            stride=in_stride,
            padding=in_pad,
            in_chans=in_channels,
            embed_dim=embed_dims[0])
 
        # set the main block in network
        network = []
        for i in range(len(layers)):
            stage = basic_blocks(
                embed_dims[i],
                i,
                layers,
                mlp_ratio=mlp_ratios[i],
                norm_cfg=norm_cfg,
                act_cfg=act_cfg,
                drop_rate=drop_rate,
                drop_path_rate=drop_path_rate,
                layer_scale_init_value=layer_scale_init_value,
                deploy=deploy)
            network.append(stage)
            if i >= len(layers) - 1:
                break
            if embed_dims[i] != embed_dims[i + 1]:
                # downsampling between two stages
                network.append(
                    PatchEmbed(
                        patch_size=down_patch_size,
                        stride=down_stride,
                        padding=down_pad,
                        in_chans=embed_dims[i],
                        embed_dim=embed_dims[i + 1]))
 
        self.network = nn.ModuleList(network)
 
        if isinstance(out_indices, int):
            out_indices = [out_indices]
        assert isinstance(out_indices, Sequence), \
            f'"out_indices" must by a sequence or int, ' \
            f'get {type(out_indices)} instead.'
        for i, index in enumerate(out_indices):
            if index < 0:
                out_indices[i] = 7 + index
                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
        self.out_indices = out_indices
        if self.out_indices:
            for i_layer in self.out_indices:
                layer = build_norm_layer(norm_cfg,
                                         embed_dims[(i_layer + 1) // 2])[1]
                layer_name = f'norm{i_layer}'
                self.add_module(layer_name, layer)
 
        self.deploy = deploy
        if weights:
            self.load_state_dict(update_weight(self.state_dict(), torch.load(weights)['state_dict']))
        self.channel = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
 
    def forward_embeddings(self, x):
        x = self.patch_embed(x)
        return x
 
    def forward_tokens(self, x):
        outs = []
        for idx, block in enumerate(self.network):
            x = block(x)
            if idx in self.out_indices:
                norm_layer = getattr(self, f'norm{idx}')
                x_out = norm_layer(x)
                outs.append(x_out)
        return outs
    
    def forward(self, x):
        # input embedding
        x = self.forward_embeddings(x)
        # through backbone
        x = self.forward_tokens(x)
        return x

详见：

https://blog.csdn.net/m0_63774211/category_12497375.html

我正在参与2023腾讯技术创作特训营第三期有奖征文，组队打卡瓜分大奖！

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

2023腾讯·技术创作特训营第三期

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

2023腾讯·技术创作特训营第三期

登录后参与评论

0 条评论

热度