前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >明月深度学习实践004:ResNet网络结构学习

明月深度学习实践004:ResNet网络结构学习

作者头像
明月AI
发布2021-10-28 14:18:46
1.1K0
发布2021-10-28 14:18:46
举报
文章被收录于专栏:野生AI架构师

ResNet可谓大名鼎鼎了,一直遵循拿来主义,没有好好去学习它,当然,作为一个提出来快五年的网络结构,已经有太多人写过它了,不好下笔。

趁着假期好好梳理一遍,相关资源:

  1. 原论文:https://arxiv.org/pdf/1512.03385.pdf
  2. Pytorch实现:https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py

本文将从源码出发,进行学习。

1. ResNet总体介绍


在ResNet的原始论文里,共介绍了几种形式:

如无特殊说明,截图均来自原始论文

作者根据网络深度不同,一共定义了5种ResNet结构,从18层到152层,每种网络结构都包含五个部分的卷积层,从conv1, conv2_x到conv5_x。这些卷积层我们拆解一下,其实就三种类型:

1.1 普通卷积conv1

conv1是一种普通的卷积,卷积核是7*7,输出64通道,步长2,输出size是112*112。图中并没有说padding值是多少,在Pytorch的官方实现中,该值为3:

代码语言:javascript
复制
self.inplanes = 64
self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3, bias=False)
self.bn1 = nn.BatchNorm2d(self.inplanes)
self.relu = nn.ReLU(inplace=True)

输入size在论文中有写,是224*224.

在这后面还有一个最大池化层:

代码语言:javascript
复制
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

至此,输出就变成了56*56。

为什么输入是224*224,池化之后是56*56?

我的理解是,这更多是一种实验发现的,对作者使用的数据集效果比较好的参数(作者使用的数据集当然是比较流行的开放数据集),但是是否在特定场景下是否就是最优呢?我决定并不见得,特定场景下,输入size可能往往是比较规范的,那就完全有可能产生更好的参数。

1.2 BaseBlock残差块结构

就是ResNet18和ResNet34中的3*3卷积核的部分:

代码语言:javascript
复制
def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=dilation, groups=groups, bias=False, dilation=dilation)

class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(BasicBlock, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError('BasicBlock only supports groups=1 and base_width=64')
        if dilation > 1:
            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)
        return out

所谓残差块,就是两个3*3的卷积层堆叠在一起:

例如,对于ResNet18的残差块conv2_x结构:

而所谓残差链接,其实关键就一句:

out += identity

就是跳层将输入和输出进行相加,以解决深层网络在训练时容易导致的梯度消失或者爆炸问题。作者论文对此有过实验:

网络从20层增加到56层的时候,训练误差和测试误差都显著增加了。这也是比较好理解吧,经过深层网络,原有的特征可能都已经消失了。

为了残差能正常连接在一起,残差块的输入输出需要是一致的。

看源码,还有一个downsample,这个后面再说。

1.2 Bottleneck残差块

除了ResNet18和34,其他几个用的都是Bottleneck残差块,还是一样,我们先看代码:

代码语言:javascript
复制
def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=dilation, groups=groups, bias=False, dilation=dilation)

def conv1x1(in_planes, out_planes, stride=1):
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
    
class Bottleneck(nn.Module):
    # Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
    # while original implementation places the stride at the first 1x1 convolution(self.conv1)
    # according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385.
    # This variant is also known as ResNet V1.5 and improves accuracy according to
    # https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.

    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(Bottleneck, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        width = int(planes * (base_width / 64.)) * groups
        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv1x1(inplanes, width)
        self.bn1 = norm_layer(width)
        self.conv2 = conv3x3(width, width, stride, groups, dilation)
        self.bn2 = norm_layer(width)
        self.conv3 = conv1x1(width, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)
        return out

图示大概如下:

核心思想就是利用1*1的卷积核来灵活控制通道数,使得残差能顺利连接在一起。具体做法就是使用一个1*1卷积核将通道数降为原来的1/4(降维),在进行3*3的卷积运算之后(保持通道数不变),又使用一个1*1卷积核将通道数提升为跟输入一样的通道数(升维)。

这样做的好处是大大减少了参数的数量,如上图的参数量是:1x1x256x64 + 3x3x64x64 + 1x1x64x256 = 69632,而如果这使用BaseBlock来实现的话,参数量变为:3x3x256x256x2 = 1179648,约为前者的17倍。我们看从ResNet34到ResNet50,深度增加了差不多三分之二,但是计算量却只是增加5%左右(当然这样比较有点作弊的嫌疑,毕竟ResNet50不少的层都是1*1的卷积运算,用于升降维)。

注意这里的降维时的通道数计算:

width = int(planes * (base_width / 64.)) * groups

这里的base_width对应的,就是训练时的width_per_group参数,在默认值的情况下,width值就等于planes,显然可以通过改变width_per_group和groups参数,来改变降维后的通道数。

不过,如果说这样降维没有特征损失,我个人是不太相信的,只是说可能这种损失比较小。

2. ResNet的实现


讲完了ResNet的基础介绍,我们就可以讲其实现了:

代码语言:javascript
复制
class ResNet(nn.Module):

    def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,
                 groups=1, width_per_group=64, replace_stride_with_dilation=None,
                 norm_layer=None):
        super(ResNet, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        self._norm_layer = norm_layer

        self.inplanes = 64
        self.dilation = 1
        if replace_stride_with_dilation is None:
            # each element in the tuple indicates if we should replace
            # the 2x2 stride with a dilated convolution instead
            replace_stride_with_dilation = [False, False, False]
        if len(replace_stride_with_dilation) != 3:
            raise ValueError("replace_stride_with_dilation should be None "
                             "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
        self.groups = groups
        self.base_width = width_per_group
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
                                       dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
                                       dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
                                       dilate=replace_stride_with_dilation[2])
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)

    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                            self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups,
                                base_width=self.base_width, dilation=self.dilation,
                                norm_layer=norm_layer))

        return nn.Sequential(*layers)

    def _forward_impl(self, x):
        # See note [TorchScript super()]
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x

    def forward(self, x):
        return self._forward_impl(x)

其中关键的方法就是_make_layer,这是用来生成残差块的。现在我们可以看一下这个downsample了,它的触发条件是,如果卷积步长不等于1或者输入通道数不满足对应的关系,则1*1的卷积运算,调整输入通道数,而这个downsample会在模块的第一个残差块进行。

这里实现中,看不出来_forward_impl这个为什么单独写成了也给函数,貌似在forward中实现并无不妥。

对于训练时,ResNet18执行的大概是:

代码语言:javascript
复制
ResNet(BasicBlock, [2, 2, 2, 2])

其中第二个参数是conv2_x到conv5_x这四个部分的残差块的数量。

再看一下ResNet50的执行:

代码语言:javascript
复制
ResNet(Bottleneck, [3, 4, 6, 3])

其他的都类似了。

另外,Pytorch实现的ResNet还有一些技巧,如:zero_init_residual等。

3. ResNet的变体


ResNet发布已经四五年了,相关的研究很多,Pytorch也实现了两种变体:

3.1 ResNeXT

这是16年底提出的结构,在Pytorch的实现中,其实只是训练时的参数不同,增加了分组卷积的参数:

代码语言:javascript
复制
# ResNeXt-101 32x8d model
kwargs['groups'] = 32
kwargs['width_per_group'] = 8
ResNet(Bottleneck, [3, 4, 23, 3],**kwargs)

3.2 Wide ResNet

这是16年中提出的结构,在Pytorch的实现差异也不大:

代码语言:javascript
复制
kwargs['width_per_group'] = 64 * 2
ResNet(Bottleneck, [3, 4, 6, 3], **kwargs)

这个参数的作用其实就是增加了降维之后的通道数为标准ResNet的两倍,也就是原来降维为1/4改为降维为1/2。

ResNet还有一些其他的升级版本,待续。

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2020-10-02,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 野生AI架构师 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档