深度学习-Faster RCNN论文笔记

原创

肉松

修改于 2020-08-14 17:40:49

8600

修改于 2020-08-14 17:40:49

文章被收录于专栏：torch-detection-学习笔记torch-detection-学习笔记

1. 引言

在本文中，我们展示了算法的变化——用深度卷积神经网络计算区域提议——导致了一个优雅和有效的解决方案，其中在给定检测网络计算的情况下区域提议计算接近领成本。为此，我们引入了新的区域提议网络（RPN），它们共享state-of-the-art目标检测网络的卷积层。通过在测试时共享卷积，计算区域提议的边际成本很小（例如，每张图像10ms）。

为了将RPN与Fast R-CNN目标检测网络相结合，我们提出了一种训练方案，在微调区域提议任务和微调目标检测之间进行交替，同时保持区域提议的固定。该方案快速收敛，并产生两个任务之间共享的具有卷积特征的统一网络。

2. FASTER R-CNN

Faster R-CNN，由两个模块组成。第一个模块是提议区域的深度全卷积网络，第二个模块是使用提议区域的Fast R-CNN检测器[2]。整个系统是一个单个的，统一的目标检测网络（图2）。使用最近流行的“注意力”机制的神经网络术语，RPN模块告诉Fast R-CNN模块在哪里寻找。

损失函数

问题：为什么分类的Anchor有256个，回归的却有2400个？

在我们的公式中，用于回归的特征在特征映射上具有相同的空间大小（3×3）。为了说明不同的大小，学习一组k个边界框回归器。每个回归器负责一个尺度和一个长宽比，而k个回归器不共享权重。因此，由于锚点的设计，即使特征具有固定的尺度/比例，仍然可以预测各种尺寸的边界框。

区域提议任务只是Faster R-CNN的第一阶段——下游的Fast R-CNN检测器会致力于对提议进行细化。在我们级联的第二阶段，在更忠实覆盖区域特征的提议框中，区域特征自适应地聚集。我们相信这些功能会带来更准确的检测结果。

3 Faster RCNN训练

Faster R-CNN的训练，是在已经训练好的model（如VGG_CNN_M_1024，VGG，ZF）的基础上继续进行训练。实际中训练过程分为6个步骤：

在已经训练好的model上，训练RPN网络，对应stage1_rpn_train.pt
利用步骤1中训练好的RPN网络，收集proposals，对应rpn_test.pt
第一次训练Fast RCNN网络，对应stage1_fast_rcnn_train.pt
第二训练RPN网络，对应stage2_rpn_train.pt
再次利用步骤4中训练好的RPN网络，收集proposals，对应rpn_test.pt
第二次训练Fast RCNN网络，对应stage2_fast_rcnn_train.pt

可以看到训练过程类似于一种“迭代”的过程，不过只循环了2次。至于只循环了2次的原因是应为作者提到："A similar alternating training can be run for more iterations, but we have observed negligible improvements"，即循环更多次没有提升了。接下来本章以上述6个步骤讲解训练过程。

下面是一张训练过程流程图，应该更加清晰：

4 Faster RCNN 代码阅读

pytorch代码：https://github.com/chenyuntc/simple-faster-rcnn-pytorch

simple-faster-rcnn-pytorch-master/model/utils/文件夹，utils一般就是一些配置工具之类的文件，目录：

1. bbox_tools.py

bbox_tools.py部分的代码主要由四个函数构成：

1loc2bbox(src_bbox,loc)和bbox2loc(src_bbox,dst_bbox)是一对函数，其功能是刚好相反的。
loc2bbox()看是有已知源框和位置偏差，求出目标框，而bbox2loc()函数是已知源框框和参考框框求出其位置偏差。
bbox_iou函数是求两个bbox的相交的交并比的功能
generate_anchor_base()是根据基准点生成9个基于原图大小的anchor，ratios=[0.5,1,2],anchor_scales=[8,16,32]是长宽比和缩放比例r。

本代码中对应着三种面积的大小(16*8)^2 ,(16*16)^2 (16*32)^2 也就是128,256,512的平方大小，三种面积乘以三种放缩比就刚刚好是9种anchor，示意图如下：

region_proposal_network.py里的这个函数_enumerate_shifted_anchor产生所有的anchor：

 def _enumerate_shifted_anchor(anchor_base, feat_stride, height, width):  ##利用anchor_base生成所有对应feature map的anchor
    # Enumerate all shifted anchors:
    #
    # add A anchors (1, A, 4) to
    # cell K shifts (K, 1, 4) to get
    # shift anchors (K, A, 4)
    # reshape to (K*A, 4) shifted anchors
    # return (K*A, 4)

    # !TODO: add support for torch.CudaTensor
    # xp = cuda.get_array_module(anchor_base)
    # it seems that it can't be boosed using GPU
    import numpy as xp
    shift_y = xp.arange(0, height * feat_stride, feat_stride)  
    # 纵向偏移量（0，16，32，...,image.height）
    shift_x = xp.arange(0, width * feat_stride, feat_stride)  
    # 横向偏移量（0，16，32，...,image.width）
    # meshgrid产生的X以x的行为行，以y的元素个数为列构成矩阵，
    # 同样的产生的Y以y的行作为列，以x的元素个数作为列数产生矩阵
    shift_x, shift_y = xp.meshgrid(shift_x, shift_y) 
    # shift_x = [[0，16，32，..],[0，16，32，..],[0，16，32，..]...],
    # shift_y = [[0，0，0，..],[16，16，16，..],[32，32，32，..]...],就是形成了一个纵横向偏移量的矩阵，
    # 也就是特征图的每一点都能够通过这个矩阵找到映射在原图中的具体位置！
    # 主要是产生偏移坐标对，一个朝x方向，一个朝y方向偏移。
    # 给’‘特征图’'上每个点对应都画与anchor_base一样的框，得到每个框的左上角和右下角坐标。
    shift = xp.stack((shift_y.ravel(), shift_x.ravel(),
                      shift_y.ravel(), shift_x.ravel()), axis=1)  # 按列叠加

    A = anchor_base.shape[0]
    K = shift.shape[0]
    # 这才是整个函数的关键，用基础的9个anchor的坐标分别和偏移量相加，最后得出了所有的anchor的坐标，
    # 四列可以堪称是左上角和右下角的坐标加偏移量的同步执行，飞速的从上往下捋一遍，所有的anchor就都出来了！
    # 一共K个特征点，每个有A(9)个base anchor，所以reshape((K*A),4)的形式，也就得到了最后的所有anchor
    anchor = anchor_base.reshape((1, A, 4)) + \
             shift.reshape((1, K, 4)).transpose((1, 0, 2))
    anchor = anchor.reshape((K * A, 4)).astype(np.float32)
    return anchor

解释见注释。

2. Creator_tools.py

creator.py文件也是需要重点解释的部分，因为它基本上涵盖了整个网络优化的全部内容，其中的三个creator每一个都有自己独特的作用。

大致的函数框架是这样的，可以看出里面有三个主要的类：1.ProposalTargetCreator 2.AnchorTargetCreator 3.ProposalCreator。前两个都在RPN网络里实现，第三个在RoIHead网络里实现。

1. AnchorTargetCreator

AnchorTargetCreator的作用是啥呢？其实AnchorTargetCreator的作用就是为Faster-RCNN专有的RPN网络提供自我训练的样本。anchor变成真正的ROIS需要进行位置修正，而AnchorTargetCreator产生的带标签的样本就是给RPN网络进行训练学习用的。

目的：利用每张图中bbox的真实标签来为所有任务分配ground truth。

输入：最初生成的20000个anchor坐标、此一张图中所有的bbox的真实坐标

输出：size为（20000，1）的正负label（其中只有128个为1，128个为0，其余都为-1）、 size为（20000，4）的回归目标（所有anchor的坐标都有）

label中背景为-1，负样本为0，正样本为1
注意虽然是要挑选256个，但是这里返回的label仍然是全部，只不过label里面有128为0，128个为1，其余都为-1而已。
函数bbox2loc利用返回的索引argmax_ious来计算出回归的目标参数组loc。然后根据之前记录的索引，将15000个再映射回20000长度的label（其余的label一律置为-1）和loc（其余的loc一律置为（0，0，0，0））。有了RPN网络两个11卷积输出的类别label和位置参数loc的预测值，AnchorTargetCreator又为其对应生成了真实值ground truth。

AnchorTargetCreator选取样本的标准

AnchorTargetCreator从20000多个Anchor选出256个用于二分类和所有的位置回归(15000?)。为预测值提供对应的真实值，选取的规则是：

1. 对于每一个Ground_truth bounding_box 从anchor中选取和它重叠度最高的一个anchor作为样本。

2. 从剩下的anchor中选取和Ground_truth bounding_box重叠度超过0.7的anchor作为样本，注意正样本的数目不能超过128

3. 随机的从剩下的样本中选取和gt_bbox重叠度小于0.3的anchor作为负样本，正负样本之和为256

PS：需要注意的是对于每一个anchor，gt_label要么为1,要么为0,所以这样实现二分类，而计算回归损失时，只有正样本计算损失，负样本不参与计算。

S is the number of anchors.(~20000)

R is the number of bounding boxes.(groudtruth,e.g. 4)

input:

bbox (array): Coordinates of bounding boxes. Its shape is (R, 4).

anchor (array): Coordinates of anchors. Its shape is (S, 4).

return:

loc: Offsets and scales to match the anchors to the ground truth bounding boxes. Its shape is (S, 4)

label: Labels of anchors with values (1=positive, 0=negative, -1=ignore). Its shape is (S,)

2. ProposalCreator

proposalCreator做的就是生成ROIS的过程，而且整个过程只有前向计算没有反向传播，所以完全可以只用numpy和Tensor就把它计算出来。

对于每张图片，利用FeatureMap, 计算H/16*W/16*9大约20000个anchor属于前景的概率和其对应的位置参数，这个就是RPN网络正向作用的过程，然后从中选取概率较大的12000张，利用位置回归参数，修正这12000个anchor的位置。利用非极大值抑制，选出2000个ROIS。

class ProposalCreator:
    # unNOTE: I'll make it undifferential
    # unTODO: make sure it's ok
    # It's ok
    """Proposal regions are generated by calling this object.

    The :meth:`__call__` of this object outputs object detection proposals by
    applying estimated bounding box offsets
    to a set of anchors.

    This class takes parameters to control number of bounding boxes to
    pass to NMS and keep after NMS.
    If the paramters are negative, it uses all the bounding boxes supplied
    or keep all the bounding boxes returned by NMS.

    This class is used for Region Proposal Networks introduced in
    Faster R-CNN [#]_.

    .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \
    Faster R-CNN: Towards Real-Time Object Detection with \
    Region Proposal Networks. NIPS 2015.

    Args:
        nms_thresh (float): Threshold value used when calling NMS.
        n_train_pre_nms (int): Number of top scored bounding boxes
            to keep before passing to NMS in train mode.
        n_train_post_nms (int): Number of top scored bounding boxes
            to keep after passing to NMS in train mode.
        n_test_pre_nms (int): Number of top scored bounding boxes
            to keep before passing to NMS in test mode.
        n_test_post_nms (int): Number of top scored bounding boxes
            to keep after passing to NMS in test mode.
        force_cpu_nms (bool): If this is :obj:`True`,
            always use NMS in CPU mode. If :obj:`False`,
            the NMS mode is selected based on the type of inputs.
        min_size (int): A paramter to determine the threshold on
            discarding bounding boxes based on their sizes.

    """

    def __init__(self,
                 parent_model,
                 nms_thresh=0.7,
                 n_train_pre_nms=12000,
                 n_train_post_nms=2000,
                 n_test_pre_nms=6000,
                 n_test_post_nms=300,
                 min_size=16
                 ):
        self.parent_model = parent_model
        self.nms_thresh = nms_thresh
        self.n_train_pre_nms = n_train_pre_nms
        self.n_train_post_nms = n_train_post_nms
        self.n_test_pre_nms = n_test_pre_nms
        self.n_test_post_nms = n_test_post_nms
        self.min_size = min_size

    def __call__(self, loc, score,
                 anchor, img_size, scale=1.):
        """input should  be ndarray
        Propose RoIs.

        Inputs :obj:`loc, score, anchor` refer to the same anchor when indexed
        by the same index.

        On notations, :math:`R` is the total number of anchors. This is equal
        to product of the height and the width of an image and the number of
        anchor bases per pixel.

        Type of the output is same as the inputs.

        Args:
            loc (array): Predicted offsets and scaling to anchors.
                Its shape is :math:`(R, 4)`.
            score (array): Predicted foreground probability for anchors.
                Its shape is :math:`(R,)`.
            anchor (array): Coordinates of anchors. Its shape is
                :math:`(R, 4)`.
            img_size (tuple of ints): A tuple :obj:`height, width`,
                which contains image size after scaling.
            scale (float): The scaling factor used to scale an image after
                reading it from a file.

        Returns:
            array:
            An array of coordinates of proposal boxes.
            Its shape is :math:`(S, 4)`. :math:`S` is less than
            :obj:`self.n_test_post_nms` in test time and less than
            :obj:`self.n_train_post_nms` in train time. :math:`S` depends on
            the size of the predicted bounding boxes and the number of
            bounding boxes discarded by NMS.

        """
        # NOTE: when test, remember
        # faster_rcnn.eval()
        # to set self.traing = False
        if self.parent_model.training:
            n_pre_nms = self.n_train_pre_nms
            n_post_nms = self.n_train_post_nms
        else:
            n_pre_nms = self.n_test_pre_nms
            n_post_nms = self.n_test_post_nms

        # Convert anchors into proposal via bbox transformations.
        # roi = loc2bbox(anchor, loc)
        roi = loc2bbox(anchor, loc)

        # Clip predicted boxes to image.
        roi[:, slice(0, 4, 2)] = np.clip(
            roi[:, slice(0, 4, 2)], 0, img_size[0])
        roi[:, slice(1, 4, 2)] = np.clip(
            roi[:, slice(1, 4, 2)], 0, img_size[1])

        # Remove predicted boxes with either height or width < threshold.
        min_size = self.min_size * scale
        hs = roi[:, 2] - roi[:, 0]
        ws = roi[:, 3] - roi[:, 1]
        keep = np.where((hs >= min_size) & (ws >= min_size))[0]
        roi = roi[keep, :]
        score = score[keep]

        # Sort all (proposal, score) pairs by score from highest to lowest.
        # Take top pre_nms_topN (e.g. 6000).
        order = score.ravel().argsort()[::-1]
        if n_pre_nms > 0:
            order = order[:n_pre_nms]
        roi = roi[order, :]
        score = score[order]

        # Apply nms (e.g. threshold = 0.7).
        # Take after_nms_topN (e.g. 300).

        # unNOTE: somthing is wrong here!
        # TODO: remove cuda.to_gpu
        keep = nms(
            torch.from_numpy(roi).cuda(),
            torch.from_numpy(score).cuda(),
            self.nms_thresh)
        if n_post_nms > 0:
            keep = keep[:n_post_nms]
        roi = roi[keep.cpu().numpy()]
        return roi

最开始初始化一些参数，比如nms_thresh=0.7, 训练和测试选取不同的样本，min_size=16等等，代码一进来就针对训练和测试过程分别设置了不同的参数，然后rois = loc2bbox(anchor,loc) 利用预测的修正值，对12000个anchor进行修正，之后利用numpy.clip(rois[:,slice(0,4,2)],0,img_size[0])函数将产生的rois的大小全部裁剪到图片范围以内。

然后计算图片的高度和宽度，二者任何一个小于开始我们规定的min_size都直接去掉。只保留剩下的Rois，然后对剩下的ROIs进行打分，对得到的分数进行合并然后进行排序，只保留属于前景的概率排序后的前12000/6000个（分别对应训练和测试时候的配置），之后再调用非极大值抑制函数，将重复的抑制掉，就可以将筛选后ROIS进行返回啦。

3. Proposal_TargetCreator

Proposal_TragetCreator的作用是，提供GroundTruth样本供ROISHeads网络进行自我训练。那这个ROISHeads网络又是什么呢？就是接收ROIS对它进行n_class类别的预测以及最终目标检测位置的！也就是最终输出结果的网络，最终输出结果的网络的性能的好坏完全取决于它，它的流程：

ProposalCreator产生2000个ROIS，但是这些ROIS并不都用于训练，经过本ProposalTargetCreator的筛选产生128个用于自身的训练，规则如下:

1 ROIS和GroundTruth_bbox的IOU大于0.5,选取一些(比如说本实验的32个)作为正样本

2 选取ROIS和GroundTruth_bbox的IOUS小于等于0的选取一些比如说选取128-32=96个作为负样本

3 然后分别对ROI_Headers进行训练

class ProposalTargetCreator(object):
    """Assign ground truth bounding boxes to given RoIs.

    The :meth:`__call__` of this class generates training targets
    for each object proposal.
    This is used to train Faster RCNN [#]_.

    .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \
    Faster R-CNN: Towards Real-Time Object Detection with \
    Region Proposal Networks. NIPS 2015.

    Args:
        n_sample (int): The number of sampled regions.
        pos_ratio (float): Fraction of regions that is labeled as a
            foreground.
        pos_iou_thresh (float): IoU threshold for a RoI to be considered as a
            foreground.
        neg_iou_thresh_hi (float): RoI is considered to be the background
            if IoU is in
            [:obj:`neg_iou_thresh_hi`, :obj:`neg_iou_thresh_hi`).
        neg_iou_thresh_lo (float): See above.

    """

    def __init__(self,
                 n_sample=128,
                 pos_ratio=0.25, pos_iou_thresh=0.5,
                 neg_iou_thresh_hi=0.5, neg_iou_thresh_lo=0.0
                 ):
        self.n_sample = n_sample
        self.pos_ratio = pos_ratio
        self.pos_iou_thresh = pos_iou_thresh
        self.neg_iou_thresh_hi = neg_iou_thresh_hi
        self.neg_iou_thresh_lo = neg_iou_thresh_lo  # NOTE:default 0.1 in py-faster-rcnn

    def __call__(self, roi, bbox, label,
                 loc_normalize_mean=(0., 0., 0., 0.),
                 loc_normalize_std=(0.1, 0.1, 0.2, 0.2)):
        """Assigns ground truth to sampled proposals.

        This function samples total of :obj:`self.n_sample` RoIs
        from the combination of :obj:`roi` and :obj:`bbox`.
        The RoIs are assigned with the ground truth class labels as well as
        bounding box offsets and scales to match the ground truth bounding
        boxes. As many as :obj:`pos_ratio * self.n_sample` RoIs are
        sampled as foregrounds.

        Offsets and scales of bounding boxes are calculated using
        :func:`model.utils.bbox_tools.bbox2loc`.
        Also, types of input arrays and output arrays are same.

        Here are notations.

        * :math:`S` is the total number of sampled RoIs, which equals \
            :obj:`self.n_sample`.
        * :math:`L` is number of object classes possibly including the \
            background.

        Args:
            roi (array): Region of Interests (RoIs) from which we sample.
                Its shape is :math:`(R, 4)`
            bbox (array): The coordinates of ground truth bounding boxes.
                Its shape is :math:`(R', 4)`.
            label (array): Ground truth bounding box labels. Its shape
                is :math:`(R',)`. Its range is :math:`[0, L - 1]`, where
                :math:`L` is the number of foreground classes.
            loc_normalize_mean (tuple of four floats): Mean values to normalize
                coordinates of bouding boxes.
            loc_normalize_std (tupler of four floats): Standard deviation of
                the coordinates of bounding boxes.

        Returns:
            (array, array, array):

            * **sample_roi**: Regions of interests that are sampled. \
                Its shape is :math:`(S, 4)`.
            * **gt_roi_loc**: Offsets and scales to match \
                the sampled RoIs to the ground truth bounding boxes. \
                Its shape is :math:`(S, 4)`.
            * **gt_roi_label**: Labels assigned to sampled RoIs. Its shape is \
                :math:`(S,)`. Its range is :math:`[0, L]`. The label with \
                value 0 is the background.

        """
        n_bbox, _ = bbox.shape

        roi = np.concatenate((roi, bbox), axis=0)  # 将groundtruth也加入了训练

        pos_roi_per_image = np.round(self.n_sample * self.pos_ratio)
        iou = bbox_iou(roi, bbox)
        gt_assignment = iou.argmax(axis=1)
        max_iou = iou.max(axis=1)
        # Offset range of classes from [0, n_fg_class - 1] to [1, n_fg_class].
        # The label with value 0 is the background.
        gt_roi_label = label[gt_assignment] + 1

        # Select foreground RoIs as those with >= pos_iou_thresh IoU.
        pos_index = np.where(max_iou >= self.pos_iou_thresh)[0]
        pos_roi_per_this_image = int(min(pos_roi_per_image, pos_index.size))
        if pos_index.size > 0:
            pos_index = np.random.choice(
                pos_index, size=pos_roi_per_this_image, replace=False)

        # Select background RoIs as those within
        # [neg_iou_thresh_lo, neg_iou_thresh_hi).
        neg_index = np.where((max_iou < self.neg_iou_thresh_hi) &
                             (max_iou >= self.neg_iou_thresh_lo))[0]
        neg_roi_per_this_image = self.n_sample - pos_roi_per_this_image
        neg_roi_per_this_image = int(min(neg_roi_per_this_image,
                                         neg_index.size))
        if neg_index.size > 0:
            neg_index = np.random.choice(
                neg_index, size=neg_roi_per_this_image, replace=False)

        # The indices that we're selecting (both positive and negative).
        keep_index = np.append(pos_index, neg_index)
        gt_roi_label = gt_roi_label[keep_index]
        gt_roi_label[pos_roi_per_this_image:] = 0  # negative labels --> 0
        sample_roi = roi[keep_index]

        # Compute offsets and scales to match sampled RoIs to the GTs.
        gt_roi_loc = bbox2loc(sample_roi, bbox[gt_assignment[keep_index]])
        gt_roi_loc = ((gt_roi_loc - np.array(loc_normalize_mean, np.float32)
                       ) / np.array(loc_normalize_std, np.float32))

        return sample_roi, gt_roi_loc, gt_roi_label

因为这些数据是要放入到整个大网络里进行训练的，比如说位置数据，所以要对其位置坐标进行数据增强处理(归一化处理)。

首先确定bbox.shape找出n_bbox的个数，
然后将bbox和rois连接起来，确定需要的正样本的个数，调用bbox_iou进行IOU的计算，按行找到最大值，返回最大值对应的序号以及其真正的IOU。
之后利用最大值的序号将那些挑出的最大值的label+1从0,n_fg_class-1 变到1,n_fg_class
同样的根据IOUS的最大值将正负样本找出来，如果找出的样本数目过多就随机丢掉一些
之后将正负样本序号连接起来，得到它们对应的真实的label
然后统一的将负样本的label全部置为0，这样筛选出来的样本的label就已经确定了
之后将sample_rois取出来，根据它和bbox的偏移量计算出loc, 最后返回sample_rois,gt_roi_loc和gt_rois_label，完成任务。

3. 搭建网络

注意网络结构图中的蓝色箭头的线代表了计算图, 梯度反向传播会经过。而红色的线不需要反向传播。

1. region_proposal_network.py

class RegionProposalNetwork(nn.Module):
    Args:
        in_channels (int): The channel size of input.
        mid_channels (int): The channel size of the intermediate tensor.
        ratios (list of floats): This is ratios of width to height of
            the anchors.
        anchor_scales (list of numbers): This is areas of anchors.
            Those areas will be the product of the square of an element in
            :obj:`anchor_scales` and the original area of the reference
            window.
        feat_stride (int): Stride size after extracting features from an
            image.
        initialW (callable): Initial weight value. If :obj:`None` then this
            function uses Gaussian distribution scaled by 0.1 to
            initialize weight.
            May also be a callable that takes an array and edits its values.
        proposal_creator_params (dict): Key valued paramters for
            :class:`model.utils.creator_tools.ProposalCreator`.

    .. seealso::
        :class:`~model.utils.creator_tools.ProposalCreator`

    """

    def __init__(
            self, in_channels=512, mid_channels=512, ratios=[0.5, 1, 2],
            anchor_scales=[8, 16, 32], feat_stride=16,
            proposal_creator_params=dict(),
    ):
        super(RegionProposalNetwork, self).__init__()
        self.anchor_base = generate_anchor_base(
            anchor_scales=anchor_scales, ratios=ratios)
        self.feat_stride = feat_stride
        self.proposal_layer = ProposalCreator(self, **proposal_creator_params)
        n_anchor = self.anchor_base.shape[0]
        self.conv1 = nn.Conv2d(in_channels, mid_channels, 3, 1, 1)
        self.score = nn.Conv2d(mid_channels, n_anchor * 2, 1, 1, 0)
        self.loc = nn.Conv2d(mid_channels, n_anchor * 4, 1, 1, 0)
        normal_init(self.conv1, 0, 0.01)
        normal_init(self.score, 0, 0.01)
        normal_init(self.loc, 0, 0.01)

    def forward(self, x, img_size, scale=1.):
        """Forward Region Proposal Network.

        Here are notations.

        * :math:`N` is batch size.
        * :math:`C` channel size of the input.
        * :math:`H` and :math:`W` are height and witdh of the input feature.
        * :math:`A` is number of anchors assigned to each pixel.

        Args:
            x (~torch.autograd.Variable): The Features extracted from images.
                Its shape is :math:`(N, C, H, W)`.
            img_size (tuple of ints): A tuple :obj:`height, width`,
                which contains image size after scaling.
            scale (float): The amount of scaling done to the input images after
                reading them from files.

        Returns:
            (~torch.autograd.Variable, ~torch.autograd.Variable, array, array, array):

            This is a tuple of five following values.

            * **rpn_locs**: Predicted bounding box offsets and scales for \
                anchors. Its shape is :math:`(N, H W A, 4)`.
            * **rpn_scores**:  Predicted foreground scores for \
                anchors. Its shape is :math:`(N, H W A, 2)`.
            * **rois**: A bounding box array containing coordinates of \
                proposal boxes.  This is a concatenation of bounding box \
                arrays from multiple images in the batch. \
                Its shape is :math:`(R', 4)`. Given :math:`R_i` predicted \
                bounding boxes from the :math:`i` th image, \
                :math:`R' = \\sum _{i=1} ^ N R_i`.
            * **roi_indices**: An array containing indices of images to \
                which RoIs correspond to. Its shape is :math:`(R',)`.
            * **anchor**: Coordinates of enumerated shifted anchors. \
                Its shape is :math:`(H W A, 4)`.

        """
        n, _, hh, ww = x.shape
        anchor = _enumerate_shifted_anchor(
            np.array(self.anchor_base),
            self.feat_stride, hh, ww) # （hh*ww*9，4）

        n_anchor = anchor.shape[0] // (hh * ww) # n_anchor = 9
        h = F.relu(self.conv1(x)) # 512 个3x3 卷积, 输出(n, 512, hh, ww)
        # n_anchor（9）* 4 个1x1 卷积, 回归坐标偏移量，输出（n, 9*4，hh,ww)
        rpn_locs = self.loc(h)
        # UNNOTE: check whether need contiguous
        # A: Yes
        rpn_locs = rpn_locs.permute(0, 2, 3, 1).contiguous().view(n, -1, 4) # 输出（n, hh,ww，9*4)
        rpn_scores = self.score(h)  #（n, 9*2，hh,ww）
        rpn_scores = rpn_scores.permute(0, 2, 3, 1).contiguous() # （n，hh，ww，9*2）
        rpn_softmax_scores = F.softmax(rpn_scores.view(n, hh, ww, n_anchor, 2), dim=4) # （n，hh，ww，9, 2）
        rpn_fg_scores = rpn_softmax_scores[:, :, :, :, 1].contiguous()
        rpn_fg_scores = rpn_fg_scores.view(n, -1) # 得到所有anchor 的前景分类概率(n, hh*ww*9)
        rpn_scores = rpn_scores.view(n, -1, 2) # 得到每一张feature map 上所有anchor 的网络输出值 （n，hh*ww*9, 2）

        rois = list()
        roi_indices = list()
        # n 为batch_size 数
        for i in range(n): # for each image
            roi = self.proposal_layer(
                rpn_locs[i].cpu().data.numpy(),
                rpn_fg_scores[i].cpu().data.numpy(),
                anchor, img_size,
                scale=scale)
            batch_index = i * np.ones((len(roi),), dtype=np.int32)
            rois.append(roi) # rois 为所有batch_size 的roi
            roi_indices.append(batch_index)
            # 按行拼接（即没有batch_size 的区分, 每一个[] 里都是一个anchor 的四个坐标）

        rois = np.concatenate(rois, axis=0)
        roi_indices = np.concatenate(roi_indices, axis=0) # 如果多张图像的话就需要存储索引以找到对应图像的roi
        # rpn_locs 的维度（hh*ww*9，4），rpn_scores 维度为（hh*ww*9，2）
        # rois 的维度为（2000,4）
        # anchor 的维度为（hh*ww*9，4）
        return rpn_locs, rpn_scores, rois, roi_indices, anchor

2. model/faster_rcnn.py

这个脚本定义了Faster RCNN 的基本类FasterRCNN。我们知道Faster RCNN 的三个核心步骤就是：

• 特征提取：输入一张图片得到其特征图feature map

• RPN：给定特征图后产生一系列RoIs

• ROI Head：利用这些RoIs 对应的特征图对这些RoIs 中的类别进行分类, 并提升定位精度

在FasterRCNN 这个类中就初始化了这三个重要的步骤, 即 self.extrator, self.rpn,self.head。

FasterRCNN 类中, forward 函数实现前向传播。

def forward(self, x, scale=1.):
        """Forward Faster R-CNN.

        Scaling paramter :obj:`scale` is used by RPN to determine the
        threshold to select small objects, which are going to be
        rejected irrespective of their confidence scores.

        Here are notations used.

        * :math:`N` is the number of batch size
        * :math:`R'` is the total number of RoIs produced across batches. \
            Given :math:`R_i` proposed RoIs from the :math:`i` th image, \
            :math:`R' = \\sum _{i=1} ^ N R_i`.
        * :math:`L` is the number of classes excluding the background.

        Classes are ordered by the background, the first class, ..., and
        the :math:`L` th class.

        Args:
            x (autograd.Variable): 4D image variable.
            scale (float): Amount of scaling applied to the raw image
                during preprocessing.

        Returns:
            Variable, Variable, array, array:
            Returns tuple of four values listed below.

            * **roi_cls_locs**: Offsets and scalings for the proposed RoIs. \
                Its shape is :math:`(R', (L + 1) \\times 4)`.
            * **roi_scores**: Class predictions for the proposed RoIs. \
                Its shape is :math:`(R', L + 1)`.
            * **rois**: RoIs proposed by RPN. Its shape is \
                :math:`(R', 4)`.
            * **roi_indices**: Batch indices of RoIs. Its shape is \
                :math:`(R',)`.

        """
        img_size = x.shape[2:]

        h = self.extractor(x)
        rpn_locs, rpn_scores, rois, roi_indices, anchor = \
            self.rpn(h, img_size, scale)
        roi_cls_locs, roi_scores = self.head(
            h, rois, roi_indices)
        return roi_cls_locs, roi_scores, rois, roi_indices

3. 类 Faster-RCNNVGG16

类Faster-RCNNVGG16, 它继承了FasterRCNN。

首先引入VGG16，然后拆分为特征提取网络和分类网络。冻结分类网络的前几层, 不进行反向传播。然后实现VGG16RoIHead 网络。实现输入特征图、rois、roi_indices, 输出roi_cls_locs和roi_scores。

类FasterRCNNVGG16 分别对VGG16 的特征提取部分、分类部分、RPN 网络、VGG16RoIHead 网络进行了实例化。

此外，在对VGG16RoIHead 网络的全连接层权重初始化过程中, 按照图像是否为truncated（截断），分了两种初始化分方法。

class VGG16RoIHead(nn.Module):
    """Faster R-CNN Head for VGG-16 based implementation.
    This class is used as a head for Faster R-CNN.
    This outputs class-wise localizations and classification based on feature
    maps in the given RoIs.
    
    Args:
        n_class (int): The number of classes possibly including the background.
        roi_size (int): Height and width of the feature maps after RoI-pooling.
        spatial_scale (float): Scale of the roi is resized.
        classifier (nn.Module): Two layer Linear ported from vgg16

    """

    def __init__(self, n_class, roi_size, spatial_scale,
                 classifier):
        # n_class includes the background
        super(VGG16RoIHead, self).__init__()
        # vgg16 中的最后两个全连接层
        self.classifier = classifier
        self.cls_loc = nn.Linear(4096, n_class * 4)
        self.score = nn.Linear(4096, n_class)
        # 全连接层权重初始化
        normal_init(self.cls_loc, 0, 0.001)
        normal_init(self.score, 0, 0.01)
        # 加上背景21 类
        self.n_class = n_class
        # 7x7
        self.roi_size = roi_size
        # 1/16
        self.spatial_scale = spatial_scale
        # 将大小不同的roi 变成大小一致, 得到pooling 后的特征,
        # 大小为[300, 512, 7, 7]。利用Cupy 实现在线编译的
        self.roi = RoIPool( (self.roi_size, self.roi_size),self.spatial_scale)

    def forward(self, x, rois, roi_indices):
        """Forward the chain.

        We assume that there are :math:`N` batches.

        Args:
            x (Variable): 4D image variable.
            rois (Tensor): A bounding box array containing coordinates of
                proposal boxes.  This is a concatenation of bounding box
                arrays from multiple images in the batch.
                Its shape is :math:`(R', 4)`. Given :math:`R_i` proposed
                RoIs from the :math:`i` th image,
                :math:`R' = \\sum _{i=1} ^ N R_i`.
            roi_indices (Tensor): An array containing indices of images to
                which bounding boxes correspond to. Its shape is :math:`(R',)`.

        """
        # in case roi_indices is  ndarray
        roi_indices = at.totensor(roi_indices).float()
        rois = at.totensor(rois).float()
        indices_and_rois = t.cat([roi_indices[:, None], rois], dim=1)
        # NOTE: important: yx->xy
        xy_indices_and_rois = indices_and_rois[:, [0, 2, 1, 4, 3]]
        # 把tensor 变成在内存中连续分布的形式
        indices_and_rois = xy_indices_and_rois.contiguous()

        pool = self.roi(x, indices_and_rois)
        # flat 操作
        pool = pool.view(pool.size(0), -1)
        # decom_vgg16（）得到的calssifier, 得到4096
        fc7 = self.classifier(pool)
        roi_cls_locs = self.cls_loc(fc7)
        roi_scores = self.score(fc7)
        return roi_cls_locs, roi_score

问题：为什么这里要变换坐标？

# NOTE: important: yx->xy xy_indices_and_rois = indices_and_rois[:,[0,2,1,4,3]]

网络总体图

4. 训练代码 trainer.py

这个脚本定义了类FasterRCNNTrainer , 在初始化的时候用到了之前定义的类FasterRCNNVGG16 为faster_rcnn。此外在初始化中又引入了其他creator、vis、optimizer 等。

另外, 还定义了四个损失函数以及一个总的联合损失函数：rpn_loc_loss、rpn_cls_loss、roi_loc_loss、roi_cls_loss, total_loss。

首先来看一下FasterRCNNTrainer 类的初始化函数：

class FasterRCNNTrainer(nn.Module):
    """wrapper for conveniently training. return losses

    The losses include:

    * :obj:`rpn_loc_loss`: The localization loss for \
        Region Proposal Network (RPN).
    * :obj:`rpn_cls_loss`: The classification loss for RPN.
    * :obj:`roi_loc_loss`: The localization loss for the head module.
    * :obj:`roi_cls_loss`: The classification loss for the head module.
    * :obj:`total_loss`: The sum of 4 loss above.

    Args:
        faster_rcnn (model.FasterRCNN):
            A Faster R-CNN model that is going to be trained.
    """

    def __init__(self, faster_rcnn):
        # 继承父模块的初始化
        super(FasterRCNNTrainer, self).__init__()

        self.faster_rcnn = faster_rcnn
        # 下面2 个参数是在_faster_rcnn_loc_loss 调用用来计算位置损失函数用到的超参数
        self.rpn_sigma = opt.rpn_sigma
        self.roi_sigma = opt.roi_sigma

        # target creator create gt_bbox gt_label etc as training targets. 
        # 用于从20000 个候选anchor 中产生256 个anchor 进行二分类和位置回归, 也就是
        # 为rpn 网络产生的预测位置和预测类别提供真正的ground_truth标准
        self.anchor_target_creator = AnchorTargetCreator()
        # AnchorTargetCreator 和ProposalTargetCreator 是为了生成训练的目标（或称ground truth）, 
        # 只在训练阶段用到，ProposalCreator 是RPN为Fast
        # R-CNN 生成RoIs，在训练和测试阶段都会用到。所以测试阶段直接输进来300个RoIs，
        # 而训练阶段会有AnchorTargetCreator 的再次干预
        self.proposal_target_creator = ProposalTargetCreator()

        self.loc_normalize_mean = faster_rcnn.loc_normalize_mean
        self.loc_normalize_std = faster_rcnn.loc_normalize_std

        self.optimizer = self.faster_rcnn.get_optimizer()
        # visdom wrapper
        self.vis = Visualizer(env=opt.env)

        # indicators for training status
        self.rpn_cm = ConfusionMeter(2)
        self.roi_cm = ConfusionMeter(21)
        self.meters = {k: AverageValueMeter() for k in LossTuple._fields}  # average loss

接下来是Forward 函数, 因为只支持batch_size 等于1 的训练, 因此n=1。每个batch 输入一张图片, 一张图片上所有的bbox 及label，以及图片经过预处理后的scale。

然后对于两个分类损失（RPN 和ROI Head）都使用了交叉熵损失, 而回归损失则使用了smooth_l1_loss。

还需要注意的一点是例如ROI 回归输出的是128 × 84, 然而真实位置参数是128 × 4 和真实标签128 × 1, 我们需要利用真实标签将回归输出索引为128 × 4, 然后在计算过程中只计算前景类的回归损失。具体实现与Fast-RCNN 略有不同。

def forward(self, imgs, bboxes, labels, scale):
        """Forward Faster R-CNN and calculate losses.

        Here are notations used.

        * :math:`N` is the batch size.
        * :math:`R` is the number of bounding boxes per image.

        Currently, only :math:`N=1` is supported.

        Args:
            imgs (~torch.autograd.Variable): A variable with a batch of images.
            bboxes (~torch.autograd.Variable): A batch of bounding boxes.
                Its shape is :math:`(N, R, 4)`.
            labels (~torch.autograd..Variable): A batch of labels.
                Its shape is :math:`(N, R)`. The background is excluded from
                the definition, which means that the range of the value
                is :math:`[0, L - 1]`. :math:`L` is the number of foreground
                classes.
            scale (float): Amount of scaling applied to
                the raw image during preprocessing.

        Returns:
            namedtuple of 5 losses
        """
        # 获取batch 个数
        n = bboxes.shape[0]
        if n != 1:
            raise ValueError('Currently only batch size 1 is supported.')
        # （n,c,hh,ww）
        _, _, H, W = imgs.shape
        img_size = (H, W)

        # vgg16 conv5_3 之前的部分提取图片的特征
        features = self.faster_rcnn.extractor(imgs)

        # rpn_locs 的维度（hh*ww*9，4），rpn_scores 维度为（hh*ww*9，2）
        # rois 的维度为（2000,4），roi_indices 用不到，anchor 的维度为
        # （hh*ww*9，4），H 和W 是经过数据预处理后的。计算（H/16）x(W/16)x9
        # (大概20000) 个anchor 属于前景的概率, 取前12000 个并经过NMS 得到2000 个
        # 近似目标框G^的坐标。roi 的维度为(2000,4)
        rpn_locs, rpn_scores, rois, roi_indices, anchor = \
            self.faster_rcnn.rpn(features, img_size, scale)

        # Since batch size is one, convert variables to singular form
        # bbox 维度(N, R, 4)
        bbox = bboxes[0]
        # labels 维度为（N，R）
        label = labels[0]
        # hh*ww*9
        rpn_score = rpn_scores[0]
        # hh*ww*9
        rpn_loc = rpn_locs[0]
        # (2000,4)
        roi = rois

        # Sample RoIs and forward
        # it's fine to break the computation graph of rois, 
        # consider them as constant input
        # Sample RoIs and forward
        # 调用proposal_target_creator 函数生成sample roi（128，4）、
        # gt_roi_loc（128，4）、gt_roi_label（128，1），RoIHead 网络
        # 利用这sample_roi + featue 为输入, 输出是分类（21 类）和回归
        # （进一步微调bbox）的预测值, 那么分类回归的groud truth 就
        # 是ProposalTargetCreator 输出的gt_roi_label 和gt_roi_loc。

        sample_roi, gt_roi_loc, gt_roi_label = self.proposal_target_creator(
            roi,
            at.tonumpy(bbox),
            at.tonumpy(label),
            self.loc_normalize_mean,
            self.loc_normalize_std)
        # NOTE it's all zero because now it only support for batch=1 now
        sample_roi_index = t.zeros(len(sample_roi))
        # roi 回归输出的是128*84 和128*21，然而真实位置参数是128*4 和真实标签128*1
        roi_cls_loc, roi_score = self.faster_rcnn.head(
            features,
            sample_roi,
            sample_roi_index)

        # ------------------ RPN losses -------------------#
        # 输入20000 个anchor 和bbox，调用anchor_target_creator 函数得到
        # 256 个anchor 与bbox 的偏移量与label
        gt_rpn_loc, gt_rpn_label = self.anchor_target_creator(
            at.tonumpy(bbox),
            anchor,
            img_size)
        gt_rpn_label = at.totensor(gt_rpn_label).long()
        gt_rpn_loc = at.totensor(gt_rpn_loc)
        rpn_loc_loss = _fast_rcnn_loc_loss(
            rpn_loc,
            gt_rpn_loc,
            gt_rpn_label.data,
            self.rpn_sigma)

        # NOTE: default value of ignore_index is -100 ...
        # rpn_score 为rpn 网络得到的（20000 个）与anchor_target_creator得到的256个label 求交叉熵损失
        rpn_cls_loss = F.cross_entropy(rpn_score, gt_rpn_label.cuda(), ignore_index=-1)
        _gt_rpn_label = gt_rpn_label[gt_rpn_label > -1]
        _rpn_score = at.tonumpy(rpn_score)[at.tonumpy(gt_rpn_label) > -1]
        self.rpn_cm.add(at.totensor(_rpn_score, False), _gt_rpn_label.data.long())

        # ------------------ ROI losses (fast rcnn loss) -------------------#
        # roi_cls_loc 为VGG16RoIHead 的输出（128*84）, n_sample=128
        n_sample = roi_cls_loc.shape[0]
        # roi_cls_loc=（128,21,4）
        roi_cls_loc = roi_cls_loc.view(n_sample, -1, 4)
        roi_loc = roi_cls_loc[t.arange(0, n_sample).long().cuda(), \
                              at.totensor(gt_roi_label).long()]
        # proposal_target_creator() 生成的128 个proposal 与bbox 求得的偏移量
        # dx,dy,dw,dh
        gt_roi_label = at.totensor(gt_roi_label).long()
        # 128 个标签
        gt_roi_loc = at.totensor(gt_roi_loc)
        # 采用smooth_l1_loss
        roi_loc_loss = _fast_rcnn_loc_loss(
            roi_loc.contiguous(),
            gt_roi_loc,
            gt_roi_label.data,
            self.roi_sigma)
        # 求交叉熵损失
        roi_cls_loss = nn.CrossEntropyLoss()(roi_score, gt_roi_label.cuda())

        self.roi_cm.add(at.totensor(roi_score, False), gt_roi_label.data.long())
        # 四个loss 加起来
        losses = [rpn_loc_loss, rpn_cls_loss, roi_loc_loss, roi_cls_loss]
        losses = losses + [sum(losses)]

        return LossTuple(*losses)

接下来就是train_step 函数, 整个函数实际上就是进行了一次参数的优化过程, 首先self.optimizer.zero_grad() 将梯度数据全部清零, 然后利用刚刚介绍self.forward(imgs,bboxes,labels,scales) 函数将所有的损失计算出来, 接着依次进行losses.total_loss.backward() 反向传播计算梯度, self.optimizer.step()进行一次参数更新过程, self.update_meters(losses) 就是将所有损失的数据更新到可视化界面上, 最后将losses 返回。代码如下：

def train_step(self, imgs, bboxes, labels, scale):
    self.optimizer.zero_grad()
    losses = self.forward(imgs, bboxes, labels, scale)
    losses.total_loss.backward()
    self.optimizer.step()
    self.update_meters(losses)
return losses

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

深度学习

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

深度学习