3000类目标检测--R-FCN-3000 at 30fps: Decoupling Detection and Classification

用户1148525

发布于 2019-05-26 11:50:29

5830

发布于 2019-05-26 11:50:29

R-FCN-3000 at 30fps: Decoupling Detection and Classification Code will be made available

本文主要解决的问题是怎么实时检测3000类物体。主要思路就是将 object检测和物体分类分离我们提出的 R-FCN-3000 比 YOLO9000 高 18%，速度每秒 30帧。对于几十类的物体实时检测已经发展的比较成熟了。但是在实际生活中，物体的类别达到几千种。最近提出的 fully convolutional class of detectors 对于给定图像计算每个类别的 objectness score，它们使用有限的计算资源可以达到很高的精度。尽管 fully-convolutional representations 对诸如目标检测、实例分割、跟踪、关系检测等提供了一个有效的方法。但是它们需要一组特定滤波器来学习每个类别的相关信息，require class-specific sets of filters for each class。例如 R-FCN / Deformable-R-FCN requires 49/197 position-specific filters for each class Retina-Net requires 9 filters for each class for each convolutional feature map

R-FCN-3000 最关键的地方就是将 objectness detection and classification 解耦，这样类别的增加不会增加定位步骤的计算量。 The key insight behind the proposed R-FCN-3000 architecture is to decouple objectness detection and classification of the detected object so that the computational requirements for localization remain constant as the number of classes increases

4.1. Weakly Supervised vs. Supervised? 半监督的效果要差于监督学习方法，所以这里我们还是用有监督的训练方法。我们对 ImageNet database 里的图像进行标记，每个图像只有 1-2 个物体

We show that careful design choices with respect to the CNN architecture, loss function and training protocol can yield a large-scale detector trained on the ImageNet classification set with significantly better accuracy compared to weakly supervised detectors

R-FCN-3000 主要思路如下

图示显示有两个流程，上面流程负责物体的有无，即提取有效候选区域，不管其具体的物体类别信息， super-class detector。下面的流程负责每个候选区域的类别信息。最后将两者的信息融合起来得到每个候选区域的类别信息及有物体的概率。

Super-class Discovery 这里我们首先从 the final layer of ResNet-101 提取一个 2048-dimensional feature-vectors 表示一个类别的信息，对于 C 个类别一共有 C 个 2048-dimensional feature-vectors，这个 C 个特征向量 applying K-means clustering，得到 K 个 super-class clusters， When K is 1, the super-class detector predicts objectness