Voxel RCNN：高性能3D目标检测网络（AAAI2021）

3D视觉工坊

发布于 2021-07-27 11:36:06

8910

发布于 2021-07-27 11:36:06

文章被收录于专栏：3D视觉从入门到精通

作者丨柒柒@知乎

来源丨https://zhuanlan.zhihu.com/p/390497086

编辑丨3D视觉工坊

论文标题：Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection 作者单位：CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China 代码：https://github.com/djiajunustc/Voxel-R-CNN 论文：https://arxiv.org/pdf/2012.15712.pdf

一句话读论文：

本文的核心是讨论如何有效提取3D结构信息

作者的观点：

1. point-based的方法为什么有效？作者认为由于其提供了精确的位置信息。但是也一定程度上损失了效率，因为对点特征的采样和融合是非常耗时的操作。

Many existing high performance 3D detectors are point-based because this structure can better retain precise point positions. Nevertheless, point-level features lead to high computation overheads due to unordered storage. Despite the superior detection accuracy, point-based methods are in general less efficient because it is more costly to search nearest neighbor with the point representation for point set abstraction.

2. voxel的优点是？voxel通过网格划分有效聚合了网格内部特征，因此在特征提取上具有显著优势。但是同时，网格内部点的聚合损失了每个点精确位置信息。

The voxel-based methods divide point clouds into regular grids, which are more applicable for convolutional neural networks (CNNs) and more efficient for feature extraction due to superior memory locality. Nevertheless, the downside is that voxelization often causes loss of precise position information.

3. 作者的核心观点是：高性能的3D检测器并不需要精确的定位信息。

In this paper, we take a slightly different viewpoint — we find that precise positioning of raw points is not essential for high performance 3D object detection and that the coarse voxel granularity can also offer sufficient detection accuracy.

4. 作者认为voxel-based和point-based methods的核心区别是：voxel-based methods是在鸟瞰图（BEV）上进行检测，而point-based methods则是利用含有更多3D结构信息的点特征。所以制约voxel-based methods性能的关键是：基于BEV map的检测损失了3D结构信息。

By taking a close look at the underlying mechanisms, we find that the key disadvantage of existing voxel-based methods stems from the fact that they convert 3D feature volumes into BEV representations without ever restoring the 3D structure context.

因此，全文的关键模块是如何为voxel-based methods提取3D结构信息。那么，之前的网络是怎么做的呢？

5. SECOND vs. PV-RCNN。作者选取了两个性能差异较大的网络作为对比，得出两个结论：

a）3D结构信息对3D目标检测非常重要。PV-RCNN增加了keypoint模块补充了3D结构信息，由此取得了更好的性能。

the 3D structure is of significant importance for 3D object detectors, since the BEV representation alone is insufficient to precisely predict bounding boxes in a 3D space. Typically, PV-RCNN integrates voxel features into sampled keypoints with Voxel Set Abstraction. The keypoints works as an intermediate feature representation to effectively preserve 3D structure information.

b）point-voxel的交互是非常耗时的，PV-RCNN相比SECOND慢很多。

the point-voxel feature interaction is time-consuming and affects the detector' s efficiency. However, the point-voxel interaction takes almost half of the overall running time, which makes PV-RCNN much slower than SECOND.

那么，如何实现快速高效的特征表达呢？具体地，网络框架流程可以理解为：

输入point cloud → 体素特征提取 → RPN → Voxel RoI pooling → 检测结果，如下图。

Voxel R-CNN 网络框架

整体网络挺简单的，这里只记录一下要点。

Voxel RoI pooling是怎么做的？其实就是，利用3D voxels 表示proposal features，也就是如何从3D voxels特征中聚合结构信息。

a）Voxel Query，为proposal匹配邻近体素，其实就是按照Manhattan距离计算voxel point附近voxel。

We exploit Manhattan distance in voxel query and sample up to K voxels within a distance threshold.

b）得到了匹配的体素，下一步显然就是考虑如何从voxel group中提取结构信息。作者引入了voxel RoI pooling layers，具体步骤是：proposal → grids → voxel query → aggregate features with PointNet。

It starts by dividing a region proposal into sub-voxels. The center point is taken as the grid point of the corresponding sub-voxel. Specifically, given a grid point, we first exploit voxel query to group a set of neighboring voxels. Then, we aggregate the neighboring voxel features with a PointNet module.

描述一下就是，对每个proposal划分grids，在每个grid中提取voxel query feature，把query feature聚合得到grid feature representation，把grid feature级联为proposal feature，这个proposal feature作者认为包含了3D结构信息。

另外一个问题是，如何在有效提取3D结构信息的基础上保证效率？因此，作者提出了Accelerated Local Aggregation。这个模块其实就是先处理体素特征，再处理位置信息，因此时间复杂度大大减小。

其余部分就没啥好说的，就是基于通用的RPN提取，检测网络等。

实验结果: