时空特征--Learning Spatiotemporal Features with 3D Convolutional Networks

用户1148525

发布于 2018-01-03 15:36:50

1.4K0

发布于 2018-01-03 15:36:50

Learning Spatiotemporal Features with 3D Convolutional Networks ICCV 2015 http://vlg.cs.dartmouth.edu/c3d/ https://github.com/facebook/C3D

本文使用 3D CNN 来分析视频序列，学习到的时空特征称之为 C3D，主要寻找3D CNN 中的最优3D滤波器结构

视频数据的分析是一个很重要的工作，但是也是一个难题。一个有效的 video descriptor，我们认为需要满足一下四点：1） generic, 2）compact, 3）simple， 4）efficient。

我们的 C3D是多才多艺的：

3 Learning Features with 3D ConvNets 3.1. 3D convolution and pooling 我们相信 3D CNN 网络适合于时空特征的学习，和 2D CNN 网络相比，3D ConvNet 通过3D 卷积和 3D 池化可以对时间信息进行建模。

我们的思路是先在一个小的数据库上寻找一个最优的 3D ConvNet 网络结构，然后再在一个大的数据库上进行验证。

Because training deep net-works on large-scale video datasets is very time-consuming, we first experiment with UCF101, a medium-scale dataset, to search for the best architecture.

Common network settings：我们的网络输入是一个小段视频，输出是 101 different actions 网络结构的一些设定，将 UCF101 图像的尺寸归一化到 128 × 171，Videos are split into non-overlapped 16-frame clips which are then used as input to the networks. 输入尺寸是 3 × 16 × 128 × 171，我们也会裁剪一些作为输入，尺寸为3 × 16 × 112 × 112，网络有5个卷积和 5个池化， 2 fully-connected layers and a softmax loss layer to predict action labels。卷积层中的滤波器个数分别为 64, 128, 256, 256, 256，所有卷积滤波器的 kernal 是 3 × 3 × d，这个d is the kernel temporal depth

According to the findings in 2D ConvNet [37], small receptive fields of 3 × 3 convolution kernels with deeper architectures yield best results. Hence, for our architecture search study we fix the spatial receptive field to 3 × 3 and vary only the temporal depth of the 3D convolution kernels.

Varying network architectures: