本文作者:木石 https://zhuanlan.zhihu.com/p/129282832 本文已由原作者授权,不得擅自二次转载
南京大学MCG group/腾讯PCG 提出可以进行用于时序建模的轻量级行为识别模型TEA
论文:https://arxiv.org/abs/2004.01398
代码(截至2020年4月19日,代码还没有挂出来):https://github.com/Phoenix1327/tea-action-recognition
Motion feature 学习过程中存在的问题:
长时建模存在的问题:
TEA 由 ME和MTA两个模块构成:
用于建模feature level的时序变化 short-range motion
设计细节:
,利用
卷积降低channel数量
卷积,可以match regions of the same object from contexts ,避免直接相减造成feature的dispalce,也可以理解为一种修正或者平滑的操作。
When calculating feature-level motion representations in the ME module, we first apply a channel-wise transformation convolution on features at the time step t + 1. The reason is that motions will cause spatial displacements for the same objects between two frames, and it will result in mismatched motion representation to directly compute differences between displaced features. To address this issue, we add a 3×3 convolution at time step t + 1 attempting to capture the matched regions of the same object from contexts. Moreover, we found that conducting transformation on both t and t + 1 time steps does not improve the performance but introduces more operations.
表示 channel-wise multiplication.
The useless channels will be completely suppressed in SENet, but the static background information can be preserved in our module by introducing a residual connection.
由上图可知,不同 channel的feature不同,motion-attention的操作会使用a large attention weight加强motion channel ,lower attention weight抑制background channel,
用于进行 long-range temporal aggregation
stack Local 3D/(2+1)D 通过网络深度增加感受野来构建时序关系——会造成优化的困难
optimization message delivered from distant frames has been dramatically weakened and cannot be well handled.
设计细节:
的时序卷积。注意进行 temporal convolution之前要先reshape
TEA block 放在 later stage 效果更好,更能有效进行long-range temporal aggregation
训练参数初始化:
shift 操作 参考TSM TSM中shift操作可以认为是一种特殊的1D conv
这种特殊的1D conv可以通过以下的方式进行构建:
两种Inference模式:
1.efficient protocol:
2.accuracy protocol: