BEVFromer-从多相机图像中学习BEV表达(3)

YoungTimes

发布于 2023-09-01 08:56:05

51200

代码可运行

文章被收录于专栏：半杯茶的小酒杯

运行总次数：0

代码可运行

续前文:

BEVFromer-从多相机图像中学习BEV表达(1)

BEVFromer-从多相机图像中学习BEV表达(2)

3. Temporal Self-Attention

除了空间信息(Spatial Information)，时间信息(Temporal Information)对视觉系统理解周围环境也至关重要。例如，通过Temporal Information可以推测除移动物体的速度和静态障碍物。

构建BEV特征的过程中，车辆是在不断运动的，车辆周围的环境也是不断变换的。但是由于两帧数据的时间间隔很小，当前时刻的物体位置一定在前一时刻物体位置的附近。Temporal Self Attention利用车辆的先验运动信息和Transformer的Context能力实现历史BEV特征和当前BEV Query的关联融合。

3.1 特征粗对齐

假设t时刻的BEV Query为Q，t-1时刻的BEV特征为

B_{t-1}

，这两个特征在空间上是不对齐的。需要使用车辆的运动特征(位移、旋转等)在空间上对齐Q和

B_{t-1}

。

处理历史BEV特征

旋转Pre Bev与当前车辆朝向一致。

rotation_angle = kwargs['img_metas'][i]['can_bus'][-1]
tmp_prev_bev = prev_bev[:, i].reshape(
    bev_h, bev_w, -1).permute(2, 0, 1)
tmp_prev_bev = rotate(tmp_prev_bev, rotation_angle,
    center=self.rotate_center)

处理参考点

计算相邻两帧的偏移量(shift_x和shift_y)。

# obtain rotation angle and shift with ego motion
delta_x = np.array([each['can_bus'][0]
      for each in kwargs['img_metas']])
delta_y = np.array([each['can_bus'][1]
      for each in kwargs['img_metas']])
ego_angle = np.array([each['can_bus'][-2] / np.pi * 180 for each in kwargs['img_metas']])

grid_length_y = grid_length[0]
grid_length_x = grid_length[1]
translation_length = np.sqrt(delta_x ** 2 + delta_y ** 2)
translation_angle = np.arctan2(delta_y, delta_x) / np.pi * 180

bev_angle = ego_angle - translation_angle
shift_y = translation_length * \
      np.cos(bev_angle / 180 * np.pi) / grid_length_y / bev_h
shift_x = translation_length * \
      np.sin(bev_angle / 180 * np.pi) / grid_length_x / bev_w

叠加相邻帧的偏移量到BEV特征点坐标。通过偏移量将当前帧的BEV特征点与上一帧的BEV特征联系起来。

# bug: this code should be 'shift_ref_2d = ref_2d.clone()', we keep this bug for reproducing our results in paper.
shift_ref_2d = ref_2d  # .clone()
shift_ref_2d += shift[:, None, None, :]

......

prev_bev = torch.stack([prev_bev, bev_query], 1).reshape(bs*2, len_bev, -1)
hybird_ref_2d = torch.stack([shift_ref_2d, ref_2d], 1).reshape(bs*2, len_bev, num_bev_level, 2)

处理BEV Query

把Query特征与Can Bus特征叠加。

# add can bus signals
can_bus = bev_queries.new_tensor(
            [each['can_bus'] for each in kwargs['img_metas']])  # [:, :]
can_bus = self.can_bus_mlp(can_bus)[None, :, :]
bev_queries = bev_queries + can_bus * self.use_can_bus

3.2 特征细对齐

通过车辆运动参数(Ego Motion)对t和t-1时刻的特征进行Align之后，后面精细化特征对齐就靠网络自身的注意力模块去学习修正了。

TSA(Q_p, {Q,B_{t-1}^{\prime}}) = \sum_{V \in {Q,B_{t-1}^{\prime}}} DeformAttn(Q_p, p, V)

DeformAttention的处理过程与BEVFromer-从多相机图像中学习BEV表达(2)中相同。

query = torch.cat([value[:bs], query], -1)
value = self.value_proj(value)
......
sampling_offsets = self.sampling_offsets(query)
......
attention_weights = self.attention_weights(query)
......
attention_weights = attention_weights.softmax(-1)
......
output = multi_scale_deformable_attn_pytorch(value, spatial_shapes, sampling_locations, attention_weights)

output = output.mean(-1)

output = self.output_proj(output)

return self.dropout(output) + identity

上述代码对应的网络结构如下: