Mid-Level 视觉表示 增强通用性和采样高效 for Learning Active Tasks

Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks

https://perceptual.actor/

Abstract

One of the ultimate promises of computer vision is to help robotic agents perform active tasks, like delivering packages or doing household chores. However, the conven- tional approach to solving “vision” is to define a set of of- fline recognition problems (e.g. object detection) and solve those first. This approach faces a challenge from the re- cent rise of Deep Reinforcement Learning frameworks that learn active tasks from scratch using images as input. This poses a set of fundamental questions: what is the role of computer vision if everything can be learned from scratch? Could intermediate vision tasks actually be useful for per- forming arbitrary downstream active tasks?

We show that proper use of mid-level perception confers significant advantages over training from scratch. We im- plement a perception module as a set of mid-level visual representations and demonstrate that learning active tasks with mid-level features is significantly more sample-efficient than scratch and able to generalize in situations where the from-scratch approach fails. However, we show that realiz- ing these gains requires careful selection of the particular mid-level features for each downstream task. Finally, we put forth a simple and efficient perception module based on the results of our study, which can be adopted as a rather generic perception module for active frameworks.

We test three core hypotheses:

I. if mid-level vision pro- vides an advantage in terms of sample efficiency of learning an active task (answer: yes)

II. if mid-level vision provides an advantage towards generalization to unseen spaces (an- swer: yes)

III. if a fixed mid-level vision feature could suf- fice or a set of features would be essential to support arbi- trary active tasks (answer: a set is essential).

Hypothesis I: Does mid-level vision provide an advantage in terms of sample efficiency when learning an active task?

Hypothesis II: Can mid-level vision features generalize better to unseen spaces?

Hypothesis III: Can a single feature support all arbitrary downstream tasks? Or is a set of features required for that?

原文发布于微信公众号 - CreateAMind(createamind)

原文发表时间:2019-01-10

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

扫码关注云+社区

领取腾讯云代金券