Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks
One of the ultimate promises of computer vision is to help robotic agents perform active tasks, like delivering packages or doing household chores. However, the conven- tional approach to solving “vision” is to define a set of of- fline recognition problems (e.g. object detection) and solve those first. This approach faces a challenge from the re- cent rise of Deep Reinforcement Learning frameworks that learn active tasks from scratch using images as input. This poses a set of fundamental questions: what is the role of computer vision if everything can be learned from scratch? Could intermediate vision tasks actually be useful for per- forming arbitrary downstream active tasks?
We show that proper use of mid-level perception confers significant advantages over training from scratch. We im- plement a perception module as a set of mid-level visual representations and demonstrate that learning active tasks with mid-level features is significantly more sample-efficient than scratch and able to generalize in situations where the from-scratch approach fails. However, we show that realiz- ing these gains requires careful selection of the particular mid-level features for each downstream task. Finally, we put forth a simple and efficient perception module based on the results of our study, which can be adopted as a rather generic perception module for active frameworks.
We test three core hypotheses:
I. if mid-level vision pro- vides an advantage in terms of sample efficiency of learning an active task (answer: yes)
II. if mid-level vision provides an advantage towards generalization to unseen spaces (an- swer: yes)
III. if a fixed mid-level vision feature could suf- fice or a set of features would be essential to support arbi- trary active tasks (answer: a set is essential).
Hypothesis I: Does mid-level vision provide an advantage in terms of sample efficiency when learning an active task?
Hypothesis II: Can mid-level vision features generalize better to unseen spaces?
Hypothesis III: Can a single feature support all arbitrary downstream tasks? Or is a set of features required for that?
原文发布于微信公众号 - CreateAMind（createamind）