专栏首页CreateAMind逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

LEARNING A PRIOR OVER INTENT VIA META-INVERSE REINFORCEMENT LEARNING

https://arxiv.org/abs/1805.12573v3

ABSTRACT

A significant challenge for the practical application of reinforcement learning to real world problems is the need to specify an oracle reward function that correctly defines a task. Inverse reinforcement learning (IRL) seeks to avoid this challenge by instead inferring a reward function from expert behavior. While appealing, it can be impractically expensive to collect datasets of demonstrations that cover the variation common in the real world (e.g. opening any type of door). Thus in practice, IRL must commonly be performed with only a limited set of demon- strations where it can be exceedingly difficult to unambiguously recover a reward function. In this work, we exploit the insight that demonstrations from other tasks can be used to constrain the set of possible reward functions by learning a “prior” that is specifically optimized for the ability to infer expressive reward functions from limited numbers of demonstrations. We demonstrate that our method can efficiently recover rewards from images for novel tasks and provide intuition as to how our approach is analogous to learning a prior.

1

Our approach relies on the key observation that related tasks share common structure that we can leverage when learning new tasks. To illustrate, considering a robot navigating through a home.

While the exact reward function we provide to the robot may differ depending on the task, there is a structure amid the space of useful behaviours, such as navigating to a series of landmarks, and there are certain behaviors we always want to encourage or discourage, such as avoiding obstacles or staying a reasonable distance from humans. This notion agrees with our understanding of why humans can easily infer the intents and goals (i.e., reward functions) of even abstract agents from just one or a few demonstrations Baker et al. (2007), as humans have access to strong priors about how other humans accomplish similar tasks accrued over many years. Similarly, our objective is to discover the common structure among different tasks, and encode the structure in a way that can be used to infer reward functions from a few demonstrations.

3.1

Learning in general energy-based models of this form is common in many applications such as structured prediction. However, in contrast to applications where learning can be supervised by millions of labels (e.g. semantic segmentation), the learning problem in Eq. 3 must typically be performed with a relatively small number of example demonstrations. In this work, we seek to address this issue in IRL by providing a way to integrate information from prior tasks to constrain the optimization in Eq. 3 in the regime of limited demonstrations.

4 LEARNING TO LEARN REWARDS

Our goal in meta-IRL is to learn how to learn reward functions across many tasks such that the model can infer the reward function for a new task using only one or a few expert demonstrations. Intuitively, we can view this problem as aiming to learn a prior over the intentions of human demon- strators, such that when given just one or a few demonstrations of a new task, we can combine the learned prior with the new data to effectively determine the human’s reward function. Such a prior is helpful in inverse reinforcement learning settings, since the space of relevant reward functions is much smaller than the space of all possible rewards definable on the raw observations.

Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradient- based meta-learning as hierarchical bayes. International Conference on Learning Representations (ICLR), 2018.

4.2 INTERPRETATION AS LEARNING A PRIOR OVER INTENT

The objective in Eq. 6 optimizes for parameters that enable that reward function to adapt and gener- alize efficiently on a wide range of tasks. Intuitively, constraining the space of reward functions to lie within a few steps of gradient descent can be interpreted as expressing a “locality” prior over reward function parameters. This intuition can be made more concrete with the following analysis.

By viewing IRL as maximum likelihood estimation, we

can take the perspective of Grant et al. (2018) who

showed that for a linear model, fast adaptation via a few

steps of gradient descent in MAML is performing MAP

inference over φ, under a Gaussian prior with the mean

θ and a covariance that depends on the step size, num-

ber of steps and curvature of the loss. This is based on

the connection between early stopping and regularization

previously discussed in Santos (1996), which we refer

the readers to for a more detailed discussion. The in-

terpretation of MAML as imposing a Gaussian prior on

the parameters is exact in the case of a likelihood that is

quadratic in the parameters (such as the log-likelihood of

a Gaussian in terms of its mean). For any non-quadratic

likelihood, this is an approximation in a local neighbor-

hood around θ (i.e. up to convex quadratic approxima-

tion). In the case of very complex parameterizations, such

as deep function approximators, this is a coarse approximation and unlikely to be the mode of a pos- terior. However, we can still frame the effect of early stopping and initialization as serving as a prior in a similar way as prior work (Sjo ̈berg & Ljung, 1995; Duvenaud et al., 2016; Grant et al., 2018).

More importantly, this interpretation hints at future extensions to our approach that could benefit from employing more fully Bayesian approaches to reward and goal inference.

then ref by Few-Shot Goal Inference for Visuomotor Learning and Planning

then ref by visual foresight

本文分享自微信公众号 - CreateAMind(createamind)

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2019-01-18

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 最强生成模型PPGN代码开放-介绍及阅读

    This repository contains source code necessary to reproduce some of the main res...

    用户1908973
  • 使用infogan学习可解释的隐变量特征学习-及代码示例(代码和官方有差异)

    In this week’s post I want to explore a simple addition to Generative Adversaria...

    用户1908973
  • 用信息瓶颈的迁移学习和探索

    Transfer and Exploration via the Information Bottleneck

    用户1908973
  • 注意力集中视觉刺激的波传播(CS AI)

    对周围视觉环境变化的快速反应需要有效的注意力机制,以将计算资源重新分配到视野中最相关的位置。尽管当前的计算模型由于数据的可用性不断提高而提高了其预测能力,但它们...

    刘子蔚
  • 数据扩充在NLP中什么时候有助于泛化?(CS.CL)

    神经模型常常利用表面的(“弱”)特性来获得良好的性能,而不是派生出我们希望模型使用的更一般的(“强”)特性。克服这种倾向是表征学习和ML公平性等领域的核心挑战。...

    用户7236395
  • 可穿戴触觉技术用于远程社交步行(Human-Computer Interaction)

    散步是健康生活中必不可少的活动,如果一起散步,就不会那么累,也会更愉快。我们在进行充分的体育锻炼时遇到的共同困难,例如缺乏动力,可以通过利用其社会方面来克服。然...

    用户6869393
  • 最强生成模型PPGN代码开放-介绍及阅读

    This repository contains source code necessary to reproduce some of the main res...

    用户1908973
  • Go is not (very) simple, folks

    I’ve recently started coding a little bit in Go, mostly out of curiosity. I’d kn...

    李海彬
  • The curious tale of Bhutan’s playable record postage stamps

    Made of plastic and embossed with a melody, these tiny record stamps are among t...

    仇诺伊
  • 互联网+ 在中国,腾讯研究院秘书长司晓牛津演讲

    本文根据腾讯研究院秘书长、公共战略研究部总经理司晓博士在“2015北大-牛津-斯坦福互联网法律与政策研讨会”上的演讲翻译整理,英文原文附后。   尊敬的女士们、...

    腾讯研究院

扫码关注云+社区

领取腾讯云代金券