前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >代码:Learning by Playing –Solving Sparse Reward Tasks from Scratch

代码:Learning by Playing –Solving Sparse Reward Tasks from Scratch

作者头像
用户1908973
发布2018-07-20 16:50:08
9530
发布2018-07-20 16:50:08
举报
文章被收录于专栏:CreateAMindCreateAMind

https://github.com/HugoCMU/pySACQ

https://zhuanlan.zhihu.com/p/34222231

PySACX

This repo contains a Pytorch implementation of the SAC-X RL Algorithm [1]. It uses the Lunar Lander v2 environment from OpenAI gym. The SAC-X algorithm enables learning of complex behaviors from scratch in the presence of multiple sparse reward signals.

Theory

In addition to a main task reward, we define a series of auxiliary rewards. An important assumption is that each auxiliary reward can be evaluated at any state action pair. The rewards are defined as follows:

Auxiliary Tasks/Rewards

  • Touch. Maximizing number of legs touching ground
  • Hover Planar. Minimize the planar movement of the lander craft
  • Hover Angular. Minimize the rotational movement of the lander craft
  • Upright. Minimize the angle of the lander craft
  • Goal Distance. Minimize distance between lander craft and pad

Main Task/Reward

  • Did the lander land successfully (Sparse reward based on landing success)

Each of these tasks (intentions in the paper) has a specific model head within the neural nets used to estimate the actor and critic functions. When executing a trajectory during training, the task (and subsequently the model head within the actor) is switched between the different available options. This switching can either be done randomly (SAC-U) or it can be learned (SAC-Q).

The pictures below show the network architechtures for the actor and critic functions. Note the N possible heads for Npossible tasks (intentions in the paper) [1].

Learning

Learning the actor (policy function) is done off-policy using a gradient based approach. Gradients are backpropagated through task-specific versions of the actor by using the task-specific versions of the critic (Q function). Importantly though, the trajectory (collection of state action pairs) need not have been collected using the same task-specific actor, allowing learning from data generated by all other actors. The actor policy gradient is computed using the reparameterization trick (code in model.py)

Learning the critic (Q function) is similarly done off-policy. We sample trajectories from a buffer collected with target actors (actor policies frozen at a particular learning iteration). The critic policy gradient is computed using the retrace method (code in model.py)

Instructions

  • Use the local/template.sh script to train lots of model variations. Or use train.py to train an agent directly.

Requirements

  • Python 3.6
  • PyTorch 0.3.0.post4
  • OpenAI Gym
  • tensorboardX

Sources

[1] Learning by Playing – Solving Sparse Reward Tasks from Scratch.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-05-07,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 CreateAMind 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • PySACX
    • Theory
      • Instructions
        • Requirements
          • Sources
          领券
          问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档