勘探与开发之间的权衡是强化学习(RL)的核心。但是,最近的RL研究中使用的大多数连续控制基准仅需要本地探索。这导致了具有基本探索功能的算法的发展,并且在需要更多通用探索的基准测试中表现不佳。例如,正如我们的经验研究所证明的那样,最新的RL算法(例如DDPG和TD3)无法控制甚至很小的2D迷宫中的点质量。在本文中,我们提出了一种新的算法,称为“计划,回放,链式技能”(PBCS),该算法结合了运动计划和强化学习来解决困难的探索环境。在第一阶段,使用运动计划算法来查找单个良好的轨迹,然后通过结合使用Backplay算法的变体和技能链,使用从轨迹得出的课程来训练RL算法。我们表明,该方法在各种尺寸的2D迷宫环境中均优于最新的RL算法,并且能够改善运动规划阶段获得的轨迹。
原文题目:PBCS : Efficient Exploration and Exploitation Using a Synergy between Reinforcement Learning and Motion Planning
原文:The exploration-exploitation trade-off is at the heart of reinforcement learning (RL). However, most continuous control benchmarks used in recent RL research only require local exploration. This led to the development of algorithms that have basic exploration capabilities, and behave poorly in benchmarks that require more versatile exploration. For instance, as demonstrated in our empirical study, state-of-the-art RL algorithms such as DDPG and TD3 are unable to steer a point mass in even small 2D mazes. In this paper, we propose a new algorithm called "Plan, Backplay, Chain Skills" (PBCS) that combines motion planning and reinforcement learning to solve hard exploration environments. In a first phase, a motion planning algorithm is used to find a single good trajectory, then an RL algorithm is trained using a curriculum derived from the trajectory, by combining a variant of the Backplay algorithm and skill chaining. We show that this method outperforms state-of-the-art RL algorithms in 2D maze environments of various sizes, and is able to improve on the trajectory obtained by the motion planning phase.
原文作者:Guillaume Matheron,Nicolas Perrin,Olivier Sigaud
原文地址:https://arxiv.org/abs/2004.11667
原创声明,本文系作者授权云+社区发表,未经许可,不得转载。
如有侵权,请联系 yunjia_community@tencent.com 删除。
我来说两句