前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >强化学习中的情景好奇心

强化学习中的情景好奇心

作者头像
CreateAMind
发布2019-07-22 11:12:30
6240
发布2019-07-22 11:12:30
举报
文章被收录于专栏:CreateAMind

EPISODIC CURIOSITY THROUGH REACHABILITY

https://arxiv.org/abs/1810.02274

https://github.com/google-research/episodic-curiosity

1. “Couch-potato” issues

The exploration-exploitation dilemma is one of those dynamics that regulates reinforcement learning algorithms. How to balance how much an agent should explore an environment vs. how to perform specific actions and evaluate a reward? In the context of reinforcement learning, exploration and exploitation are seen as opposite forces constraining curiosity in the reward model. However, just like in human cognition, curiosity in reinforcement learning agents result on robust knowledge, especially in sparse reward scenarios, so how can we encourage curiosity without punishing the agents?

Curiosity can produce a reward bonus b_t which is further summed up with the task reward r_t to give anaugmented reward r'_t = r_t + b_t. Many modern curiosity formulations aim at maximizing “surprise” — inability to predict the future. This approach makes perfect sense but, in fact, is far from perfect. To show why, let us consider a thought experiment. Imagine an agent is put into a 3D maze. There is a precious goal somewhere in the maze which would give a large reward. Now, the agent is also given a remote control to a TV and can switch the channels. Every switch shows a random image. The curiosity formulations which optimize surprise would rejoice because the result of the channel switching action is unpredictable. The agent would be drawn to the TV instead of looking for a goal in the environment. So, should we call the channel switching behaviour curious? Maybe, but it is unproductive for the original sparse-reward goal-reaching task. What would be a definition of curiosity which does not suffer from such “couch-potato” behaviour?

2. Episodic curiosity (EC)

We propose a new curiosity definition based on the following intuition. If the agent knew the observation after changing a TV channel is only one step away from the observation before doing that — it probably would not be so interesting to change the channel in the first place (too easy). This intuition can be formalized as giving a reward only for those observations which take some effort to reach (outside the already explored part of the environment). The effort is measured in the number of environment steps.

To estimate novelty we train a neural network approximator: given two observations, it would predict how many steps separate them.

To make the description above practically implementable, there is still one piece missing though. For determining the novelty of the current observation, we need to keep track of what was already explored in the environment. A natural candidate for that purpose would be episodic memory: it stores instances of the past which makes it easy to apply the reachability approximator on pairs of current and past observations.

Our method works as follows. The agent starts with an empty memory at the beginning of the episode and at every step compares the current observation with the observations in memory to determine novelty. If the current observation is indeed novel — takes more steps to reach from observations in memory than a threshold — the agent rewards itself with a bonus and adds the current observation to the episodic memory. The process continues until the end of the episode.

3. Bonus computation algorithm

At every time step, the current observation o goes through the embedding network producing the embedding vector e = E(o). This embedding vector is compared with the stored embeddings in the memory buffer M = < e 1 , . . . , e |M| > via the comparator network C where |M| is the current number of elements in memory. This comparator network fills the reachability buffer with values:

Then the similarity score between the memory buffer and the current embedding is computed from the reachability buffer as:

where the aggregation function F is a hyperparameter of our method. Theoretically, F = max would be a good choice, however, in practice it is prone to outliers coming from the parametric embedding and comparator networks. Empirically, we found that 90-th percentile works well as a robust substitute to maximum. As a curiosity bonus, we take

where α > 0 and β are hyperparameters of our method. The value of α depends on the scale of task rewards. The value of β determines the sign of the reward — and thus could bias the episodes to be shorter or longer. Empirically, β = 0.5 works well for fixed-duration episodes, and β = 1 is preferred if an episode could have variable length.

4. Experiments

Task reward as a function of training step for VizDoom tasks:

Reward as a function of training step for DMLab tasks:

The following animation shows how the episodic memory agent encourages positive rewards(green) instead of bad rewards(red) while maintaining a buffer of explored locations in memory(blue).

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2019-07-17,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 CreateAMind 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档