手把手教强化学习1: Q-learning 十步实现猫捉耗子视频及代码

CreateAMind

发布于 2018-07-20 15:14:15

6740

发布于 2018-07-20 15:14:15

文章被收录于专栏：CreateAMind

Basic Reinforcement Learning (RL) - An introduction series to Reinforcement Learning (RL) with comprehensive step-by-step tutorials

实现猫捉耗子偷奶酪的游戏

实现效果动画

具体实现如下：

Background

Value Functions are state-action pair functions that estimate how good a particular action will be in a given state, or what the return for that action is expected to be.

Q-Learning is an off-policy (can update the estimated value functions using hypothetical actions, those which have not actually been tried) algorithm for temporal difference learning ( method to estimate value functions). It can be proven that given sufficient training, the Q-learning converges with probability 1 to a close approximation of the action-value function for an arbitrary target policy. Q-Learning learns the optimal policy even when actions are

The lookup is performed according to the lookdist variable that in this implementation uses a value of 2 (in other words, the mouse can "see" up to two cells ahead in every direction).

To finish up reviewing the Mouse implementation, let's look at how the Q-learning is implemented:

Q-learning implementation

Let's look at the update method of the Mouse player:

Code has been commented so that its understanding is simplified. The implementation matches the pseudo-code presented in the Background section above (note that for the sake of the implementation, the actions in the Pythonimplementation have been reordered).

Rewards are given with these terms:

-100: if the Cat player eats the Mouse
50: if the Mouse player eats the cheese
-1: otherwise

The learning algorithm records every state/action/reward combination in a dictionary containing a (state, action) tuple in the key and the reward as the value of each member.

Note that the amount of elements saved in the dictionary for this simplified 2D environment is considerable after a few generations. To get some insight about this fact, consider the following numbers:

After 10 000 generations:
- 2430 elements (state/action/reward combinations) learned
- Bytes: 196888 (192 KB)
After 100 000 generations:
- 5631 elements (state/action/reward combinations) learned
- Bytes: 786712 (768 KB)
After 600 000 generations:
- 9514 elements (state/action/reward combinations) learned
- Bytes: 786712 (768 KB)
After 1 000 000 generations:
- 10440 elements (state/action/reward combinations) learned
- Bytes: 786712 (768 KB)

Given the results showed above, one can observe that for some reason, Python sys.getsizeof function seems to be upper bounded by 786712 (768 KB). We can't provide accurate data but given the results showed, one can conclude that the elements generated after 1 million generations should require something close to 10 MB in memory for this 2D simplified world.

Results

Below we present a mouse player after 15 000 generations of reinforcement learning:

and now we present the same player after 150 000 generations of reinforcement learning:

Reproduce it yourself

git clone https://github.com/vmayoral/basic_reinforcement_learningcd basic_reinforcement_learning
python tutorial1/egoMouseLook.py

Replace line 131 in tutorial1/egoMouseLook.py with world = cellular.World(Cell, directions=directions, filename='./worlds/waco.txt')

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2016-08-24，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自 CreateAMind 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度