https://github.com/rail-berkeley/softlearning training about ten hours with 24 cores
Pretty cool, right? Today we will discuss some fundamental idea in reinforcement learning, one day you can also make this agent walk.
First of all, let's back to that cooking example, when you cook, there are many actions you make, adding water, adding eggs...every action you make is base on two things, states, and policy. States is what your kitchen looks like and what is your dish looks like, the policy tells you what to do under that circumstance. Of course, you will gain some reward by your action, maybe a sweet cookie or a burned rubbish, that is the so-called reward. Let's make it more official. The picture below illustrates the dynamic of state-policy-action-reward.
Just like you want to make as many cookies as you can. The goal of the agent is to maximize its cumulative reward, called return. Reinforcement learning methods are ways that the agent can learn behaviors to achieve its goal[1].
Next time we will jump into this key concept with an accurate definition. Of course, math is inevitable. But we will make it easier to understand, just like cooking, you will get used to it when you practice enough.
One more thing, the sqn algorithm is an old version in yesterday article. The following one is the latest version, please read this one.
[1] https://spinningup.openai.com/en/latest/spinningup/rl_intro.html