SQN with Lunar landing challenge

Our team combine SAC algorithm and DQN method gave a useful way to solve the control problem in discrete action space. Here is a little demo. Creat by a much more simple version of SQN.


In order to understand SQN algorithm, we first need to introduce some key concept in reinforcement learning, let's talk about the policy gradient first. We don't want to go deep in the math and make you dizzy, you can imagine the policy gradient is some method to make our landing ship give better policy over time, just like us, we learn from experience, and we will change our policy toward life with time going on. Record the first time you cook, you make a mess of your kitchen, but when you have done, you notice that you shouldn't put the eggs too early this is a so-called negative reward, so does our small landing boat. When he crashes because of some actions he took, he gets a bad reward, so he will change his policy to avoid taking those action in similar situations (so-called state). But how? How can we change the policy? The answer is easy, we represent the policy will some function, this function take the sate as input and give the action you show take, we can twinkling the parameters in this function, and make the probability of getting this bad action to be lower. How we twinkling it and how much should we change? In a word, we change the parameters in the direction of gradient and with the strength reward. That is easy to understand, the worse one action is the more we should prevent it to appear in future, as for gradient we will discuss in future, for now, you can imagine it's like Oracle, tells you with direction should you change your parameters.

This is only the beginning of an interesting journey, in future we will update a series of article introduce reinforcement learning from scratch.

原文发布于微信公众号 - CreateAMind(createamind)





0 条评论
登录 后参与评论