前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >(Keras)基于DDPG用300行Python代码玩转TORCS(开放赛车模拟器)-教程及代码

(Keras)基于DDPG用300行Python代码玩转TORCS(开放赛车模拟器)-教程及代码

作者头像
CreateAMind
发布2018-07-25 10:49:51
2.2K0
发布2018-07-25 10:49:51
举报
文章被收录于专栏:CreateAMind

视频地址 http://weibo.com/3164120327/EcF8g6jdw

文章有详细的代码解释,有理论介绍,是非常好的教程和项目练习

Overview

This is the second blog posts on the reinforcement learning. In this project we will demonstrate how to use the Deep Deterministic Policy Gradient algorithm (DDPG) with Keras together to play TORCS (The Open Racing Car Simulator), a very interesting AI racing game and research platform.

Installation Dependencies:

  • Python 2.7
  • Keras 1.1.0
  • Tensorflow r0.10
  • gym_torcs

How to Run?

代码语言:javascript
复制
git clone https://github.com/yanpanlau/DDPG-Keras-Torcs.gitcd DDPG-Keras-Torcscp *.* ~/gym_torcscd ~/gym_torcspython ddpg.py 

(Change the flag train_indicator=1 in ddpg.py if you want to train the network)

Why TORCS

I think it is important to study TORCS because:

  • It looks cool, it’s really cool to see the AI can learn how to drive
  • You can visualize how the neural network learns over time and inspect its learning process, rather than just looking at the final result
  • It is easy to visualize when the neural network stuck in local minimun
  • It can help us understand machine learning technique in automated driving, which is important for self-driving car technologies

Background

In the previous blog post Using Keras and Deep Q-Network to Play FlappyBird we demonstrate using Deep Q-Network to play FlappyBird. However, a big limitation of Deep Q-Network is that the outputs/actions are discrete while the action like steering are continuous in car racing. An obvious approach to adapt DQN to continuous domains is to simply discretize the action space. However, we encounter the “curse of dimensionality” problem. For example, if you discretize the steering wheel from -90 to +90 degrees in 5 degrees each and acceleration from 0km to 300km in 5km each, your output combinations will be 36 steering states times 60 velocity states which equals to 2160 possible combinations. The situation will be worse when you want to build robots to perform something very specialized, such as brain surgery that requires fine control of actions and naive discretization will not able to achieve the required precision to do the operations.

Google Deepmind has devised a new algorithm to tackle the continuous action space problem by combining 3 techniques together 1) Deterministic Policy-Gradient Algorithms 2) Actor-Critic Methods 3) Deep Q-Network called Deep Deterministic Policy Gradients (DDPG)

The original paper in 1) is not easy for non-machine learning expert to digest so I will sketch the proof here. If you are already familiar with the algorithm you can directly go to Keras code session.

Actor-Critic Algorithm

The Actor-Critic Algorithm is essentially a hybrid method to combine the policy gradient method and the value function method together. The policy function is known as the actor, while the value function is referred to as the critic. Essentially, the actor produces the action aa given the current state of the environment ss, while the critic produces a signal to criticizes the actions made by the actor. I think it is quite natural in the human’s world where the junior employee (actor) do the actual work and your boss (critic) criticizes your work and hopefully the junior employee can do it better next time. In our TORCS example, we use the continuous Q-learning (SARSA) as our critic model and use policy gradient method as our actor model. The following figure explains the relationships between Value Function/Policy Function and Actor-Critic Algorithm.

Keras Code Explanation

Actor Network

Let’s first talk about how to build the Actor Network in Keras. Here we used 2 hidden layers with 300 and 600 hidden units respectively. The output consist of 3 continuous actions, Steering, which is a single unit with tanh activation function (where -1 means max right turn and +1 means max left turn). Acceleration, which is a single unit with sigmoid activation function (where 0 means no gas, 1 means full gas). Brake, another single unit with sigmoid activation function (where 0 means no brake, 1 bull brake)

代码语言:javascript
复制
    def create_actor_network(self, state_size,action_dim):
        print("Now we build the model")
        S = Input(shape=[state_size])  
        h0 = Dense(HIDDEN1_UNITS, activation='relu')(S)
        h1 = Dense(HIDDEN2_UNITS, activation='relu')(h0)
        Steering = Dense(1,activation='tanh',init=lambda shape, name: normal(shape, scale=1e-4, name=name))(h1)   
        Acceleration = Dense(1,activation='sigmoid',init=lambda shape, name: normal(shape, scale=1e-4, name=name))(h1)   
        Brake = Dense(1,activation='sigmoid',init=lambda shape, name: normal(shape, scale=1e-4, name=name))(h1)   
        V = merge([Steering,Acceleration,Brake],mode='concat')          
        model = Model(input=S,output=V)
        print("We finished building the model")
        return model, model.trainable_weights, S

We have used Keras function called Merge to combine 3 outputs together. Smart readers may ask why not using traditional Dense function like this

代码语言:javascript
复制
V = Dense(3,activation='tanh')(h1)   

There is a reason for that. First using 3 different Dense() function allows each continuous action have different activation function, for example, using tanh() for acceleration doesn’t make sense as tanh are in the range [-1,1] while the acceleration is in the range [0,1]

Please also noted that in the final layer we used the normal initialization with \muμ = 0 , \sigmaσ = 1e-4 to ensure the initial outputs for the policy were near zero.

Critic Network

The construction of the Critic Network is very similar to the Deep-Q Network in the previous post. The only difference is that we used 2 hidden layers with 300 and 600 hidden units. Also, the critic network takes both the states and the action as inputs. According to the DDPG paper, the actions were not included until the 2nd hidden layer of Q-network. Here we used the Keras function Merge to merge the action and the hidden layer together

代码语言:javascript
复制
    def create_critic_network(self, state_size,action_dim):
        print("Now we build the model")
        S = Input(shape=[state_size])
        A = Input(shape=[action_dim],name='action2')    
        w1 = Dense(HIDDEN1_UNITS, activation='relu')(S)
        a1 = Dense(HIDDEN2_UNITS, activation='linear')(A)
        h1 = Dense(HIDDEN2_UNITS, activation='linear')(w1)
        h2 = merge([h1,a1],mode='sum')    
        h3 = Dense(HIDDEN2_UNITS, activation='relu')(h2)
        V = Dense(action_dim,activation='linear')(h3)  
        model = Model(input=[S,A],output=V)
        adam = Adam(lr=self.LEARNING_RATE)
        model.compile(loss='mse', optimizer=adam)
        print("We finished building the model")
        return model, A, S 

Target Network

It is a well-known fact that directly implementing Q-learning with neural networks proved to be unstable in many environments including TORCS. Deepmind team came up the solution to the problem is to use a target network, where we created a copy of the actor and critic networks respectively, that are used for calculating the target values. The weights of these target networks are then updated by having them slowly track the learned networks:

where \tau \ll 1τ≪1. This means that the target values are constrained to change slowly, greatly improving the stability of learning.

It is extremely easy to implement target networks in Keras:

代码语言:javascript
复制
    def target_train(self):
        actor_weights = self.model.get_weights()
        actor_target_weights = self.target_model.get_weights()
        for i in xrange(len(actor_weights)):
            actor_target_weights[i] = self.TAU * actor_weights[i] + (1 - self.TAU)* actor_target_weights[i]
        self.target_model.set_weights(actor_target_weights)

Main Code

After we finished the network setup, Let’s go through the example in ddpg.py, our main code

The code simply does the following:

  1. The code receives the sensor input in the form of array
  2. The sensor input will be fed into our Neural Network, and the network will output 3 real numbers (value of the steering, acceleration and brake)
  3. The network will be trained many times, via the Deep Deterministic Policy Gradient, to maximize the future expected reward.

Sensor Input

In the TORCS there are 18 different types of sensor input, the details can be found here Simulated Car Racing Championship : Competition Software Manual. So which sensor input should we use? After some trial-and-error, I found the following inputs are useful:

Please note that we have normalized some of those value before feed into the neural network and some sensor inputs are not exposed in gym_torcs. The Advanced user needs to amend gym_torcs.py to change the parameters. [checkout the function make_observaton()]

Policy Selection

Now we can use the inputs above to feed into the neural network. The code is actually very simple:

代码语言:javascript
复制
    for j in range(max_steps):
        a_t = actor.model.predict(s_t.reshape(1, s_t.shape[0]))
        ob, r_t, done, info = env.step(a_t[0])

However, we immediately run into the two issues. First, how do we decide the reward? Second, how do we do exploration in the continuous action space?

Design of the rewards

In the original paper, they used the reward function which is equal to the velocity of the car projected along the track direction V_x cos(\theta)Vxcos(θ) as illustrated below:

However, I found that the training is not very stable as reported in the original paper.

On both low-dimensional and form pixels, some replicas were able to learn reasonable policies that are able to complete a circuit around the track though other replicas failed to learn a sensible policy

I believe the reason is that in the original policy the AI will try to accelerate the gas pedal very hard (to get maximum reward) and it hits the edge and the episode terminated very quickly. Therefore, the neural network stuck in a very poor local minimum. The new proposed reward function is below:

In plain English, we want to maximum longitudinal velocity (first term), minimize transverse velocity (second term), and we also penalize the AI if it constantly drives very off center of the track (third term)

I found the new reward function greatly improves the stability and the learning time of TORCS.

Design of the exploration algorithm

Another issue is how to design a right exploration algorithm in the continuous domain. In the previous blog post, we used \epsilonϵgreedy policy where the agent to try a random action some percentage of the time. However, this approach does not work very well in TORCS because we have 3 actions [steering,acceleration,brake]. If I just randomly choose the action from uniform random distribution we will generate some boring combinations [eg: the value of the brake is greater than the value of acceleration and the car simply not moving). Therefore, we add the noise using Ornstein-Uhlenbeck process to do the exploration.

Ornstein-Uhlenbeck process

What is Ornstein-Uhlenbeck process? In simple English it is simply a stochastic process which has mean-reverting properties.

Basically, the most important parameters are the \muμ of the acceleration, where you want the car have some initial velocity and don’t stuck in a local minimum where the car keep pressing the brake and never hit the gas pedal. Readers feel free to change the parameters and see how the AI performs in various combinations. The code of the Ornstein-Uhlenbeck process is saved under OU.py

My finding is that the AI can learn a reasonable policy on the simple track if using a sensible exploration policy and revised reward function, like within ~200 episode.

Experience Replay

Similar to the FlappyBird case, we also used the Experience Replay to saved down all the episode (s, a, r, s^{'})(s,a,r,s′) in a replay memory. When training the network, random mini-batches from the replay memory are used instead of most the recent transition, which will greatly improve the stability. The following code snippet shows how it is done.

代码语言:javascript
复制
        buff.add(s_t, a_t[0], r_t, s_t1, done)
        # sample a random minibatch of N transitions (si, ai, ri, si+1) from replay buffer
        batch = buff.getBatch(BATCH_SIZE)
        states = np.asarray([e[0] for e in batch])
        actions = np.asarray([e[1] for e in batch])
        rewards = np.asarray([e[2] for e in batch])
        new_states = np.asarray([e[3] for e in batch])
        dones = np.asarray([e[4] for e in batch])
        y_t = np.asarray([e[1] for e in batch])

        target_q_values = critic.target_model.predict([new_states, actor.target_model.predict(new_states)])    #Still using tf
       
        for k in range(len(batch)):
            if dones[k]:
                y_t[k] = rewards[k]
            else:
                y_t[k] = rewards[k] + GAMMA*target_q_values[k]

Please note that when we calculated the target_q_values we do use the output of target-Network instead of the modelinstead. The used of the slow-varying target-Network will reduce the oscillations of the Q-value estimation, which greatly improve the stability of the learning.

Training

The actual training of the neural network is very simple, only contains 6 lines of code:

代码语言:javascript
复制
        loss += critic.model.train_on_batch([states,actions], y_t) 
        a_for_grad = actor.model.predict(states)
        grads = critic.gradients(states, a_for_grad)
        actor.train(states, grads)
        actor.target_train()
        critic.target_train()

In plain English, we first update the critic by minimizing the loss

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2016-11-26,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 CreateAMind 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Overview
    • Installation Dependencies:
      • How to Run?
        • Why TORCS
        • Background
          • Actor-Critic Algorithm
          • Keras Code Explanation
            • Actor Network
              • Critic Network
                • Target Network
                • Main Code
                  • Sensor Input
                    • Policy Selection
                      • Design of the rewards
                      • Design of the exploration algorithm
                      • Ornstein-Uhlenbeck process
                      • Experience Replay
                    • Training
                    领券
                    问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档