代码+视频 - 卡车强化学习自动驾驶模拟 - 玩游戏看风景

CreateAMind

发布于 2018-07-24 14:24:01

6320

发布于 2018-07-24 14:24:01

文章被收录于专栏：CreateAMind

https://github.com/aleju/self-driving-truck

需要可以周末来公司玩这个游戏！

视频内容

About

This repository contains code to train and run a self-driving truck in Euro Truck Simulator 2. The resulting AI will automatically steer, accelerate and brake. It is trained (mostly) via reinforcement learning and only has access to the buttons W, A, S and D (i.e. it can not directly set the steering wheel angle).

Example video:

Architecture and method

The basic training method follows the standard reinforcement learning approach from the original Atari paper. Additionally, a separation of Q-values in V (value) and A (advantage) - as described in Dueling Network Architectures for Deep Reinforcement Learning - is used. Further, the model tries to predict future states and rewards, similar to the description in Deep Successor Reinforcement Learning. (While that paper uses only predictions for the next timestep, here predictions for the next T timesteps are generated via an LSTM.) To make training faster, a semi-supervised pretraining is applied to the first stage of the whole model (similar to Loss is its own Reward: Self-Supervision for Reinforcement Learning, though here only applied once at the start). That training uses some manually created annotations (e.g. positions of cars and lanes in example images) as well as some automatically generated ones (e.g. canny edges, optical flow).

Architecture visualization:

readme内容在下面继续。

玩游戏看风景

代码readme继续：

There are five components:

Embedder 1: A CNN that is pretrained in semi-supervised fashion. The two gradient inputs (see image) are just gradients from 1 to 0 which are supposed to give positional information. (E.g. the mirrors are always at roughly the same positions, so it is logical to detect them partially by their position.) Instance Normalization was used, because Batch Normalization regularly broke, resulting in zero-only vectors during test/eval (seemed like a bug in the framework, would usually go away when using batch sizes >= 2 or staying in training mode).
Embedder 2: Takes the results of Embedder 1 and converts them into a vector. Additional inputs are added here. (These are: (1) Previous actions, (2) whether the gear is in reverse mode, (3) steering wheel position, (4) previous and current speeds. The current gear state and the speed is read out from the route advisor. The steering wheel position is approximated using a separate CNN.) Not merging this component with Embedder 1 allows to theoretically keep the weights from pretraining fixed.
Direct Reward: A model that predicts the direct reward, i.e. for (s, a, r), (s', a', r') it predicts r when being in s'. The reward is bound to the range -100 to +100. It predicts the reward value using a softmax over 100 bins.
Indirect Reward: A model that predicts future rewards, i.e. for (s, a, r), (s', a', r'), ... it predicts r + gamma*r' + gamma^2*r'' when being in state s. Gamma is set to 0.95. This model uses standard regression. It predicts one value per action, i.e. Q(s, a).
Successors: An RNN model that predicts future embeddings (when specific actions are chosen). These future embeddings can then be used to predict future direct and indirect rewards (using the two previous models). This module uses an addition to the previously generated embedding (i.e. residual architecture). That way the LSTMs only have to predict the changes (of the embeddings) that were caused by the actions.

Aside from these, there is also an autoencoder component applied to the embeddings of Embedder 2. However, that component is only trained for some batches, so it is skipped here.

During application, each game state (i.e. frame/screenshot at 10fps) is embedded via convolutions and fully connected layers to a vector. From that vector, future embeddings (the successors) are predicted. Each such prediction (for each timestep) is dependent on a chosen action (e.g. pressing W+A followed by two times W converts game state vector X into Y). For 9 possible actions (W, W+A, W+D, S, S+A, S+D, A, D, none) and 10 timesteps (i.e. looking 1 second into the future), this leads to roughly 3.5 billion possible chains of actions. This number is decreased to roughly 400 sensible plans (e.g. 10x W, 10x W+A, 3x W+A followed by 7x W, ...). For each such plan, the successors are generated and rewards are predicted (which can be done reasonably fast as the embedding's vector size is only 512). The plans are ordered/ranked by the V-values of their last timesteps and the plan with the highest V-value is chosen. (The predicted direct rewards of successors are currently ignored in the ranking, which seemed to improve the driving a bit.)

Reward Function

For a chain (s, a, r, s'), the reward r is mainly dependent on the measured speed at s'. The formula is r = sp*rev + o + d, where

sp is the current speed (clipped to 0 to 100),
rev is 0.25 if the truck is in reverse gear ("drive backwards mode") and otherwise 1
o is -10 if an offence is shown (e.g. "ran over a red light", "crashed into a vehicle")
d is -50 if the truck has taken damage (shown by a message similar to offences, which stays for around 2 seconds)

The speed is read out from the game screen (it is shown in the route advisor). Similarly, offences and damages can be recognized using simple pixel comparisons or instance matching in the area of the route advisor (both events lead to shown messages).

Difficulties

ETS2 is a harder game to play (for an AI) than it may seem at first glance. Some of the difficulties are:

The truck's acceleration is underwhelming. The AI has to keep buttons pressed for a long time to slowly accumulate speed, especially at hills. This makes it also quite bad whenever the AI crashes into something, as getting out of that situation will require to again accumulate speed.
Same is the case for braking.
The truck has separate forward and reverse gears. You can only drive backwards when you are in reverse mode (or forward in the forward gear). To switch gears, the speed has to be flat zero, with no buttons pressed for a short amount of time, then forward/backward will activate forward/backward mode (in the simplest settings). Especially the flat zero speed is hard to achieve for the AI (see also slow acceleration/braking).
The truck needs a very large area to turn around.
When driving backwards, the front of the truck can collide with the container at the back. Then it can't drive further backwards, limiting the maximum length that you can drive that way before having to switch again to forward+left/right (depending on the situation).
When pressing left or right, the truck does not immediately drive towards the left or right. Instead, these keys merely change the position of the steering wheel. It is common to steer (via left/right button) into a direction for several timesteps (e.g. one second), while the truck keeps driving into the opposite direction the whole time.
The steering wheel can turn by more than 360 degrees, so when it looks like you are steering towards the right you might actually be steering towards the left. (Here, the steering wheel is tracked at all times via a CNN that runs separately from the AI's CNN - though that model sometimes produces errors.)
The truck can get randomly stuck on objects that aren't even visible from the driver's seat. This may include objects blocking some wheel at the far back of the container. Even as a human driver, that can be annoying and hard to fix.
Driving into the green can get you stuck at invisible walls (or tiny hills, see above point), which is hard to understand for the AI. (Especially as it has almost no memory - so it might not even recognize in any way that it is stuck.)
Sidenote: Most of the game is spent pressing only W (forward) on the highway, which makes it hard to train a model in supervised fashion from recorded gameplay (W+no steering has basically always maximum probability).

And of course on top of these things, the standard problems with window handling occur (e.g. where is the window; what pixel content does it show; is the game paused; at what speed is the truck driving; how to send keypresses to the window etc.). Also, the whole architecture has to run in (almost-)realtime (theoretically here max 100ms per state, but better <=50ms so that actions have a decent effect before observing the next screen).

更多请直接看项目代码 readme

https://github.com/aleju/self-driving-truck 阅读原文

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-01-06，如有侵权请联系 cloudcommunity@tencent.com 删除

强化学习