# DeepMind和剑桥、普朗克研究所共同论文：基于模型加速的连续深度Q-learning方法

1.引言

2.相关工作

3.背景

Qπ(xt,ut) = Eri≥t,xi>t∼E,ui>t∼π[Rt|xt,ut] (1)

Q-learning学会需求确定性策略µ(xt) = argmaxu Q(xt,ut), 反应π(ut|xt) = δ(ut = µ(xt)). 让 θQ参数化动作值函数, β是一个任意开放的策略，学习的目的是尽量减少贝尔曼误差, 我们在此修正 yt:

V π(xt) = Eri≥t,xi>t∼E,ui≥t∼π[Rt|xt,ut]

Aπ(xt,ut) = Qπ(x ,ut) − V π(xt). (3) t

πtiLQG(ut|xt) = N(uˆt + kt + Kt(xt − xˆt),−cQ−u,1ut) (4)

4.连续归一化优势函数Q-Learning

P(x|θP) 是一个状态依赖正定方阵, 通过P(x|θP) = L(x|θP)L(x|θP)T实现参数化, L(x|θP) 是一个三角形下部矩阵，其元素来自神经网络的线性输出层，幂的对角项。 虽然这表示比一般的神经网络功能更严格, 由于Q-function是与u的二次方，最大化Q函数的动作总是通过µ(x|θµ)给出。我们使用这种类似Mnih的深度Q-learning算法et al. (2015), 使用目标网络和重复缓冲区 (Lillicrap et al., 2016). NAF, 通过算法1被认为比DDPG更简单。

Randomly initialize normalized Q network Q(x,u|θQ).

0 Q0 Q

Initialize target network Q with weight θ ← θ .

Initialize replay buffer R ← ∅.

for episode=1,M do

Initialize a random process N for action exploration Receive initial observation state x1 ∼ p(x1)

for t=1,T do

Select action ut = µ(xt|θµ)+ Nt

Execute ut and observe rt and xt+1 Store transition (xt,ut,rt,xt+1) in R for iteration=1,I do

Sample a random minibatch of m transitions from R

0 Q0

Set yi = ri + γV (xi+1|θ )

Update θQ by minimizing the loss: L = N1 Pi(yi −

Q(xi,ui|θQ))2

Q0 Q Q0

Update the target network: θ ← τθ +(1 − τ)θ

end for

end for

end for

5. 使用想象推广法加速学习

5.1. 指导模型类探索

5.2. 想象推广

Randomly initialize normalized Q network Q(x,u|θQ).

0 Q0 Q

Initialize target network Q with weight θ ← θ .

Initialize replay buffer R ← ∅ and fictional buffer Rf ← ∅.

Initialize additional buffers B ← ∅,Bold ← ∅ with size nT.

Initialize fitted dynamics model M ← ∅.

for episode = 1,M do

Initialize a random process N for action exploration Receive initial observation state x1

Select µ0(x,t) from {µ(x|θµ),πtiLQG(ut|xt)} with probabilities {p,1 − p} for t = 1,T do

Select action ut = µ0(xt,t)+ Nt

Execute ut and observe rt and xt+1

Store transition (xt,ut,rt,xt+1,t) in R and B

if mod (episode · T + t,m) = 0 and M 6= ∅ then

Sample m (xi,ui,ri,xi+1,i) from Bold

Use M to simulate l steps from each sample

Store all fictional transitions in Rf end if

Sample a random minibatch of m transitions I · l times

0 from Rf and I times from R, and update θQ,θQ as in

Algorithm 1 per minibatch.

end for if Bf is full then

M ← FitLocalLinearDynamics(Bf) (see Section 5.3)

OneStep(Bf,M) (see appendix)

old ← f f

end if

end for

5.3.拟合动力学模型

6.实验

6.1. 标准化的优势函数

6.2. 基于真实模型的评估最佳基于模型的改进

6.3.合理动量引导想象推广

7.讨论

0 条评论

## 相关文章

580

4509

35713

### NIPS-16 | 无监督学习“感知分组”概念获突破，深度学习或迎来变革

【新智元导读】神经网络在图像中的物体识别上准确率做到75%，这是来自芬兰的一群研究员在NIPS2016 上公布的最新成果。他们使用的核心概念是“感知分组” （P...

3646

2405

1022

3819

3254

3317

35711