Planning and Learning

Models and Planning

Models指的是Environment Models，可以分为两大类：

1. 当前状态和采取的动作作为输入，输出下一个所有可能状态和奖励的分布
2. 当前状态和采取的动作作为输入，输出下一个状态和奖励

Planning则是指对模型的合理使用，指定使用模型的"策略"(跟policy区分开，使用策略的意思)，用最低的成本收敛到价值函数。

Dyna-Q: Integrating Planning, Acting, and Learning

1. 通过真实经验学到的环境模型生成的模拟样本更新
2. 直接通过真实经验更新

Tabular Dyna-Q

1: InitializeQ(s,a)1:\ Initialize Q(s,a)1: InitializeQ(s,a) and Model(s,a)Model(s,a)Model(s,a) for all s∈Ss \in Ss∈S and a∈A(s)a \in A(s)a∈A(s) 2: Do forever2:\ Do \ forever2: Do forever: 3: (a) S←3:\ \qquad (a) \ S \leftarrow3: (a) S← current (nonterminal) state 4: (b) A←ϵ4:\ \qquad (b) \ A \leftarrow \epsilon4: (b) A←ϵ-greedy (S,Q)(S,Q)(S,Q) 5: (c) Execute5:\ \qquad (c) \ Execute5: (c) Execute action AAA, observe resultant reward RRR and next state S′S'S′ 6: (d) Q(S,A)←Q(S,A)+α[R+γmaxaQ(S′,a)−Q(S,A)]6:\ \qquad (d) \ Q(S,A) \leftarrow Q(S,A) + \alpha[R+\gamma max_aQ(S',a)-Q(S,A)]6: (d) Q(S,A)←Q(S,A)+α[R+γmaxa​Q(S′,a)−Q(S,A)] 7: (e)Model(S,A)←R,S′7:\ \qquad (e) Model(S,A) \leftarrow R,S'7: (e)Model(S,A)←R,S′(assuming deterministic environment) 8: (f)Repeat n8:\ \qquad (f) Repeat \ n8: (f)Repeat n times: 9: S←9:\ \qquad \qquad S \leftarrow9: S← random previously observed state 10:A←10: \qquad \qquad A \leftarrow10:A←random action previously taken in SSS 11:R,S′←Model(S,A)11: \qquad \qquad R,S' \leftarrow Model(S,A)11:R,S′←Model(S,A) 12:Q(S,A)←Q(S,A)+α[R+γmaxaQ(S′,a)−Q(S,A)]12: \qquad \qquad Q(S,A) \leftarrow Q(S,A) + \alpha[R+\gamma max_aQ(S',a)-Q(S,A)]12:Q(S,A)←Q(S,A)+α[R+γmaxa​Q(S′,a)−Q(S,A)]

Prioritized sweeping for a deterministic environment

1: InitializeQ(s,a), Model(s,a)1:\ Initialize Q(s,a), \ Model(s,a)1: InitializeQ(s,a), Model(s,a), for all s,as,as,a and PQueuePQueuePQueue to empty 2: Do forever:2:\ Do \ forever:2: Do forever:: 3: (a)S←3:\ \qquad (a) S \leftarrow3: (a)S← current (nonterminal) state 4: (b)A←policy4:\ \qquad (b) A \leftarrow policy4: (b)A←policy(S,Q) 5: (c)Execute5:\ \qquad (c) Execute5: (c)Execute action AAA; observe resultant reward, RRR, and state, S′S'S′ 6: (d)Model(S,A)←R,S′6:\ \qquad (d) Model(S,A) \leftarrow R,S'6: (d)Model(S,A)←R,S′ 7: (e)P←∣R+γmaxaQ(S′,a)−Q(S,A)∣.7:\ \qquad (e) P \leftarrow |R + \gamma max_a Q(S',a) - Q(S,A)|.7: (e)P←∣R+γmaxa​Q(S′,a)−Q(S,A)∣. 8: (f)if P>θ, then8:\ \qquad (f) if \ P > \theta, \ then8: (f)if P>θ, then insert S,AS,AS,A into PQueue with priority PPP 9: (g) Repeat n9:\ \qquad (g) \ Repeat \ n9: (g) Repeat n times, while PQueue is not empty: 10: S,A←10:\ \qquad \qquad S,A \leftarrow10: S,A← into PQueuePQueuePQueue with priority PPP 11:R,S′←Model(S,A)11: \qquad \qquad R,S' \leftarrow Model(S,A)11:R,S′←Model(S,A) 12:Q(S,A)←Q(S,A)+α[R+γmaxaQ(S′,a)−Q(S,A)]12: \qquad \qquad Q(S,A) \leftarrow Q(S,A) + \alpha[R+\gamma max_a Q(S',a)-Q(S,A)]12:Q(S,A)←Q(S,A)+α[R+γmaxa​Q(S′,a)−Q(S,A)] 13:Repeat, for all S‾,A‾13: \qquad \qquad Repeat, \ for \ all \ \overline{S},\overline{A}13:Repeat, for all S,A predicted to lead S: 14:R‾←14: \qquad \qquad \qquad \overline{R} \leftarrow14:R← predicted reward for S‾\overline{S}S,A‾\overline{A}A,SSS 15:P←∣R‾+γmaxaQ(S,a)−Q（S‾,A‾∣.15: \qquad \qquad \qquad P \leftarrow |\overline{R} + \gamma max_a Q(S,a) - Q（\overline{S}, \overline{A}|.15:P←∣R+γmaxa​Q(S,a)−Q（S,A∣. 16:if P>θ then16: \qquad \qquad \qquad if \ P > \theta \ then16:if P>θ then insert S‾\overline{S}S, A‾\overline{A}A into PQueuePQueuePQueue with priority PPP

Trajectory Sampling

Rollout Algorithms

As explained by Tesauro and Galperin (1997), who experimented with rollout algorithms for playing backgammon, the term rollout “comes from estimating the value of a backgammon position by playing out, i.e., rolling out,” the position many times to the game’s end with randomly generated sequences of dice rolls, where the moves of both players are made by some xed policy.

Rollout算法是决策时规划算法，它基于Monte Carlo Control(Model-Free Control里有一个小节的介绍)，应用于从当前环境状态中采样迹(trajectories)。通过在大量从任何一个可能的动作上开始然后遵循策略的模拟迹上取平均来评估一个给定策略的动作价值。当动作-价值估计被认为是足够准确的时候，最高估计值的动作(或者动作集中的一个)被执行，然后在下一个状态重复该过程。

MCTS也是Alpha Go核心技术之一，同时不局限于棋类搜索，所有的可以进行多不高效模拟的问题都可以使用这个方法。

MCTS并不是每次都从当前状态开始搜索，而是会利用以前搜索到的以当前状态为开始状态的高分路径作为候选。

MCTS可以分为如下图所示的四个步骤：

• backup之后，rollout的路径不更新价值函数，只更新selection和expansion的路径
• 上面四个步骤将按序循环执行，直到时间耗尽或者找不到新的状态
• 并不是每次搜索都从当前节点开始，而是会服用出现过的当前节点的树结构

0 条评论

相关文章

5710

Spring中的SpEL表达式概述

<bean id="numberGuess" class="org.spring.samples.NumberGuess"> <property name="...

7720

7220

4910

5710

4810

3610

5520

3710

3110