# 强化学习系列之二:模型相关的强化学习

• 1. 策略迭代
• 1.1 策略评估
• 1.2 策略改进
• 2. 价值迭代
• 3. 总结性结尾（好魔性的标题）
• 强化学习系列系列文章

class Policy_Value:
def __init__(self, grid_mdp):
//self.v[state] 等于状态 state 的价值。
self.v  = [ 0.0 for i in xrange(len(grid_mdp.states) + 1)]

//self.pi[state] 等于策略在状态 state 下采取的动作
self.pi = dict()
for state in grid_mdp.states:
if state in grid_mdp.terminal_states: continue
self.pi[state] = grid_mdp.actions[ 0 ]

### 1. 策略迭代

#### 1.1 策略评估

(1)

// grid_mdp 是机器人找金币问题的马尔科夫决策过程
def policy_evaluate(self, grid_mdp):
for i in xrange(1000):

delta = 0.0
for state in grid_mdp.states:
if state in grid_mdp.terminal_states: continue
// 当前策略输出动作
action        = self.pi[state]
// t 表示是否到了终止状态
// s 表示状态 state 下执行动作 action 后的状态
// r 表示奖励
t, s, r       = grid_mdp.transform(state, action)
//更新状态价值
new_v         = r + grid_mdp.gamma * self.v[s]
delta        += abs(self.v[state] - new_v)
self.v[state] = new_v

if delta < 1e-6:
break;

#### 1.2 策略改进

*** QuickLaTeX cannot compile formula:
\begin{eqnarray*}
\pi_{i+1}(s,a) =
\left{\begin{matrix}
1  & a = argmax_{a}R_{s,a}+\gamma \sum_{s' \in S}T_{s,a}^{s'}v(s') \
0  & a \neq argmax_{a}R_{s,a}+\gamma \sum_{s' \in S}T_{s,a}^{s'}v(s')
\end{matrix}\right. \nonumber
\end{eqnarray*}

*** Error message:
Missing delimiter (. inserted).
leading text: \left{

def policy_improve(self, grid_mdp):

for state in grid_mdp.states:
if state in grid_mdp.terminal_states: continue

a1      = grid_mdp.actions[0]
t, s, r = grid_mdp.transform( state, a1 )
v1      = r + grid_mdp.gamma * self.v[s]

//找到当前状态的价值最大的动作 a1
for action in grid_mdp.actions:
t, s, r = grid_mdp.transform( state, action )
if v1 < r + grid_mdp.gamma * self.v[s]:
a1 = action
v1 = r + grid_mdp.gamma * self.v[s]
//更新策略
self.pi[state] = a1

### 2. 价值迭代

def value_iteration(self, grid_mdp):
for i in xrange(1000):

delta = 0.0;
for state in grid_mdp.states:

if state in grid_mdp.terminal_states: continue

a1      = grid_mdp.actions[0]
t, s, r = grid_mdp.transform( state, a1 )
v1      = r + grid_mdp.gamma * self.v[s]

for action in grid_mdp.actions:
t, s, r = grid_mdp.transform( state, action )
if v1 < r + grid_mdp.gamma * self.v[s]:
a1 = action
v1 = r + grid_mdp.gamma * self.v[s]

delta         += abs(v1 - self.v[state])
self.pi[state] = a1
self.v[state]  = v1;

if delta <  1e-6:
break;       

46 篇文章36 人订阅

0 条评论

## 相关文章

39180

36690

11920

27960

### 学界 | AAAI 18论文解读：基于强化学习的时间行为检测自适应模型

AI 科技评论按：互联网上以视频形式呈现的内容在日益增多，对视频内容进行高效及时的审核也变得越来越迫切。因此，视频中的行为检测技术也是当下热点研究任务之一。本文...

37760

296100

19540

### 强化学习（二）马尔科夫决策过程(MDP)

在强化学习（一）模型基础中，我们讲到了强化学习模型的8个基本要素。但是仅凭这些要素还是无法使用强化学习来帮助我们解决问题的, 在讲到模型训练前，模型的简...

26730

36960

29960