强化学习读书笔记 - 00 - 术语和数学符号

绿巨人

发布于 2018-05-17 15:32:49

1.5K0

发布于 2018-05-17 15:32:49

文章被收录于专栏：绿巨人专栏

强化学习读书笔记 - 00 - 术语和数学符号

学习笔记： Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

基本概念

策略

\text{For continuing tasks: } \\ G_t \doteq \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \\ \text{For episodic tasks: } \\ G_t \doteq \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1} \\ v_{\pi}(s) \doteq \mathbb{E}_{\pi} [G_t | S_t=s] = \mathbb{E}_{\pi} \left [ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}|S_t = s \right ] \\ q_{\pi}(s,a) \doteq \mathbb{E}_{\pi} [G_t | S_t=s,A_t=a] = \mathbb{E}_{\pi} \left [ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}|S_t = s, A_t=a \right ] \\ v_{\pi}(s) = \max_{a \in \mathcal{A}} q_{\pi}(s,a) \\ \pi(s) = \underset{a}{argmax} \ v_{\pi}(s' | s, a) \\ \pi(s) \text{ is the action which can get the next state which has the max value.} \\ \pi(s) = \underset{a}{argmax} \ q_{\pi}(s, a) \\ \pi(s) \text{ is the action which can get the max action value from the current state.} \\ 由上面的公式可以看出：\(\pi(s)\)可以由\(v_{\pi}(s)\)或者\(q_{\pi}(s,a)\)决定。

\text{Reinforcement Learning} \doteq \pi_* \\ \quad \updownarrow \\ \pi_* \doteq \{ \pi(s) \}, \ s \in \mathcal{S} \\ \quad \updownarrow \\ \begin{cases} \pi(s) = \underset{a}{argmax} \ v_{\pi}(s' | s, a), \ s' \in S(s), \quad \text{or} \\ \pi(s) = \underset{a}{argmax} \ q_{\pi}(s, a) \\ \end{cases} \\ \quad \updownarrow \\ \begin{cases} v_*(s), \quad \text{or} \\ q_*(s, a) \\ \end{cases} \\ \quad \updownarrow \\ \text{approximation cases:} \\ \begin{cases} \hat{v}(s, \theta) \doteq \theta^T \phi(s), \quad \text{state value function} \\ \hat{q}(s, a, \theta) \doteq \theta^T \phi(s, a), \quad \text{action value function} \\ \end{cases} \\ where \\ \theta \text{ - value function's weight vector} \\ 强化学习的目标3：找到最优价值函数v_*(s)或者q_*(s,a)。

近似计算

老O虎O机问题

通用数学符号

术语

episodic tasks - 情节性任务。指（强化学习的问题）会在有限步骤下结束。 continuing tasks - 连续性任务。指（强化学习的问题）有无限步骤。 episode - 情节。指从起始状态（或者当前状态）到结束的所有步骤。 tabular method - 列表方法。指使用了数组或者表格存储每个状态（或者状态-行动）的信息（比如：其价值）。

planning method - 计划性方法。需要一个模型，在模型里，可以获得状态价值。比如：动态规划。 learning method - 学习性方法。不需要模型，通过模拟（或者体验），来计算状态价值。比如：蒙特卡洛方法，时序差分方法。

on-policy method - on-policy方法。评估的策略和优化的策略是同一个。 off-policy method - off-policy方法。评估的策略和优化的策略不是同一个。意味着优化策略使用来自外部的样本数据。 target policy - 目标策略。off-policy方法中需要优化的策略。 behavior policy - 行为策略\(\mu\)。off-policy方法中提供样本数据的策略。 importance sampling - 行为策略\(\mu\)的样本数据。 importance sampling rate - 由于目标策略\(\pi\)和行为策略\(\mu\)不同，导致样本数据在使用上的加权值。 ordinary importance sampling - 无偏见的计算策略价值的方法。 weighted importance sampling - 有偏见的计算策略价值的方法。 MSE(mean square error) - 平均平方误差。 MDP(markov decision process) - 马尔科夫决策过程 The forward view - We decide how to update each state by looking forward to future rewards and states. 例如： G_t^{(n)} \doteq R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \theta_{t+n-1}) , \ 0 \le t \le T-n \\ The backward or mechanistic view - Each update depends on the current TD error combined with eligibility traces of past events. 例如： e_0 \doteq 0 \\ e_t \doteq \nabla \hat{v}(S_t, \theta_t) + \gamma \lambda e_{t-1} \\

参照

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2017-03-25 ，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度

强化学习读书笔记 - 00 - 术语和数学符号

强化学习读书笔记 - 00 - 术语和数学符号

强化学习读书笔记 - 00 - 术语和数学符号

基本概念

策略

近似计算

老O虎O机问题

通用数学符号

术语

参照

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐