前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >强化学习读书笔记 - 13 - 策略梯度方法(Policy Gradient Methods)

强化学习读书笔记 - 13 - 策略梯度方法(Policy Gradient Methods)

作者头像
绿巨人
发布2018-05-17 14:34:22
1.9K0
发布2018-05-17 14:34:22
举报
文章被收录于专栏:绿巨人专栏绿巨人专栏

强化学习读书笔记 - 13 - 策略梯度方法(Policy Gradient Methods)

学习笔记: Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

参照

需要了解强化学习的数学符号,先看看这里:

策略梯度方法(Policy Gradient Methods)

基于价值函数的思路

\text{Reinforcement Learning} \doteq \pi_* \\ \quad \updownarrow \\ \pi_* \doteq \{ \pi(s) \}, \ s \in \mathcal{S} \\ \quad \updownarrow \\ \begin{cases} \pi(s) = \underset{a}{argmax} \ v_{\pi}(s' | s, a), \ s' \in S(s), \quad \text{or} \\ \pi(s) = \underset{a}{argmax} \ q_{\pi}(s, a) \\ \end{cases} \\ \quad \updownarrow \\ \begin{cases} v_*(s), \quad \text{or} \\ q_*(s, a) \\ \end{cases} \\ \quad \updownarrow \\ \text{approximation cases:} \\ \begin{cases} \hat{v}(s, \theta) \doteq \theta^T \phi(s), \quad \text{state value function} \\ \hat{q}(s, a, \theta) \doteq \theta^T \phi(s, a), \quad \text{action value function} \\ \end{cases} \\ where \\ \theta \text{ - value function's weight vector} \\

策略梯度方法的新思路(Policy Gradient Methods)

策略梯度定理(The policy gradient theorem)

情节性任务

如何计算策略的价值\eta

\eta(\theta) \doteq v_{\pi_\theta}(s_0) \\ where \\ \eta \text{ - the performance measure} \\ v_{\pi_\theta} \text{ - the true value function for } \pi_\theta \text{, the policy determined by } \theta \\ s_0 \text{ - some particular state} \\

  • 策略梯度定理 \nabla \eta(\theta) = \sum_s d_{\pi}(s) \sum_{a} q_{\pi}(s,a) \nabla_\theta \pi(a|s, \theta) \\ where \\ d(s) \text{ - on-policy distribution, the fraction of time spent in s under the target policy } \pi \\ \sum_s d(s) = 1 \\

蒙特卡洛策略梯度强化算法(ERINFORCE: Monte Carlo Policy Gradient)

  • 策略价值计算公式 \begin{align} \nabla \eta(\theta) & = \sum_s d_{\pi}(s) \sum_{a} q_{\pi}(s,a) \nabla_\theta \pi(a|s, \theta) \\ & = \mathbb{E}_\pi \left [ \gamma^t \sum_a q_\pi(S_t,a) \nabla_\theta \pi(a|s, \theta) \right ] \\ & = \mathbb{E}_\pi \left [ \gamma^t G_t \frac{\nabla_\theta \pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)} \right ] \end{align}
  • Update Rule公式 \begin{align} \theta_{t+1} & \doteq \theta_t + \alpha \gamma^t G_t \frac{\nabla_\theta \pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)} \\ & = \theta_t + \alpha \gamma^t G_t \nabla_\theta \log \pi(A_t|S_t, \theta) \\ \end{align}
  • 算法描述(ERINFORCE: A Monte Carlo Policy Gradient Method (episodic)) 请看原书,在此不做拗述。

带基数的蒙特卡洛策略梯度强化算法(ERINFORCE with baseline)

  • 策略价值计算公式 \begin{align} \nabla \eta(\theta) & = \sum_s d_{\pi}(s) \sum_{a} q_{\pi}(s,a) \nabla_\theta \pi(a|s, \theta) \\ & = \sum_s d_{\pi}(s) \sum_{a} \left ( q_{\pi}(s,a) - b(s)\right ) \nabla_\theta \pi(a|s, \theta) \\ \end{align} \\ \because \\ \sum_{a} b(s) \nabla_\theta \pi(a|s, \theta) \\ \quad = b(s) \nabla_\theta \sum_{a} \pi(a|s, \theta) \\ \quad = b(s) \nabla_\theta 1 \\ \quad = 0 \\ where \\ b(s) \text{ - an arbitrary baseline function, e.g. } b(s) = \hat{v}(s, w) \\
  • Update Rule公式 \delta = G_t - \hat{v}(s, w) \\ w_{t+1} = w_{t} + \beta \delta \nabla_w \hat{v}(s, w) \\ \theta_{t+1} = \theta_t + \alpha \gamma^t \delta \nabla_\theta \log \pi(A_t|S_t, \theta) \\
  • 算法描述 请看原书,在此不做拗述。

角色评论算法(Actor-Critic Methods)

这个算法实际上是:

  1. 带基数的蒙特卡洛策略梯度强化算法的TD通用化。
  2. 加上资格迹(eligibility traces)

注:蒙特卡洛方法要求必须完成当前的情节。这样才能计算正确的回报G_t。 TD避免了这个条件(从而提高了效率),可以通过临时差分计算一个近似的回报G_t^{(0)} \approx G_t(当然也产生了不精确性)。 资格迹(eligibility traces)优化了(计算权重变量的)价值函数的微分值,e_t \doteq \nabla \hat{v}(S_t, \theta_t) + \gamma \lambda \ e_{t-1}

  • Update Rule公式 \delta = G_t^{(1)} - \hat{v}(S_t, w) \\ \quad = R_{t+1} + \gamma \hat{v}(S_{t+1}, w) - \hat{v}(S_t, w) \\ w_{t+1} = w_{t} + \beta \delta \nabla_w \hat{v}(s, w) \\ \theta_{t+1} = \theta_t + \alpha \gamma^t \delta \nabla_\theta \log \pi(A_t|S_t, \theta) \\
  • Update Rule with eligibility traces公式 \delta = R + \gamma \hat{v}(s', w) - \hat{v}(s', w) \\ e^w = \lambda^w e^w + \gamma^t \nabla_w \hat{v}(s, w) \\ w_{t+1} = w_{t} + \beta \delta e_w \\ e^{\theta} = \lambda^{\theta} e^{\theta} + \gamma^t \nabla_\theta \log \pi(A_t|S_t, \theta) \\ \theta_{t+1} = \theta_t + \alpha \delta e^{\theta} \\ where \\ R + \gamma \hat{v}(s', w) = G_t^{(0)} \\ \delta \text{ - TD error} \\ e^w \text{ - eligibility trace of state value function} \\ e^{\theta} \text{ - eligibility trace of policy value function} \\
  • 算法描述 请看原书,在此不做拗述。

针对连续性任务的策略梯度算法(Policy Gradient for Continuing Problems(Average Reward Rate))

  • 策略价值计算公式 对于连续性任务的策略价值是每个步骤的平均奖赏 \begin{align} \eta(\theta) \doteq r(\theta) & \doteq \lim_{n \to \infty} \frac{1}{n} \sum_{t=1}^n \mathbb{E} [R_t|\theta_0=\theta_1=\dots=\theta_{t-1}=\theta] \\ & = \lim_{t \to \infty} \mathbb{E} [R_t|\theta_0=\theta_1=\dots=\theta_{t-1}=\theta] \\ \end{align}
  • Update Rule公式 \delta = G_t^{(1)} - \hat{v}(S_t, w) \\ \quad = R_{t+1} + \gamma \hat{v}(S_{t+1}, w) - \hat{v}(S_t, w) \\ w_{t+1} = w_{t} + \beta \delta \nabla_w \hat{v}(s, w) \\ \theta_{t+1} = \theta_t + \alpha \gamma^t \delta \nabla_\theta \log \pi(A_t|S_t, \theta) \\
  • Update Rule Actor-Critic with eligibility traces (continuing) 公式 \delta = R - \bar{R} + \gamma \hat{v}(s', w) - \hat{v}(s', w) \\ \bar{R} = \bar{R} + \eta \delta \\ e^w = \lambda^w e^w + \gamma^t \nabla_w \hat{v}(s, w) \\ w_{t+1} = w_{t} + \beta \delta e_w \\ e^{\theta} = \lambda^{\theta} e^{\theta} + \gamma^t \nabla_\theta \log \pi(A_t|S_t, \theta) \\ \theta_{t+1} = \theta_t + \alpha \delta e^{\theta} \\ where \\ R + \gamma \hat{v}(s', w) = G_t^{(0)} \\ \delta \text{ - TD error} \\ e^w \text{ - eligibility trace of state value function} \\ e^{\theta} \text{ - eligibility trace of policy value function} \\
  • 算法描述(Actor-Critic with eligibility traces (continuing)) 请看原书,在此不做拗述。 原书还没有完成,这章先停在这里
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2017-03-26 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 强化学习读书笔记 - 13 - 策略梯度方法(Policy Gradient Methods)
    • 参照
      • 策略梯度方法(Policy Gradient Methods)
        • 基于价值函数的思路
        • 策略梯度方法的新思路(Policy Gradient Methods)
        • 情节性任务
      • 蒙特卡洛策略梯度强化算法(ERINFORCE: Monte Carlo Policy Gradient)
        • 带基数的蒙特卡洛策略梯度强化算法(ERINFORCE with baseline)
          • 角色评论算法(Actor-Critic Methods)
            • 针对连续性任务的策略梯度算法(Policy Gradient for Continuing Problems(Average Reward Rate))
            领券
            问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档