# 强化学习读书笔记 - 09 - on-policy预测的近似方法

## 近似方法的重要性

• 不适用复杂的环境。主要原因是状态和行动太多，策略需要大量空间来记忆策略价值。
• 环境可能是不稳定的，过去的经验不能适用于未来的情况。需要一个通用性的方法来更新策略价值。
• 策略价值是一个数值，缺乏通用性。期望有一个通用的方法来计算策略价值。

## 近似预测方法的目标

• 在情节性任务中 \eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_{a} \pi(a|\bar{s})p(s|\bar{s}, a), \ \forall s \in \mathcal{S} \\ d(s) = \frac{\eta(s)}{\sum_{s'} \eta(s')} \\ where \\ \eta(s) \text{ - the number of time steps spent in state s in a single episode} \\ h(s) \text{ - time spent in a state s if episodes start in it}
• 在连续性任务中 d(s) = \text{ the stationary distribution under } \pi \\

### 随机梯度递减算法

Stochastic gradient descend \begin{align} \theta_{t+1} & \doteq \theta_{t} - \frac{1}{2} \alpha \nabla [v_{\pi}(S_t) - \hat{v}(S_t, \theta_t)]^2 \\ & = \theta_{t} + \alpha [v_{\pi}(S_t) - \hat{v}(S_t, \theta_t)] \nabla \hat{v}(S_t, \theta_t) \\ \end{align} \\ where \\ \nabla f(\theta) \doteq \left ( \frac{\partial f(\theta)}{\partial \theta_1}, \frac{\partial f(\theta)}{\partial \theta_2}, \cdots, \frac{\partial f(\theta)}{\partial \theta_n} \right )^T \\ \alpha \text{ - the step size, learning rate}

### 蒙特卡洛

• 算法描述 Input: the policy \pi to be evaluated Input: a differentiable function \hat{v} : \mathcal{S} \times \mathbb{R^n} \to \mathbb{R} Initialize value-function weights \theta arbitrarily (e.g. \theta = 0) arbitrarily (e.g. \theta = 0) Repeat (for each episode):   Generate an episode S_0, A_0, R_1 ,S_1 ,A_1, \cdots ,R_t ,S_t using \pi   For t = 0, 1, \cdots, T - 1 \theta \gets \theta + \alpha [G_t -\hat{v}(S_t, \theta)] \nabla \hat{v}(S_t, \theta)

• 算法描述 Input: the policy \pi to be evaluated Input: a differentiable function \hat{v} : S^+ \times \mathbb{R^n} \to \mathbb{R} such that \hat{v}(terminal, \dot \ ) = 0 Initialize value-function weights \theta arbitrarily (e.g. \theta = 0) Repeat (for each episode):   Initialize \mathcal{S}   Repeat (for each step of episode):    Choose A \sim \pi(\dot  |S)    Take action A, observe R, S' \theta \gets \theta + \alpha [R + \gamma \hat{v}(S', \theta) -\hat{v}(S', \theta)] \nabla \hat{v}(S, \theta) S \gets S'   Until S' is terminal

## 线性方程的定义

\phi(s) \doteq (\phi_1(s), \phi_2(s), \dots, \phi_n(s))^T \\ \hat{v} \doteq \theta^T \phi(s) \doteq \sum_{i=1}^n \theta_i \phi_i(s) \phi(s)特征函数。 这里讨论特征函数的通用化定义方法。

### 多项式基(polynomials basis)

s的每一个维度都可以看成一个特征。多项式基的方法是使用s的高维多项式作为新的特征。 比如：二维的s = (s_1, s_2)，可以选择多项式为(1, s_1, s_2, s_1s_2)或者(1, s_1, s_2, s_1s_2, s_1^2, s_2^2, s_1s_2^2, s_1^2s_2, s_1^2s_2^2)

\phi_i(s) = \prod_{j=1}^d s_j^{C_{i,j}} \\ where \\ s = (s_1,s_2,\cdots,s_d)^T \\ \phi_i(s) \text{ - polynomials basis function}

0 条评论

16920

42740

35960

38660

383110

27470

9120

39040

20560

19250