版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/Solo95/article/details/102673140
Importance Sampling(重要性采样),也是常用估计函数价值在某个概率分布下的期望的一个方法。这篇博文先简要介绍IS,再将其在策略评估中的应用。
Ex∼q[f(x)]=∫xq(x)f(x)\mathbb{E}_{x\sim q}[f(x)] = \int_xq(x)f(x)Ex∼q[f(x)]=∫xq(x)f(x)
记hjh_jhj为轮次jjj关于状态、动作、奖励的历史: hj=(sj,1,aj,1,rj,1,sj,2,aj,2,rj,2,...,sj,Lj(terminal))h_j=(s_{j,1},a_{j,1},r_{j,1},s_{j,2},a_{j,2},r_{j,2},...,s_j,L_j(terminal))hj=(sj,1,aj,1,rj,1,sj,2,aj,2,rj,2,...,sj,Lj(terminal))
那么 p(hj∣π,s=sj,1)=p(aj,1∣sj,1)p(rj,1∣sj,1,aj,1)p(sj,2∣sj,1aj,1)p(aj,2∣sj,2)p(rj,2∣sj,2,aj,2)p(sj,3∣sj,2aj,2)...=∏t=1Lj−1p(aj,t∣sj,t)p(rj,t∣sj,t,aj,t)p(aj,t+1∣sj,t,aj,t)=∏t=1Lj−1π(aj,t∣sj,t)p(rj,t∣sj,t,aj,t)p(aj,t+1∣sj,t,aj,t) \begin{aligned} p(h_j|\pi,s=s_{j,1}) & =p(a_{j,1}|s_{j,1})p(r_{j,1}|s_{j,1},a_{j,1})p(s_{j,2}|s_{j,1}a_{j,1}) p(a_{j,2}|s_{j,2})p(r_{j,2}|s_{j,2},a_{j,2})p(s_{j,3}|s_{j,2}a_{j,2})... \\ & = \prod_{t=1}^{L_j-1} p(a_{j,t}|s_{j,t})p(r_{j,t}|s_{j,t},a_{j,t})p(a_{j,t+1}|s_{j,t},a_{j,t}) \\ & = \prod_{t=1}^{L_j-1} \pi(a_{j,t}|s_{j,t})p(r_{j,t}|s_{j,t},a_{j,t})p(a_{j,t+1}|s_{j,t},a_{j,t}) \end{aligned} p(hj∣π,s=sj,1)=p(aj,1∣sj,1)p(rj,1∣sj,1,aj,1)p(sj,2∣sj,1aj,1)p(aj,2∣sj,2)p(rj,2∣sj,2,aj,2)p(sj,3∣sj,2aj,2)...=t=1∏Lj−1p(aj,t∣sj,t)p(rj,t∣sj,t,aj,t)p(aj,t+1∣sj,t,aj,t)=t=1∏Lj−1π(aj,t∣sj,t)p(rj,t∣sj,t,aj,t)p(aj,t+1∣sj,t,aj,t)
如果记hjh_jhj为轮次jjj关于状态、动作、奖励的历史,其中动作是从策略π2\pi_2π2采样而来: hj=(sj,1,aj,1,rj,1,sj,2,aj,2,rj,2,...,sj,Lj(terminal))h_j=(s_{j,1},a_{j,1},r_{j,1},s_{j,2},a_{j,2},r_{j,2},...,s_j,L_j(terminal))hj=(sj,1,aj,1,rj,1,sj,2,aj,2,rj,2,...,sj,Lj(terminal))
那么 Vπ1(s)≈∑j=1np(hj∣π1,s)p(hj∣π2,s)G(hj)V^{\pi_1}(s)\approx \sum_{j=1}^n \frac{p(h_j|\pi_1,s)}{p(h_j|\pi_2,s)}G(h_j)Vπ1(s)≈j=1∑np(hj∣π2,s)p(hj∣π1,s)G(hj)