Gt=rt+γrt+1+γ2rt+2+r3γrt+3+...G_t = r_t+\gamma r_{t+1}+\gamma^2r_{t+2}+r^3\gamma r_{t+3}+...Gt=rt+γrt+1+γ2rt+2+r3γrt+3+... in MDP M under policy π\piπ
Define Gi,t=ri,t+γri,t+1+γ2ri,t+2+...+γTi−1ri,τiG_{i,t}=r_{i,t}+\gamma r_{i, t+1} + \gamma^2r_{i, t+2}+...+\gamma^{\Tau_i-1}r_{i,\tau_i}Gi,t=ri,t+γri,t+1+γ2ri,t+2+...+γTi−1ri,τi as return from time step t onwards in ith episode
For each state s visited in episode i
For first time t that state s is visited in episode i
Increment counter of total first visits: N(s)=N(s)+1N(s) = N(s)+1N(s)=N(s)+1
Increment total return G(s)=G(s)+Gi,tG(s)=G(s)+G_{i,t}G(s)=G(s)+Gi,t
Define Gi,t=ri,t+γri,t+1+γ2ri,t+2+...+γTi−1ri,τiG_{i,t}=r_{i,t}+\gamma r_{i, t+1} + \gamma^2r_{i, t+2}+...+\gamma^{\Tau_i-1}r_{i,\tau_i}Gi,t=ri,t+γri,t+1+γ2ri,t+2+...+γTi−1ri,τi as return from time step t onwards in ith episode
For each state s visited in episode i
For every time t that state s is visited in episode i
Increment counter of total first visits: N(s)=N(s)+1N(s) = N(s)+1N(s)=N(s)+1
Increment total return G(s)=G(s)+Gi,tG(s) = G(s) + G_{i,t}G(s)=G(s)+Gi,t
Define Gi,t=ri,t+γri,t+1+γ2ri,t+2+...+γTi−1ri,τiG_{i,t}=r_{i,t}+\gamma r_{i, t+1} + \gamma^2r_{i, t+2}+...+\gamma^{\Tau_i-1}r_{i,\tau_i}Gi,t=ri,t+γri,t+1+γ2ri,t+2+...+γTi−1ri,τi as return from time step t onwards in ith episode
For state s visited at time step t in episode i
Increment counter of total first visits: N(s)=N(s)+1N(s) = N(s)+1N(s)=N(s)+1
Define Gi,t=ri,t+γri,t+1+γ2ri,t+2+...+γTi−1ri,τiG_{i,t}=r_{i,t}+\gamma r_{i, t+1} + \gamma^2r_{i, t+2}+...+\gamma^{\Tau_i-1}r_{i,\tau_i}Gi,t=ri,t+γri,t+1+γ2ri,t+2+...+γTi−1ri,τi as return from time step t onwards in ith episode
For state s visited at time step t in episode i
代码语言:txt
复制
- For state s is visited at time step t in episode i
- Increment counter of total first visits: N(s)=N(s)+1N(s) = N(s)+1N(s)=N(s)+1
- Update estimate Vπ(s)=Vπ(s)+α(Gi,t−Vπ(s))V^\pi(s)=V^\pi(s)+\alpha(G_{i,t}-V^\pi(s))Vπ(s)=Vπ(s)+α(Gi,t−Vπ(s))α=1N(s)\alpha=\frac{1}{N(s)}α=N(s)1时,和every-visit MC算法等同