# 2、HER原理

HER的实现思路如下：

#### （3）新目标的获取方法

• final — goal corresponding to the final state in each episode：把相应序列的最后时刻的状态作为新的目标goal
• future — replay with k random states which come from the same episode as the transition being replayed and were observed after it：从该时刻起往后的同一序列中的状态，随机采样k个作为新的目标goal。
• episode — replay with k random states coming from the same episode as the transition being replayed:对于同一序列中的状态，随机采样k个作为新的目标goal。
• random — replay with k random states encountered so far in the whole training procedure：从全局出现过的state中，随机选择k个作为新的目标goal。

# 3、HER简单实现

https://github.com/princewen/tensorflow_practice/tree/master/RL/Basic-HER-Demo

RL的模型我们选择的是Double DQN。

3.1 环境搭建

```class BitFlip():
def __init__(self, n, reward_type):
self.n = n # number of bits
self.reward_type = reward_type
def reset(self):
self.goal = np.random.randint(2, size=(self.n)) # a random sequence of 0's and 1's
self.state = np.random.randint(2, size=(self.n)) # another random sequence of 0's and 1's as initial state
return np.copy(self.state), np.copy(self.goal)
def step(self, action):
self.state[action] = 1-self.state[action] # flip this bit
done = np.array_equal(self.state, self.goal)
if self.reward_type == 'sparse':
reward = 0 if done else -1
else:
reward = -np.sum(np.square(self.state-self.goal))
return np.copy(self.state), reward, done
def render(self):
print("\rstate :", np.array_str(self.state), end=' '*10)
```

3.2 经验池类构建

```class Episode_experience():
def __init__(self):
self.memory = []

def add(self, state, action, reward, next_state, done, goal):
self.memory += [(state, action, reward, next_state, done, goal)]

def clear(self):
self.memory = []```

#### 3.3 DDQN-Agent构建

```def _set_model(self):  # set value network
tf.reset_default_graph()
self.sess = tf.Session()

self.tfs = tf.placeholder(tf.float32, [None, self.state_size], 'state')
self.tfs_ = tf.placeholder(tf.float32, [None, self.state_size], 'next_state')
self.tfg = tf.placeholder(tf.float32, [None, self.goal_size], 'goal')
self.tfa = tf.placeholder(tf.int32, [None, ], 'action')
self.tfr = tf.placeholder(tf.float32, [None, ], 'reward')
self.tfd = tf.placeholder(tf.float32, [None, ], 'done')

def _build_qnet(state, scope, trainable, reuse):
with tf.variable_scope(scope, reuse=reuse):
net = tf.layers.dense(tf.concat([state, self.tfg], axis=1), 256, activation=tf.nn.relu,
trainable=trainable)
q = tf.layers.dense(net, self.action_size, trainable=trainable)
return q, tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=scope)

self.q_eval, e_params = _build_qnet(self.tfs, 'eval', trainable=True, reuse=False)
self.q_targ, t_params = _build_qnet(self.tfs_, 'target', trainable=False, reuse=False)

self.update_op = [tf.assign(t, self.tau * e + (1 - self.tau) * t) for t, e in
zip(t_params, e_params)]  # soft update

if self.use_double_dqn:
q_eval_next, _ = _build_qnet(self.tfs_, 'eval', trainable=True, reuse=True)  # reuse the same eval net
q_eval_next_best_action = tf.argmax(q_eval_next, 1)
self.q_target_value = tf.reduce_sum(self.q_targ * tf.one_hot(q_eval_next_best_action, self.action_size),
axis=1)
else:
self.q_target_value = tf.reduce_max(self.q_targ, axis=1)

self.q_target_value = self.tfr + self.gamma * (1 - self.tfd) * self.q_target_value

if self.clip_target_value:
self.q_target_value = tf.clip_by_value(self.q_target_value, -1 / (1 - self.gamma), 0)

self.q_eval_action_value = tf.reduce_sum(self.q_eval * tf.one_hot(self.tfa, self.action_size), axis=1)

self.loss = tf.losses.mean_squared_error(self.q_target_value, self.q_eval_action_value)

self.saver = tf.train.Saver()

self.sess.run(tf.global_variables_initializer())
```

```def choose_action(self, state, goal):
if np.random.rand() <= self.epsilon:
return np.random.randint(self.action_size)
act_values = self.sess.run(self.q_eval, {self.tfs: state, self.tfg: goal})
return np.argmax(act_values[0])  # use tf.argmax is much slower, so use np```

```def remember(self, ep_experience):
self.memory += ep_experience.memory
if len(self.memory) > self.buffer_size:
self.memory = self.memory[-self.buffer_size:]  # empty the first memories```

```def replay(self, optimization_steps):
if len(self.memory) < self.batch_size:  # if there's no enough transitions, do nothing
return 0

losses = 0
for _ in range(optimization_steps):
minibatch = np.vstack(random.sample(self.memory, self.batch_size))
ss = np.vstack(minibatch[:, 0])
acs = minibatch[:, 1]
rs = minibatch[:, 2]
nss = np.vstack(minibatch[:, 3])
ds = minibatch[:, 4]
gs = np.vstack(minibatch[:, 5])
loss, _ = self.sess.run([self.loss, self.train_op],
{self.tfs: ss, self.tfg: gs, self.tfa: acs,
self.tfr: rs, self.tfs_: nss, self.tfd: ds})
losses += loss
return losses / optimization_steps  # return mean loss
```

```def update_target_net(self, decay=True):
self.sess.run(self.update_op)
if decay:
self.epsilon = max(self.epsilon * self.epsilon_decay, self.epsilon_min)```

`self.update_op = [tf.assign(t, self.tau * e + (1 - self.tau) * t) for t, e in zip(t_params, e_params)]  # soft update`

#### 3.4 经验池构建

```ep_experience = Episode_experience()
ep_experience_her = Episode_experience()```

```for t in range(size):
action = agent.choose_action([state], [goal])
next_state, reward, done = env.step(action)
ep_experience.add(state, action, reward, next_state, done, goal)
state = next_state
if done:```

```if use_her:  # The strategy can be changed here
#         goal = state # HER, with substituted goal=final_state
for t in range(len(ep_experience.memory)):
for k in range(K):
future = np.random.randint(t, len(ep_experience.memory))
goal = ep_experience.memory[future][3]  # next_state of future
state = ep_experience.memory[t][0]
action = ep_experience.memory[t][1]
next_state = ep_experience.memory[t][3]
done = np.array_equal(next_state, goal)
reward = 0 if done else -1
ep_experience_her.add(state, action, reward, next_state, done, goal)```

#### 3.5 模型训练

```agent.remember(ep_experience)
agent.remember(ep_experience_her)
ep_experience.clear()
ep_experience_her.clear()

mean_loss = agent.replay(optimization_steps)
agent.update_target_net()```

# 参考文献

1、原文：https://arxiv.org/abs/1707.01495v1 2、《强化学习精要：核心算法与Tensorflow实现》

GitHub仓库

https://github.com/NeuronDance/DeepRL

0 条评论

• ### DQN系列(3): 优先级经验回放(Prioritized Experience Replay)论文阅读、原理及实现

通常情况下，在使用“经验”回放的算法中，通常从缓冲池中采用“均匀采样(Uniformly sampling)”，虽然这种方法在DQN算法中取得了不错的效果并登顶...

• ### DQN系列(2): Double DQN算法原理与实现

论文地址： https://arxiv.org/pdf/1509.06461.pdf

• ### “超参数”与“网络结构”自动化设置方法---DeepHyper

可以说这两个问题一直困扰每一个学习者，为了解决这些问题，谷歌公司开源了AutoML(貌似收费)。此外还有Keras（后期详解），本篇文章介绍一个自动化学习包： ...

• ### vn.py源码解读（三、事件驱动引擎代码分析）

先抛开一切，我们来想一想，如果自己要写一个事件驱动引擎会怎么写？之前也说过，所谓的事情驱动就是你要监听一些事件，当某些事件发生的时候，要分配相对...

• ### 强化学习反馈稀疏问题-HindSight Experience Replay原理及实现！

在强化学习中，反馈稀疏是一个比较常见同时令人头疼的问题。因为我们大部分情况下都无法得到有效的反馈，模型难以得到有效的学习。为了解决反馈稀疏的问题，一种常用的做法...

• ### 项目笔记 LUNA16-DeepLung：（二）肺结节检测

在前面进行了肺结节数据的预处理之后，接下来开始进入肺结节检测环节。首先附上该项目的Github链接：https://github.com/Minerva-J/D...

• ### DWIntrosPage 简单定制引导页

下面摘取部分代码 DWIntrosPageContentViewController

• ### TensorFlow应用实战-17-Qlearning实现迷宫小游戏

总共有12个状态，s1到s12.对于每一个状态会有四个动作。对于每个状态下的每个动作会有一个Q的值。

• ### RBF神经网络及Python实现（附源码）

作者：ACdreamers http://blog.csdn.net/acdreamers/article/details/46327761 RBF网络能够逼近...