PLANET+SAC代码实现和解读

CreateAMind

发布于 2019-07-30 17:51:08

1K00

代码可运行

文章被收录于专栏：CreateAMindCreateAMind

运行总次数：0

代码可运行

代码已经在正常跑实验了。以下描述的是，经过我几次尝试后改动最小的那个方案：

为planet增加SAC功能，之前写了详细思路请先参考：

详解PLANET代码(tensorflow)如何加入SAC功能

１数据有两部分：

1.1　random 开始的随机部分。

原先设计的取Ｏ1...Ｏ49 和Ｏ2.....Ｏ50的方式有一个问题，永远sample不到done = true的情况。所以我让env在done了之后还能继续运行一步，我收集收据时候，done了就设置一个stop标记，下一次根据stop标记作为停止，而不是done的那一步停止。总之：若done的一步是O99,episode会收集到Ｏ100．总而言之，就是要收集到done之后的下一个为止。

def random_episodes(env_ctor, num_episodes, output_dir=None):
  env = env_ctor()  # env is an <ExternalProcess object>.
  env = wrappers.CollectGymDataset(env, output_dir)
  episodes = []
  num_episodes = 5
  for _ in range(num_episodes):
    policy = lambda env, obs: env.action_space.sample()
    done = False
    stop = False
    obs = env.reset()
    # cnt = 0append
    while not stop:
      if done:
        stop = done
      action = policy(env, obs)
      obs, _, done, info = env.step(action)  # env.step
    #   cnt += 1
    # print(cnt)
    episodes.append(info['episode'])  # if done is True, info stores the 'episode' information and 'episode' is written in a file(e.g. "~/planet/log_debug/00001/test_episodes").
    # for i in range(200):
    #   action = policy(env, obs)
    #   obs, _, done, info = env.step(action)  # env.step
    # episodes.append(info['episode'])
  return episodes

这里是将一个episode数据存入文件，标志位对应于前也加了stop标志，因为我要收集到done后面一个，也就是说收集到数据末尾两个都是done=true。

这样当我错开一个取ｏ和ｏ２时，ｏ才能概率能去到ｄｏｎｅ＝ｔｒｕｅ的情形。

def _process_step(self, action, observ, reward, done, info):
  self._transition.update({'action': info['action'], 'reward': reward,'done': done})
  self._transition.update(info)

  #self._transition.update(self._process_observ_sac_next(observ)) # add o_next
  self._episode.append(self._transition)
  self._transition = {}

  if not self._stop:
    #print('updating.....................................')
    self._transition.update(self._process_observ(observ))
  else:
    episode = self._get_episode()
    info['episode'] = episode
    acc_reward = sum(episode['reward'])
    if self.step_error:
      print('step error... this episode will NOT be saved.')

    # control data collection...
    elif acc_reward > COLLECT_EPISODE:
      print('episode length: {}.  accumulative reward({}) > {}... this episode will NOT be saved.'.format(len(self._episode),acc_reward,COLLECT_EPISODE))

    elif self._outdir:
      print('episode length: {}.  accumulative reward({}) <= {}... this episode will be saved.'.format(len(self._episode),acc_reward,COLLECT_EPISODE))
      filename = self._get_filename()
      #print('writing ......................................')
      self._write(episode, filename)   #
  if done:
    self._stop = done
  return observ, reward, done, info

还要注意在每次reset时候将stop回归false：

def reset(self, *args, **kwargs):
  if kwargs.get('blocking', True):
    observ = self._env.reset(*args, **kwargs)       # carla: observ = {'state':..., 'image':array(...)}
    self._stop = False
    return self._process_reset(observ)

1.2　在每一个step时去收集一个episode数据。

这块我之前的做法就复杂了自己去写一个新网络。后来发现最小改动方式，是将planning部分的config.planner替换为我们的sac policy.

def perform(self, agent_indices, observ, info)  :
  observ = self._config.preprocess_fn(observ)
  embedded = self._config.encoder({'image': observ[:, None]})[:, 0]
  state = nested.map(
      lambda tensor: tf.gather(tensor, agent_indices),
      self._state)
  prev_action = self._prev_action + 0
  with tf.control_dependencies([prev_action]):
    use_obs = tf.ones(tf.shape(agent_indices), tf.bool)[:, None]
    _, state = self._cell((embedded, prev_action, use_obs), state)
  # action = self._config.planner(
  #     self._cell, self._config.objective, state, info,
  #     embedded.shape[1:].as_list(),
  #     prev_action.shape[1:].as_list())
  feature = self._cell.features_from_state(state)  # [s,h]
  mu, pi, logp_pi, q1, q2, q1_pi, q2_pi = self._config.actor_critic(feature,prev_action)

  if self._config.exploration:
    scale = self._config.exploration.scale
    if self._config.exploration.schedule:
      scale *= self._config.exploration.schedule(self._step)
    action = tfd.Normal(mu, scale).sample()
  action = tf.clip_by_value(action, -1, 1)
  remember_action = self._prev_action.assign(action)
  remember_state = nested.map(
      lambda var, val: tf.scatter_update(var, agent_indices, val),
      self._state, state, flatten=True)
  with tf.control_dependencies(remember_state + (remember_action,)):
    return tf.identity(action), tf.constant('')

注意所有的函数都要到config中去配置好：

agent_config = tools.AttrDict(
    cell=cell,
    encoder=graph.encoder,
    planner=params.planner,
    actor_critic=graph.main_actor_critic,
    objective=functools.partial(params.objective, graph=graph),
    exploration=params.exploration,
    preprocess_fn=config.preprocess_fn,
    postprocess_fn=config.postprocess_fn)

　random_episodes　与　这里的define_batch_env里定义的ｅｎｖ都是同一个函数生成的：envs = [env_ctor() for _ in range(num_agents)] 只不过ｂａｔｃｈ里面根据num_agents不同而产生多个ｅｎｖ。

２．模型设计部分：

前面数据经过RNN生成了对应的feature，把【ｏ，ａ，ｒ，ｏ２，ｄ】准备好就可以放进ｓａｃ算法了。

注意原来的ｌｏｓｓ全部删除，没有ｔｒａｉｎ和ｔｅｓｔ　ｐｈａｓｅ，增加了ｓａｃ　ｐｈａｓｅ。

注意：模型的结构顺序都是靠数据依赖和控制依赖来完成的。只要掌握了这个根本法则，就知道这个装置是如何衔接起来的。

# data for sac
              features = tf.stop_gradient(features)  # stop gradient for features



              hidden_next= features[:, 1:] # s2,s3,s4.......s50
              hidden = features[:, :-1]  # s1,s2,s3.......s49
              reward = obs['reward'][:,:-1]
              action = obs['action'][:,:-1]
              done = obs['done'][:,:-1]

              done = tf.cast(done, dtype=tf.float32)

              reward = tf.reshape(reward,(-1,1))
              action = tf.reshape(action,(-1,2))
              done = tf.reshape(done, (-1, 1))
              hidden_next = tf.reshape(hidden_next, (-1, 250))
              hidden = tf.reshape(hidden, (-1, 250))

              x = tf.placeholder(dtype=tf.float32, shape=(None, 250))
              a = tf.placeholder(dtype=tf.float32, shape=(None, 2))
              mu, pi, logp_pi, q1, q2, q1_pi, q2_pi = main_actor_critic(hidden,action)
              _, _, logp_pi_, _, _,q1_pi_, q2_pi_ = target_actor_critic(hidden_next,action)
              _, pi_ep, _, _, _, _, _ = episode_actor_critic(x, a)


              target_init = tf.group([tf.assign(v_targ, v_main) for v_main, v_targ in
                                      zip(get_vars('main_actor_critic'), get_vars('target_actor_critic'))])

              if args.alpha == 'auto':
                  target_entropy = (-np.prod([2,1]))

                  log_alpha = tf.get_variable('log_alpha', dtype=tf.float32, initializer=1.0)
                  alpha = tf.exp(log_alpha)

                  alpha_loss = tf.reduce_mean(-log_alpha * tf.stop_gradient(logp_pi + target_entropy))

                  alpha_optimizer = tf.train.AdamOptimizer(learning_rate=args.lr * 0.01, name='alpha_optimizer')
                  train_alpha_op = alpha_optimizer.minimize(loss=alpha_loss, var_list=[log_alpha])

              min_q_pi = tf.minimum(q1_pi_, q2_pi_)

              # Targets for Q and V regression
              v_backup = tf.stop_gradient(min_q_pi - args.alpha * logp_pi)
              q_backup = reward + args.gamma * (1 - done) * v_backup

              # Soft actor-critic losses
              pi_loss = tf.reduce_mean(args.alpha * logp_pi - q1_pi)
              q1_loss = 0.5 * tf.reduce_mean((q_backup - q1) ** 2)
              q2_loss = 0.5 * tf.reduce_mean((q_backup - q2) ** 2)
              value_loss = q1_loss + q2_loss
              #loss = value_loss + pi_loss
              # Policy train op
              # (has to be separate from value train op, because q1_pi appears in pi_loss)
              pi_optimizer = tf.train.AdamOptimizer(learning_rate=args.lr)
              train_pi_op = pi_optimizer.minimize(pi_loss, var_list=get_vars('main_actor_critic/pi'))

              # Value train op
              # (control dep of train_pi_op because sess.run otherwise evaluates in nondeterministic order)
              value_optimizer = tf.train.AdamOptimizer(learning_rate=args.lr)
              value_params = get_vars('main_actor_critic/q')
              with tf.control_dependencies([train_pi_op]):
                  train_value_op = value_optimizer.minimize(value_loss, var_list=value_params)

              # Polyak averaging for target variables
              # (control flow because sess.run otherwise evaluates in nondeterministic order)
              with tf.control_dependencies([train_value_op]):
                  target_update = tf.group([tf.assign(v_targ, args.polyak * v_targ + (1 - args.polyak) * v_main)
                                            for v_main, v_targ in zip(get_vars('target_actor_critic'), get_vars('target_actor_critic'))])

              # All ops to call during one training step
              if isinstance(args.alpha, Number):
                 print("enter into args.alpha")
                 train_step_op = [pi_loss, q1_loss, q2_loss, q1, q2, logp_pi, tf.identity(args.alpha),
                              train_pi_op, train_value_op, target_update]
              else:
                  step_ops = [pi_loss, q1_loss, q2_loss, q1, q2, logp_pi, alpha,
                              train_pi_op, train_value_op, target_update, train_alpha_op]


              # train_step_op = tf.cond(
              #     tf.equal(phase, 'sac'),
              #     lambda: pi_loss,
              #     lambda: 0 * tf.get_variable('dummy_loss', (), tf.float32))



              with tf.control_dependencies(train_step_op):
                  train_summary = tf.constant('')





# for sac ===================================if you phase is set as sac , it will not enter phase train and test so
# it will not do planning for  episode data .
  collect_summaries = []
  graph = tools.AttrDict(locals())
  with tf.variable_scope('collection'):
    should_collects = []
    for name, params in config.sim_collects.items():
      after, every = params.steps_after, params.steps_every
      should_collect = tf.logical_and(
          tf.equal(phase, 'sac'),
          tools.schedule.binary(step, config.batch_shape[0], after, every))
      collect_summary, score_train = tf.cond(
          should_collect,
          functools.partial(
              utility.simulate_episodes, config, params, graph, name),
          lambda: (tf.constant(''), tf.constant(0.0)),
          name='should_collect_' + params.task.name)
      should_collects.append(should_collect)
      collect_summaries.append(collect_summary)

  # Compute summaries.
  graph = tools.AttrDict(locals())
  with tf.control_dependencies(collect_summaries):
    # summaries, score = tf.cond(
    #     should_summarize,
    #     lambda: define_summaries.define_summaries(graph, config),
    #     lambda: (tf.constant(''), tf.zeros((0,), tf.float32)),
    #     name='summaries')
    summaries=tf.constant('')
    score=tf.zeros((0,), tf.float32)
  with tf.device('/cpu:0'):
    summaries = tf.summary.merge([summaries, train_summary])
    # summaries = tf.summary.merge([summaries, train_summary] + collect_summaries)

    dependencies.append(utility.print_metrics((
        ('score', score_train),
        ('pi_loss', pi_loss),
        ('value_loss', value_loss),
    ), step, config.mean_metrics_every))
  with tf.control_dependencies(dependencies):
    score = tf.identity(score)
  return score, summaries,target_init

３. session.run部分：

会在每一个的时候判断这是什么ｐｈａｓｅ，只要ａｄｄ　ｓａｃ　ｐｈａｓｅ就好。

在ｔｒａｉｎ函数中：

if config.sac_steps:
    trainer.add_phase(
        'sac', config.sac_steps, score, summary,
        batch_size=config.batch_shape[0],
        report_every=None,
        log_every=config.train_log_every,
        checkpoint_every=config.train_checkpoint_every)

（注意，改起来代码不多，只是全部都要明白，才知道在哪里改合适）

４.　restore checkpoint 部分：

还有存储checkpoint问题： 4.1 一开始是只ｒｅｓｔｏｒｅ　ｒｎｎ的ｗｅｉｇｈｔ

def _initialize_variables(self, sess, savers, logdirs, checkpoints):
  """Initialize or restore variables from a checkpoint if available.

  Args:
    sess: Session to initialize variables in.
    savers: List of savers to restore variables.
    logdirs: List of directories for each saver to search for checkpoints.
    checkpoints: List of checkpoint names for each saver; None for newest.
  """
  sess.run(tf.group(
      tf.local_variables_initializer(),
      tf.global_variables_initializer()))
  assert len(savers) == len(logdirs) == len(checkpoints)
  for i, (saver, logdir, checkpoint) in enumerate(
      zip(savers, logdirs, checkpoints)):
    logdir = os.path.expanduser(logdir)
    state = tf.train.get_checkpoint_state(logdir)
    if checkpoint:
      checkpoint = os.path.join(logdir, checkpoint)
    if not checkpoint and state and state.model_checkpoint_path:  # determine the checkpoint to restore.
      checkpoint = state.model_checkpoint_path
    if checkpoint:

      #saver.restore(sess, checkpoint)

      variables_to_restore = tf.contrib.framework.get_variables_to_restore()
      # variables_restore = [v for v in variables_to_restore if v.name.split('/')[0] not in 'angular_speed_degree']
      # variables_restore = [v for v in variables_to_restore if 'angular_speed_degree' not in v.name and 'env_temporary' not in v.name]
      variables_restore = [v for v in variables_to_restore if 'encoder' in v.name or 'rnn' in v.name]
      print("---------------------------------------------variables restore----------------------------------------")
      print(variables_restore)
      # variables_restore = [v for v in variables_to_restore if 'reward' not in v.name and 'env_temporary' not in v.name]
      saver1 = tf.train.Saver(variables_restore)
      saver1.restore(sess, checkpoint)

4.2 然后训练一段时间　（rnn weight不反向传播）

下面这句一出现，这个结点之前衔接操作（encoder和rnn）就都不参加反向传播了。

# data for sac
features = tf.stop_gradient(features)  # stop gradient for features

4.3 开始store当前所有结点到checkpoint

iterate函数中执行如下，前提是要先设置好每多少步存一个checkpoint，已经saver可以设计要存的variables.

if self._is_every_steps(
    phase_step, phase.batch_size, phase.checkpoint_every):
  for saver in self._savers:
    self._store_checkpoint(sess, saver, global_step)

4.4接着就可以改写checkpoint文件，在initialization时候restore整个graph。把下面的模型替换为新生成的你需要的那个模型即可。

model_checkpoint_path: "/home/lei/Data/planet/carla_70_30_rgb_car_0.0_E3/00001/model.ckpt-30110130" all_model_checkpoint_paths: "/home/lei/Data/planet/carla_70_30_rgb_car_0.0_E3/00001/model.ckpt-30110130"

全部代码push到这个链接的qiuforsac分支了：

https://github.com/createamind/PlanetA

如有想法和疑问，作者wechat: Leslie27ch

欢迎加入我们！更多内容请参考CreateAMind公众号菜单。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-07-24，如有侵权请联系 cloudcommunity@tencent.com 删除

机器学习