前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >PLANET+SAC代码实现和解读

PLANET+SAC代码实现和解读

作者头像
用户1908973
发布2019-07-30 17:51:08
8820
发布2019-07-30 17:51:08
举报
文章被收录于专栏:CreateAMindCreateAMind

代码已经在正常跑实验了。以下描述的是,经过我几次尝试后改动最小的那个方案:

为planet增加SAC功能,之前写了详细思路请先参考:

详解PLANET代码(tensorflow)如何加入SAC功能

1数据有两部分:

1.1 random 开始的随机部分。

原先设计的取O1...O49 和O2.....O50的方式有一个问题,永远sample不到done = true的情况。 所以我让env在done了之后还能继续运行一步,我收集收据时候,done了就设置一个stop标记,下一次根据stop标记作为停止,而不是done的那一步停止。总之:若done的一步是O99,episode会收集到O100.总而言之,就是要收集到done之后的下一个为止。

代码语言:javascript
复制
def random_episodes(env_ctor, num_episodes, output_dir=None):
  env = env_ctor()  # env is an <ExternalProcess object>.
  env = wrappers.CollectGymDataset(env, output_dir)
  episodes = []
  num_episodes = 5
  for _ in range(num_episodes):
    policy = lambda env, obs: env.action_space.sample()
    done = False
    stop = False
    obs = env.reset()
    # cnt = 0append
    while not stop:
      if done:
        stop = done
      action = policy(env, obs)
      obs, _, done, info = env.step(action)  # env.step
    #   cnt += 1
    # print(cnt)
    episodes.append(info['episode'])  # if done is True, info stores the 'episode' information and 'episode' is written in a file(e.g. "~/planet/log_debug/00001/test_episodes").
    # for i in range(200):
    #   action = policy(env, obs)
    #   obs, _, done, info = env.step(action)  # env.step
    # episodes.append(info['episode'])
  return episodes

这里是将一个episode数据存入文件,标志位对应于前也加了stop标志,因为我要收集到done后面一个,也就是说收集到数据末尾两个都是done=true。

这样当我错开一个取o和o2时,o才能概率能去到done=true的情形。

代码语言:javascript
复制
def _process_step(self, action, observ, reward, done, info):
  self._transition.update({'action': info['action'], 'reward': reward,'done': done})
  self._transition.update(info)

  #self._transition.update(self._process_observ_sac_next(observ)) # add o_next
  self._episode.append(self._transition)
  self._transition = {}

  if not self._stop:
    #print('updating.....................................')
    self._transition.update(self._process_observ(observ))
  else:
    episode = self._get_episode()
    info['episode'] = episode
    acc_reward = sum(episode['reward'])
    if self.step_error:
      print('step error... this episode will NOT be saved.')

    # control data collection...
    elif acc_reward > COLLECT_EPISODE:
      print('episode length: {}.  accumulative reward({}) > {}... this episode will NOT be saved.'.format(len(self._episode),acc_reward,COLLECT_EPISODE))

    elif self._outdir:
      print('episode length: {}.  accumulative reward({}) <= {}... this episode will be saved.'.format(len(self._episode),acc_reward,COLLECT_EPISODE))
      filename = self._get_filename()
      #print('writing ......................................')
      self._write(episode, filename)   #
  if done:
    self._stop = done
  return observ, reward, done, info

还要注意在每次reset时候将stop回归false:

代码语言:javascript
复制
def reset(self, *args, **kwargs):
  if kwargs.get('blocking', True):
    observ = self._env.reset(*args, **kwargs)       # carla: observ = {'state':..., 'image':array(...)}
    self._stop = False
    return self._process_reset(observ)

1.2 在每一个step时去收集一个episode数据。

这块我之前的做法就复杂了自己去写一个新网络。后来发现最小改动方式,是将planning部分的config.planner替换为我们的sac policy.

代码语言:javascript
复制
def perform(self, agent_indices, observ, info)  :
  observ = self._config.preprocess_fn(observ)
  embedded = self._config.encoder({'image': observ[:, None]})[:, 0]
  state = nested.map(
      lambda tensor: tf.gather(tensor, agent_indices),
      self._state)
  prev_action = self._prev_action + 0
  with tf.control_dependencies([prev_action]):
    use_obs = tf.ones(tf.shape(agent_indices), tf.bool)[:, None]
    _, state = self._cell((embedded, prev_action, use_obs), state)
  # action = self._config.planner(
  #     self._cell, self._config.objective, state, info,
  #     embedded.shape[1:].as_list(),
  #     prev_action.shape[1:].as_list())
  feature = self._cell.features_from_state(state)  # [s,h]
  mu, pi, logp_pi, q1, q2, q1_pi, q2_pi = self._config.actor_critic(feature,prev_action)

  if self._config.exploration:
    scale = self._config.exploration.scale
    if self._config.exploration.schedule:
      scale *= self._config.exploration.schedule(self._step)
    action = tfd.Normal(mu, scale).sample()
  action = tf.clip_by_value(action, -1, 1)
  remember_action = self._prev_action.assign(action)
  remember_state = nested.map(
      lambda var, val: tf.scatter_update(var, agent_indices, val),
      self._state, state, flatten=True)
  with tf.control_dependencies(remember_state + (remember_action,)):
    return tf.identity(action), tf.constant('')

注意所有的函数都要到config中去配置好:

代码语言:javascript
复制
agent_config = tools.AttrDict(
    cell=cell,
    encoder=graph.encoder,
    planner=params.planner,
    actor_critic=graph.main_actor_critic,
    objective=functools.partial(params.objective, graph=graph),
    exploration=params.exploration,
    preprocess_fn=config.preprocess_fn,
    postprocess_fn=config.postprocess_fn)

 random_episodes 与 这里的define_batch_env里定义的env都是同一个函数生成的:envs = [env_ctor() for _ in range(num_agents)] 只不过batch里面根据num_agents不同而产生多个env。

2.模型设计部分:

前面数据经过RNN生成了对应的feature,把【o,a,r,o2,d】准备好就可以放进sac算法了。

注意原来的loss全部删除,没有train和test phase,增加了sac phase

注意:模型的结构顺序都是靠数据依赖和控制依赖来完成的。只要掌握了这个根本法则,就知道这个装置是如何衔接起来的。

代码语言:javascript
复制
# data for sac
              features = tf.stop_gradient(features)  # stop gradient for features



              hidden_next= features[:, 1:] # s2,s3,s4.......s50
              hidden = features[:, :-1]  # s1,s2,s3.......s49
              reward = obs['reward'][:,:-1]
              action = obs['action'][:,:-1]
              done = obs['done'][:,:-1]

              done = tf.cast(done, dtype=tf.float32)

              reward = tf.reshape(reward,(-1,1))
              action = tf.reshape(action,(-1,2))
              done = tf.reshape(done, (-1, 1))
              hidden_next = tf.reshape(hidden_next, (-1, 250))
              hidden = tf.reshape(hidden, (-1, 250))

              x = tf.placeholder(dtype=tf.float32, shape=(None, 250))
              a = tf.placeholder(dtype=tf.float32, shape=(None, 2))
              mu, pi, logp_pi, q1, q2, q1_pi, q2_pi = main_actor_critic(hidden,action)
              _, _, logp_pi_, _, _,q1_pi_, q2_pi_ = target_actor_critic(hidden_next,action)
              _, pi_ep, _, _, _, _, _ = episode_actor_critic(x, a)


              target_init = tf.group([tf.assign(v_targ, v_main) for v_main, v_targ in
                                      zip(get_vars('main_actor_critic'), get_vars('target_actor_critic'))])

              if args.alpha == 'auto':
                  target_entropy = (-np.prod([2,1]))

                  log_alpha = tf.get_variable('log_alpha', dtype=tf.float32, initializer=1.0)
                  alpha = tf.exp(log_alpha)

                  alpha_loss = tf.reduce_mean(-log_alpha * tf.stop_gradient(logp_pi + target_entropy))

                  alpha_optimizer = tf.train.AdamOptimizer(learning_rate=args.lr * 0.01, name='alpha_optimizer')
                  train_alpha_op = alpha_optimizer.minimize(loss=alpha_loss, var_list=[log_alpha])

              min_q_pi = tf.minimum(q1_pi_, q2_pi_)

              # Targets for Q and V regression
              v_backup = tf.stop_gradient(min_q_pi - args.alpha * logp_pi)
              q_backup = reward + args.gamma * (1 - done) * v_backup

              # Soft actor-critic losses
              pi_loss = tf.reduce_mean(args.alpha * logp_pi - q1_pi)
              q1_loss = 0.5 * tf.reduce_mean((q_backup - q1) ** 2)
              q2_loss = 0.5 * tf.reduce_mean((q_backup - q2) ** 2)
              value_loss = q1_loss + q2_loss
              #loss = value_loss + pi_loss
              # Policy train op
              # (has to be separate from value train op, because q1_pi appears in pi_loss)
              pi_optimizer = tf.train.AdamOptimizer(learning_rate=args.lr)
              train_pi_op = pi_optimizer.minimize(pi_loss, var_list=get_vars('main_actor_critic/pi'))

              # Value train op
              # (control dep of train_pi_op because sess.run otherwise evaluates in nondeterministic order)
              value_optimizer = tf.train.AdamOptimizer(learning_rate=args.lr)
              value_params = get_vars('main_actor_critic/q')
              with tf.control_dependencies([train_pi_op]):
                  train_value_op = value_optimizer.minimize(value_loss, var_list=value_params)

              # Polyak averaging for target variables
              # (control flow because sess.run otherwise evaluates in nondeterministic order)
              with tf.control_dependencies([train_value_op]):
                  target_update = tf.group([tf.assign(v_targ, args.polyak * v_targ + (1 - args.polyak) * v_main)
                                            for v_main, v_targ in zip(get_vars('target_actor_critic'), get_vars('target_actor_critic'))])

              # All ops to call during one training step
              if isinstance(args.alpha, Number):
                 print("enter into args.alpha")
                 train_step_op = [pi_loss, q1_loss, q2_loss, q1, q2, logp_pi, tf.identity(args.alpha),
                              train_pi_op, train_value_op, target_update]
              else:
                  step_ops = [pi_loss, q1_loss, q2_loss, q1, q2, logp_pi, alpha,
                              train_pi_op, train_value_op, target_update, train_alpha_op]


              # train_step_op = tf.cond(
              #     tf.equal(phase, 'sac'),
              #     lambda: pi_loss,
              #     lambda: 0 * tf.get_variable('dummy_loss', (), tf.float32))



              with tf.control_dependencies(train_step_op):
                  train_summary = tf.constant('')





# for sac ===================================if you phase is set as sac , it will not enter phase train and test so
# it will not do planning for  episode data .
  collect_summaries = []
  graph = tools.AttrDict(locals())
  with tf.variable_scope('collection'):
    should_collects = []
    for name, params in config.sim_collects.items():
      after, every = params.steps_after, params.steps_every
      should_collect = tf.logical_and(
          tf.equal(phase, 'sac'),
          tools.schedule.binary(step, config.batch_shape[0], after, every))
      collect_summary, score_train = tf.cond(
          should_collect,
          functools.partial(
              utility.simulate_episodes, config, params, graph, name),
          lambda: (tf.constant(''), tf.constant(0.0)),
          name='should_collect_' + params.task.name)
      should_collects.append(should_collect)
      collect_summaries.append(collect_summary)

  # Compute summaries.
  graph = tools.AttrDict(locals())
  with tf.control_dependencies(collect_summaries):
    # summaries, score = tf.cond(
    #     should_summarize,
    #     lambda: define_summaries.define_summaries(graph, config),
    #     lambda: (tf.constant(''), tf.zeros((0,), tf.float32)),
    #     name='summaries')
    summaries=tf.constant('')
    score=tf.zeros((0,), tf.float32)
  with tf.device('/cpu:0'):
    summaries = tf.summary.merge([summaries, train_summary])
    # summaries = tf.summary.merge([summaries, train_summary] + collect_summaries)

    dependencies.append(utility.print_metrics((
        ('score', score_train),
        ('pi_loss', pi_loss),
        ('value_loss', value_loss),
    ), step, config.mean_metrics_every))
  with tf.control_dependencies(dependencies):
    score = tf.identity(score)
  return score, summaries,target_init

3. session.run部分:

会在每一个的时候判断这是什么phase,只要add sac phase就好。

在train函数中:

代码语言:javascript
复制
if config.sac_steps:
    trainer.add_phase(
        'sac', config.sac_steps, score, summary,
        batch_size=config.batch_shape[0],
        report_every=None,
        log_every=config.train_log_every,
        checkpoint_every=config.train_checkpoint_every)

(注意,改起来代码不多,只是全部都要明白,才知道在哪里改合适)

4. restore checkpoint 部分:

还有存储checkpoint问题: 4.1 一开始是只restore rnn的weight

代码语言:javascript
复制
def _initialize_variables(self, sess, savers, logdirs, checkpoints):
  """Initialize or restore variables from a checkpoint if available.

  Args:
    sess: Session to initialize variables in.
    savers: List of savers to restore variables.
    logdirs: List of directories for each saver to search for checkpoints.
    checkpoints: List of checkpoint names for each saver; None for newest.
  """
  sess.run(tf.group(
      tf.local_variables_initializer(),
      tf.global_variables_initializer()))
  assert len(savers) == len(logdirs) == len(checkpoints)
  for i, (saver, logdir, checkpoint) in enumerate(
      zip(savers, logdirs, checkpoints)):
    logdir = os.path.expanduser(logdir)
    state = tf.train.get_checkpoint_state(logdir)
    if checkpoint:
      checkpoint = os.path.join(logdir, checkpoint)
    if not checkpoint and state and state.model_checkpoint_path:  # determine the checkpoint to restore.
      checkpoint = state.model_checkpoint_path
    if checkpoint:

      #saver.restore(sess, checkpoint)

      variables_to_restore = tf.contrib.framework.get_variables_to_restore()
      # variables_restore = [v for v in variables_to_restore if v.name.split('/')[0] not in 'angular_speed_degree']
      # variables_restore = [v for v in variables_to_restore if 'angular_speed_degree' not in v.name and 'env_temporary' not in v.name]
      variables_restore = [v for v in variables_to_restore if 'encoder' in v.name or 'rnn' in v.name]
      print("---------------------------------------------variables restore----------------------------------------")
      print(variables_restore)
      # variables_restore = [v for v in variables_to_restore if 'reward' not in v.name and 'env_temporary' not in v.name]
      saver1 = tf.train.Saver(variables_restore)
      saver1.restore(sess, checkpoint)

4.2 然后训练一段时间 (rnn weight不反向传播)

下面这句一出现,这个结点之前衔接操作(encoder和rnn)就都不参加反向传播了。

代码语言:javascript
复制
# data for sac
features = tf.stop_gradient(features)  # stop gradient for features

4.3 开始store当前所有结点到checkpoint

iterate函数中执行如下,前提是要先设置好每多少步存一个checkpoint,已经saver可以设计要存的variables.

代码语言:javascript
复制
if self._is_every_steps(
    phase_step, phase.batch_size, phase.checkpoint_every):
  for saver in self._savers:
    self._store_checkpoint(sess, saver, global_step)

4.4接着就可以改写checkpoint文件,在initialization时候restore整个graph。把下面的模型替换为新生成的你需要的那个模型即可。

model_checkpoint_path: "/home/lei/Data/planet/carla_70_30_rgb_car_0.0_E3/00001/model.ckpt-30110130" all_model_checkpoint_paths: "/home/lei/Data/planet/carla_70_30_rgb_car_0.0_E3/00001/model.ckpt-30110130"

全部代码push到这个链接的qiuforsac分支了:

https://github.com/createamind/PlanetA

如有想法和疑问,作者wechat: Leslie27ch

欢迎加入我们!更多内容请参考CreateAMind公众号菜单。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-07-24,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 CreateAMind 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档