文章/答案/技术大牛

发布

社区首页 >问答首页 >RL -使用PyTorch- DQN的稳定基线:为什么CustomModel不学习？

问RL -使用PyTorch- DQN的稳定基线:为什么CustomModel不学习？
EN

Stack Overflow用户

提问于 2022-02-20 21:55:24

回答 1查看 364关注 0票数 0

我希望使用稳定的基线RL实现，并使用自定义模型。我简化了我的案子。我有三个问题：

为什么不学会预测2呢？根据初始化，它预测4，7，3，.
，我假设CustomCombinedExtractor在前向传递最后的离散预测中产生。因此，这将是维数10，但是稳定的基线要求它输出64 dim向量。为什么会这样呢？之后是否有进一步的模型应用？我怎样才能让它停用呢？"lr_schedule"?

，

，我们有哪些明智的选择？

在这里，代码：

import gym
from gym import spaces
from stable_baselines3 import DQN
from stable_baselines3.dqn import MultiInputPolicy
import numpy as np
import torch.nn as nn
import torch


class CustomEnv(gym.Env):
    """Custom Environment that follows gym interface"""
    metadata = {'render.modes': ['human']}

    def __init__(self):
        super(CustomEnv, self).__init__()
        self.action_space = spaces.Discrete(10)
        self.observation_space = spaces.Dict({
            "vector1": spaces.Box(low=0, high=10, shape=(10,), dtype=np.float32),
            "vector2": spaces.Box(low=0, high=10, shape=(10,), dtype=np.float32)
        })

    def obs(self):
        return dict({
            "vector1": 5*np.ones(10),
            "vector2": 5*np.ones(10)})

    def step(self, action):
        if action == 2:
            reward = 20
        else:
            reward = 0
        return self.obs(), reward, False, dict({})

    def reset(self):
        return self.obs()

    def render(self, mode='human'):
        return None

    def close(self):
        pass

env = CustomEnv()

class CustomCombinedExtractor(MultiInputPolicy):
    def __init__(self, observation_space, action_space, lr_schedule):
        super().__init__(observation_space, action_space, lr_schedule)

        extractors = {}

        total_concat_size = 0
        for key, subspace in observation_space.spaces.items():
            elif key == "vector"1:
                extractors[key] = nn.Linear(subspace.shape[0], 64)
                total_concat_size += 64
            elif key == "vector2":
                extractors[key] = nn.Linear(subspace.shape[0], 64)
                total_concat_size += 64

        self.extractors = nn.ModuleDict(extractors)
        self._features_dim = 1
        self.features_dim = 1

    def forward(self, observations):
        encoded_tensor_list = []

        x = self.extractors["vector"](observations["vector"])
        return x.T


def lr_schedule(x): return 1/x
policy_kwargs = dict(
    features_extractor_class=CustomCombinedExtractor,
    features_extractor_kwargs=dict(
        action_space=spaces.Discrete(10), lr_schedule=lr_schedule),
)

model = DQN(MultiInputPolicy, env, verbose=1,
            buffer_size=1000, policy_kwargs=policy_kwargs)

model.learn(total_timesteps=25000)
model.save("ppo_cartpole")

del model  # remove to demonstrate saving and loading

model = DQN.load("ppo_cartpole")

obs = env.reset()
while True:
    action, _states = model.predict(obs)
    print(action)
    obs, rewards, dones, info = env.step(action)
    env.render()

stable-baselines

pytorch

reinforcement-learning

回答 1

Stack Overflow用户

发布于 2022-07-10 11:36:52

编辑:我刚刚又检查了您的代码，并看到了学习进度计划:您正在传递一个函数，即1/X.x将是从1.0开始的progress_remaining。0随着学习的进展。因此，在最后，您可能会遇到div/0的问题(但不确定progress_remaining到底是在末尾达到0.0，还是在此之前终止)。但更大的问题是:你检查了这个函数实际使用的学习率值吗？在开始学习的时候，它是1/1 = 1。在学习结束时，它会成长到无限！正常学习率在5e-5之间。3e-3。因此，你的学习率至少是2个数量级过高！1/x在这方面不是一个明智的功能！(除非您至少将其更改为1/(1-x)*m，其中m缩放的结果值足够小。我想这就是你对1/x的想法吧。我建议你从一个固定的标准学习率开始，比如1e-4，一旦成功就做lr时间表。

代码中另一件非常奇怪的事情是，Obs是常量。如果它所采取的10种行动中的任何一种都导致观察中的零变化，那么它可能就无法学习任何东西了！它可以在action==2上得到奖励，但是在你的例子中，它不能从状态-行动-奖励关系中获得任何东西，因为没有。另一个可能的问题是，你的插曲(游戏)永远不会结束！您总是在每一步()上返回done=False。最好在某个时候用done=True来结束一集/游戏，以帮助模型评估其游戏后的表现并从中吸取教训。(另一个小建议是，您应该根据文档将观察结果规范化为-1，+1 (以0为中心)，但这不应该是这里的关键问题。)

在您的例子中，将两个向量组合成一个list /1D数组可能更有意义，这样您就根本不需要创建CustomCombinedExtractor类了。

关于时间表，docu检查这个链接以获得线性计划(在所有培训步骤中从初始值降到0)：https://stable-baselines3.readthedocs.io/en/master/guide/examples.html?highlight=Linear%20schedule#learning-rate-schedule

有关更多的想法，请查看以下链接：https://stable-baselines.readthedocs.io/en/master/common/schedules.html

如果您想玩sb3，请检查这个repo：https://github.com/DLR-RM/rl-baselines3-zoo，您也会发现不同型号的调优超参数。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71199158

复制

相似问题

问RL -使用PyTorch- DQN的稳定基线:为什么CustomModel不学习？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问RL -使用PyTorch- DQN的稳定基线:为什么CustomModel不学习？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问RL -使用PyTorch- DQN的稳定基线:为什么CustomModel不学习？
EN