文章/答案/技术大牛

发布

社区首页 >问答首页 >截断的bptt pytorch实现问题

问截断的bptt pytorch实现问题
EN

Stack Overflow用户

提问于 2020-12-07 10:17:52

回答 1查看 211关注 0票数 0

我正在尝试在pytorch中实现tbptt。

我在一个论坛上找到了下面的一个实现，我理解了代码背后的逻辑，但我总是得到一个"inplace operation“错误。

class TBPTT():
    def __init__(self, one_step_module, loss_module, k1, k2, optimizer):
        self.one_step_module = one_step_module
        self.loss_module = loss_module
        self.k1 = k1
        self.k2 = k2
        self.retain_graph = k1 < k2
        # You can also remove all the optimizer code here, and the
        # train function will just accumulate all the gradients in
        # one_step_module parameters
        self.optimizer = optimizer

    def train(self, input_sequence, init_state):
        states = [(None, init_state)]
        for j, (inp, target) in enumerate(input_sequence):

            state = states[-1][1].detach()
            state.requires_grad=True
            output, new_state = self.one_step_module(inp, state)
            states.append((state, new_state))

            while len(states) > self.k2:
                # Delete stuff that is too old
                del states[0]

            if (j+1)%self.k1 == 0:
                loss = self.loss_module(output, target)

                optimizer.zero_grad()
                # backprop last module (keep graph only if they ever overlap)
                start = time.time()
                loss.backward(retain_graph=self.retain_graph)
                for i in range(self.k2-1):
                    # if we get all the way back to the "init_state", stop
                    if states[-i-2][0] is None:
                        break
                    curr_grad = states[-i-1][0].grad
                    states[-i-2][1].backward(curr_grad, retain_graph=self.retain_graph)
                print("bw: {}".format(time.time()-start))
                optimizer.step()



seq_len = 20
layer_size = 50

idx = 0

class MyMod(nn.Module):
    def __init__(self):
        super(MyMod, self).__init__()
        self.lin = nn.Linear(2*layer_size, 2*layer_size)

    def forward(self, inp, state):
        global idx
        full_out = self.lin(torch.cat([inp, state], 1))
        # out, new_state = full_out.chunk(2, dim=1)
        out = full_out.narrow(1, 0, layer_size)
        new_state = full_out.narrow(1, layer_size, layer_size)
        def get_pr(idx_val):
            def pr(*args):
                print("doing backward {}".format(idx_val))
            return pr
        new_state.register_hook(get_pr(idx))
        out.register_hook(get_pr(idx))
        print("doing fw {}".format(idx))
        idx += 1
        return out, new_state


one_step_module = MyMod()
loss_module = nn.MSELoss()
input_sequence = [(torch.rand(200, layer_size), torch.rand(200, layer_size))] * seq_len

optimizer = torch.optim.SGD(one_step_module.parameters(), lr=1e-3)

runner = TBPTT(one_step_module, loss_module, 5, 7, optimizer)

runner.train(input_sequence, torch.zeros(200, layer_size))
print("done")

奇怪的是。当我第一次尝试运行代码时，我不断得到另一个错误，经过彻底的推测后，我发现在一些变量中，如"one_step_module"，"input_sequence“中，隐藏了其他变量的外部作用域。因此，在重命名这些变量之后，代码运行得很好。然后，我试着为我自己的项目进一步修改代码，我开始得到"inplace operation“错误。所以，为了看看哪里出了问题，我将代码修复回了上面的原始代码，但我一直收到错误。我甚至尝试打开一个新文件并从一开始复制粘贴实现，但我仍然无法运行代码。我都快疯了。

这是我从上面的实现开始得到的"inplace operation“错误。

C:\Users\bboyj\anaconda3\envs\jinkyu\python.exe C:/Users/bboyj/PycharmProjects/pythonProject/test1.py
doing fw 0
doing fw 1
doing fw 2
doing fw 3
doing fw 4
doing backward 4
doing backward 3
doing backward 2
doing backward 1
doing backward 0
bw: 0.17385029792785645
doing fw 5
doing fw 6
doing fw 7
doing fw 8
doing fw 9
doing backward 9
doing backward 8
doing backward 7
doing backward 6
doing backward 5
doing backward 4
C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py:130: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  ..\c10\cuda\CUDAFunctions.cpp:100.)
  Variable._execution_engine.run_backward(
Traceback (most recent call last):
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 100, in <module>
    runner.train(input_sequence, torch.zeros(200, layer_size))
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 59, in train
    states[-i-2][1].backward(curr_grad, retain_graph=self.retain_graph)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 100]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Process finished with exit code 1

以防您想要查看触发错误的特定代码。以下是torch异常检测的错误日志。

C:\Users\bboyj\anaconda3\envs\jinkyu\python.exe C:/Users/bboyj/PycharmProjects/pythonProject/test1.py
doing fw 0
doing fw 1
doing fw 2
doing fw 3
doing fw 4
doing backward 4
doing backward 3
doing backward 2
doing backward 1
doing backward 0
bw: 0.17083358764648438
doing fw 5
doing fw 6
doing fw 7
doing fw 8
doing fw 9
doing backward 9
doing backward 8
doing backward 7
doing backward 6
doing backward 5
doing backward 4
C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py:130: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  ..\c10\cuda\CUDAFunctions.cpp:100.)
  Variable._execution_engine.run_backward(
C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py:130: UserWarning: Error detected in AddmmBackward. Traceback of forward call that caused the error:
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 101, in <module>
    runner.train(input_sequence, torch.zeros(200, layer_size))
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 41, in train
    output, new_state = self.one_step_module(inp, state)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 78, in forward
    full_out = self.lin(torch.cat([inp111, state111], 1))
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\nn\modules\linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\nn\functional.py", line 1690, in linear
    ret = torch.addmm(bias, input, weight.t())
 (Triggered internally at  ..\torch\csrc\autograd\python_anomaly_mode.cpp:104.)
  Variable._execution_engine.run_backward(
Traceback (most recent call last):
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 101, in <module>
    runner.train(input_sequence, torch.zeros(200, layer_size))
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 60, in train
    states[-i-2][1].backward(curr_grad, retain_graph=self.retain_graph)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 100]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

主要的问题是，第一次迭代是很好的，因为损失只用新的隐藏状态和"detached and required_grad = True“状态来计算，但是当第二次迭代尝试向后返回已经”向后“的前一组隐藏状态时，它会引发错误。因此，在这种情况下，在t =0,1,2,3,4和t= 5,6,7,8,9上转发之后，当它尝试在t=9,8,7,6,5,4,3上向后(因为k2是7)时，向后在t=9,8,7,6,5上工作得很好，但在t= 4上失败。有人能解释一下这个问题吗？

python

pytorch

lstm

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-12-09 11:09:00

在仔细推测了代码之后，我找到了这个bug。问题是，在对以前的隐藏状态进行“倒退”之后，优化器试图踏入已经计算过的隐藏状态。我把优化器移出了for循环的作用域，一切都很正常！

我把这个答案留给那些试图实现截断bptt的人。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65175252

复制

相似问题

问截断的bptt pytorch实现问题
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问截断的bptt pytorch实现问题EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问截断的bptt pytorch实现问题
EN