发布
社区首页 >问答首页 >截断的bptt pytorch实现问题

截断的bptt pytorch实现问题
EN

Stack Overflow用户
提问于 2020-12-07 10:17:52
回答 1查看 211关注 0票数 0

我正在尝试在pytorch中实现tbptt。

我在一个论坛上找到了下面的一个实现,我理解了代码背后的逻辑,但我总是得到一个"inplace operation“错误。

代码语言:javascript
代码运行次数:0
复制
class TBPTT():
    def __init__(self, one_step_module, loss_module, k1, k2, optimizer):
        self.one_step_module = one_step_module
        self.loss_module = loss_module
        self.k1 = k1
        self.k2 = k2
        self.retain_graph = k1 < k2
        # You can also remove all the optimizer code here, and the
        # train function will just accumulate all the gradients in
        # one_step_module parameters
        self.optimizer = optimizer

    def train(self, input_sequence, init_state):
        states = [(None, init_state)]
        for j, (inp, target) in enumerate(input_sequence):

            state = states[-1][1].detach()
            state.requires_grad=True
            output, new_state = self.one_step_module(inp, state)
            states.append((state, new_state))

            while len(states) > self.k2:
                # Delete stuff that is too old
                del states[0]

            if (j+1)%self.k1 == 0:
                loss = self.loss_module(output, target)

                optimizer.zero_grad()
                # backprop last module (keep graph only if they ever overlap)
                start = time.time()
                loss.backward(retain_graph=self.retain_graph)
                for i in range(self.k2-1):
                    # if we get all the way back to the "init_state", stop
                    if states[-i-2][0] is None:
                        break
                    curr_grad = states[-i-1][0].grad
                    states[-i-2][1].backward(curr_grad, retain_graph=self.retain_graph)
                print("bw: {}".format(time.time()-start))
                optimizer.step()



seq_len = 20
layer_size = 50

idx = 0

class MyMod(nn.Module):
    def __init__(self):
        super(MyMod, self).__init__()
        self.lin = nn.Linear(2*layer_size, 2*layer_size)

    def forward(self, inp, state):
        global idx
        full_out = self.lin(torch.cat([inp, state], 1))
        # out, new_state = full_out.chunk(2, dim=1)
        out = full_out.narrow(1, 0, layer_size)
        new_state = full_out.narrow(1, layer_size, layer_size)
        def get_pr(idx_val):
            def pr(*args):
                print("doing backward {}".format(idx_val))
            return pr
        new_state.register_hook(get_pr(idx))
        out.register_hook(get_pr(idx))
        print("doing fw {}".format(idx))
        idx += 1
        return out, new_state


one_step_module = MyMod()
loss_module = nn.MSELoss()
input_sequence = [(torch.rand(200, layer_size), torch.rand(200, layer_size))] * seq_len

optimizer = torch.optim.SGD(one_step_module.parameters(), lr=1e-3)

runner = TBPTT(one_step_module, loss_module, 5, 7, optimizer)

runner.train(input_sequence, torch.zeros(200, layer_size))
print("done")

奇怪的是。当我第一次尝试运行代码时,我不断得到另一个错误,经过彻底的推测后,我发现在一些变量中,如"one_step_module","input_sequence“中,隐藏了其他变量的外部作用域。因此,在重命名这些变量之后,代码运行得很好。然后,我试着为我自己的项目进一步修改代码,我开始得到"inplace operation“错误。所以,为了看看哪里出了问题,我将代码修复回了上面的原始代码,但我一直收到错误。我甚至尝试打开一个新文件并从一开始复制粘贴实现,但我仍然无法运行代码。我都快疯了。

这是我从上面的实现开始得到的"inplace operation“错误。

代码语言:javascript
代码运行次数:0
复制
C:\Users\bboyj\anaconda3\envs\jinkyu\python.exe C:/Users/bboyj/PycharmProjects/pythonProject/test1.py
doing fw 0
doing fw 1
doing fw 2
doing fw 3
doing fw 4
doing backward 4
doing backward 3
doing backward 2
doing backward 1
doing backward 0
bw: 0.17385029792785645
doing fw 5
doing fw 6
doing fw 7
doing fw 8
doing fw 9
doing backward 9
doing backward 8
doing backward 7
doing backward 6
doing backward 5
doing backward 4
C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py:130: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  ..\c10\cuda\CUDAFunctions.cpp:100.)
  Variable._execution_engine.run_backward(
Traceback (most recent call last):
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 100, in <module>
    runner.train(input_sequence, torch.zeros(200, layer_size))
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 59, in train
    states[-i-2][1].backward(curr_grad, retain_graph=self.retain_graph)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 100]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Process finished with exit code 1

以防您想要查看触发错误的特定代码。以下是torch异常检测的错误日志。

代码语言:javascript
代码运行次数:0
复制
C:\Users\bboyj\anaconda3\envs\jinkyu\python.exe C:/Users/bboyj/PycharmProjects/pythonProject/test1.py
doing fw 0
doing fw 1
doing fw 2
doing fw 3
doing fw 4
doing backward 4
doing backward 3
doing backward 2
doing backward 1
doing backward 0
bw: 0.17083358764648438
doing fw 5
doing fw 6
doing fw 7
doing fw 8
doing fw 9
doing backward 9
doing backward 8
doing backward 7
doing backward 6
doing backward 5
doing backward 4
C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py:130: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  ..\c10\cuda\CUDAFunctions.cpp:100.)
  Variable._execution_engine.run_backward(
C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py:130: UserWarning: Error detected in AddmmBackward. Traceback of forward call that caused the error:
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 101, in <module>
    runner.train(input_sequence, torch.zeros(200, layer_size))
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 41, in train
    output, new_state = self.one_step_module(inp, state)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 78, in forward
    full_out = self.lin(torch.cat([inp111, state111], 1))
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\nn\modules\linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\nn\functional.py", line 1690, in linear
    ret = torch.addmm(bias, input, weight.t())
 (Triggered internally at  ..\torch\csrc\autograd\python_anomaly_mode.cpp:104.)
  Variable._execution_engine.run_backward(
Traceback (most recent call last):
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 101, in <module>
    runner.train(input_sequence, torch.zeros(200, layer_size))
  File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 60, in train
    states[-i-2][1].backward(curr_grad, retain_graph=self.retain_graph)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 100]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

主要的问题是,第一次迭代是很好的,因为损失只用新的隐藏状态和"detached and required_grad = True“状态来计算,但是当第二次迭代尝试向后返回已经”向后“的前一组隐藏状态时,它会引发错误。因此,在这种情况下,在t =0,1,2,3,4和t= 5,6,7,8,9上转发之后,当它尝试在t=9,8,7,6,5,4,3上向后(因为k2是7)时,向后在t=9,8,7,6,5上工作得很好,但在t= 4上失败。有人能解释一下这个问题吗?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-12-09 11:09:00

在仔细推测了代码之后,我找到了这个bug。问题是,在对以前的隐藏状态进行“倒退”之后,优化器试图踏入已经计算过的隐藏状态。我把优化器移出了for循环的作用域,一切都很正常!

我把这个答案留给那些试图实现截断bptt的人。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65175252

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档