我正在尝试在pytorch中实现tbptt。
我在一个论坛上找到了下面的一个实现,我理解了代码背后的逻辑,但我总是得到一个"inplace operation“错误。
class TBPTT():
def __init__(self, one_step_module, loss_module, k1, k2, optimizer):
self.one_step_module = one_step_module
self.loss_module = loss_module
self.k1 = k1
self.k2 = k2
self.retain_graph = k1 < k2
# You can also remove all the optimizer code here, and the
# train function will just accumulate all the gradients in
# one_step_module parameters
self.optimizer = optimizer
def train(self, input_sequence, init_state):
states = [(None, init_state)]
for j, (inp, target) in enumerate(input_sequence):
state = states[-1][1].detach()
state.requires_grad=True
output, new_state = self.one_step_module(inp, state)
states.append((state, new_state))
while len(states) > self.k2:
# Delete stuff that is too old
del states[0]
if (j+1)%self.k1 == 0:
loss = self.loss_module(output, target)
optimizer.zero_grad()
# backprop last module (keep graph only if they ever overlap)
start = time.time()
loss.backward(retain_graph=self.retain_graph)
for i in range(self.k2-1):
# if we get all the way back to the "init_state", stop
if states[-i-2][0] is None:
break
curr_grad = states[-i-1][0].grad
states[-i-2][1].backward(curr_grad, retain_graph=self.retain_graph)
print("bw: {}".format(time.time()-start))
optimizer.step()
seq_len = 20
layer_size = 50
idx = 0
class MyMod(nn.Module):
def __init__(self):
super(MyMod, self).__init__()
self.lin = nn.Linear(2*layer_size, 2*layer_size)
def forward(self, inp, state):
global idx
full_out = self.lin(torch.cat([inp, state], 1))
# out, new_state = full_out.chunk(2, dim=1)
out = full_out.narrow(1, 0, layer_size)
new_state = full_out.narrow(1, layer_size, layer_size)
def get_pr(idx_val):
def pr(*args):
print("doing backward {}".format(idx_val))
return pr
new_state.register_hook(get_pr(idx))
out.register_hook(get_pr(idx))
print("doing fw {}".format(idx))
idx += 1
return out, new_state
one_step_module = MyMod()
loss_module = nn.MSELoss()
input_sequence = [(torch.rand(200, layer_size), torch.rand(200, layer_size))] * seq_len
optimizer = torch.optim.SGD(one_step_module.parameters(), lr=1e-3)
runner = TBPTT(one_step_module, loss_module, 5, 7, optimizer)
runner.train(input_sequence, torch.zeros(200, layer_size))
print("done")
奇怪的是。当我第一次尝试运行代码时,我不断得到另一个错误,经过彻底的推测后,我发现在一些变量中,如"one_step_module","input_sequence“中,隐藏了其他变量的外部作用域。因此,在重命名这些变量之后,代码运行得很好。然后,我试着为我自己的项目进一步修改代码,我开始得到"inplace operation“错误。所以,为了看看哪里出了问题,我将代码修复回了上面的原始代码,但我一直收到错误。我甚至尝试打开一个新文件并从一开始复制粘贴实现,但我仍然无法运行代码。我都快疯了。
这是我从上面的实现开始得到的"inplace operation“错误。
C:\Users\bboyj\anaconda3\envs\jinkyu\python.exe C:/Users/bboyj/PycharmProjects/pythonProject/test1.py
doing fw 0
doing fw 1
doing fw 2
doing fw 3
doing fw 4
doing backward 4
doing backward 3
doing backward 2
doing backward 1
doing backward 0
bw: 0.17385029792785645
doing fw 5
doing fw 6
doing fw 7
doing fw 8
doing fw 9
doing backward 9
doing backward 8
doing backward 7
doing backward 6
doing backward 5
doing backward 4
C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py:130: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at ..\c10\cuda\CUDAFunctions.cpp:100.)
Variable._execution_engine.run_backward(
Traceback (most recent call last):
File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 100, in <module>
runner.train(input_sequence, torch.zeros(200, layer_size))
File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 59, in train
states[-i-2][1].backward(curr_grad, retain_graph=self.retain_graph)
File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 100]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Process finished with exit code 1
以防您想要查看触发错误的特定代码。以下是torch异常检测的错误日志。
C:\Users\bboyj\anaconda3\envs\jinkyu\python.exe C:/Users/bboyj/PycharmProjects/pythonProject/test1.py
doing fw 0
doing fw 1
doing fw 2
doing fw 3
doing fw 4
doing backward 4
doing backward 3
doing backward 2
doing backward 1
doing backward 0
bw: 0.17083358764648438
doing fw 5
doing fw 6
doing fw 7
doing fw 8
doing fw 9
doing backward 9
doing backward 8
doing backward 7
doing backward 6
doing backward 5
doing backward 4
C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py:130: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at ..\c10\cuda\CUDAFunctions.cpp:100.)
Variable._execution_engine.run_backward(
C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py:130: UserWarning: Error detected in AddmmBackward. Traceback of forward call that caused the error:
File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 101, in <module>
runner.train(input_sequence, torch.zeros(200, layer_size))
File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 41, in train
output, new_state = self.one_step_module(inp, state)
File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 78, in forward
full_out = self.lin(torch.cat([inp111, state111], 1))
File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\nn\modules\linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\nn\functional.py", line 1690, in linear
ret = torch.addmm(bias, input, weight.t())
(Triggered internally at ..\torch\csrc\autograd\python_anomaly_mode.cpp:104.)
Variable._execution_engine.run_backward(
Traceback (most recent call last):
File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 101, in <module>
runner.train(input_sequence, torch.zeros(200, layer_size))
File "C:/Users/bboyj/PycharmProjects/pythonProject/test1.py", line 60, in train
states[-i-2][1].backward(curr_grad, retain_graph=self.retain_graph)
File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\bboyj\anaconda3\envs\jinkyu\lib\site-packages\torch\autograd\__init__.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 100]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
主要的问题是,第一次迭代是很好的,因为损失只用新的隐藏状态和"detached and required_grad = True“状态来计算,但是当第二次迭代尝试向后返回已经”向后“的前一组隐藏状态时,它会引发错误。因此,在这种情况下,在t =0,1,2,3,4和t= 5,6,7,8,9上转发之后,当它尝试在t=9,8,7,6,5,4,3上向后(因为k2是7)时,向后在t=9,8,7,6,5上工作得很好,但在t= 4上失败。有人能解释一下这个问题吗?
发布于 2020-12-09 11:09:00
在仔细推测了代码之后,我找到了这个bug。问题是,在对以前的隐藏状态进行“倒退”之后,优化器试图踏入已经计算过的隐藏状态。我把优化器移出了for循环的作用域,一切都很正常!
我把这个答案留给那些试图实现截断bptt的人。
https://stackoverflow.com/questions/65175252
复制相似问题