上次那篇《Pytorch Autograd 基础(一)》的最后一个配图 中红色曲线的标签有问题,应该是"b",特此更正。
那篇简要地讲了Autograd是如何工作的,这篇以一个微型的模型为例,介绍Augograd是如何在每一个训练batch后更新梯度的。
我们先定义模型的结构
import torch
class TinyModel(torch.nn.Module):
def __init__(self):
super(TinyModel, self).__init__()
self.layer1 = torch.nn.Linear(20, 10)
self.relu = torch.nn.ReLU()
self.layer2 = torch.nn.Linear(10, 3)
def forward(self, x):
x = self.layer1(x)
x = self.relu(x)
x = self.layer2(x)
return x
模型很简单,只有两个线性层,layer1和layer2,layer1后激活函数用Relu。
我们再定义一些常量确定batchsize 和每一层的size,以及模型的输入和输出。
BATCH_SIZE = 4
DIM_IN = 20
HIDDEN_SIZE = 10
DIM_OUT = 3
some_input = torch.randn(BATCH_SIZE, DIM_IN, requires_grad=False)
ideal_output = torch.randn(BATCH_SIZE, DIM_OUT, requires_grad=False)
model = TinyModel()
可能你会注意到我们没有为模型的每一层指定requires_grad=True。因为我们的模型继承自torch.nn.Module,它已经默认了我们想要追踪每一层权重的梯度。
print("initial state")
print("layer2.weight: ", model.layer2.weight) # just a small slice
print("layer2.weight.grad: ", model.layer2.weight.grad, end="\n\n")
我们打印了layer2初始的权重,此时其梯度为None。
initial state
layer2.weight: Parameter containing:
tensor([[ 0.0455, 0.1137, 0.0758, -0.1928, 0.1358, -0.2644, 0.2034, 0.2180,
-0.0228, -0.0099],
[ 0.1578, 0.0739, 0.1091, 0.0257, -0.2678, 0.0101, 0.0276, 0.1347,
-0.1745, -0.3097],
[-0.2996, 0.3080, 0.1106, -0.2018, 0.1867, -0.2209, -0.2722, 0.0876,
0.2186, -0.0047]], requires_grad=True)
layer2.weight.grad: None
下面我们克隆一下这个初始权重,以防更新后被覆盖。后面做比对验证的时候会用到。
l2_w0 = torch.clone(model.layer2.weight.data) # 张量 深拷贝
接着定义学习率,优化器,做第一步预测。
lr = 0.001
optimizer = torch.optim.SGD(model.parameters(), lr)
prediction = model(some_input)
定义损失函数,并打印之
loss = (ideal_output - prediction).pow(2).sum()
print("the loss is", end=": ")
print(loss, end="\n\n")
the loss is: tensor(16.0445, grad_fn=<SumBackward0>)
让我们调用loss.backward() 看会发生什么
loss.backward()
print("after loss.backward()")
print("layer2.weight: ", model.layer2.weight) # just a small slice
print("layer2.weight.grad: ", model.layer2.weight.grad, end="\n\n")
after loss.backward()
layer2.weight: Parameter containing:
tensor([[ 0.0455, 0.1137, 0.0758, -0.1928, 0.1358, -0.2644, 0.2034, 0.2180,
-0.0228, -0.0099],
[ 0.1578, 0.0739, 0.1091, 0.0257, -0.2678, 0.0101, 0.0276, 0.1347,
-0.1745, -0.3097],
[-0.2996, 0.3080, 0.1106, -0.2018, 0.1867, -0.2209, -0.2722, 0.0876,
0.2186, -0.0047]], requires_grad=True)
layer2.weight.grad: tensor([[ 2.5977, 0.5576, 0.3378, -1.3451, -4.1733, -4.9013, 0.4293, 1.6931,
-0.0186, -1.0523],
[ 2.8229, 1.8159, -0.1136, 5.3649, 3.1599, 2.2919, 1.3979, -0.5267,
4.2920, -0.1978],
[-3.0656, -0.9230, -0.2846, -1.6104, 2.1102, 3.5630, -0.7106, -1.4640,
-1.9020, -0.2953]])
我们发现weight的梯度已经计算出来了,不再是None。但是weight并没有更新,还是初始值。
想要更新梯度,必须调用optimizer.step()。它负责根据计算出来的梯度更新weight。
optimizer.step()
print("after optimizer.step() (第一次更新模型)")
print("layer2.weight: ", model.layer2.weight) # 只有optimizer.step()之后才会更新weight
print("layer2.weight.grad: ", model.layer2.weight.grad, end="\n\n")
after optimizer.step() (第一次更新模型)
layer2.weight: Parameter containing:
tensor([[ 0.0429, 0.1132, 0.0754, -0.1915, 0.1400, -0.2595, 0.2030, 0.2164,
-0.0228, -0.0089],
[ 0.1550, 0.0721, 0.1093, 0.0203, -0.2709, 0.0078, 0.0262, 0.1353,
-0.1788, -0.3095],
[-0.2965, 0.3089, 0.1109, -0.2002, 0.1846, -0.2245, -0.2715, 0.0891,
0.2205, -0.0044]], requires_grad=True)
layer2.weight.grad: tensor([[ 2.5977, 0.5576, 0.3378, -1.3451, -4.1733, -4.9013, 0.4293, 1.6931,
-0.0186, -1.0523],
[ 2.8229, 1.8159, -0.1136, 5.3649, 3.1599, 2.2919, 1.3979, -0.5267,
4.2920, -0.1978],
[-3.0656, -0.9230, -0.2846, -1.6104, 2.1102, 3.5630, -0.7106, -1.4640,
-1.9020, -0.2953]])
optimizer.step()过后,weight已经完成了第一次更新。optimizer.step()并不会更新梯度。
下面我们做一个验证,看模型是否按我们预期的那样更新梯度。
l2_w1 = l2_w0 - lr * model.layer2.weight.grad.data # 梯度下降算法
print(l2_w1)
print(l2_w1 == model.layer2.weight)
tensor([[ 0.0429, 0.1132, 0.0754, -0.1915, 0.1400, -0.2595, 0.2030, 0.2164,
-0.0228, -0.0089],
[ 0.1550, 0.0721, 0.1093, 0.0203, -0.2709, 0.0078, 0.0262, 0.1353,
-0.1788, -0.3095],
[-0.2965, 0.3089, 0.1109, -0.2002, 0.1846, -0.2245, -0.2715, 0.0891,
0.2205, -0.0044]])
tensor([[True, True, True, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True, True]])
确实如我们所料,
weight_new=weight_old-learningRate*weight_old.grad
上式中的减号代表梯度下降。梯度下降算法是求解极小值的一种数学算法,可参考我的公众号文章《梯度下降算法》。
我们还是克隆一下这个更新了一次之后的weight,以防更新后被覆盖。后面做比对验证的时候会用到。
l2_w1 = torch.clone(l2_w1) # 克隆一份 以便后续查看
在每一次更新模型后,须让梯度归零,否则每次运行loss.backward()时梯度会积累!
optimizer.zero_grad() # 梯度归零,否则每次运行loss.backward()时梯度会积累
print("after optimizer.zero_grad()")
print("layer2.weight.grad: ", model.layer2.weight.grad, end="\n\n")
after optimizer.zero_grad()
layer2.weight.grad: tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
更新预测和损失,调用loss.backward()和optimizer.step(),即第二次更新模型权重。
# model 之前已更新了一次梯度
prediction = model(some_input)
loss = (ideal_output - prediction).pow(2).sum()
loss.backward()
#loss.backward() # 不能不更新loss直接 二次调用backward(),否则报错 RuntimeError: Trying to backward through the graph a second time
optimizer.step()
第二次更新模型后打印权重和梯度
print("after optimizer.step() (第二次更新模型)")
print("layer2.weight: ", model.layer2.weight) # 只有optimizer.step()之后才会更新weight
print("layer2.weight.grad: ", model.layer2.weight.grad, end="\n\n")
after optimizer.step() (第二次更新模型)
layer2.weight: Parameter containing:
tensor([[ 0.0404, 0.1126, 0.0751, -0.1902, 0.1441, -0.2546, 0.2026, 0.2147,
-0.0229, -0.0078],
[ 0.1524, 0.0703, 0.1094, 0.0151, -0.2742, 0.0056, 0.0250, 0.1358,
-0.1831, -0.3093],
[-0.2936, 0.3098, 0.1112, -0.1987, 0.1825, -0.2280, -0.2709, 0.0905,
0.2224, -0.0041]], requires_grad=True)
layer2.weight.grad: tensor([[ 2.5134, 0.5676, 0.3401, -1.2741, -4.1204, -4.8144, 0.3902, 1.6643,
0.0844, -1.0339],
[ 2.5896, 1.8101, -0.1178, 5.2262, 3.2895, 2.2297, 1.2444, -0.5366,
4.3429, -0.2041],
[-2.8858, -0.9023, -0.2856, -1.5083, 2.0905, 3.5353, -0.6203, -1.4341,
-1.9003, -0.2885]])
我们再做一次验证,看模型是否按我们预期的那样更新梯度。
l2_w2 = l2_w1 - lr * model.layer2.weight.grad.data # 梯度下降算法
print(l2_w2)
print(l2_w2 == model.layer2.weight)
tensor([[ 0.0404, 0.1126, 0.0751, -0.1902, 0.1441, -0.2546, 0.2026, 0.2147,
-0.0229, -0.0078],
[ 0.1524, 0.0703, 0.1094, 0.0151, -0.2742, 0.0056, 0.0250, 0.1358,
-0.1831, -0.3093],
[-0.2936, 0.3098, 0.1112, -0.1987, 0.1825, -0.2280, -0.2709, 0.0905,
0.2224, -0.0041]])
tensor([[True, True, True, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True, True]])
确实如我们所料,依然满足
weight_new=weight_old-learningRate*weight_old.grad
本文分享自 Python可视化编程机器学习OpenCV 微信公众号,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文参与 腾讯云自媒体同步曝光计划 ,欢迎热爱写作的你一起参与!