大家新年快乐,下面是我在Python中使用吴家祥的课程实现的梯度下降算法(单变量线性回归),matplotlib具有可选的可视化功能,它创建了一个GIF。欢迎任何优化/建议。
对于那些不知道梯度下降算法是什么的人,梯度下降是一种求函数局部极小的一阶迭代优化算法。若要使用梯度下降求函数的局部极小值,则在当前点采取与函数的梯度(或近似梯度)的负值成正比的步骤。相反,如果采取与梯度正成比例的步骤,则接近该函数的局部最大值。您可以查看维基百科上的定义。
所用数据集的链接:
https://drive.google.com/open?id=1tztcXVillZTrbPeeCd28djRooM5nkiBZ
https://drive.google.com/open?id=17ZQ4TLA7ThtU-3J-G108a1fzCH72nSFp
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random
import imageio
import os
def compute_cost(b, m, data):
"""
Compute cost function for univariate linear regression using mean squared error(MSE).
Args:
b: Intercept (y = mx + b).
m: Slope (y = mx + b).
data: A pandas df with x and y data.
Return:
Cost function value.
"""
data['Mean Squared Error(MSE)'] = (data['Y'] - (m * data['X'] + b)) ** 2
return data['Mean Squared Error(MSE)'].sum().mean()
def adjust_gradient(b_current, m_current, data, learning_rate):
"""
Adjust Theta parameters for the univariate linear equation y^ = hθ(x) = θ0 + θ1x
Args:
b_current: Current Intercept (y = mx + b) or [Theta(0) θ0].
m_current: Current Slope (y = mx + b) or [Theta(1) θ1].
data: A pandas df with x and y data.
learning_rate: Alpha value.
Return:
Adjusted Theta parameters.
"""
data['b Gradient'] = -(2 / len(data)) * (data['Y'] - ((m_current * data['X']) + b_current))
data['m Gradient'] = -(2 / len(data)) * data['X'] * (data['Y'] - ((m_current * data['X']) + b_current))
new_b = b_current - (data['b Gradient'].sum() * learning_rate)
new_m = m_current - (data['m Gradient'].sum() * learning_rate)
return new_b, new_m
def gradient_descent(data, b, m, learning_rate, max_iter, visual=False):
"""
Optimize Theta values for the univariate linear regression equation y^ = hθ(x) = θ0 + θ1x.
Args:
data: A pandas df with x and y data.
b: Starting b (θ0) value.
m: Starting m (θ1) value.
learning_rate: Alpha value.
max_iter: Maximum number of iterations.
visual: If True, a GIF progression will be generated.
Return:
Optimized values for θ0 and θ1.
"""
line = np.arange(len(data))
folder_name = None
if visual:
folder_name = str(random.randint(10 ** 6, 10 ** 8))
os.mkdir(folder_name)
os.chdir(folder_name)
for i in range(max_iter):
b, m = adjust_gradient(b, m, data, learning_rate)
if visual:
data['Line'] = (line * m) + b
data.plot(kind='scatter', x='X', y='Y', figsize=(8, 8), marker='x', color='r')
plt.plot(data['Line'], color='b')
plt.grid()
plt.title(f'y = {m}x + {b}\nCurrent cost: {compute_cost(b, m, data)}\nIteration: {i}\n'
f'Alpha = {learning_rate}')
fig_name = ''.join([str(i), '.png'])
plt.savefig(fig_name)
plt.close()
if visual:
frames = os.listdir('.')
frames.sort(key=lambda x: int(x.split('.')[0]))
frames = [imageio.imread(frame) for frame in frames]
imageio.mimsave(folder_name + '.gif', frames)
return b, m
if __name__ == '__main__':
data = pd.read_csv('data.csv')
data.columns = ['X', 'Y']
learning = 0.00001
initial_b, initial_m = 0, 0
max_it = 350
b, m = gradient_descent(data, initial_b, initial_m, learning, max_it, visual=True)
发布于 2020-01-02 17:26:03
当我运行脚本时,我的第一个想法是它挂起来了,因为什么都没发生。显然,我的电脑计算速度太慢了。
为了让用户放心,程序正在工作,有一些输出表明进展。下面是我添加它的方式,但是您可以通过进度条和其他东西使它变得更时尚:
n_reports = 20
for i in range(max_iter):
if max_iter >= n_reports and i % (max_iter // n_reports) == 1:
print(f'{(i * 100 // max_iter)}% ', end = '', flush = True)
...
print('100%')
n_reports
是打印完成百分比的次数。不耐烦的用户希望打印得更频繁,而耐心的用户(或计算机速度更快的用户)希望打印得更少。
Python有一个临时文件和目录模块,您应该使用这些模块,而不是创建自己的文件和目录。此外,除非有必要,否则永远不要更改进程当前工作目录( chdir
调用)。它会引起非常奇怪的问题。
tempfile
和pathlib
模块简化了文件处理:
if visual:
save_dir = pathlib.Path(tempfile.mkdtemp())
...
for i in range(max_iter):
...
if visual:
...
fig_name = f'{i:05}.png'
plt.savefig(save_dir / fig_name)
...
if visual:
frames = sorted([f.resolve() for f in save_dir.iterdir()])
frames = [imageio.imread(frame) for frame in frames]
image_file = save_dir.name + '.gif'
imageio.mimsave(image_file, frames)
print(f'Saved image as {image_file}')
我也是这样做的,所以文件名中填充了零。这样,你就可以依赖字典顺序,而不必编写自己的关键函数。
https://codereview.stackexchange.com/questions/234936
复制相似问题