前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >深度学习-优化参数

深度学习-优化参数

原创
作者头像
freesan44
修改2021-10-12 17:56:24
4760
修改2021-10-12 17:56:24
举报
文章被收录于专栏:freesan44
  1. 训练集、开发集、测试集
  2. 高方差:训练集误差率1%,开发集误差率11%,就是参数过度拟合训练集,修正方法:更多的训练数据,正则化 高偏差:训练集误差率15%,开发集误差率16%,就是泛化性好,但误差率高,修正方法:神经网络更深层次,每层神经元更多
  3. 正则化:L1正则化 L2正则化 dropout正则化 dropout正则化:反向随机失活法 在梯度下降过程中更新w,d的时候用,让参数更加平滑,非过度拟合,大部分是放在视觉图像处理中用 J函数可能不会单调递减
  4. dropout多数是放在神经元最多的隐藏层里面
  5. 优化成本函数J:梯度下降、Momentum算法、RMSProp算法、Adam算法等 不要过度拟合:正则化、超参数
  6. 加速训练方法:数据归一化,方差归一化 注意:如果训练集进行归一化后,测试集开发集必须要用相同值的σ和μ进行归一化
  7. 数据归一化: X = X-μ(平均值)
  8. 方差归一化:
    X = X/(σ^2)
  9. 防止梯度缺失和爆炸: 权重W的初始化默认值选择比较好的初始化公式
  10. 梯度检查算法,结果在10^-7以下才比较正确 通常是在开发集、测试集中才用 注意:梯度检查不能和dropout一齐使用,所以,应该用梯度检查算法是否正确,正确之后再开启dropout
  11. Wl应该随机初始化,不应该初始化为0,以防对称性,随机初始化# GRADED FUNCTION: initialize_parameters_he def initialize_parameters_he(layers_dims): """ Arguments: layer_dims -- python array (list) containing the size of each layer. Returns: parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL": W1 -- weight matrix of shape (layers_dims[1], layers_dims[0]) b1 -- bias vector of shape (layers_dims[1], 1) ... WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1]) bL -- bias vector of shape (layers_dims[L], 1) """ np.random.seed(3) parameters = {} L = len(layers_dims) - 1 # integer representing the number of layers for l in range(1, L + 1): ### START CODE HERE ### (≈ 2 lines of code) parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1]) * np.sqrt(2./layers_dims[l-1]) parameters['b' + str(l)] = np.zeros((layers_dims[l],1)) ### END CODE HERE ### return parameters
  12. L2正则化
    # GRADED FUNCTION: compute_cost_with_regularization def compute_cost_with_regularization(A3, Y, parameters, lambd): """ Implement the cost function with L2 regularization. See formula (2) above. Arguments: A3 -- post-activation, output of forward propagation, of shape (output size, number of examples) Y -- "true" labels vector, of shape (output size, number of examples) parameters -- python dictionary containing parameters of the model Returns: cost - value of the regularized loss function (formula (2)) """ m = Y.shape[1] W1 = parameters["W1"] W2 = parameters["W2"] W3 = parameters["W3"] cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost ### START CODE HERE ### (approx. 1 line) L2_regularization_cost = 1/m * lambd/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) ### END CODER HERE ### cost = cross_entropy_cost + L2_regularization_cost return cost正则的反向梯度计算:
代码语言:txt
复制
# GRADED FUNCTION: backward_propagation_with_regularization

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.
    
    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    
    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1./m * np.dot(dZ3, A2.T) + lambd/m * W3
    ### END CODE HERE ###
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1./m * np.dot(dZ2, A1.T) + lambd/m * W2
    ### END CODE HERE ###
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1./m * np.dot(dZ1, X.T) + lambd/m * W1
    ### END CODE HERE ###
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

L2正规化实际上在做什么:

L2正则化依赖于这样的假设:具有小权重的模型比具有大权重的模型更简单。因此,通过惩罚成本函数中权重的平方值,您可以将所有权重驱动为更小的值。拥有大重量的成本太昂贵了!这导致更平滑的模型,其中输出随输入变化而变化更慢。

L2正则化对于:

成本计算:正则化术语被添加到成本中

反向传播功能:关于权重矩阵,在梯度中存在额外的项

重量最小(“重量衰减”):权重被推到较小的值。

  1. dropout 几率屏蔽神经元# GRADED FUNCTION: forward_propagation_with_dropout def forward_propagation_with_dropout(X, parameters, keep_prob=0.5): """ Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID. Arguments: X -- input dataset, of shape (2, number of examples) parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3": W1 -- weight matrix of shape (20, 2) b1 -- bias vector of shape (20, 1) W2 -- weight matrix of shape (3, 20) b2 -- bias vector of shape (3, 1) W3 -- weight matrix of shape (1, 3) b3 -- bias vector of shape (1, 1) keep_prob - probability of keeping a neuron active during drop-out, scalar Returns: A3 -- last activation value, output of the forward propagation, of shape (1,1) cache -- tuple, information stored for computing the backward propagation """ np.random.seed(1) # retrieve parameters W1 = parameters["W1"] b1 = parameters["b1"] W2 = parameters["W2"] b2 = parameters["b2"] W3 = parameters["W3"] b3 = parameters["b3"] # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID Z1 = np.dot(W1, X) + b1 A1 = relu(Z1) ### START CODE HERE ### (approx. 4 lines) # Steps 1-4 below correspond to the Steps 1-4 described above. D1 = np.random.rand(A1.shape[0],A1.shape[1]) # Step 1: initialize matrix D1 = np.random.rand(..., ...) D1 = D1<keep_prob # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold) A1 = A1 * D1 # Step 3: shut down some neurons of A1 A1 = A1/keep_prob # Step 4: scale the value of neurons that haven't been shut down ### END CODE HERE ### Z2 = np.dot(W2, A1) + b2 A2 = relu(Z2) ### START CODE HERE ### (approx. 4 lines) D2 = np.random.rand(A2.shape[0],A2.shape[1]) # Step 1: initialize matrix D2 = np.random.rand(..., ...) D2 = D2<keep_prob # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold) A2 = A2 * D2 # Step 3: shut down some neurons of A2 A2 = A2 / keep_prob # Step 4: scale the value of neurons that haven't been shut down ### END CODE HERE ### Z3 = np.dot(W3, A2) + b3 A3 = sigmoid(Z3) cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) return A3, cache反向dropout# GRADED FUNCTION: backward_propagation_with_dropout def backward_propagation_with_dropout(X, Y, cache, keep_prob): """ Implements the backward propagation of our baseline model to which we added dropout. Arguments: X -- input dataset, of shape (2, number of examples) Y -- "true" labels vector, of shape (output size, number of examples) cache -- cache output from forward_propagation_with_dropout() keep_prob - probability of keeping a neuron active during drop-out, scalar Returns: gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables """ m = X.shape[1] (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache dZ3 = A3 - Y dW3 = 1. / m * np.dot(dZ3, A2.T) db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True) dA2 = np.dot(W3.T, dZ3) ### START CODE HERE ### (≈ 2 lines of code) dA2 = dA2 * D2 # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation dA2 = dA2 / keep_prob # Step 2: Scale the value of neurons that haven't been shut down ### END CODE HERE ### dZ2 = np.multiply(dA2, np.int64(A2 > 0)) dW2 = 1. / m * np.dot(dZ2, A1.T) db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True) dA1 = np.dot(W2.T, dZ2) ### START CODE HERE ### (≈ 2 lines of code) dA1 = dA1 * D1 # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation dA1 = dA1 / keep_prob # Step 2: Scale the value of neurons that haven't been shut down ### END CODE HERE ### dZ1 = np.multiply(dA1, np.int64(A1 > 0)) dW1 = 1. / m * np.dot(dZ1, X.T) db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True) gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1} return gradients注意
  • 一个常见的错误使用下降现象时在训练和测试使用它两者。您应该仅在训练中使用dropout(随机消除节点)。
  • tensorflowPaddlePaddlekerascaffe这样的深度学习框架带有一个dropout层实现。不要紧张 - 你很快就会学到一些这样的框架。

dropout你应该记住什么:

  • dropout是一种正规化技术。
  • 您只在训练期间使用dropout。在测试期间不要使用dropout(随机消除节点)。
  • 在向前和向后传播期间应用dropout。
  • 在训练期间,通过keep_prob划分每个丢失层以保持激活的相同预期值。例如,如果keep_prob为0.5,那么我们平均会关闭一半节点,因此输出将缩放0.5,因为只有剩下的一半对解决方案有贡献。除以0.5相当于乘以2.因此,输出现在具有相同的期望值。即使keep_prob的值不是0.5,您也可以检查这是否有效。
  1. 渐变检查 一维:
    # GRADED FUNCTION: gradient_check def gradient_check(x, theta, epsilon=1e-7): """ Implement the backward propagation presented in Figure 1. Arguments: x -- a real-valued input theta -- our parameter, a real number as well epsilon -- tiny shift to the input to compute approximated gradient with formula(1) Returns: difference -- difference (2) between the approximated gradient and the backward propagation gradient """ # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit. ### START CODE HERE ### (approx. 5 lines) thetaplus = theta + epsilon # Step 1 thetaminus = theta - epsilon # Step 2 J_plus = forward_propagation(x,thetaplus)# Step 3 J_minus = forward_propagation(x,thetaminus) # Step 4 gradapprox = (J_plus - J_minus)/(2 * epsilon) # Step 5 ### END CODE HERE ### # Check if gradapprox is close enough to the output of backward_propagation() ### START CODE HERE ### (approx. 1 line) grad = backward_propagation(x,theta) ### END CODE HERE ### ### START CODE HERE ### (approx. 1 line) numerator = np.linalg.norm(grad-gradapprox) # Step 1' denominator = np.linalg.norm(grad)+np.linalg.norm(gradapprox) # Step 2' difference = numerator/denominator # Step 3' ### END CODE HERE ### if difference < 1e-7: print("The gradient is correct!") else: print("The gradient is wrong!") return difference多维渐变检查:
代码语言:txt
复制
# GRADED FUNCTION: gradient_check_n

def gradient_check_n(parameters, gradients, X, Y, epsilon=1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n

    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters.
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)

    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """

    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))

    # Compute gradapprox
    for i in range(num_parameters):
        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        thetaplus = np.copy(parameters_values)  # Step 1
        thetaplus[i][0] = thetaplus[i][0] + epsilon  # Step 2
        J_plus[i], _ = forward_propagation_n(X,Y,vector_to_dictionary(thetaplus))  # Step 3
        ### END CODE HERE ###

        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaminus = np.copy(parameters_values)  # Step 1
        thetaminus[i][0] = thetaminus[i][0] - epsilon  # Step 2
        J_minus[i], _ = forward_propagation_n(X,Y,vector_to_dictionary(thetaminus))  # Step 3
        ### END CODE HERE ###

        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = (J_plus[i]-J_minus[i])/(2*epsilon)
        ### END CODE HERE ###

    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad-gradapprox)  # Step 1'
    denominator = np.linalg.norm(grad)+np.linalg.norm(gradapprox)  # Step 2'
    difference = numerator/denominator  # Step 3'
    ### END CODE HERE ###

    if difference > 2e-7:
        print(
            "\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print(
            "\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")

    return difference

注意:

渐变检查很慢!使用渐近近似渐变 ∂J∂θ≈J(θ+ε)−J(θ−ε)2ε计算成本很高。因此,我们不会在训练期间的每次迭代中运行梯度检查。只需几次检查渐变是否正确。

至少在我们提出的情况下,渐变检查不适用于dropout。您通常会在没有dropout的情况下运行渐变检查算法,以确保您的backprop正确,然后添加dropout。

梯度检查验证来自反向传播的梯度与梯度的数值近似(使用前向传播计算)之间的接近程度。

梯度检查很慢,因此我们不会在每次训练迭代中运行它。您通常只运行它以确保您的代码是正确的,然后将其关闭并使用backprop进行实际的学习过程。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档