深度学习-优化参数

原创

freesan44

修改于 2021-10-12 17:56:24

4760

修改于 2021-10-12 17:56:24

文章被收录于专栏：freesan44

训练集、开发集、测试集
高方差：训练集误差率1%，开发集误差率11%，就是参数过度拟合训练集，修正方法：更多的训练数据，正则化高偏差：训练集误差率15%，开发集误差率16%，就是泛化性好，但误差率高，修正方法：神经网络更深层次，每层神经元更多
正则化：L1正则化 L2正则化 dropout正则化 dropout正则化：反向随机失活法在梯度下降过程中更新w，d的时候用，让参数更加平滑，非过度拟合，大部分是放在视觉图像处理中用 J函数可能不会单调递减
dropout多数是放在神经元最多的隐藏层里面
优化成本函数J：梯度下降、Momentum算法、RMSProp算法、Adam算法等 不要过度拟合：正则化、超参数
加速训练方法：数据归一化，方差归一化注意：如果训练集进行归一化后，测试集开发集必须要用相同值的σ和μ进行归一化
数据归一化: X = X-μ(平均值)
方差归一化：
X = X/(σ^2)
防止梯度缺失和爆炸：权重W的初始化默认值选择比较好的初始化公式
梯度检查算法，结果在10^-7以下才比较正确通常是在开发集、测试集中才用 注意：梯度检查不能和dropout一齐使用，所以，应该用梯度检查算法是否正确，正确之后再开启dropout
Wl应该随机初始化，不应该初始化为0，以防对称性，随机初始化# GRADED FUNCTION: initialize_parameters_he def initialize_parameters_he(layers_dims): """ Arguments: layer_dims -- python array (list) containing the size of each layer. Returns: parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL": W1 -- weight matrix of shape (layers_dims[1], layers_dims[0]) b1 -- bias vector of shape (layers_dims[1], 1) ... WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1]) bL -- bias vector of shape (layers_dims[L], 1) """ np.random.seed(3) parameters = {} L = len(layers_dims) - 1 # integer representing the number of layers for l in range(1, L + 1): ### START CODE HERE ### (≈ 2 lines of code) parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1]) * np.sqrt(2./layers_dims[l-1]) parameters['b' + str(l)] = np.zeros((layers_dims[l],1)) ### END CODE HERE ### return parameters
L2正则化
# GRADED FUNCTION: compute_cost_with_regularization def compute_cost_with_regularization(A3, Y, parameters, lambd): """ Implement the cost function with L2 regularization. See formula (2) above. Arguments: A3 -- post-activation, output of forward propagation, of shape (output size, number of examples) Y -- "true" labels vector, of shape (output size, number of examples) parameters -- python dictionary containing parameters of the model Returns: cost - value of the regularized loss function (formula (2)) """ m = Y.shape[1] W1 = parameters["W1"] W2 = parameters["W2"] W3 = parameters["W3"] cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost ### START CODE HERE ### (approx. 1 line) L2_regularization_cost = 1/m * lambd/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) ### END CODER HERE ### cost = cross_entropy_cost + L2_regularization_cost return cost正则的反向梯度计算:

# GRADED FUNCTION: backward_propagation_with_regularization

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.
    
    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    
    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1./m * np.dot(dZ3, A2.T) + lambd/m * W3
    ### END CODE HERE ###
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1./m * np.dot(dZ2, A1.T) + lambd/m * W2
    ### END CODE HERE ###
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1./m * np.dot(dZ1, X.T) + lambd/m * W1
    ### END CODE HERE ###
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

L2正规化实际上在做什么：

L2正则化依赖于这样的假设：具有小权重的模型比具有大权重的模型更简单。因此，通过惩罚成本函数中权重的平方值，您可以将所有权重驱动为更小的值。拥有大重量的成本太昂贵了！这导致更平滑的模型，其中输出随输入变化而变化更慢。

L2正则化对于：

成本计算：正则化术语被添加到成本中

反向传播功能：关于权重矩阵，在梯度中存在额外的项

重量最小（“重量衰减”）：权重被推到较小的值。

dropout 几率屏蔽神经元# GRADED FUNCTION: forward_propagation_with_dropout def forward_propagation_with_dropout(X, parameters, keep_prob=0.5): """ Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID. Arguments: X -- input dataset, of shape (2, number of examples) parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3": W1 -- weight matrix of shape (20, 2) b1 -- bias vector of shape (20, 1) W2 -- weight matrix of shape (3, 20) b2 -- bias vector of shape (3, 1) W3 -- weight matrix of shape (1, 3) b3 -- bias vector of shape (1, 1) keep_prob - probability of keeping a neuron active during drop-out, scalar Returns: A3 -- last activation value, output of the forward propagation, of shape (1,1) cache -- tuple, information stored for computing the backward propagation """ np.random.seed(1) # retrieve parameters W1 = parameters["W1"] b1 = parameters["b1"] W2 = parameters["W2"] b2 = parameters["b2"] W3 = parameters["W3"] b3 = parameters["b3"] # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID Z1 = np.dot(W1, X) + b1 A1 = relu(Z1) ### START CODE HERE ### (approx. 4 lines) # Steps 1-4 below correspond to the Steps 1-4 described above. D1 = np.random.rand(A1.shape[0],A1.shape[1]) # Step 1: initialize matrix D1 = np.random.rand(..., ...) D1 = D1<keep_prob # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold) A1 = A1 * D1 # Step 3: shut down some neurons of A1 A1 = A1/keep_prob # Step 4: scale the value of neurons that haven't been shut down ### END CODE HERE ### Z2 = np.dot(W2, A1) + b2 A2 = relu(Z2) ### START CODE HERE ### (approx. 4 lines) D2 = np.random.rand(A2.shape[0],A2.shape[1]) # Step 1: initialize matrix D2 = np.random.rand(..., ...) D2 = D2<keep_prob # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold) A2 = A2 * D2 # Step 3: shut down some neurons of A2 A2 = A2 / keep_prob # Step 4: scale the value of neurons that haven't been shut down ### END CODE HERE ### Z3 = np.dot(W3, A2) + b3 A3 = sigmoid(Z3) cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) return A3, cache反向dropout# GRADED FUNCTION: backward_propagation_with_dropout def backward_propagation_with_dropout(X, Y, cache, keep_prob): """ Implements the backward propagation of our baseline model to which we added dropout. Arguments: X -- input dataset, of shape (2, number of examples) Y -- "true" labels vector, of shape (output size, number of examples) cache -- cache output from forward_propagation_with_dropout() keep_prob - probability of keeping a neuron active during drop-out, scalar Returns: gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables """ m = X.shape[1] (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache dZ3 = A3 - Y dW3 = 1. / m * np.dot(dZ3, A2.T) db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True) dA2 = np.dot(W3.T, dZ3) ### START CODE HERE ### (≈ 2 lines of code) dA2 = dA2 * D2 # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation dA2 = dA2 / keep_prob # Step 2: Scale the value of neurons that haven't been shut down ### END CODE HERE ### dZ2 = np.multiply(dA2, np.int64(A2 > 0)) dW2 = 1. / m * np.dot(dZ2, A1.T) db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True) dA1 = np.dot(W2.T, dZ2) ### START CODE HERE ### (≈ 2 lines of code) dA1 = dA1 * D1 # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation dA1 = dA1 / keep_prob # Step 2: Scale the value of neurons that haven't been shut down ### END CODE HERE ### dZ1 = np.multiply(dA1, np.int64(A1 > 0)) dW1 = 1. / m * np.dot(dZ1, X.T) db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True) gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1} return gradients注意：

一个常见的错误使用下降现象时在训练和测试使用它两者。您应该仅在训练中使用dropout（随机消除节点）。
像tensorflow，PaddlePaddle，keras或caffe这样的深度学习框架带有一个dropout层实现。不要紧张 - 你很快就会学到一些这样的框架。

dropout你应该记住什么：

dropout是一种正规化技术。
您只在训练期间使用dropout。在测试期间不要使用dropout（随机消除节点）。
在向前和向后传播期间应用dropout。
在训练期间，通过keep_prob划分每个丢失层以保持激活的相同预期值。例如，如果keep_prob为0.5，那么我们平均会关闭一半节点，因此输出将缩放0.5，因为只有剩下的一半对解决方案有贡献。除以0.5相当于乘以2.因此，输出现在具有相同的期望值。即使keep_prob的值不是0.5，您也可以检查这是否有效。

渐变检查一维：
# GRADED FUNCTION: gradient_check def gradient_check(x, theta, epsilon=1e-7): """ Implement the backward propagation presented in Figure 1. Arguments: x -- a real-valued input theta -- our parameter, a real number as well epsilon -- tiny shift to the input to compute approximated gradient with formula(1) Returns: difference -- difference (2) between the approximated gradient and the backward propagation gradient """ # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit. ### START CODE HERE ### (approx. 5 lines) thetaplus = theta + epsilon # Step 1 thetaminus = theta - epsilon # Step 2 J_plus = forward_propagation(x,thetaplus)# Step 3 J_minus = forward_propagation(x,thetaminus) # Step 4 gradapprox = (J_plus - J_minus)/(2 * epsilon) # Step 5 ### END CODE HERE ### # Check if gradapprox is close enough to the output of backward_propagation() ### START CODE HERE ### (approx. 1 line) grad = backward_propagation(x,theta) ### END CODE HERE ### ### START CODE HERE ### (approx. 1 line) numerator = np.linalg.norm(grad-gradapprox) # Step 1' denominator = np.linalg.norm(grad)+np.linalg.norm(gradapprox) # Step 2' difference = numerator/denominator # Step 3' ### END CODE HERE ### if difference < 1e-7: print("The gradient is correct!") else: print("The gradient is wrong!") return difference多维渐变检查:

# GRADED FUNCTION: gradient_check_n

def gradient_check_n(parameters, gradients, X, Y, epsilon=1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n

    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters.
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)

    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """

    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))

    # Compute gradapprox
    for i in range(num_parameters):
        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        thetaplus = np.copy(parameters_values)  # Step 1
        thetaplus[i][0] = thetaplus[i][0] + epsilon  # Step 2
        J_plus[i], _ = forward_propagation_n(X,Y,vector_to_dictionary(thetaplus))  # Step 3
        ### END CODE HERE ###

        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaminus = np.copy(parameters_values)  # Step 1
        thetaminus[i][0] = thetaminus[i][0] - epsilon  # Step 2
        J_minus[i], _ = forward_propagation_n(X,Y,vector_to_dictionary(thetaminus))  # Step 3
        ### END CODE HERE ###

        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = (J_plus[i]-J_minus[i])/(2*epsilon)
        ### END CODE HERE ###

    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad-gradapprox)  # Step 1'
    denominator = np.linalg.norm(grad)+np.linalg.norm(gradapprox)  # Step 2'
    difference = numerator/denominator  # Step 3'
    ### END CODE HERE ###

    if difference > 2e-7:
        print(
            "\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print(
            "\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")

    return difference

注意:

渐变检查很慢！使用渐近近似渐变 ∂J∂θ≈J(θ+ε)−J(θ−ε)2ε计算成本很高。因此，我们不会在训练期间的每次迭代中运行梯度检查。只需几次检查渐变是否正确。

至少在我们提出的情况下，渐变检查不适用于dropout。您通常会在没有dropout的情况下运行渐变检查算法，以确保您的backprop正确，然后添加dropout。

梯度检查验证来自反向传播的梯度与梯度的数值近似（使用前向传播计算）之间的接近程度。

梯度检查很慢，因此我们不会在每次训练迭代中运行它。您通常只运行它以确保您的代码是正确的，然后将其关闭并使用backprop进行实际的学习过程。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

深度学习