专栏首页光城(guangcity)全连接神经网络(下)

全连接神经网络(下)

全连接神经网络(下)

0.说在前面1.Batch Normalization1.1 什么是BN?1.2 前向传播1.3 反向传播2.Dropout2.1 什么是Dropout?2.2 前向传播2.3 反向传播3.任意隐藏层数的全连接网络4.训练模型5.作者的话

0.说在前面

说点感慨的,有人问我为何每日都在分享,从来没有间断过,我只有一个答案,那就是:坚持

另外,我已经将作业详解新建了一个菜单,可以在公众号里面找到作业详解菜单,里面有之前的所有作业详解!

ok,我们继续来上次cs231n的assignment2的全连接神经网络第二篇。这一篇则重点研究构建任意层数的全连接网络!下面我们一起来实战吧!

1.Batch Normalization

1.1 什么是BN?

什么是Batch Normalization,以及相关的前向传播,反向传播推导,这里给出一个大佬的网址,大家可以自行mark!

Understanding the backward pass through Batch Normalization Layer

简单来说,Batch Normalization就是在每一层的wx+b和f(wx+b)之间加一个归一化。

什么是归一化,这里的归一化指的是:将wx+b归一化成:均值为0,方差为1!

下面给出Batch Normalization的算法和反向求导公式,下图来自于网上上述链接~

1.2 前向传播

前向与后向传播均在layes.py文件内!

其实这里比较好写,原因在于注释提示了很多比如注释里面的:

running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var

输入输出:

输入:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

返回元组:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass

完整实现:

相关公示的注释已经写上,对上述的算法进行实现即可!

def batchnorm_forward(x, gamma, beta, bn_param):
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)

    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == 'train':
        # mini-batch mean miu_B (1,D)
        sample_mean = np.mean(x,axis=0,keepdims=True)
        # miin-batch variance sigema_square (1,D)
        sample_var = np.var(x,axis=0,keepdims=True)
        # normalize (N,D)
        x_normalize = (x-sample_mean)/np.sqrt(sample_var+eps)
        # scale and shift
        out = gamma*x_normalize+beta
        cache=(x_normalize,gamma,beta,sample_mean,sample_var,x,eps)
        # update
        running_mean = momentum * running_mean + (1 - momentum) * sample_mean
        running_var = momentum * running_var + (1 - momentum) * sample_var
    elif mode == 'test':
        x_normalize = (x-running_mean)/np.sqrt(running_var+eps)
        out = gamma*x_normalize+beta
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache

1.3 反向传播

反向传播很重要,而在assignment1中对两层神经网络进行手推,这里是一样的原理,由于自己写的有点乱,就不放手推了,给出网上的推导:

输入输出:

输入:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

返回元组:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)

完整实现:

这里建议将上述算法与反向传播公式联系起来一起推,最好手推,有一个重要点提一下,就是在对x求导的时候,是层层嵌套,所以采用算法当中的分治法解决,分为多个子问题,链式推导,方便简单,而且不容易出错!

def batchnorm_backward(dout, cache):
    dx, dgamma, dbeta = None, None, None
    x_normalized, gamma, beta, sample_mean, sample_var, x, eps = cache
    N, D = x.shape
    dx_normalized = dout * gamma       # [N,D]
    x_mu = x - sample_mean             # [N,D]
    sample_std_inv = 1.0 / np.sqrt(sample_var + eps)    # [1,D]
    dsample_var = -0.5 * np.sum(dx_normalized * x_mu, axis=0, keepdims=True) * sample_std_inv**3
    dsample_mean = -1.0 * np.sum(dx_normalized * sample_std_inv, axis=0, keepdims=True) - 2.0 * dsample_var * np.mean(x_mu, axis=0, keepdims=True)
    dx1 = dx_normalized * sample_std_inv
    dx2 = 2.0/N * dsample_var * x_mu
    dx = dx1 + dx2 + 1.0/N * dsample_mean
    dgamma = np.sum(dout * x_normalized, axis=0, keepdims=True)
    dbeta = np.sum(dout, axis=0, keepdims=True)
    return dx, dgamma, dbeta

2.Dropout

2.1 什么是Dropout?

Dropout可以理解为遗抑制过拟合的一种正规化手段!在训练过程中,对每个神经元,以概率p保持它的激活状态。下面给出dropout的示意图:

回答先图b与图a明显的区别是,指向变少了,也就是去掉了很多传递过程,但在实际中不经常用,因为容易去掉一些关键信息!

2.2 前向传播

前向与反向传播在layers.py文件中!

在注释中提到了cs231n的一个关键点,大家可以去下面链接去看什么是dropout:

cs231n直通点

输入输出

输入:
    - x: Input data, of any shape
    - dropout_param: A dictionary with the following keys:
      - p: Dropout parameter. We keep each neuron output with probability p.
      - mode: 'test' or 'train'. If the mode is train, then perform dropout;
        if the mode is test, then just return the input.
      - seed: Seed for the random number generator. Passing seed makes this
        function deterministic, which is needed for gradient checking but not
        in real networks.

 输出:
    - out: Array of the same shape as x.
    - cache: tuple (dropout_param, mask). In training mode, mask is the dropout
      mask that was used to multiply the input; in test mode, mask is None.

完整实现

具体实现只需要记住一句话,以某一概率失活!!!也就是让当前的数据乘以每个数据的失活概率即可!

def dropout_forward(x, dropout_param):
    p, mode = dropout_param['p'], dropout_param['mode']
    if 'seed' in dropout_param:
        np.random.seed(dropout_param['seed'])
    mask = None
    out = None
    if mode == 'train':
        mask = (np.random.rand(*x.shape) < p)   #以某一概率随机失活
        out = x * mask
    elif mode == 'test':
        out=x
    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)
    return out, cache

2.3 反向传播

输入输出:

输入:
    - dout: Upstream derivatives, of any shape
    - cache: (dropout_param, mask) from dropout_forward.
输出:
    - dx

完整实现:

实现就是直接上层的梯度乘以当前的梯度,上层梯度为dout,当前梯度为存储的mask。

def dropout_backward(dout, cache):
    dropout_param, mask = cache
    mode = dropout_param['mode']

    dx = None
    if mode == 'train':
        dx = dout * mask
    elif mode == 'test':
        dx = dout
    return dx

3.任意隐藏层数的全连接网络

对fc_net.py进行修改!

对于这一块填写,之前一直有点不懂,还好今天重新看了一下注释,觉得很清楚了,建议都去看看注释的todo或者解释,很详细!!!

以这个为例:

首先我们可以看到所构建的全连接网络结构为:

网络的层数为L层,L-1表示重复{blok}L-1次,注释中都有的!

{affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

输入输出:

为了保持原文意思,这里没有翻译出来,大家克服一下,看英文,如果不懂可以留言!

  初始化一个新的全连接网络.
        输入:
        - hidden_dims: A list of integers giving the size of each hidden layer.
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then
          the network should not use dropout at all.
        - normalization: What type of normalization the network should use. Valid values
          are "batchnorm", "layernorm", or None for no normalization (the default).
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - dtype: A numpy datatype object; all computations will be performed using
          this datatype. float32 is faster but less accurate, so you should use
          float64 for numeric gradient checking.
        - seed: If not None, then pass this random seed to the dropout layers. This
          will make the dropout layers deteriminstic so we can gradient check the
          model.

下面两行#号中间为填写内容!我们所实现的目标大家可以看TODO,里面说的很详细,我简单说一下,就是来存储w与b,而这个存储的作用,则会在后面的loss用到!

class FullyConnectedNet(object):
    def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
                 dropout=1, normalization=None, reg=0.0,
                 weight_scale=1e-2, dtype=np.float32, seed=None):
        self.normalization = normalization
        self.use_dropout = dropout != 1
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}
        ############################################################################
        num_neurons = [input_dim] + hidden_dims + [num_classes]
        # 看一开始的时候注释就说了L-1次,所以这里要前去1
        for i in range(len(num_neurons) - 1):
            self.params['W' + str(i + 1)] = np.random.randn(num_neurons[i], num_neurons[i+1]) * weight_scale
            self.params['b' + str(i + 1)] = np.zeros(num_neurons[i+1])
            # 这里处理的总循环式L-1,i最大为L-2,而batchnormalization只在层与层中间,也就是比如三个结点就只有两个间隔,所以这里是到L-2
            if self.normalization=='batchnorm' and i < len(num_neurons) - 2:
                self.params['beta' + str(i + 1)] = np.zeros([num_neurons[i+1]])
                self.params['gamma' + str(i + 1)] = np.ones([num_neurons[i+1]])
        ############################################################################
        self.dropout_param = {}
        if self.use_dropout:
            self.dropout_param = {'mode': 'train', 'p': dropout}
            if seed is not None:
                self.dropout_param['seed'] = seed
        self.bn_params = []
        if self.normalization=='batchnorm':
            self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]
        if self.normalization=='layernorm':
            self.bn_params = [{} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)

目标:

计算全连接网络的损失与梯度

输入输出:

输入:
    - X: Array of input data of shape (N, d_1, ..., d_k)
    - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

返回:
    If y is None, then run a test-time forward pass of the model and return:
    - scores: Array of shape (N, C) giving classification scores, where scores[i, c] is the classification score for X[i] and class c.

完整实现:

这里的实现思路就是按照上面一开始的注释提到的:

{affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

下面一起来看:具体代码在两行长#号中间:

下面依此调用affine、batch、relu、dropout的前向传播来实现!紧接着求出loss,最后来调用跟前向传播相对的反向传播来求梯度!

def loss(self, X, y=None):
        X = X.astype(self.dtype)
        mode = 'test' if y is None else 'train'
        if self.use_dropout:
            self.dropout_param['mode'] = mode
        if self.normalization=='batchnorm':
            for bn_param in self.bn_params:
                bn_param['mode'] = mode
        scores = None
        cache = {}
        scores = X 
        ############################################################################
        # 前向传播
        # {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax
        for i in range(1, self.num_layers + 1):
            scores, cache['fc'+str(i)] = affine_forward(scores, self.params['W' + str(i)], self.params['b' + str(i)])
            # Do not add relu, batchnorm, dropout after the last layer
            if i < self.num_layers: 
                if self.normalization == "batchnorm":
                    D = scores.shape[1]
                    # self.bn_params[i-1] since the provided code above initilizes bn_params for layers from index 0, here we index layer from 1. 
                    scores, cache['bn'+str(i)] = batchnorm_forward(scores, self.params['gamma'+str(i)], self.params['beta'+str(i)], self.bn_params[i-1]) 
                scores, cache['relu'+str(i)] = relu_forward(scores)
                if self.use_dropout:
                    scores, cache['dropout'+str(i)] = dropout_forward(scores, self.dropout_param)
        ############################################################################

        # If test mode return early
        if mode == 'test':
            return scores

        loss, grads = 0.0, {}
        ############################################################################
        # 计算loss
        loss, last_grad = softmax_loss(scores, y)
        loss += 0.5 * self.reg * sum([np.sum(self.params['W' + str(i)]**2) for i in range(1, self.num_layers + 1)])
        ############################################################################

        ############################################################################
        # 反向传播
        for i in range(self.num_layers, 0, -1): 
            if i < self.num_layers: # No ReLU, dropout, Batchnorm for the last layer
                if self.use_dropout:
                    last_grad = dropout_backward(last_grad, cache['dropout' + str(i)])
                last_grad = relu_backward(last_grad, cache['relu' + str(i)])
                if self.normalization == "batchnorm":
                    last_grad, grads['gamma'+str(i)], grads['beta'+str(i)] = batchnorm_backward(last_grad, cache['bn'+str(i)])
            last_grad, grads['W' + str(i)], grads['b' + str(i)] = affine_backward(last_grad, cache['fc' + str(i)])
            grads['W' + str(i)] += self.reg * self.params['W' + str(i)]
        ############################################################################
        return loss, grads

4.训练模型

最后,回到FullyConnectedNets.ipynb文件中,依此调用即可,最后填充相应的训练一个好的模型的代码!

hidden_dims = [100] * 4
range_weight_scale = [1e-2, 2e-2, 5e-3]
range_lr = [1e-5, 5e-4, 1e-5]

best_val_acc = -1
best_weight_scale = 0
best_lr = 0

print("Training...")

for weight_scale in range_weight_scale:
    for lr in range_lr:
        model = FullyConnectedNet(hidden_dims=hidden_dims, reg=0.0,
                                 weight_scale=weight_scale)
        solver = Solver(model, data, update_rule='adam',
                        optim_config={'learning_rate': lr},
                        batch_size=100, num_epochs=5,
                        verbose=False)
        solver.train()
        val_acc = solver.best_val_acc  
        print('Weight_scale: %f, lr: %f, val_acc: %f' % (weight_scale, lr, val_acc))

        if val_acc > best_val_acc:
            best_val_acc = val_acc
            best_weight_scale = weight_scale
            best_lr = lr
            best_model = model
print("Best val_acc: %f" % best_val_acc)
print("Best weight_scale: %f" % best_weight_scale)
print("Best lr: %f" % best_lr)

最终要求的精度在验证集上至少50%!

上面训练后的最好结果为:

Validation set accuracy:  0.528
Test set accuracy:  0.527

本文分享自微信公众号 - 光城(guangcity),作者:lightcity

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2018-12-03

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 1.8亿条海量Txt数据存储MySQL实践

    最近出去旅游了,嗨皮了嗨皮,明天上班,开始做作业,今日将1.8亿数据存储的方式进行总结,欢迎大家拍砖!

    公众号guangcity
  • 如醉如痴之最小堆

    一道简单的题,可以让你如醉如痴,更是因为这一道题,你才会学会很多,不要小看简单,简单中蕴含深意。

    公众号guangcity
  • C++像Go一样的并发与闭包

    并发性是程序的一种属性,其中两个或多个任务可以同时进行。并行性是一个运行时属性,其中两个或多个任务同时执行。通过并发性,为程序定义一个适当的结构。并发可以使用并...

    公众号guangcity
  • 带有偏见和议论标签的公平性评估(CS AI)

    风险评估工具在全国范围内广泛用于在刑事司法系统内为决策提供依据。最近,人们对这种工具是否可能遭受种族偏见的问题投入了大量关注。在这种类型的评估中,一个基本问题是...

    用户7035935
  • Leetcode 290. Word Pattern

    Given a pattern and a string str, find if str follows the same pattern. Here ...

    triplebee
  • Leetcode 290. Word Pattern

    Given a pattern and a string str, find if str follows the same pattern. Here ...

    triplebee
  • 原创译文 | 区块链不仅仅是技术,而是新的经济体系

    转载声明 本文为灯塔大数据原创内容,欢迎个人转载至朋友圈,其他机构转载请在文章开头标注:“转自:灯塔大数据;微信:DTbigdata” 导读:上一期了解了关于将...

    灯塔大数据
  • 【论文推荐】最新九篇目标检测相关论文—常识性知识转移、尺度不敏感、多尺度位置感知、渐进式域适应、时间感知特征图、人机合作

    【导读】专知内容组整理了最近七篇目标检测(Object Detection)相关文章,为大家进行介绍,欢迎查看! 1.Single-Shot Object De...

    WZEARW
  • 人工智能达特茅斯夏季研究项目提案(1955年8月31日)中英对照版

    科学Sciences导读:人工智能达特茅斯夏季研究项目提案(1955年8月31日)中英对照版。全文分为六大部分:一、提案说明,二、C.E.香农(C.E. Sha...

    秦陇纪
  • 研究用于社交媒体中仇恨语音检测的深度学习方法(CS CL)

    互联网的迅猛发展有助于增强个人的表达能力,但滥用表达自由的行为也导致各种网络犯罪和反社会活动的增加。仇恨言论就是一个这样的问题,需要非常认真地解决,否则,这可能...

    刘子蔚

扫码关注云+社区

领取腾讯云代金券