手机端运行卷积神经网络实现文档检测功能(二) -- 从 VGG 到 MobileNetV2 知识梳理（续）

腾讯Bugly

发布于 2018-06-11 16:27:50

5.7K1

发布于 2018-06-11 16:27:50

从 MobileNet V1 到 MobileNet V2

ResNet、Inception、Xception 追求的目标，就是在达到更高的准确率的前提下，尽量在模型大小、模型运算速度、模型训练速度这几个指标之间找一个平衡点，如果在准确性上允许一定的损失，但是追求更小的模型和更快的速度，这就直接催生了 MobileNet 或类似的以手机端或嵌入式端为运行环境的网络结构的出现。

MobileNet V1

(https://arxiv.org/pdf/1704.04861.pdf)

和 MobileNet V2

(https://arxiv.org/pdf/1801.04381.pdf)

都是基于 Depthwise Separable Convolution 构建的卷积层(类似 Xception，但是并不是和 Xception 使用的 Separable Convolution 完全一致)，这是它满足体积小、速度快的一个关键因素，另外就是精心设计和试验调优出来的层结构，下面就对照论文给出两个版本的代码实现。

MobileNet V1

MobileNet V1 的整体结构其实并没有特别复杂的地方，和 VGG 类似，层和层之间就是普通的串联型的结构，有区别的地方主要在于 layer 的内部，如下图所示：

这个图中没有用箭头表示数据的传递方向，但是只要对卷积神经网络有初步的经验，就能看出来数据是从上往下传递的，左图是标准的卷积层操作，类似于前面 HED 网络中 _vgg_conv2d 函数的结构(回想一下前面说过的 Batch Normalization 和 relu 先后顺序的话题，虽然 Batch Normalization 可以放到激活函数的后面，但是很多论文里面都还是习惯性的放在激活函数的前面，所以这里的代码也会严格的遵照论文中的方式)，右侧的图相当于 separable convolution，但是在中间是有两次 Batch Normalization 的。

论文中用一张如下的表格来描述了整体结构：

下面是一份简单的代码实现：

def mobilenet_v1(inputs, alpha, is_training):
    if alpha not in [0.25, 0.50, 0.75, 1.0]:
        raise ValueError('alpha must be one of'
                         '`0.25`, `0.50`, `0.75` or `1.0` only.')

    filter_initializer = tf.contrib.layers.xavier_initializer()    
    def _conv2d(inputs, filters, kernel_size, stride, scope=''):
        with tf.variable_scope(scope):
            outputs = tf.layers.conv2d(inputs,
                                    filters, 
                                    kernel_size, 
                                    strides=(stride, stride),
                                    padding='same', 
                                    activation=None,
                                    use_bias=False, 
                                    kernel_initializer=filter_initializer)

            outputs = tf.layers.batch_normalization(outputs, training=is_training)
            outputs = tf.nn.relu(outputs)        
        return outputs    
    def _mobilenet_v1_conv2d(inputs, 
                          pointwise_conv_filters, 
                          depthwise_conv_kernel_size,
                          stride, # stride is just for depthwise convolution
                          scope=''):
        with tf.variable_scope(scope):
            with tf.variable_scope('depthwise_conv'):                
                '''
                tf.layers Module 里面有一个 tf.layers.separable_conv2d 函数，
                但是它的内部调用流程是 depthwise convolution --> pointwise convolution --> activation func，
                而 MobileNet V1 风格的卷积层的内部调用流程应该是
                depthwise conv --> batch norm --> relu --> pointwise conv --> batch norm --> relu，
                所以需要用其他的手段组装出想要的调用流程，
                一种办法是使用 tf.nn.depthwise_conv2d，但是这个 API 比较底层，代码写起来很笨重。
                后来找到了另外一种可行的办法，借助 tf.contrib.layers.separable_conv2d 函数，
                tf.contrib.layers.separable_conv2d 的第二个参数 num_outputs 如果设置为 None，
                则只会调用内部的 depthwise conv2d 部分，而不执行 pointwise conv2d 部分。
                这样就可以组装出 MobileNet V1 需要的 layer 结构了。

                TensorFlow 提供了四种 API，都命名为 separable_conv2d，但是又存在各种细微的差别，
                有兴趣的读者可以自行阅读相关文档
                tf.contrib.layers.separable_conv2d [Aliases tf.contrib.layers.separable_convolution2d]
                VS
                tf.keras.backend.separable_conv2d
                VS
                tf.layers.separable_conv2d
                VS
                tf.nn.separable_conv2d
                '''
                outputs = tf.contrib.layers.separable_conv2d(
                            inputs,            
                            None, # ref https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.py
                            depthwise_conv_kernel_size,
                            depth_multiplier=1, # 按照论文的描述，这里设置成1
                            stride=(stride, stride),
                            padding='SAME',
                            activation_fn=None,
                            weights_initializer=filter_initializer,
                            biases_initializer=None)

                outputs = tf.layers.batch_normalization(outputs, training=is_training)
                outputs = tf.nn.relu(outputs)     
            with tf.variable_scope('pointwise_conv'):
                # 论文中 alpha 参数的含义，就是在每一层的 pointwise conv 的位置按比例缩小输出 channels 的数量
                pointwise_conv_filters = int(pointwise_conv_filters * alpha)
                outputs = tf.layers.conv2d(outputs,
                                        pointwise_conv_filters,
                                        (1, 1), 
                                        padding='same', 
                                        activation=None,
                                        use_bias=False, 
                                        kernel_initializer=filter_initializer)

                outputs = tf.layers.batch_normalization(outputs, training=is_training)
                outputs = tf.nn.relu(outputs)        
                return outputs    
    def _avg_pool2d(inputs, scope=''):
        inputs_shape = inputs.get_shape().as_list()        
        assert len(inputs_shape) == 4

        pool_height = inputs_shape[1]
        pool_width = inputs_shape[2]        
        with tf.variable_scope(scope):
            outputs = tf.layers.average_pooling2d(inputs,
                                      [pool_height, pool_width],
                                      strides=(1, 1),
                                      padding='valid')        
        return outputs    '''
    执行分类任务的网络结构，通常还可以作为实现其他任务的网络结构的 base architecture，
    为了方便代码复用，这里只需要实现出卷积层构成的主体部分，
    外部调用者根据各自的需求使用这里返回的 output 和 end_points。
    比如对于分类任务，按照如下方式使用这个函数

    image_height = 224
    image_width = 224
    image_channels = 3

    x = tf.placeholder(tf.float32, [None, image_height, image_width, image_channels])
    is_training = tf.placeholder(tf.bool, name='is_training')

    output, net = mobilenet_v1(x, 1.0, is_training)
    print('output shape is: %r' % (output.get_shape().as_list()))

    output = tf.layers.flatten(output)
    output = tf.layers.dense(output,
                        units=1024, # 1024 class
                        activation=None,
                        use_bias=True,
                        kernel_initializer=tf.contrib.layers.xavier_initializer())
    print('output shape is: %r' % (output.get_shape().as_list()))
    '''
    with tf.variable_scope('mobilenet', 'mobilenet', [inputs]):
        end_points = {}
        net = inputs 

        net = _conv2d(net, 32, [3, 3], stride=2, scope='block0')
        end_points['block0'] = net
        net = _mobilenet_v1_conv2d(net, 64, [3, 3], stride=1, scope='block1')
        end_points['block1'] = net

        net = _mobilenet_v1_conv2d(net, 128, [3, 3], stride=2, scope='block2')
        end_points['block2'] = net
        net = _mobilenet_v1_conv2d(net, 128, [3, 3], stride=1, scope='block3')
        end_points['block3'] = net

        net = _mobilenet_v1_conv2d(net, 256, [3, 3], stride=2, scope='block4')
        end_points['block4'] = net
        net = _mobilenet_v1_conv2d(net, 256, [3, 3], stride=1, scope='block5')
        end_points['block5'] = net

        net = _mobilenet_v1_conv2d(net, 512, [3, 3], stride=2, scope='block6')
        end_points['block6'] = net
        net = _mobilenet_v1_conv2d(net, 512, [3, 3], stride=1, scope='block7')
        end_points['block7'] = net
        net = _mobilenet_v1_conv2d(net, 512, [3, 3], stride=1, scope='block8')
        end_points['block8'] = net
        net = _mobilenet_v1_conv2d(net, 512, [3, 3], stride=1, scope='block9')
        end_points['block9'] = net
        net = _mobilenet_v1_conv2d(net, 512, [3, 3], stride=1, scope='block10')
        end_points['block10'] = net
        net = _mobilenet_v1_conv2d(net, 512, [3, 3], stride=1, scope='block11')
        end_points['block11'] = net

        net = _mobilenet_v1_conv2d(net, 1024, [3, 3], stride=2, scope='block12')
        end_points['block12'] = net
        net = _mobilenet_v1_conv2d(net, 1024, [3, 3], stride=1, scope='block13')
        end_points['block13'] = net

        output = _avg_pool2d(net, scope='output')    
        return output, end_points

MobileNet V2

MobileNet V2 的改动就比较大了，首先引入了两种新的 layer 结构，如下图所示：

很明显的一个差异点，就是左边这种层结构引入了残差网络的手段，另外，这两种层结构中，在 depthwise convolution 之前又添加了一个 1x1 convolution 操作，在之前举得几个例子中，1x1 convolution 都是用来降维的，而在 MobileNet V2 里，这个位于 depthwise convolution 之前的 1x1 convolution 其实用来提升维度的，对应论文中 expansion factor 参数的含义，在 depthwise convolution 之后仍然还有一次 1x1 convolution 调用，但是这个 1x1 convolution 并不会跟随一个激活函数，只是一次线性变换，所以这里也不叫做 pointwise convolution，而是对应论文中的 1x1 projection convolution。

网络的整体结构由下面的表格描述：

代码实现如下：

def mobilenet_v2_func_blocks(is_training):
    filter_initializer = tf.contrib.layers.xavier_initializer()
    activation_func = tf.nn.relu6    
    def conv2d(inputs, filters, kernel_size, stride, scope=''):
        with tf.variable_scope(scope):            
            with tf.variable_scope('conv2d'):
                outputs = tf.layers.conv2d(inputs,
                                        filters, 
                                        kernel_size, 
                                        strides=(stride, stride),
                                        padding='same', 
                                        activation=None,
                                        use_bias=False, 
                                        kernel_initializer=filter_initializer)

                outputs = tf.layers.batch_normalization(outputs, training=is_training)
                outputs = tf.nn.relu(outputs)            
       return outputs    
    def _1x1_conv2d(inputs, filters, stride):
        kernel_size = [1, 1]        
        with tf.variable_scope('1x1_conv2d'):
            outputs = tf.layers.conv2d(inputs,
                                    filters, 
                                    kernel_size, 
                                    strides=(stride, stride),
                                    padding='same', 
                                    activation=None,
                                    use_bias=False, 
                                    kernel_initializer=filter_initializer)

            outputs = tf.layers.batch_normalization(outputs, training=is_training)            # no activation_func
        return outputs    
    def expansion_conv2d(inputs, expansion, stride):
        input_shape = inputs.get_shape().as_list()        
        assert len(input_shape) == 4
        filters = input_shape[3] * expansion

        kernel_size = [1, 1]        
        with tf.variable_scope('expansion_1x1_conv2d'):
            outputs = tf.layers.conv2d(inputs,
                                    filters, 
                                    kernel_size, 
                                    strides=(stride, stride),
                                    padding='same', 
                                    activation=None,
                                    use_bias=False, 
                                    kernel_initializer=filter_initializer)

            outputs = tf.layers.batch_normalization(outputs, training=is_training)
            outputs = activation_func(outputs)        
        return outputs    
    def projection_conv2d(inputs, filters, stride):
        kernel_size = [1, 1]        
        with tf.variable_scope('projection_1x1_conv2d'):
            outputs = tf.layers.conv2d(inputs,
                                    filters, 
                                    kernel_size, 
                                    strides=(stride, stride),
                                    padding='same', 
                                    activation=None,
                                    use_bias=False, 
                                    kernel_initializer=filter_initializer)

            outputs = tf.layers.batch_normalization(outputs, training=is_training)            # no activation_func
        return outputs    def depthwise_conv2d(inputs, 
                        depthwise_conv_kernel_size,
                        stride):
        with tf.variable_scope('depthwise_conv2d'):
            outputs = tf.contrib.layers.separable_conv2d(
                        inputs,                        
                        None, # https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.py
                        depthwise_conv_kernel_size,
                        depth_multiplier=1,
                        stride=(stride, stride),
                        padding='SAME',
                        activation_fn=None,
                        weights_initializer=filter_initializer,
                        biases_initializer=None) 

            outputs = tf.layers.batch_normalization(outputs, training=is_training)
            outputs = activation_func(outputs)        
            return outputs    
    def avg_pool2d(inputs, scope=''):
        inputs_shape = inputs.get_shape().as_list()        
        assert len(inputs_shape) == 4

        pool_height = inputs_shape[1]
        pool_width = inputs_shape[2]
        with tf.variable_scope(scope):
            outputs = tf.layers.average_pooling2d(inputs,
                                            [pool_height, pool_width],
                                            strides=(1, 1),
                                            padding='valid')
       return outputs    
    def inverted_residual_block(inputs, 
                            filters, 
                            stride, 
                            expansion=6, 
                            scope=''):
        assert stride == 1 or stride == 2

        depthwise_conv_kernel_size = [3, 3]
        pointwise_conv_filters = filters        
    with tf.variable_scope(scope):
            net = inputs
            net = expansion_conv2d(net, expansion, stride=1)
            net = depthwise_conv2d(net, depthwise_conv_kernel_size, stride=stride)
            net = projection_conv2d(net, pointwise_conv_filters, stride=1)            
            if stride == 1:               
                # 如果 net.get_shape().as_list()[3] != inputs.get_shape().as_list()[3]
                # 借助一个 1x1 的卷积让他们的 channels 相等，然后才能相加
                if net.get_shape().as_list()[3] != inputs.get_shape().as_list()[3]:
                    inputs = _1x1_conv2d(inputs, net.get_shape().as_list()[3], stride=1)

                net = net + inputs                
                return net           
     else:                
         # stride == 2
          return net

    func_blocks = {}
    func_blocks['conv2d'] = conv2d
    func_blocks['inverted_residual_block'] = inverted_residual_block
    func_blocks['avg_pool2d'] = avg_pool2d
    func_blocks['filter_initializer'] = filter_initializer
    func_blocks['activation_func'] = activation_func    
    return func_blocks
def mobilenet_v2(inputs, is_training):
    func_blocks = mobilenet_v2_func_blocks(is_training)
    _conv2d = func_blocks['conv2d'] 
    _inverted_residual_block = func_blocks['inverted_residual_block']
    _avg_pool2d = func_blocks['avg_pool2d']    
    with tf.variable_scope('mobilenet_v2', 'mobilenet_v2', [inputs]):
        end_points = {}
        net = inputs 

        net = _conv2d(net, 32, [3, 3], stride=2, scope='block0_0') # size/2
        end_points['block0'] = net

        net = _inverted_residual_block(net, 16, stride=1, expansion=1, scope='block1_0')
        end_points['block1'] = net

        net = _inverted_residual_block(net, 24, stride=2, scope='block2_0') # size/4
        net = _inverted_residual_block(net, 24, stride=1, scope='block2_1')
        end_points['block2'] = net

        net = _inverted_residual_block(net, 32, stride=2, scope='block3_0') # size/8
        net = _inverted_residual_block(net, 32, stride=1, scope='block3_1') 
        net = _inverted_residual_block(net, 32, stride=1, scope='block3_2')
        end_points['block3'] = net

        net = _inverted_residual_block(net, 64, stride=2, scope='block4_0') # size/16
        net = _inverted_residual_block(net, 64, stride=1, scope='block4_1') 
        net = _inverted_residual_block(net, 64, stride=1, scope='block4_2') 
        net = _inverted_residual_block(net, 64, stride=1, scope='block4_3') 
        end_points['block4'] = net

        net = _inverted_residual_block(net, 96, stride=1, scope='block5_0') 
        net = _inverted_residual_block(net, 96, stride=1, scope='block5_1')
        net = _inverted_residual_block(net, 96, stride=1, scope='block5_2')
        end_points['block5'] = net

        net = _inverted_residual_block(net, 160, stride=2, scope='block6_0') # size/32
        net = _inverted_residual_block(net, 160, stride=1, scope='block6_1') 
        net = _inverted_residual_block(net, 160, stride=1, scope='block6_2') 
        end_points['block6'] = net

        net = _inverted_residual_block(net, 320, stride=1, scope='block7_0')
        end_points['block7'] = net

        net = _conv2d(net, 1280, [1, 1], stride=1, scope='block8_0') 
        end_points['block8'] = net

        output = _avg_pool2d(net, scope='output')    
    return output, end_points

MobileNet V2 Style HED

原始的 HED 使用 VGG 作为基础网络结构来得到 feature maps，参照这种思路，可以把基础网络部分替换为 MobileNet V2，代码如下：

def mobilenet_v2_style_hed(inputs, batch_size, is_training):
    if const.use_kernel_regularizer:
        weights_regularizer = tf.contrib.layers.l2_regularizer(scale=0.0001)    else:
        weights_regularizer = None

    ####################################################
    func_blocks = mobilenet_v2_func_blocks(is_training)    
    # print('============ func_blocks are: %r' % func_blocks)
    _conv2d = func_blocks['conv2d'] 
    _inverted_residual_block = func_blocks['inverted_residual_block']
    _avg_pool2d = func_blocks['avg_pool2d']
    filter_initializer = func_blocks['filter_initializer']
    activation_func = func_blocks['activation_func']    
    ####################################################

    def _dsn_1x1_conv2d(inputs, filters):
        kernel_size = [1, 1]
        outputs = tf.layers.conv2d(inputs,
                                   filters,
                                   kernel_size, 
                                   padding='same', 
                                   activation=None, ## no activation
                                   use_bias=False, 
                                   kernel_initializer=filter_initializer,
                                   kernel_regularizer=weights_regularizer)

        outputs = tf.layers.batch_normalization(outputs, training=is_training)        
        ## no activation
        return outputs    
    def _output_1x1_conv2d(inputs, filters):
        kernel_size = [1, 1]
        outputs = tf.layers.conv2d(inputs,
                                   filters,
                                   kernel_size, 
                                   padding='same', 
                                   activation=None, ## no activation
                                   use_bias=True, ## use bias
                                   kernel_initializer=filter_initializer,
                                   kernel_regularizer=weights_regularizer)        
        ## no batch normalization
        ## no activation

        return outputs    
    def _dsn_deconv2d_with_upsample_factor(inputs, filters, upsample_factor):
        ## https://github.com/s9xie/hed/blob/master/examples/hed/train_val.prototxt
        ## 从这个原版代码里看，是这样计算 kernel_size 的
        kernel_size = [2 * upsample_factor, 2 * upsample_factor]
        outputs = tf.layers.conv2d_transpose(inputs,
                                             filters, 
                                             kernel_size, 
                                             strides=(upsample_factor, upsample_factor), 
                                             padding='same', 
                                             activation=None, ## no activation
                                             use_bias=True, ## use bias
                                             kernel_initializer=filter_initializer,
                                             kernel_regularizer=weights_regularizer)        
        ## 概念上来说，deconv2d 已经是最后的输出 layer 了，只不过最后还有一步 1x1 的 conv2d 把 5 个 deconv2d 的输出再融合到一起
        ## 所以不需要再使用 batch normalization 了

        return outputs    
    with tf.variable_scope('hed', 'hed', [inputs]):
        end_points = {}
        net = inputs        
        ## mobilenet v2 as base net
        with tf.variable_scope('mobilenet_v2'):            
            # 标准的 mobilenet v2 里面并没有这两层，
            # 这里是为了得到和 input image 相同 size 的 feature map 而增加的层
            net = _conv2d(net, 3, [3, 3], stride=1, scope='block0_0')
            net = _conv2d(net, 6, [3, 3], stride=1, scope='block0_1')

            dsn1 = net
            net = _conv2d(net, 12, [3, 3], stride=2, scope='block0_2') # size/2

            net = _inverted_residual_block(net, 6, stride=1, expansion=1, scope='block1_0')

            dsn2 = net
            net = _inverted_residual_block(net, 12, stride=2, scope='block2_0') # size/4
            net = _inverted_residual_block(net, 12, stride=1, scope='block2_1')

            dsn3 = net
            net = _inverted_residual_block(net, 24, stride=2, scope='block3_0') # size/8
            net = _inverted_residual_block(net, 24, stride=1, scope='block3_1') 
            net = _inverted_residual_block(net, 24, stride=1, scope='block3_2')

            dsn4 = net
            net = _inverted_residual_block(net, 48, stride=2, scope='block4_0') # size/16
            net = _inverted_residual_block(net, 48, stride=1, scope='block4_1') 
            net = _inverted_residual_block(net, 48, stride=1, scope='block4_2') 
            net = _inverted_residual_block(net, 48, stride=1, scope='block4_3') 

            net = _inverted_residual_block(net, 64, stride=1, scope='block5_0') 
            net = _inverted_residual_block(net, 64, stride=1, scope='block5_1')
            net = _inverted_residual_block(net, 64, stride=1, scope='block5_2')

            dsn5 = net        
        ## dsn layers
        with tf.variable_scope('dsn1'):
            dsn1 = _dsn_1x1_conv2d(dsn1, 1)    
                
        ## no need deconv2d
        with tf.variable_scope('dsn2'):
            dsn2 = _dsn_1x1_conv2d(dsn2, 1)
            dsn2 = _dsn_deconv2d_with_upsample_factor(dsn2, 1, upsample_factor = 2)        with tf.variable_scope('dsn3'):
            dsn3 = _dsn_1x1_conv2d(dsn3, 1)
            dsn3 = _dsn_deconv2d_with_upsample_factor(dsn3, 1, upsample_factor = 4)        with tf.variable_scope('dsn4'):
            dsn4 = _dsn_1x1_conv2d(dsn4, 1)
            dsn4 = _dsn_deconv2d_with_upsample_factor(dsn4, 1, upsample_factor = 8)        with tf.variable_scope('dsn5'):
            dsn5 = _dsn_1x1_conv2d(dsn5, 1)
            dsn5 = _dsn_deconv2d_with_upsample_factor(dsn5, 1, upsample_factor = 16)        # dsn fuse
        with tf.variable_scope('dsn_fuse'):
            dsn_fuse = tf.concat([dsn1, dsn2, dsn3, dsn4, dsn5], 3)
            dsn_fuse = _output_1x1_conv2d(dsn_fuse, 1)    
    return dsn_fuse, dsn1, dsn2, dsn3, dsn4, dsn5

这个 MobileNet V2 风格的 HED 网络，整体结构和 VGG 风格的 HED 并没有区别，只是把 VGG 里面用到的卷积层操作替换成了 MobileNet V2 对应的卷积层，另外，因为 MobileNet V2 的第一个卷积层就设置了 stride=2，并不匹配 dsn1 层的 size，所以额外添加了两个 stride=1 的普通卷积层，把它们的输出作为 dsn1 层。

MobileNet V2 As Base Net

MobileNet 只是针对手机运行环境设计出来的执行 分类任务 的网络结构，但是，和同样执行分类任务的 ResNet、Inception、Xception 这一类网络结构类似，都可以作为执行其他任务的网络结构的 base net，提取输入 image 的 feature maps，我尝试过 mobilenet_v2_style_unet、mobilenet_v2_style_deeplab_v3plus、mobilenet_v2_style_ssd，都是可以看到效果的。

Android 性能瓶颈

作为一个参考值，在 iPhone 7 Plus 上运行这个 mobilenet_v2_style_hed 网络并且执行后续的找点算法，FPS 可以跑到12，基本满足实时性的需求。但是当尝试在 Android 上部署的时候，即便是在高价位高配置的机型上，FPS 也很低，卡顿现象很明显。

经过排查，找到了一些线索。在 iPhone 7 Plus 上，计算量的分布如下图所示：

红框中的三种

操作占据了大部分的 CPU 时间，用这几个数值做一个粗略估算，1.0 / (32 + 30 + 10 + 6) = 12.8，这和检测到的 FPS 是比较吻合的，说明大量的计算时间都用在神经网络上了，OpenCV 实现的找点算法的耗时是很短的。

但是在 Android 上，情况则完全不一样了，如下图所示：

用红框里的数值计算一下，FPS = 1.0 / (232 + 76 + 29 + 16) = 2.8，达不到实时的要求。从上图还可以看出，在 Android 上，Batch Normalization 消耗了大量的计算时间，而且和 Conv2D 消耗的 CPU 时间相比，不在一个数量级上了，这就和 iOS 平台上完全不是同一种分布规律了。进一步 debug 后发现，我们 Android 平台的 app，由于一些历史原因被限定住了只能使用 32bit 的 .so 动态库，换成 64bit 的 TensorFlow 动态库在独立的 demo app 里面重新测量，mobilenet_v2_style_hed 在 Android 上的运行情况就和 iOS 的接近了，虽然还是比 iOS 慢，但是 CPU 耗时的统计数据是同一种分布规律了。

所以，性能瓶颈就在于 Batch Normalization 在 32bit 的 ARM CPU 环境中执行效率不高，尝试过使用一些编译器优化选项重新编译 32bit 的 TensorFlow 库，但是并没有明显的改善。最后的解决方案是退而求其次，使用 vgg_style_hed，并且不使用 Batch Normalization，经过这样的调整后，Android 上的统计数据如下图：

关于 TensorFlow Lite

在使用 TensorFlow 1.7 部署模型的时候，TensorFlow Lite 还未支持 transposed convolution，所以没有使用 TF Lite (目前 github 上已经看到有 Lite 版本的 transpose_conv.cc(https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/kernels/transpose_conv.cc) 了) 了)。TensorFlow Lite 目前发展的很快，以后在选择部署方案的时候，TensorFlow Lite 是优先于 TensorFlow Mobile 的。

参考资料

xavier init

How to do Xavier initialization on TensorFlow

(https://stackoverflow.com/questions/33640581/how-to-do-xavier-initialization-on-tensorflow/36784797)

聊一聊深度学习的weight initialization

(https://zhuanlan.zhihu.com/p/25110150)

Batch Normalization

Understanding the backward pass through Batch Normalization Layer

(https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html)

机器学习里的黑色艺术：normalization, standardization, regularization

(https://zhuanlan.zhihu.com/p/29974820)

How could I use Batch Normalization in TensorFlow?

(https://stackoverflow.com/questions/33949786/how-could-i-use-batch-normalization-in-tensorflow)

add Batch Normalization immediately before non-linearity or after in Keras?

(https://github.com/keras-team/keras/issues/5465)

1x1 Convolution

What does 1x1 convolution mean in a neural network?

(https://stats.stackexchange.com/questions/194142/what-does-1x1-convolution-mean-in-a-neural-network.)

How are 1x1 convolutions the same as a fully connected layer?

(https://datascience.stackexchange.com/questions/12830/how-are-1x1-convolutions-the-same-as-a-fully-connected-layer)

One by One [ 1 x 1 ] Convolution - counter-intuitively useful

(https://iamaaditya.github.io/2016/03/one-by-one-convolution/)

Upsampling && Transposed Convolution

Upsampling and Image Segmentation with Tensorflow and TF-Slim

(http://warmspringwinds.github.io/tensorflow/tf-slim/2016/11/22/upsampling-and-image-segmentation-with-tensorflow-and-tf-slim/)

Image Segmentation using deconvolution layer in Tensorflow

(http://cv-tricks.com/image-segmentation/transpose-convolution-in-tensorflow/)

ResNet && Inception && Xception

Network In Network architecture: The beginning of Inception

(http://teleported.in/posts/network-in-network/)

ResNets, HighwayNets, and DenseNets, Oh My!

(https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32)

Inception modules: explained and implemented

(https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/)

TensorFlow implementation of the Xception Model by François Chollet

(https://github.com/kwotsin/TensorFlow-Xception)

TensorFlow Lite

TensorFlow Lite 深度解析

(http://developers.googleblog.cn/2018/06/tensorflow-lite-overview.html)

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2018-06-07，如有侵权请联系 cloudcommunity@tencent.com 删除

卷积神经网络

本文分享自腾讯Bugly 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体分享计划，欢迎热爱写作的你一起参与！

卷积神经网络

登录后参与评论

0 条评论

热度