前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >使用腾讯云 GPU 学习深度学习系列之五:文字的识别与定位

使用腾讯云 GPU 学习深度学习系列之五:文字的识别与定位

原创
作者头像
集智
修改2017-06-19 18:55:50
7.9K5
修改2017-06-19 18:55:50
举报

这是《使用腾讯云GPU学习深度学习》系列文章的第五篇,以车牌识别和简单OCR为例,谈了谈如何进行字母、数字的识别以及定位。本系列文章主要介绍如何使用腾讯云GPU服务器进行深度学习运算,前面主要介绍原理部分,后期则以实践为主。

往期内容:

  1. 使用腾讯云 GPU 学习深度学习系列之一:传统机器学习的回顾
  2. 使用腾讯云 GPU 学习深度学习系列之二:Tensorflow 简明原理
  3. 使用腾讯云 GPU 学习深度学习系列之三:搭建深度神经网络
  4. 使用腾讯云 GPU 学习深度学习系列之四:深度学习的特征工

上一节,我们简要介绍了一些与深度学习相关的数据预处理方法。其中我们特别提到,使用 基于深度学习的 Spatial Transform 方法,可以让“草书” 字体的手写数字同样也可以被高效识别。

但无论是工整书写的 Tensorflow 官网上的 MNIST 教程,还是上节提到“草书”数字,都是 单一的数字识别问题。 但是,在实际生活中,遇到数字、字母识别问题时,往往需要识别一组数字。这时候一个简单的深度神经网络可能就做不到了。本节内容,就是在讨论遇到这种情况时,应该如何调整深度学习模型。

1. 固定长度

固定长度的字符、数字识别,比较常见的应用场景包括:

  • 识别验证码
  • 识别机动车车牌

识别验证码的方法,这篇文章 有详细介绍。不过该文章使用的是版本较早的 Keras1,实际使用时会有一些问题。如果想尝试,根据Jupyter 的提示更改就好,最终效果也是相当不错:

[png]
[png]

我们这里要识别的内容,是中华人民共和国机动车车牌。相比上面例子的 4 位验证码,车牌长度更长,达到了 7 位,并且内容也更加丰富,第一位是各省的汉字简称,第二位是 A-Z 的大写字母,3-7位则是数字、字母混合。

由于车牌涉及个人隐私,我们使用了用户 szad670401 在 Github 上开源的一个车牌生成器,随机的生成一些车牌的图片,用于模型训练。当然这个项目同样提供了完整的 MXNet 深度学习框架编写的代码,我们接下来会用 Keras 再写一个。

首先做些准备工作,从 szad670401 的开源项目中获取必要的文件:

### 从 szad670401 github 项目下载车牌生成器以及字体文件
!git clone https://github.com/szad670401/end-to-end-for-chinese-plate-recognition
!cp -r end-to-end-for-chinese-plate-recognition/* ./
!sed 's/for i in range(batchSize):/l_plateStr = []\n        l_plateImg = []\n        for i in range(batchSize):/g' ./genplate.py  | sed 's/cv2.imwrite(outputPath/l_plateStr.append(plateStr)\n                l_plateImg.append(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))\n                #cv2.imwrite(outputPath/g' | sed 's/img);/img);\n        return l_plateStr,l_plateImg/g'  >genplateRev.py

来看看生成器的效果:

from keras.models import Model
from keras.callbacks import ModelCheckpoint
from keras.layers import Conv2D, MaxPool2D, Flatten, Dropout, Dense, Input
from keras.optimizers import Adam
from keras.backend.tensorflow_backend import set_session
from keras.utils.vis_utils import model_to_dot
import tensorflow as tf
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

from IPython.display import SVG

from genplate import *

%matplotlib inline

np.random.seed(5)
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
set_session(tf.Session(config=config))

chars = ["京", "沪", "津", "渝", "冀", "晋", "蒙", "辽", "吉", "黑", "苏", "浙", "皖", "闽", "赣", "鲁", "豫", "鄂", "湘", "粤", "桂",
             "琼", "川", "贵", "云", "藏", "陕", "甘", "青", "宁", "新", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A",
             "B", "C", "D", "E", "F", "G", "H", "J", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "U", "V", "W", "X",
             "Y", "Z"
             ];

M_strIdx = dict(zip(chars, range(len(chars))))

n_generate = 100
rows = 20
cols = int(n_generate/rows)

G = GenPlate("./font/platech.ttf",'./font/platechar.ttf',"./NoPlates")
l_plateStr,l_plateImg = G.genBatch(100,2,range(31,65),"./plate",(272,72))

l_out = []
for i in range(rows):
    l_tmp = []
    for j in range(cols):
        l_tmp.append(l_plateImg[i*cols+j])

    l_out.append(np.hstack(l_tmp))

    fig = plt.figure(figsize=(10, 10))
    ax  = fig.add_subplot(111)
    ax.imshow( np.vstack(l_out), aspect="auto" )

[1496633839441_7548_1496633834423.png]
[1496633839441_7548_1496633834423.png]

看来 szad670401 开源的车牌生成器,随机生成的车牌确实达到了以假乱真的效果。于是我们基于这个生成器,再自己写一个生成器,用于深度神经网络的数据输入:

def gen(batch_size=32):
    while True:
        l_plateStr,l_plateImg = G.genBatch(batch_size, 2, range(31,65),"./plate",(272,72))
        X = np.array(l_plateImg, dtype=np.uint8)
        ytmp = np.array(list(map(lambda x: [M_strIdx[a] for a in list(x)], l_plateStr)), dtype=np.uint8)
        y = np.zeros([ytmp.shape[1],batch_size,len(chars)])
        for batch in range(batch_size):
            for idx,row_i in enumerate(ytmp[batch]):
                y[idx,batch,row_i] = 1

        yield X, [yy for yy in y]

因为是固定长度,所以我们有个想法,就是既然我们知道识别七次,那就可以用七个模型按照顺序识别。这个思路没有问题,但实际上根据之前卷积神经网络的原理,实际上卷积神经网络在扫描整张图片的过程中,已经对整个图像的内容以及相对位置关系有所了解,所以,七个模型的卷积层实际上是可以共享的。我们实际上可以用一个 一组卷积层+7个全链接层 的架构,来对应输入的车牌图片:

adam = Adam(lr=0.001)

input_tensor = Input((72, 272, 3))
x = input_tensor
for i in range(3):
    x = Conv2D(32*2**i, (3, 3), activation='relu')(x)
    x = Conv2D(32*2**i, (3, 3), activation='relu')(x)
    x = MaxPool2D(pool_size=(2, 2))(x)
x = Flatten()(x)
x = Dropout(0.25)(x)

n_class = len(chars)
x = [Dense(n_class, activation='softmax', name='c%d'%(i+1))(x) for i in range(7)]
model = Model(inputs=input_tensor, outputs=x)
model.compile(loss='categorical_crossentropy',
              optimizer=adam,
              metrics=['accuracy'])

SVG(model_to_dot(model=model, show_layer_names=True, show_shapes=True).create(prog='dot', format='svg'))

[1496633860342_2464_1496633853445.png]
[1496633860342_2464_1496633853445.png]

训练模型:

best_model = ModelCheckpoint("chepai_best.h5", monitor='val_loss', verbose=0, save_best_only=True)

model.fit_generator(gen(32), steps_per_epoch=2000, epochs=5,
                    validation_data=gen(32), validation_steps=1280,
                    callbacks=[best_model]
)

Epoch 1/5
2000/2000 [==============================] - 547s - loss: 11.1077 - c1_loss: 1.3878 - c2_loss: 0.7512 - c3_loss: 1.1270 - c4_loss: 1.3997 - c5_loss: 1.7955 - c6_loss: 2.3060 - c7_loss: 2.3405 - c1_acc: 0.6157 - c2_acc: 0.7905 - c3_acc: 0.6831 - c4_acc: 0.6041 - c5_acc: 0.5025 - c6_acc: 0.3790 - c7_acc: 0.3678 - val_loss: 3.1323 - val_c1_loss: 0.1970 - val_c2_loss: 0.0246 - val_c3_loss: 0.0747 - val_c4_loss: 0.2076 - val_c5_loss: 0.5099 - val_c6_loss: 1.0774 - val_c7_loss: 1.0411 - val_c1_acc: 0.9436 - val_c2_acc: 0.9951 - val_c3_acc: 0.9807 - val_c4_acc: 0.9395 - val_c5_acc: 0.8535 - val_c6_acc: 0.7065 - val_c7_acc: 0.7190
Epoch 2/5
2000/2000 [==============================] - 546s - loss: 2.7473 - c1_loss: 0.2008 - c2_loss: 0.0301 - c3_loss: 0.0751 - c4_loss: 0.1799 - c5_loss: 0.4407 - c6_loss: 0.9450 - c7_loss: 0.8757 - c1_acc: 0.9416 - c2_acc: 0.9927 - c3_acc: 0.9790 - c4_acc: 0.9467 - c5_acc: 0.8740 - c6_acc: 0.7435 - c7_acc: 0.7577 - val_loss: 1.4777 - val_c1_loss: 0.1039 - val_c2_loss: 0.0118 - val_c3_loss: 0.0300 - val_c4_loss: 0.0665 - val_c5_loss: 0.2145 - val_c6_loss: 0.5421 - val_c7_loss: 0.5090 - val_c1_acc: 0.9725 - val_c2_acc: 0.9978 - val_c3_acc: 0.9937 - val_c4_acc: 0.9824 - val_c5_acc: 0.9393 - val_c6_acc: 0.8524 - val_c7_acc: 0.8609
Epoch 3/5
2000/2000 [==============================] - 544s - loss: 1.7686 - c1_loss: 0.1310 - c2_loss: 0.0156 - c3_loss: 0.0390 - c4_loss: 0.0971 - c5_loss: 0.2689 - c6_loss: 0.6416 - c7_loss: 0.5754 - c1_acc: 0.9598 - c2_acc: 0.9961 - c3_acc: 0.9891 - c4_acc: 0.9715 - c5_acc: 0.9213 - c6_acc: 0.8223 - c7_acc: 0.8411 - val_loss: 1.0954 - val_c1_loss: 0.0577 - val_c2_loss: 0.0088 - val_c3_loss: 0.0229 - val_c4_loss: 0.0530 - val_c5_loss: 0.1557 - val_c6_loss: 0.4247 - val_c7_loss: 0.3726 - val_c1_acc: 0.9849 - val_c2_acc: 0.9987 - val_c3_acc: 0.9948 - val_c4_acc: 0.9861 - val_c5_acc: 0.9569 - val_c6_acc: 0.8829 - val_c7_acc: 0.8994
Epoch 4/5
2000/2000 [==============================] - 544s - loss: 1.4012 - c1_loss: 0.1063 - c2_loss: 0.0120 - c3_loss: 0.0301 - c4_loss: 0.0754 - c5_loss: 0.2031 - c6_loss: 0.5146 - c7_loss: 0.4597 - c1_acc: 0.9677 - c2_acc: 0.9968 - c3_acc: 0.9915 - c4_acc: 0.9773 - c5_acc: 0.9406 - c6_acc: 0.8568 - c7_acc: 0.8731 - val_loss: 0.8221 - val_c1_loss: 0.0466 - val_c2_loss: 0.0061 - val_c3_loss: 0.0122 - val_c4_loss: 0.0317 - val_c5_loss: 0.1085 - val_c6_loss: 0.3181 - val_c7_loss: 0.2989 - val_c1_acc: 0.9870 - val_c2_acc: 0.9986 - val_c3_acc: 0.9969 - val_c4_acc: 0.9910 - val_c5_acc: 0.9696 - val_c6_acc: 0.9117 - val_c7_acc: 0.9182
Epoch 5/5
2000/2000 [==============================] - 553s - loss: 1.1712 - c1_loss: 0.0903 - c2_loss: 0.0116 - c3_loss: 0.0275 - c4_loss: 0.0592 - c5_loss: 0.1726 - c6_loss: 0.4305 - c7_loss: 0.3796 - c1_acc: 0.9726 - c2_acc: 0.9971 - c3_acc: 0.9925 - c4_acc: 0.9825 - c5_acc: 0.9503 - c6_acc: 0.8821 - c7_acc: 0.8962 - val_loss: 0.7210 - val_c1_loss: 0.0498 - val_c2_loss: 0.0079 - val_c3_loss: 0.0132 - val_c4_loss: 0.0303 - val_c5_loss: 0.0930 - val_c6_loss: 0.2810 - val_c7_loss: 0.2458 - val_c1_acc: 0.9862 - val_c2_acc: 0.9987 - val_c3_acc: 0.9971 - val_c4_acc: 0.9915 - val_c5_acc: 0.9723 - val_c6_acc: 0.9212 - val_c7_acc: 0.9336

可见五轮训练后,即便是位置靠后的几位车牌,也实现了 93% 的识别准确率。

展示下模型预测结果:

myfont = FontProperties(fname='./font/Lantinghei.ttc')  
matplotlib.rcParams['axes.unicode_minus']=False  

fig = plt.figure(figsize=(12,12))
l_titles = list(map(lambda x: "".join([M_idxStr[xx] for xx in x]), np.argmax(np.array(model.predict( np.array(l_plateImg) )), 2).T))
for idx,img in enumerate(l_plateImg[0:40]):
    ax = fig.add_subplot(10,4,idx+1)
    ax.imshow(img)
    ax.set_title(l_titles[idx],fontproperties=myfont)
    ax.set_axis_off()

[1496633881060_2501_1496633874378.png]
[1496633881060_2501_1496633874378.png]

可见预测的其实相当不错,很多字体已经非常模糊,模型仍然可以看出来。图中一个错误是 皖TQZ680 被预测成了 皖TQZG8D,当然这也和图片裁剪不当有一定的关系。

2. 不固定长度

车牌的应用场景中,我们固定了长度为7位,并且基于这个预设设计了卷积神经网络。但是在实际运用中,可能长度并不固定。此时如果长度过长,用这个架构也将会导致参数过多,占用过多显存。

针对这种情况,Keras 的案例中,提供了一种基于循环神经网络的方法,在 Keras Example 中有写到。具体而言,就是数据首先通过卷积神经网络部分扫描特征,然后通过循环神经网络部分,同时从左到右、从右到左扫描特征,最后基于扫描的结果,通过计算 Conectionist Temporal Classification(CTC) 损失函数,完成模型训练。

2.1. 循环神经网络

使用循环神经网络,是因为循环神经网络有一个很重要的特点,就是相邻的节点之间,可以相互影响。这里相邻节点,既可以是时间上的(前一秒数据和后一秒数据),也可以是位置关系上的,比如我们这里从左向右扫描,左边一列的扫描结果会影响右边一列的扫描结果。

[image.png]
[image.png]

图片来源:知乎:CNN(卷积神经网络)、RNN(循环神经网络)、DNN(深度神经网络)的内部网络结构有什么区别

2.2. CTC 损失函数

同时,对于循环神经网络的结果,由于长度不固定,可能会有空间上的“错配”:

[image.png]
[image.png]

图片来源:Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

但由于这种错配实际上并没有什么严重的影响,如上图所示, __TH____E_T__H__EE 其实都是 THE 这个单词,因此这里这种错配在损失函数的优化环节中,是需要被忽略掉的。于是这里就使用了CTC 优化函数。CTC 可以在计算过程中,通过综合所有可能情况的排列组合,进而忽略相对的位置关系。

Keras 的 CTC loss 函数位于 https://github.com/fchollet/keras/blob/master/keras/backend/tensorflow_backend.py 这个文件中,内容如下:

import tensorflow as tf
from tensorflow.python.ops import ctc_ops as ctc

#...

def ctc_batch_cost(y_true, y_pred, input_length, label_length):
    """Runs CTC loss algorithm on each batch element.
    # Arguments
        y_true: tensor `(samples, max_string_length)`
            containing the truth labels.
        y_pred: tensor `(samples, time_steps, num_categories)`
            containing the prediction, or output of the softmax.
        input_length: tensor `(samples, 1)` containing the sequence length for
            each batch item in `y_pred`.
        label_length: tensor `(samples, 1)` containing the sequence length for
            each batch item in `y_true`.
    # Returns
        Tensor with shape (samples,1) containing the
            CTC loss of each element.
    """
    label_length = tf.to_int32(tf.squeeze(label_length))
    input_length = tf.to_int32(tf.squeeze(input_length))
    sparse_labels = tf.to_int32(ctc_label_dense_to_sparse(y_true, label_length))

    y_pred = tf.log(tf.transpose(y_pred, perm=[1, 0, 2]) + 1e-8)

    return tf.expand_dims(ctc.ctc_loss(inputs=y_pred,
                                       labels=sparse_labels,
                                       sequence_length=input_length), 1)

3.3. 完整代码

首先是一些必要的函数:

import os
import itertools
import re
import datetime
import cairocffi as cairo
import editdistance
import numpy as np
from scipy import ndimage
import pylab

from keras import backend as K
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.layers import Input, Dense, Activation, Reshape, Lambda
from keras.layers.merge import add, concatenate
from keras.layers.recurrent import GRU
from keras.models import Model
from keras.optimizers import SGD
from keras.utils.data_utils import get_file
from keras.preprocessing import image
from keras.callbacks import EarlyStopping,Callback

from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
import matplotlib.pyplot as plt

%matplotlib inline

config = tf.ConfigProto()
config.gpu_options.allow_growth=True
set_session(tf.Session(config=config))


OUTPUT_DIR = 'image_ocr'

np.random.seed(55)

# 从 Keras 官方文件中 import 相关的函数
!wget https://raw.githubusercontent.com/fchollet/keras/master/examples/image_ocr.py
from image_ocr import *

必要的参数:

run_name = datetime.datetime.now().strftime('%Y:%m:%d:%H:%M:%S')
start_epoch = 0
stop_epoch  = 200
img_w = 128
img_h = 64
words_per_epoch = 16000
val_split = 0.2
val_words = int(words_per_epoch * (val_split))

# Network parameters
conv_filters = 16
kernel_size = (3, 3)
pool_size = 2
time_dense_size = 32
rnn_size = 512
input_shape = (img_w, img_h, 1)

使用这些函数以及对应参数构建生成器,生成不固定长度的验证码:

fdir = os.path.dirname(get_file('wordlists.tgz',
                                    origin='http://www.mythic-ai.com/datasets/wordlists.tgz', untar=True))

img_gen = TextImageGenerator(monogram_file=os.path.join(fdir, 'wordlist_mono_clean.txt'),
                                 bigram_file=os.path.join(fdir, 'wordlist_bi_clean.txt'),
                                 minibatch_size=32,
                                 img_w=img_w,
                                 img_h=img_h,
                                 downsample_factor=(pool_size ** 2),
                                 val_split=words_per_epoch - val_words
                                 )
act = 'relu'

构建网络:

input_data = Input(name='the_input', shape=input_shape, dtype='float32')
inner = Conv2D(conv_filters, kernel_size, padding='same',
                   activation=act, kernel_initializer='he_normal',
                   name='conv1')(input_data)
inner = MaxPooling2D(pool_size=(pool_size, pool_size), name='max1')(inner)
inner = Conv2D(conv_filters, kernel_size, padding='same',
                   activation=act, kernel_initializer='he_normal',
                   name='conv2')(inner)
inner = MaxPooling2D(pool_size=(pool_size, pool_size), name='max2')(inner)

conv_to_rnn_dims = (img_w // (pool_size ** 2), (img_h // (pool_size ** 2)) * conv_filters)
inner = Reshape(target_shape=conv_to_rnn_dims, name='reshape')(inner)

# cuts down input size going into RNN:
inner = Dense(time_dense_size, activation=act, name='dense1')(inner)

# Two layers of bidirecitonal GRUs
# GRU seems to work as well, if not better than LSTM:
gru_1 = GRU(rnn_size, return_sequences=True, kernel_initializer='he_normal', name='gru1')(inner)
gru_1b = GRU(rnn_size, return_sequences=True, go_backwards=True, kernel_initializer='he_normal', name='gru1_b')(inner)
gru1_merged = add([gru_1, gru_1b])
gru_2 = GRU(rnn_size, return_sequences=True, kernel_initializer='he_normal', name='gru2')(gru1_merged)
gru_2b = GRU(rnn_size, return_sequences=True, go_backwards=True, kernel_initializer='he_normal', name='gru2_b')(gru1_merged)

# transforms RNN output to character activations:
inner = Dense(img_gen.get_output_size(), kernel_initializer='he_normal',
                  name='dense2')(concatenate([gru_2, gru_2b]))
y_pred = Activation('softmax', name='softmax')(inner)

Model(inputs=input_data, outputs=y_pred).summary()
labels = Input(name='the_labels', shape=[img_gen.absolute_max_string_len], dtype='float32')
input_length = Input(name='input_length', shape=[1], dtype='int64')
label_length = Input(name='label_length', shape=[1], dtype='int64')
# Keras doesn't currently support loss funcs with extra parameters
# so CTC loss is implemented in a lambda layer

loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')([y_pred, labels, input_length, label_length])

# clipnorm seems to speeds up convergence
sgd = SGD(lr=0.02, decay=1e-6, momentum=0.9, nesterov=True, clipnorm=5)

model = Model(inputs=[input_data, labels, input_length, label_length], outputs=loss_out)

# the loss calc occurs elsewhere, so use a dummy lambda func for the loss
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer=sgd)
if start_epoch > 0:
    weight_file = os.path.join(OUTPUT_DIR, os.path.join(run_name, 'weights%02d.h5' % (start_epoch - 1)))
    model.load_weights(weight_file)

# captures output of softmax so we can decode the output during visualization
test_func = K.function([input_data], [y_pred])

# 反馈函数,即运行固定次数后,执行反馈函数可保存模型,并且可视化当前训练的效果
viz_cb = VizCallback(run_name, test_func, img_gen.next_val())

模型完整架构如下图所示:

[1496633935790_1069_1496633928873.png]
[1496633935790_1069_1496633928873.png]

执行训练:

model.fit_generator(generator=img_gen.next_train(), steps_per_epoch=(words_per_epoch - val_words),
                        epochs=stop_epoch, validation_data=img_gen.next_val(), validation_steps=val_words,
                        callbacks=[EarlyStopping(patience=10), viz_cb, img_gen], initial_epoch=start_epoch)

Epoch 1/200
12799/12800 [============================>.] - ETA: 0s - loss: 0.4932
Out of 256 samples:  Mean edit distance: 0.000 Mean normalized edit distance: 0.000
12800/12800 [==============================] - 2025s - loss: 0.4931 - val_loss: 3.7432e-04

完成一个 Epoch 后,输出文件夹 image_ocr 里,可以看到,一轮训练后,我们模型训练效果如下:

[1496633954584_9937_1496633947592.png]
[1496633954584_9937_1496633947592.png]

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1. 固定长度
  • 2. 不固定长度
    • 2.1. 循环神经网络
      • 2.2. CTC 损失函数
        • 3.3. 完整代码
        相关产品与服务
        验证码
        腾讯云新一代行为验证码(Captcha),基于十道安全栅栏, 为网页、App、小程序开发者打造立体、全面的人机验证。最大程度保护注册登录、活动秒杀、点赞发帖、数据保护等各大场景下业务安全的同时,提供更精细化的用户体验。
        领券
        问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档