关于深度学习系列笔记十五（循环神经网络）

python与大数据分析

发布于 2022-03-11 13:38:40

5960

发布于 2022-03-11 13:38:40

文章被收录于专栏：python与大数据分析

文本向量化（vectorize）是指将文本转换为数值张量的过程。

将文本分割为单词，并将每个单词转换为一个向量。

将文本分割为字符，并将每个字符转换为一个向量。

提取单词或字符的 n-gram，并将每个 n-gram 转换为一个向量。n-gram 是多个连续单词或字符的集合（n-gram 之间可重叠）。

将文本分解而成的单元（单词、字符或n-gram）叫作标记（token），

将文本分解成标记的过程叫作分词（tokenization）。

所有文本向量化过程都是应用某种分词方案，然后将数值向量与生成的标记相关联。这些向量组合成序列张量，被输入到深度神经网络中。

最好将Embedding 层理解为一个字典，将整数索引（表示特定单词）映射为密集向量。它接收整数作为输入，并在内部字典中查找这些整数，然后返回相关联的向量。Embedding 层实际上是一种字典查找

循环神经网络（RNN，recurrent neural network）：它处理序列的方式是，遍历所有序列元素，并保存一个状态（state），其中包含与已查看内容相关的信息。实际上，RNN 是一类具有内部环的神经网络。在处理两个不同的独立序列（比如两条不同的IMDB 评论）之间，RNN 状态会被重置，因此，你仍可以将一个序列看作单个数据点，即网络的单个输入。真正改变的是，数据点不再是在单个步骤中进行处理，相反，网络内部会对序列元素进行遍历。

LSTM 层是SimpleRNN 层的一种变体，它增加了一种携带信息跨越多个时间步的方法。假设有一条传送带，其运行方向平行于你所处理的序列。序列中的信息可以在任意位置跳上传送带，然后被传送到更晚的时间步，并在需要时原封不动地跳回来。这实际上就是LSTM 的原理：它保存信息以便后面使用，从而防止较早期的信号在处理过程中逐渐消失。

循环神经网络的高级用法

循环 dropout（recurrent dropout）。这是一种特殊的内置方法，在循环层中使用 dropout来降低过拟合。

堆叠循环层（stacking recurrent layers）。这会提高网络的表示能力（代价是更高的计算负荷）。

双向循环层（bidirectional recurrent layer）。将相同的信息以不同的方式呈现给循环网络，可以提高精度并缓解遗忘问题。

代码示例

import numpy as np
#单词级的one-hot 编码
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
token_index = {}
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1
#token_index={'The': 1, 'cat': 2, 'sat': 3, 'on': 4, 'the': 5, 'mat.': 6, 'dog': 7, 'ate': 8, 'my': 9, 'homework.': 10}
max_length = 10
results = np.zeros(shape=(len(samples),max_length,max(token_index.values()) + 1))
#shape=2,max_length=10,max(token_index.values()) + 1=11
for i, sample in enumerate(samples):  #i 为第几行语句
    #0 The cat sat on the mat.
    #1 The dog ate my homework.
    for j, word in list(enumerate(sample.split()))[:max_length]:  #j为第几个单词
        # list(enumerate(sample.split()))[:max_length]
        # [(0, 'The'), (1, 'cat'), (2, 'sat'), (3, 'on'), (4, 'the'), (5, 'mat.')]
        # [(0, 'The'), (1, 'dog'), (2, 'ate'), (3, 'my'), (4, 'homework.')]
        index = token_index.get(word)   #index为单词在字典中的顺序
        results[i, j, index] = 1.
        #print('results[',i, ',',j, ',',index,']=',results[i, j, index] )
#print(results)

#字符级的one-hot 编码
import string
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable
token_index = dict(zip(range(1, len(characters) + 1), characters))
#token_index = {1: '0', 2: '1', 3: '2', 4: '3', 5: '4', 6: '5', 7: '6', 8: '7', 9: '8', 10: '9', 11: 'a', 12: 'b', 13: 'c', 14: 'd', 15: 'e', 16: 'f', 17: 'g', 18: 'h', 19: 'i', 20: 'j', 21: 'k', 22: 'l', 23: 'm', 24: 'n', 25: 'o', 26: 'p', 27: 'q', 28: 'r', 29: 's', 30: 't', 31: 'u', 32: 'v', 33: 'w', 34: 'x', 35: 'y', 36: 'z', 37: 'A', 38: 'B', 39: 'C', 40: 'D', 41: 'E', 42: 'F', 43: 'G', 44: 'H', 45: 'I', 46: 'J', 47: 'K', 48: 'L', 49: 'M', 50: 'N', 51: 'O', 52: 'P', 53: 'Q', 54: 'R', 55: 'S', 56: 'T', 57: 'U', 58: 'V', 59: 'W', 60: 'X', 61: 'Y', 62: 'Z', 63: '!', 64: '"', 65: '#', 66: '$', 67: '%', 68: '&', 69: "'", 70: '(', 71: ')', 72: '*', 73: '+', 74: ',', 75: '-', 76: '.', 77: '/', 78: ':', 79: ';', 80: '<', 81: '=', 82: '>', 83: '?', 84: '@', 85: '[', 86: '\\', 87: ']', 88: '^', 89: '_', 90: '`', 91: '{', 92: '|', 93: '}', 94: '~', 95: ' ', 96: '\t', 97: '\n', 98: '\r', 99: '\x0b', 100: '\x0c'}
max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample):
        index = token_index.get(character)
        results[i, j, index] = 1.
#print(results)

#用Keras 实现单词级的one-hot 编码
from keras.preprocessing.text import Tokenizer
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
#创建一个分词器（tokenizer），设置为只考虑前1000 个最常见的单词
tokenizer = Tokenizer(num_words=1000)
#tokenizer= <keras_preprocessing.text.Tokenizer object at 0x000001AC2C60A7B8>
#构建单词索引
tokenizer.fit_on_texts(samples)
tokenizer.num_words
#1000
tokenizer.document_count
#2
tokenizer.word_counts
#OrderedDict([('the', 3), ('cat', 1), ('sat', 1), ('on', 1), ('mat', 1), ('dog', 1), ('ate', 1), ('my', 1), ('homework', 1)])
tokenizer.index_docs
#defaultdict(<class 'int'>, {4: 1, 3: 1, 2: 1, 1: 2, 5: 1, 8: 1, 7: 1, 6: 1, 9: 1})
tokenizer.index_word
#{1: 'the', 2: 'cat', 3: 'sat', 4: 'on', 5: 'mat', 6: 'dog', 7: 'ate', 8: 'my', 9: 'homework'}
tokenizer.word_index
#{'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}
tokenizer.__dict__
#{
# 'word_counts': OrderedDict([('the', 3), ('cat', 1), ('sat', 1), ('on', 1), ('mat', 1), ('dog', 1), ('ate', 1), ('my', 1), ('homework', 1)]),
# 'word_docs': defaultdict(<class 'int'>, {'mat': 1, 'sat': 1, 'the': 2, 'cat': 1, 'on': 1, 'dog': 1, 'homework': 1, 'ate': 1, 'my': 1}),
# 'filters': '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
# 'split': ' ',
# 'lower': True,
# 'num_words': 1000,
# 'document_count': 2,
# 'char_level': False,
# 'oov_token': None,
# 'index_docs': defaultdict(<class 'int'>, {5: 1, 3: 1, 1: 2, 2: 1, 4: 1, 6: 1, 9: 1, 7: 1, 8: 1}),
# 'word_index': {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9},
# 'index_word': {1: 'the', 2: 'cat', 3: 'sat', 4: 'on', 5: 'mat', 6: 'dog', 7: 'ate', 8: 'my', 9: 'homework'}}
#将字符串转换为整数索引组成的列表
sequences = tokenizer.texts_to_sequences(samples)
#sequences= [[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
#one_hot_results= [[0. 1. 1. ... 0. 0. 0.]
#找回单词索引
word_index = tokenizer.word_index
#word_index= {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}
print('Found %s unique tokens.' % len(word_index))
#Found 9 unique tokens.

#使用散列技巧的单词级的one-hot 编码
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
#将单词保存为长度为1000 的向量。如果单词数量接近1000 个（或更多），那么会遇到很多散列冲突，这会降低这种编码方法的准确性
dimensionality = 1000
max_length = 10
results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        #将单词散列为0~1000 范围内的一个随机整数索引
        index = abs(hash(word)) % dimensionality
        results[i, j, index] = 1.
print(results)


from keras.preprocessing.text import Tokenizer
#用Keras 实现单词级的one-hot 编码是基于空格来区别单词的，中文需要提前进行词语的识别
samples = ['我 爱 北京 天安门', '天安门 上 太阳 升']
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)
sequences = tokenizer.texts_to_sequences(samples)
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

#------------利用Embedding 层学习词嵌入--------------
#将一个Embedding 层实例化
from keras.layers import Embedding
#Embedding 层至少需要两个参数：标记的个数（这里是1000，即最大单词索引+1）和嵌入的维度（这里是64）
embedding_layer = Embedding(1000, 64)

#加载IMDB 数据，准备用于Embedding 层
from keras.datasets import imdb
from keras import preprocessing
#作为特征的单词个数
max_features = 10000
#在这么多单词后截断文本（这些单词都属于前max_features 个最常见的单词）
maxlen = 500
#将数据加载为整数列表
#(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
(x_train, y_train), (x_test, y_test) = imdb.load_data(path="D:/Python36/Coding/PycharmProjects/ttt/imdb.npz", num_words=max_features)
# x_train.shape=(25000,)
#[list([1, 14, 22, 16, 43, 530, 973, 1622, 1385,...])
#list([1, 194, 1153, 194, 8255, 78, 228, 5, 6,...])]
#将整数列表转换成形状为(samples,maxlen) 的二维整数张量
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
# x_train.shape=(25000, 20)
# [[  65   16   38 ...   19  178   32]
# [  23    4 1690 ...   16  145   95]
# [1352   13  191 ...    7  129  113]

#在IMDB 数据上使用Embedding 层和分类器
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding
model = Sequential()
#指定Embedding 层的最大输入长度，以便后面将嵌入输入展平。Embedding 层激活的形状为(samples, maxlen, 8)
model.add(Embedding(10000, 8, input_length=maxlen))
#将三维的嵌入张量展平成形状为(samples, maxlen * 8) 的二维张量
model.add(Flatten())
#在上面添加分类器
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()
history = model.fit(x_train, y_train,epochs=10,batch_size=32,validation_split=0.2)
#1s 61us/step - loss: 0.2839 - acc: 0.8860 - val_loss: 0.5302 - val_acc: 0.7464

import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Dense Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Dense Training and validation loss')
plt.legend()
plt.show()

#SimpleRNN 的例子
#SimpleRNN 可以在两种不同的模式下运行：一种是返回每个时间步连续输出的完整序列，即形状为(batch_size, timesteps, output_features)的三维张量；
#另一种是只返回每个输入序列的最终输出，即形状为(batch_size, output_features) 的二维张量。
# 这两种模式由return_sequences 这个构造函数参数来控制。
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32))
print(model.summary())
#Layer (type)                 Output Shape              Param #
#=================================================================
#embedding_2 (Embedding)      (None, None, 32)          320000
#_________________________________________________________________
#simple_rnn_1 (SimpleRNN)     (None, 32)                2080
#=================================================================
#Total params: 322,080
#Trainable params: 322,080
#Non-trainable params: 0
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
print(model.summary())
#_________________________________________________________________
#Layer (type)                 Output Shape              Param #
#=================================================================
#embedding_3 (Embedding)      (None, None, 32)          320000
#_________________________________________________________________
#simple_rnn_2 (SimpleRNN)     (None, None, 32)          2080
#=================================================================
#Total params: 322,080
#Trainable params: 322,080
#Non-trainable params: 0

#为了提高网络的表示能力，将多个循环层逐个堆叠有时也是很有用的。在这种情况下，需要让所有中间层都返回完整的输出序列
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32))
print(model.summary())
#Layer (type)                 Output Shape              Param #
#=================================================================
#embedding_4 (Embedding)      (None, None, 32)          320000
#_________________________________________________________________
#simple_rnn_3 (SimpleRNN)     (None, None, 32)          2080
#_________________________________________________________________
#simple_rnn_4 (SimpleRNN)     (None, None, 32)          2080
#_________________________________________________________________
#simple_rnn_5 (SimpleRNN)     (None, None, 32)          2080
#_________________________________________________________________
#simple_rnn_6 (SimpleRNN)     (None, 32)                2080
#=================================================================
#Total params: 328,320
#Trainable params: 328,320
#Non-trainable params: 0

#使用simpleRNN模型应用于IMDB 电影评论分类问题
from keras.datasets import imdb
from keras.preprocessing import sequence
max_features = 10000
maxlen = 500
batch_size = 32
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(path="D:/Python36/Coding/PycharmProjects/ttt/imdb.npz", num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)
#用Embedding 层和SimpleRNN 层来训练模型
from keras.layers import Dense
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(input_train, y_train,epochs=10,batch_size=32,validation_split=0.2)

import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Simple RNN Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Simple RNN Training and validation loss')
plt.legend()
plt.show()


#使用Keras 中的LSTM 层
from keras.layers import LSTM
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])
history = model.fit(input_train, y_train,epochs=10,batch_size=32,validation_split=0.2)
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('LSTM Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('LSTM Training and validation loss')
plt.legend()
plt.show()


#使用Keras 中的带dropout的LSTM 层
from keras.layers import LSTM
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32,dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])
history = model.fit(input_train, y_train,epochs=10,batch_size=32,validation_split=0.2)
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('LSTM with dropout Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('LSTM with dropout Training and validation loss')
plt.legend()
plt.show()

#训练并评估一个双向LSTM
from keras.datasets import imdb
from keras.preprocessing import sequence
from keras import layers
from keras.models import Sequential
model = Sequential()
model.add(layers.Embedding(max_features, 32))
model.add(layers.Bidirectional(layers.LSTM(32)))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train,epochs=10,batch_size=32,validation_split=0.2)
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('double LSTM Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('double LSTM Training and validation loss')
plt.legend()
plt.show()