前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >关于深度学习系列笔记十五(循环神经网络)

关于深度学习系列笔记十五(循环神经网络)

作者头像
python与大数据分析
发布2022-03-11 13:38:40
5960
发布2022-03-11 13:38:40
举报
文章被收录于专栏:python与大数据分析

文本向量化(vectorize)是指将文本转换为数值张量的过程。

‰ 将文本分割为单词,并将每个单词转换为一个向量。

‰ 将文本分割为字符,并将每个字符转换为一个向量。

‰ 提取单词或字符的 n-gram,并将每个 n-gram 转换为一个向量。n-gram 是多个连续单词或字符的集合(n-gram 之间可重叠)。

将文本分解而成的单元(单词、字符或n-gram)叫作标记(token),

将文本分解成标记的过程叫作分词(tokenization)。

所有文本向量化过程都是应用某种分词方案,然后将数值向量与生成的标记相关联。这些向量组合成序列张量,被输入到深度神经网络中。

最好将Embedding 层理解为一个字典,将整数索引(表示特定单词)映射为密集向量。它接收整数作为输入,并在内部字典中查找这些整数,然后返回相关联的向量。Embedding 层实际上是一种字典查找

循环神经网络(RNN,recurrent neural network):它处理序列的方式是,遍历所有序列元素,并保存一个状态(state),其中包含与已查看内容相关的信息。实际上,RNN 是一类具有内部环的神经网络。在处理两个不同的独立序列(比如两条不同的IMDB 评论)之间,RNN 状态会被重置,因此,你仍可以将一个序列看作单个数据点,即网络的单个输入。真正改变的是,数据点不再是在单个步骤中进行处理,相反,网络内部会对序列元素进行遍历。

LSTM 层是SimpleRNN 层的一种变体,它增加了一种携带信息跨越多个时间步的方法。假设有一条传送带,其运行方向平行于你所处理的序列。序列中的信息可以在任意位置跳上传送带,然后被传送到更晚的时间步,并在需要时原封不动地跳回来。这实际上就是LSTM 的原理:它保存信息以便后面使用,从而防止较早期的信号在处理过程中逐渐消失。

循环神经网络的高级用法

‰循环 dropout(recurrent dropout)。这是一种特殊的内置方法,在循环层中使用 dropout来降低过拟合。

‰堆叠循环层(stacking recurrent layers)。这会提高网络的表示能力(代价是更高的计算负荷)。

‰双向循环层(bidirectional recurrent layer)。将相同的信息以不同的方式呈现给循环网络,可以提高精度并缓解遗忘问题。

代码示例

代码语言:javascript
复制
import numpy as np
#单词级的one-hot 编码
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
token_index = {}
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1
#token_index={'The': 1, 'cat': 2, 'sat': 3, 'on': 4, 'the': 5, 'mat.': 6, 'dog': 7, 'ate': 8, 'my': 9, 'homework.': 10}
max_length = 10
results = np.zeros(shape=(len(samples),max_length,max(token_index.values()) + 1))
#shape=2,max_length=10,max(token_index.values()) + 1=11
for i, sample in enumerate(samples):  #i 为第几行语句
    #0 The cat sat on the mat.
    #1 The dog ate my homework.
    for j, word in list(enumerate(sample.split()))[:max_length]:  #j为第几个单词
        # list(enumerate(sample.split()))[:max_length]
        # [(0, 'The'), (1, 'cat'), (2, 'sat'), (3, 'on'), (4, 'the'), (5, 'mat.')]
        # [(0, 'The'), (1, 'dog'), (2, 'ate'), (3, 'my'), (4, 'homework.')]
        index = token_index.get(word)   #index为单词在字典中的顺序
        results[i, j, index] = 1.
        #print('results[',i, ',',j, ',',index,']=',results[i, j, index] )
#print(results)

#字符级的one-hot 编码
import string
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable
token_index = dict(zip(range(1, len(characters) + 1), characters))
#token_index = {1: '0', 2: '1', 3: '2', 4: '3', 5: '4', 6: '5', 7: '6', 8: '7', 9: '8', 10: '9', 11: 'a', 12: 'b', 13: 'c', 14: 'd', 15: 'e', 16: 'f', 17: 'g', 18: 'h', 19: 'i', 20: 'j', 21: 'k', 22: 'l', 23: 'm', 24: 'n', 25: 'o', 26: 'p', 27: 'q', 28: 'r', 29: 's', 30: 't', 31: 'u', 32: 'v', 33: 'w', 34: 'x', 35: 'y', 36: 'z', 37: 'A', 38: 'B', 39: 'C', 40: 'D', 41: 'E', 42: 'F', 43: 'G', 44: 'H', 45: 'I', 46: 'J', 47: 'K', 48: 'L', 49: 'M', 50: 'N', 51: 'O', 52: 'P', 53: 'Q', 54: 'R', 55: 'S', 56: 'T', 57: 'U', 58: 'V', 59: 'W', 60: 'X', 61: 'Y', 62: 'Z', 63: '!', 64: '"', 65: '#', 66: '$', 67: '%', 68: '&', 69: "'", 70: '(', 71: ')', 72: '*', 73: '+', 74: ',', 75: '-', 76: '.', 77: '/', 78: ':', 79: ';', 80: '<', 81: '=', 82: '>', 83: '?', 84: '@', 85: '[', 86: '\\', 87: ']', 88: '^', 89: '_', 90: '`', 91: '{', 92: '|', 93: '}', 94: '~', 95: ' ', 96: '\t', 97: '\n', 98: '\r', 99: '\x0b', 100: '\x0c'}
max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample):
        index = token_index.get(character)
        results[i, j, index] = 1.
#print(results)

#用Keras 实现单词级的one-hot 编码
from keras.preprocessing.text import Tokenizer
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
#创建一个分词器(tokenizer),设置为只考虑前1000 个最常见的单词
tokenizer = Tokenizer(num_words=1000)
#tokenizer= <keras_preprocessing.text.Tokenizer object at 0x000001AC2C60A7B8>
#构建单词索引
tokenizer.fit_on_texts(samples)
tokenizer.num_words
#1000
tokenizer.document_count
#2
tokenizer.word_counts
#OrderedDict([('the', 3), ('cat', 1), ('sat', 1), ('on', 1), ('mat', 1), ('dog', 1), ('ate', 1), ('my', 1), ('homework', 1)])
tokenizer.index_docs
#defaultdict(<class 'int'>, {4: 1, 3: 1, 2: 1, 1: 2, 5: 1, 8: 1, 7: 1, 6: 1, 9: 1})
tokenizer.index_word
#{1: 'the', 2: 'cat', 3: 'sat', 4: 'on', 5: 'mat', 6: 'dog', 7: 'ate', 8: 'my', 9: 'homework'}
tokenizer.word_index
#{'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}
tokenizer.__dict__
#{
# 'word_counts': OrderedDict([('the', 3), ('cat', 1), ('sat', 1), ('on', 1), ('mat', 1), ('dog', 1), ('ate', 1), ('my', 1), ('homework', 1)]),
# 'word_docs': defaultdict(<class 'int'>, {'mat': 1, 'sat': 1, 'the': 2, 'cat': 1, 'on': 1, 'dog': 1, 'homework': 1, 'ate': 1, 'my': 1}),
# 'filters': '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
# 'split': ' ',
# 'lower': True,
# 'num_words': 1000,
# 'document_count': 2,
# 'char_level': False,
# 'oov_token': None,
# 'index_docs': defaultdict(<class 'int'>, {5: 1, 3: 1, 1: 2, 2: 1, 4: 1, 6: 1, 9: 1, 7: 1, 8: 1}),
# 'word_index': {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9},
# 'index_word': {1: 'the', 2: 'cat', 3: 'sat', 4: 'on', 5: 'mat', 6: 'dog', 7: 'ate', 8: 'my', 9: 'homework'}}
#将字符串转换为整数索引组成的列表
sequences = tokenizer.texts_to_sequences(samples)
#sequences= [[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
#one_hot_results= [[0. 1. 1. ... 0. 0. 0.]
#找回单词索引
word_index = tokenizer.word_index
#word_index= {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}
print('Found %s unique tokens.' % len(word_index))
#Found 9 unique tokens.

#使用散列技巧的单词级的one-hot 编码
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
#将单词保存为长度为1000 的向量。如果单词数量接近1000 个(或更多),那么会遇到很多散列冲突,这会降低这种编码方法的准确性
dimensionality = 1000
max_length = 10
results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        #将单词散列为0~1000 范围内的一个随机整数索引
        index = abs(hash(word)) % dimensionality
        results[i, j, index] = 1.
print(results)


from keras.preprocessing.text import Tokenizer
#用Keras 实现单词级的one-hot 编码是基于空格来区别单词的,中文需要提前进行词语的识别
samples = ['我 爱 北京 天安门', '天安门 上 太阳 升']
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)
sequences = tokenizer.texts_to_sequences(samples)
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

#------------利用Embedding 层学习词嵌入--------------
#将一个Embedding 层实例化
from keras.layers import Embedding
#Embedding 层至少需要两个参数:标记的个数(这里是1000,即最大单词索引+1)和嵌入的维度(这里是64)
embedding_layer = Embedding(1000, 64)

#加载IMDB 数据,准备用于Embedding 层
from keras.datasets import imdb
from keras import preprocessing
#作为特征的单词个数
max_features = 10000
#在这么多单词后截断文本(这些单词都属于前max_features 个最常见的单词)
maxlen = 500
#将数据加载为整数列表
#(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
(x_train, y_train), (x_test, y_test) = imdb.load_data(path="D:/Python36/Coding/PycharmProjects/ttt/imdb.npz", num_words=max_features)
# x_train.shape=(25000,)
#[list([1, 14, 22, 16, 43, 530, 973, 1622, 1385,...])
#list([1, 194, 1153, 194, 8255, 78, 228, 5, 6,...])]
#将整数列表转换成形状为(samples,maxlen) 的二维整数张量
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
# x_train.shape=(25000, 20)
# [[  65   16   38 ...   19  178   32]
# [  23    4 1690 ...   16  145   95]
# [1352   13  191 ...    7  129  113]

#在IMDB 数据上使用Embedding 层和分类器
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding
model = Sequential()
#指定Embedding 层的最大输入长度,以便后面将嵌入输入展平。Embedding 层激活的形状为(samples, maxlen, 8)
model.add(Embedding(10000, 8, input_length=maxlen))
#将三维的嵌入张量展平成形状为(samples, maxlen * 8) 的二维张量
model.add(Flatten())
#在上面添加分类器
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()
history = model.fit(x_train, y_train,epochs=10,batch_size=32,validation_split=0.2)
#1s 61us/step - loss: 0.2839 - acc: 0.8860 - val_loss: 0.5302 - val_acc: 0.7464

import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Dense Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Dense Training and validation loss')
plt.legend()
plt.show()

#SimpleRNN 的例子
#SimpleRNN 可以在两种不同的模式下运行:一种是返回每个时间步连续输出的完整序列,即形状为(batch_size, timesteps, output_features)的三维张量;
#另一种是只返回每个输入序列的最终输出,即形状为(batch_size, output_features) 的二维张量。
# 这两种模式由return_sequences 这个构造函数参数来控制。
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32))
print(model.summary())
#Layer (type)                 Output Shape              Param #
#=================================================================
#embedding_2 (Embedding)      (None, None, 32)          320000
#_________________________________________________________________
#simple_rnn_1 (SimpleRNN)     (None, 32)                2080
#=================================================================
#Total params: 322,080
#Trainable params: 322,080
#Non-trainable params: 0
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
print(model.summary())
#_________________________________________________________________
#Layer (type)                 Output Shape              Param #
#=================================================================
#embedding_3 (Embedding)      (None, None, 32)          320000
#_________________________________________________________________
#simple_rnn_2 (SimpleRNN)     (None, None, 32)          2080
#=================================================================
#Total params: 322,080
#Trainable params: 322,080
#Non-trainable params: 0

#为了提高网络的表示能力,将多个循环层逐个堆叠有时也是很有用的。在这种情况下,需要让所有中间层都返回完整的输出序列
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32))
print(model.summary())
#Layer (type)                 Output Shape              Param #
#=================================================================
#embedding_4 (Embedding)      (None, None, 32)          320000
#_________________________________________________________________
#simple_rnn_3 (SimpleRNN)     (None, None, 32)          2080
#_________________________________________________________________
#simple_rnn_4 (SimpleRNN)     (None, None, 32)          2080
#_________________________________________________________________
#simple_rnn_5 (SimpleRNN)     (None, None, 32)          2080
#_________________________________________________________________
#simple_rnn_6 (SimpleRNN)     (None, 32)                2080
#=================================================================
#Total params: 328,320
#Trainable params: 328,320
#Non-trainable params: 0

#使用simpleRNN模型应用于IMDB 电影评论分类问题
from keras.datasets import imdb
from keras.preprocessing import sequence
max_features = 10000
maxlen = 500
batch_size = 32
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(path="D:/Python36/Coding/PycharmProjects/ttt/imdb.npz", num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)
#用Embedding 层和SimpleRNN 层来训练模型
from keras.layers import Dense
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(input_train, y_train,epochs=10,batch_size=32,validation_split=0.2)

import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Simple RNN Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Simple RNN Training and validation loss')
plt.legend()
plt.show()


#使用Keras 中的LSTM 层
from keras.layers import LSTM
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])
history = model.fit(input_train, y_train,epochs=10,batch_size=32,validation_split=0.2)
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('LSTM Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('LSTM Training and validation loss')
plt.legend()
plt.show()


#使用Keras 中的带dropout的LSTM 层
from keras.layers import LSTM
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32,dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])
history = model.fit(input_train, y_train,epochs=10,batch_size=32,validation_split=0.2)
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('LSTM with dropout Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('LSTM with dropout Training and validation loss')
plt.legend()
plt.show()

#训练并评估一个双向LSTM
from keras.datasets import imdb
from keras.preprocessing import sequence
from keras import layers
from keras.models import Sequential
model = Sequential()
model.add(layers.Embedding(max_features, 32))
model.add(layers.Bidirectional(layers.LSTM(32)))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train,epochs=10,batch_size=32,validation_split=0.2)
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('double LSTM Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('double LSTM Training and validation loss')
plt.legend()
plt.show()
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2019-06-09,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 python与大数据分析 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 代码示例
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档