github地址 使用循环神经网络生成序列文本数据。循环神经网络可以用来生成音乐、图像作品、语音、对话系统对话等等。
随机性的重要性。考虑一个极端情况:纯随机抽样,从均匀概率分布中绘制下一个字符,并且每个角色都具有相同的可能性。该方案具有最大随机性;换句话说,该概率分布具有最大熵。当然,它不会产生任何有趣的东西。在另一个极端,贪婪的采样也不会产生任何有趣的东西,并且没有随机性:相应的概率分布具有最小的熵。从“真实”概率分布中抽样(由模型的softmax函数输出的分布)构成这两个极端之间的中间点。但是,可能希望探索许多其他更高或更低熵的中间点。较少的熵将使生成的序列具有更可预测的结构(因此它们可能看起来更逼真),而更多的熵将导致更令人惊讶和创造性的序列。 当从生成模型中抽样时,在生成过程中探索不同量的随机性总是好的。因为我们是生成数据有趣程度的终极判断,所以相互作用是高度主观的,并且不可能事先知道最佳熵点在哪里。 为了控制采样过程中的随机性,我们将引入一个名为softmax temperature的参数,该参数表示用于采样的概率分布的熵:它表征下一个字符的选择将会出乎意料或可预测的程度。给定温度值,通过以下列方式对其进行重新加权,从原始概率分布(模型的softmax输出)计算新的概率分布。
import numpy as np
def reweight_distribution(original_distribution, temperature=0.5):
distribution = np.log(original_distribution) / temperature#原始分布为1D,和为1;
distribution = np.exp(distribution)
return distribution / np.sum(distribution)
准备数据 下载数据,转换成小写
import keras
import numpy as np
path = keras.utils.get_file('nietzsche.txt',origin='')
text = open(path).read().lower()
print('Corpus length:',len(text))
之后,将提取长度为maxlen的部分重叠序列,对它们进行one-hot编码,打包成3D numpy数组形式,形状(squences,maxlen,unique_characters).同时,将准备一个包含相应目标的数组y:每个提取序列之后的one-hot编码字符。
maxlen = 60#句子最大长度
step = 3#每3个字符对句子进行采样
sentences = []
next_chars = []#targets
for i in range(0,len(text)-maxlen,step):
print('Number of sequences:', len(sentences))
chars = sorted(list(set(text)))#字典
print('Unique characters:', len(chars))
char_indices = dict((char, chars.index(char)) for char in chars)#字符-id对应关系
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):#one-hot
for t, char in enumerate(sentence):
x[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1
构建模型 LSTM+Dense;但是RNN并不是唯一生成序列文本的模型,1D卷积也可以达到相同的效果。
from keras import layers
model = keras.models.Sequential()
optimizer = keras.optimizers.RMSprop(lr=0.01)
语言模型训练、采样 给定训练有素的模型和种子文本片段,可以通过重复执行以下操作来生成新文本:
def sample(preds, temperature=1.0):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)#重新加权调整
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)#返回概率最大的字符下标
import random
import sys
for epoch in range(1, 60):
print('epoch', epoch), y, batch_size=128, epochs=1)
start_index = random.randint(0, len(text) - maxlen - 1)
generated_text = text[start_index:start_index+maxlen]#种子文本
print('--- Generating with seed: "' + generated_text + '"')
for temperature in [0.2, 0.5, 1.0, 1.2]:#不同temperature生成文本对比
print('------ temperature:', temperature)
for i in range(400):#从种子文本开始,生成400个字符
sampled = np.zeros((1, maxlen, len(chars)))#种子文本one-hot编码
for t, char in enumerate(generated_text):
sampled[0, t, char_indices[char]] = 1.
preds = model.predict(sampled, verbose=0)[0]#下一个字符预测
next_index = sample(preds, temperature)
next_char = chars[next_index]
generated_text += next_char#添加到之前文本中
generated_text = generated_text[1:]#新的文本数据
种子文本‘new faculty, and the jubilation reached its climax when kant.’; epochs=20,temperature=0.2,生成文本序列:
new faculty, and the jubilation reached its climax when kant and such a man
in the same time the spirit of the surely and the such the such
as a man is the sunligh and subject the present to the superiority of the
special pain the most man and strange the subjection of the
special conscience the special and nature and such men the subjection of the
special men, the most surely the subjection of the special
intellect of the subjection of the same things and
new faculty, and the jubilation reached its climax when kant in the eterned
and such man as it's also become himself the condition of the
experience of off the basis the superiory and the special morty of the
strength, in the langus, as which the same time life and "even who
discless the mankind, with a subject and fact all you have to be the stand
and lave no comes a troveration of the man and surely the
conscience the superiority, and when one must be w
cheerfulness, friendliness and kindness of a heart are the sense of the
spirit is a man with the sense of the sense of the world of the
self-end and self-concerning the subjection of the strengthorixes--the
subjection of the subjection of the subjection of the
self-concerning the feelings in the superiority in the subjection of the
subjection of the spirit isn't to be a man of the sense of the
subjection and said to the strength of the sense of the
cheerfulness, friendliness and kindness of a heart are the part of the soul
who have been the art of the philosophers, and which the one
won't say, which is it the higher the and with religion of the frences.
the life of the spirit among the most continuess of the
strengther of the sense the conscience of men of precisely before enough
presumption, and can mankind, and something the conceptions, the
subjection of the sense and suffering and the
请注意,通过训练更大的模型,更长的数据,可以获得生成的样本,这些样本看起来比这个更连贯和更真实。当然,除了随机机会之外,不要期望生成任何有意义的文本:您所做的只是从统计模型中抽取数据,其中字符来自哪些字符。 语言是一种通信渠道,通信的内容与通信编码的消息的统计结构之间存在区别。