用RNN构建文本生成器(TensorFlow Eager+ tf.keras)

【导读】本文翻译自TensorFlow官网新出的教程《Text generation using a RNN with eager execution》,该教程介绍如何使用TensorFlow Eager(动态图)和RNN来学习生成莎士比亚的作品。模型可以根据已有的字符序列来预测序列的下一个字符,以达到文本生成的效果。

简介


教程包含了用Tensorflow Eager(动态图)和tf.keras实现的可执行代码,下面是代码运行的示例结果:

QUEENE:
I had thought thou hadst a Roman; for the oracle,
Thus by All bids the man against the word,
Which are so weak of care, by old care done;
Your children were in your holy love,
And the precipitation through the bleeding throne.

BISHOP OF ELY:
Marry, and will, my lord, to weep in such a one were prettiest;
Yet now I was adopted heir
Of the world's lamentable day,
To watch the next way with his father with his face?

ESCALUS:
The cause why then we are all resolved more sons.

VOLUMNIA:
O, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, it is no sin it should be dead,
And love and pale as any will to that word.

QUEEN ELIZABETH:
But how long have I heard the soul for this world,
And show his hands of life be proved to stand.

PETRUCHIO:
I say he look'd on, if I must be content
To stay him from the fatal of our country's bliss.
His lordship pluck'd from this sentence then for prey,
And then let us twain, being the moon,
were she such a case as fills m

虽然生成的句子中有一部分看起来比较符合语法,大多数生成的句子还是没有什么意义的。这个模型并没有考虑词的意义,而是考虑:

  • 该模型是基于字符的,模型并不知道如何用字符拼写单词,甚至不知道词是文本的组成单元。
  • 文本的结构很像戏剧,和训练集类似,文本块往往以说话者的大写名字开始。

模型中序列长度为100的序列上进行训练,但它有能力去生成更长的序列。

简介


导入相关库:

import tensorflow as tf
tf.enable_eager_execution()

import numpy as np
import os
import time

下载数据集:

path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

读取数据:

text = open(path_to_file).read()
print ('Length of text: {} characters'.format(len(text)))

查看数据:

print(text[:1000])
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.

构建字符集合:

# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))
65 unique characters

文本处理


构建字符索引:

char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])
for char,_ in zip(char2idx, range(20)):
    print('{:6s} ---> {:4d}'.format(repr(char), char2idx[char]))
'j'    --->   48
'f'    --->   44
'R'    --->   30
':'    --->   10
'W'    --->   35
';'    --->   11
'o'    --->   53
'b'    --->   40
'K'    --->   23
'L'    --->   24
'O'    --->   27
'h'    --->   46
'm'    --->   51
'u'    --->   59
'H'    --->   20
'z'    --->   64
'!'    --->    2
'S'    --->   31
'N'    --->   26
'Z'    --->   38

预测任务:

给定一个字符或一个字符序列,希望预测下一个最可能出现的字符。在训练时,我们的模型输入seq_length个字符,输出seq_length个字符。例如seq_length为4,我们的文本你是Hello,那么输入是Hell,输出是ello。

首先获得多个长度为seq_length的文本:

seq_length = 100
chunks = tf.data.Dataset.from_tensor_slices(text_as_int).batch(seq_length+1, drop_remainder=True)
for item in chunks.take(5):
  print(repr(''.join(idx2char[item.numpy()])))
'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'

从文本中提取输入和目标:

def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = chunks.map(split_input_target)
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))
Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '

利用tf.data来划分batch,并进行shuffle:

# batch大小
BATCH_SIZE = 64
# shuffle缓存大小
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

模型


模型类,使用了tf.keras的Embedding和GRU层:

class Model(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, units):
        super(Model, self).__init__()
        self.units = units

        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

        if tf.test.is_gpu_available():
            self.gru = tf.keras.layers.CuDNNGRU(self.units,
                                           return_sequences=True,
                            recurrent_initializer='glorot_uniform',
                                                stateful=True)
        else:
            self.gru = tf.keras.layers.GRU(self.units,
                                           return_sequences=True,
                                           recurrent_activation='sigmoid',
                                           recurrent_initializer='glorot_uniform',
                                           stateful=True)

        self.fc = tf.keras.layers.Dense(vocab_size)

    def call(self, x):
        embedding = self.embedding(x)

        # output at every time step
        # output shape == (batch_size, seq_length, hidden_size) 
        output = self.gru(embedding)

        # The dense layer will output predictions for every time_steps(seq_length)
        # output shape after the dense layer == (seq_length * batch_size, vocab_size)
        prediction = self.fc(output)

        # states will be used to pass at every step to the model while training
        return prediction

实例化模型、优化器和损失函数:

# 字典大小
vocab_size = len(vocab)
# 字符向量维度
embedding_dim = 256
# RNN单元维度
units = 1024

model = Model(vocab_size, embedding_dim, units)
optimizer = tf.train.AdamOptimizer()

def loss_function(real, preds):
    return tf.losses.sparse_softmax_cross_entropy(labels=real, logits=preds)

训练模型:

model.build(tf.TensorShape([BATCH_SIZE, seq_length]))
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        multiple                  16640     
_________________________________________________________________
gru (GRU)                    multiple                  3935232   
_________________________________________________________________
dense (Dense)                multiple                  66625     
=================================================================
Total params: 4,018,497
Trainable params: 4,018,497
Non-trainable params: 0
_________________________________________________________________
# 保存checkpoints的目录
checkpoint_dir = './training_checkpoints'
# Checkpoint文件名
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")

简单训练几次:

EPOCHS = 5
# 循环训练
for epoch in range(EPOCHS):
    start = time.time()

    # 在每轮epoch开始时,初始化状态
    hidden = model.reset_states()

    for (batch, (inp, target)) in enumerate(dataset):
        with tf.GradientTape() as tape:
            predictions = model(inp)
            loss = loss_function(target, predictions)

        grads = tape.gradient(loss, model.variables)
        optimizer.apply_gradients(zip(grads, model.variables))

        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                         batch,
                                                         loss))
    # 每5个epoch保存一下模型
    if (epoch + 1) % 5 == 0:
        model.save_weights(checkpoint_prefix)

    print('Epoch {} Loss {:.4f}'.format(epoch + 1, loss))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
Epoch 3 Batch 0 Loss 1.7645
Epoch 3 Batch 100 Loss 1.6853
Epoch 3 Loss 1.6164
Time taken for 1 epoch 610.0756878852844 sec

Epoch 4 Batch 0 Loss 1.6491
Epoch 4 Batch 100 Loss 1.5350
Epoch 4 Loss 1.5071
Time taken for 1 epoch 609.8330454826355 sec

Epoch 5 Batch 0 Loss 1.4715
Epoch 5 Batch 100 Loss 1.4685
Epoch 5 Loss 1.4042
Time taken for 1 epoch 608.6753587722778 sec

保存模型:

model.save_weights(checkpoint_prefix)

模型载入:

如果需要载入保存的模型,可以使用下面的代码:

model = Model(vocab_size, embedding_dim, units)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

使用训练的模型来进行文本生成:

# 生成文本

# 生成字符的数量
num_generate = 1000

# 起始字符
start_string = 'Q'

input_eval = [char2idx[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)

# 用于存储结果的空字符串
text_generated = []

# 较低的temperature值会产出一些预料之中的文本.
# 较高的temperature值会产出一些出乎预料的文本.
temperature = 1.0
# 这里batch大小为1
model.reset_states()
for i in range(num_generate):
    predictions = model(input_eval)
    # 移除batch维
    predictions = tf.squeeze(predictions, 0)

    # 用多项式分布来预测生成的词
    predictions = predictions / temperature
    predicted_id = tf.multinomial(predictions, num_samples=1)[-1, 0].numpy()

    # 将生成的词和上一次的隐藏状态作为模型的下一个输入
    input_eval = tf.expand_dims([predicted_id], 0)

    text_generated.append(idx2char[predicted_id])

print(start_string + ''.join(text_generated))

运行结果:

QULERBY:
If a body.
But I would me your lood.
Steak ungrace and as this only in the ploaduse,
his they, much you amed on't.

RSCALIO:
Hearn' thousand as your well, and obepional.

ANTONIO:
Can wathach this wam a discure that braichal heep itspose,
Teparmate confoim it: never knor sheep, so litter
Plarence? He,
But thou sunds a parmon servection:
Occh Rom o'ld him sir;
madish yim,
I'll surm let as hand upherity

Shepherd:
Why do I sering their stumble; the thank emo'st yied
Baunted unpluction; the main, sir, What's a meanulainst
Even worship tebomn slatued of his name,
Manisholed shorks you go?

BUCKINGHAM:
We look thus then impare'd least itsiby drumes,
That I, what!
Nurset, fell beshee that which I will
to the near-Volshing upon this aguin against fless
Is done untlein with is the neck,
Thands he shall fear'ds; let me love at officed:
Where else to her awticions, as you hall, my lord.

KING RICHARD II:
I will been another one our accuser less
Tiold, methought to the presench of consiar

参考资料:

  • https://www.tensorflow.org/tutorials/sequences/text_generation
  • https://github.com/tensorflow/docs/blob/master/site/en/tutorials/sequences/text_generation.ipynb

-END-

原文发布于微信公众号 - 专知(Quan_Zhuanzhi)

原文发表时间:2018-11-02

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏潇涧技术专栏

Problem: Longest Common Subsequence

最长公共子序列(LCS)是典型的动态规划问题,如果不理解动态规划请移步先看这篇动态规划的总结,否则本文中的代码实现会不理解的哟!

581
来自专栏PPV课数据科学社区

【学习】K近邻算法基础:KD树的操作

Kd-树概念 Kd-树其实是K-dimension tree的缩写,是对数据点在k维空间中划分的一种数据结构。其实,Kd-树是一种平衡二叉树。 举一示例: 假设...

3205
来自专栏Petrichor的专栏

leetcode: 62. Unique Paths

1092
来自专栏进击的程序猿

TensorFlow 学前班

本文我参加Udacity的深度学习基石课程的学习的第3周总结,主题是在学习 TensorFlow 之前,先自己做一个miniflow,通过本周的学习,对于Te...

1222
来自专栏闪电gogogo的专栏

压缩感知重构算法之正则化正交匹配追踪(ROMP)

  在看代码之前,先拜读了ROMP的经典文章:Needell D,VershyninR.Signal recovery from incompleteand i...

3546
来自专栏程序生活

TensorFlow入门:MNIST数据的单层逻辑回归代码单层回归代码输出结果

2084
来自专栏ml

数据挖掘之聚类算法K-Means总结

序   由于项目需要,需要对数据进行处理,故而又要滚回来看看paper,做点小功课,这篇文章只是简单的总结一下基础的Kmeans算法思想以及实现; 正文:   ...

3918
来自专栏蜉蝣禅修之道

Levenshtein distance最小编辑距离算法实现

6494
来自专栏编程坑太多

人工智能python的tensorflow基础

1283
来自专栏C/C++基础

Dijkstra算法求单源最短路径

在一个连通图中,从一个顶点到另一个顶点间可能存在多条路径,而每条路径的边数并不一定相同。如果是一个带权图,那么路径长度为路径上各边的权值的总和。两个顶点间路径长...

2501

扫码关注云+社区

领取腾讯云代金券