textgenrnn 文本生成实战

sparkexpert

发布于 2019-05-26 14:02:22

8260

发布于 2019-05-26 14:02:22

文本生成是一件很神奇的自然语言处理任务，深度学习给文本生成带来的全新的技术途径，如这篇文章The Unreasonable Effectiveness of Recurrent Neural Networks所讲的，是一种不可思议又高效的方式。textgenrnn就是采用RNN的方式来实现文本生成的一个简洁高效的库，代码量非常少，又非常易于理解。其架构是采用了LSTM＋Attention的方式来实现。如下图所示：

在其官网https://github.com/minimaxir/textgenrnn的介绍中：

For the default model, textgenrnn takes in an input of up to 40 characters, converts each character to a 100-D character embedding vector, and feeds those into a 128-cell long-short-term-memory (LSTM) recurrent layer. Those outputs are then fed into another 128-cell LSTM. All three layers are then fed into an Attention layer to weight the most important temporal features and average them together (and since the embeddings + 1st LSTM are skip-connected into the attention layer, the model updates can backpropagate to them more easily and prevent vanishing gradients). That output is mapped to probabilities for up to 394 different characters that they are the next character in the sequence, including uppercase characters, lowercase, punctuation, and emoji. (if training a new model on a new dataset, all of the numeric parameters above can be configured)

即textgenrnn 接受最多 40 个字符的输入，首先每个字符转换为 100 维的词(char)向量，并将这些向量输入到一个包含 128 个神经元的长短期记忆（LSTM）循环层中。其次，这些输出被传输至另一个包含 128 个神经元的 LSTM 中。以上所有三层都被输入到一个注意力层中，用来给最重要的时序特征赋权，并且将它们取平均（由于嵌入层和第一个 LSTM 层是通过跳跃连接与注意力层相连的，因此模型的更新可以更容易地向后传播并且防止梯度消失）。该输出被映射到最多 394 个不同字符的概率分布上，这些字符是序列中的下一个字符，包括大写字母、小写字母、标点符号和表情。而且关键是上述的参数都可以设置。

源码实践：

（1）默认的测试，生成新闻。

（2）电脑领域的新闻生成

在上述参数中，可见有个temperatures，它可以用来代表生成文本的温度（从结果来看，似乎可以认定为文本带的感情色彩强烈与否，其中0.2一般为偏负面，0.5代表偏中性，1.0代表相对正能量一些。）

为了试验不同的temperatures，textgenrnn自带了上生成不同温度的例子，其代码如下

def generate_samples(self, n=3, temperatures=[0.2, 0.5, 1.0], **kwargs):
        for temperature in temperatures:
            print('#'*20 + '\nTemperature: {}\n'.format(temperature) +
                  '#'*20)
            self.generate(n, temperature=temperature, **kwargs)

因此再度试验，可以看到结果如下：

（3）目前已经有一些基于该框架的应用，如创作类似于Trump的推特博文。

测试代码如下： textgen = textgenrnn('./weights/realDonaldTrump_dril_twitter_weights.hdf5','./textgenrnn/textgenrnn_vocab.json') gen_texts=textgen.generate_samples()

可见结果就是这么简单。

不过在该代码简介中明确了一些注意事项：

textgen = textgenrnn('./weights/realDonaldTrump_dril_twitter_weights.hdf5','./textgenrnn/textgenrnn_vocab.json') gen_texts=textgen.generate_samples()

Notes

You will not get quality generated text 100% of the time, even with a heavily-trained neural network. That's the primary reason viral blog posts/Twitter tweets utilizing NN text generation often generate lots of texts and curate/edit the best ones afterward.
Results will vary greatly between datasets. Because the pretrained neural network is relatively small, it cannot store as much data as RNNs typically flaunted in blog posts. For best results, use a dataset with atleast 2,000-5,000 documents. If a dataset is smaller, you'll need to train it for longer by setting num_epochs higher when calling a training method and/or training a new model from scratch. Even then, there is currently no good heuristic for determining a "good" model.
A GPU is not required to retrain textgenrnn, but it will take much longer to train on a CPU. If you do use a GPU, I recommend increasing the batch_size parameter for better hardware utilization.

如训练语料至少2000－5000个之间，且生成文本不稳定，需要一些人工编辑等。

textgen = textgenrnn('./weights/realDonaldTrump_dril_twitter_weights.hdf5','./textgenrnn/textgenrnn_vocab.json') gen_texts=textgen.generate_samples()

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2018年05月04日，如有侵权请联系 cloudcommunity@tencent.com 删除

机器学习