仅用四行代码实现RNN文本生成模型

文本生成(generating text)对机器学习和NLP初学者来说似乎很有趣的项目之一,但也是一个非常困难的项目。值得庆幸的是,网络上有各种各样的优秀资源,可以用于了解RNN如何用于文本生成,从理论到深入具体的技术,都有一些非常好的资源。所有的这些资源都会特别分享一件事情:在文本生成过程中的某个时候,你必须建立RNN模型并调参来完成这项工作。 虽然文本生成是一项有价值的工作,特别是在学习的该过程中,但如果任务抽象程度高,应该怎么办呢?如果你是一个数据科学家,需要一个RNN文本生成器形式的模块来填充项目呢?或者作为一个新人,你只是想试试或者提升下自己。对于这两种情况,都可以来看看textgenrnn项目,它用几行代码就能够轻松地在任何文本数据集上训练任意大小和复杂的文本生成神经网络。 textgenrnn项目由数据科学家Max Woolf开发而成。 textgenrnn是建立在Keras和TensorFlow之上的,可用于生成字符和文字级文本。网络体系结构使用注意力加权来加速训练过程并提高质量,并允许调整大量超参数,如RNN模型大小、RNN层和双向RNN。读者可以在Github上或类似的介绍博客文章中阅读有关textgenrnn及其功能和体系结构的更多信息。

由于“Hello,World!”对于文本生成而言类似于特朗普产生推文一样简单, textgenrnn的默认预训练模型可以轻松地在新文本上进行训练,此外也可以使用textgenrnn来训练新模型(只需将new_model = True添加到任何训练的函数中)。

获取数据

本文爬取2014年1月1日至2018年6月11日特朗普的推文,其中包括美国总统就职前后的推文(来自特朗普Twitter Archive)。从中只选择日期范围内的推文来获取文本,并将其保存到一个文本文件中,将该文本命名为trump-tweets.txt。

训练模型

下面让我们看看用textgenrnn生成文本的简单方法。以下4行是我们需要导入的库,并创建文本生成对象,在trump-tweets.txt文件中训练模型10个epoch,然后生成一些示例推文。

from textgenrnn import textgenrnn
textgen = textgenrnn()
textgen.train_from_file('trump-tweets.txt', num_epochs=10)
textgen.generate(5)

大约30分钟后,会产生以下结果(训练时间与电脑性能有关,在第10个epoch):

My @FoxNews will be self finally complaining about me that so he is a great day and companies and is starting to report the president in safety and more than any mention of the bail of the underaches to the construction and freedom and efforts the politicians and expensive meetings should have beeThe world will be interviewed on @foxandfriends at 7:30pm. Enjoy!.@JebBush and Fake News Media is a major place in the White House in the service and sense where the people of the debate and his show of many people who is a great press considering the GREAT job on the way to the U.S. A the best and people in the biggest! Thank you!New Hampshire Trump Int'l Hotel Leadership Barrier Lou Clinton is a forever person politically record supporters have really beginning in the media on the heart of the bad and women who have been succeeded and before you can also work the people are there a time strong and send out the world withJoin me in Maryland at 7:00 A.M. and happened to the WALL and be true the longer of the same sign into the Fake News Media will be a great honor to serve that the Republican Party will be a great legal rate the media with the Best Republican Party and the American people that will be the bill by a...

抛开政治而言,考虑到我们仅仅在10个epoch内使用约12千条推文进行训练,这些生成的推文并不是很糟糕的。textgenrnn默认设置为0.5,如果想获得更多的创意推文可以将该值调高,下面让我们试试看调高该值后效果如何:

textgen.generate(5, temperature=0.9)
“Via-can see this Democrats were the opening at GREAT ENSUS CALL!.@GovSeptorald Taster is got to that the subcent Vote waiting them. @CalkersMajor President Obama will listen for the disaster!Grateful and South Carolina so his real ability and much better-- or big crisis on many signing!It is absolutely dumbers for well tonight. Love us in the great inherition of fast. With bill of badly to forget the greatest puppet at my wedds. No Turnberry is "bigger.” - All

上述结果不太具有说服力,那么将textgenrnn值调低效果又如何呢?结果表明模型更稳定:

textgen.generate(5, temperature=0.1)
The Fake News Media is a great people of the president was a great people of the many people who would be a great people of the president was a big crowd of the statement of the media is a great people of the people of the statement of the people of the people of the world with the statement of thThank you @TrumpTowerNY #Trump2016 https://t.co/25551R58350Thank you for your support! #Trump2016 https://t.co/7eN53P55cThe people of the U.S. has been a great people of the presidential country is a great time and the best thing that the people of the statement of the media is the people of the state of the best thing that the people of the statement of the statement of the problem in the problem and success and tThank you @TheBrodyFile tonight at 8:00 A.M. Enjoy!

根据两个例子的对比,可以对这个项目有更清晰的了解。 当然,这些例子并不完美。还有很多其他的东西我们都可以尝试,好消息是,如果你不想实现你自己的解决方案,textgenrnn可以用来执行许多这样的事情(参见Github):

  • 从头开始训练我们自己的模型
  • 训练更多样本数据以获得更多迭代次数
  • 调整其他超参数
  • 对数据进行一些预处理(至少要消除伪造的URL)。 我很感兴趣的是看到默认的textgen模型是如何针对自定义的任务,经过良好调整后的模型完全可以获得开箱即用的效果,感兴趣的读者可以动手尝试一下。

原文发布于微信公众号 - AI科技时讯(aiblog_research)

原文发表时间:2018-06-23

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

扫码关注云+社区

领取腾讯云代金券