代码地址:https://github.com/hjptriplebee/Chinese_poem_generator, 欢迎fork, star
机器人命名MC胖虎,目前只是最简单粗暴的方法,使用tensorflow完成,有些像人工智障,符合胖虎的人物设定,看一些效果:
LSTM的原理网上资料很多,不了解的可以看这里:http://www.jianshu.com/p/9dc9f41f0b29
本文以讲解写诗机器人实现为主,不会讲太多理论和tensorflow使用方法,好下面开始。
训练数据预处理
采用3w首唐诗作为训练数据,在github上dataset文件夹下可以看到,唐诗格式为”题目:诗句“,如下所示:
我们首先通过”:“将题目和内容分离,然后做数据清洗过滤一些不好的训练样本,包含特殊符号、字数太少或太多的都要去除,最后在诗的前后分别加上开始和结束符号,用来告诉LSTM这是开头和结尾,这里用方括号表示。
poems = [] file = open(filename, "r") for line in file: #every line is a poem #print(line) title, poem = line.strip().split(":") #get title and poem poem = poem.replace(' ','') if '_' in poem or '《' in poem or '[' in poem or '(' in poem or '(' in poem: continue if len(poem) < 10 or len(poem) > 128: #filter poem continue poem = '[' + poem + ']' #add start and end signs poems.append(poem)
然后统计每个字出现的次数,并删除出现次数较少的生僻字
#counting words allWords = {} for poem in poems: for word in poem: if word not in allWords: allWords[word] = 1 else: allWords[word] += 1 # erase words which are not common erase = [] for key in allWords: if allWords[key] < 2: erase.append(key) for key in erase: del allWords[key]
根据字出现的次数排序,建立字到ID的映射。为什么需要排序呢?排序后的ID从一定程度上表示了字的出现频率,两者之间有一定关系,比不排序直接映射更容易使模型学出规律。
添加空格字符,因为诗的长度不一致,需要用空格填补,所以留出空格的ID。最后将诗转成字向量的形式。
wordPairs = sorted(allWords.items(), key = lambda x: -x[1]) words, a= zip(*wordPairs) words += (" ", ) wordToID = dict(zip(words, range(len(words)))) #word to ID wordTOIDFun = lambda A: wordToID.get(A, len(words)) poemsVector = [([wordTOIDFun(word) for word in poem]) for poem in poems] # poem to vector
接下来构建训练batch,每一个batch中所有的诗都要补空格直到长度达到最长诗的长度。因为补的都是空格,所以模型可以学出这样一个规律:空格后面都是接着空格。X和Y分别表示输入和输出,输出为输入的错位,即模型看到字得到的输出应该为下一个字。
这里注意一定要用np.copy,坑死我了!
#padding length to batchMaxLength batchNum = (len(poemsVector) - 1) // batchSize X = [] Y = [] #create batch for i in range(batchNum): batch = poemsVector[i * batchSize: (i + 1) * batchSize] maxLength = max([len(vector) for vector in batch]) temp = np.full((batchSize, maxLength), wordTOIDFun(" "), np.int32) for j in range(batchSize): temp[j, :len(batch[j])] = batch[j] X.append(temp) temp2 = np.copy(temp) #copy!!!!!! temp2[:, :-1] = temp[:, 1:] Y.append(temp2)
搭建模型
搭建一个LSTM模型,后接softmax,输出为每一个字出现的概率。这里对着LSTM模板抄一份,改改参数就好了。
with tf.variable_scope("embedding"): #embedding embedding = tf.get_variable("embedding", [wordNum, hidden_units], dtype = tf.float32) inputbatch = tf.nn.embedding_lookup(embedding, gtX) basicCell = tf.contrib.rnn.BasicLSTMCell(hidden_units, state_is_tuple = True) stackCell = tf.contrib.rnn.MultiRNNCell([basicCell] * layers) initState = stackCell.zero_state(np.shape(gtX)[0], tf.float32) outputs, finalState = tf.nn.dynamic_rnn(stackCell, inputbatch, initial_state = initState) outputs = tf.reshape(outputs, [-1, hidden_units]) with tf.variable_scope("softmax"): w = tf.get_variable("w", [hidden_units, wordNum]) b = tf.get_variable("b", [wordNum]) logits = tf.matmul(outputs, w) + b probs = tf.nn.softmax(logits)
模型训练
先定义输入输出,构建模型,然后设置损失函数、学习率等参数。
gtX = tf.placeholder(tf.int32, shape=[batchSize, None]) # input gtY = tf.placeholder(tf.int32, shape=[batchSize, None]) # output logits, probs, a, b, c = buildModel(wordNum, gtX) targets = tf.reshape(gtY, [-1]) #loss loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], [targets], [tf.ones_like(targets, dtype=tf.float32)], wordNum) cost = tf.reduce_mean(loss) tvars = tf.trainable_variables() grads, a = tf.clip_by_global_norm(tf.gradients(cost, tvars), 5) learningRate = learningRateBase optimizer = tf.train.AdamOptimizer(learningRate) trainOP = optimizer.apply_gradients(zip(grads, tvars)) globalStep = 0
然后开始训练,训练时先寻找能否找到检查点,找到则还原,否则重新训练。然后按照batch一步步读入数据训练,学习率逐渐递减,每隔几个step就保存一下模型。
with tf.Session() as sess: sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() if reload: checkPoint = tf.train.get_checkpoint_state(checkpointsPath) # if have checkPoint, restore checkPoint if checkPoint and checkPoint.model_checkpoint_path: saver.restore(sess, checkPoint.model_checkpoint_path) print("restored %s" % checkPoint.model_checkpoint_path) else: print("no checkpoint found!") for epoch in range(epochNum): if globalStep % learningRateDecreaseStep == 0: #learning rate decrease by epoch learningRate = learningRateBase * (0.95 ** epoch) epochSteps = len(X) # equal to batch for step, (x, y) in enumerate(zip(X, Y)): #print(x) #print(y) globalStep = epoch * epochSteps + step a, loss = sess.run([trainOP, cost], feed_dict = {gtX:x, gtY:y}) print("epoch: %d steps:%d/%d loss:%3f" % (epoch,step,epochSteps,loss)) if globalStep%1000==0: print("save model") saver.save(sess,checkpointsPath + "/poem",global_step=epoch)
自动写诗
在自动写诗之前,我们需要定义一个输出概率对应到单词的功能函数,为了避免每次生成的诗都一样,需要引入一定的随机性。不选择输出概率最高的字,而是将概率映射到一个区间上,在区间上随机采样,输出概率大的字对应的区间大,被采样的概率也大,但胖虎也有小概率会选择其他字。因为每一个字都有这样的随机性,所以每次作出的诗都完全不一样。
def probsToWord(weights, words): """probs to word""" t = np.cumsum(weights) #prefix sum s = np.sum(weights) coff = np.random.rand(1) index = int(np.searchsorted(t, coff * s)) # large margin has high possibility to be sampled return words[index]
然后开始写诗,首先仍然是构建模型,定义相关参数,加载checkpoint。
gtX = tf.placeholder(tf.int32, shape=[1, None]) # input logits, probs, stackCell, initState, finalState = buildModel(wordNum, gtX) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() checkPoint = tf.train.get_checkpoint_state(checkpointsPath) # if have checkPoint, restore checkPoint if checkPoint and checkPoint.model_checkpoint_path: saver.restore(sess, checkPoint.model_checkpoint_path) print("restored %s" % checkPoint.model_checkpoint_path) else: print("no checkpoint found!") exit(0)
生成generateNum这么多首诗,每首诗以左中括号开始,以右中括号或空格结束,每次生成的prob用probsToWord方法转成字。
poems = [] for i in range(generateNum): state = sess.run(stackCell.zero_state(1, tf.float32)) x = np.array([[wordToID['[']]]) # init start sign probs1, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state}) word = probsToWord(probs1, words) poem = '' while word != ']' and word != ' ': poem += word if word == '。': poem += '\n' x = np.array([[wordToID[word]]]) #print(word) probs2, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state}) word = probsToWord(probs2, words) print(poem) poems.append(poem)
还可以写藏头诗,前面的搭建模型,加载checkpoint等内容一样,作诗部分,每遇到标点符号,人为控制下一个输入的字为指定的字就可以了。需要注意,在标点符号后,因为没有选择模型输出的字,所以需要将state往前滚动一下,直接跳过这个字的生成。
flag = 1 endSign = {-1: ",", 1: "。"} poem = '' state = sess.run(stackCell.zero_state(1, tf.float32)) x = np.array([[wordToID['[']]]) probs1, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state}) for c in characters: word = c flag = -flag while word != ']' and word != ',' and word != '。' and word != ' ': poem += word x = np.array([[wordToID[word]]]) probs2, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state}) word = probsToWord(probs2, words) poem += endSign[flag] # keep the context, state must be updated if endSign[flag] == '。': probs2, state = sess.run([probs, finalState], feed_dict={gtX: np.array([[wordToID["。"]]]), initState: state}) poem += '\n' else: probs2, state = sess.run([probs, finalState], feed_dict={gtX: np.array([[wordToID[","]]]), initState: state}) print(characters) print(poem)
大约在GPU上训练20epoch效果就不错了!
代码地址:https://github.com/hjptriplebee/Chinese_poem_generator, 欢迎fork, star
估计后续还会出看图写诗机器人-MC胖虎2.0
说了这么多胖虎该生气了!
本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。
我来说两句