作者 | Christian Beckmann
来源 | Towards Data Science
编辑 | 代码医生团队
有没有想过创作自己的“辛普森一家”剧集?
这就是在Kaggle看到Simpsons数据集时的想法。它是自然语言生成(NLG)中一个小型“仅用于娱乐”项目的完美数据集。
什么是自然语言生成(NLG)?
“自然语言生成(NLG)是语言技术的一个方面,侧重于从结构化数据或结构化表示(如知识库或逻辑形式)生成自然语言。”
在这种情况下,将看到如何训练能够创建新的“辛普森一家”式对话的模型。作为训练的输入,将使用Simpsons数据集中的文件simpsons_script_lines.csv。
下载和准备数据
首先需要下载数据文件。可以在Kaggle网站上的“The Simpsons by the Data”中做到这一点。下载文件simpsons_script_lines.csv,将其保存到文件夹“data”并解压缩。解压后应该是~34MB。
https://www.kaggle.com/wcukierski/the-simpsons-by-the-data#simpsons_script_lines.csv
如果查看文件的第一行,会看到此CSV中有多个列:
simpsons_script_lines.csv第一行
对于模型的训练,只需要纯文本而不需要所有其他功能。所以需要从文件中提取它。
读取数据的最简单方法通常是使用Pandas read_csv()函数 - 但在这种情况下它不起作用。此文件使用逗号作为分隔符,但文本中有许多未转义的逗号,这会破坏自动解析。
所以需要将文件作为纯文本读取并使用正则表达式进行解析。
1data_dir = './data/simpsons_script_lines.csv'
2input_file = os.path.join(data_dir)
3
4clean_text = ''
5
6with open(input_file, "r", encoding="utf8") as f:
7 for line in f:
8 text = re.search('[0-9]*,[0-9]*,[0-9]*,(.+?),[0-9]*,', line)
9 if text:
10 text = text.group(1).replace('"', '')
11 text_parts = text.split(':')
12 text_parts[0] = text_parts[0].replace(' ', '_')
13 text = ':'.join(text_parts)
14 clean_text += text + '\n'
15
16print('\n'.join(clean_text.split('\n')[:10]))
此脚本的输出如下所示:
Miss_Hoover: No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it's only natural that you think you have it.
Lisa_Simpson: (NEAR TEARS) Where's Mr. Bergstrom?
Miss_Hoover: I don't know. Although I'd sure like to talk to him. He didn't touch my lesson plan. What did he teach you?
Lisa_Simpson: That life is worth living.
Edna_Krabappel-Flanders: The polls will be open from now until the end of recess. Now, (SOUR) just in case any of you have decided to put any thought into this, we'll have our final statements. Martin?
Martin_Prince: (HOARSE WHISPER) I don't think there's anything left to say.
Edna_Krabappel-Flanders: Bart?
Bart_Simpson: Victory party under the slide!
(Apartment_Building: Ext. apartment building - day)
Lisa_Simpson: (CALLING) Mr. Bergstrom! Mr. Bergstrom!
查看输出,可以看到不仅提取了文本。还用下划线替换了名称中的空格 - 所以“Lisa Simpson”变成了“Lisa_Simpson”。这样就可以使用名称作为文本生成步骤的起始单词。
数据预处理
在使用它作为模型训练的输入之前,首先需要做一些额外的预处理。
将使用空格作为分隔符将脚本拆分为单词数组。然而像句点和惊叹号这样的标点符号使神经网络很难区分“再见”和“再见!”这个词。
为了解决这个问题,创建了一个字典,将用它来标记符号并在其周围添加分隔符。这将符号与单词分开,使神经网络更容易预测下一个单词。
在下一步中,将使用此字典替换符号,为文本中的单词构建词汇表和查找表。
1tokenized_punctuation = {
2
3 '.' : '||Period||',
4
5 ',' : '||Comma||',
6
7 '"' : '||Quotation_Mark||',
8
9 ';' : '||Semicolon||',
10
11 '!' : '||Exclamation_Mark||',
12
13 '?' : '||Question_Mark||',
14
15 '(' : '||Left_Parentheses||',
16
17 ')' : '||Right_Parentheses||',
18
19 '--' : '||Dash||',
20
21 '\n' : '||Return||'
22
23}
24
25
26
27text = "\n".join(clean_data)
28
29
30
31for key, token in tokenized_punctuation .items():
32
33 text = text.replace(key, ' {} '.format(token))
34
35
36
37text = text.lower()
38
39text = text.split()
40
41
42
43word_counts = Counter(text)
44
45sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
46
47int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
48
49vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
50
51
52
53int_text = [vocab_to_int[word] for word in text]
建立神经网络
现在准备好了数据,是时候创建神经网络了。
首先需要为输入,目标和学习率创建Tensorflow占位符。
1def get_inputs():
2
3 input_placeholder = tf.placeholder(tf.int32, [None, None], name = 'input')
4
5 targets_placeholder = tf.placeholder(tf.int32, [None, None])
6
7 learning_rate_placeholder = tf.placeholder(tf.float32)
8
9
10
11 return input_placeholder, targets_placeholder, learning_rate_placeholder
接下来创建一个RNN Cell并初始化它。
1def get_init_cell(batch_size, rnn_size):
2
3 lstm = tf.contrib.rnn.GRUCell(rnn_size)
4
5 cell = tf.contrib.rnn.MultiRNNCell([lstm])
6
7 initial_state = tf.identity(cell.zero_state(batch_size, tf.float32), name='initial_state')
8
9 return cell, initial_state
这里应用嵌入input_data使用TensorFlow并返回嵌入的序列。
1def get_embed(input_data, vocab_size, embed_dim):
2
3 embedding = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1, 1))
4
5 embed = tf.nn.embedding_lookup(embedding, input_data)
6
7 return embed
在get_init_cell()函数中创建了一个RNN Cell 。是时候使用单元格来创建RNN了。
1def build_rnn(cell, inputs):
2
3 outputs, state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32)
4
5 final_state = tf.identity(state, name="final_state")
6
7 return outputs, final_state
现在把它们放在一起构建最终的神经网络。
1def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim):
2
3 embeddings = get_embed(input_data, vocab_size, embed_dim)
4
5 inputs, final_state = build_rnn(cell, embeddings)
6
7 logits = tf.contrib.layers.fully_connected(inputs=inputs, num_outputs=vocab_size, activation_fn=None)
8
9 return logits, final_state
训练神经网络
为了训练神经网络,必须创建批量的输入和目标
1def get_batches(int_text, batch_size, seq_length):
2
3 n_batches = len(int_text) // (batch_size * seq_length)
4
5 words = np.asarray(int_text[:n_batches*(batch_size * seq_length)])
6
7
8
9 batches = np.zeros(shape=(n_batches, 2, batch_size, seq_length))
10
11
12
13 input_sequences = words.reshape(-1, seq_length)
14
15 target_sequences = np.roll(words, -1)
16
17 target_sequences = target_sequences.reshape(-1, seq_length)
18
19
20
21 for idx in range(0, input_sequences.shape[0]):
22
23 input_idx = idx % n_batches
24
25 target_idx = idx // n_batches
26
27 batches[input_idx,0,target_idx,:] = input_sequences[idx,:]
28
29 batches[input_idx,1,target_idx,:] = target_sequences[idx,:]
30
31 return batches
定义用于训练的超参数。
1# Number of Epochs
2
3num_epochs = 50
4
5# Batch Size
6
7batch_size = 32
8
9# RNN Size
10
11rnn_size = 512
12
13# Embedding Dimension Size
14
15embed_dim = 256
16
17# Sequence Length
18
19seq_length = 16
20
21# Learning Rate
22
23learning_rate = 0.001
24
25# Show stats for every n number of batches
26
27show_every_n_batches = 200
28
29
30
31# where to save the trained model
32
33save_dir = './save'
在开始训练之前,需要构建图表。
1train_graph = tf.Graph()
2
3with train_graph.as_default():
4
5 vocab_size = len(int_to_vocab)
6
7 input_text, targets, lr = get_inputs()
8
9 input_data_shape = tf.shape(input_text)
10
11 cell, initial_state = get_init_cell(input_data_shape[0], rnn_size)
12
13 logits, final_state = build_nn(cell, rnn_size, input_text, vocab_size, embed_dim)
14
15
16
17 # Probabilities for generating words
18
19 probs = tf.nn.softmax(logits, name='probs')
20
21
22
23 # Loss function
24
25 cost = seq2seq.sequence_loss(
26
27 logits,
28
29 targets,
30
31 tf.ones([input_data_shape[0], input_data_shape[1]]))
32
33
34
35 # Optimizer
36
37 optimizer = tf.train.AdamOptimizer(lr)
38
39
40
41 # Gradient Clipping
42
43 gradients = optimizer.compute_gradients(cost)
44
45 capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
46
47 train_op = optimizer.apply_gradients(capped_gradients)
现在开始在预处理数据上训练神经网络。这需要一段时间。在GTX 1080TI上,使用上述参数完成训练大约需要4个小时。
1batches = get_batches(int_text, batch_size, seq_length)
2
3
4
5with tf.Session(graph=train_graph) as sess:
6
7 sess.run(tf.global_variables_initializer())
8
9
10
11 for epoch_i in range(num_epochs):
12
13 state = sess.run(initial_state, {input_text: batches[0][0]})
14
15
16
17 for batch_i, (x, y) in enumerate(batches):
18
19 feed = {
20
21 input_text: x,
22
23 targets: y,
24
25 initial_state: state,
26
27 lr: learning_rate}
28
29 train_loss, state, _ = sess.run([cost, final_state, train_op], feed)
30
31
32
33 # Show every <show_every_n_batches> batches
34
35 if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
36
37 print('Epoch {:>3} Batch {:>4}/{} train_loss = {:.3f}'.format(
38
39 epoch_i,
40
41 batch_i,
42
43 len(batches),
44
45 train_loss))
46
47
48
49 # Save Model
50
51 saver = tf.train.Saver()
52
53 saver.save(sess, save_dir)
54
55 print('Model Trained and Saved')
训练时生成的输出应如下所示:
...
Epoch 49 Batch 1186/4686 train_loss = 1.737
Epoch 49 Batch 1386/4686 train_loss = 1.839
Epoch 49 Batch 1586/4686 train_loss = 2.050
Epoch 49 Batch 1786/4686 train_loss = 1.798
Epoch 49 Batch 1986/4686 train_loss = 1.751
Epoch 49 Batch 2186/4686 train_loss = 1.680
Epoch 49 Batch 2386/4686 train_loss = 1.641
Epoch 49 Batch 2586/4686 train_loss = 1.912
Epoch 49 Batch 2786/4686 train_loss = 1.811
Epoch 49 Batch 2986/4686 train_loss = 1.949
Epoch 49 Batch 3186/4686 train_loss = 1.821
Epoch 49 Batch 3386/4686 train_loss = 1.664
Epoch 49 Batch 3586/4686 train_loss = 1.735
Epoch 49 Batch 3786/4686 train_loss = 2.175
Epoch 49 Batch 3986/4686 train_loss = 1.710
Epoch 49 Batch 4186/4686 train_loss = 1.969
Epoch 49 Batch 4386/4686 train_loss = 2.055
Epoch 49 Batch 4586/4686 train_loss = 1.862
Model Trained and Saved
生成电视剧本
训练结束后,正处于这个项目的最后一步:为“辛普森一家”制作新的电视剧本!
首先需要从loaded_graph获得张量
1def get_tensors(loaded_graph):
2
3 input_tensor = loaded_graph.get_tensor_by_name('input:0')
4
5 initial_state_tensor = loaded_graph.get_tensor_by_name('initial_state:0')
6
7 final_state_tensor = loaded_graph.get_tensor_by_name('final_state:0')
8
9 probs_tensor = loaded_graph.get_tensor_by_name('probs:0')
10
11 return input_tensor, initial_state_tensor, final_state_tensor, probs_tensor
以及使用选择下一个单词的功能probabilities。
1def pick_word(probabilities, int_to_vocab):
2
3 word_id = np.argmax(probabilities)
4
5 word_string = int_to_vocab[word_id]
6
7 return word_string
最后准备生成电视剧本。设置gen_length为要生成的电视剧本的长度。
1gen_length = 500
2
3
4
5"""
6
7The prime word is used as the start word for the text generation.
8
9To generate different text try different prime words like:
10
11'marge_simpson'
12
13'bart_simpson'
14
15'lisa_simpson'
16
17'seymour_skinner'
18
19'chief_wiggum'
20
21'judge_snyder'
22
23"""
24
25prime_word = 'homer_simpson'
26
27
28
29loaded_graph = tf.Graph()
30
31with tf.Session(graph=loaded_graph) as sess:
32
33 # Load saved model
34
35 loader = tf.train.import_meta_graph(save_dir + '.meta')
36
37 loader.restore(sess, save_dir)
38
39
40
41 # Get Tensors from loaded model
42
43 input_text, initial_state, final_state, probs = get_tensors(loaded_graph)
44
45
46
47 # Sentences generation setup
48
49 gen_sentences = [prime_word + ':']
50
51 prev_state = sess.run(initial_state, {input_text: np.array([[1]])})
52
53
54
55 # Generate sentences
56
57 for n in range(gen_length):
58
59 # Dynamic Input
60
61 dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
62
63 dyn_seq_length = len(dyn_input[0])
64
65
66
67 # Get Prediction
68
69 probabilities, prev_state = sess.run(
70
71 [probs, final_state],
72
73 {input_text: dyn_input, initial_state: prev_state})
74
75
76
77 pred_word = pick_word(probabilities[0][dyn_seq_length-1], int_to_vocab)
78
79
80
81 gen_sentences.append(pred_word)
82
83
84
85 # Remove tokens
86
87 tv_script = ' '.join(gen_sentences)
88
89 for key, token in tokenized_punctuation.items():
90
91 ending = ' ' if key in ['\n', '(', '"'] else ''
92
93 tv_script = tv_script.replace(' ' + token.lower(), key)
94
95 tv_script = tv_script.replace('\n ', '\n')
96
97 tv_script = tv_script.replace('( ', '(')
98
99
100
101 print(tv_script)
这应该给一个像这样的输出:
INFO:tensorflow:Restoring parameters from ./save
homer_simpson:(moans)
marge_simpson:(annoyed murmur)
homer_simpson:(annoyed grunt)
(moe's_tavern: ext. moe's - night)
homer_simpson:(to moe) this is a great idea, children. now, what are we playing here?
bart_simpson:(horrified gasp)
(simpson_home: ext. simpson house - day - establishing)
homer_simpson:(worried) i've got a wet!
homer_simpson:(faking enthusiasm) well, maybe i could kiss my little girl. mine!
(department int. sports arena - night)
seymour_skinner:(chuckles)
chief_wiggum:(laughing) oh, i get it.
seymour_skinner:(snapping) i guess this building is quiet.
homer_simpson:(stunned) what? how'd you like that?
professor_jonathan_frink: uh, well, looks like the little bit of you.
bart_simpson:(to larry) i guess this is clearly justin, right?
homer_simpson:(dismissive snort) oh, i am.
marge_simpson:(pained) hi.
homer_simpson:(pained sound) i thought you might have some good choice.
homer_simpson:(pained) oh, sorry.
(simpson_home: int. simpson house - living room - day)
marge_simpson:(concerned) okay, open your door.
homer_simpson: don't push, marge. we'll be fine.
judge_snyder:(sarcastic) children, you want a night?
homer_simpson:(gulp) oh, i can't believe i wasn't in a car.
chief_wiggum:(to selma) i can't find this map. and she's gonna release that?
homer_simpson:(lots of hair) just like me.
homer_simpson:(shrugs) gimme a try.
homer_simpson:(sweetly) i don't know, but i don't remember that.
marge_simpson:(brightening) are you all right?
homer_simpson: absolutely...
lisa_simpson:(mad) even better!
homer_simpson:(hums)
marge_simpson: oh, homie. that's a doggie door.
homer_simpson:(moan) i don't have computers.
homer_simpson:(hopeful) honey?(makes fake companies break) are you okay?
marge_simpson:(short giggle)
homer_simpson:(happy) oh, marge, i found the two thousand and cars.
marge_simpson:(frustrated sound)
lisa_simpson:(outraged) are you, you're too far to go?
boys:(skeptical) well, i'm gonna be here at the same time.
homer_simpson:(moans) why are you doing us by doing anything?
marge_simpson: well, it always seemed like i'm gonna be friends with...
homer_simpson:(seething) losers!
(simpson_home: int. simpson house -
结论
训练了一个模型来生成新文本!
可以看到文字没有任何意义,该项目旨在展示如何准备用于训练模型的数据,并提供有关NLG如何工作的基本概念。
如果需要可以调整参数,添加更多图层或更改其大小。看看模型的输出如何变化。
Github上
该项目的代码
https://github.com/chricke/medium/blob/master/simpsons-tv-script-generation/Generate%20your%20own%20Simpsons%20TV%20script%20using%20Deep%20Learning.ipynb