深度学习的乐趣：如何生成自己的“辛普森一家”电视剧本

代码医生工作室

发布于 2019-06-21 17:01:29

8760

发布于 2019-06-21 17:01:29

文章被收录于专栏：相约机器人相约机器人

作者 | Christian Beckmann

来源 | Towards Data Science

编辑 | 代码医生团队

有没有想过创作自己的“辛普森一家”剧集？

这就是在Kaggle看到Simpsons数据集时的想法。它是自然语言生成（NLG）中一个小型“仅用于娱乐”项目的完美数据集。

什么是自然语言生成（NLG）？

“自然语言生成（NLG）是语言技术的一个方面，侧重于从结构化数据或结构化表示（如知识库或逻辑形式）生成自然语言。”

在这种情况下，将看到如何训练能够创建新的“辛普森一家”式对话的模型。作为训练的输入，将使用Simpsons数据集中的文件simpsons_script_lines.csv。

下载和准备数据

首先需要下载数据文件。可以在Kaggle网站上的“The Simpsons by the Data”中做到这一点。下载文件simpsons_script_lines.csv，将其保存到文件夹“data”并解压缩。解压后应该是~34MB。

https://www.kaggle.com/wcukierski/the-simpsons-by-the-data#simpsons_script_lines.csv

如果查看文件的第一行，会看到此CSV中有多个列：

simpsons_script_lines.csv第一行

对于模型的训练，只需要纯文本而不需要所有其他功能。所以需要从文件中提取它。

读取数据的最简单方法通常是使用Pandas read_csv()函数 - 但在这种情况下它不起作用。此文件使用逗号作为分隔符，但文本中有许多未转义的逗号，这会破坏自动解析。

所以需要将文件作为纯文本读取并使用正则表达式进行解析。

 1data_dir = './data/simpsons_script_lines.csv'
 2input_file = os.path.join(data_dir)
 3
 4clean_text = ''
 5
 6with open(input_file, "r", encoding="utf8") as f:
 7    for line in f:
 8        text = re.search('[0-9]*,[0-9]*,[0-9]*,(.+?),[0-9]*,', line)
 9        if text:
10            text = text.group(1).replace('"', '')
11            text_parts = text.split(':')
12            text_parts[0] = text_parts[0].replace(' ', '_')
13            text = ':'.join(text_parts)
14            clean_text += text + '\n'
15
16print('\n'.join(clean_text.split('\n')[:10]))

此脚本的输出如下所示：

Miss_Hoover: No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it's only natural that you think you have it.

Lisa_Simpson: (NEAR TEARS) Where's Mr. Bergstrom?

Miss_Hoover: I don't know. Although I'd sure like to talk to him. He didn't touch my lesson plan. What did he teach you?

Lisa_Simpson: That life is worth living.

Edna_Krabappel-Flanders: The polls will be open from now until the end of recess. Now, (SOUR) just in case any of you have decided to put any thought into this, we'll have our final statements. Martin?

Martin_Prince: (HOARSE WHISPER) I don't think there's anything left to say.

Edna_Krabappel-Flanders: Bart?

Bart_Simpson: Victory party under the slide!

(Apartment_Building: Ext. apartment building - day)

Lisa_Simpson: (CALLING) Mr. Bergstrom! Mr. Bergstrom!

查看输出，可以看到不仅提取了文本。还用下划线替换了名称中的空格 - 所以“Lisa Simpson”变成了“Lisa_Simpson”。这样就可以使用名称作为文本生成步骤的起始单词。

数据预处理

在使用它作为模型训练的输入之前，首先需要做一些额外的预处理。

将使用空格作为分隔符将脚本拆分为单词数组。然而像句点和惊叹号这样的标点符号使神经网络很难区分“再见”和“再见！”这个词。

为了解决这个问题，创建了一个字典，将用它来标记符号并在其周围添加分隔符。这将符号与单词分开，使神经网络更容易预测下一个单词。

在下一步中，将使用此字典替换符号，为文本中的单词构建词汇表和查找表。

1tokenized_punctuation = {
 2
 3    '.' : '||Period||',
 4
 5    ',' : '||Comma||',
 6
 7    '"' : '||Quotation_Mark||',
 8
 9    ';' : '||Semicolon||',
10
11    '!' : '||Exclamation_Mark||',
12
13    '?' : '||Question_Mark||',
14
15    '(' : '||Left_Parentheses||',
16
17    ')' : '||Right_Parentheses||',
18
19    '--' : '||Dash||',
20
21    '\n' : '||Return||'
22
23}
24
25
26
27text = "\n".join(clean_data)
28
29
30
31for key, token in tokenized_punctuation .items():
32
33    text = text.replace(key, ' {} '.format(token))
34
35
36
37text = text.lower()
38
39text = text.split()
40
41
42
43word_counts = Counter(text)
44
45sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
46
47int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
48
49vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
50
51
52
53int_text = [vocab_to_int[word] for word in text]

建立神经网络

现在准备好了数据，是时候创建神经网络了。

首先需要为输入，目标和学习率创建Tensorflow占位符。

 1def get_inputs():
 2
 3    input_placeholder = tf.placeholder(tf.int32, [None, None], name = 'input')
 4
 5    targets_placeholder = tf.placeholder(tf.int32, [None, None])
 6
 7    learning_rate_placeholder = tf.placeholder(tf.float32)
 8
 9
10
11    return input_placeholder, targets_placeholder, learning_rate_placeholder

接下来创建一个RNN Cell并初始化它。

1def get_init_cell(batch_size, rnn_size):
2
3    lstm = tf.contrib.rnn.GRUCell(rnn_size)
4
5    cell = tf.contrib.rnn.MultiRNNCell([lstm])
6
7    initial_state = tf.identity(cell.zero_state(batch_size, tf.float32), name='initial_state')
8
9    return cell, initial_state

这里应用嵌入input_data使用TensorFlow并返回嵌入的序列。

1def get_embed(input_data, vocab_size, embed_dim):
2
3    embedding = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1, 1))
4
5    embed = tf.nn.embedding_lookup(embedding, input_data)    
6
7    return embed

在get_init_cell()函数中创建了一个RNN Cell 。是时候使用单元格来创建RNN了。

1def build_rnn(cell, inputs):
2
3    outputs, state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32)
4
5    final_state = tf.identity(state, name="final_state")
6
7    return outputs, final_state

现在把它们放在一起构建最终的神经网络。

1def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim):
2
3    embeddings = get_embed(input_data, vocab_size, embed_dim)
4
5    inputs, final_state = build_rnn(cell, embeddings)
6
7    logits = tf.contrib.layers.fully_connected(inputs=inputs, num_outputs=vocab_size, activation_fn=None)
8
9    return logits, final_state

训练神经网络

为了训练神经网络，必须创建批量的输入和目标

 1def get_batches(int_text, batch_size, seq_length):
 2
 3    n_batches = len(int_text) // (batch_size * seq_length)
 4
 5    words = np.asarray(int_text[:n_batches*(batch_size * seq_length)])
 6
 7
 8
 9    batches = np.zeros(shape=(n_batches, 2, batch_size, seq_length))
10
11
12
13    input_sequences = words.reshape(-1, seq_length)
14
15    target_sequences = np.roll(words, -1)
16
17    target_sequences = target_sequences.reshape(-1, seq_length)
18
19
20
21    for idx in range(0, input_sequences.shape[0]):
22
23        input_idx = idx % n_batches
24
25        target_idx = idx // n_batches
26
27        batches[input_idx,0,target_idx,:] = input_sequences[idx,:]
28
29        batches[input_idx,1,target_idx,:] = target_sequences[idx,:]        
30
31    return batches

定义用于训练的超参数。

 1# Number of Epochs
 2
 3num_epochs = 50
 4
 5# Batch Size
 6
 7batch_size = 32
 8
 9# RNN Size
10
11rnn_size = 512
12
13# Embedding Dimension Size
14
15embed_dim = 256
16
17# Sequence Length
18
19seq_length = 16
20
21# Learning Rate
22
23learning_rate = 0.001
24
25# Show stats for every n number of batches
26
27show_every_n_batches = 200
28
29
30
31# where to save the trained model
32
33save_dir = './save'

在开始训练之前，需要构建图表。

 1train_graph = tf.Graph()
 2
 3with train_graph.as_default():
 4
 5    vocab_size = len(int_to_vocab)
 6
 7    input_text, targets, lr = get_inputs()
 8
 9    input_data_shape = tf.shape(input_text)
10
11    cell, initial_state = get_init_cell(input_data_shape[0], rnn_size)
12
13    logits, final_state = build_nn(cell, rnn_size, input_text, vocab_size, embed_dim)
14
15
16
17    # Probabilities for generating words
18
19    probs = tf.nn.softmax(logits, name='probs')
20
21
22
23    # Loss function
24
25    cost = seq2seq.sequence_loss(
26
27        logits,
28
29        targets,
30
31        tf.ones([input_data_shape[0], input_data_shape[1]]))
32
33
34
35    # Optimizer
36
37    optimizer = tf.train.AdamOptimizer(lr)
38
39
40
41    # Gradient Clipping
42
43    gradients = optimizer.compute_gradients(cost)
44
45    capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
46
47    train_op = optimizer.apply_gradients(capped_gradients)

现在开始在预处理数据上训练神经网络。这需要一段时间。在GTX 1080TI上，使用上述参数完成训练大约需要4个小时。

 1batches = get_batches(int_text, batch_size, seq_length)
 2
 3
 4
 5with tf.Session(graph=train_graph) as sess:
 6
 7    sess.run(tf.global_variables_initializer())
 8
 9
10
11    for epoch_i in range(num_epochs):
12
13        state = sess.run(initial_state, {input_text: batches[0][0]})
14
15
16
17        for batch_i, (x, y) in enumerate(batches):
18
19            feed = {
20
21                input_text: x,
22
23                targets: y,
24
25                initial_state: state,
26
27                lr: learning_rate}
28
29            train_loss, state, _ = sess.run([cost, final_state, train_op], feed)
30
31
32
33            # Show every <show_every_n_batches> batches
34
35            if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
36
37                print('Epoch {:>3} Batch {:>4}/{}   train_loss = {:.3f}'.format(
38
39                    epoch_i,
40
41                    batch_i,
42
43                    len(batches),
44
45                    train_loss))
46
47
48
49    # Save Model
50
51    saver = tf.train.Saver()
52
53    saver.save(sess, save_dir)
54
55    print('Model Trained and Saved')

训练时生成的输出应如下所示：

...

Epoch  49 Batch 1186/4686   train_loss = 1.737

Epoch  49 Batch 1386/4686   train_loss = 1.839

Epoch  49 Batch 1586/4686   train_loss = 2.050

Epoch  49 Batch 1786/4686   train_loss = 1.798

Epoch  49 Batch 1986/4686   train_loss = 1.751

Epoch  49 Batch 2186/4686   train_loss = 1.680

Epoch  49 Batch 2386/4686   train_loss = 1.641

Epoch  49 Batch 2586/4686   train_loss = 1.912

Epoch  49 Batch 2786/4686   train_loss = 1.811

Epoch  49 Batch 2986/4686   train_loss = 1.949

Epoch  49 Batch 3186/4686   train_loss = 1.821

Epoch  49 Batch 3386/4686   train_loss = 1.664

Epoch  49 Batch 3586/4686   train_loss = 1.735

Epoch  49 Batch 3786/4686   train_loss = 2.175

Epoch  49 Batch 3986/4686   train_loss = 1.710

Epoch  49 Batch 4186/4686   train_loss = 1.969

Epoch  49 Batch 4386/4686   train_loss = 2.055

Epoch  49 Batch 4586/4686   train_loss = 1.862

Model Trained and Saved

生成电视剧本

训练结束后，正处于这个项目的最后一步：为“辛普森一家”制作新的电视剧本！

首先需要从loaded_graph获得张量

 1def get_tensors(loaded_graph):
 2
 3    input_tensor = loaded_graph.get_tensor_by_name('input:0')
 4
 5    initial_state_tensor = loaded_graph.get_tensor_by_name('initial_state:0')
 6
 7    final_state_tensor = loaded_graph.get_tensor_by_name('final_state:0')
 8
 9    probs_tensor = loaded_graph.get_tensor_by_name('probs:0')
10
11    return input_tensor, initial_state_tensor, final_state_tensor, probs_tensor

以及使用选择下一个单词的功能probabilities。

1def pick_word(probabilities, int_to_vocab):
2
3    word_id = np.argmax(probabilities)
4
5    word_string = int_to_vocab[word_id]
6
7    return word_string

最后准备生成电视剧本。设置gen_length为要生成的电视剧本的长度。

1gen_length = 500
  2
  3
  4
  5"""
  6
  7The prime word is used as the start word for the text generation.
  8
  9To generate different text try different prime words like:
 10
 11'marge_simpson'
 12
 13'bart_simpson'
 14
 15'lisa_simpson'
 16
 17'seymour_skinner'
 18
 19'chief_wiggum'
 20
 21'judge_snyder'
 22
 23"""
 24
 25prime_word = 'homer_simpson'
 26
 27
 28
 29loaded_graph = tf.Graph()
 30
 31with tf.Session(graph=loaded_graph) as sess:
 32
 33    # Load saved model
 34
 35    loader = tf.train.import_meta_graph(save_dir + '.meta')
 36
 37    loader.restore(sess, save_dir)
 38
 39
 40
 41    # Get Tensors from loaded model
 42
 43    input_text, initial_state, final_state, probs = get_tensors(loaded_graph)
 44
 45
 46
 47    # Sentences generation setup
 48
 49    gen_sentences = [prime_word + ':']
 50
 51    prev_state = sess.run(initial_state, {input_text: np.array([[1]])})
 52
 53
 54
 55    # Generate sentences
 56
 57    for n in range(gen_length):
 58
 59        # Dynamic Input
 60
 61        dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
 62
 63        dyn_seq_length = len(dyn_input[0])
 64
 65
 66
 67        # Get Prediction
 68
 69        probabilities, prev_state = sess.run(
 70
 71            [probs, final_state],
 72
 73            {input_text: dyn_input, initial_state: prev_state})
 74
 75
 76
 77        pred_word = pick_word(probabilities[0][dyn_seq_length-1], int_to_vocab)
 78
 79
 80
 81        gen_sentences.append(pred_word)
 82
 83
 84
 85    # Remove tokens
 86
 87    tv_script = ' '.join(gen_sentences)
 88
 89    for key, token in tokenized_punctuation.items():
 90
 91        ending = ' ' if key in ['\n', '(', '"'] else ''
 92
 93        tv_script = tv_script.replace(' ' + token.lower(), key)
 94
 95    tv_script = tv_script.replace('\n ', '\n')
 96
 97    tv_script = tv_script.replace('( ', '(')
 98
 99
100
101    print(tv_script)

这应该给一个像这样的输出：

INFO:tensorflow:Restoring parameters from ./save

homer_simpson:(moans)

marge_simpson:(annoyed murmur)

homer_simpson:(annoyed grunt)

(moe's_tavern: ext. moe's - night)

homer_simpson:(to moe) this is a great idea, children. now, what are we playing here?

bart_simpson:(horrified gasp)

(simpson_home: ext. simpson house - day - establishing)

homer_simpson:(worried) i've got a wet!

homer_simpson:(faking enthusiasm) well, maybe i could kiss my little girl. mine!

(department int. sports arena - night)

seymour_skinner:(chuckles)

chief_wiggum:(laughing) oh, i get it.

seymour_skinner:(snapping) i guess this building is quiet.

homer_simpson:(stunned) what? how'd you like that?

professor_jonathan_frink: uh, well, looks like the little bit of you.

bart_simpson:(to larry) i guess this is clearly justin, right?

homer_simpson:(dismissive snort) oh, i am.

marge_simpson:(pained) hi.

homer_simpson:(pained sound) i thought you might have some good choice.

homer_simpson:(pained) oh, sorry.

(simpson_home: int. simpson house - living room - day)

marge_simpson:(concerned) okay, open your door.

homer_simpson: don't push, marge. we'll be fine.

judge_snyder:(sarcastic) children, you want a night?

homer_simpson:(gulp) oh, i can't believe i wasn't in a car.

chief_wiggum:(to selma) i can't find this map. and she's gonna release that?

homer_simpson:(lots of hair) just like me.

homer_simpson:(shrugs) gimme a try.

homer_simpson:(sweetly) i don't know, but i don't remember that.

marge_simpson:(brightening) are you all right?

homer_simpson: absolutely...

lisa_simpson:(mad) even better!

homer_simpson:(hums)

marge_simpson: oh, homie. that's a doggie door.

homer_simpson:(moan) i don't have computers.

homer_simpson:(hopeful) honey?(makes fake companies break) are you okay?

marge_simpson:(short giggle)

homer_simpson:(happy) oh, marge, i found the two thousand and cars.

marge_simpson:(frustrated sound)

lisa_simpson:(outraged) are you, you're too far to go?

boys:(skeptical) well, i'm gonna be here at the same time.

homer_simpson:(moans) why are you doing us by doing anything?

marge_simpson: well, it always seemed like i'm gonna be friends with...

homer_simpson:(seething) losers!

(simpson_home: int. simpson house -

结论

训练了一个模型来生成新文本！

可以看到文字没有任何意义，该项目旨在展示如何准备用于训练模型的数据，并提供有关NLG如何工作的基本概念。

如果需要可以调整参数，添加更多图层或更改其大小。看看模型的输出如何变化。

Github上

该项目的代码

https://github.com/chricke/medium/blob/master/simpsons-tv-script-generation/Generate%20your%20own%20Simpsons%20TV%20script%20using%20Deep%20Learning.ipynb

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2019-04-01，如有侵权请联系 cloudcommunity@tencent.com 删除

NLP 服务