前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >深度学习实践:从零开始做电影评论文本情感分析

深度学习实践:从零开始做电影评论文本情感分析

作者头像
AINLP
发布2019-10-10 16:01:35
1.4K0
发布2019-10-10 16:01:35
举报
文章被收录于专栏:AINLPAINLP

最近读了《Python深度学习》, 是一本好书,很棒,隆重推荐。

本书由Keras之父、现任Google人工智能研究员的弗朗索瓦•肖莱(François Chollet)执笔,详尽介绍了用Python和Keras进行深度学习的探索实践,涉及计算机视觉、自然语言处理、生成式模型等应用。书中包含30多个代码示例,步骤讲解详细透彻。由于本书立足于人工智能的可达性和大众化,读者无须具备机器学习相关背景知识即可展开阅读。在学习完本书后,读者将具备搭建自己的深度学习环境、建立图像识别模型、生成图像和文字等能力。

各方面都很好,但是总感觉哪里有点欠缺,后来想想,可能是作者做得太好了,把数据预处理都做得好好的,所以你才能“20行搞定情感分析”,这可能也是学习其他深度学习工具过程中要面临的一个问题,很多工具都提供了预处理好的数据,导致学习过程中只需要调用相关接口即可。不过在实际工作中,数据的预处理是非常重要的,从数据获取,到数据清洗,再到基本的数据处理,例如中文需要分词,英文需要Tokenize, Truecase或者Lowercase等,还有去停用词等等,在将数据“喂”给工具之前,有很多事情要做。这个部分,貌似是当前一些教程有所欠缺的地方,所以才有了这个“从零开始做”的想法和系列,准备弥补一下这个缺失,第一个例子就拿《Python深度学习》这本书第一个文本挖掘例子练手:电影评论文本分类-二分类问题,这也可以归结为一个情感分析任务。

首先介绍一下这个原始的电影评论数据集aclIMDB: Large Movie Review Dataset, 这个数据集由斯坦福大学人工智能实验室于2011年推出,包含25000条训练数据和25000条测试数据,另外包含约50000条没有标签的辅助数据。训练集和测试集又分别包含12500条正例(正向评价pos)和12500负例(负向评价neg)。关于数据,更详细的介绍可参考该数据集的官网:http://ai.stanford.edu/~amaas/data/sentiment/, paper: Learning Word Vectors for Sentiment Analysis, 和数据集里的readme。

然后下载和处理这份数据:Large Movie Review Dataset v1.0,下载链接;

http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

下载之后进行解压:tar -zxvf aclImdb.tar.gz,可以用tree命令看一下aclImdb的目录结构:

代码语言:javascript
复制
tree aclImdb -L 2

继续进入训练集正例的目录看一下: cd aclImdb/train/pos/:

这个里面包含了12500篇英文评论,我们随机打开一个看一下里面的文本内容:

代码语言:javascript
复制
vim 1234_10.txt

I grew up watching this movie ,and I still love it just as much today as when i was a kid. Don't listen to the critic reviews. They are not accurate on this film.Eddie Murphy really shines in his roll.You can sit down with your whole family and everybody will enjoy it.I recommend this movie to everybody to see. It is a comedy with a touch of fantasy.With demons ,dragons,and a little bald kid with God like powers.This movie takes you from L.A. to Tibet , of into the amazing view of the wondrous temples of the mountains in Tibet.Just a beautiful view! So go do your self a favor and snatch this one up! You wont regret it!

在预处理之前,还要想好目标是什么?这里主要想复用一下Keras的相关接口,Keras官方提供了一个调用imdb预处理数据的Python脚本imdb.py,但是(貌似)没有提供做这个数据的脚本(如果提供了,也不用写这篇文章了),这个脚本在Keras官方的github路径如下:

https://github.com/keras-team/keras/blob/master/keras/datasets/imdb.py

这个脚本主要读两个数据,一个是 imdb_word_index.json ,另外一个是 imdb.npz。前者是单词索引文件,按单词频率高低排序,第一个索引是"the: 1", 可以看一下:

后者是Numpy NPZ 文件,存了多个 numpy 数组文件,这里主要包括imdb的训练集和测试集基于上面的单词索引文件转化为id后的数据,我们看一下:

  1. In [1]: import numpy as np
  2. In [2]: f = np.load('imdb.npz')
  3. In [3]: f.keys()
  4. Out[3]: ['x_test', 'x_train', 'y_train', 'y_test']
  5. In [4]: x_train, y_train, x_test, y_test = f['x_train'], f['y_train'], f['x_test'], f['y_test']
  6. In [5]: len(x_train), len(y_train), len(x_test), len(y_test)
  7. Out[5]: (25000, 25000, 25000, 25000)
  8. In [6]: x_train.shape
  9. Out[6]: (25000,)
  10. In [7]: y_train.shape
  11. Out[7]: (25000,)
  12. ...
  13. In [12]: x_train[0:2]
  14. Out[12]:
  15. array([ [23022, 309, 6, 3, 1069, ... , 3, 2237, 12, 9, 215],
  16. [23777, 39, 81226, 14, 739, ... , 6018, 22, 5, 336, 406]], dtype=object)
  17. In [13]: y_train[0:2]
  18. Out[13]: array([1, 1])
  19. In [14]: x_test.shape
  20. Out[14]: (25000,)
  21. In [15]: y_test.shape
  22. Out[15]: (25000,)
  23. In [16]: x_test[0:2]
  24. Out[16]:
  25. array([ [10, 432, 2, 216, 11, ... , 64, 9, 156, 22, 1916],
  26. [281, 676, 164, 985, 5696, ... , 1012, 5, 166, 32, 308]], dtype=object)
  27. In [17]: y_test[0:2]
  28. Out[17]: array([1, 1])

现在就可以按照这个思路处理原始的aclIMDB数据了,我已在Github上建了一个项目: AINLP(与我们的公众号AINLP同名,欢迎同时关注), 子项目 aclimdb_sentiment_analysis_from_scratch 里分别提供了几个Python脚本,兼容Python2和Python3, 已在Python2.7和Python 3.6, 3.7的环境下运行通过(其他没测),不过在运行这几个脚本之前,需要先安装一下相关的依赖:requirement.txt

代码语言:javascript
复制
numpy==1.15.2
sacremoses==0.0.5
six==1.11.0

其中sacremoses提供了英文tokenizer的接口,之前是通过NLTK调用里面的MosesTokenizer,但是发现最近这个接口因开源证书的问题从NLTK里面移除了,sacremoses是单独移植的一个版本,接口完全一致。首先来建立单词索引,由脚本 build_word_index.py 来完成,这里只处理训练集和测试集中的数据,忽略没有标签的数据(unsup):

  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # Author: TextMiner (textminer@foxmail.com)
  4. # Copyright 2018 @ AINLP
  5. from __future__ import absolute_import
  6. from __future__ import division
  7. from __future__ import print_function
  8. import argparse
  9. import json
  10. import numpy as np
  11. import re
  12. import six
  13. from collections import OrderedDict
  14. from os import walk
  15. from sacremoses import MosesTokenizer
  16. tokenizer = MosesTokenizer()
  17. def build_word_index(input_dir, output_json):
  18. word_count = OrderedDict()
  19. for root, dirs, files in walk(input_dir):
  20. for filename in files:
  21. if re.match(".*\d+_\d+.txt", filename):
  22. filepath = root + '/' + filename
  23. print(filepath)
  24. if 'unsup' in filepath:
  25. continue
  26. with open(filepath, 'r') as f:
  27. for line in f:
  28. if six.PY2:
  29. tokenize_words = tokenizer.tokenize(
  30. line.decode('utf-8').strip())
  31. else:
  32. tokenize_words = tokenizer.tokenize(line.strip())
  33. lower_words = [word.lower() for word in tokenize_words]
  34. for word in lower_words:
  35. if word not in word_count:
  36. word_count[word] = 0
  37. word_count[word] += 1
  38. words = list(word_count.keys())
  39. counts = list(word_count.values())
  40. sorted_idx = np.argsort(counts)
  41. sorted_words = [words[ii] for ii in sorted_idx[::-1]]
  42. word_index = OrderedDict()
  43. for ii, ww in enumerate(sorted_words):
  44. word_index[ww] = ii + 1
  45. with open(output_json, 'w') as fp:
  46. json.dump(word_index, fp)
  47. if __name__ == '__main__':
  48. parser = argparse.ArgumentParser()
  49. parser.add_argument('-id', '--input_dir', type=str, nargs='?',
  50. default='./data/aclImdb/',
  51. help='input data directory')
  52. parser.add_argument('-ot', '--output_json', type=str, nargs='?',
  53. default='./data/aclimdb_word_index.json',
  54. help='output word index dict json')
  55. args = parser.parse_args()
  56. input_dir = args.input_dir
  57. output_json = args.output_json
  58. build_word_index(input_dir, output_json)

注意里面的文件目录位置,我的文件结构大概是这样的:

这里把解压后的aclIMDB目录放在data下,如果你按这个结构来安排数据,就可以直接执行,否则,请根据程序里的参数指定文件目录。运行:

代码语言:javascript
复制
python build_word_index.py

程序执行完毕后在data目录下产生一个单词索引文件:aclimdb_word_index.json ,因为程序中使用了OrderedDict, dump之后的json文件还能看到有序的单词索引,注意,这里没有清除标点符号, 也没有去掉 html tag,有兴趣的同学可以试试进一步完善:

接下来,我们提供第二个脚本 build_data_index.py 对训练集和测试集进行处理,基于上一个脚本产生的单词索引文件 aclimdb_word_index.json 将训练集和测试集的明文转换为数字id,生成4个numpy数组(x_train, y_train, x_test, y_test),并存储为npz文件:

  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # Author: TextMiner (textminer@foxmail.com)
  4. # Copyright 2018 @ AINLP
  5. from __future__ import absolute_import
  6. from __future__ import division
  7. from __future__ import print_function
  8. import argparse
  9. import json
  10. import numpy as np
  11. import re
  12. import six
  13. from collections import OrderedDict
  14. from os import walk
  15. from sacremoses import MosesTokenizer
  16. tokenizer = MosesTokenizer()
  17. def get_word_index(word_index_path):
  18. with open(word_index_path) as f:
  19. return json.load(f)
  20. def build_data_index(input_dir, word_index):
  21. train_x = []
  22. train_y = []
  23. for root, dirs, files in walk(input_dir):
  24. for filename in files:
  25. if re.match(".*\d+_\d+.txt", filename):
  26. filepath = root + '/' + filename
  27. print(filepath)
  28. if 'pos' in filepath:
  29. train_y.append(1)
  30. elif 'neg' in filepath:
  31. train_y.append(0)
  32. else:
  33. continue
  34. train_list = []
  35. with open(filepath, 'r') as f:
  36. for line in f:
  37. if six.PY2:
  38. tokenize_words = tokenizer.tokenize(
  39. line.decode('utf-8').strip())
  40. else:
  41. tokenize_words = tokenizer.tokenize(line.strip())
  42. lower_words = [word.lower() for word in tokenize_words]
  43. for word in lower_words:
  44. train_list.append(word_index.get(word, 0))
  45. train_x.append(train_list)
  46. return train_x, train_y
  47. if __name__ == '__main__':
  48. parser = argparse.ArgumentParser()
  49. parser.add_argument('-trd', '--train_dir', type=str, nargs='?',
  50. default='./data/aclImdb/train/',
  51. help='train data directory')
  52. parser.add_argument('-ted', '--test_dir', type=str, nargs='?',
  53. default='./data/aclImdb/test/',
  54. help='test data directory')
  55. parser.add_argument('-wip', '--word_index_path', type=str, nargs='?',
  56. default='./data/aclimdb_word_index.json',
  57. help='aclimdb word index json')
  58. parser.add_argument('-onz', '--output_npz', type=str, nargs='?',
  59. default='./data/aclimdb.npz',
  60. help='output npz')
  61. args = parser.parse_args()
  62. train_dir = args.train_dir
  63. test_dir = args.test_dir
  64. word_index_path = args.word_index_path
  65. output_npz = args.output_npz
  66. word_index = get_word_index(word_index_path)
  67. train_x, train_y = build_data_index(train_dir, word_index)
  68. test_x, test_y = build_data_index(test_dir, word_index)
  69. np.savez(output_npz,
  70. x_train=np.asarray(train_x),
  71. y_train=np.asarray(train_y),
  72. x_test=np.asarray(test_x),
  73. y_test=np.asarray(test_y))

运行这个脚本 python build_data_index.py 之后会在data目录下生成一个 aclimdb.npz 文件,这个文件和官方imdb.npz的结构是一致的,这里就不展开了。

到目前为止,两份数据已经准备的差不多了,但是Kereas官方提供的 imdb.py 貌似不支持指定本地文件路径,所以这里模仿 imdb.py 脚本写了一个简化版的 aclimdb.py , 用来支持上述两个本地文件调用:

  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # Author: TextMiner (textminer@foxmail.com)
  4. # Copyright 2018 @ AINLP
  5. from __future__ import absolute_import
  6. from __future__ import division
  7. from __future__ import print_function
  8. import json
  9. import numpy as np
  10. def get_word_index(path='./data/aclimdb_word_index.json'):
  11. with open(path) as f:
  12. return json.load(f)
  13. def load_data(path='./data/aclimdb.npz', num_words=None, skip_top=0,
  14. seed=113, start_char=1, oov_char=2, index_from=3):
  15. """A simplify version of the origin imdb.py load_data function
  16. https://github.com/keras-team/keras/blob/master/keras/datasets/imdb.py
  17. """
  18. with np.load(path) as f:
  19. x_train, labels_train = f['x_train'], f['y_train']
  20. x_test, labels_test = f['x_test'], f['y_test']
  21. np.random.seed(seed)
  22. indices = np.arange(len(x_train))
  23. np.random.shuffle(indices)
  24. x_train = x_train[indices]
  25. labels_train = labels_train[indices]
  26. indices = np.arange(len(x_test))
  27. np.random.shuffle(indices)
  28. x_test = x_test[indices]
  29. labels_test = labels_test[indices]
  30. xs = np.concatenate([x_train, x_test])
  31. labels = np.concatenate([labels_train, labels_test])
  32. if start_char is not None:
  33. xs = [[start_char] + [w + index_from for w in x] for x in xs]
  34. elif index_from:
  35. xs = [[w + index_from for w in x] for x in xs]
  36. if not num_words:
  37. num_words = max([max(x) for x in xs])
  38. # 0 (padding), 1 (start), 2(OOV)
  39. if oov_char is not None:
  40. xs = [[w if (skip_top <= w < num_words) else oov_char for w in x]
  41. for x in xs]
  42. else:
  43. xs = [[w for w in x if skip_top <= w < num_words]
  44. for x in xs]
  45. idx = len(x_train)
  46. x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  47. x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])
  48. return (x_train, y_train), (x_test, y_test)

现在,可以按《Python深度学习》书中第3.4节的流程来快速过一遍我们自己处理的数据了,这里测试的环境是Mac OS, Python 2.7, Keras 2.14, Tensorflow 1.6.0, CPU环境,这个模型训练无需GUP也很快,请注意在上述几个代码存放的目录执行相关代码:

  1. In [1]: import aclimdb
  2. # 注意,代码里已经写了数据文件aclimdb.npz的相对路径,如果在其他位置运行,请加上参数path
  3. In [2]: (train_data, train_labels), (test_data, test_labels) = aclimdb.load_data(num_words=10000)
  4. In [3]: train_data[0]
  5. Out[3]:
  6. [1,
  7. 7799,
  8. 1459,
  9. ...
  10. 11,
  11. 13,
  12. 3320,
  13. 2]
  14. In [4]: train_labels[0]
  15. Out[4]: 0
  16. In [5]: max([max(sequence) for sequence in train_data])
  17. Out[5]: 9999
  18. In [6]: word_index = aclimdb.get_word_index()
  19. In [8]: reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
  20. In [9]: decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])
  21. In [10]: decoded_review
  22. Out[10]: u'? hi folks &lt; br / &gt; &lt; br / &gt; forget about that movie . john c. should be ashamed that he appears as executive producer in the ? bon ? has never been and will never be an actor and the fx are a joke . &lt; br / &gt; &lt; br / &gt; the first vampires was good ... and it was the only vampires . this thing here just wears the same name . &lt; br / &gt; &lt; br / &gt; just a waste of time thinks ... &lt; br / &gt; &lt; br / &gt; jake ?'
  23. In [11]: import numpy as np
  24. In [13]: def vectorize_sequences(sequences, dimension=10000):
  25. ...: results = np.zeros((len(sequences), dimension))
  26. ...: for i, sequence in enumerate(sequences):
  27. ...: results[i, sequence] = 1
  28. ...: return results
  29. ...:
  30. In [14]: x_train = vectorize_sequences(train_data)
  31. In [15]: x_test = vectorize_sequences(test_data)
  32. In [16]: x_train[0]
  33. Out[16]: array([0., 1., 1., ..., 0., 0., 0.])
  34. In [17]: y_train = np.asarray(train_labels).astype('float32')
  35. In [18]: y_test = np.asarray(test_labels).astype('float32')
  36. In [19]: from keras import models
  37. Using TensorFlow backend.
  38. In [20]: from keras import layers
  39. In [21]: model = models.Sequential()
  40. In [22]: model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
  41. In [23]: model.add(layers.Dense(16, activation='relu'))
  42. In [24]: model.add(layers.Dense(1, activation='sigmoid'))
  43. In [25]: model.compile(optimizer='rmsprop',
  44. ...: loss='binary_crossentropy',
  45. ...: metrics=['accuracy'])
  46. In [26]: model.fit(x_train, y_train, epochs=4, batch_size=512)
  47. Epoch 1/4
  48. 25000/25000 [==============================] - 3s 140us/step - loss: 0.4544 - acc: 0.8192
  49. Epoch 2/4
  50. 25000/25000 [==============================] - 2s 93us/step - loss: 0.2632 - acc: 0.9077
  51. Epoch 3/4
  52. 25000/25000 [==============================] - 2s 92us/step - loss: 0.2053 - acc: 0.9244
  53. Epoch 4/4
  54. 25000/25000 [==============================] - 2s 92us/step - loss: 0.1708 - acc: 0.9388
  55. Out[26]: <keras.callbacks.History at 0x206cfdc10>
  56. In [27]: resuls = model.evaluate(x_test, y_test)
  57. 25000/25000 [==============================] - 4s 145us/step
  58. In [28]: resuls
  59. Out[28]: [0.2953770682477951, 0.88304]
  60. In [29]: model.predict(x_test)
  61. Out[29]:
  62. array([[9.9612302e-01],
  63. [9.5416462e-01],
  64. [1.5807265e-05],
  65. ...,
  66. [9.9868757e-01],
  67. [8.4713501e-01],
  68. [5.7828808e-01]], dtype=float32)

详细的细节请参考《Python深度学习》,无论英文版还是中文翻译版都写得比较清楚,这里就不做补充了。最后,欢迎大家关注我们的github项目: AINLP (https://github.com/panyang/AINLP),预期配套这个系列相关的文章和教程,欢迎大家关注,也欢迎大家关注我们的微信号: AINLP,有问题随时反馈和交流:

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-11-19,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 AINLP 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
NLP 服务
NLP 服务(Natural Language Process,NLP)深度整合了腾讯内部的 NLP 技术,提供多项智能文本处理和文本生成能力,包括词法分析、相似词召回、词相似度、句子相似度、文本润色、句子纠错、文本补全、句子生成等。满足各行业的文本智能需求。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档