前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >苹果手机评论情感分析(附python源码和评论数据)

苹果手机评论情感分析(附python源码和评论数据)

作者头像
机器学习AI算法工程
发布2018-03-15 13:29:19
1.7K0
发布2018-03-15 13:29:19
举报
文章被收录于专栏:机器学习AI算法工程

首先抓取网页上的数据,每一页十条评论,生成为一个txt文件。

数据链接

以下采用既有词典的方式:

准备四本词典,停用词,否定词,程度副词,情感词,链接也给出来:

[python] view plain copy

  1. f=open(r'C:/Users/user/Desktop/stopword.dic')#停止词
  2. stopwords = f.readlines()
  3. stopwords=[i.replace("\n","").decode("utf-8") for i in stopwords]
  4. from collections import defaultdict
  5. # (1) 情感词
  6. f1 =open(r"C:\Users\user\Desktop\BosonNLP_sentiment_score.txt")
  7. senList = f1.readlines()
  8. senDict = defaultdict()
  9. for s in senList:
  10. s=s.decode("utf-8").replace("\n","")
  11. senDict[s.split(' ')[0]] = float(s.split(' ')[1])
  12. # (2) 否定词
  13. f2=open(r"C:\Users\user\Desktop\notDict.txt")
  14. notList = f2.readlines()
  15. notList=[x.decode("utf-8").replace("\n","") for x in notList if x != '']
  16. # (3) 程度副词
  17. f3=open(r"C:\Users\user\Desktop\degreeDict.txt")
  18. degreeList = f3.readlines()
  19. degreeDict = defaultdict()
  20. for d in degreeList:
  21. d=d.decode("utf-8")
  22. degreeDict[d.split(',')[0]] = float(d.split(',')[1])

导入数据并且分词

[python] view plain copy

  1. import jieba
  2. def sent2word(sentence):
  3. """
  4. Segment a sentence to words
  5. Delete stopwords
  6. """
  7. segList = jieba.cut(sentence)
  8. segResult = []
  9. for w in segList:
  10. segResult.append(w)
  11. newSent = []
  12. for word in segResult:
  13. if word in stopwords:
  14. # print "stopword: %s" % word
  15. continue
  16. else:
  17. newSent.append(word)
  18. return newSent
  19. import os
  20. path = u"C:/Users/user/Desktop/comments/"
  21. listdir = os.listdir(path)
  22. t=[]
  23. for i in listdir:
  24. f=open(path+i).readlines()
  25. for j in f:
  26. t.append(sent2word(j))

计算一下得分,注意,程度副词和否定词只修饰后面的情感词,这是缺点之一,之二是无法判断某些贬义词其实是褒义的,之三是句子越长得分高的可能性比较大,在此可能应该出去词的总数。

[python] view plain copy

  1. def class_score(word_lists):
  2. id=[]
  3. for i in word_lists:
  4. if i in senDict.keys():
  5. id.append(1)
  6. elif i in notList:
  7. id.append(2)
  8. elif i in degreeDict.keys():
  9. id.append(3)
  10. word_nake=[]
  11. for i in word_lists:
  12. if i in senDict.keys():
  13. word_nake.append(i)
  14. elif i in notList:
  15. word_nake.append(i)
  16. elif i in degreeDict.keys():
  17. word_nake.append(i)
  18. score=0
  19. w=1
  20. score0=0
  21. for i in range(len(id)):
  22. # if id[i] ==3 and id[i+1]==2 and id[i+2]==1:
  23. # score0 = (-1)*degreeWord[word_nake[i+1]]*senWord[word_nake[i+2]]
  24. if id[i]==1:
  25. score0=w*senDict[word_nake[i]]
  26. w=1
  27. elif id[i]==2:
  28. w=-1
  29. elif id[i]==3:
  30. w=w*degreeDict[word_nake[i]]
  31. # print degreeWord[word_nake[i]]
  32. score=score+score0
  33. score0=0
  34. return score

[python] view plain copy

  1. import xlwt
  2. wb=xlwt.Workbook()
  3. sheet=wb.add_sheet('score')
  4. num=390
  5. writings=""
  6. for i in t[389:]:
  7. print "第",num,"条得分",class_score(i[:-1])
  8. sheet.write(num-1,0,class_score(i[:-1]))
  9. num=num+1
  10. wb.save(r'C:/Users/userg/Desktop/result.xlsx')

排序之后图标如下,可以看出积极正面的得分比较多,负面的比较少,根据原网页的评分确实如此,然而点评为1星的有1半得分为正,点评为5星的有四分之一得分为负。基于词典的方式严重依赖词典的质量,以及这种方式的缺点都可能造成得分的偏差,所以接下来打算利用word2vec试试。

词向量的变换方式如下:

[python] view plain copy

  1. from gensim.models import word2vec
  2. import logging
  3. logging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO)
  4. sentences = word2vec.Text8Corpus("corpus.csv") # 加载语料
  5. model = word2vec.Word2Vec(sentences, size = 400) # 训练skip-gram模型,根据单词寻找周边词
  6. # 保存模型,以便重用
  7. model.save("corpus.model")
  8. # 对应的加载方式
  9. # model = word2vec.Word2Vec.load("corpus.model")
  10. from gensim.models import word2vec
  11. # load word2vec model
  12. model = word2vec.Word2Vec.load("corpus.model")
  13. model.save_word2vec_format("corpus.model.bin", binary = True)
  14. model = word2vec.Word2Vec.load_word2vec_format("corpus.model.bin", binary = True)

加载一下评分

[python] view plain copy

  1. stars=open("C:\Users\user\Desktop\stars\stars.txt").readlines()
  2. stars=[ int(i.split(".")[0]) for i in stars]
  3. #三类
  4. y=[]
  5. for i in stars:
  6. if i ==1 or i ==2:
  7. y.append(-1)
  8. elif i ==3:
  9. y.append(0)
  10. elif i==4 or i==5:
  11. y.append(1)

转换成词向量,发现里面有2个失败并且删除

[python] view plain copy

  1. import numpy as np
  2. import sys
  3. reload(sys)
  4. sys.setdefaultencoding("utf-8")
  5. def getWordVecs(wordList):
  6. vecs = []
  7. for word in wordList:
  8. try:
  9. vecs.append(model[word])
  10. except KeyError:
  11. continue
  12. return np.array(vecs, dtype = 'float')
  13. def buildVecs(list):
  14. posInput = []
  15. # print txtfile
  16. for line in list:
  17. # print u"第",id,u"条"
  18. resultList = getWordVecs(line)
  19. # for each sentence, the mean vector of all its vectors is used to represent this sentence
  20. if len(resultList) != 0:
  21. resultArray = sum(np.array(resultList))/len(resultList)
  22. posInput.append(resultArray)
  23. else:
  24. return posInput
  25. X = np.array(buildVecs(t))
  26. #327 408失败
  27. del(y[326])
  28. del(y[407])
  29. y = np.array(y)

PCA降维并运用SVM进行分类

[python] view plain copy

  1. import matplotlib.pyplot as plt
  2. from sklearn.decomposition import PCA
  3. # Plot the PCA spectrum
  4. pca = PCA(n_components=400)
  5. pca.fit(X)
  6. plt.figure(1, figsize=(4, 3))
  7. plt.clf()
  8. plt.axes([.2, .2, .7, .7])
  9. plt.plot(pca.explained_variance_, linewidth=2)
  10. plt.axis('tight')
  11. plt.xlabel('n_components')
  12. plt.ylabel('explained_variance_')
  13. X_reduced = PCA(n_components = 100).fit_transform(X)
  14. from sklearn.cross_validation import train_test_split
  15. X_reduced_train,X_reduced_test,y_reduced_train,y_reduced_test= train_test_split(X, y, test_size=0.33, random_state=42)
  16. from sklearn.svm import SVC
  17. from sklearn import metrics#准确度
  18. clf = SVC(C = 2, probability = True)
  19. clf.fit(X_reduced_train, y_reduced_train)
  20. pred_probas = clf.predict(X_reduced_test)
  21. scores =[]
  22. scores.append(metrics.accuracy_score(pred_probas, y_reduced_test))
  23. print scores

降维后的准确度为auc=0.83,相比MLP神经网络的准确度0.823来说结果差不多,以下是MLP的代码。对于利用word2vec来说,其结果依赖于语料库的词语量大小,我打印了部分失败的词语如下,表明在语料库中并没有找到相关的词,导致向量的表达信息有所缺失。

[python] view plain copy

  1. from keras.models import Sequential
  2. from keras.layers import Dense, Dropout, Activation
  3. from keras.optimizers import SGD
  4. model = Sequential()
  5. model.add(Dense(512, input_dim = 400, init = 'uniform', activation = 'tanh'))
  6. model.add(Dropout(0.7))
  7. # Dropout的意思就是训练和预测时随机减少特征个数,即去掉输入数据中的某些维度,用于防止过拟合。
  8. model.add(Dense(256, activation = 'relu'))
  9. model.add(Dropout(0.7))
  10. model.add(Dense(128, activation = 'relu'))
  11. model.add(Dropout(0.7))
  12. model.add(Dense(64, activation = 'relu'))
  13. model.add(Dropout(0.7))
  14. model.add(Dense(32, activation = 'relu'))
  15. model.add(Dropout(0.7))
  16. model.add(Dense(16, activation = 'relu'))
  17. model.add(Dropout(0.7))
  18. model.add(Dense(1, activation = 'sigmoid'))
  19. model.compile(loss = 'binary_crossentropy',
  20. optimizer = 'adam',
  21. metrics = ['accuracy'])
  22. model.fit(X_reduced_train, y_reduced_train, nb_epoch = 20, batch_size = 16)
  23. score = model.evaluate(X_reduced_test, y_reduced_test, batch_size = 16)
  24. print ('Test accuracy: ', score[1])

原文:http://blog.csdn.net/Jemila/article/details/62887907?locationNum=7&fps=1

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2017-09-22,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 大数据挖掘DT数据分析 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档