摘要:本文主要针对知乎网站互联网话题下的QA问答对内容进行分析,观察当前互联网话题下用户都比较关注什么。文章从数据爬取、问题分析、高赞答案分析、关键词可视化等方面进行QA内容解读,希望本文开源的代码能给您带来帮助。
开发环境说明:
01
—
数据获取
一、知乎网站爬取
1. 模拟用户登录
(1)知乎的登录网站地址:https://www.zhihu.com/signin?next=%2F
(2)设置所使用的浏览器的header信息:(注意浏览器的版本)
header_info = {
"Accept": "*/*",
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "zh-CN,zh;q=0.8",
"Connection": "keep-alive",
"Connect-Length": "127",
"Connect-type": "application/x-www-form-urlencoded; charset=UTF-8",
"DNT": "1",
"Host": "www.zhihu.com",
"Origin": "http://www.zhihu.com",
"Referer": "http://www.zhihu.com/people/xiaofeng-tong-xue/followers",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
(3)设置用户登录信息:
data = {
'account' : '你的登录名',
'password' : '你的密码',
'rememberme' : 'true',
}(4)登录过程:
首先访问登录界面,再提交你的用户登录信息
def login():
loginurl = 'https://www.zhihu.com/signin?next=%2F
global s
s = requests.session()
req = s.get(url, headers=header_info)
print req
loginREQ = s.post(loginurl, headers=headers, data=data)
print loginREQ
2. 访问“互联网”话题下的问答页面,地址:https://www.zhihu.com/topic/19550517/top-answers
(1)查看页面HTML的结构,找到问题的链接地址和问题标题,如下:
(2)使用beautifulSoup解析该页面后,得到链接地址,进一步访问改地址获取答案的内容:
response = s.get(music_url, headers=header_info)
soup = BeautifulSoup(response.content, 'html.parser')
questions = soup.findAll('a', attrs={'class':'question_link'})
question_id = question.get('href')
response = s.get("https://www.zhihu.com" + question_id, headers=header_info)
、
(3)进入问答内容的页面后,如下结构,继续查看HTML的结构,进行内容抽取
抓取的数据文本以Json格式保存,如下:
02
—
数据可视化
上一部分介绍了数据的抓取过程(完整代码请参见第三部分),本节主要介绍如何将获取的数据进行可视化展示。主要处理过程包括文本分词、去停用词、词频统计、可视化展示,针对问题和答案分别进行分析。
1.问题分析,可以发现大家提问主要着眼于关键词:王者荣耀、发现商机、打游戏、创业、百度、小米手机等。
2.答案分析,对于答案分析,主要选择点赞数最多的答案进行分析。类似问题进行可视化展示。区别于问题,答案中更多出现共享单车、数据、游戏、手机等关键词。
3.具体实现过程包括:jieba分词,去停用词以及wordcloud词云展示三部分。(完整代码参见第三部分)
03
—
完整代码分析
import operator
import json
import os
import time
import re
import requests
import ConfigParser
import sys
from bs4 import BeautifulSoup
import json
reload(sys)
sys.setdefaultencoding('utf-8')
header_info = {
"Accept": "*/*",
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "zh-CN,zh;q=0.8",
"Connection": "keep-alive",
"Connect-Length": "127",
"Connect-type": "application/x-www-form-urlencoded; charset=UTF-8",
"DNT": "1",
"Host": "www.zhihu.com",
"Origin": "http://www.zhihu.com",
"Referer": "http://www.zhihu.com/people/xiaofeng-tong-xue/followers",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
def login():
loginurl = 'https://www.zhihu.com/signin?next=%2F'
data = {
'account' : '你的账号',
'password' : '你的密码',
'rememberme' : 'true',
}
global s
s = requests.session()
req = s.get(url, headers=header_info)
print req
loginREQ = s.post(loginurl, headers=headers, data=data)
print loginREQ
def ContentParser(content):
dicReturn = {}
soup = BeautifulSoup(content,"html.parser")
title = soup.find_all("title")[0].get_text()
title = title[:-5]
print title
AnswerTop = soup.find_all("div",class_="RichContent-inner")[0].get_text()
dicReturn["Q"] = title
dicReturn["A"] = AnswerTop
return dicReturn
def get_topic_music(music_url,fw):
response = s.get(music_url, headers=header_info)
soup = BeautifulSoup(response.content, 'html.parser')
questions = soup.findAll('a', attrs={'class':'question_link'})
for question in questions:
question_id = question.get('href')
response = s.get("https://www.zhihu.com" + question_id, headers=header_info)
try:
dicReturn = ContentParser(response.content)
except:
continue
dicStr = json.dumps(dicReturn,ensure_ascii=False)
fw.write(dicStr+"\n")
time.sleep(0.5)
if __name__=='__main__':
login()
path = "你存文件目录"
if not os.path.isdir(path):
os.mkdir(path)
file_name = path + "QA_zhihu.txt"
fw = open(file_name, "a+")
Url = "https://www.zhihu.com/topic/19550517/top-answers?page="
for i in range(1,50):
currentURL = Url+str(i)
get_topic_music(currentURL,fw)
time.sleep(2.5)
fw.close()
2. 数据可视化
from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator
import matplotlib.pyplot as plt
import jieba
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
f = open("你抓取的数据","r")
text = ""
for line in f.readlines():
line = line.strip("\n")
line = eval(line)["A"]
text += ' '.join(jieba.cut(line))
text += ' '
fstop = open("你的停用词表","r")
stopws = []
for idx,line in enumerate(fstop):
line = line.strip("\n").strip()
stopws.append(line)
stopwset = set(stopws)
fstop.close()
background_Image = plt.imread('你的背景图片')
wc=WordCloud(
background_color='white',
mask=background_Image,
font_path='C:\Windows\Fonts\STZHONGS.TTF',
max_words=200,
stopwords=stopwset,
max_font_size=150,
random_state=30
)
wc.generate_from_text(text)
print('开始加载文本')
img_colors=ImageColorGenerator(background_Image)
wc.recolor(color_func=img_colors)
plt.imshow(wc)
plt.axis('off')
plt.savefig("pro.png",dpi=1600)
print('display success!')