作者 | godweiyang
出品 | 公众号:算法码上来(ID:GodNLP)
这两天王力宏的瓜可谓是闹的沸沸扬扬,不怎么吃瓜的我也听了不少传闻。网上观点主要分为两派,一种无脑直接喷的,一种是说人品和艺术无关的。而我也想看看大家对他都什么看法,因此写了个爬虫爬了一下b站视频的弹幕。
爬虫的代码我已经开源到github上了,这里我写了两个版本,地址在下面,详细代码也可以见文章末尾:https://github.com/godweiyang/bilibili-danmu
这里我以b站一个分析王力宏事件的热门视频为例,演示一下怎么使用这个代码,视频地址如下:https://www.bilibili.com/video/BV1tq4y1B7Jz
那么这个视频的视频号就是BV1tq4y1B7Jz,这里我们使用新版本爬虫,也就是danmu2.py
。
运行前要做一件事,就是获取cookie。打开这个视频网页,首先点击地址栏左侧的小锁,点Cookie
,展开bilibili.com-Cookie-SESSDATA
,获取到SESSDATA
值。然后按F12
,点击Console
,输入命令document.cookie
后就能获取到cookie了。将SESSDATA
值和cookie拼接获得完整的cookie,填入到代码中。
运行python3 danmu2.py
看看效果,先输入视频号BV1tq4y1B7Jz,然后输入弹幕的时间范围,最后爬取完成:
最后我们可以对弹幕进行各种分析,这里我用了之前介绍过的词云进行分析,代码也开源了:https://github.com/godweiyang/wordcloud
可以看出,这届网友还是挺友好的,这里面还出现了一些绯闻的人名:李云迪、范玮琪、徐若瑄等等。
再来分析一下EDG夺冠后网友们是什么反应,视频链接在下方:https://www.bilibili.com/video/BV1EP4y1j7kV
danmu.py
是简单版,不需要cookie,调用的是b站老版本的api,只能获取一小部分弹幕。
import re
import requests
def get_info(vid):
url = f"https://api.bilibili.com/x/web-interface/view/detail?bvid={vid}"
response = requests.get(url)
response.encoding = "utf-8"
data = response.json()
info = {}
info["标题"] = data["data"]["View"]["title"]
info["总弹幕数"] = data["data"]["View"]["stat"]["danmaku"]
info["视频数量"] = data["data"]["View"]["videos"]
info["cid"] = [dic["cid"] for dic in data["data"]["View"]["pages"]]
if info["视频数量"] > 1:
info["子标题"] = [dic["part"] for dic in data["data"]["View"]["pages"]]
for k, v in info.items():
print(k + ":", v)
return info
def get_danmu(info):
all_dms = []
for i, cid in enumerate(info["cid"]):
url = f"https://api.bilibili.com/x/v1/dm/list.so?oid={cid}"
response = requests.get(url)
response.encoding = "utf-8"
data = re.findall('<d p="(.*?)">(.*?)</d>', response.text)
dms = [d[1] for d in data]
if info["视频数量"] > 1:
print("cid:", cid, "弹幕数:", len(dms), "子标题:", info["子标题"][i])
all_dms += dms
print(f"共获取弹幕{len(all_dms)}条!")
return all_dms
if __name__ == "__main__":
vid = input("请输入视频编号: ")
info = get_info(vid)
danmu = get_danmu(info)
with open("danmu.txt", "w", encoding="utf-8") as fout:
for dm in danmu:
fout.write(dm + "\n")
danmu2.py
是完整版,需要手动填入cookie,调用的是b站新版本的api,可以获取不同日期的所有弹幕。
import re
import requests
import pandas as pd
import time
from tqdm import trange
# 视频页面点击“浏览器地址栏小锁-Cookie-bilibili.com-Cookie-SESSDATA”进行获取
SESSDATA = ""
# 视频页面“按F12-Console-输入document.cookie”进行获取
cookie = ""
cookie += f";SESSDATA={SESSDATA}"
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"cookie": cookie,
}
def get_info(vid):
url = f"https://api.bilibili.com/x/web-interface/view/detail?bvid={vid}"
response = requests.get(url, headers=headers)
response.encoding = "utf-8"
data = response.json()
info = {}
info["标题"] = data["data"]["View"]["title"]
info["总弹幕数"] = data["data"]["View"]["stat"]["danmaku"]
info["视频数量"] = data["data"]["View"]["videos"]
info["cid"] = [dic["cid"] for dic in data["data"]["View"]["pages"]]
if info["视频数量"] > 1:
info["子标题"] = [dic["part"] for dic in data["data"]["View"]["pages"]]
for k, v in info.items():
print(k + ":", v)
return info
def get_danmu(info, start, end):
date_list = [i for i in pd.date_range(start, end).strftime("%Y-%m-%d")]
all_dms = []
for i, cid in enumerate(info["cid"]):
dms = []
for j in trange(len(date_list)):
url = f"https://api.bilibili.com/x/v2/dm/web/history/seg.so?type=1&oid={cid}&date={date_list[j]}"
response = requests.get(url, headers=headers)
response.encoding = "utf-8"
data = re.findall(r"[:](.*?)[@]", response.text)
dms += [dm[1:] for dm in data]
time.sleep(3)
if info["视频数量"] > 1:
print("cid:", cid, "弹幕数:", len(dms), "子标题:", info["子标题"][i])
all_dms += dms
print(f"共获取弹幕{len(all_dms)}条!")
return all_dms
if __name__ == "__main__":
vid = input("输入视频编号: ")
info = get_info(vid)
start = input("输入弹幕开始时间(年-月-日): ")
end = input("输入弹幕结束时间(年-月-日): ")
danmu = get_danmu(info, start, end)
with open("danmu.txt", "w", encoding="utf-8") as fout:
for dm in danmu:
fout.write(dm + "\n")
我是godweiyang,字节跳动算法工程师,末流985计算机本硕均专业第一。秋招斩获三家大厂SSP offer,擅长算法、机器翻译和模型加速。