作者完善了从微博内容对话格式提取聊天pair对的脚本, 对话的准确率在 99% 左右(consider copyright issue, we will open source it later);
作者提交了分门别类的近800万用户id的list,全网开爬(Consider weibo official limitations, we can't distributed all list, just for sample, join our contributor team we will give every contributor single and unique part of id_file.);
Clone this repo: git clone https://github.com/jinfagang/weibo_terminater.git;
Install PhantomJS to enable weibo_terminator auto get cookies, from here (http://phantomjs.org/download.html)get it and set your unzip path to settings/config.py, follow the instruction there;
Set your multi account, inside settings/accounts.py, you can using multi account now, terminator will automatically dispatch them;
Run python3 main.py -i realangelababy, scrap single user, set settings/id_file for multi user scrap;
Contact project administrator via wechat jintianiloveu, if you want contribute, administrator will hand out you and id_file which is unique in our project;
All data will saved into ./weibo_detail, with different id separately.
Collect data to project administrator.
When all the work finished, administrator will distribute all data as one single file to all contributors. Using it under WT & TIANEYE COPYRIGHT.
Research & Discuss Group
We fund several group for our project:
Tutorial
这是第一次commit丢失的部分,使用帮助:
That's all, simple and easy.
About cookies
The cookies still maybe banned, if our scraper continues get information from weibo, that is exactly we have to get this job done under people's strength, no one can build such a big corpora under one single power. If your cookies out of date or being banned, we strongly recommended using another weibo account which can be your friends or anyone else, and continue scrap, one thing you have to remind is that our weibo_terminator can remember scrap progress and it will scrap from where it stopped last time. :)