文章/答案/技术大牛

发布

Python中使用selenium进行动态爬虫

文章来源：企鹅号 - 机器学习和数学

Hello，大家好！停更了这么久，中间发生了很多事情，我的心情也发生了很大的变化，看着每天在增长的粉丝，实在不想就这么放弃了，所以以后我会尽量保持在一周一篇的进度，与大家分享我的学习点滴，希望大家可以继续支持我，我会努力滴！

selenium是一个前端的自动化测试工具，一般不推荐作为爬虫工具，但是为啥我还要给大家说用来做爬虫呢，因为他确实可以用来爬虫，并且思路很直观，原理比较清晰。

1. 安装

selenium安装比较简单，直接用pip就可以安装，打开cmd，输入

pip install selenium

就好了

2. 安装chromedriver

chromedriver是谷歌浏览器的驱动程序，因为我平时用chrome，所以这里只介绍chromedriver。

下载地址：

这里需要注意的是，chromedriver的版本需要是你安装的Chrome的版本对应起来，Chrome的版本可以在浏览器的右上角找到帮助-关于Google Chrome 查看浏览器的版本。具体的对应规则如下：

安装完之后，把驱动的安装目录添加到系统Path中就好了，如果不添加，在运行程序的时候就会报错，提示你没有添加到Path中。

3. 开始爬虫

今天要爬取的网址是：https://www.upbit.com/service_center/notice，然后点击翻页按钮，发现url并没有变化，通过F12查看请求的地址变化，可以发现，

https://www.upbit.com/service_center/notice?id=1

这里主要变化的就是后面的id，1,2,3，。。。依次类推。

用selenium爬虫开始前，需要定义好下面内容

# 设置谷歌浏览器的选项，

opt = webdriver.ChromeOptions()

# 将浏览器设置为无头浏览器，即先爬虫时，没有显示的浏览器

opt.set_headless()

# 浏览器设置为谷歌浏览器，并设置为上面设置的选项

browser = webdriver.Chrome(options=opt)

save = []

home ='https://www.upbit.com/home'

# 创建好浏览器对象后，通过get()方法可以向浏览器发送网址，

# 获取网址信息

browser.get(home)

time.sleep(15)

然后是如何定位html的元素，在selenium中，定位元素的方法有

find_element_by_id(self, id_)

find_element_by_name(self, name)

find_element_by_class_name(self, name)

find_element_by_tag_name(self, name)

find_element_by_link_text(self, link_text)

find_element_by_partial_link_text(self, link_text)

find_element_by_xpath(self, xpath)

find_element_by_css_selector(self, css_selector）

其中的id，name等都可以通过浏览器获得，定位元素的目的是为了获取我们想要的信息，然后解析出来保存，通过调用tex方法可以获得元素的文本信息。

下面把整个爬虫的代码，贴出来，供大家参考

fromseleniumimportwebdriver

importtime

fromtqdmimporttrange

fromcollectionsimportOrderedDict

importpandasaspd

defstringpro(inputs):

inputs =str(inputs)

returninputs.strip().replace("\n","").replace("\t","").lstrip().rstrip()

opt = webdriver.ChromeOptions()

opt.set_headless()

browser = webdriver.Chrome(options=opt)

save = []

home ='https://www.upbit.com/home'

browser.get(home)

time.sleep(15)

forpageintrange(500):

try:

rows = OrderedDict()

url ="https://www.upbit.com/"\

"service_center/notice?id={}".format(page)

browser.get(url)

content = browser.find_element_by_class_name(

name='txtB').text

title_class = browser.find_element_by_class_name(

name='titB')

title = title_class.find_element_by_tag_name(

'strong').text

times_str = title_class.find_element_by_tag_name(

'span').text

times = times_str.split('|')[].split(" ")[1:]

num = times_str.split("|")[1].split(" ")[1]

rows['title'] = title

rows['times'] =" ".join(times)

rows['num'] = num

rows['content'] = stringpro(content)

save.append(rows)

print("{},{}".format(page,rows))

exceptExceptionase:

continue

df = pd.DataFrame(save)

df.to_csv("./datasets/www_upbit_com.csv",index=None)

有问题可以与我交流~

发表于: 2018-08-112018-08-11 20:38:07
原文链接：https://kuaibao.qq.com/s/20180811G1DJUP00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

Python中使用selenium进行动态爬虫

相关快讯

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐