为了得到数据
python
http协议
html
requests(用于http请求),beautifulsoup(用于解析结果)
requests安装方式(一般已经安装好,检查是否安装好:import requests):
1.pip install requests
2.git clone git://github.com/kennethreitz/requests.git
3.curl -OL https://github.com/requests/requests/tarball/master
request介绍:
https://2.python-requests.org//zh_CN/latest/user/quickstart.html
beautifulsoup安装(验证:from bs4 import BeautifulSoup):
1.下载地址:https://pypi.org/project/beautifulsoup4/#files
2.pip install bs4
beautifulsoup安装遇到的问题:
问题:
1.bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
pip install lxml
2.如果是手动下载:可能会出现:/usr/lib/python2.7/site-packages/beautifulsoup4-4.7.1-py2.7.egg/bs4/element.py:16: UserWarning: The soupsieve package is not installed. CSS selectors cannot be used.'The soupsieve package is not installed. CSS selectors cannot be used.'
手动下载这些包,或者改用pip install 安装
1.requests库的基本使用方法
import requests
request = requests.get('http://baidu.com')
#type
print type(request)
print request.url
print request.status_code
print request.text
#get cookie
print request.cookies
print(type(request.cookies))
for k,v in request.cookies.items():
print(k+':'+v)
#if content is type json
#print request.json()
#get params
get_params = {'wd': 'qq'}
request = requests.get("http://www.baidu.com/s", params=get_params)
print request.text
2.beautifusoup的基本使用方法
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>Test BeautifulSoup</title></head>
<body>
<p class="title"><b>Test BeautifulSoup</b></p>
<p class="list">
<a href="http://test1.com" class="href" id="item1">test1</a>
<a href="http://test2.com" class="href" id="item2">test2</a>
</p>
<p class="content">Test1 Test2 Test3 ...</p>
"""
html = BeautifulSoup(html_doc,features="html.parser")
print "result"
# Standard indentation format
print(html.prettify())
print 'title:',html.title
print 'title_name:',html.title.name
print 'title_string:',html.title.string
print 'title_parent_name:',html.title.parent.name
print 'p:',html.p
print 'p_class:',html.p['class']
print 'a:',html.a
print 'all_a:',html.find_all('a')
print 'id=link3:',html.find(id="item1")
print 'a_href:'
for link in html.find_all('a'):
print(link.get('href'))
print "all_text:",html.get_text()
http状态码:https://zh.wikipedia.org/wiki/HTTP%E7%8A%B6%E6%80%81%E7%A0%81
requests文档:http://cn.python-requests.org/zh_CN/latest/
requests 详细使用:https://www.cnblogs.com/mzc1997/p/7813801.html
beautifulsoup文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。