python 爬取租房信息存储至mysql数据库

利用python requests库获取网页信息;

利用bs库解析网页内容;

pymysql用于操作mysql数据库;

将获取的信息存储至mysql数据库。

效果如图:

1.导入需要的库,并建立数据库连接。需要先安装好mysql数据库在本机上。

importrequests

frombs4importBeautifulSoup

importre

importpymysql

connect=pymysql.connect(user='root',password='root',host='localhost',port=3306,db='python',charset='utf8')

conn=connect.cursor()

conn.execute("create database if not exists Ganjizufang character set utf8;")

conn.execute("use Ganjizufang;")

sql="""create table if not exists roominfo (id INT PRIMARY KEY AUTO_INCREMENT,title VARCHAR(200),price VARCHAR(200),room_type VARCHAR(200),room_size VARCHAR(200),room_direction VARCHAR(200),

room_total_floor VARCHAR(200),elecator VARCHAR(200),room_fixtures VARCHAR(200),viliage_name VARCHAR(200),raiway_distance VARCHAR(300),url VARCHAR(200))"""

conn.execute('drop table if exists roominfo;')

conn.execute(sql)

2.获取主网页信息。加上headers,利用requests.get()方法访问网页,用BeautifulSoup和lxml解析;

foriinrange(1,20):

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}

link='http://cd.ganji.com/fang1/ditie/m1o'+str(i)+'l21s2298/'

r = requests.get(link,headers=headers)

print'The page is:',i

soup = BeautifulSoup(r.text,'lxml')

house_list = soup.find_all('div',class_="f-list-item ershoufang-list")

3.获取子网页信息。主网页对room 信息只有很少的描述,需要获取子网页链接,访问子网页获取很多的room信息。获取href可获取子网页链接。

forhouseinhouse_list:

next_link=house.find('dd',class_="dd-item title").find('a',class_='js-title value title-font')

# find the next level link

link1='http://cd.ganji.com'+str(next_link['href'])

r1=requests.get(link1,headers=headers)

soup1=BeautifulSoup(r1.text,'lxml')

title=soup1.find('p',class_='card-title').text.strip()

price=soup1.find('ul',class_='card-pay f-clear').find('li',class_='price').find('span',class_='num').text.strip()

# how to handle the same tag

room_type=soup1.find('ul',class_='er-list f-clear').contents[1].text.strip()

room_size=soup1.find('ul',class_='er-list f-clear').contents[3].text.strip()

room_direction=soup1.find('ul',class_='er-list f-clear').contents[5].text.strip()

room_total_floor=soup1.find('ul',class_='er-list f-clear').contents[7].text.strip()

elecator=soup1.find('ul',class_='er-list f-clear').contents[9].text.strip()

try:

room_fixtures=soup1.find('ul',class_='er-list f-clear').contents[11].text.strip()

except:

room_fixtures='no info'

try:

viliage_name=soup1.find('ul',class_='er-list-two f-clear').find('a',class_='blue').text.strip()

except:

viliage_name='no info'

raiway_distance=soup1.find('div',class_='subway-wrap').find('span',class_='content').text.strip()

try:

address=soup1.find('ul',class_='er-list-two f-clear').contents[5].text.strip()

except:

#address='No info'

phone=soup1.find('div',class_='card-info f-fr').find('div',class_='c_phone f-clear')

# how match the phone bunber, or use the click solution.

phone_number=re.search(r'[0-9]+',phone) 上面try--except因为某些room的信息不全,无法获取对应的信息,程序会报错。将没有的信息给定为‘no info’. 4.每获取一条信息,将该信息存储至mysql数据库;用insert into将信息插入到数据库;

conn.execute("insert into roominfo(title,price,room_type,room_size,room_direction,room_total_floor,elecator,room_fixtures,viliage_name,raiway_distance,url) VALUES ('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')"%(title,price,room_type,room_size,room_direction,room_total_floor,elecator,room_fixtures,viliage_name,raiway_distance,link1))

conn.commit() 5.也可以将数据存储为text或者csv文件在本地。

withopen('Ganjizufang.csv','a+')asf:

f.write(title.encode('utf-8')+',')

f.write(price.encode('utf-8') +',')

f.write(room_type.encode('utf-8') +',')

f.write(room_size.encode('utf-8') +',')

f.write(room_direction.encode('utf-8') +',')

f.write(room_total_floor.encode('utf-8') +',')

f.write(elecator.encode('utf-8') +',')

f.write(room_fixtures.encode('utf-8') +',')

f.write(viliage_name.encode('utf-8') +',')

f.write(raiway_distance.encode('utf-8') +',')

#f.write(address.encode('utf-8') + ',')

f.write(link1.encode('utf-8') +',')

#f.write(price.encode('utf-8'))

f.write('\n')

f.close()

6.每获取一页内容,暂停两秒。最后关闭数据库。

time.sleep(2)

conn.close()

connect.close() 保存csv效果如图:

7.结语:访问量过多会被网站禁用ip,要求输入验证码后才能继续访问。可以建立代理池或者用代理服务器的方法伪装ip进行访问。 ------------------------------------------------ 8.附加:本地mysql数据库操作,以上面写入的数据数据为例。 1.将安装好的mysql添加至path环境变量里; 2.windows+R输入cmd进入命令行,输入mysql;输入密码; 3.connect Ganjizufang; use Ganjizufang; select * from roominfo;获取roominfo所有的信息; select * from roominfo where price

  • 发表于:
  • 原文链接http://kuaibao.qq.com/s/20180505G0F1EX00?refer=cp_1026
  • 腾讯「云+社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。

同媒体快讯

扫码关注云+社区

领取腾讯云代金券