我是Python的新手,正在从tripAdvisor.com中抓取数据。但是我得到了复制信息的输出。我真的很感激你的帮助。谢谢这是我的代码:
from bs4 import BeautifulSoup
import csv, urllib.request
import requests
hotels_pagewise = []
offset = 0
url = 'https://www.tripadvisor.com.au/Hotels-g55711-oa60' + str(offset) + '-Dallas_Texas-Hotels.html#Hotelnames'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.find_all('a', {'last'}):
page_number = link.get('data-page-number')
last_offset = int(page_number) * 30
print('last offset:', last_offset)
for i in range(0, 9):
n = i * 30
if n == 0:
pageUrl = 'https://www.tripadvisor.com.au/Hotels-g55711-Dallas_Texas-Hotels.html#Hotelnames'
else:
pageUrl = 'https://www.tripadvisor.com.au/Hotels-g55711-oa' + str(n) + '-Dallas_Texas-Hotels.html#Hotelnames'
hotels_pagewise.append(pageUrl)
csvfile = open('hotel.csv', 'w', newline='')
writer = csv.writer(csvfile)
writer.writerow(['name','link'])
for sub_url in hotels_pagewise:
thepage = urllib.request.urlopen(sub_url)
soup = BeautifulSoup(thepage, "html.parser")
text = str(soup)
hpage = soup.findAll('div', {"class": "listing_title"})
for link in hpage:
hotel_link = link.find('a').get('href')
hotel_link = 'https://www.tripadvisor.com.au/' + hotel_link
hotel_name = link.text
print(hotel_name, "-", hotel_link)
if hotel_link == None:
print(hotel_name)
writer.writerow([hotel_name, hotel_link])
csvfile.close()
发布于 2018-08-02 13:08:39
写完CSV之后。与pandas一起打开“Hotel.csv”文件并使用pandasdrop_duplicates()
删除CSV中的重复条目。
df = pd.read_csv('hotel.csv')
df.drop_duplicates(subset=['hotel_name', 'hotel_link'], keep=False)
https://stackoverflow.com/questions/-100001868
复制相似问题