如何删除Web抓取到CSV python中的副本?

内容来源于 Stack Overflow,并遵循CC BY-SA 3.0许可协议进行翻译与使用

  • 回答 (2)
  • 关注 (0)
  • 查看 (85)

我是Python的新手,正在从tripAdvisor.com中抓取数据。但是我得到了复制信息的输出。我真的很感激你的帮助。谢谢这是我的代码:

from bs4 import BeautifulSoup
import csv, urllib.request
import requests

hotels_pagewise = []
offset = 0
url = 'https://www.tripadvisor.com.au/Hotels-g55711-oa60' + str(offset) + '-Dallas_Texas-Hotels.html#Hotelnames'

r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

for link in soup.find_all('a', {'last'}):
    page_number = link.get('data-page-number')
    last_offset = int(page_number) * 30
    print('last offset:', last_offset)
    for i in range(0, 9):
        n = i * 30
        if n == 0:
            pageUrl = 'https://www.tripadvisor.com.au/Hotels-g55711-Dallas_Texas-Hotels.html#Hotelnames'
        else:
            pageUrl = 'https://www.tripadvisor.com.au/Hotels-g55711-oa' + str(n) + '-Dallas_Texas-Hotels.html#Hotelnames'
        hotels_pagewise.append(pageUrl)

csvfile = open('hotel.csv', 'w', newline='')
writer = csv.writer(csvfile)
writer.writerow(['name','link'])

for sub_url in hotels_pagewise:
    thepage = urllib.request.urlopen(sub_url)
    soup = BeautifulSoup(thepage, "html.parser")
    text = str(soup)
    hpage = soup.findAll('div', {"class": "listing_title"})


    for link in hpage:
        hotel_link = link.find('a').get('href')
        hotel_link = 'https://www.tripadvisor.com.au/' + hotel_link
        hotel_name = link.text
        print(hotel_name, "-", hotel_link)

        if hotel_link == None:
            print(hotel_name)

        writer.writerow([hotel_name, hotel_link])
csvfile.close()
提问于
用户回答回答于

试着用pandas把数据添加到数据中 。

用户回答回答于

写完CSV之后。与pandas一起打开“Hotel.csv”文件并使用pandasdrop_duplicates()删除CSV中的重复条目。

df = pd.read_csv('hotel.csv')
df.drop_duplicates(subset=['hotel_name', 'hotel_link'], keep=False)

扫码关注云+社区

领取腾讯云代金券

年度创作总结 领取年终奖励