我正在尝试用BeautifulSoup收集一家Yelp餐厅的100条评论/评分,以完成一项任务。我特别在寻找:评论ID评论评论评级
我是Python的新手,我觉得我错过了一些非常明显的东西
这是我到目前为止所得到的:
from bs4 import BeautifulSoup
import urllib.requesturl = 'https://www.yelp.com/biz/ichiran-times-square-new-york-4?osq=Ichiban+Ramen' ourUrl = urllib.request.urlopen(url)
soup = BeautifulSoup(ourUrl,'html.parser') type(soup) print(soup.prettify())
for i in soup.find_all('div', {'class':" arrange-unit__373c0__3XPkE arrange-unit-fill__373c0__38Zde border-color--default__373c0__r305k"}): ID.append(i.find("div").get("aria-label"))
soup.find('p', {'class':"comment__373c0__Nsutg css-n6i4z7"})
i = soup.find('p', {'class':"comment__373c0__Nsutg css-n6i4z7"}) i.text
review=[]
rating = []
ID = []
for x in range(0,10):
url = "https://www.yelp.com/biz/ichiran-times-square-new-york-4?osq=Ichiban+Ramen="+str(10*x)
ourUrl = urllib.request.urlopen(url)
soup = BeautifulSoup(ourUrl,'html.parser')
#for i in soup,
for i in soup.find_all('div', {'class':" i-stars__373c0___sZu0 i-stars--regular-5__373c0__20dKs border-color--default__373c0__1yxBb overflow--hidden__373c0__1TJqF"}):
per_rating = i.text
rating.append(per_rating)
for i in soup.find_all('span', {'class':" arrange-unit__373c0__3XPkE arrange-unit-fill__373c0__38Zde border-color--default__373c0__r305k"}):
ID.append(i.find("div").get("aria-label"))
for i in soup.find_all('p', {'class':"comment__373c0__Nsutg css-n6i4z7"}):
per_review=i.text
review.append(per_review)len(review)
下面是我导出到csv的尝试,在csv中,我只能获得评论文本,而不能获得其他内容:
with open('Review.csv','a',encoding = 'utf-8') as f:
for each in review:
f.write(each+'\n')发布于 2021-11-03 14:35:48
编辑-更新的
该问题实际上看起来是由于没有针对HTML中的正确标记。
# Import regex package
import re
# Narrow down the section that you are searching in to avoid erroneous elements
child = soup.find('div', {'class': 'css-79elbk border-color--default__373c0__1ei3H'})
for x in child.find_all('span', {'class':"fs-block css-m6anxm"}):
# Ignore the titular "Username"
if x.text != 'Username':
ID.append(x.text)
for x in child.find_all('div', {'class':re.compile(r'i-stars.+')}):
rating.append(x.get('aria-label'))
for x in child.find_all('p', {'class':'comment__373c0__Nsutg css-n6i4z7'}):
comment = x.find('span', {'class':'raw__373c0__tQAx6'})
review.append(comment.text)ID需要以特定元素'class':"fs-block css-m6anxm"为目标,而rating类根据它达到的星数而有所不同,因此实现正则表达式以识别以i-stars开头的任何内容。
原始答案
我相信你的问题是,当你还需要循环ID和rating时,你只在review中循环:
# Create new_line to work around f-strings issue with '\'
new_line = '\n'
with open('Review.csv','a',encoding = 'utf-8') as f:
for i in range(len(review):
f.write(f'{review[i]},{ID[i]},{rating[i]}{new_line}')为了实现这一点,您还可以查看Pandas package。
您可以创建数据帧,然后将其导出为多种不同的文件类型,包括CSV,例如:
# Import Pandas package
import Pandas
# Store list values, along with column headings, in a dictionary
d = {'review_comment': review, 'review_id': ID, 'review_rating': rating}
# Create dataframe from the dictionary
df = pd.DataFrame(data=d)
# Export the dataframe as a CSV
df.to_csv('desired/save/location.csv', index=False)https://stackoverflow.com/questions/69826227
复制相似问题