问我无法显示html代码-美丽的汤
EN

Stack Overflow用户

提问于 2019-05-09 02:55:34

回答 3查看 210关注 0票数 1

(我是网络抓取的初学者)我想通过这个链接：https://www.seloger.com/list.htm?tri=initial&idtypebien=1,2&pxMax=3000000&div=2238&idtt=2,5&naturebien=1,2,4&LISTING-LISTpg=2 抓取

当我尝试显示repo_list时，我得到的是[]，而不是html代码！

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.seloger.com/list.htm?tri=initial&idtypebien=1,2&pxMax=3000000&div=2238&idtt=2,5&naturebien=1,2,4&LISTING-LISTpg=2')
soup = BeautifulSoup(page.text, 'html.parser')
repo = soup.find(class_="c-wrap")
print(repo)
repo_list = repo.find_all(class_='c-pa-list c-pa-sl c-pa-gold cartouche ')
print(repo_list)

python

html

web-scraping

beautifulsoup

回答 3

Stack Overflow用户

回答已采纳

发布于 2019-05-09 03:55:49

您可以使用正则表达式输出，进行一些字符串清理，然后传递给json，然后将每个产品作为字典打印出来，其中包含每个清单的信息

import re
import requests
import json

r = requests.get('https://www.seloger.com/list.htm?tri=initial&idtypebien=1,2&pxMax=3000000&div=2238&idtt=2,5&naturebien=1,2,4&LISTING-LISTpg=2', headers = {'User-Agent' : 'Mozilla/5.0'})
p = re.compile('var ava_data =(.*);\r\n\s+ava_data\.logged = logged;', re.DOTALL)
x = p.findall(r.text)[0].strip().replace('\r\n    ','').replace('\xa0',' ').replace('\\','\\\\')
x = re.sub(r'\s{2,}|\\r\\n', '', x)
data = json.loads(x)

for product in data['products']:
    print(product)

示例返回(来自第3页)：

{'idannonce': '142830891', 'idagence': '263765', 'idtiers': '284402', 'typedebien': 'Appartement', 'typedetransaction': ['vente'], 'idtypepublicationsourcecouplage': 'SL', 'position': '0', 'codepostal': '77450', 'ville': 'Esbly', 'departement': 'Seine-et-Marne', 'codeinsee': '770171', 'produitsvisibilite': 'AD:AC:BB:AW', 'affichagetype': [{'name': 'liste', 'value': True}], 'cp': '77450', 'etage': '0', 'idtypechauffage': '0', 'idtypecommerce': '0', 'idtypecuisine': 'séparée équipée', 'naturebien': '1', 'si_balcon': '0', 'nb_chambres': '1', 'nb_pieces': '2', 'si_sdbain': '0', 'si_sdEau': '0', 'nb_photos': '14', 'prix': '139900', 'surface': '44'}

以价格为例：

product['prix']

票数 1

Stack Overflow用户

发布于 2019-05-09 03:15:37

当你调用find_all时，它会返回一个标签列表给html的一个子集，如果没有匹配的标签，它会返回一个空的列表。因此，在页面中找不到您要搜索的标记！这可能有很多原因。您可能在正在搜索的类中有一个拼写错误，或者您正在搜索的值可能不是一个类，而是一个id或其他属性。

一些页面(通常是更大的web应用，如facebook、instagram、twitter等)会动态创建类、ids等，并做一些小技巧来防止数据被抓取。如果你想看看一个网站允许你抓取什么，你可以查看一下所谓的robots.txt。

例如，如果你想抓取reddit，你可以去https://reddit.com/robots.txt查看你可以从它们的域名点击的uri的列表！站点还将包含一个sitemap，它是一个xml (类似于html)文档，其中充满了指向可用页面的链接！

票数 2

Stack Overflow用户

发布于 2019-05-09 05:41:24

很棒的教程：

https://www.youtube.com/watch?v=ind-mugxMxk

import re
import requests
from bs4 import BeautifulSoup
from babel.numbers import format_currency

session = requests.session()
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0',
    'Accept': '*/*',
    'Accept-Language': 'en-US,en;q=0.5', # these parameters can be changed as needed
    'Accept-Encoding': 'gzip, deflate, br',
    'content-type': 'application/json',
    'skip-caching': 'true',
    'DNT': '1',
    'Connection': 'keep-alive',
    'TE': 'Trailers'}
url = 'https://www.seloger.com/list.htm?tri=initial&idtypebien=1,2&pxMax=3000000&div=2238&idtt=2,5&naturebien=1,2,4&LISTING-LISTpg=2'
response = session.get(url, headers=headers)
page = response.text
soup = BeautifulSoup(page, "lxml")
for i, div in enumerate(soup.find_all('div', {'class': 'c-pa-price'}), 1):
    price = div.text
    # this regular expression substitution replaces all non alphanumeric characters but leaves in specialized language characters
    price = re.sub('[^0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06]+','', price)
    # remove extra word Bouquet - optional
    extra_word = re.compile('Bouquet')
    if extra_word.search(price):
        price = price.split('Bouquet')[1]
    price = format_currency(int(price), 'EUR', locale='fr_FR')
    print('Inscription ' + str(i) + ':', price)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56047324

复制

相似问题

问我无法显示html代码-美丽的汤
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我无法显示html代码-美丽的汤EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我无法显示html代码-美丽的汤
EN