首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >从用javascript呈现内容的页面中提取内容,使用Beautifulsoup

从用javascript呈现内容的页面中提取内容,使用Beautifulsoup
EN

Stack Overflow用户
提问于 2022-11-19 02:57:41
回答 2查看 55关注 0票数 0

不久前,我开始编程,遇到了这个问题。我想收集股票数据从网站:https://statusinvest.com.br/acoes/petr4。但是很明显,它们是用javascript呈现的,BeautifulSoup不收集,如果你能帮我理解的话

我的汤码 用javascript加载的信息示例

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-11-19 07:47:13

这个部分不仅需要js来加载,它实际上不会加载直到您滚动到它。您可以尝试找出哪个请求和/或一些js是用来呈现该部分的,然后尝试用python复制它,但我认为使用会更容易一些。我甚至还使用有此功能来使在抓取html之前自动化一些更简单/常见的交互变得更加方便:

代码语言:javascript
运行
复制
#### FIRST PASTE [or DOWNLOAD&IMPORT] FUNCTION DEF from https://pastebin.com/kEC9gPC8 ####
soup = linkToSoup_selenium(
    'https://statusinvest.com.br/acoes/petr4', 
    clickFirst='//strong[@data-item="avg_F"]' # it actually just has to scroll, not click [but I haven't added an option for that yet], 
    ecx='//strong[@data-item="avg_F"][text()!="-"]' # waits till this loads
)
if soup is not None:
    print({
        t.find_previous_sibling().get_text(' ').strip(): t.get_text(' ').strip()
        for t in soup.select('div#payout-section span.title + strong.value')
    })

版画

代码语言:javascript
运行
复制
{'MÉDIA': '83,32%', 'ATUAL': '124,13% \n ( 48,97% acima da média )', 'MENOR\xa0VALOR': '26,35% \n ( 2019 )', 'MAIOR\xa0VALOR': '144,51% \n \n( 2020 )'}

编辑:I最终注意到了用于获取数据的API (https://statusinvest.com.br/acao/payoutresult?code=petr4&companyid=408&type=0)。即使在js加载发生之前就可以使用html,您也可以对其进行实际的修改:

代码语言:javascript
运行
复制
soup.select_one('#payout-section[data-company][data-code]').attrs

应该回来

代码语言:javascript
运行
复制
{'id': 'payout-section', 'data-company': '408', 'data-code': 'petr4', 'data-category': '1'}

这样,url就可以用

代码语言:javascript
运行
复制
payout = soup.select_one('#payout-section[data-company][data-code]')
if payout:
    compId, dCode = payout.get('data-company'), payout.get('data-code')
    apiUrl = f'https://statusinvest.com.br/acao'
    apiUrl = f'{apiUrl}/payoutresult?code={dCode}&companyid={compId}&type=0'

我认为type参数是为时间窗口-0为5年,1为10年,2为最大窗口。requests.get(apiUrl, headers=headers).json()应该返回类似的内容

代码语言:javascript
运行
复制
{
    "actual": 124.12623323305537,
    "avg": 83.32096287339556,
    "avgDifference": 48.97359434223362,
    "minValue": 26.353309862919502,
    "minValueRank": 2019,
    "maxValue": 144.51093035368598,
    "maxValueRank": 2020,
    "actual_F": "124,13%",
    "avg_F": "83,32%",
    "avgDifference_F": "48,97% acima da m\u00e9dia",
    "minValue_F": "26,35%",
    "minValueRank_F": "2019",
    "maxValue_F": "144,51%",
    "maxValueRank_F": "2020",
    "chart": {
        "categoryUnique": true,
        "category": [
            "2018",
            "2019",
            "2020",
            "2021",
            "2022"
        ],
        "series": {
            "percentual": [
                {
                    "value": 27.189302754606462,
                    "value_F": "27,19%"
                },
                {
                    "value": 26.353309862919502,
                    "value_F": "26,35%"
                },
                {
                    "value": 144.51093035368598,
                    "value_F": "144,51%"
                },
                {
                    "value": 94.42503816271046,
                    "value_F": "94,43%"
                },
                {
                    "value": 124.12623323305537,
                    "value_F": "124,13%"
                }
            ],
            "proventos": [
                {
                    "value": 7009130357.11,
                    "value_F": "R$ 7.009.130.357,11",
                    "valueSmall_F": "7,01 B"
                },
                {
                    "value": 10577427979.68,
                    "value_F": "R$ 10.577.427.979,68",
                    "valueSmall_F": "10,58 B"
                },
                {
                    "value": 10271836929.54,
                    "value_F": "R$ 10.271.836.929,54",
                    "valueSmall_F": "10,27 B"
                },
                {
                    "value": 100721299707.4,
                    "value_F": "R$ 100.721.299.707,40",
                    "valueSmall_F": "100,72 B"
                },
                {
                    "value": 179966901777.61,
                    "value_F": "R$ 179.966.901.777,61",
                    "valueSmall_F": "179,97 B"
                }
            ],
            "lucroLiquido": [
                {
                    "value": 25779000000.0,
                    "value_F": "R$ 25.779.000.000,00",
                    "valueSmall_F": "25,78 B"
                },
                {
                    "value": 40137000000.0,
                    "value_F": "R$ 40.137.000.000,00",
                    "valueSmall_F": "40,14 B"
                },
                {
                    "value": 7108000000.0,
                    "value_F": "R$ 7.108.000.000,00",
                    "valueSmall_F": "7,11 B"
                },
                {
                    "value": 106668000000.0,
                    "value_F": "R$ 106.668.000.000,00",
                    "valueSmall_F": "106,67 B"
                },
                {
                    "value": 144987000000.0,
                    "value_F": "R$ 144.987.000.000,00",
                    "valueSmall_F": "144,99 B"
                }
            ]
        }
    }
}

然后你可以从那里得到你想要的值。(我认为它还包括图表数据。)

票数 0
EN

Stack Overflow用户

发布于 2022-11-19 07:41:35

希望OP的下一个问题将包含一个最小的,可复制的例子,下面是使用请求和BeautifulSoup从该页面获取一些数据的一种方法:

代码语言:javascript
运行
复制
from bs4 import BeautifulSoup as bs
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

r = requests.get('https://statusinvest.com.br/acoes/petr4', headers=headers)
soup = bs(r.text, 'html.parser')
valor_atual = soup.select_one('h3:-soup-contains("Valor atual")').find_next('strong').text
min_52_semanas = soup.select_one('h3:-soup-contains("Min. 52 semanas")').find_next('strong').text
print('Valor atual:', valor_atual)
print('Min. 52 semanas:', min_52_semanas)

### and now some values hydrated in page by Javascript, from an API endpoint:

api_url = 'https://statusinvest.com.br/acao/payoutresult?code=petr4&companyid=408&type=0'
api_headers = {
    'referer': 'https://statusinvest.com.br/acoes/petr4',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
r = requests.get(api_url, headers=api_headers)
print(r.json())

终点站的结果:

代码语言:javascript
运行
复制
Valor atual: 26,54
Min. 52 semanas: 15,85
{'actual': 124.12623323305537, 'avg': 83.32096287339556, 'avgDifference': 48.97359434223362, 'minValue': 26.353309862919502, 'minValueRank': 2019, 'maxValue': 144.51093035368598, 'maxValueRank': 2020, 'actual_F': '124,13%', 'avg_F': '83,32%', 'avgDifference_F': '48,97% acima da média', 'minValue_F': '26,35%', 'minValueRank_F': '2019', 'maxValue_F': '144,51%', 'maxValueRank_F': '2020', 'chart': {'categoryUnique': True, 'category': ['2018', '2019', '2020', '2021', '2022'], 'series': {'percentual': [{'value': 27.189302754606462, 'value_F': '27,19%'}, {'value': 26.353309862919502, 'value_F': '26,35%'}, {'value': 144.51093035368598, 'value_F': '144,51%'}, {'value': 94.42503816271046, 'value_F': '94,43%'}, {'value': 124.12623323305537, 'value_F': '124,13%'}], 'proventos': [{'value': 7009130357.11, 'value_F': 'R$ 7.009.130.357,11', 'valueSmall_F': '7,01 B'}, {'value': 10577427979.68, 'value_F': 'R$ 10.577.427.979,68', 'valueSmall_F': '10,58 B'}, {'value': 10271836929.54, 'value_F': 'R$ 10.271.836.929,54', 'valueSmall_F': '10,27 B'}, {'value': 100721299707.4, 'value_F': 'R$ 100.721.299.707,40', 'valueSmall_F': '100,72 B'}, {'value': 179966901777.61, 'value_F': 'R$ 179.966.901.777,61', 'valueSmall_F': '179,97 B'}], 'lucroLiquido': [{'value': 25779000000.0, 'value_F': 'R$ 25.779.000.000,00', 'valueSmall_F': '25,78 B'}, {'value': 40137000000.0, 'value_F': 'R$ 40.137.000.000,00', 'valueSmall_F': '40,14 B'}, {'value': 7108000000.0, 'value_F': 'R$ 7.108.000.000,00', 'valueSmall_F': '7,11 B'}, {'value': 106668000000.0, 'value_F': 'R$ 106.668.000.000,00', 'valueSmall_F': '106,67 B'}, {'value': 144987000000.0, 'value_F': 'R$ 144.987.000.000,00', 'valueSmall_F': '144,99 B'}]}}}

BeautifulSoup文档可以在这里找到:https://beautiful-soup-4.readthedocs.io/en/latest/

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/74497235

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档