我想摘录一些段落,这些段落列出了报告增长和收缩的行业名单,以及受访者的言论等等(这可以在网页的几个地方找到)。这些段落通常就在桌子的上方。如何使用请求、lxml、BeautifulSoup来解析和选择我需要的段落?
https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655
我尝试使用lxml和xpath,但是每个月网站都会随着新的报告而略有变化,代码也就停止工作了。
发布于 2017-02-26 17:19:27
这段代码与您所使用的代码有多近?
它用正则表达式来识别段落,这一行也是被调查者所说的话的前一行。然后它只会显示结果。
>>> import requests
>>> URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
>>> r = requests.get(URL)
>>> page = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(page, 'lxml')
>>> import re
>>> paras = soup.find_all('p', string=re.compile('(?:growth)|(?:contraction).*? are\:'))
>>> saying = soup.find_all('strong', string=re.compile('WHAT RESPONDENTS ARE SAYING'))[0]
>>> for i, para in enumerate(paras):
... 'Paragraph ', i
... para
...
('Paragraph ', 0)
<p>Of the 18 manufacturing industries, 12 reported growth in January in the following order: Plastics & Rubber Products; Miscellaneous Manufacturing; Apparel, Leather & Allied Products; Paper Products; Chemical Products; Transportation Equipment; Food, Beverage & Tobacco Products; Machinery; Petroleum & Coal Products; Primary Metals; Fabricated Metal Products; and Computer & Electronic Products. The five industries reporting contraction in January are: Nonmetallic Mineral Products; Wood Products; Furniture & Related Products; Electrical Equipment, Appliances & Components; and Printing & Related Support Activities.</p>
('Paragraph ', 1)
<p>The 12 industries reporting growth in new orders in January — listed in order — are: Plastics & Rubber Products; Apparel, Leather & Allied Products; Miscellaneous Manufacturing; Chemical Products; Paper Products; Transportation Equipment; Electrical Equipment, Appliances & Components; Petroleum & Coal Products; Primary Metals; Machinery; Fabricated Metal Products; and Food, Beverage & Tobacco Products. The five industries reporting a decrease in new orders during January are: Nonmetallic Mineral Products; Wood Products; Textile Mills; Computer & Electronic Products; and Furniture & Related Products.</p>
('Paragraph ', 2)
<p>The 10 industries reporting growth in production during the month of January — listed in order — are: Miscellaneous Manufacturing; Apparel, Leather & Allied Products; Paper Products; Petroleum & Coal Products; Plastics & Rubber Products; Transportation Equipment; Chemical Products; Machinery; Food, Beverage & Tobacco Products; and Computer & Electronic Products. The five industries reporting a decrease in production during January are: Wood Products; Textile Mills; Nonmetallic Mineral Products; Electrical Equipment, Appliances & Components; and Furniture & Related Products.</p>
('Paragraph ', 3)
<p>Of the 18 manufacturing industries, the 10 reporting employment growth in January — listed in order — are: Textile Mills; Paper Products; Food, Beverage & Tobacco Products; Machinery; Electrical Equipment, Appliances & Components; Chemical Products; Miscellaneous Manufacturing; Transportation Equipment; Computer & Electronic Products; and Nonmetallic Mineral Products. The five industries reporting a decrease in employment in January are: Plastics & Rubber Products; Petroleum & Coal Products; Primary Metals; Fabricated Metal Products; and Printing & Related Support Activities. </p>
('Paragraph ', 4)
<p>The seven industries reporting growth in order backlogs in January — listed in order — are: Wood Products; Plastics & Rubber Products; Electrical Equipment, Appliances & Components; Primary Metals; Fabricated Metal Products; Miscellaneous Manufacturing; and Chemical Products. The seven industries reporting a decrease in order backlogs during January — listed in order — are: Nonmetallic Mineral Products; Textile Mills; Paper Products; Computer & Electronic Products; Food, Beverage & Tobacco Products; Transportation Equipment; and Furniture & Related Products.</p>
('Paragraph ', 5)
<p>The eight industries reporting growth in new export orders in January — listed in order — are: Wood Products; Paper Products; Petroleum & Coal Products; Chemical Products; Fabricated Metal Products; Transportation Equipment; Miscellaneous Manufacturing; and Food, Beverage & Tobacco Products. The four industries reporting a decrease in new export orders during January are: Textile Mills; Nonmetallic Mineral Products; Plastics & Rubber Products; and Machinery. Six industries reported no change in new export orders in January compared to December.</p>
('Paragraph ', 6)
<p>The four industries reporting growth in imports during the month of January are: Furniture & Related Products; Apparel, Leather & Allied Products; Fabricated Metal Products; and Food, Beverage & Tobacco Products. The five industries reporting a decrease in imports during January are: Plastics & Rubber Products; Primary Metals; Nonmetallic Mineral Products; Transportation Equipment; and Computer & Electronic Products. Eight industries reported no change in imports in January compared to December.</p>
>>> saying.findNextSibling()
<ul style="list-style-type: square;">
<li>“Demand very steady to start the year.” (Chemical Products)</li>
<li>“January revenue target slightly lower following a big December shipment month.” (Computer & Electronic Products)</li>
<li>“Strong start to the new year. Production is increasing and we are adding capacity.” (Plastics & Rubber Products)</li>
<li>“Business looks stronger moving into the first quarter of 2017.” (Primary Metals)</li>
<li>“Economic outlook remains stable and no current effects of geopolitical changes appear to be penetrating market conditions.” (Food, Beverage & Tobacco Products)</li>
<li>“Sales bookings are exceeding expectations. We are starting to see supply shortages in hot rolled steel due to the curtailment of imports.” (Machinery)</li>
<li>“Year starting on pace with Q4 2016.” (Transportation Equipment)</li>
<li>“Business conditions are good, demand is generally increasing.” (Miscellaneous Manufacturing)</li>
<li>“Conditions and outlook remain positive. Raw material prices are stable resulting in stable margins. Asset utilization remains high.” (Petroleum & Coal Products)</li>
<li>“Steady demand from automotive.” (Fabricated Metal Products)</li>
</ul>
>>>
发布于 2017-02-26 16:55:45
第三种解决方案是使用Pyquery。它速度快,使用与Jquery完全相同的选择器。您可以通过使用Chrome 小工具选择器轻松地找到它们。
然后,它仍然只使用它。
from pyquery import PyQuery as pq
import requests
url = "https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1"
content = requests.get(url).content
doc = pq(content)
respondent = doc(".formatted_content ul").text()
print(respondent)
输出:
“Demand very steady to start the year.” (Chemical Products) “January revenue target slightly lower following a big December shipment month.” (Computer & Electronic Products) “Strong start to the new year. Production is increasing and we are adding capacity.” (Plastics & Rubber Products) “Business looks stronger moving into the first quarter of 2017.” (Primary Metals) “Economic outlook remains stable and no current effects of geopolitical changes appear to be penetrating market conditions.” (Food, Beverage & Tobacco Products) “Sales bookings are exceeding expectations. We are starting to see supply shortages in hot rolled steel due to the curtailment of imports.” (Machinery) “Year starting on pace with Q4 2016.” (Transportation Equipment) “Business conditions are good, demand is generally increasing.” (Miscellaneous Manufacturing) “Conditions and outlook remain positive. Raw material prices are stable resulting in stable margins. Asset utilization remains high.” (Petroleum & Coal Products) “Steady demand from automotive.” (Fabricated Metal Products)
https://stackoverflow.com/questions/42470597
复制相似问题