问使用BeautifulSoup，如何针对段落中的特定项目？
EN

Stack Overflow用户

提问于 2017-06-12 01:46:36

回答 1查看 30关注 0票数 0

我有一些问题，从这个页面中提取我需要的正确信息：http://www.chronicle.com/article/Major-Private-Gifts-to-Higher/128264

理想的情况下，我想得到学校的名称和价值的每一所学校。

例如：“加利福尼亚理工学院:戈登和贝蒂摩尔基金会的，6亿美元，包括5年3亿美元和10年3亿美元；现金和股票；2001年*。”

理想的产出是:加州理工学院，6亿美元

(用逗号分隔)

python-3.x

beautifulsoup

回答 1

Stack Overflow用户

发布于 2017-06-12 02:27:29

您可以使用BeautifulSoup和正则表达式实现这一点。

BeautifulSoup是一个python库，它允许解析HTML数据。

正则表达式允许搜索字符串中的某些模式。

from bs4 import BeautifulSoup
import re
import urllib.request

link = 'http://www.chronicle.com/article/Major-Private-Gifts-to-Higher/128264'
req = urllib.request.Request(link, headers={'User-Agent': 'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = BeautifulSoup(sauce, 'html.parser')

university = {}

for x in soup.find_all('p'):
    name_tag = x.find('strong')
    if name_tag != None:
        name = name_tag.text
        t = x.text
        m = re.findall('\$([0-9]*)', t)
        if m != []:
            #There is a possibility that there are more than one values gifted.
            #For example, in case of CalTech there are 3 values [600, 300, 300]
            #This can be handled in two ways.
            #Either print the first value using m[0].
            #Or find the max element of the list using max(m)        
            print(name +', ' + m[0])

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/44490353

复制

相似问题

问使用BeautifulSoup，如何针对段落中的特定项目？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用BeautifulSoup，如何针对段落中的特定项目？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用BeautifulSoup，如何针对段落中的特定项目？
EN