文章/答案/技术大牛

发布

社区首页 >问答首页 >在python中使用regex获取多个重复行

问在python中使用regex获取多个重复行
EN

Stack Overflow用户

提问于 2019-03-24 15:12:11

回答 2查看 34关注 0票数 0

我是RegEx新手，有一个非常大的文本文件，其中的一小部分如下所示：

<div class="hbk-preamble " id="preamble-APG5180">
<div class="hbk-preamble-entry">
<div class="hbk-preamble-icon hbk-preamble-icon_mode"></div>
<p class="hbk-preamble-heading">Offered</p>
<p><a href="index-bylocation-city-melbourne.html">City (Melbourne)</a></p><ul class="hbk-preamble-list__offerings"><li>Summer semester A 2019 (Flexible)</li></ul><p><a href="index-bylocation-clayton.html">Clayton</a></p><ul class="hbk-preamble-list__offerings"><li>First semester 2019 (On-campus)</li></ul>
</div>
</div>
<div class="notes">
<p class="hbk-heading hdg_6">Notes</p>
<p></p><ul>
<li>The unit may be offered as part of the <a class="hbk-screen-url" href="http://www.monash.edu/students/courses/arts/summer-program.html">Summer Arts Program</a><span class="hbk-print-url">Summer Arts Program (<a href="http://www.monash.edu/students/courses/arts/summer-program.html">http://www.monash.edu/students/courses/arts/summer-program.html</a>)</span>.</li>
<li>For more information please visit the <a class="hbk-screen-url" href="https://www.anzsog.edu.au/">ANZSOG webpage</a><span class="hbk-print-url">ANZSOG webpage (<a href="https://www.anzsog.edu.au/">https://www.anzsog.edu.au/</a>)</span>.</li>
</ul>
</div>
<h2 class="hbk-heading">Synopsis</h2>
<div>
<p>The media is one of the most important components of any political society. In a liberal democracy like Australia, its role and function have profound implications for the conduct of politics, the nature of democracy and public policy outcomes. In this unit, the relationship between the media, politics and public policy is studied from three broad perspectives. First, the politics of the media is investigated from the perspective of liberal democratic theory in order to understand the role of news media on the policy debate. Second, the political economy of the media is investigated. Particular emphasis is on the structure and operation of media organisations and journalists and how political news is covered. Third, the unit undertakes a study of the relationship between the media and political actors. Particular emphasis is on the use of public relations and 'spin doctors' in managing the media as well as the utilisation of political advertising and strategic political communication by governments and political agents.</p>
</div>
<h2 class="hbk-heading">Outcomes</h2>
<div>
<p>Upon successful completion of the unit students should have:</p>
<ol princestart="0" start="1" type="1">

我想使用RegEx来只获取其中的“概要”文本：

The media is one of the most important components of any political society. In a liberal democracy like Australia, its role and function have profound implications for the conduct of politics, the nature of democracy and public policy outcomes. In this unit, the relationship between the media, politics and public policy is studied from three broad perspectives. First, the politics of the media is investigated from the perspective of liberal democratic theory in order to understand the role of news media on the policy debate. Second, the political economy of the media is investigated. Particular emphasis is on the structure and operation of media organisations and journalists and how political news is covered. Third, the unit undertakes a study of the relationship between the media and political actors. Particular emphasis is on the use of public relations and 'spin doctors' in managing the media as well as the utilisation of political advertising and strategic political communication by governments and political agents.

我需要文本文件中每个部分的概要文本，我应该怎么做？

到目前为止，我已经使用read和readline读入了我的文本文件，但是我不能建立一个模式来开始。

python

regex

回答 2

Stack Overflow用户

发布于 2019-03-24 15:57:52

我将从不直接回答你的问题开始。我假设你的问题是一个X-Y problem。在您的例子中，您必须处理HTML，因此您有很多强大的工具来处理它。

看看Python的BeautifulSoup：

from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

然后，您可以从该soup中提取所需的任何内容。

现在根据您的问题，如果您仍然想使用正则表达式，您可以使用https://regex101.com来帮助您：

演示：https://regex101.com/r/AcozoW/1

<p.*?Notes.*?<li>(.+?)<\/li>

票数 1

Stack Overflow用户

发布于 2019-03-24 15:59:33

我会推荐这个套餐做这件事。您可以尝试如下所示：

import requests
from bs4 import BeautifulSoup
data = requests.get('put website address here')
soup = BeautifulSoup(data.text, 'html.parser')
for i in soup.find_all('h2', {'class':'hbk-heading'}):
    print(i.text.strip())

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55321503

复制

相似问题

问在python中使用regex获取多个重复行
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在python中使用regex获取多个重复行EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在python中使用regex获取多个重复行
EN