文章/答案/技术大牛

发布

社区首页 >问答首页 >抓取以特定单词开头的网页句子

问抓取以特定单词开头的网页句子
EN

Stack Overflow用户

提问于 2013-07-06 21:16:35

回答 2查看 1.6K关注 0票数 1

我正在用BeautifulSoup和python抓取一个网页，希望抓取页面上以"Text Start:“开头的每一句话，如下面的代码所示。每个句子也以逗号结尾，后面跟着日期，格式为月-日(下面是5-4)有很多这样的例子，所以我想浏览一下页面，返回每个以"Text Start:“开头的句子。

我一直在尝试用BeautifulSoup包来做这件事，但是遇到了麻烦。我认为我应该使用正则表达式，所以我一直在尝试，但还没有真正取得任何进展。

<div class="class">
<div class="time">
<span class="date">07/02/13</span>
<span class="sep">|</span>
<span class="duration">02:15</span>
<div class="clear"></div>
</div>
Text Start: This text changes each time, 5-4
</div>

python

regex

beautifulsoup

回答 2

Stack Overflow用户

回答已采纳

发布于 2013-07-06 21:17:21

使用regular expression匹配specific text contents

import re

soup.find_all(text=re.compile('^\s*Text Start:.*'))

演示：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <div class="class">
... <div class="time">
... <span class="date">07/02/13</span>
... <span class="sep">|</span>
... <span class="duration">02:15</span>
... <div class="clear"></div>
... </div>
... Text Start: This text changes each time, 5-4
... </div>
... ''')
>>> import re
>>> soup.find_all(text=re.compile('^\s*Text Start:.*'))
[u'\nText Start: This text changes each time, 5-4\n']

票数 2

Stack Overflow用户

发布于 2013-07-07 01:23:28

对于如此简单的任务，您不需要使用BeautifulSoup。它比直接使用正则表达式慢得多。

执行re.findall('^\s*Text Start:.*',page)。

在抓取网页时，我发现非常准确地知道页面的内容是很有用的。就个人而言，我这样做：

from httplib import HTTPConnection

hypr = HTTPConnection(host='stackoverflow.com',
                      timeout = 3)
rekete = ('/questions/17503336/'
          'scraping-webpage-sentences-'
          'beginning-with-certain-word')

hypr.request('GET',rekete)
page = hypr.getresponse().read()


print '\n'.join('%d %r' % (i,line)
                for i,line in enumerate(page.splitlines(True)))

显示内容为

0 '<!DOCTYPE html>\r\n'
1 '<html>\r\n'
2 '<head>\r\n'
3 '        \r\n'
4 '    <title>python - Scraping webpage sentences beginning with certain word - Stack Overflow</title>\r\n'
5 '    <link rel="shortcut icon" href="https://cdn.sstatic.net/stackoverflow/img/favicon.ico">\r\n'
6 '    <link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png">\r\n'
7 '    <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">\r\n'
8 '    \r\n'
9 '    <script type="text/javascript" src="//ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>\r\n'
10 '    <script type="text/javascript" src="https://cdn.sstatic.net/js/stub.js?v=d2c9bad99c24"></script>\r\n'
11 '    <link rel="stylesheet" type="text/css" href="https://cdn.sstatic.net/stackoverflow/all.css?v=2079d4ae31a4">\r\n'
12 '    \r\n'
13 '    <link rel="canonical" href="http://stackoverflow.com/questions/17503336/scraping-webpage-sentences-beginning-with-certain-word">\r\n'
14 '    <link rel="alternate" type="application/atom+xml" title="Feed for question \'Scraping webpage sentences beginning with certain word\'" href="/feeds/question/17503336">\r\n'
15 '    <script type="text/javascript">\r\n'
16 '        \r\n'
17 '        StackExchange.ready(function () {\r\n'
18 '            StackExchange.using("postValidation", function () {\r\n'
19 "                StackExchange.postValidation.initOnBlurAndSubmit($('#post-form'), 2, 'answer');\r\n"
20 '            });\r\n'
21 '\r\n'
22 '            \r\n'
23 "            StackExchange.question.init({showAnswerHelp:true,totalCommentCount:0,shownCommentCount:0,highlightColor:'#F4A83D',backgroundColor:'#FFF',questionId:17503336});\r\n"
24 '\r\n'
25 '            styleCode();\r\n'
26 '\r\n'
27 "                StackExchange.realtime.subscribeToQuestion('1', '17503336');\r\n"
28 '            \r\n'
29 '                    });\r\n'
30 '    </script>\r\n'
31 '\r\n'

etc etc

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/17503336

复制

相似问题

问抓取以特定单词开头的网页句子
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问抓取以特定单词开头的网页句子EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问抓取以特定单词开头的网页句子
EN