文章/答案/技术大牛

发布

社区首页 >问答首页 >仅剪贴式正文文本

问仅剪贴式正文文本
EN

Stack Overflow用户

提问于 2011-03-22 18:59:57

回答 2查看 8.7K关注 0票数 9

我尝试使用python Scrapy从正文中抓取文本，但还没有成功。

希望一些学者能够在这里帮助我从<body>标记中抓取所有的文本。

python

scrapy

scrape

scraper

回答 2

Stack Overflow用户

回答已采纳

发布于 2011-03-22 19:11:46

Scrapy使用XPath表示法来提取HTML文档的各个部分。那么，您是否尝试过仅使用/html/body路径来提取<body>？(假设它嵌套在<html>中)。使用//body选择器可能会更简单：

x.select("//body").extract()    # extract body

你可以找到更多关于selectors Scrapy提供的here的信息。

票数 4

Stack Overflow用户

发布于 2012-06-09 10:50:29

如果能得到像lynx -nolist -dump这样的输出，那就太好了，它呈现页面，然后转储可见的文本。我通过提取段落元素的所有子元素的文本来接近这一点。

我从//body//text()开始，它提取正文中的所有文本元素，但其中包括脚本元素。//body//p获取正文中的所有段落元素，包括未标记文本周围的隐含段落标记。使用//body//p/text()提取文本会丢失子标签中的元素(如粗体、斜体、跨度、)。只要页面没有在段落中嵌入脚本标记，//body//p//text()似乎就能获得所需的大部分内容。

在XPath中，/表示直接的子代，而//则包含所有子代。

% scrapy shell
In[1]: fetch('http://stackoverflow.com/questions/5390133/scrapy-body-text-only')
In[2]: hxs.select('//body//p//text()').extract()

Out[2]:
[u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.",
u'Wishing some scholars might be able to help me here scraping all the text from the ',
u'&lt;body&gt;',
u' tag.',
u'Thank you in advance for your time.',
u'Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the ',
u'/html/body',
u' path to extract ',
u'&lt;body&gt;',
u"? (assuming it's nested in ",
u'&lt;html&gt;',
u'). It might be even simpler to use the ',
u'//body',
u' selector:',
u'You can find more information about the selectors Scrapy provides ',
u'here',

使用空格将字符串连接在一起，您就会得到一个非常好的输出：

In [43]: ' '.join(hxs.select("//body//p//text()").extract())
Out[43]: u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the  &lt;body&gt;  tag. Thank you in advance for your time. Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the  /html/body  path to extract  &lt;body&gt; ? (assuming it's nested in  &lt;html&gt; ). It might be even simpler to use the  //body  selector: You can find more information about the selectors Scrapy provides  here . This is a collaboratively edited question and answer site for  professional and enthusiast programmers . It's 100% free, no registration required. about \xbb \xa0\xa0\xa0 faq \xbb \r\n             tagged asked 1 year ago viewed 280 times active 1 year ago"

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/5390133

复制

相似问题

问仅剪贴式正文文本
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问仅剪贴式正文文本EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问仅剪贴式正文文本
EN