我尝试使用python Scrapy从正文中抓取文本,但还没有成功。
希望一些学者能够在这里帮助我从<body>标记中抓取所有的文本。
发布于 2011-03-22 19:11:46
Scrapy使用XPath表示法来提取HTML文档的各个部分。那么,您是否尝试过仅使用/html/body路径来提取<body>?(假设它嵌套在<html>中)。使用//body选择器可能会更简单:
x.select("//body").extract() # extract body你可以找到更多关于selectors Scrapy提供的here的信息。
发布于 2012-06-09 10:50:29
如果能得到像lynx -nolist -dump这样的输出,那就太好了,它呈现页面,然后转储可见的文本。我通过提取段落元素的所有子元素的文本来接近这一点。
我从//body//text()开始,它提取正文中的所有文本元素,但其中包括脚本元素。//body//p获取正文中的所有段落元素,包括未标记文本周围的隐含段落标记。使用//body//p/text()提取文本会丢失子标签中的元素(如粗体、斜体、跨度、)。只要页面没有在段落中嵌入脚本标记,//body//p//text()似乎就能获得所需的大部分内容。
在XPath中,/表示直接的子代,而//则包含所有子代。
% scrapy shell
In[1]: fetch('http://stackoverflow.com/questions/5390133/scrapy-body-text-only')
In[2]: hxs.select('//body//p//text()').extract()
Out[2]:
[u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.",
u'Wishing some scholars might be able to help me here scraping all the text from the ',
u'<body>',
u' tag.',
u'Thank you in advance for your time.',
u'Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the ',
u'/html/body',
u' path to extract ',
u'<body>',
u"? (assuming it's nested in ",
u'<html>',
u'). It might be even simpler to use the ',
u'//body',
u' selector:',
u'You can find more information about the selectors Scrapy provides ',
u'here',使用空格将字符串连接在一起,您就会得到一个非常好的输出:
In [43]: ' '.join(hxs.select("//body//p//text()").extract())
Out[43]: u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the <body> tag. Thank you in advance for your time. Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body path to extract <body> ? (assuming it's nested in <html> ). It might be even simpler to use the //body selector: You can find more information about the selectors Scrapy provides here . This is a collaboratively edited question and answer site for professional and enthusiast programmers . It's 100% free, no registration required. about \xbb \xa0\xa0\xa0 faq \xbb \r\n tagged asked 1 year ago viewed 280 times active 1 year ago"https://stackoverflow.com/questions/5390133
复制相似问题