我使用scrapy和lxml-3.2.4来抓取几篇报纸文章。这些文章有时包含HyperLinks,与文本的其他部分位于网页的不同节点中。这是这样一篇文章的链接:1.html
我想提取文章的内容,并为此编写了以下代码:
hxs = Selector(response)
detailsPath = hxs.xpath('//*[@class="articleContentBox"]')
textall = detailsPath.xpath('//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*')
for text in textall:
contents = text.xpath('text()').extract()
for content in contents:
data.append(unicodedata.normalize('NFKD',content).encode('ascii','ignore'))
finaltext = "\n".join(data)
我希望文章的内容是这样的:
Bangalore-based information technology (IT) services firm Wipro is on a major recruitment drive. The company would evaluate 50,000-60,000 students from 350 colleges in FY15.
Last year, Wipro had offered letters to 3,000-4,000 students with science background, and the company plans to hire more in FY15, said Rajiv Kumar, global campus head, Wipro. Kumar did not reveal the exact number of recruitments.
“Campus hiring has always been strategic to Wipro’s hiring strategy. But other than hiring engineers, we have been hiring students from science background in good numbers through two of our programmes -- Wipro Academy of Software Engineers and Wipro Software Technology Academy,” added Kumar.
According to sources, about 5,000 students were inducted and another 11,000 are in the process of joining the company through these programmes. The programmes had been launched in partnership with BITS Pilani and Vellore Institute of Technology. During the traning, the company takes care of the fee, books and accomodation. Besides, students are given a stipend of about Rs 12,000 in the first year and goes up to 20,000 in the fourth year.
“These are true earn-as-you-learn programmes. After four years, their career paths are similar to any engineer. They can start as developer, project manager etc. More importantly, we do not sign any bond with the student. So after the fourth year, if a candidate wishes to leave Wipro, they can. The only condition is that they have to complete the four-year tenure,” said Kumar.
Kumar said candidates who have completed the programmes would draw more salary than that of an engineer. “They get paid more than an entry level engineering candidate. It is generally in the range of Rs 4,00,000–6,00,000 per annum,” he said. The average salary an entry level engineer draws is around Rs 3,00,000–Rs 3,50,000 per annum.
“Our experience tells us that the attrition rate in this group is in single digits: Much lower than the company average. Also, we do not hire these students for our BPO operations,” said Kumar.
但是文章的内容却是这样来的(超链接中的文本在结尾)
Bangalore-based information technology (IT) services firm
is on a major recruitment drive. The company would evaluate 50,000-60,000 students from 350 colleges in FY15.
Last year, Wipro had offered letters to 3,000-4,000 students with science background, and the company plans to hire more in FY15, said Rajiv Kumar, global campus head, Wipro. Kumar did not reveal the exact number of recruitments.
Campus hiring has always been strategic to Wipros hiring strategy. But other than hiring engineers, we have been hiring students from science background in good numbers through two of our programmes -- Wipro Academy of Software Engineers and Wipro Software Technology Academy, added Kumar.
According to sources, about 5,000 students were inducted and another 11,000 are in the process of joining the company through these programmes. The programmes had been launched in partnership with
and
. During the traning, the company takes care of the fee, books and accomodation. Besides, students are given a stipend of about Rs 12,000 in the first year and goes up to 20,000 in the fourth year.
These are true earn-as-you-learn programmes. After four years, their career paths are similar to any engineer. They can start as developer, project manager etc. More importantly, we do not sign any bond with the student. So after the fourth year, if a candidate wishes to leave Wipro, they can. The only condition is that they have to complete the four-year tenure, said Kumar.
Kumar said candidates who have completed the programmes would draw more salary than that of an engineer. They get paid more than an entry level engineering candidate. It is generally in the range of Rs 4,00,0006,00,000 per annum, he said. The average salary an entry level engineer draws is around Rs 3,00,000Rs 3,50,000 per annum.
Our experience tells us that the attrition rate in this group is in single digits: Much lower than the company average. Also, we do not hire these students for our BPO operations, said Kumar.
Wipro
BITS Pilani
Vellore Institute of Technology
请告诉我一种按元素出现的顺序提取元素的方法(最好是在python中),从而消除这个问题。提前谢谢。
发布于 2014-04-18 10:52:05
如果检查XPath //*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*
的输出,您会注意到descendant-or-self::*
选择
<div itemscope itemtype="http://schema.org/Article">
(因为-or-self
)<p itemprop="articleBody">
(以上div
的后裔)<a class="storyTags" href="...
,p
(和div
)的后代br
元素使用scrapy shell http://www.business-standard.com/article/companies/wipro-on-a-major-recruitment-drive-113122300827_1.html
>>> pprint.pprint(sel.xpath('//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*'))
[<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<div itemscope itemtype="http://schema.o'>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<p itemprop="articleBody">\r\n \r\n '>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<a class="storyTags" href="/search?type='>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<a class="storyTags" href="/search?type='>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<a class="storyTags" href="/search?type='>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>]
>>>
然后,应用.xpath('text()')
将从这些元素中提取子文本节点。
div
只有白色的文本:
>>> sel.xpath('//*[@class="colL_MktColumn2"]/div/div/self::*/text()').extract()
[u'\r\n ', u'\r\n ']
>>>
p
有您想要的大部分内容,但是请注意链接中的文本并不存在(链接中的文本是a
的子文本节点,而不是p
的文本子节点):
>>> import pprint
>>> pprint.pprint(sel.xpath('//*[@class="colL_MktColumn2"]/div/div/p/text()').extract())
[u'\r\n \r\n \r\nBangalore-based information technology (IT) services firm ',
u' is on a major recruitment drive. The company would evaluate 50,000-60,000 students from 350 colleges in FY15.',
u'\r\n',
u'\r\nLast year, Wipro had offered letters to 3,000-4,000 students with science background, and the company plans to hire more in FY15, said Rajiv Kumar, global campus head, Wipro. Kumar did not reveal the exact number of recruitments.',
u'\r\n',
u'\r\n\u201cCampus hiring has always been strategic to Wipro\u2019s hiring strategy. But other than hiring engineers, we have been hiring students from science background in good numbers through two of our programmes -- Wipro Academy of Software Engineers and Wipro Software Technology Academy,\u201d added Kumar.',
u'\r\n',
u'\r\nAccording to sources, about 5,000 students were inducted and another 11,000 are in the process of joining the company through these programmes. The programmes had been launched in partnership with ',
u' and ',
u'. During the traning, the company takes care of the fee, books and accomodation. Besides, students are given a stipend of about Rs 12,000 in the first year and goes up to 20,000 in the fourth year.',
u'\r\n',
u'\r\n\u201cThese are true earn-as-you-learn programmes. After four years, their career paths are similar to any engineer. They can start as developer, project manager etc. More importantly, we do not sign any bond with the student. So after the fourth year, if a candidate wishes to leave Wipro, they can. The only condition is that they have to complete the four-year tenure,\u201d said Kumar.',
u'\r\n',
u'\r\nKumar said candidates who have completed the programmes would draw more salary than that of an engineer. \u201cThey get paid more than an entry level engineering candidate. It is generally in the range of Rs 4,00,000\u20136,00,000 per annum,\u201d he said. The average salary an entry level engineer draws is around Rs 3,00,000\u2013Rs\xa0 3,50,000 per annum.',
u'\r\n',
u'\r\n\u201cOur experience tells us that the attrition rate in this group is in single digits: Much lower than the company average. Also, we do not hire these students for our BPO operations,\u201d said Kumar.']
>>>
最后,a
元素的文本节点:
>>> pprint.pprint(sel.xpath('//*[@class="colL_MktColumn2"]/div/div//a/text()').extract())
[u'Wipro', u'BITS Pilani', u'Vellore Institute of Technology']
>>>
br
元素没有子文本节点。
>>> sel.xpath('//*[@class="colL_MktColumn2"]/div/div//br/text()').extract()
[]
>>>
一种解决方案是使用<p itemprop="articleBody">
提取string()
的文本表示。
>>> print(sel.xpath('string(//*[@class="colL_MktColumn2"]/div/div/p)').extract()[0])
Bangalore-based information technology (IT) services firm Wipro is on a major recruitment drive. The company would evaluate 50,000-60,000 students from 350 colleges in FY15.
Last year, Wipro had offered letters to 3,000-4,000 students with science background, and the company plans to hire more in FY15, said Rajiv Kumar, global campus head, Wipro. Kumar did not reveal the exact number of recruitments.
“Campus hiring has always been strategic to Wipro’s hiring strategy. But other than hiring engineers, we have been hiring students from science background in good numbers through two of our programmes -- Wipro Academy of Software Engineers and Wipro Software Technology Academy,” added Kumar.
According to sources, about 5,000 students were inducted and another 11,000 are in the process of joining the company through these programmes. The programmes had been launched in partnership with BITS Pilani and Vellore Institute of Technology. During the traning, the company takes care of the fee, books and accomodation. Besides, students are given a stipend of about Rs 12,000 in the first year and goes up to 20,000 in the fourth year.
“These are true earn-as-you-learn programmes. After four years, their career paths are similar to any engineer. They can start as developer, project manager etc. More importantly, we do not sign any bond with the student. So after the fourth year, if a candidate wishes to leave Wipro, they can. The only condition is that they have to complete the four-year tenure,” said Kumar.
Kumar said candidates who have completed the programmes would draw more salary than that of an engineer. “They get paid more than an entry level engineering candidate. It is generally in the range of Rs 4,00,000–6,00,000 per annum,” he said. The average salary an entry level engineer draws is around Rs 3,00,000–Rs 3,50,000 per annum.
“Our experience tells us that the attrition rate in this group is in single digits: Much lower than the company average. Also, we do not hire these students for our BPO operations,” said Kumar.
>>>
https://stackoverflow.com/questions/23151356
复制相似问题