我没有使用xpath或regex导航xml/html的经验,我有一组来自LexisNexis的如下格式的html文档:
<HTML>
<HEAD>
<STYLE TYPE="text/css"><!--
.c0 { text-align: center; }
.c1 { text-align: center; margin-top: 0em; margin-bottom: 0em; }
.c2 { font-family: 'Times New Roman'; font-size: 10pt; font-style: normal; font-weight: normal; color: #000000; text-decoration: none; }
.c3 { text-align: center; margin-left: 13%; margin-right: 13%; }
.c4 { text-align: left; }
.c5 { text-align: left; margin-top: 0em; margin-bottom: 0em; }
.c6 { font-family: 'Times New Roman'; font-size: 14pt; font-style: normal; font-weight: bold; color: #000000; text-decoration: none; }
.c7 { font-family: 'Times New Roman'; font-size: 10pt; font-style: normal; font-weight: bold; color: #000000; text-decoration: none; }
.c8 { text-align: left; margin-top: 1em; margin-bottom: 0em; }
.c9 { page-break-before: always; }
.c10 { font-family: 'Times New Roman'; font-size: 10pt; font-style: italic; font-weight: normal; color: #000000; text-decoration: none; }
.c11 { border-collapse: collapse; table-layout: auto; width:100%; }
.c12 { width: 480pt; }
.c13 { text-align: left; padding-left: 2pt; vertical-align: top; padding-right: 2pt; }
.c14 { font-family: 'Courier New',Courier; font-size: 10pt; font-style: normal; font-weight: normal; color: #000000; text-decoration: none; }
.c15 { width: 120pt; }
.c16 { text-align: right; padding-left: 2pt; vertical-align: top; padding-right: 2pt; }
.c17 { text-align: right; margin-top: 0em; margin-bottom: 0em; }
.c18 { text-align: center; margin-left: 5%; margin-right: 5%; }
.c19 { margin-left: 30pt; margin-right: 0pt; margin-top: 0em; margin-bottom: 0em; list-style: none; }
.c20 { margin-left: 0pt; margin-right: 0pt; }
.c21 { margin-top: 0em; margin-bottom: 0em; }
.c22 { text-align: left; margin-left: 30pt; margin-top: -12pt; }
--></STYLE>
<!-- LXNComment 2826:543743167 -->
<TITLE> </TITLE>
<META TOPIC="null" DOCUMENTS="500" UPDATED="Tuesday, January 05, 2016 18:08:34 EST" /></HEAD>
<BODY>
<A NAME="DOC_ID_0_0"></A><!-- Hide XML section from browser
<DOC NUMBER=1>
<DOCFULL> -->
<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 1301 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Lincoln Journal Star (Nebraska)</SPAN></P>
</DIV>
<BR><DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">August 2, 2001 Thursday</SPAN><SPAN CLASS="c2"> </SPAN><SPAN CLASS="c2"> <BR>City Edition</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">Class counts, not race</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">BYLINE: </SPAN><SPAN CLASS="c2">BUTCH MABIN, Lincoln Journal Star</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">SECTION: </SPAN><SPAN CLASS="c2">A; Pg. 1</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LENGTH: </SPAN><SPAN CLASS="c2">1779 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">DATELINE: </SPAN><SPAN CLASS="c2">Lincoln, NE </SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c8"><SPAN CLASS="c2">Study says geography plays role </SPAN></P>
<P CLASS="c8"><SPAN CLASS="c2"> The battle lines dividing both sides of the death penalty debate came into sharp focus with Wednesday's release of a comprehensive study examining the fairness of capital punishment in Nebraska. (cut out the remaining body of text)</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LOAD-DATE: </SPAN><SPAN CLASS="c2">August 11, 2005</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">GRAPHIC: </SPAN><SPAN CLASS="c2">A divided time: The Sept. 2, 1994, execution of Harold Otey (above and below) drew more than 1,000 spectators to the Nebraska State Penitentiary - many of them with sharply opposing views of capital punishments. JOURNAL STAR FILE PHOTOS (one photo archived) 3 b/w head photos of Harold Otey, John Joubert and Robert Williams. (photo of Williams not archived)</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2001 Lincoln Journal Star,<BR>All Rights Reserved</SPAN></P>
</DIV>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->
<DIV CLASS="c9"> </DIV>
<A NAME="DOC_ID_0_1"></A><!-- Hide XML section from browser
<DOC NUMBER=2>
<DOCFULL> -->
<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">2 of 1301 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Lincoln Journal Star (Nebraska)</SPAN></P>
</DIV>
<BR><DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">February 8, 2004 Sunday</SPAN><SPAN CLASS="c2"> </SPAN><SPAN CLASS="c2"> <BR>City Edition</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">Death penalty at crossroads</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">BYLINE: </SPAN><SPAN CLASS="c2">JOE DUGGAN, LINCOLN JOURNAL STAR</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">SECTION: </SPAN><SPAN CLASS="c2">A; Pg. 1</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LENGTH: </SPAN><SPAN CLASS="c2">2493 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">DATELINE: </SPAN><SPAN CLASS="c2">LINCOLN, NE </SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c8"><SPAN CLASS="c2">A legislative bill on lethal injection, U.S. Supreme Court caseand constitutional appeals may affect the future of Nebraska's seven death-row inmates. (cut out the remaining body of text)</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LOAD-DATE: </SPAN><SPAN CLASS="c2">July 13, 2007</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">GRAPHIC: </SPAN><SPAN CLASS="c2">1. Nebraska is the only state in the nation to have the electric chair as the sole means of execution, and some wonder whether the law would survive an Eighth Amendment challenge that it is cruel and unusual punishment. 2. Seven inmates are in death row at the Nebraska State Correctional Institution in Tecumseh. 3. Marylyn Felion's portrait of Robert E. Williams, who was executed in 1997. 7 color head photos and statistics of Carey Dean Moore, Charles Jess Palmer, Michael Ryan, John Lotter, David Dunster, Raymond Mata Jr. and Arthur Lee Gales. color head photo of Summerlin JOURNAL STAR FILE PHOTO</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2004 Lincoln Journal Star,<BR>All Rights Reserved</SPAN></P>
</DIV>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->
</BODY></HTML>
现在,我想提取每个文档的日期,并尝试遵循this now closed question中提供的指导原则。但是,建议似乎依赖于标签(比如"SECTION:"),我只有"LOAD- date:“标签(并不总是与标题上的实际日期相同)。尽管如此,尝试下面的建议表达式似乎没有任何结果:
> ex <- htmlTreeParse("~/Desktop/example.html", encoding="UTF-8")
> example <- xmlRoot(ex)
> xpathSApply(example, "//DOCFULL/*/*/span[text()='SECTION: ']/..", xmlValue)
NULL
如何修复这个表达式以提取装入日期,或者更好地提取每个文档的实际日期?
是否有可能对缺少日期的文档使用累进帐户(即用NA标记它们)?
发布于 2016-01-09 05:27:03
只需删除DOCFULL/*并简化xpath...
xpathSApply(example, "//span[text()='SECTION: ']/..", xmlValue)
[1] "SECTION: A; Pg. 1" "SECTION: A; Pg. 1"
xpathSApply(example, "//div[@class='c3']/p[@class='c1']/span[@class='c2'][1]", xmlValue)
[1] "August 2, 2001 Thursday" "February 8, 2004 Sunday"
如果一个节点缺少一个标签,有很多方法可以添加NA -这是一个常见的问题。
https://stackoverflow.com/questions/34656667
复制相似问题