首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >在R中使用xpath定位LexisNexis元数据

在R中使用xpath定位LexisNexis元数据
EN

Stack Overflow用户
提问于 2016-01-07 21:42:42
回答 1查看 92关注 0票数 0

我没有使用xpath或regex导航xml/html的经验,我有一组来自LexisNexis的如下格式的html文档:

代码语言:javascript
运行
复制
<HTML>
    <HEAD>
        <STYLE TYPE="text/css"><!--
        .c0 { text-align: center; }
        .c1 { text-align: center; margin-top: 0em; margin-bottom: 0em; }
        .c2 { font-family: 'Times New Roman'; font-size: 10pt; font-style: normal; font-weight: normal; color: #000000; text-decoration: none; }
        .c3 { text-align: center; margin-left: 13%; margin-right: 13%; }
        .c4 { text-align: left; }
        .c5 { text-align: left; margin-top: 0em; margin-bottom: 0em; }
        .c6 { font-family: 'Times New Roman'; font-size: 14pt; font-style: normal; font-weight: bold; color: #000000; text-decoration: none; }
        .c7 { font-family: 'Times New Roman'; font-size: 10pt; font-style: normal; font-weight: bold; color: #000000; text-decoration: none; }
        .c8 { text-align: left; margin-top: 1em; margin-bottom: 0em; }
        .c9 { page-break-before: always; }
        .c10 { font-family: 'Times New Roman'; font-size: 10pt; font-style: italic; font-weight: normal; color: #000000; text-decoration: none; }
        .c11 { border-collapse: collapse; table-layout: auto; width:100%; }
        .c12 { width: 480pt; }
        .c13 { text-align: left; padding-left: 2pt; vertical-align: top; padding-right: 2pt; }
        .c14 { font-family: 'Courier New',Courier; font-size: 10pt; font-style: normal; font-weight: normal; color: #000000; text-decoration: none; }
        .c15 { width: 120pt; }
        .c16 { text-align: right; padding-left: 2pt; vertical-align: top; padding-right: 2pt; }
        .c17 { text-align: right; margin-top: 0em; margin-bottom: 0em; }
        .c18 { text-align: center; margin-left: 5%; margin-right: 5%; }
        .c19 { margin-left: 30pt; margin-right: 0pt; margin-top: 0em; margin-bottom: 0em; list-style: none; }
        .c20 { margin-left: 0pt; margin-right: 0pt; }
        .c21 { margin-top: 0em; margin-bottom: 0em; }
        .c22 { text-align: left; margin-left: 30pt; margin-top: -12pt; }
        --></STYLE>
        <!-- LXNComment 2826:543743167 -->
        <TITLE>&nbsp;</TITLE>
        <META TOPIC="null" DOCUMENTS="500" UPDATED="Tuesday, January 05, 2016  18:08:34 EST" /></HEAD>
        <BODY>
<A NAME="DOC_ID_0_0"></A><!-- Hide XML section from browser
<DOC NUMBER=1>
    <DOCFULL> -->
        <BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 1301 DOCUMENTS</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Lincoln Journal Star (Nebraska)</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">August 2, 2001 Thursday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>City Edition</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">Class counts, not race</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">BYLINE: </SPAN><SPAN CLASS="c2">BUTCH MABIN, Lincoln Journal Star</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">SECTION: </SPAN><SPAN CLASS="c2">A; Pg. 1</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LENGTH: </SPAN><SPAN CLASS="c2">1779 words</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">DATELINE: </SPAN><SPAN CLASS="c2">Lincoln, NE </SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c8"><SPAN CLASS="c2">Study says geography plays role </SPAN></P>
            <P CLASS="c8"><SPAN CLASS="c2">  The battle lines dividing both sides of the death penalty debate came into sharp focus with Wednesday's release of a comprehensive study examining the fairness of capital punishment in Nebraska. (cut out the remaining body of text)</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LOAD-DATE: </SPAN><SPAN CLASS="c2">August 11, 2005</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">GRAPHIC: </SPAN><SPAN CLASS="c2">A divided time: The Sept. 2, 1994, execution of Harold Otey (above and below) drew more than 1,000 spectators to the Nebraska State Penitentiary - many of them with sharply opposing views of capital punishments. JOURNAL STAR FILE PHOTOS (one photo archived) 3 b/w head photos of Harold Otey, John Joubert and Robert Williams. (photo of Williams not archived)</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2001 Lincoln Journal Star,<BR>All Rights Reserved</SPAN></P>
        </DIV>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->
<DIV CLASS="c9">&nbsp;</DIV>
<A NAME="DOC_ID_0_1"></A><!-- Hide XML section from browser
<DOC NUMBER=2>
    <DOCFULL> -->
        <BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">2 of 1301 DOCUMENTS</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Lincoln Journal Star (Nebraska)</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">February 8, 2004 Sunday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>City Edition</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">Death penalty at crossroads</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">BYLINE: </SPAN><SPAN CLASS="c2">JOE DUGGAN, LINCOLN JOURNAL STAR</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">SECTION: </SPAN><SPAN CLASS="c2">A; Pg. 1</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LENGTH: </SPAN><SPAN CLASS="c2">2493 words</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">DATELINE: </SPAN><SPAN CLASS="c2">LINCOLN, NE </SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c8"><SPAN CLASS="c2">A legislative bill on lethal injection, U.S. Supreme Court caseand constitutional appeals may affect the future of Nebraska's seven death-row inmates. (cut out the remaining body of text)</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LOAD-DATE: </SPAN><SPAN CLASS="c2">July 13, 2007</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">GRAPHIC: </SPAN><SPAN CLASS="c2">1. Nebraska is the only state in the nation to have the electric chair as the sole means of execution, and some wonder whether the law would survive an Eighth Amendment challenge that it is cruel and unusual punishment. 2. Seven inmates are in death row at the Nebraska State Correctional Institution in Tecumseh. 3. Marylyn Felion's portrait of Robert E. Williams, who was executed in 1997. 7 color head photos and statistics of Carey Dean Moore, Charles Jess Palmer, Michael Ryan, John Lotter, David Dunster, Raymond Mata Jr. and Arthur Lee Gales. color head photo of Summerlin JOURNAL STAR FILE PHOTO</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2004 Lincoln Journal Star,<BR>All Rights Reserved</SPAN></P>
        </DIV>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->
</BODY></HTML>

现在,我想提取每个文档的日期,并尝试遵循this now closed question中提供的指导原则。但是,建议似乎依赖于标签(比如"SECTION:"),我只有"LOAD- date:“标签(并不总是与标题上的实际日期相同)。尽管如此,尝试下面的建议表达式似乎没有任何结果:

代码语言:javascript
运行
复制
> ex <- htmlTreeParse("~/Desktop/example.html", encoding="UTF-8")
> example <- xmlRoot(ex)
> xpathSApply(example, "//DOCFULL/*/*/span[text()='SECTION: ']/..", xmlValue)
NULL

如何修复这个表达式以提取装入日期,或者更好地提取每个文档的实际日期?

是否有可能对缺少日期的文档使用累进帐户(即用NA标记它们)?

EN

回答 1

Stack Overflow用户

发布于 2016-01-09 05:27:03

只需删除DOCFULL/*并简化xpath...

代码语言:javascript
运行
复制
xpathSApply(example, "//span[text()='SECTION: ']/..", xmlValue)
[1] "SECTION: A; Pg. 1" "SECTION: A; Pg. 1"
xpathSApply(example, "//div[@class='c3']/p[@class='c1']/span[@class='c2'][1]", xmlValue)
[1] "August 2, 2001 Thursday" "February 8, 2004 Sunday"

如果一个节点缺少一个标签,有很多方法可以添加NA -这是一个常见的问题。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/34656667

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档