首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >用R从XML中提取数据

用R从XML中提取数据
EN

Stack Overflow用户
提问于 2016-07-01 12:47:59
回答 2查看 249关注 0票数 1

在下面的XML文件中,我将使用R提取数据,通常,我会结合使用read_xml命令从包xml2中使用%>%函数。但由于某种原因这不起作用。它甚至不读取XML。

代码语言:javascript
运行
复制
invoices <- read_xml(doclist[i]) %>% xml_nodes("page")
invoices
{xml_nodeset (0)}

我要提取的数据只是子<variantText>之后的文本,并存储这是一个数据格式。所以在这个例子中

克兰特贝塔努默

10450320

接触式

代码语言:javascript
运行
复制
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<document xmlns="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml" version="1.0" producer="FineReader 10.0" pagesCount="2" languages="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml">
    <page width="2479" height="3508" resolution="300">
        <block blockType="Text" blockName="" l="292" t="108" r="590" b="194"><region><rect l="292" t="108" r="590" b="194"/></region>
            <text>
                <par align="Justified" lineSpacing="1200">
                    <line baseline="138" l="298" t="114" r="584" b="138"><formatting lang="EnglishUnitedStates" ff="Arial" fs="8.">
                            <wordRecVariants>
                                <wordRecVariant wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" wordPenalty="0" meanStrokeWidth="31"><variantText>Klantbetaalnummer<charParams l="0" t="0" r="0" b="0">K</charParams><charParams l="0" t="0" r="0" b="0">l</charParams><charParams l="0" t="0" r="0" b="0">a</charParams><charParams l="0" t="0" r="0" b="0">n</charParams><charParams l="0" t="0" r="0" b="0">t</charParams><charParams l="0" t="0" r="0" b="0">b</charParams><charParams l="0" t="0" r="0" b="0">e</charParams><charParams l="0" t="0" r="0" b="0">t</charParams><charParams l="0" t="0" r="0" b="0">a</charParams><charParams l="0" t="0" r="0" b="0">a</charParams><charParams l="0" t="0" r="0" b="0">l</charParams><charParams l="0" t="0" r="0" b="0">n</charParams><charParams l="0" t="0" r="0" b="0">u</charParams><charParams l="0" t="0" r="0" b="0">m</charParams><charParams l="0" t="0" r="0" b="0">m</charParams><charParams l="0" t="0" r="0" b="0">e</charParams><charParams l="0" t="0" r="0" b="0">r</charParams>
                                    </variantText>
                                </wordRecVariant>
                            </wordRecVariants>
                            <charParams l="298" t="114" r="318" b="138" wordStart="1" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="6" wordPenalty="0" meanStrokeWidth="31">K</charParams>
                            <charParams l="319" t="114" r="322" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="31">l</charParams>
                            <charParams l="326" t="120" r="341" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="16" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">a</charParams>
                            <charParams l="345" t="120" r="359" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">n</charParams>
                            <charParams l="362" t="114" r="370" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="28" wordPenalty="0" meanStrokeWidth="31">t</charParams>
                            <charParams l="373" t="114" r="388" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">b</charParams>
                            <charParams l="391" t="120" r="406" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="40" wordPenalty="0" meanStrokeWidth="31">e</charParams>
                            <charParams l="408" t="114" r="416" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="28" wordPenalty="0" meanStrokeWidth="31">t</charParams>
                            <charParams l="419" t="120" r="434" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="16" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">a</charParams>
                            <charParams l="437" t="120" r="452" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="16" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">a</charParams>
                            <charParams l="457" t="114" r="460" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="31">l</charParams>
                            <charParams l="464" t="120" r="478" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">n</charParams>
                            <charParams l="483" t="120" r="497" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="29" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">u</charParams>
                            <charParams l="501" t="120" r="524" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="3" wordPenalty="0" meanStrokeWidth="31">m</charParams>
                            <charParams l="529" t="120" r="552" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="3" wordPenalty="0" meanStrokeWidth="31">m</charParams>
                            <charParams l="556" t="120" r="571" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="40" wordPenalty="0" meanStrokeWidth="31">e</charParams>
                            <charParams l="575" t="120" r="584" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="16" serifProbability="4" wordPenalty="0" meanStrokeWidth="31">r</charParams></formatting><formatting lang="EnglishUnitedStates" ff="Times New Roman" fs="10."></formatting></line>
                    <line baseline="188" l="298" t="164" r="441" b="188"><formatting lang="EnglishUnitedStates" ff="Arial" fs="8." bold="1">
                            <wordRecVariants>
                                <wordRecVariant wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" wordPenalty="0" meanStrokeWidth="50"><variantText>10450320<charParams l="0" t="0" r="0" b="0">1</charParams><charParams l="0" t="0" r="0" b="0">0</charParams><charParams l="0" t="0" r="0" b="0">4</charParams><charParams l="0" t="0" r="0" b="0">5</charParams><charParams l="0" t="0" r="0" b="0">0</charParams><charParams l="0" t="0" r="0" b="0">3</charParams><charParams l="0" t="0" r="0" b="0">2</charParams><charParams l="0" t="0" r="0" b="0">0</charParams>
                                    </variantText>
                                </wordRecVariant>
                            </wordRecVariants>
                            <charParams l="298" t="164" r="309" b="188" wordStart="1" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="46" serifProbability="67" wordPenalty="0" meanStrokeWidth="50">1</charParams>
                            <charParams l="315" t="164" r="330" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">0</charParams>
                            <charParams l="332" t="164" r="349" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">4</charParams>
                            <charParams l="352" t="164" r="367" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="100" serifProbability="44" wordPenalty="0" meanStrokeWidth="50">5</charParams>
                            <charParams l="370" t="164" r="385" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">0</charParams>
                            <charParams l="389" t="164" r="404" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="89" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">3</charParams>
                            <charParams l="407" t="164" r="422" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">2</charParams>
                            <charParams l="426" t="164" r="441" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">0</charParams></formatting></line></par>
            </text>
        </block>
        <block blockType="Text" blockName="" l="1826" t="383" r="2113" b="426"><region><rect l="1826" t="383" r="2113" b="426"/></region>
            <text>
                <par align="Justified">
                    <line baseline="413" l="1832" t="389" r="2107" b="420"><formatting lang="EnglishUnitedStates" ff="Arial" fs="8." bold="1">
                            <wordRecVariants>
                                <wordRecVariant wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" wordPenalty="0" meanStrokeWidth="50"><variantText>Contactgegevens<charParams l="0" t="0" r="0" b="0">C</charParams><charParams l="0" t="0" r="0" b="0">o</charParams><charParams l="0" t="0" r="0" b="0">n</charParams><charParams l="0" t="0" r="0" b="0">t</charParams><charParams l="0" t="0" r="0" b="0">a</charParams><charParams l="0" t="0" r="0" b="0">c</charParams><charParams l="0" t="0" r="0" b="0">t</charParams><charParams l="0" t="0" r="0" b="0">g</charParams><charParams l="0" t="0" r="0" b="0">e</charParams><charParams l="0" t="0" r="0" b="0">g</charParams><charParams l="0" t="0" r="0" b="0">e</charParams><charParams l="0" t="0" r="0" b="0">v</charParams><charParams l="0" t="0" r="0" b="0">e</charParams><charParams l="0" t="0" r="0" b="0">n</charParams><charParams l="0" t="0" r="0" b="0">s</charParams>
                                    </variantText>
                                </wordRecVariant>
                            </wordRecVariants>
                            <charParams l="1832" t="389" r="1853" b="413" wordStart="1" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="51" wordPenalty="0" meanStrokeWidth="50">C</charParams>
                            <charParams l="1856" t="395" r="1874" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">o</charParams>
                            <charParams l="1877" t="395" r="1893" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="0" wordPenalty="0" meanStrokeWidth="50">n</charParams>
                            <charParams l="1895" t="389" r="1905" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="33" serifProbability="44" wordPenalty="0" meanStrokeWidth="50">t</charParams>
                            <charParams l="1908" t="395" r="1924" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="0" wordPenalty="0" meanStrokeWidth="50">a</charParams>
                            <charParams l="1926" t="395" r="1942" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="51" wordPenalty="0" meanStrokeWidth="50">c</charParams>
                            <charParams l="1944" t="389" r="1954" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="33" serifProbability="44" wordPenalty="0" meanStrokeWidth="50">t</charParams>
                            <charParams l="1956" t="395" r="1973" b="420" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="12" wordPenalty="0" meanStrokeWidth="50">g</charParams>
                            <charParams l="1976" t="395" r="1992" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="39" wordPenalty="0" meanStrokeWidth="50">e</charParams>
                            <charParams l="1995" t="395" r="2012" b="420" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="12" wordPenalty="0" meanStrokeWidth="50">g</charParams>
                            <charParams l="2015" t="395" r="2031" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="39" wordPenalty="0" meanStrokeWidth="50">e</charParams>
                            <charParams l="2033" t="395" r="2050" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="5" wordPenalty="0" meanStrokeWidth="50">v</charParams>
                            <charParams l="2052" t="395" r="2068" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="39" wordPenalty="0" meanStrokeWidth="50">e</charParams>
                            <charParams l="2072" t="395" r="2088" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="0" wordPenalty="0" meanStrokeWidth="50">n</charParams>
                            <charParams l="2091" t="395" r="2107" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="57" wordPenalty="0" meanStrokeWidth="50">s</charParams></formatting></line></par>
            </text>
        </block>
    </page>
</document>
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2016-07-01 23:03:57

您的文档有一个与它关联的名称空间,因此需要在路径中指定名称空间。试试这个:

代码语言:javascript
运行
复制
library(rvest)
page<-read_xml("test.xml")
#check for name space:
xml_ns(page)

#read nodes with namespace
nodes<-xml_nodes(page, ".//d1:variantText")
票数 0
EN

Stack Overflow用户

发布于 2016-07-01 13:13:28

我还没有看过为什么您的xml不被读取,但另一个解决方案是使用regex。

代码语言:javascript
运行
复制
library(stringr)

str_match(doclist, "<variantText>(.*)</variantText>")
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/38145858

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档