问在R中解析10 GB XML文件
EN

Stack Overflow用户

提问于 2019-04-04 21:45:16

回答 2查看 104关注 0票数 1

我有10 it的XML文件，我需要对其进行解析。XML的示例结构是

<?xml version="1.0" encoding="UTF-8"?>
<proteinAtlas xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://v18.proteinatlas.org/download/proteinatlas.xsd" schemaVersion="2.5">
    <entry version="18" url="http://v18.proteinatlas.org/ENSG00000000003">
        <name>TSPAN6</name>
        <synonym>T245</synonym>
        <synonym>TM4SF6</synonym>
        <synonym>TSPAN-6</synonym>
        <identifier id="ENSG00000000003" db="Ensembl" version="88.38">
            <xref id="O43657" db="Uniprot/SWISSPROT"/>
        </identifier>
        <proteinClasses>
            <proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
            <proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
            <proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
            <proteinClass source="MEMSAT-SVM" id="Mf" parent_id="" name="MEMSAT-SVM predicted membrane proteins"/>
            <proteinClass source="Phobius" id="Mg" parent_id="" name="Phobius predicted membrane proteins"/>
            <proteinClass source="SCAMPI" id="Mh" parent_id="" name="SCAMPI predicted membrane proteins"/>
            <proteinClass source="SPOCTOPUS" id="Mi" parent_id="" name="SPOCTOPUS predicted membrane proteins"/>
            <proteinClass source="THUMBUP" id="Mj" parent_id="" name="THUMBUP predicted membrane proteins"/>
            <proteinClass source="TMHMM" id="Mk" parent_id="" name="TMHMM predicted membrane proteins"/>
            <proteinClass source="MDM" id="M1" parent_id="" name="1TM proteins predicted by MDM"/>
            <proteinClass source="MDM" id="M4" parent_id="" name="4TM proteins predicted by MDM"/>
            <proteinClass source="SignalP" id="Sb" parent_id="Se" name="SignalP predicted secreted proteins"/>
            <proteinClass source="HPA" id="Za" parent_id="" name="Predicted intracellular proteins"/>
            <proteinClass source="UniProt" id="Ua" parent_id="" name="UniProt - Evidence at protein level"/>
            <proteinClass source="Kim et al 2014" id="Ea" parent_id="" name="Protein evidence (Kim et al 2014)"/>
            <proteinClass source="Ezkurdia et al 2014" id="Eb" parent_id="" name="Protein evidence (Ezkurdia et al 2014)"/>
        </proteinClasses>
        <proteinEvidence evidence="Evidence at protein level">
            <" source="HPA" evidence="Evidence at transcript level"/>
            <evidence source="MS" evidence="Evidence at protein level"/>
            <evidence source="UniProt" evidence="Evidence at protein level"/>
        </proteinEvidence>
        <tissueExpression source="HPA" technology="IHC" assayType="tissue">
            <summary type="tissue"><![CDATA[Cytoplasmic and membranous expression in most tissues.]]></summary>
            <verification type="reliability" description="Antibody staining mainly consistent with RNA expression data. Pending external verification. ">approved</verification>
            <image imageType="selected">
         </tissueExpression>
    </entry>

因此，我需要解析的每个节点都是"entry“，它在整个文件中都遵循相同的结构

我在网上找到了一个如何逐个解析节点的例子，它工作得很好

branchFunction <- function() {
  store <- new.env() 
  func <- function(x, ...) {
    ns <- getNodeSet(x, path = "//name")
    proteinEviden = getNodeSet(x, path = "proteinEvidence")
    tissueExpression = getNodeSet(x, path = "tissueExpression/summary")
    tissueExpression1 = getNodeSet(x, path = "tissueExpression/verification")
    value <- xmlValue(ns[[1]])
    value2 <- xmlGetAttr(proteinEviden[[1]], "evidence")

    print(value)
    print(value2)

    # if storing something ... 
    # store[[some_key]] <- some_value
  }
  getStore <- function() { as.list(store) }
  list(entry = func, getStore=getStore)
}

myfunctions <- branchFunction()

xmlEventParse(
  file = "proteinatlas.xml",
  handlers = NULL, 
  branches = myfunctions
)

这工作得很好，但随着它的进展，它会变慢，内存开始积累。您是否知道如何释放内存，或者是否有任何其他方法来解析大型XML文件。

xml

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-06-21 03:19:54

我在网上发现，R可能不是解析这么大的XML文件的最佳语言。Python确实有这样的库，它们工作得很好。我尝试了其中的一个，似乎工作得很好。-

票数 0

Stack Overflow用户

发布于 2019-04-05 18:24:45

不确定它是否更有内存效率，但试试也无伤大雅：

library( xml2 )
library( data.table )

#first, parse the xml document
doc <- read_xml( "./test.xml" )

#get all entry-nodes
entry.nodes <- xml_find_all( doc, "//entry")

#if necessary, you can now delete the read-in document 'doc' to free up memory
#  rm( doc )
#   

#build data.table
# will also handle missing attributes/nodes. 
# because xml_find_first will return NA if node is not found
data.table( name = xml_text( xml_find_first( entry.nodes, ".//name" ) ),
            eviden = xml_attr( xml_find_first( entry.nodes, ".//proteinEvidence" ), "evidence" ) 
            )


#      name                    eviden
# 1: TSPAN6 Evidence at protein level

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55517531

复制

相似问题

问在R中解析10 GB XML文件
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在R中解析10 GB XML文件EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在R中解析10 GB XML文件
EN