文章/答案/技术大牛

发布

社区首页 >问答首页 >用于解析非标准XML文件的Python

问用于解析非标准XML文件的Python
EN

Stack Overflow用户

提问于 2011-09-07 22:26:50

回答 2查看 2.7K关注 0票数 6

我的输入文件实际上是附加到一个文件的多个XML文件。(它来自Google Patents)。它的结构如下：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>

Python xml.dom.minidom无法解析此非标准文件。有什么更好的方法来解析这个文件？我不知道下面的代码有没有好的性能。

for line in infile:
  if line == '<?xml version="1.0" encoding="UTF-8"?>': 
    xmldoc = minidom.parse(XMLstring)
  else:
    XMLstring += line

xml-parsing

python

回答 2

Stack Overflow用户

发布于 2011-09-07 23:11:20

我会选择单独解析每个XML块。

您似乎已经在示例代码中这样做了。以下是我对您的代码的看法：

def parse_xml_buffer(buffer):
    dom = minidom.parseString("".join(buffer))  # join list into string of XML
    # .... parse dom ...

buffer = [file.readline()]  # initialise with the first line
for line in file:
    if line.startswith("<?xml "):
        parse_xml_buffer(buffer)
        buffer = []  # reset buffer
    buffer.append(line)  # list operations are faster than concatenating strings
parse_xml_buffer(buffer)  # parse final chunk

将文件分解为单独的XML块后，如何实际进行解析取决于您的需求，在某种程度上还取决于您的偏好。选项包括lxml、minidom、elementtree、expat、BeautifulSoup等。

更新：

从头开始，下面是我将如何做到这一点(使用BeautifulSoup)：

#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup

def separated_xml(infile):
    file = open(infile, "r")
    buffer = [file.readline()]
    for line in file:
        if line.startswith("<?xml "):
            yield "".join(buffer)
            buffer = []
        buffer.append(line)
    yield "".join(buffer)
    file.close()

for xml_string in separated_xml("ipgb20110104.xml"):
    soup = BeautifulSoup(xml_string)
    for num in soup.findAll("doc-number"):
        print num.contents[0]

这将返回：

D0629996
29316765
D471343
D475175
6715152
D498899
D558952
D571528
D577177
D584027
.... (lots more)...

票数 2

Stack Overflow用户

发布于 2011-09-07 22:30:35

我不了解minidom，也不太了解XML解析，但我曾经使用过XPath来解析XML/HTML。例如在lxml module内。

在这里您可以找到一些XPath示例：http://www.w3schools.com/xpath/xpath_examples.asp

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/7335560

复制

相似问题

问用于解析非标准XML文件的Python
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于解析非标准XML文件的PythonEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于解析非标准XML文件的Python
EN