我的输入文件实际上是附加到一个文件的多个XML文件。(它来自Google Patents)。它的结构如下:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>Python xml.dom.minidom无法解析此非标准文件。有什么更好的方法来解析这个文件?我不知道下面的代码有没有好的性能。
for line in infile:
if line == '<?xml version="1.0" encoding="UTF-8"?>':
xmldoc = minidom.parse(XMLstring)
else:
XMLstring += line发布于 2011-09-07 23:11:20
我会选择单独解析每个XML块。
您似乎已经在示例代码中这样做了。以下是我对您的代码的看法:
def parse_xml_buffer(buffer):
dom = minidom.parseString("".join(buffer)) # join list into string of XML
# .... parse dom ...
buffer = [file.readline()] # initialise with the first line
for line in file:
if line.startswith("<?xml "):
parse_xml_buffer(buffer)
buffer = [] # reset buffer
buffer.append(line) # list operations are faster than concatenating strings
parse_xml_buffer(buffer) # parse final chunk将文件分解为单独的XML块后,如何实际进行解析取决于您的需求,在某种程度上还取决于您的偏好。选项包括lxml、minidom、elementtree、expat、BeautifulSoup等。
更新:
从头开始,下面是我将如何做到这一点(使用BeautifulSoup):
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup
def separated_xml(infile):
file = open(infile, "r")
buffer = [file.readline()]
for line in file:
if line.startswith("<?xml "):
yield "".join(buffer)
buffer = []
buffer.append(line)
yield "".join(buffer)
file.close()
for xml_string in separated_xml("ipgb20110104.xml"):
soup = BeautifulSoup(xml_string)
for num in soup.findAll("doc-number"):
print num.contents[0]这将返回:
D0629996
29316765
D471343
D475175
6715152
D498899
D558952
D571528
D577177
D584027
.... (lots more)...发布于 2011-09-07 22:30:35
我不了解minidom,也不太了解XML解析,但我曾经使用过XPath来解析XML/HTML。例如在lxml module内。
在这里您可以找到一些XPath示例:http://www.w3schools.com/xpath/xpath_examples.asp
https://stackoverflow.com/questions/7335560
复制相似问题