文章/答案/技术大牛

发布

社区首页 >问答首页 >使用python解析一个巨大的xml文件，但是得到了一个错误

问使用python解析一个巨大的xml文件，但是得到了一个错误
EN

Stack Overflow用户

提问于 2016-04-28 19:59:15

回答 3查看 298关注 0票数 1

我试图使用python解析一个巨大的XML文件，但是我得到了这个错误：

    File "parser.py", line 6, in <module>
        event, root = text.next()
    File "C:\Python27\lib\xml\etree\ElementTree.py", line 1281, in next
        self._root = self._parser.close()
    File "C:\Python27\lib\xml\etree\ElementTree.py", line 1654, in close
        self._raiseerror(v)
    File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
        raise err
    xml.etree.ElementTree.ParseError: syntax error: line 1, column 0

我的代码现在看起来像这样

    import xml.etree.ElementTree as ET
    from StringIO import StringIO

    text = ET.iterparse(StringIO('Posts.xml'), events=('start', 'end', 'start-ns', 'end-ns'))
    text = iter(text)
    event, root = text.next()

    for event, elem in text:
        currId = elem.get('PostTypeId')
        if (currId != '1'):
            root.remove(elem)

    tree.write('cut.xml')

我试图解析的XML文件如下所示：

    <posts>

     <row FavoriteCount="4" CommentCount="4" AnswerCount="7" Tags="<discussion><answers>" Title="Why would anyone accept an answer?" LastActivityDate="2014-04-23T09:14:37.103" LastEditDate="2010-09-03T00:42:07.733" LastEditorUserId="99" OwnerUserId="4" Body="<p>I'm looking at the questions proposed during the Area 51 process:</p> <ul> <li>My supervisor thinks that all <code>If</code> statements should include <code>else</code> statements. Do you agree?</li> <li>What are common mistakes in Software Development?</li> <li>Tabs vs. Spaces: What is the one proper indentation character for everything, in every situation, ever?</li> <li>What programming language should I teach to my 4 year old son?</li> <li>What was the turning point of your programming career?</li> </ul> <p>None of these have an answer that should be accepted. The questions are interesting, and the answers would also be informative if the answer was well written and explained why the answerer thinks his method or idea is better. But I can't really see being able to accept an answer to any of these questions.</p> <p>So, if I ask a question, how do I decide if or how to accept an answer? There is no right or wrong answer and just because it works for me doesn't mean I should be floating that answer to the top - unless I'm overlooking something, the questions that are on topic here are very subjective. On Stack Overflow, there are often multiple right solutions to a problem. Here, we have a problem with an infinite number of solutions, none of which are arguably better or worse than any others.</p> <p>Thoughts?</p> " ViewCount="1582" Score="30" CreationDate="2010-09-01T19:32:45.710" PostTypeId="1" Id="1"/>

    <row CommentCount="0" AnswerCount="4" Tags="<discussion><site-attributes><faq-contents><top-7>" Title="What should our FAQ contain?" LastActivityDate="2015-03-18T19:19:24.887" LastEditDate="2015-03-18T19:19:24.887" LastEditorUserId="25936" OwnerUserId="9" Body="<p>One of the big 7 questions.</p> " ViewCount="318" Score="6" CreationDate="2010-09-01T19:34:51.797" PostTypeId="1" Id="2" CommunityOwnedDate="2010-09-02T03:42:26.083"/>

     <row FavoriteCount="8" CommentCount="8" AnswerCount="32" Tags="<discussion><top-7><site-attributes>" Title="What should our domain name be?" LastActivityDate="2014-04-23T09:14:37.103" LastEditDate="2010-12-20T02:46:31.950" LastEditorUserId="2314" OwnerUserId="9" Body="<blockquote> <p><strong>Possible Duplicate:</strong><br> <a href="http://meta.programmers.stackexchange.com/questions/412/write-an-elevator-pitch-tagline">Write an Elevator Pitch / Tagline</a> </p> </blockquote> <h2>Note:</h2> <p>We are closing this domain naming thread. It is asking the <em>entirely</em> wrong question. See this blog post for details: <a href="http://blog.stackoverflow.com/2010/10/domain-names-the-wrong-question/" rel="nofollow">Domain Names: Wrong Question</a> </p> <p>We're going to keep the name programmers.stackexchange.com. But we WILL be setting up redirects from the more "popular" domains names. (e.g. seasonedadvice.com to cooking.stackexchange.com, basicallymoney.com to money.stackexchange.com, and others as we go through the list).</p> <p>New question: "<strong>Write an Elevator Pitch / Tagline!</strong>"</p> <p><a href="http://meta.programmers.stackexchange.com/questions/412/write-an-elevator-pitch-tagline"><strong>Click here to contribute ideas and vote.</strong></a> </p> <p><em>[original message text below]</em></p> <p>One of the big 7 questions.</p> <ul> <li>One answer per answer please</li> <li>Only .com domain names please</li> <li>Only untaken domain names please (use whois)</li> </ul> <p>Please use <strong>lowercase characters only</strong> in domain name!<br> DomainName.com is more readable, but we have to register domainname.com!</p> " ViewCount="1146" Score="16" CreationDate="2010-09-01T19:36:08.390" PostTypeId="1" Id="3" CommunityOwnedDate="2010-09-02T03:40:00.467" ClosedDate="2010-10-08T21:02:50.313"/>
    ...

    </posts>

python

xml

回答 3

Stack Overflow用户

回答已采纳

发布于 2016-04-28 20:43:05

ElementTree.iterparse期望有某种来源。您正在为它提供一个带有内容Posts.xml的字符串缓冲区，而不是文件Posts.xml的实际内容，后者显然没有正确的Posts.xml文件语法。

因此，只要摆脱StringIO调用，ElementTree就会为您处理打开文件的问题。但是，您的输入文件还存在一些问题，无法正确地解析您的文件(请参阅sverasch的答案)。

票数 1

Stack Overflow用户

发布于 2016-04-28 20:29:02

我在xmllint ( http://linux.die.net/man/1/xmllint )中运行了您的示例xml，发现您的未转义量小于和大于符号。

> <

应该是

&gt; &lt;

当它进行解析时，它会认为它过早地出现了一个新的标记，或者一个关闭的标记。

票数 1

Stack Overflow用户

发布于 2016-04-28 20:44:33

你没有正确地读取文件。

StringIO('Posts.xml')不读取文件；它创建一个内容为"Posts.xml“的类似文件的对象。

这就是为什么iterparse在抱怨；内容不以<开头。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/36924383

复制

相似问题

问使用python解析一个巨大的xml文件，但是得到了一个错误
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用python解析一个巨大的xml文件，但是得到了一个错误EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用python解析一个巨大的xml文件，但是得到了一个错误
EN