前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >[转载] python 解析xml 文件

[转载] python 解析xml 文件

作者头像
py3study
发布2020-01-19 10:44:26
1.3K0
发布2020-01-19 10:44:26
举报
文章被收录于专栏:python3python3

环境

python:3.4.4

准备xml文件

首先新建一个xml文件,countries.xml。内容是在python官网上看到的。

代码语言:javascript
复制
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

准备python文件

新建一个test_SAX.py,用来解析xml文件。

代码语言:javascript
复制
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import xml.sax

class CountryHandler( xml.sax.ContentHandler ):
    def __init__(self):
        self.CurrentData = ""
        self.CurrentAttributes = ""
        self.rank = ""
        self.year = ""
        self.gdppc = ""
        self.nei_name = ""
        self.nei_dire = ""

    def startElement(self, tag, attributes):
        self.CurrentData = tag
        self.CurrentAttributes = attributes
        if tag == "country":
            print ("*****Country*****")
            name = attributes["name"]
            print ("Name:", name)
        if tag == "neighbor":
            self.nei_name = attributes["name"]
            self.nei_dire = attributes["direction"]

    def endElement(self, tag):
        if self.CurrentData == "rank":
            print ("Rank:", self.rank)
        elif self.CurrentData == "year":
            print ("Year:", self.year)
        elif self.CurrentData == "gdppc":
            print ("Gdppc:", self.gdppc)
        elif self.CurrentData == "neighbor":
            print ("Neighbor:", self.nei_name,self.nei_dire)
        self.CurrentData = ""
        self.nei_name = ""
        self.nei_dire = ""

    def characters(self, content):
        if self.CurrentData == "rank":
            self.rank = content
        elif self.CurrentData == "year":
            self.year = content
        elif self.CurrentData == "gdppc":
            self.gdppc = content
  
if __name__ == "__main__":
    parser = xml.sax.make_parser()
    parser.setFeature(xml.sax.handler.feature_namespaces, 0)
    Handler = CountryHandler()
    parser.setContentHandler( Handler )
    parser.parse("countries.xml")

执行结果

代码语言:javascript
复制
>python test_SAX.py
*****Country*****
Name: Liechtenstein
Rank: 1
Year: 2008
Gdppc: 141100
Neighbor: Austria E
Neighbor: Switzerland W
*****Country*****
Name: Singapore
Rank: 4
Year: 2011
Gdppc: 59900
Neighbor: Malaysia N
*****Country*****
Name: Panama
Rank: 68
Year: 2011
Gdppc: 13600
Neighbor: Costa Rica W
Neighbor: Colombia E

备注

SAX是一种基于事件驱动的API。

SAX主要包括三种对象: readers,handlers 以及 input sources。即解析器,事件处理器以及输入源。

解析器负责读取输入源,如xml文档,并向事件处理器发送事件,如元素开始和元素结束事件。

事件处理器负责处理事件,对xml文档数据进行处理。

parser = xml.sax.make_parser()

新建并且返回一个 SAX XMLReader 对象。

参见: https://docs.python.org/2/library/xml.sax.html

代码语言:javascript
复制
xml.sax.make_parser([parser_list])
Create and return a SAX XMLReader object. The first parser found will be used. If parser_list is provided, it must be a sequence of strings which name modules that have a function named create_parser(). Modules listed in parser_list will be used before modules in the default list of parsers.
parser.setFeature(xml.sax.handler.feature_namespaces, 0)

设置xml.sax.handler.feature_namespaces值为0。其实就是关闭 namespace模式。

参见:https://docs.python.org/2/library/xml.sax.reader.html

代码语言:javascript
复制
XMLReader.setFeature(featurename, value)
Set the featurename to value. If the feature is not recognized, SAXNotRecognizedException is raised. If the feature or its setting is not supported by the parser, SAXNotSupportedException is raised.
class CountryHandler( xml.sax.ContentHandler )

SAX API 定义了4种handler:content handler,DTD handler,error handlers,和 entity resolvers。

程序只需要实现自己感兴趣的事件的接口,比如我们这里只实现了 ContentHandler接口里的部分方法。

代码语言:javascript
复制
class xml.sax.handler.ContentHandler
This is the main callback interface in SAX, and the one most important to applications. The order of events in this interface mirrors the order of the information in the document.

ContentHandler 有很多方法。具体可参见: https://docs.python.org/2/library/xml.sax.handler.html#contenthandler-objects

我们这里首先新建一个CountryHandler类,继承自 xml.sax.ContentHandler。然后实现了他的 startElement(),endElement() 以及 characters()方法。

def startElement(self, tag, attributes)

遇到XML开始标签时调用。tag是标签的名字,attributes 是标签的属性值字典。

代码语言:javascript
复制
Signals the start of an element in non-namespace mode.

The name parameter contains the raw XML 1.0 name of the element type as a string and the attrs parameter holds an object of the Attributes interface (see The Attributes Interface) containing the attributes of the element. The object passed as attrs may be re-used by the parser; holding on to a reference to it is not a reliable way to keep a copy of the attributes. To keep a copy of the attributes, use the copy() method of the attrs object.
def endElement(self, tag)

遇到XML结束标签时调用。tag是标签的名字。

代码语言:javascript
复制
Signals the end of an element in non-namespace mode.

The name parameter contains the name of the element type, just as with the startElement() event.
def characters(self, content)

遇到XML元素内容时调用。content为元素的内容值。

代码语言:javascript
复制
Receive notification of character data.

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.

content may be a Unicode string or a byte string; the expat reader module produces always Unicode strings.

parser.setContentHandler( Handler )

设置当前的ContentHandler为我们自己写的handler实例。如果不进行设置,content 事件会被忽略。

参见:https://docs.python.org/2/library/xml.sax.reader.html

代码语言:javascript
复制
XMLReader.setContentHandler(handler)¶
Set the current ContentHandler. If no ContentHandler is set, content events will be discarded.
parser.parse("countries.xml")

开始解析 xml文件。

参见:https://docs.python.org/2/library/xml.sax.reader.html

代码语言:javascript
复制
Process an input source, producing SAX events. The source object can be a system identifier (a string identifying the input source – typically a file name or an URL), a file-like object, or an InputSource object. When parse() returns, the input is completely processed, and the parser object can be discarded or reset. As a limitation, the current implementation only accepts byte streams; processing of character streams is for further study.
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019-03-01 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 环境
  • 准备xml文件
  • 准备python文件
  • 执行结果
  • 备注
    • parser = xml.sax.make_parser()
      • parser.setFeature(xml.sax.handler.feature_namespaces, 0)
        • class CountryHandler( xml.sax.ContentHandler )
          • def startElement(self, tag, attributes)
            • def endElement(self, tag)
              • def characters(self, content)
                • parser.setContentHandler( Handler )
                  • parser.parse("countries.xml")
                  领券
                  问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档