首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >如何将XML文件转换为漂亮的pandas数据帧?

如何将XML文件转换为漂亮的pandas数据帧?
EN

Stack Overflow用户
提问于 2015-02-01 11:58:54
回答 5查看 163.9K关注 0票数 81

让我们假设我有一个这样的XML:

代码语言:javascript
复制
<author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com">
    <documents count="N">
        <document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="28a45eb2460899763d709ca00ddbb665" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
    </documents>
</author>

我想读取这个XML文件,并将其转换为pandas DataFrame:

代码语言:javascript
复制
key                                         type     language    feature            web                         data
e95324a9a6c790ecb95e46cf15bE232ee517651      XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]
bc360cfbafc39970587547215162f0db             XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]
19e71144c50a8b9160b3cvdf2324f0955e906fce     XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]
21d4af9021a174f61b8erf284606c74d9e42         XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]
28a45eb2460823499763d70vdf9ca00ddbb665       XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]

这是我已经尝试过的,但我得到了一些错误,可能有一种更有效的方法来完成这项任务:

代码语言:javascript
复制
from lxml import objectify
import pandas as pd

path = 'file_path'
xml = objectify.parse(open(path))
root = xml.getroot()
root.getchildren()[0].getchildren()
df = pd.DataFrame(columns=('key','type', 'language', 'feature', 'web', 'data'))

for i in range(0,len(xml)):
    obj = root.getchildren()[i].getchildren()
    row = dict(zip(['key','type', 'language', 'feature', 'web', 'data'], [obj[0].text, obj[1].text]))
    row_s = pd.Series(row)
    row_s.name = i
    df = df.append(row_s)

有没有人能为我提供一个更好的解决这个问题的方法?

EN

回答 5

Stack Overflow用户

回答已采纳

发布于 2015-02-02 04:08:37

您可以很容易地使用xml (来自Python标准库)来转换为pandas.DataFrame。下面是我要做的(从文件读取时,将xml_data替换为您的文件或文件对象的名称):

代码语言:javascript
复制
import pandas as pd
import xml.etree.ElementTree as ET
import io

def iter_docs(author):
    author_attr = author.attrib
    for doc in author.iter('document'):
        doc_dict = author_attr.copy()
        doc_dict.update(doc.attrib)
        doc_dict['data'] = doc.text
        yield doc_dict

xml_data = io.StringIO(u'''YOUR XML STRING HERE''')

etree = ET.parse(xml_data) #create an ElementTree object 
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))

如果您的原始文档中有多个作者,或者您的XML的根不是author,那么我将添加以下生成器:

代码语言:javascript
复制
def iter_author(etree):
    for author in etree.iter('author'):
        for row in iter_docs(author):
            yield row

并将doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))更改为doc_df = pd.DataFrame(list(iter_author(etree)))

看看ElementTreedocumentation中提供的xml tutorial

票数 55
EN

Stack Overflow用户

发布于 2021-03-25 16:24:25

v1.3开始,您可以简单地使用:

代码语言:javascript
复制
pandas.read_xml(path_or_file)
票数 15
EN

Stack Overflow用户

发布于 2018-05-29 14:57:05

这是将xml转换为pandas数据帧的另一种方法。例如,我需要从一个字符串解析xml,但是这个逻辑也适用于读取文件。

代码语言:javascript
复制
import pandas as pd
import xml.etree.ElementTree as ET

xml_str = '<?xml version="1.0" encoding="utf-8"?>\n<response>\n <head>\n  <code>\n   200\n  </code>\n </head>\n <body>\n  <data id="0" name="All Categories" t="2018052600" tg="1" type="category"/>\n  <data id="13" name="RealEstate.com.au [H]" t="2018052600" tg="1" type="publication"/>\n </body>\n</response>'

etree = ET.fromstring(xml_str)
dfcols = ['id', 'name']
df = pd.DataFrame(columns=dfcols)

for i in etree.iter(tag='data'):
    df = df.append(
        pd.Series([i.get('id'), i.get('name')], index=dfcols),
        ignore_index=True)

df.head()
票数 14
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/28259301

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档