首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >Python XML解析器问题

Python XML解析器问题
EN

Stack Overflow用户
提问于 2019-11-04 16:58:31
回答 3查看 171关注 0票数 0

我是python的新手。很抱歉问了这个愚蠢的问题。我正在尝试将XML文件读取到python对象(最好是pandas),现在我只是尝试打印变量,看看是否可以正确地以表格形式读取它们。

为此,我使用了xml.etree.ElementTree,但我可能没有按照预期使用它。

代码:

代码语言:javascript
运行
复制
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
ODM = tree.getroot()

ns = {'xmlns': 'http://www.cdisc.org/ns/odm/v1.3',
      'mdsol': 'http://www.mdsol.com/ns/odm/metadata'}

for ClinicalData in ODM:
    LocationOID=None
    #print(ClinicalData.tag, ClinicalData.attrib)
    for SubjectData in ClinicalData:
        for SiteRef in SubjectData:
            LocationOID=SiteRef.attrib.get('LocationOID')
        for StudyEventData in SubjectData:
            for AuditRecord in StudyEventData:
                print(ClinicalData.attrib.get('MetaDataVersionOID'),
                     ClinicalData.attrib.get('AuditSubCategoryName'),       #null ouptput due to namespace issue
                     SubjectData.attrib.get('SubjectKey'),
                     SubjectData.attrib.get('SubjectName'),                 #null ouptput due to namespace issue
                     LocationOID,                                           #not sure what is the issue
                     StudyEventData.attrib.get('StudyEventRepeatKey'),
                     AuditRecord.find('DateTimeStamp')                      #not sure what is the issue
                    )

输入:

代码语言:javascript
运行
复制
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" 
        xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" 
        CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">

    <ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
        <SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
                <AuditRecord>
                    <UserRef UserOID="systemuser"/>
                    <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                    <DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
                    <ReasonForChange>Update</ReasonForChange>
                    <SourceID>394263772</SourceID>
                </AuditRecord>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
</ODM>

我期望所有的打印变量都需要像在XML文件中那样有正确的变量赋值。请让我知道有没有其他合适的方法来代替多次内循环。

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2019-11-04 17:23:41

使用ElementTree时,名称空间是一种痛苦。请参阅此discussion

简短的回答:

代码语言:javascript
运行
复制
for ClinicalData in ODM:
    #print(ClinicalData.tag, ClinicalData.attrib)
    for SubjectData in ClinicalData:
        SiteRef = SubjectData.find('{http://www.cdisc.org/ns/odm/v1.3}SiteRef')
        LocationOID = SiteRef.attrib.get('LocationOID')
        for StudyEventData in SubjectData:
            for AuditRecord in StudyEventData:
                print(
                    ClinicalData.attrib.get('MetaDataVersionOID'),
                    ClinicalData.attrib.
                    get('{http://www.mdsol.com/ns/odm/metadata}AuditSubCategoryName'
                        ),  #null ouptput due to namespace issue
                    SubjectData.attrib.get('SubjectKey'),
                    SubjectData.attrib.get(
                        '{http://www.mdsol.com/ns/odm/metadata}SubjectName'
                    ),  #null ouptput due to namespace issue
                    LocationOID,  #not sure what is the issue
                    StudyEventData.attrib.get('StudyEventRepeatKey'),
                    AuditRecord.find(
                        '{http://www.cdisc.org/ns/odm/v1.3}DateTimeStamp').
                    text  #not sure what is the issue
                )
票数 0
EN

Stack Overflow用户

发布于 2019-11-04 17:25:05

我认为您可以使用BeautifulSoup来解析XML:

代码语言:javascript
运行
复制
    from bs4 import BeautifulSoup

    temp  ="""<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" 
        xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" 
        CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">

    <ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
        <SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
                <AuditRecord>
                    <UserRef UserOID="systemuser"/>
                    <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                    <DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
                    <ReasonForChange>Update</ReasonForChange>
                    <SourceID>394263772</SourceID>
                </AuditRecord>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
</ODM>"""



temp=BeautifulSoup(temp,"lxml")
ClinicalData = temp.find('ClinicalData'.lower())
SubjectData = ClinicalData.find_all('SubjectData'.lower())
LocationOID=None
for i in SubjectData:
    SiteRef = i.find('SiteRef'.lower())
    LocationOID = SiteRef.attrs['locationoid']


print('LocationOID',LocationOID)

输出:

代码语言:javascript
运行
复制
LocationOID 0ACCSP3MAPPING1SITE1
[Finished in 1.2s]
票数 0
EN

Stack Overflow用户

发布于 2019-11-04 21:16:58

@Justin我应用了你的建议,它奏效了,直到我打破了它。

输入:

代码语言:javascript
运行
复制
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
    <ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
        <SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
                <FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
                    <ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928808">
                        <ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="45">
                            <AuditRecord>
                                <UserRef UserOID="alscrave2"/>
                                <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                                <DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
                                <ReasonForChange/>
                                <SourceID>122841525</SourceID>
                            </AuditRecord>
                            <MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
                        </ItemData>
                    </ItemGroupData>
                </FormData>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
    <ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
        <SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
                <FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
                    <ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928809">
                        <ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="46">
                            <AuditRecord>
                                <UserRef UserOID="alscrave2"/>
                                <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                                <DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
                                <ReasonForChange/>
                                <SourceID>122841525</SourceID>
                            </AuditRecord>
                            <MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
                        </ItemData>
                    </ItemGroupData>
                </FormData>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
</ODM>

代码:

代码语言:javascript
运行
复制
import xml.etree.ElementTree as ET
import pandas as pd

def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None

tree = ET.parse("data.xml")
ODM = tree.getroot()

xmlns = "{http://www.cdisc.org/ns/odm/v1.3}"
mdsol = "{http://www.mdsol.com/ns/odm/metadata}"

def data_reader():
    dfcols = ['CreationDateTime','StudyOID','MetaDataVersionOID','SubjectName','SUBJECTUUID','LocationOID','StudyEventOID',
             'StudyEventRepeatKey','FormOID','FormRepeatKey','DataPageId','ItemgroupOID','RecordId','var_name','Value',
             'DateTimeStamp','ASC_Name','Measurement_Unit','SourceID','UserOID','InstanceId']
    df_xml = pd.DataFrame(columns=dfcols)

    CreationDateTime = ODM.attrib.get('CreationDateTime')

    for ClinicalData in ODM:
        StudyOID = ClinicalData.attrib.get('StudyOID')
        MetaDataVersionOID = ClinicalData.attrib.get('MetaDataVersionOID')
        ASC_Name = ClinicalData.attrib.get('{0}AuditSubCategoryName'.format(mdsol))
        for SubjectData in ClinicalData:
            SubjectName = SubjectData.attrib.get('{0}SubjectName'.format(mdsol))
            SUBJECTUUID = SubjectData.attrib.get('SubjectKey')
            LocationOID = SubjectData.find('{0}SiteRef'.format(xmlns)).attrib.get('LocationOID')
            for StudyEventData in SubjectData:
                StudyEventOID = StudyEventData.attrib.get('StudyEventOID')
                StudyEventRepeatKey = StudyEventData.attrib.get('StudyEventRepeatKey')
                InstanceId = StudyEventData.attrib.get('{0}InstanceId'.format(mdsol))
                for FormData in StudyEventData:
                    FormOID = FormData.attrib.get('FormOID')
                    FormRepeatKey = FormData.attrib.get('FormRepeatKey')
                    DataPageId = FormData.attrib.get('{0}DataPageId'.format(mdsol))
                    for ItemGroupData in FormData:
                        ItemgroupOID = ItemGroupData.attrib.get('ItemgroupOID')
                        RecordId = ItemGroupData.attrib.get('{0}RecordId'.format(mdsol))
                        for ItemData in ItemGroupData:
                            var_name = ItemData.attrib.get('ItemOID')
                            Value = ItemData.attrib.get('Value')
                            Measurement_Unit = ItemData.find('MeasurementUnitRef'.format(xmlns)).attrib.get('MeasurementUnitOID')
                            for AuditRecord in ItemData:
                                DateTimeStamp = AuditRecord.find('{0}DateTimeStamp'.format(xmlns)).text;
                                SourceID = AuditRecord.find('{0}SourceID'.format(xmlns)).text; 
                                UserOID = ItemData.find('{0}UserRef'.format(xmlns)).attrib.get('UserOID')
                                df_xml = df_xml.append(
                                pd.Series([CreationDateTime,StudyOID,MetaDataVersionOID,SubjectName,
                                           SUBJECTUUID,LocationOID,StudyEventOID,
                                           StudyEventRepeatKey,FormOID,FormRepeatKey,DataPageId,ItemgroupOID,
                                           RecordId,var_name,Value,DateTimeStamp,ASC_Name,Measurement_Unit,
                                           SourceID,UserOID,InstanceId], index=dfcols),
                                        ignore_index=True)

    print(df_xml)
data_reader()

问题:我得到重复的记录。变量DateTimeStamp、SourceID、UserOID和Measurement_Unit在赋值过程中抛出运行时错误。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/58689992

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档