我希望读取带有元数据的XML文件,并提取特定部分,然后将其写入另一个文件。但是,我仍然停留在解析2MB元数据XML文件的开头。
为了测试和调试目的,我将输入文件缩小到下面这个较小的示例XML。
<?xml version="1.0" encoding="UTF-8"?>
<ODM Description="Study Metadata" xmlns="http://www.cdisc.org/ns/odm/v1.3" xmlns:OpenClinica="http://www.openclinica.org/ns/odm_ext_v130/v3.1" >
<Study OID="MyStudy">
<GlobalVariables>
<StudyName>MyStudy</StudyName>
<ProtocolName>MyProtocol</ProtocolName>
</GlobalVariables>
<BasicDefinitions>
<MeasurementUnit OID="MU_CM" Name="cm">
<Symbol>
<TranslatedText>cm</TranslatedText>
</Symbol>
</MeasurementUnit>
<MeasurementUnit OID="MU_KG" Name="kg">
<Symbol>
<TranslatedText>kg</TranslatedText>
</Symbol>
</MeasurementUnit>
</BasicDefinitions>
<MetaDataVersion OID="v1.0.0" Name="MetaDataVersion_v1.0.0">
<Protocol>
<StudyEventRef StudyEventOID="SE_BASELINE" OrderNumber="1" Mandatory="Yes"/>
<StudyEventRef StudyEventOID="SE_3WK" OrderNumber="2" Mandatory="Yes"/>
<StudyEventRef StudyEventOID="SE_6WK" OrderNumber="3" Mandatory="Yes"/>
<StudyEventRef StudyEventOID="SE_9WK" OrderNumber="4" Mandatory="Yes"/>
<StudyEventRef StudyEventOID="SE_12WK" OrderNumber="5" Mandatory="Yes"/>
</Protocol>
<ItemDef OID="I_MYSTUDY_B_BL_D_VDATE" Name="BL_D_VISITDATE" DataType="date" SASFieldName="BL_D_VDA" Comment="Visit date" OpenClinica:FormOIDs="F_MYSTUDY_BL_D_2,F_MYSTUDY_BL_D_1">
<Question>
<TranslatedText>Visit date</TranslatedText>
</Question>
</ItemDef>
<ItemDef OID="I_MYSTUDY_B_BL_D_VCODE" Name="BL_D_MEDCODE" DataType="integer" Length="1" SASFieldName="BL_D_MCO" Comment="Medicine code" OpenClinica:FormOIDs="F_MYSTUDY_BL_D_2,F_MYSTUDY_BL_D_1">
<Question>
<TranslatedText>Medicine code</TranslatedText>
</Question>
<CodeListRef CodeListOID="CL_12345"/>
</ItemDef>
</MetaDataVersion>
</Study>
</ODM>我只是对ItemDef元素及其属性感兴趣,我正在使用xml.etree.ElementTree解析该XML文件。以下是我到目前为止所取得的成果,但是它从未涉及到-- found ItemDef的部分,请参阅下面的代码。
# which file to read
FILE_NAME = "mystudy.xml"
ns = {'d': 'http://www.cdisc.org/ns/odm/v1.3'}
# Import the os module
import os
import xml.etree.ElementTree as ET
import csv
import array as arr
e = ET.parse(os.path.join(os.getcwd(), FILE_NAME))
root = e.getroot()
# testing to see if it is parses anything
print(root.get('Description'))
namespace = "{http://www.cdisc.org/ns/odm/v1.3}"
# none of this seems to work..
# col = e.findall('ItemDef')
# col = e.findall('.//ItemDef')
# col = e.findall('(*)ItemDef')
# col = e.findall('{0}ODM/Study/MetaDataVersion/ItemDef'.format(namespace))
col = e.findall('{0}ODM/{0}Study/{0}MetaDataVersion/{0}ItemDef'.format(namespace))
print("start for-loop")
# iterate all
for itemdef in col:
name = itemdef.get('Name')
print("-- found ItemDef name=", name)
print("finished for-loop")正如我所理解的,您必须正确地指定名称空间,否则它将不读取任何内容,这可能是错误。我在stackoverflow.com上搜索过类似的问题,并尝试过几种方法(参见代码中的注释),但它不能正常工作。
?
发布于 2021-06-23 21:53:41
由于e从根标记开始,所以从XPath表达式中删除<ODM>:
col = e.findall('./{0}Study/{0}MetaDataVersion/{0}ItemDef'.format(namespace))
# Study Metadata
# start for-loop
# -- found ItemDef name= BL_D_VISITDATE
# -- found ItemDef name= BL_D_MEDCODE
# finished for-loop更好的是,使用您定义的字典来映射到namespaces前缀的findall的findall参数:
ns = {'d': 'http://www.cdisc.org/ns/odm/v1.3'}
col = e.findall('./d:Study/d:MetaDataVersion/d:ItemDef', namespaces=ns)
# SHORT-HAND FOR ANYWHERE SEARCH
col = e.findall('.//d:ItemDef', namespaces=ns)https://stackoverflow.com/questions/68107055
复制相似问题