我正试图解析一个XML文件(更准确地说,它是一个XLIFF翻译文件),并将其转换为(略有不同的) TMX格式。
我的源文件XLIFF文件如下所示:
<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.0">
<file origin="Some/Folder/proj/SomeFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah" product-version="3.9.12" build-num="1" x-train="Blurt">
<header>
<count-group name="SomeFile.strings">
<count count-type="total" unit="word">2</count>
</count-group>
</header>
<body>
<trans-unit id="8.text" restype="string" resname=""><source>End</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11) _one-word-threshold(-25)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">Kraj</target><note>This is a note</note></trans-unit>
</body>
</file>
<file origin="Some/Folder/proj/SomeOtherFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah2" product-version="3.12.56" build-num="1" x-train="Blurt2">
<header>
<count-group name="SomeOtherFile.strings">
<count count-type="total" unit="word">4</count>
</count-group>
</header>
<body>
<trans-unit id="14.accessibilityLabel" restype="string" resname=""><source>return to project list</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">povratak na popis projekata</target><note>This is again a note</note></trans-unit>
</body>
</file>
(and more <file> elements continue... some with many more <trans-unit> </trans-unit> elements than these above)
</xliff>
我想要做的是重新安排和简化这些内容,将上面的内容简化为以下格式:
<tu>
<prop type="FileSource">SomeFile.strings</prop>
<tuv xml:lang="en">
<seg>End</seg>
</tuv>
<tuv xml:lang="hr">
<prop type="Note">This is a note</prop>
<seg>Kraj</seg>
</tuv>
</tu>
<tu>
<prop type="FileSource">SomeOtherFile.strings</prop>
<tuv xml:lang="en">
<seg>return to project list</seg>
</tuv>
<tuv xml:lang="hr">
<prop type="Note">This is again a note</prop></prop>
<seg>povratak na popis projekata</seg>
</tuv>
</tu>
请注意,原始的XLIFF文件可能有几个<file origin ...>
部分,每个部分都有许多<trans-unit ...>
元素(它们是来自该文件的实际字符串.)
我已经成功地编写了一个部分,它给我提供了"Source“和"Target”部件,但我仍然需要的是来自“文件来源”元素的部分.定义语言的地方(即“源语言”和“目标语言”,然后我将把它们写成每个字符串的<tuv xml:lang="en">
和<tuv xml:lang="hr">
),在这里我可以找到与字符串文件相关的引用(例如,"SomeFile.strings“和"SomeOtherFile.strings",用作<prop type="FileSource">SomeFile.strings</prop>
)。
目前,我有以下Python代码,它很好地提取了所需的“源代码”和“目标”元素:
#!/usr/bin/env python3
#
import sys
from lxml import etree
if len(sys.argv) < 2:
print('Wrong number of arguments:\n => You need to provide a filename for processing!')
exit()
file = sys.argv[1]
tree = etree.iterparse(file)
for action, elem in tree:
if elem.tag == "source":
print("<TransUnit>")
print("\t<Source>" + elem.text + "</Source>")
elif elem.tag == "target":
print("\t<Target>" + elem.text + "</Target>")
elif elem.tag == "note":
if elem.text is not None:
print("\t<Note>" + elem.text + "</Note>")
print("</TransUnit>")
else:
print("</TransUnit>")
else:
next
现在,我如何还可以提取“源语言”(即值"en")、“目标语言”(即值"hr")和文件引用(即( "SomeFile.strings")来自“文件来源.”原始XLIFF文件中的元素?
此外,我还需要保留(记住)该文件引用,即:
<prop type="FileSource">SomeOtherFile.strings</prop>
<tu>
)单元(可能有很多,与上面的示例不同,其中每个“文件”只有一个)所以,例如,我会:
<tu>
<prop type="FileSource">SomeFile.strings</prop>
<tuv xml:lang="en">
<seg>End</seg>
</tuv>
<tuv xml:lang="hr">
<prop type="Note">This is a note</prop>
<seg>Kraj</seg>
</tuv>
</tu>
<tu>
<prop type="FileSource">SomeFile.strings</prop>
<tuv xml:lang="en">
<seg>Start</seg>
</tuv>
<tuv xml:lang="hr">
<prop type="Note">This is a note</prop>
<seg>Početak</seg>
</tuv>
</tu>
<tu>
元素都有一个<prop type="FileSource">
元素,显示它来自哪个文件.我非常感谢在这方面的任何帮助.
发布于 2019-05-10 22:33:12
嗯,就像经常发生的那样,经过进一步的挖掘,我找到了可用的解决方案.也许我的问题是不必要的复杂,而问题实际上是找出正确的根元素,正确的处理(和针对)的子子孙孙。
无论如何,另一个堆栈溢出线程使我走上了正确的道路,因此适合我的解决方案现在如下所示:
#!/usr/bin/env python3
#
import sys
import os
from lxml import etree
if len(sys.argv) < 2:
print('Wrong number of arguments:\n => You need to provide a filename for processing!')
exit()
file = sys.argv[1]
tree = etree.parse(file)
root = tree.getroot()
print("<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<!DOCTYPE tmx SYSTEM \"tmx14.dtd\">\n<tmx version=\"1.4\">")
print("\n<header srclang=\"en\" creationtool=\"XLIFF to TMX\" datatype=\"unknown\" adminlang=\"en\" segtype=\"sentence\" creationtoolversion=\"1.0\">")
print("</header>\n<body>")
for element in root:
FileOrigin = (os.path.basename(element.attrib['origin']))
Product = element.attrib['product']
Source = element.attrib['source-language']
Target = element.attrib['target-language']
# now the children
for all_tags in element.findall('.//'):
if all_tags.tag == "source":
# replacing some troublesome and unnecessary codes
srctxt = all_tags.text
srctxt = srctxt.replace('^n', ' ')
srctxt = srctxt.replace('^b', ' ')
print("<tu>")
print("\t<prop type=\"FileSource\">" + FileOrigin + "</prop>")
print("\t<tuv xml:lang=\"" + Source + "\">")
print("\t\t<seg>" + srctxt + "</seg>")
elif all_tags.tag == "target":
# replacing the same troublesome and unnecessary codes
targtxt = all_tags.text
targtxt = targtxt.replace('^n', ' ')
targtxt = targtxt.replace('^b', ' ')
print("\t<tuv xml:lang=\"" + Target + "\">")
print("\t\t<seg>" + targtxt + "</seg>")
elif all_tags.tag == "note":
if all_tags.text is not None:
print("\t\t<prop type=\"Note\">" + all_tags.text.replace('^n', ' ') + "</prop>")
print("</tu>")
else:
print("</tu>")
else:
next
print("</body>\n</tmx>")
可能会稍微整理一下,并添加更多的铃铛和口哨,但总的来说,这解决了我原来的问题。也许它可以帮助其他试图做xliff解析的人..。
发布于 2021-05-10 10:13:38
import xml.etree.cElementTree as ET
tree=ET.ElementTree(file='inputfile.xlf')
root=tree.getroot()
for tag in root.findall('file'):
t_value = tag.get('target-language')
for tag in root.findall('file'):
s_value = tag.get('source-language')
https://stackoverflow.com/questions/56080925
复制相似问题