文章/答案/技术大牛

发布

社区首页 >问答首页 >Python:解析XML (xliff)文件(包括头)

问Python:解析XML (xliff)文件(包括头)
EN

Stack Overflow用户

提问于 2019-05-10 15:45:13

回答 2查看 2.8K关注 0票数 2

我正试图解析一个XML文件(更准确地说，它是一个XLIFF翻译文件)，并将其转换为(略有不同的) TMX格式。

我的源文件XLIFF文件如下所示：

<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.0">
  <file origin="Some/Folder/proj/SomeFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah" product-version="3.9.12" build-num="1" x-train="Blurt">
    <header>
      <count-group name="SomeFile.strings">
        <count count-type="total" unit="word">2</count>
      </count-group>
    </header>
    <body>
      <trans-unit id="8.text" restype="string" resname=""><source>End</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11) _one-word-threshold(-25)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">Kraj</target><note>This is a note</note></trans-unit>
    </body>
  </file>
  <file origin="Some/Folder/proj/SomeOtherFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah2" product-version="3.12.56" build-num="1" x-train="Blurt2">
    <header>
      <count-group name="SomeOtherFile.strings">
        <count count-type="total" unit="word">4</count>
      </count-group>
    </header>
    <body>
      <trans-unit id="14.accessibilityLabel" restype="string" resname=""><source>return to project list</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">povratak na popis projekata</target><note>This is again a note</note></trans-unit>
    </body>
  </file>

  (and more <file> elements continue... some with many more <trans-unit> </trans-unit> elements than these above)

  </xliff>

我想要做的是重新安排和简化这些内容，将上面的内容简化为以下格式：

<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>End</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Kraj</seg>
    </tuv>
</tu>
<tu>
    <prop type="FileSource">SomeOtherFile.strings</prop>
    <tuv xml:lang="en">
        <seg>return to project list</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is again a note</prop></prop>
        <seg>povratak na popis projekata</seg>
    </tuv>
</tu>

请注意，原始的XLIFF文件可能有几个<file origin ...>部分，每个部分都有许多<trans-unit ...>元素(它们是来自该文件的实际字符串.)

我已经成功地编写了一个部分，它给我提供了"Source“和"Target”部件，但我仍然需要的是来自“文件来源”元素的部分.定义语言的地方(即“源语言”和“目标语言”，然后我将把它们写成每个字符串的<tuv xml:lang="en">和<tuv xml:lang="hr"> )，在这里我可以找到与字符串文件相关的引用(例如，"SomeFile.strings“和"SomeOtherFile.strings"，用作<prop type="FileSource">SomeFile.strings</prop>)。

目前，我有以下Python代码，它很好地提取了所需的“源代码”和“目标”元素：

#!/usr/bin/env python3
#

import sys

from lxml import etree

if len(sys.argv) < 2:
    print('Wrong number of arguments:\n => You need to provide a filename for processing!')
    exit()

file = sys.argv[1]

tree = etree.iterparse(file)
for action, elem in tree:
    if elem.tag == "source":
        print("<TransUnit>")
        print("\t<Source>" + elem.text  + "</Source>")
    elif elem.tag == "target":
        print("\t<Target>" + elem.text + "</Target>")
    elif elem.tag == "note":
        if elem.text is not None:
            print("\t<Note>" + elem.text + "</Note>")
            print("</TransUnit>")
        else: 
            print("</TransUnit>")
    else:
        next

现在，我如何还可以提取“源语言”(即值"en")、“目标语言”(即值"hr")和文件引用(即( "SomeFile.strings")来自“文件来源.”原始XLIFF文件中的元素？

此外，我还需要保留(记住)该文件引用，即：

<prop type="FileSource">SomeOtherFile.strings</prop>

对于属于该文件的所有翻译(<tu>)单元(可能有很多，与上面的示例不同，其中每个“文件”只有一个)

所以，例如，我会：

<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>End</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Kraj</seg>
    </tuv>
</tu>
<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>Start</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Početak</seg>
    </tuv>
</tu>

其中每个<tu>元素都有一个<prop type="FileSource">元素，显示它来自哪个文件.

我非常感谢在这方面的任何帮助.

xliff

python

lxml

回答 2

Stack Overflow用户

发布于 2019-05-10 22:33:12

嗯，就像经常发生的那样，经过进一步的挖掘，我找到了可用的解决方案.也许我的问题是不必要的复杂，而问题实际上是找出正确的根元素，正确的处理(和针对)的子子孙孙。

无论如何，另一个堆栈溢出线程使我走上了正确的道路，因此适合我的解决方案现在如下所示：

#!/usr/bin/env python3
#

import sys
import os

from lxml import etree

if len(sys.argv) < 2:
    print('Wrong number of arguments:\n => You need to provide a filename for processing!')
    exit()

file = sys.argv[1]

tree = etree.parse(file)
root = tree.getroot()

print("<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<!DOCTYPE tmx SYSTEM \"tmx14.dtd\">\n<tmx version=\"1.4\">")
print("\n<header srclang=\"en\" creationtool=\"XLIFF to TMX\" datatype=\"unknown\" adminlang=\"en\" segtype=\"sentence\" creationtoolversion=\"1.0\">")
print("</header>\n<body>")

for element in root:
    FileOrigin = (os.path.basename(element.attrib['origin']))
    Product = element.attrib['product']
    Source = element.attrib['source-language']
    Target =  element.attrib['target-language']
    # now the children
    for all_tags in element.findall('.//'):
        if all_tags.tag == "source":
            # replacing some troublesome and unnecessary codes
            srctxt = all_tags.text
            srctxt = srctxt.replace('^n', ' ')
            srctxt = srctxt.replace('^b', ' ')
            print("<tu>")
            print("\t<prop type=\"FileSource\">" + FileOrigin + "</prop>")
            print("\t<tuv xml:lang=\"" + Source + "\">")
            print("\t\t<seg>" + srctxt + "</seg>")
        elif all_tags.tag == "target":
            # replacing the same troublesome and unnecessary codes
            targtxt = all_tags.text
            targtxt = targtxt.replace('^n', ' ')
            targtxt = targtxt.replace('^b', ' ')
            print("\t<tuv xml:lang=\"" + Target + "\">")
            print("\t\t<seg>" + targtxt + "</seg>")
        elif all_tags.tag == "note":
            if all_tags.text is not None:
                print("\t\t<prop type=\"Note\">" + all_tags.text.replace('^n', ' ') + "</prop>")
                print("</tu>")
            else: 
                print("</tu>")
        else:
            next
print("</body>\n</tmx>")

可能会稍微整理一下，并添加更多的铃铛和口哨，但总的来说，这解决了我原来的问题。也许它可以帮助其他试图做xliff解析的人..。

票数 1

Stack Overflow用户

发布于 2021-05-10 10:13:38

import xml.etree.cElementTree as ET

tree=ET.ElementTree(file='inputfile.xlf')

root=tree.getroot()

for tag in root.findall('file'):
    t_value = tag.get('target-language')

for tag in root.findall('file'):
    s_value = tag.get('source-language')

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56080925

复制

相似问题

问Python:解析XML (xliff)文件(包括头)
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python:解析XML (xliff)文件(包括头)EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python:解析XML (xliff)文件(包括头)
EN