在训练多语言机器翻译(MT)模型时,我们面临着训练集不均衡的问题:有些语言的训练数据要比其他语言多得多。 标准做法是对资源较少的语言进行抽样,以增加代表性,抽样的程度对整体性能有很大的影响。在这篇文章中,我们提出了一种方法来代替自动学习如何通过一个数据记分器来加权训练数据,该记分器被优化以使所有测试语言的性能最大化。在一对多和多对一机器翻译环境下对两种语言集的实验表明,我们的方法不仅在平均性能方面始终优于启发式基线,而且对优化语言的性能提供了灵活的控制。
原文题目:Balancing Training for Multilingual Neural Machine Translation
原文:When training multilingual machine translation (MT) models that can translate to/from multiple languages, we are faced with imbalanced training sets: some languages have much more training data than others. Standard practice is to up-sample less resourced languages to increase representation, and the degree of up-sampling has a large effect on the overall performance. In this paper, we propose a method that instead automatically learns how to weight training data through a data scorer that is optimized to maximize performance on all test languages. Experiments on two sets of languages under both one-to-many and many-to-one MT settings show our method not only consistently outperforms heuristic baselines in terms of average performance, but also offers flexible control over the performance of which languages are optimized.
原文作者: Xinyi Wang
原文地址:https://arxiv.org/abs/2004.06748
原创声明,本文系作者授权云+社区发表,未经许可,不得转载。
如有侵权,请联系 yunjia_community@tencent.com 删除。
我来说两句