自然语言处理学术速递[7.14]

公众号-arXiv每日学术速递

发布于 2021-07-27 10:55:37

5280

发布于 2021-07-27 10:55:37

文章被收录于专栏：arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

cs.CL 方向，今日共计20篇

Transformer(2篇)

【1】 Combiner: Full Attention Transformer with Sparse Computation Cost 标题：合并器：具有稀疏计算成本的全注意力Transformer

作者：Hongyu Ren,Hanjun Dai,Zihang Dai,Mengjiao Yang,Jure Leskovec,Dale Schuurmans,Bo Dai 机构：‡University of Alberta 链接：https://arxiv.org/abs/2107.05768 摘要：Transformers提供了一类对序列建模非常有效的表达性架构。然而，transformers的关键限制是其二次内存和时间复杂度$\mathcal{O}（L^2）$相对于注意层中的序列长度，这限制了它在超长序列中的应用。大多数现有的方法都利用注意力矩阵中的稀疏性或低秩假设来降低成本，但牺牲了表达能力。相反，我们提出了组合器，它在保持低计算和内存复杂度的同时，在每个注意头中提供完全的注意能力。其核心思想是将自我注意机制视为每个位置嵌入的条件期望，并用结构化因子分解近似条件分布。每个位置都可以关注所有其他位置，或者通过直接关注，或者通过间接关注抽象，这些抽象又是来自相应局部区域的嵌入的条件期望。我们表明，现有稀疏变换器中使用的大多数稀疏注意模式都能够激发这种分解的设计以获得充分的注意，从而产生相同的次二次代价（$\mathcal{O}（L\log（L））$或$\mathcal{O}（L\sqrt{L}）$）。Combiner是现有Transformer中注意层的一个替代品，可以很容易地在通用框架中实现。对自回归和双向序列任务的实验评估表明了该方法的有效性，在多个图像和文本建模任务中得到了最新的结果。摘要：Transformers provide a class of expressive architectures that are extremely effective for sequence modeling. However, the key limitation of transformers is their quadratic memory and time complexity $\mathcal{O}(L^2)$ with respect to the sequence length in attention layers, which restricts application in extremely long sequences. Most existing approaches leverage sparsity or low-rank assumptions in the attention matrix to reduce cost, but sacrifice expressiveness. Instead, we propose Combiner, which provides full attention capability in each attention head while maintaining low computation and memory complexity. The key idea is to treat the self-attention mechanism as a conditional expectation over embeddings at each location, and approximate the conditional distribution with a structured factorization. Each location can attend to all other locations, either via direct attention, or through indirect attention to abstractions, which are again conditional expectations of embeddings from corresponding local regions. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention, resulting in the same sub-quadratic cost ($\mathcal{O}(L\log(L))$ or $\mathcal{O}(L\sqrt{L})$). Combiner is a drop-in replacement for attention layers in existing transformers and can be easily implemented in common frameworks. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach, yielding state-of-the-art results on several image and text modeling tasks.

【2】 Uncertainty-based Query Strategies for Active Learning with Transformers 标题：基于不确定性的Transformer主动学习查询策略

作者：Christopher Schröder,Andreas Niekler,Martin Potthast 机构：Leipzig University 链接：https://arxiv.org/abs/2107.05687 摘要：主动学习是通过有针对性的标记来迭代构建分类模型，从而显著节省标记成本。由于大多数关于主动学习的研究都是在基于Transformer的语言模型（transformers）流行之前进行的，尽管它具有重要的实际意义，但迄今为止很少有论文研究Transformer如何与主动学习相结合。这可以归因于这样一个事实，即对transformers使用最先进的查询策略会导致令人望而却步的运行时开销，这实际上抵消了甚至超过了前面提到的成本节约。在本文中，我们将重新讨论基于不确定性的查询策略，这些策略在很大程度上优于以前的查询策略，但特别适合于微调转换器的环境。在对五个广泛使用的文本分类基准的广泛评估中，我们表明，在学习曲线下的面积上取得了高达14.4个百分点的显著改进，并且对于除一个基准外的所有基准，仅使用0.4%到15%的训练数据，最终的准确率接近最新水平。摘要：Active learning is the iterative construction of a classification model through targeted labeling, enabling significant labeling cost savings. As most research on active learning has been carried out before transformer-based language models ("transformers") became popular, despite its practical importance, comparably few papers have investigated how transformers can be combined with active learning to date. This can be attributed to the fact that using state-of-the-art query strategies for transformers induces a prohibitive runtime overhead, which effectively cancels out, or even outweighs aforementioned cost savings. In this paper, we revisit uncertainty-based query strategies, which had been largely outperformed before, but are particularly suited in the context of fine-tuning transformers. In an extensive evaluation on five widely used text classification benchmarks, we show that considerable improvements of up to 14.4 percentage points in area under the learning curve are achieved, as well as a final accuracy close to the state of the art for all but one benchmark, using only between 0.4% and 15% of the training data.

机器翻译(3篇)

【1】 The IWSLT 2021 BUT Speech Translation Systems 标题：IWSLT 2021 BUT语音翻译系统

作者：Hari Krishna Vydana,Martin Karafi'at,Luk'as Burget,"Honza" Cernock'y 机构： “Honza” ˇCernockýBrno University of Technology 链接：https://arxiv.org/abs/2107.06155 摘要：本文介绍了BUT公司为IWSLT2021开发的英语到德语离线语音翻译系统。它们基于联合训练的自动语音识别机器翻译模型。在musc通用测试集上对其性能进行了评价。在这项工作中，我们从大量独立的ASR训练数据和机器翻译训练数据，以及少量的语音翻译训练数据的角度来研究它们的效率。利用大量的ASR和MT训练数据对ASR和MT模型进行预训练。语音翻译数据通过定义从语音到翻译的端到端可差分路径来联合优化ASR-MT模型。为此，我们使用ASR解码器的内部连续表示作为MT模块的输入。我们证明，通过使用大量纯文本机器翻译训练数据，将ASR解码器与机器翻译模块联合训练，可以进一步改善语音翻译。我们还通过训练一个能够生成标点文本的ASR模块，而不是将标点任务交给机器翻译模块来实现显著的改进。摘要：The paper describes BUT's English to German offline speech translation(ST) systems developed for IWSLT2021. They are based on jointly trained Automatic Speech Recognition-Machine Translation models. Their performances is evaluated on MustC-Common test set. In this work, we study their efficiency from the perspective of having a large amount of separate ASR training data and MT training data, and a smaller amount of speech-translation training data. Large amounts of ASR and MT training data are utilized for pre-training the ASR and MT models. Speech-translation data is used to jointly optimize ASR-MT models by defining an end-to-end differentiable path from speech to translations. For this purpose, we use the internal continuous representations from the ASR-decoder as the input to MT module. We show that speech translation can be further improved by training the ASR-decoder jointly with the MT-module using large amount of text-only MT training data. We also show significant improvements by training an ASR module capable of generating punctuated text, rather than leaving the punctuation task to the MT module.

【2】 Zero-shot Speech Translation 标题：Zero-Shot语音翻译

作者：Tu Anh Dinh 机构：Department of Data Science and Knowledge Engineering, Maastricht University, Maastricht, The Netherlands 链接：https://arxiv.org/abs/2107.06010 摘要：语音翻译是将一种语言的语音翻译成另一种语言的文本。采用自动语音识别（ASR）和机器翻译（MT）系统的传统串级翻译方法容易产生错误传播。端到端方法仅使用一个系统来避免传播错误，但由于数据的稀缺性，很难采用。我们探索Zero-Shot翻译，它可以翻译一对语言，是看不见的训练，从而避免使用端到端的ST数据。Zero-Shot翻译已被证明适用于多语种机器翻译，但还没有研究用于语音翻译。我们试图建立只在ASR和MT任务上训练，但在推理过程中可以完成ST任务的零炮ST模型。问题在于，文本和音频的表示方式有很大的不同，因此模型以不同的方式学习ASR和MT任务，使得执行Zero-Shot任务变得非常重要。这些模型在执行零炮ST时往往输出错误的语言。我们通过包含额外的训练数据和一个辅助损失函数来解决这个问题，该函数最小化了文本-音频的差异。我们的实验结果和分析表明，这些方法对于零炮ST是有希望的。而且，我们的方法在少数炮点的情况下特别有用，因为只有有限的ST数据，与直接端到端ST模型相比，BLEU点数最多可提高11.8，与预先训练的ASR模型微调的ST模型相比，BLEU点数最多可提高3.9。摘要：Speech Translation (ST) is the task of translating speech in one language into text in another language. Traditional cascaded approaches for ST, using Automatic Speech Recognition (ASR) and Machine Translation (MT) systems, are prone to error propagation. End-to-end approaches use only one system to avoid propagating error, yet are difficult to employ due to data scarcity. We explore zero-shot translation, which enables translating a pair of languages that is unseen during training, thus avoid the use of end-to-end ST data. Zero-shot translation has been shown to work for multilingual machine translation, yet has not been studied for speech translation. We attempt to build zero-shot ST models that are trained only on ASR and MT tasks but can do ST task during inference. The challenge is that the representation of text and audio is significantly different, thus the models learn ASR and MT tasks in different ways, making it non-trivial to perform zero-shot. These models tend to output the wrong language when performing zero-shot ST. We tackle the issues by including additional training data and an auxiliary loss function that minimizes the text-audio difference. Our experiment results and analysis show that the methods are promising for zero-shot ST. Moreover, our methods are particularly useful in the few-shot settings where a limited amount of ST data is available, with improvements of up to +11.8 BLEU points compared to direct end-to-end ST models and +3.9 BLEU points compared to ST models fine-tuned from pre-trained ASR model.

【3】 Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task 标题：在理解和学习辅助文本翻译任务中提高言语翻译水平

作者：Yun Tang,Juan Pino,Xian Li,Changhan Wang,Dmitriy Genzel 机构：Facebook AI 备注：Accepted by ACL 2021 链接：https://arxiv.org/abs/2107.05782 摘要：预训练和多任务学习被广泛应用于提高语篇翻译性能。在这项研究中，我们感兴趣的是训练一个语音到文本的翻译模型以及一个辅助的文本到文本的翻译任务。在多任务学习框架下，我们进行了详细的分析，以了解辅助任务对主要任务的影响。我们的分析证实，多任务学习倾向于从不同的模式中产生相似的解码器表示，并保留来自预训练文本翻译模块的更多信息。我们观察到两个任务之间的负迁移效应最小，共享更多的参数有助于将知识从文本任务转移到言语任务。分析还表明，译码器顶层的模态表示差异仍然是不可忽略的，这些层对翻译质量至关重要。受这些发现的启发，我们提出了三种提高翻译质量的方法。首先，提出了一种参数共享和初始化策略，以增强任务间的信息共享。其次，针对编码器提出了一种新的基于注意的正则化方法，使得不同模式的表示更加接近。第三，提出了一种在线知识提炼方法，以增强文本到语音任务的知识传递。我们的实验表明，该方法在一个很强的基线上提高了2个BLEU以上的翻译性能，并在\textsc{MuST-C}英语-德语、英语-法语和英语-西班牙语语言对上取得了最新的结果。摘要：Pretraining and multitask learning are widely used to improve the speech to text translation performance. In this study, we are interested in training a speech to text translation model along with an auxiliary text to text translation task. We conduct a detailed analysis to understand the impact of the auxiliary task on the primary task within the multitask learning framework. Our analysis confirms that multitask learning tends to generate similar decoder representations from different modalities and preserve more information from the pretrained text translation modules. We observe minimal negative transfer effect between the two tasks and sharing more parameters is helpful to transfer knowledge from the text task to the speech task. The analysis also reveals that the modality representation difference at the top decoder layers is still not negligible, and those layers are critical for the translation quality. Inspired by these findings, we propose three methods to improve translation quality. First, a parameter sharing and initialization strategy is proposed to enhance information sharing between the tasks. Second, a novel attention-based regularization is proposed for the encoders and pulls the representations from different modalities closer. Third, an online knowledge distillation is proposed to enhance the knowledge transfer from the text to the speech task. Our experiments show that the proposed approach improves translation performance by more than 2 BLEU over a strong baseline and achieves state-of-the-art results on the \textsc{MuST-C} English-German, English-French and English-Spanish language pairs.

语义分析(2篇)

【1】 Exploiting Network Structures to Improve Semantic Representation for the Financial Domain 标题：利用网络结构改进金融领域的语义表示

作者：Chao Feng,Shi-jie We 机构：University of Zurich, Harbin Institute of Technology 备注：5 pages, 4 figures 链接：https://arxiv.org/abs/2107.05885 摘要：本文介绍了miniture团队参与FinSim-3共享任务学习英语金融领域语义相似性的情况。我们的方法将基于transformer的语言模型学习到的上下文嵌入与从外部知识源中提取的网络结构嵌入相结合，以创建金融领域实体和术语的更有意义的表示。为此，本文采用了两种基于BERT的语言模型和一种知识图嵌入模型。此外，我们提出了一个投票函数来连接三个基本模型进行最终推理。实验结果表明，嵌入知识图的模型比仅嵌入上下文的模型效果更好。然而，我们也注意到，我们的投票功能为最终系统带来了额外的好处。摘要：This paper presents the participation of the MiniTrue team in the FinSim-3 shared task on learning semantic similarities for the financial domain in English language. Our approach combines contextual embeddings learned by transformer-based language models with network structures embeddings extracted on external knowledge sources, to create more meaningful representations of financial domain entities and terms. For this, two BERT based language models and a knowledge graph embedding model are used. Besides, we propose a voting function to joint three basic models for the final inference. Experimental results show that the model with the knowledge graph embeddings has achieved a superior result than these models with only contextual embeddings. Nevertheless, we also observe that our voting function brings an extra benefit to the final system.

【2】 Enforcing Consistency in Weakly Supervised Semantic Parsing 标题：弱监督语义分析中的一致性增强

作者：Nitish Gupta,Sameer Singh,Matt Gardner 机构：University of Pennsylvania, University of California, Irvine, Allen Institute for AI 备注：Published in ACL 2021 链接：https://arxiv.org/abs/2107.05833 摘要：在弱监督语义分析中，最主要的挑战是那些计算错误答案的伪程序。以前的工作使用精心设计的搜索策略来减少虚假程序的流行；然而，它们通常只考虑一个输入一次。在这项工作中，我们探讨了使用输出程序之间的一致性来减少相关输入的虚假程序的影响。我们将程序搜索（以及模型的训练信号）偏向于将相关输入中的相同短语映射到各自程序中相同子部分的程序。此外，我们还研究了设计逻辑形式的重要性，以促进这种基于共识的训练。我们发现，即使没有基于一致性的训练，更一致的形式主义也能提高模型的性能。当这两种见解结合在一起时，与自然语言视觉推理数据集上的最佳先验结果相比，绝对提高了10%。摘要：The predominant challenge in weakly supervised semantic parsing is that of spurious programs that evaluate to correct answers for the wrong reasons. Prior work uses elaborate search strategies to mitigate the prevalence of spurious programs; however, they typically consider only one input at a time. In this work we explore the use of consistency between the output programs for related inputs to reduce the impact of spurious programs. We bias the program search (and thus the model's training signal) towards programs that map the same phrase in related inputs to the same sub-parts in their respective programs. Additionally, we study the importance of designing logical formalisms that facilitate this kind of consAistency-based training. We find that a more consistent formalism leads to improved model performance even without consistency-based training. When combined together, these two insights lead to a 10% absolute improvement over the best prior result on the Natural Language Visual Reasoning dataset.

摘要|信息提取(2篇)

【1】 Fairness-aware Summarization for Justified Decision-Making 标题：合理决策的公平意识总结

作者：Moniba Keymanesh,Tanya Berger-Wolf,Micha Elsner,Srinivasan Parthasarathy 机构：The Ohio State University 备注：16 pages, 7 figures 链接：https://arxiv.org/abs/2107.06243 摘要：在累犯预测、设施检查、利益分配等许多应用中，个体了解与决策相关的信息对于模型的预测具有重要意义。此外，模型的预测应该是合理的。基本上，与决策相关的特征应为预测结果提供足够的信息，并应独立于种族和性别等受保护群体中的个人成员。在这项工作中，我们集中在（联合国）公平性的问题，在理由的文本为基础的神经模型。我们将模型的解释力与结果的公平性联系起来，并提出了一种公平感知的摘要机制来检测和抵消模型中的偏差。考虑到决策的自然语言解释可能存在偏差，我们使用多任务神经模型和基于综合梯度的归因机制，以摘要的形式提取高效用和无歧视的理由。然后，将提取的摘要用于训练模型，以便为个人做出决策。对几个真实数据集的结果表明，我们的方法：（i）帮助用户理解模型决策所使用的信息；（ii）提高结果的公平性，同时显著减少人口数据的泄露。摘要：In many applications such as recidivism prediction, facility inspection, and benefit assignment, it's important for individuals to know the decision-relevant information for the model's prediction. In addition, the model's predictions should be fairly justified. Essentially, decision-relevant features should provide sufficient information for the predicted outcome and should be independent of the membership of individuals in protected groups such as race and gender. In this work, we focus on the problem of (un)fairness in the justification of the text-based neural models. We tie the explanatory power of the model to fairness in the outcome and propose a fairness-aware summarization mechanism to detect and counteract the bias in such models. Given a potentially biased natural language explanation for a decision, we use a multi-task neural model and an attribution mechanism based on integrated gradients to extract the high-utility and discrimination-free justifications in the form of a summary. The extracted summary is then used for training a model to make decisions for individuals. Results on several real-world datasets suggests that our method: (i) assists users to understand what information is used for the model's decision and (ii) enhances the fairness in outcomes while significantly reducing the demographic leakage.

【2】 A Dialogue-based Information Extraction System for Medical Insurance Assessment 标题：一种基于对话的医疗保险评估信息抽取系统

作者：Shuang Peng,Mengdi Zhou,Minghui Yang,Haitao Mi,Shaosheng Cao,Zujie Wen,Teng Xu,Hongbin Wang,Lei Liu 机构：Ant Group, Hangzhou, China 备注：To be published in the Findings of ACL 2021 链接：https://arxiv.org/abs/2107.05866 摘要：在中国医疗保险行业，评估员的角色至关重要，需要与索赔人进行大量的沟通。这是一项专业性很强的工作，涉及到很多方面，比如识别个人信息、收集相关证据、制作最终保险报告等。由于冠状病毒（COVID-19）的流行，以前的线下保险评估必须在网上进行。然而，对于经常缺乏实践经验的初级评估员来说，快速处理如此复杂的在线程序并不容易，然而这一点很重要，因为保险公司需要根据评估员的反馈决定索赔人应获得多少赔偿。为了提高评估人员的工作效率，加快整个评估过程，本文提出了一个基于对话的信息抽取系统，该系统集成了先进的NLP技术，用于医疗保险评估。在我们的系统的帮助下，该程序的平均时间成本从55分钟减少到35分钟，总人力资源成本比以前的离线程序节省了30%。到目前为止，该系统已经为数千起网上索赔案件提供了服务。摘要：In the Chinese medical insurance industry, the assessor's role is essential and requires significant efforts to converse with the claimant. This is a highly professional job that involves many parts, such as identifying personal information, collecting related evidence, and making a final insurance report. Due to the coronavirus (COVID-19) pandemic, the previous offline insurance assessment has to be conducted online. However, for the junior assessor often lacking practical experience, it is not easy to quickly handle such a complex online procedure, yet this is important as the insurance company needs to decide how much compensation the claimant should receive based on the assessor's feedback. In order to promote assessors' work efficiency and speed up the overall procedure, in this paper, we propose a dialogue-based information extraction system that integrates advanced NLP technologies for medical insurance assessment. With the assistance of our system, the average time cost of the procedure is reduced from 55 minutes to 35 minutes, and the total human resources cost is saved 30% compared with the previous offline procedure. Until now, the system has already served thousands of online claim cases.

GAN|对抗|攻击|生成相关(2篇)

【1】 Between Flexibility and Consistency: Joint Generation of Captions and Subtitles 标题：在灵活性和一致性之间：字幕和字幕的联合生成

作者：Alina Karakanta,Marco Gaido,Matteo Negri,Marco Turchi 机构： Fondazione Bruno Kessler, Via Sommarive , Povo, Trento - Italy, University of Trento, Italy 备注：Accepted at IWSLT 2021 链接：https://arxiv.org/abs/2107.06246 摘要：语音翻译（ST）近来越来越受到人们的关注，即不需要中间源语言的转录和计时（即字幕）就可以生成字幕。然而，源字幕和目标字幕的联合生成不仅在两个解码过程相互通知时带来潜在的输出质量优势，而且在多语言场景中也经常需要。在这项工作中，我们重点研究了ST模型，该模型在结构和词汇内容方面产生一致的字幕。我们进一步介绍了评估字幕一致性的新指标。我们的研究结果表明，联合解码可以提高字幕和字幕之间的性能和一致性，同时还可以提供足够的灵活性来生成符合特定语言需要和规范的字幕。摘要：Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing (i.e. captions). However, the joint generation of source captions and target subtitles does not only bring potential output quality advantages when the two decoding processes inform each other, but it is also often required in multilingual scenarios. In this work, we focus on ST models which generate consistent captions-subtitles in terms of structure and lexical content. We further introduce new metrics for evaluating subtitling consistency. Our findings show that joint decoding leads to increased performance and consistency between the generated captions and subtitles while still allowing for sufficient flexibility to produce subtitles conforming to language-specific needs and norms.

【2】 Generating Gender Augmented Data for NLP 标题：为NLP生成性别增强数据

作者：Nishtha Jain,Maja Popovic,Declan Groves,Eva Vanmassenhove 机构：ADAPT Centre, Trinity College Dublin, ADAPT Centre, Dublin City University, Microsoft, Dublin, Department of CSAI, Tilburg University 备注：10 pages, 4 tables 链接：https://arxiv.org/abs/2107.05987 摘要：性别偏见在基于自然语言处理的应用程序中经常出现，特别是在性别屈折变化的语言中。偏见可以通过某些形容词和生动活泼的名词与所指事物的自然性别的联系而出现，但也可以由于屈折词的语法性别频率不平衡而出现。这种类型的偏见在句子中没有指定性别的会话话语中变得更加明显，因为大多数当前的NLP应用程序仍然在句子级上下文中工作。作为迈向更具包容性的自然语言处理的一步，本文提出了一种自动的、通用的会话短句改写方法。这种改写方法适用于没有句外语境，但在性别方面有多种对等选择的句子。该方法既可用于创建性别均衡的产出，也可用于创建性别均衡的训练数据。所提出的方法是基于神经机器翻译（NMT）系统训练的“翻译”从一个性别替代到另一个。对该方法的自动和手动分析表明，自动生成西班牙语会话句子的性别替代方案具有良好的效果。摘要：Gender bias is a frequent occurrence in NLP-based applications, especially pronounced in gender-inflected languages. Bias can appear through associations of certain adjectives and animate nouns with the natural gender of referents, but also due to unbalanced grammatical gender frequencies of inflected words. This type of bias becomes more evident in generating conversational utterances where gender is not specified within the sentence, because most current NLP applications still work on a sentence-level context. As a step towards more inclusive NLP, this paper proposes an automatic and generalisable rewriting approach for short conversational sentences. The rewriting method can be applied to sentences that, without extra-sentential context, have multiple equivalent alternatives in terms of gender. The method can be applied both for creating gender balanced outputs as well as for creating gender balanced training data. The proposed approach is based on a neural machine translation (NMT) system trained to 'translate' from one gender alternative to another. Both the automatic and manual analysis of the approach show promising results for automatic generation of gender alternatives for conversational sentences in Spanish.

识别/分类(2篇)

【1】 Conformer-based End-to-end Speech Recognition With Rotary Position Embedding 标题：旋转位置嵌入的基于Conform的端到端语音识别

作者：Shengqiang Li,Menglong Xu,Xiao-Lei Zhang 机构：CIAIC, School of Marine Science and Technology, Northwestern Polytechnical University, China 备注：5 pages, 3 figures 链接：https://arxiv.org/abs/2107.05907 摘要：基于Transformer的端到端语音识别模型由于其训练速度快、能够模拟长距离的全局上下文，近年来受到了广泛的关注。transformer架构中的位置嵌入是必不可少的，因为它为输入序列中不同位置的元素之间的依赖关系建模提供了监督。为了利用输入序列的时间顺序，许多工作在输入序列中注入了有关元素相对或绝对位置的信息。本文研究了卷积增强变换器（conformer）中的各种位置嵌入方法，并采用了一种新的实现方法&旋转位置嵌入（RoPE）。RoPE通过旋转矩阵将绝对位置信息编码到输入序列中，然后自然地将显式相对位置信息合并到自我注意模块中。为了评价RoPE方法的有效性，我们在AISHELL-1和LibriSpeech语料库上进行了实验。结果表明，用RoPE增强的一致性在语音识别任务中取得了较好的效果。具体来说，我们的模型在测试clean和测试LibriSpeech语料库的其他集合时分别比conformer降低了8.70%和7.27%。摘要：Transformer-based end-to-end speech recognition models have received considerable attention in recent years due to their high training speed and ability to model a long-range global context. Position embedding in the transformer architecture is indispensable because it provides supervision for dependency modeling between elements at different positions in the input sequence. To make use of the time order of the input sequence, many works inject some information about the relative or absolute position of the element into the input sequence. In this work, we investigate various position embedding methods in the convolution-augmented transformer (conformer) and adopt a novel implementation named rotary position embedding (RoPE). RoPE encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module. To evaluate the effectiveness of the RoPE method, we conducted experiments on AISHELL-1 and LibriSpeech corpora. Results show that the conformer enhanced with RoPE achieves superior performance in the speech recognition task. Specifically, our model achieves a relative word error rate reduction of 8.70% and 7.27% over the conformer on test-clean and test-other sets of the LibriSpeech corpus respectively.

【2】 Accenture at CheckThat! 2021: Interesting claim identification and ranking with contextually sensitive lexical training data augmentation 标题：Accenture at CheckThat！2021：使用上下文敏感的词汇训练数据增强进行有趣的索赔识别和排名

作者：Evan Williams,Paul Rodrigues,Sieu Tran 机构：Accenture, N. Glebe Rd., Arlington, USA 备注：To Appear As: Evan Williams, Paul Rodrigues, Sieu Tran. Accenture at CheckThat! 2021: Interesting claim identification and ranking with contextually sensitive lexical training data augmentation. In: Faggioli et al. Working Notes of CLEF 2021-Conference and Labs of the Evaluation Forum. Bucharest, Romania. 21-24 September 2021 链接：https://arxiv.org/abs/2107.05684 摘要：本文讨论了埃森哲团队在CLEF2021 CheckThat！实验，任务1，确定在社交媒体上发表的声明是否会引起广大受众的兴趣，是否应该进行事实核查。Twitter的训练和测试数据以英语、阿拉伯语、西班牙语、土耳其语和保加利亚语提供。对索赔进行分类（值得检查/不值得检查），并按事实核查员的优先顺序排列。我们的方法使用深层神经网络Transformer模型和上下文敏感的词汇扩充应用于提供的训练数据集，以创建额外的训练样本。这种扩充方法提高了所有语言的性能。总的来说，我们的体系结构和数据扩充管道为阿拉伯语生成了最佳提交系统，并根据提供的英语、西班牙语、土耳其语和保加利亚语训练数据的数量进行了性能分级。本文研究了每种语言的深层神经网络结构以及提供的数据，以检验该方法对阿拉伯语如此有效的原因，并讨论了可能对该问题有用的额外数据扩充措施。摘要：This paper discusses the approach used by the Accenture Team for CLEF2021 CheckThat! Lab, Task 1, to identify whether a claim made in social media would be interesting to a wide audience and should be fact-checked. Twitter training and test data were provided in English, Arabic, Spanish, Turkish, and Bulgarian. Claims were to be classified (check-worthy/not check-worthy) and ranked in priority order for the fact-checker. Our method used deep neural network transformer models with contextually sensitive lexical augmentation applied on the supplied training datasets to create additional training samples. This augmentation approach improved the performance for all languages. Overall, our architecture and data augmentation pipeline produced the best submitted system for Arabic, and performance scales according to the quantity of provided training data for English, Spanish, Turkish, and Bulgarian. This paper investigates the deep neural network architectures for each language as well as the provided data to examine why the approach worked so effectively for Arabic, and discusses additional data augmentation measures that should could be useful to this problem.

Zero/Few/One-Shot|迁移|自适应(1篇)

【1】 Few-shot Language Coordination by Modeling Theory of Mind 标题：通过心理理论建模实现语言协调

作者：Hao Zhu,Graham Neubig,Yonatan Bisk 机构： there has been interestin creating artificial agents that mimic this communication 1Language Technologies Institute, Carnegie Mellon University 备注：Thirty-eighth International Conference on Machine Learning (ICML 2021) 链接：https://arxiv.org/abs/2107.05697 摘要：$\textit{没有人是一座岛。}$人类通过与不同的对话者在简短的对话中进行协调，与一个大社区进行交流。这种能力已经被构建神经通信代理的研究所取代。我们研究了少量shot$\textit{language coordination}$：代理快速适应其会话伙伴的语言能力的任务。不同于目前用自我游戏训练的交际主体，我们要求主导主体与具有不同语言能力的$\textit{population}$主体协调，快速适应与群体中不可见主体的交流。这就需要有能力对伴侣的信仰进行建模，这是人类交流的一个重要组成部分。从心灵理论中汲取灵感（汤姆；Premack&Woodruff（1978）），我们研究了说话人显式模拟听者心理状态的效果。如我们的实验所示，说话者获得了预测对方反应的能力，这有助于对方生成简明表达其交际目标的指令。我们检验了我们的假设，即使用ToM模型生成的指令在参考游戏和语言导航任务中都能产生更好的通信性能。从我们的实验中得到的积极结果暗示了显式建模交际作为一种社会语用过程的重要性。摘要：$\textit{No man is an island.}$ Humans communicate with a large community by coordinating with different interlocutors within short conversations. This ability has been understudied by the research on building neural communicative agents. We study the task of few-shot $\textit{language coordination}$: agents quickly adapting to their conversational partners' language abilities. Different from current communicative agents trained with self-play, we require the lead agent to coordinate with a $\textit{population}$ of agents with different linguistic abilities, quickly adapting to communicate with unseen agents in the population. This requires the ability to model the partner's beliefs, a vital component of human communication. Drawing inspiration from theory-of-mind (ToM; Premack& Woodruff (1978)), we study the effect of the speaker explicitly modeling the listeners' mental states. The speakers, as shown in our experiments, acquire the ability to predict the reactions of their partner, which helps it generate instructions that concisely express its communicative goal. We examine our hypothesis that the instructions generated with ToM modeling yield better communication performance in both a referential game and a language navigation task. Positive results from our experiments hint at the importance of explicitly modeling communication as a socio-pragmatic progress.

其他神经网络|深度学习|模型|建模(2篇)

【1】 Deep Neural Networks Evolve Human-like Attention Distribution during Reading Comprehension 标题：深度神经网络在阅读理解过程中的类人注意力分布

作者：Jiajie Zou,Nai Ding 机构：Zhejiang lab; College of Biomedical Engineering and Instrument Sciences, Zhejiang University, Hangzhou , China, Key Laboratory for Biomedical Engineering of Ministry of Education, Corresponding author: 链接：https://arxiv.org/abs/2107.05799 摘要：注意是生物大脑和许多先进的深层神经网络（DNNs）中信息选择的关键机制。在这里，我们调查人类和dnn在阅读一篇文章以回答一个特定问题时是否以类似的方式分配注意力。我们分析了3个基于Transformer的dnn，当训练他们执行阅读理解任务时，这些dnn达到了人的水平。我们发现，DNN的注意力分布在数量上类似于人类的注意力分布测量注视时间。人类读者对与答疑任务更相关的单词的关注时间更长，这表明注意力是由自上而下的阅读目标调节的，而不是刺激的较低层次的视觉和文本特征。进一步的分析表明，DNNs的注意权重同时受自上而下阅读目标和低水平刺激特征的影响，浅层受低水平文本特征的影响更大，而深层更关注任务相关词。此外，当预训练的DNN模型被微调以执行阅读理解任务时，深层对任务相关词的注意逐渐显现，这与任务绩效的提高相吻合。这些结果表明，DNNs可以通过任务优化进化出类人的注意分布，这说明目标导向阅读理解中的人的注意是任务优化的结果。摘要：Attention is a key mechanism for information selection in both biological brains and many state-of-the-art deep neural networks (DNNs). Here, we investigate whether humans and DNNs allocate attention in comparable ways when reading a text passage to subsequently answer a specific question. We analyze 3 transformer-based DNNs that reach human-level performance when trained to perform the reading comprehension task. We find that the DNN attention distribution quantitatively resembles human attention distribution measured by fixation times. Human readers fixate longer on words that are more relevant to the question-answering task, demonstrating that attention is modulated by top-down reading goals, on top of lower-level visual and text features of the stimulus. Further analyses reveal that the attention weights in DNNs are also influenced by both top-down reading goals and lower-level stimulus features, with the shallow layers more strongly influenced by lower-level text features and the deep layers attending more to task-relevant words. Additionally, deep layers' attention to task-relevant words gradually emerges when pre-trained DNN models are fine-tuned to perform the reading comprehension task, which coincides with the improvement in task performance. These results demonstrate that DNNs can evolve human-like attention distribution through task optimization, which suggests that human attention during goal-directed reading comprehension is a consequence of task optimization.

【2】 A Configurable Multilingual Model is All You Need to Recognize All Languages 标题：可配置的多语言模型是识别所有语言所需的全部

作者：Long Zhou,Jinyu Li,Eric Sun,Shujie Liu 机构：†Microsoft Research Asia, ‡Microsoft Speech and Language Group 链接：https://arxiv.org/abs/2107.05876 摘要：多语种自动语音识别（ASR）模型由于其简化的模型训练和部署过程，近年来显示出巨大的应用前景。传统的方法要么训练一个通用的多语言模型而不需要任何语言信息，要么用一个1-hot语言ID（LID）向量来指导目标语言的识别。实际上，系统会提示用户预先选择几种他/她会说的语言。没有LID的多语言模型不能很好地利用用户设置的语言信息，而具有LID的多语言模型只能处理一种预先选定的语言。本文提出了一种新的可配置多语言模型（CMM），该模型只需训练一次，通过从训练后的CMM中提取特定于语言的模块和通用模型，就可以根据用户的选择配置为不同的模型。特别是，单个CMM可以部署到任何用户场景中，用户可以预先选择任何语言组合。使用75K小时的转录匿名微软多语种数据进行训练，并使用10个语言测试集进行评估，当用户选择1、2或3种语言时，该模型比通用多语种模型分别提高了26.0%、16.9%和10.4%。CMM在代码转换测试集上的表现也明显更好。摘要：Multilingual automatic speech recognition (ASR) models have shown great promise in recent years because of the simplified model training and deployment process. Conventional methods either train a universal multilingual model without taking any language information or with a 1-hot language ID (LID) vector to guide the recognition of the target language. In practice, the user can be prompted to pre-select several languages he/she can speak. The multilingual model without LID cannot well utilize the language information set by the user while the multilingual model with LID can only handle one pre-selected language. In this paper, we propose a novel configurable multilingual model (CMM) which is trained only once but can be configured as different models based on users' choices by extracting language-specific modules together with a universal model from the trained CMM. Particularly, a single CMM can be deployed to any user scenario where the users can pre-select any combination of languages. Trained with 75K hours of transcribed anonymized Microsoft multilingual data and evaluated with 10-language test sets, the proposed CMM improves from the universal multilingual model by 26.0%, 16.9%, and 10.4% relative word error reduction when the user selects 1, 2, or 3 languages, respectively. CMM also performs significantly better on code-switching test sets.

其他(4篇)

【1】 Indian Legal NLP Benchmarks : A Survey 标题：印度法律NLP基准：一项调查

作者：Prathamesh Kalamkar,Janani Venugopalan Ph. D.,Vivek Raghavan Ph. D 机构：ThoughtWorks, Joint first author, Janani Venugopalan, Vivek Raghavan, This work is funded by EkStep 链接：https://arxiv.org/abs/2107.06056 摘要：提供具有挑战性的基准是AI在特定领域发展的关键。由于法律文本与普通英语文本有很大不同，因此需要为印度法律文本创建单独的自然语言处理基准，这些基准具有挑战性，并侧重于特定于法律制度的任务。这将刺激印度法律文本自然语言处理应用的创新，并将有利于人工智能社区和法律界。我们回顾了这一领域的现有工作，并提出了为印度法律自然语言处理创建新基准的想法。摘要：Availability of challenging benchmarks is the key to advancement of AI in a specific field.Since Legal Text is significantly different than normal English text, there is a need to create separate Natural Language Processing benchmarks for Indian Legal Text which are challenging and focus on tasks specific to Legal Systems. This will spur innovation in applications of Natural language Processing for Indian Legal Text and will benefit AI community and Legal fraternity. We review the existing work in this area and propose ideas to create new benchmarks for Indian Legal Natural Language Processing.

【2】 On the Difficulty of Translating Free-Order Case-Marking Languages 标题：论自由定格标记语言翻译的难点

作者：Arianna Bisazza,Ahmet Üstün,Stephan Sportel 机构：Center for Language and Cognition, University of Groningen 备注：Accepted to TACL, pre-MIT Press publication version 链接：https://arxiv.org/abs/2107.06055 摘要：识别使某些语言比其他语言更难建模的因素对于在未来的自然语言处理技术中实现语言平等至关重要。自由顺序格标记语言，如俄语、拉丁语或泰米尔语，在句法分析和主谓一致性预测方面比固定顺序语言更具挑战性。在这项工作中，我们调查这类语言是否也更难以翻译最先进的神经机器翻译模型（NMT）。使用多种合成语言和一个新引入的翻译挑战集，我们发现源语言中的词序灵活性只会导致NMT质量的非常小的损失，即使核心动词参数在没有语义提示的句子中变得不可能消除歧义。后一个问题确实是通过增加案例标记来解决的。然而，在中低资源环境中，固定顺序语言的NMT质量仍然是无与伦比的。摘要：Identifying factors that make certain languages harder to model than others is essential to reach language equality in future Natural Language Processing technologies. Free-order case-marking languages, such as Russian, Latin or Tamil, have proved more challenging than fixed-order languages for the tasks of syntactic parsing and subject-verb agreement prediction. In this work, we investigate whether this class of languages is also more difficult to translate by state-of-the-art Neural Machine Translation models (NMT). Using a variety of synthetic languages and a newly introduced translation challenge set, we find that word order flexibility in the source language only leads to a very small loss of NMT quality, even though the core verb arguments become impossible to disambiguate in sentences without semantic cues. The latter issue is indeed solved by the addition of case marking. However, in medium- and low-resource settings, the overall NMT quality of fixed-order languages remains unmatched.

【3】 Rating Facts under Coarse-to-fine Regimes 标题：从粗到细制度下的评级事实

作者：Guojun Wu 机构：National Taiwan University of Science and Technology 链接：https://arxiv.org/abs/2107.06051 摘要：操纵假新闻作为一种政治武器的兴起，已成为全球关注的焦点，凸显了对迅速产生的假新闻无法进行人工事实核查。因此，如果我们要有效地解决这个问题，就必须采用统计方法。公共可用数据集的短缺是自动事实检查的一个主要瓶颈。为了解决这个问题，我们从PolitiFact收集了24K条手动评级的语句。如表1所示，类值表现出与真实性相关的自然顺序。因此，由于类之间不同程度的相似性，我们的任务代表了标准分类的一个转折。为了研究这个问题，我们定义了从粗到细的分类机制，这对分类提出了新的挑战。为了解决这个问题，我们提出了基于BERT的模型。经过训练后，多类数据集的类相似性是敏感的，特别是在细粒度数据集。在所有的制度下，BERT达到了最先进的水平，而额外的层提供了微不足道的改善。摘要：The rise of manipulating fake news as a political weapon has become a global concern and highlighted the incapability of manually fact checking against rapidly produced fake news. Thus, statistical approaches are required if we are to address this problem efficiently. The shortage of publicly available datasets is one major bottleneck of automated fact checking. To remedy this, we collected 24K manually rated statements from PolitiFact. The class values exhibit a natural order with respect to truthfulness as shown in Table 1. Thus, our task represents a twist from standard classification, due to the various degrees of similarity between classes. To investigate this, we defined coarse-to-fine classification regimes, which presents new challenge for classification. To address this, we propose BERT-based models. After training, class similarity is sensible over the multi-class datasets, especially in the fine-grained one. Under all the regimes, BERT achieves state of the art, while the additional layers provide insignificant improvement.

【4】 Quantifying Explainability in NLP and Analyzing Algorithms for Performance-Explainability Tradeoff 标题：NLP中可解释性的量化及性能-可解释性折衷算法分析

作者：Mitchell Naylor,Christi French,Samantha Terker,Uday Kamath 备注：To appear at Interpretable ML in Healthcare workshop at ICML 2021. 9 pages (excluding references), 6 figures 链接：https://arxiv.org/abs/2107.05693 摘要：医疗保健领域是机器学习最令人兴奋的应用领域之一，但缺乏模型透明度导致行业内采用的滞后。在这项工作中，我们探讨了当前艺术的解释性和解释性的案例研究中的临床文本分类，使用任务的死亡率预测模拟-III临床笔记。我们展示了用于完全解释方法以及模型不可知的事后属性的各种可视化技术，并且我们提供了一种通用方法，用于评估从logistic回归到BERT变量的模型类型中使用不忠和局部Lipschitz的解释质量。通过这些度量，我们引入了一个框架，通过这个框架，从业者和研究人员可以评估模型的预测性能和可用解释的质量之间的界限。我们提供代码以鼓励继续改进这些方法。摘要：The healthcare domain is one of the most exciting application areas for machine learning, but a lack of model transparency contributes to a lag in adoption within the industry. In this work, we explore the current art of explainability and interpretability within a case study in clinical text classification, using a task of mortality prediction within MIMIC-III clinical notes. We demonstrate various visualization techniques for fully interpretable methods as well as model-agnostic post hoc attributions, and we provide a generalized method for evaluating the quality of explanations using infidelity and local Lipschitz across model types from logistic regression to BERT variants. With these metrics, we introduce a framework through which practitioners and researchers can assess the frontier between a model's predictive performance and the quality of its available explanations. We make our code available to encourage continued refinement of these methods.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-14，如有侵权请联系 cloudcommunity@tencent.com 删除

linux