自然语言处理学术速递[12.17]

公众号-arXiv每日学术速递

发布于 2021-12-17 17:43:27

1.1K0

发布于 2021-12-17 17:43:27

cs.CL 方向，今日共计75篇

Transformer(4篇)

【1】 Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability 标题：通过自然语言可满足性突破Transformer规则推理的极限链接：https://arxiv.org/abs/2112.09054

作者：Kyle Richardson,Ashish Sabharwal 备注：Accepted to AAAI-2022, AAAI preprint 摘要：研究Transformer模型的推理能力，并为它们发现新的挑战性任务，一直是人们非常感兴趣的话题。最近的研究发现，这些模型在对自然语言表达的形式逻辑理论进行演绎推理方面出人意料地强大。然而，这些研究的一个缺点是，它们没有考虑到，当对逻辑理论进行均匀随机抽样时，不一定会导致硬实例。我们提出了一种新的方法来创建具有挑战性的算法推理数据集，重点关注自然语言可满足性（NLSat）问题。其核心思想是从硬命题SAT问题的经验抽样和语言复杂性理论研究中得出见解。这种方法使我们能够区分简单和困难的实例，并系统地增加现有推理基准（如RuleTaker）的复杂性。我们发现，在提供足够的训练数据的情况下，电流互感器在解决由此产生的NLSat问题方面具有惊人的鲁棒性，其难度大大增加。它们还表现出一定程度的尺度不变性——能够推广到更大规模和范围的问题。然而，我们的结果也揭示了重要的局限性：对训练数据的仔细采样对于构建可推广到更大问题的模型至关重要，而transformer模型的有限规模不变性表明，它们离学习鲁棒的演绎推理算法还很远。摘要：Investigating the reasoning abilities of transformer models, and discovering new challenging tasks for them, has been a topic of much interest. Recent studies have found these models to be surprisingly strong at performing deductive reasoning over formal logical theories expressed in natural language. A shortcoming of these studies, however, is that they do not take into account that logical theories, when sampled uniformly at random, do not necessarily lead to hard instances. We propose a new methodology for creating challenging algorithmic reasoning datasets that focus on natural language satisfiability (NLSat) problems. The key idea is to draw insights from empirical sampling of hard propositional SAT problems and from complexity-theoretic studies of language. This methodology allows us to distinguish easy from hard instances, and to systematically increase the complexity of existing reasoning benchmarks such as RuleTaker. We find that current transformers, given sufficient training data, are surprisingly robust at solving the resulting NLSat problems of substantially increased difficulty. They also exhibit some degree of scale-invariance - the ability to generalize to problems of larger size and scope. Our results, however, reveal important limitations too: a careful sampling of training data is crucial for building models that generalize to larger problems, and transformer models' limited scale-invariance suggests they are far from learning robust deductive reasoning algorithms.

【2】 KAT: A Knowledge Augmented Transformer for Vision-and-Language 标题：KAT：一种知识增强型视觉语言转换器链接：https://arxiv.org/abs/2112.08614

作者：Liangke Gui,Borui Wang,Qiuyuan Huang,Alex Hauptmann,Yonatan Bisk,Jianfeng Gao 摘要：最近关于大型Transformer的工作主要集中在优化模型参数中包含的信息量。在这项工作中，我们提出了一个不同的问题：多模态Transformer能否在其推理中利用明确的知识？现有的，主要是单峰的方法已经探索了知识检索和答案预测范式下的方法，但是对于检索到的所用知识的质量和相关性，以及如何整合对内隐知识和外显知识的推理过程，还存在一些悬而未决的问题。为了应对这些挑战，我们提出了一种新的模型-知识增强Transformer（KAT），该模型在OK-VQA的开放域多模式任务上取得了强大的最新成果（+6个绝对点）。我们的方法将隐式和显式知识集成到端到端编码器-解码器体系结构中，同时在答案生成过程中仍然对这两种知识源进行联合推理。显式知识集成的另一个好处是，在我们的分析中，模型预测的可解释性得到了改善。摘要：The primary focus of recent work with largescale transformers has been on optimizing the amount of information packed into the model's parameters. In this work, we ask a different question: Can multimodal transformers leverage explicit knowledge in their reasoning? Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction, but leave open questions about the quality and relevance of the retrieved knowledge used, and how the reasoning processes over implicit and explicit knowledge should be integrated. To address these challenges, we propose a novel model - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result (+6 points absolute) on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.

【3】 Block-Skim: Efficient Question Answering for Transformer 标题：挡路-Skim：Transformer的高效问答链接：https://arxiv.org/abs/2112.08560

作者：Yue Guan,Zhengyi Li,Jingwen Leng,Zhouhan Lin,Minyi Guo,Yuhao Zhu 备注：Published as a conference paper at AAAI 2022 摘要：Transformer模型在自然语言处理（NLP）任务（包括抽取式问答（QA））方面取得了令人满意的结果。NLP任务中使用的公共转换器编码器处理所有层中上下文段落中所有输入标记的隐藏状态。然而，与序列分类等其他任务不同，回答提出的问题不一定需要上下文段落中的所有标记。基于这一动机，我们提出了块略读，它学习在更高的隐藏层中略读不必要的上下文，以提高和加速转换器的性能。块略读的关键思想是确定必须进一步处理的上下文以及在推理过程中可以安全地在早期丢弃的上下文。关键的是，我们发现这样的信息可以从Transformer模型内的自我注意权重中得到。我们在较低层的早期进一步修剪对应于不必要位置的隐藏状态，实现了显著的推理时间加速。令我们惊讶的是，我们观察到，以这种方式修剪的模型比全尺寸模型的性能要好。块略读提高了QA模型在不同数据集上的准确性，在基于BERT的模型上实现了3倍的加速。摘要：Transformer models have achieved promising results on natural language processing (NLP) tasks including extractive question answering (QA). Common Transformer encoders used in NLP tasks process the hidden states of all input tokens in the context paragraph throughout all layers. However, different from other tasks such as sequence classification, answering the raised question does not necessarily need all the tokens in the context paragraph. Following this motivation, we propose Block-skim, which learns to skim unnecessary context in higher hidden layers to improve and accelerate the Transformer performance. The key idea of Block-Skim is to identify the context that must be further processed and those that could be safely discarded early on during inference. Critically, we find that such information could be sufficiently derived from the self-attention weights inside the Transformer model. We further prune the hidden states corresponding to the unnecessary positions early in lower layers, achieving significant inference-time speedup. To our surprise, we observe that models pruned in this way outperform their full-size counterparts. Block-Skim improves QA models' accuracy on different datasets and achieves 3 times speedup on BERT-base model.

【4】 DSGPT: Domain-Specific Generative Pre-Training of Transformers for Text Generation in E-commerce Title and Review Summarization 标题：DSGPT：电子商务标题和评论摘要中用于文本生成的转换器的特定领域生成性预训练(DSGPT：Domain-Specific Generative Pre-Training for Text Generate Text and Review Summary) 链接：https://arxiv.org/abs/2112.08414

作者：Xueying Zhang,Yunjiang Jiang,Yue Shang,Zhaomeng Cheng,Chi Zhang,Xiaochuan Fan,Yun Xiao,Bo Long 备注：None 摘要：我们提出了一种新的领域特定生成性预训练（DS-GPT）文本生成方法，并将其应用于电子商务移动显示屏上的产品标题和评论摘要问题。首先，我们采用了一种仅限解码器的转换器架构，它通过将输入和输出全部合并到一起来适合微调任务。其次，我们证明了在相关领域仅利用少量的预训练数据是有效的。从一般语料库（如Wikipedia或CommonCrawl）预训练语言模型需要大量的时间和资源投入，如果下游任务的种类有限，则可能会造成浪费。OurDSGPT是在有限的数据集——中文短文本摘要数据集（LCSTS）上预先训练的。第三，我们的模型不需要复制相关的人类标记数据。对于标题摘要任务，最新技术在训练和预测阶段明确使用了额外的背景知识。相比之下，我们的模型隐式地捕获了这些知识，并在公共淘宝上进行微调后，取得了比其他方法更大的改进。comdataset。对于复习总结任务，我们使用JD。com，并观察到与缺乏微调灵活性的标准机器翻译方法相比的类似改进。我们提出的工作可以简单地扩展到其他领域，以实现广泛的文本生成任务。摘要：We propose a novel domain-specific generative pre-training (DS-GPT) method for text generation and apply it to the product titleand review summarization problems on E-commerce mobile display.First, we adopt a decoder-only transformer architecture, which fitswell for fine-tuning tasks by combining input and output all to-gether. Second, we demonstrate utilizing only small amount of pre-training data in related domains is powerful. Pre-training a languagemodel from a general corpus such as Wikipedia or the CommonCrawl requires tremendous time and resource commitment, andcan be wasteful if the downstream tasks are limited in variety. OurDSGPT is pre-trained on a limited dataset, the Chinese short textsummarization dataset (LCSTS). Third, our model does not requireproduct-related human-labeled data. For title summarization task,the state of art explicitly uses additional background knowledgein training and predicting stages. In contrast, our model implic-itly captures this knowledge and achieves significant improvementover other methods, after fine-tuning on the public Taobao.comdataset. For review summarization task, we utilize JD.com in-housedataset, and observe similar improvement over standard machinetranslation methods which lack the flexibility of fine-tuning. Ourproposed work can be simply extended to other domains for a widerange of text generation tasks.

QA|VQA|问答|对话(5篇)

【1】 Adapting Document-Grounded Dialog Systems to Spoken Conversations using Data Augmentation and a Noisy Channel Model 标题：使用数据增强和噪声信道模型使基于文档的对话系统适应口语对话链接：https://arxiv.org/abs/2112.08844

作者：David Thulke,Nico Daheim,Christian Dugast,Hermann Ney 备注：Accepted to the DSTC10 workshop at AAAI 2022 摘要：本文总结了我们对第十届对话系统技术挑战（DSTC10）“基于知识的面向任务的口语对话建模”第二轨道任务2的提交。与前一年的迭代类似，该任务由三个子任务组成：检测一个回合是否是知识寻求，选择相关的知识文档，最后生成扎根的响应。今年，重点在于使系统适应嘈杂的ASR成绩单。我们探索了不同的方法，使模型对这种类型的输入更加健壮，并使生成的响应适应口语对话的风格。对于后者，我们使用噪声信道模型获得最佳结果，该模型还减少了短响应和一般响应的数量。我们最好的系统在挑战的自动评估中排名第一，在人类评估中排名第三。摘要：This paper summarizes our submission to Task 2 of the second track of the 10th Dialog System Technology Challenge (DSTC10) "Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations". Similar to the previous year's iteration, the task consists of three subtasks: detecting whether a turn is knowledge seeking, selecting the relevant knowledge document and finally generating a grounded response. This year, the focus lies on adapting the system to noisy ASR transcripts. We explore different approaches to make the models more robust to this type of input and to adapt the generated responses to the style of spoken conversations. For the latter, we get the best results with a noisy channel model that additionally reduces the number of short and generic responses. Our best system achieved the 1st rank in the automatic and the 3rd rank in the human evaluation of the challenge.

【2】 Ditch the Gold Standard: Re-evaluating Conversational Question Answering 标题：抛弃黄金标准：重新评估会话问答链接：https://arxiv.org/abs/2112.08812

作者：Huihan Li,Tianyu Gao,Manan Goenka,Danqi Chen 摘要：会话问答（CQA）系统旨在为用户在寻求信息的对话中提供自然语言的答案。现有的CQA基准使用会话历史中提供的基本事实答案，将模型与预先收集的人类对话进行比较。目前尚不清楚我们是否可以依靠这种静态评估来开发模型，以及当前的系统是否能够很好地推广到现实世界的人机对话。在这项工作中，我们对最先进的CQA系统进行了第一次大规模的人类评估，人类评估人员与模型对话并判断其答案的正确性。我们发现，人机对话的分布与人机对话的分布有很大的不同，并且在模型排名方面，人机对话和黄金历史评估之间存在分歧。我们进一步研究了如何改进自动评估，并提出了一种基于预测历史的问题重写机制，该机制能更好地与人类的判断相关联。最后，我们讨论了各种建模策略的影响以及未来更好的会话问答系统的发展方向。摘要：Conversational question answering (CQA) systems aim to provide natural-language answers to users in information-seeking conversations. Existing CQA benchmarks compare models with pre-collected human-human conversations, using ground-truth answers provided in conversational history. It remains unclear whether we can rely on this static evaluation for model development and whether current systems can well generalize to real-world human-machine conversations. In this work, we conduct the first large-scale human evaluation of state-of-the-art CQA systems, where human evaluators converse with models and judge the correctness of their answers. We find that the distribution of human-machine conversations differs drastically from that of human-human conversations, and there is a disagreement between human and gold-history evaluation in terms of model ranking. We further investigate how to improve automatic evaluations, and propose a question rewriting mechanism based on predicted history, which better correlates with human judgments. Finally, we discuss the impact of various modeling strategies and future directions towards better conversational question answering systems.

【3】 Utilizing Evidence Spans via Sequence-Level Contrastive Learning for Long-Context Question Answering 标题：基于序列级对比学习的证据跨度在长上下文问答中的应用链接：https://arxiv.org/abs/2112.08777

作者：Avi Caciularu,Ido Dagan,Jacob Goldberger,Arman Cohan 摘要：远程Transformer模型在长上下文问答（QA）任务中取得了令人鼓舞的结果。这类任务通常需要对一个长文档进行推理，并且它们从识别一组证据跨度（例如句子）中获益，这些证据跨度为解决问题提供了支持性证据。在这项工作中，我们提出了一种新的方法，为远程Transformer配备额外的序列级目标，以便更好地识别支持证据跨度。我们通过在微调中提出一个额外的对比监督信号来实现这一点，鼓励该模型通过最大化问题-证据相似度来明确区分支持证据句和否定证据句。拟议的额外损失在三种不同的强长上下文转换器模型上表现出一致的改进，跨越了两个具有挑战性的问答基准——HotpotQA和QAsper。摘要：Long-range transformer models have achieved encouraging results on long-context question answering (QA) tasks. Such tasks often require reasoning over a long document, and they benefit from identifying a set of evidence spans (e.g., sentences) that provide supporting evidence for addressing the question. In this work, we propose a novel method for equipping long-range transformers with an additional sequence-level objective for better identification of supporting evidence spans. We achieve this by proposing an additional contrastive supervision signal in finetuning, where the model is encouraged to explicitly discriminate supporting evidence sentences from negative ones by maximizing the question-evidence similarity. The proposed additional loss exhibits consistent improvements on three different strong long-context transformer models, across two challenging question answering benchmarks - HotpotQA and QAsper.

【4】 QuALITY: Question Answering with Long Input Texts, Yes! 标题：质量：带长输入文本的问题回答，是的！链接：https://arxiv.org/abs/2112.08608

作者：Richard Yuanzhe Pang,Alicia Parrish,Nitish Joshi,Nikita Nangia,Jason Phang,Angelica Chen,Vishakh Padmakumar,Johnny Ma,Jana Thompson,He He,Samuel R. Bowman 摘要：为了能够在长文档理解上构建和测试模型，我们引入了QuALITY，这是一个多选QA数据集，具有英语上下文段落，平均长度约为5000个标记，比当前典型模型能够处理的长度要长得多。与之前的文章不同，我们的问题是由阅读了整篇文章的投稿人撰写和验证的，而不是依赖于摘要或摘录。此外，只有一半的问题可由在紧迫时间限制下工作的注释员回答，这表明略读和简单搜索不足以始终表现良好。当前模型在这项任务上表现不佳（55.4%），显著落后于人的表现（93.5%）。摘要：To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Current models perform poorly on this task (55.4%) and significantly lag behind human performance (93.5%).

【5】 QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization 标题：QAFactEval：改进的基于QA的文摘事实一致性评价链接：https://arxiv.org/abs/2112.08542

作者：Alexander R. Fabbri,Chien-Sheng Wu,Wenhao Liu,Caiming Xiong 摘要：事实一致性是文本摘要模型在实际应用中的一个基本特性。评估这一维度的现有工作可以大致分为两大类，基于蕴涵的指标和基于问答（QA）的指标。然而，在最近的工作中提出的不同实验设置导致了关于哪种范式表现最好的对比结论。在这项工作中，我们对蕴涵和基于QA的度量进行了广泛的比较，表明仔细选择基于QA的度量的组件对性能至关重要。在这些见解的基础上，我们提出了一个优化指标，我们称之为QAFactEval，该指标在SummaC事实一致性基准上比以前基于QA的指标平均提高15%。我们的解决方案改进了基于蕴涵的最佳性能指标，并在此基准上实现了最先进的性能。此外，我们发现基于QA和基于蕴涵的度量提供了互补信号，并将两者结合成一个单一的、学习的度量，以进一步提高性能。通过定性和定量分析，我们指出问题生成和可回答性分类是基于QA的度量未来工作的两个关键组件。摘要：Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based metrics and question answering (QA)-based metrics. However, differing experimental setups presented in recent work lead to contrasting conclusions as to which paradigm performs best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the components of a QA-based metric is critical to performance. Building on those insights, we propose an optimized metric, which we call QAFactEval, that leads to a 15% average improvement over previous QA-based metrics on the SummaC factual consistency benchmark. Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance on this benchmark. Furthermore, we find that QA-based and entailment-based metrics offer complementary signals and combine the two into a single, learned metric for further performance boost. Through qualitative and quantitative analyses, we point to question generation and answerability classification as two critical components for future work in QA-based metrics.

机器翻译(4篇)

【1】 IsometricMT: Neural Machine Translation for Automatic Dubbing 标题：IsometricMT：自动配音的神经机器翻译链接：https://arxiv.org/abs/2112.08682

作者：Surafel M. Lakew,Yogesh Virkar,Prashant Mathur,Marcello Federico 备注：Submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022 摘要：自动配音（AD）是翻译应符合给定长度模板的用例之一，以实现源语音和目标语音之间的同步。对于神经机器翻译（MT），产生接近源长度的长度的翻译（例如在字符计数中的+- 10%以内），同时保持质量是一项具有挑战性的任务。控制NMT输出长度会降低翻译质量，这通常通过两步方法来缓解，即生成n个最佳假设，然后根据长度和质量对其重新排序。这项工作引入了一种自学习方法，允许Transformer模型直接学习生成与源长度紧密匹配的输出，简称等轴测MT。特别是，我们的等轴测MT方法不需要生成多个假设或任何辅助评分函数。我们报告了四对语言（英语-法语、意大利语、德语、西班牙语）的结果，并基于TED Talk数据提供了一个公开的基准。自动和手动评估表明，我们的自学习方法与更复杂的等距MT方法相媲美。摘要：Automatic dubbing (AD) is among the use cases where translations should fit a given length template in order to achieve synchronicity between source and target speech. For neural machine translation (MT), generating translations of length close to the source length (e.g. within +-10% in character count), while preserving quality is a challenging task. Controlling NMT output length comes at a cost to translation quality which is usually mitigated with a two step approach of generation of n-best hypotheses and then re-ranking them based on length and quality. This work, introduces a self-learning approach that allows a transformer model to directly learn to generate outputs that closely match the source length, in short isometric MT. In particular, our approach for isometric MT does not require to generate multiple hypotheses nor any auxiliary scoring function. We report results on four language pairs (English - French, Italian, German, Spanish) with a publicly available benchmark based on TED Talk data. Both automatic and manual evaluations show that our self-learning approach to performs on par with more complex isometric MT approaches.

【2】 Amortized Noisy Channel Neural Machine Translation 标题：折算噪声通道神经机器翻译链接：https://arxiv.org/abs/2112.08670

作者：Richard Yuanzhe Pang,He He,Kyunghyun Cho 摘要：噪声信道模型在神经机器翻译（NMT）中尤其有效。然而，最近的方法，如“波束搜索和重行”（BSR）在推理过程中会产生大量的计算开销，使得实际应用不可行。我们的目标是建立一个摊销噪声信道NMT模型，以便贪婪地从中解码将生成与使用BSR生成的翻译相同的最大回报的翻译。我们尝试了三种方法：知识提炼、一步偏差模仿学习和Q学习。第一种方法是从伪语料库中获取带噪信道信号，后两种方法是直接针对带噪信道进行优化。这三种方法都将推理速度提高了1-2个数量级。对于所有三种方法，生成的翻译无法获得与BSR相当的回报，但BLEU近似的翻译质量与BSR生成的翻译质量相似。摘要：Noisy channel models have been especially effective in neural machine translation (NMT). However, recent approaches like "beam search and rerank" (BSR) incur significant computation overhead during inference, making real-world application infeasible. We aim to build an amortized noisy channel NMT model such that greedily decoding from it would generate translations that maximize the same reward as translations generated using BSR. We attempt three approaches: knowledge distillation, 1-step-deviation imitation learning, and Q learning. The first approach obtains the noisy channel signal from a pseudo-corpus, and the latter two approaches aim to optimize toward a noisy-channel MT reward directly. All three approaches speed up inference by 1-2 orders of magnitude. For all three approaches, the generated translations fail to achieve rewards comparable to BSR, but the translation quality approximated by BLEU is similar to the quality of BSR-produced translations.

【3】 Can Multilinguality benefit Non-autoregressive Machine Translation? 标题：多语种能为非自回归机器翻译带来好处吗？链接：https://arxiv.org/abs/2112.08570

作者：Sweta Agrawal,Julia Kreutzer,Colin Cherry 摘要：非自回归（NAR）机器翻译最近取得了显著的改进，现在在一些基准上优于自回归（AR）模型，为AR推理提供了一种有效的替代方法。然而，尽管AR翻译通常使用多语言模型实现，这些模型得益于语言之间的转换和服务效率的提高，但多语言NAR模型仍然相对未被探索。本文以联结主义时间分类（CTC）为例，以NAR模型和插补器为半NAR模型，对多语种NAR进行了全面的实证研究。我们测试了它在相关语言之间的正迁移和在能力限制下的负迁移能力。由于NAR模型需要经过提炼的训练集，我们仔细研究了双语教师与多语教师的影响。最后，我们拟合了多语言NAR的比例律，随着模型比例的增加，该比例律量化了其相对于AR模型的性能。摘要：Non-autoregressive (NAR) machine translation has recently achieved significant improvements, and now outperforms autoregressive (AR) models on some benchmarks, providing an efficient alternative to AR inference. However, while AR translation is often implemented using multilingual models that benefit from transfer between languages and from improved serving efficiency, multilingual NAR models remain relatively unexplored. Taking Connectionist Temporal Classification (CTC) as an example NAR model and Imputer as a semi-NAR model, we present a comprehensive empirical study of multilingual NAR. We test its capabilities with respect to positive transfer between related languages and negative transfer under capacity constraints. As NAR models require distilled training sets, we carefully study the impact of bilingual versus multilingual teachers. Finally, we fit a scaling law for multilingual NAR, which quantifies its performance relative to the AR model as model scale increases.

【4】 Prosody-Aware Neural Machine Translation for Dubbing 标题：韵律感知的神经机器翻译配音链接：https://arxiv.org/abs/2112.08548

作者：Derek Tam,Surafel M. Lakew,Yogesh Virkar,Prashant Mathur,Marcello Federico 备注：Submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022 摘要：我们介绍了韵律感知机器翻译的任务，旨在生成适合配音的翻译。口头句子的配音需要将源语言的内容和韵律结构转移到目标语言中，以保留时间信息。实际上，这意味着正确地将暂停从源语言投射到目标语言，并确保目标语音段的持续时间与相应的源语言段大致相同。在这项工作中，我们提出了一种隐式和显式的建模方法，将韵律信息集成到神经机器翻译。使用自动度量在英语、德语和法语上进行的实验表明，最简单的方法效果最好。人类对翻译和配音视频的评估证实了这一结果。摘要：We introduce the task of prosody-aware machine translation which aims at generating translations suitable for dubbing. Dubbing of a spoken sentence requires transferring the content as well as the prosodic structure of the source into the target language to preserve timing information. Practically, this implies correctly projecting pauses from the source to the target and ensuring that target speech segments have roughly the same duration of the corresponding source segments. In this work, we propose an implicit and explicit modeling approaches to integrate prosody information into neural machine translation. Experiments on English-German/French with automatic metrics show that the simplest of the considered approaches works best. Results are confirmed by human evaluations of translations and dubbed videos.

语义分析(3篇)

【1】 Khmer Word Search: Challenges, Solutions, and Semantic-Aware Search 标题：高棉文单词搜索：挑战、解决方案和语义感知搜索链接：https://arxiv.org/abs/2112.08918

作者：Rina Buoy,Nguonly Taing,Sovisal Chenda 摘要：搜索是数字平台和应用程序（如电子词典、搜索引擎和电子商务平台）的关键功能之一。虽然某些语言的搜索功能微不足道，但高棉语单词搜索由于其复杂的书写系统而具有挑战性。字符的多个顺序和单词的不同拼写实现限制了高棉语单词搜索功能。此外，拼写错误是常见的，因为强大的拼写检查器在输入设备平台上通常不可用。这些挑战阻碍了高棉语在搜索嵌入式应用程序中的应用。此外，由于缺少类似WordNet的高棉语词汇数据库，无法建立词与词之间的语义关系，从而实现语义搜索。在本文中，我们提出了一套稳健的解决方案，以应对上述与高棉词搜索相关的挑战。提出的解决方案包括字符顺序规范化、基于字形和音素的拼写检查以及高棉语单词语义模型。语义模型基于单词嵌入模型，该模型在3000万单词语料库上训练，用于捕获单词之间的语义相似性。摘要：Search is one of the key functionalities in digital platforms and applications such as an electronic dictionary, a search engine, and an e-commerce platform. While the search function in some languages is trivial, Khmer word search is challenging given its complex writing system. Multiple orders of characters and different spelling realizations of words impose a constraint on Khmer word search functionality. Additionally, spelling mistakes are common since robust spellcheckers are not commonly available across the input device platforms. These challenges hinder the use of Khmer language in search-embedded applications. Moreover, due to the absence of WordNet-like lexical databases for Khmer language, it is impossible to establish semantic relation between words, enabling semantic search. In this paper, we propose a set of robust solutions to the above challenges associated with Khmer word search. The proposed solutions include character order normalization, grapheme and phoneme-based spellcheckers, and Khmer word semantic model. The semantic model is based on the word embedding model that is trained on a 30-million-word corpus and is used to capture the semantic similarities between words.

【2】 Few-Shot Semantic Parsing with Language Models Trained On Code 标题：基于代码训练的语言模型的少量语义分析链接：https://arxiv.org/abs/2112.08696

作者：Richard Shin,Benjamin Van Durme 摘要：在上下文示例的提示下，大型语言模型可以用很少的训练数据执行语义分析。当我们将问题表述为解释成规范话语时，它们会做得更好。规范话语将潜在的意义表达转化为受控的自然语言表达。直观地说，这样的模型可以更容易地输出规范话语，因为它们更接近用于预训练的自然语言。最近，像OpenAI Codex这样的预先训练过代码的模型也变得越来越突出。因为准确地建模代码需要理解可执行语义。这类模型可能更擅长语义分析。在本文中，我们检验了这一假设，并发现Codex在语义分析方面比同等的GPT-3模型表现得更好。我们发现，与GPT-3不同，Codex在直接针对意义表示时表现类似，可能是因为语义解析中使用的意义表示的结构与代码类似。摘要：Large language models, prompted with in-context examples, can perform semantic parsing with little training data. They do better when we formulate the problem as paraphrasing into canonical utterances, which cast the underlying meaning representations into a controlled natural language-like representation. Intuitively, such models can more easily output canonical utterances as they are closer to the natural language used for pre-training. More recently, models also pre-trained on code, like OpenAI Codex, have risen in prominence. Since accurately modeling code requires understanding of executable semantics. such models may prove more adept at semantic parsing. In this paper, we test this hypothesis and find that Codex performs better at semantic parsing than equivalent GPT-3 models. We find that unlike GPT-3, Codex performs similarly when targeting meaning representations directly, perhaps as meaning representations used in semantic parsing are structured similar to code.

【3】 Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing Results and Analysis 标题：宾夕法尼亚-赫尔辛基早期现代英语句法语料库：第一次句法分析结果与分析链接：https://arxiv.org/abs/2112.08532

作者：Seth Kulick,Neville Ryant,Beatrice Santorini 摘要：我们展示了宾夕法尼亚赫尔辛基大学早期现代英语分析语料库（PPCEME）的第一个分析结果，PPCEME是一个190万单词的树库，是研究句法变化的重要资源。我们描述了PPCEME的关键特性，这些特性使解析变得具有挑战性，包括比Penn树库中更大、更多样的函数标记集。我们使用改进版的Berkeley神经解析器和Gabbard等人（2006）的功能标签恢复方法展示了该语料库的结果。尽管这种方法简单，但效果却出人意料地好，这表明有可能以足够的精度恢复原始结构，以支持语言应用（例如，搜索感兴趣的语法结构）。然而，对于功能标签的子集（例如，指示直接语音的标签），还需要额外的工作，我们将讨论这种方法的一些进一步限制。由此产生的解析器将用于解析早期在线英语图书，这是一个11亿单词的语料库，随着精确解析树的添加，其用于句法变化研究的效用将大大增加。摘要：We present the first parsing results on the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), a 1.9 million word treebank that is an important resource for research in syntactic change. We describe key features of PPCEME that make it challenging for parsing, including a larger and more varied set of function tags than in the Penn Treebank. We present results for this corpus using a modified version of the Berkeley Neural Parser and the approach to function tag recovery of Gabbard et al (2006). Despite its simplicity, this approach works surprisingly well, suggesting it is possible to recover the original structure with sufficient accuracy to support linguistic applications (e.g., searching for syntactic structures of interest). However, for a subset of function tags (e.g., the tag indicating direct speech), additional work is needed, and we discuss some further limits of this approach. The resulting parser will be used to parse Early English Books Online, a 1.1 billion word corpus whose utility for the study of syntactic change will be greatly increased with the addition of accurate parse trees.

Graph|知识图谱|Knowledge(7篇)

【1】 Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer 标题：通过视觉知识传授实现无平行数据的音文连点链接：https://arxiv.org/abs/2112.08995

作者：Yanpeng Zhao,Jack Hessel,Youngjae Yu,Ximing Lu,Rowan Zellers,Yejin Choi 备注：Our code is available at this https URL 摘要：能够表示和描述环境声景的机器具有实用潜力，例如，用于音频标签和字幕系统。流行的学习模式一直依赖于平行的音频文本数据，然而，这在网络上几乎不可用。我们提出了VIP-ANT，它在不使用任何并行音频文本数据的情况下诱导\textbf{A}udio-\textbf{T}ext对齐。我们的核心思想是在双模图像文本表示和双模图像音频表示之间共享图像模态；图像模态作为轴心，在三模态嵌入空间中隐式连接音频和文本。在没有成对音频文本数据的困难Zero-Shot设置中，我们的模型在ESC50和US8K音频分类任务上展示了最先进的Zero-Shot性能，甚至超过了Cloto字幕检索（带音频查询）的监督最先进水平2.2\%R@1.我们进一步调查了最小音频文本监督的案例，发现，例如，在US8K上，只有几百对受监督的音频文本对将Zero-Shot音频分类精度提高了8%。然而，为了在一些Zero-Shot任务上匹配人类平价，我们的经验缩放实验表明，我们需要大约$2^{21}\大约200万美元的监督音频字幕对。我们的工作为学习几乎没有平行音频文本数据的音频文本连接开辟了新的途径。摘要：Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning systems. Prevailing learning paradigms have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces \textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a tri-modal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2\% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8\% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about $2^{21} \approx 2M$ supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.

【2】 Evidentiality-guided Generation for Knowledge-Intensive NLP Tasks 标题：基于证据的知识密集型自然语言处理任务生成链接：https://arxiv.org/abs/2112.08688

作者：Akari Asai,Matt Gardner,Hannaneh Hajishirzi 备注：17 pages 摘要：检索增强生成模型在许多知识密集型NLP任务（如开放式问答和事实验证）中显示了最先进的性能。这些模型经过训练，以生成给定检索到的段落的最终输出，这些段落可能与原始查询无关，从而导致学习虚假线索或答案记忆。这项工作介绍了一种方法，将段落的证据性——段落是否包含支持输出的正确证据——融入到生成器的训练中。我们引入了一个多任务学习框架，共同生成最终输出并预测每个段落的证据性，利用一种新的任务不可知方法获得{\it silver}证据性标签以供监督。我们在三个知识密集型任务的五个数据集上进行的实验表明，我们新的证据引导生成器显著优于相同大小模型的直接生成器，并提高了FaVIQ Ambig的技术水平。我们将这些改进归因于辅助多任务学习和白银证据挖掘技术。摘要：Retrieval-augmented generation models have shown state-of-the-art performance across many knowledge-intensive NLP tasks such as open question answering and fact verification. These models are trained to generate the final output given the retrieved passages, which can be irrelevant to the original query, leading to learning spurious cues or answer memorization. This work introduces a method to incorporate evidentiality of passages -- whether a passage contains correct evidence to support the output -- into training the generator. We introduce a multi-task learning framework to jointly generate the final output and predict the evidentiality of each passage, leveraging a new task-agnostic method to obtain {\it silver} evidentiality labels for supervision. Our experiments on five datasets across three knowledge-intensive tasks show that our new evidentiality-guided generator significantly outperforms its direct counterpart with the same-size model and advances the state of the art on FaVIQ-Ambig. We attribute these improvements to both the auxiliary multi-task learning and silver evidentiality mining techniques.

【3】 Call for Customized Conversation: Customized Conversation Grounding Persona and Knowledge 标题：呼吁定制对话：定制对话基础人物和知识链接：https://arxiv.org/abs/2112.08619

作者：Yoonna Jang,Jungwoo Lim,Yuna Hur,Dongsuk Oh,Suhyune Son,Yeonsoo Lee,Donghoon Shin,Seungryong Kim,Heuiseok Lim 备注：Accepted paper at the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) 摘要：人类通常通过利用话题的先验知识和与之交谈的人的背景信息进行对话。然而，现有的会话代理和数据集不考虑这样的综合信息，因此它们在生成正确地融合知识和人物的话语方面具有局限性。为了解决这个问题，我们引入了一个调用定制对话（FoCus）数据集，其中定制的答案是使用用户的角色和维基百科知识构建的。为了评估预先训练的语言模型的信息和定制表达能力，我们使用了BART和GPT-2以及基于转换器的模型。我们通过自动评分评估它们的生成能力，并对定性结果进行人工评估。我们通过我们提出的两个子任务，即角色基础（PG）和知识基础（KG），检验模型是否反映了足够的角色和知识。此外，我们还通过基础质量评估表明，我们的数据的话语是由适当的知识和人物角色构成的。摘要：Humans usually have conversations by making use of prior knowledge about a topic and background information of the people whom they are talking to. However, existing conversational agents and datasets do not consider such comprehensive information, and thus they have a limitation in generating the utterances where the knowledge and persona are fused properly. To address this issue, we introduce a call For Customized conversation (FoCus) dataset where the customized answers are built with the user's persona and Wikipedia knowledge. To evaluate the abilities to make informative and customized utterances of pre-trained language models, we utilize BART and GPT-2 as well as transformer-based models. We assess their generation abilities with automatic scores and conduct human evaluations for qualitative results. We examine whether the model reflects adequate persona and knowledge with our proposed two sub-tasks, persona grounding (PG) and knowledge grounding (KG). Moreover, we show that the utterances of our data are constructed with the proper knowledge and persona through grounding quality assessment.

【4】 Commonsense Knowledge-Augmented Pretrained Language Models for Causal Reasoning Classification 标题：用于因果推理分类的常识扩充的预训练语言模型链接：https://arxiv.org/abs/2112.08615

作者：Pedram Hosseini,David A. Broniatowski,Mona Diab 摘要：常识知识可以用来识别文本中的因果关系。在这项工作中，我们将ATOMIC2020（一个广泛覆盖的常识推理知识图）中的三元组表示为自然语言文本，并持续预训练一个BERT预训练语言模型。我们评估结果模型回答常识推理问题。我们的结果表明，在两个常识性因果推理基准（COPA和BCOPA-CE）上，一个持续预训练的语言模型加上常识性推理知识，其性能优于我们的基线，而没有对基础模型进行额外改进或使用质量增强的数据进行微调。摘要：Commonsense knowledge can be leveraged for identifying causal relations in text. In this work, we verbalize triples in ATOMIC2020, a wide coverage commonsense reasoning knowledge graph, to natural language text and continually pretrain a BERT pretrained language model. We evaluate the resulting model on answering commonsense reasoning questions. Our results show that a continually pretrained language model augmented with commonsense reasoning knowledge outperforms our baseline on two commonsense causal reasoning benchmarks, COPA and BCOPA-CE, without additional improvement on the base model or using quality-enhanced data for fine-tuning.

【5】 SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning 标题：SGEITL：用于视觉常识推理的场景图增强图文学习链接：https://arxiv.org/abs/2112.08587

作者：Zhecan Wang,Haoxuan You,Liunian Harold Li,Alireza Zareian,Suji Park,Yiqing Liang,Kai-Wei Chang,Shih-Fu Chang 备注：None 摘要：回答关于图像的复杂问题是机器智能的一个雄心勃勃的目标，它需要对图像、文本和常识的共同理解，以及强大的推理能力。近年来，多模态变换器在视觉常识推理（VCR）方面取得了巨大进展，它通过跨模态注意层共同理解视觉对象和文本标记。然而，这些方法并没有利用场景的丰富结构和对象之间的交互作用，这对于回答复杂的常识性问题至关重要。我们提出了一个场景图增强图像文本学习（SGEITL）框架，将视觉场景图融入常识推理。为了利用场景图结构，在模型结构层次上，我们提出了一种多跳图变换器，用于正则化跳之间的注意交互。在预训练方面，提出了一种场景图感知的预训练方法，以利用从视觉场景图中提取的结构知识。此外，我们还介绍了一种在弱监督的情况下使用文本注释来训练和生成与领域相关的视觉场景图的方法。在VCR和其他任务上进行的大量实验表明，与最先进的方法相比，性能显著提高，并证明了每个拟议组件的有效性。摘要：Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graphs in commonsense reasoning. To exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph. Moreover, we introduce a method to train and generate domain-relevant visual scene graphs using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show a significant performance boost compared with the state-of-the-art methods and prove the efficacy of each proposed component.

【6】 Does Pre-training Induce Systematic Inference? How Masked Language Models Acquire Commonsense Knowledge 标题：预训是否会诱导系统推理？掩蔽语言模型是如何获得常识性知识的链接：https://arxiv.org/abs/2112.08583

作者：Ian Porada,Alessandro Sordoni,Jackie Chi Kit Cheung 摘要：Transformer模型通过蒙面语言建模目标（例如，BERT）预先训练，编码常识知识，如行为探测所证明；然而，通过对训练前语料库的语义进行系统推理来获得这些知识的程度是一个开放的问题。为了回答这个问题，我们在预训练期间选择性地将口头化的知识注入到一个小批量的BERT模型中，并评估该模型推广到支持推理的效果。我们发现，在预训练过程中，泛化并没有改善，这表明常识知识是从表面层次的共现模式获得的，而不是归纳的系统推理。摘要：Transformer models pre-trained with a masked-language-modeling objective (e.g., BERT) encode commonsense knowledge as evidenced by behavioral probes; however, the extent to which this knowledge is acquired by systematic inference over the semantics of the pre-training corpora is an open question. To answer this question, we selectively inject verbalized knowledge into the minibatches of a BERT model during pre-training and evaluate how well the model generalizes to supported inferences. We find generalization does not improve over the course of pre-training, suggesting that commonsense knowledge is acquired from surface-level, co-occurrence patterns rather than induced, systematic reasoning.

【7】 NewsClaims: A New Benchmark for Claim Detection from News with Background Knowledge 标题：NewsClaims：一种基于背景知识的新闻索赔检测新基准链接：https://arxiv.org/abs/2112.08544

作者：Revanth Gangi Reddy,Sai Chinthakindi,Zhenhailong Wang,Yi R. Fung,Kathryn S. Conger,Ahmed S. Elsayed,Martha Palmer,Heng Ji 备注：Preprint 摘要：索赔检测和核实对于理解新闻至关重要，并已成为减少新闻中错误信息的有前途的技术。然而，大多数现有工作侧重于索赔语句的分析，而忽略了关键的背景属性，如索赔人、索赔对象和与索赔相关的其他知识。在这项工作中，我们提出了NewsClaims，这是新闻领域中知识感知索赔检测的一个新基准。我们重新定义了索赔检测问题，包括提取与索赔相关的其他背景属性，并发布了529份索赔，注释了103篇新闻文章。此外，NewsClaims的目标是在新出现的场景中对索赔检测系统进行基准测试，包括几乎没有或几乎没有训练数据的看不见的主题。最后，我们为这个新基准提供了各种基于零拍和提示的基线的综合评估。摘要：Claim detection and verification are crucial for news understanding and have emerged as promising technologies for mitigating misinformation in news. However, most existing work focus on analysis of claim sentences while overlooking crucial background attributes, such as the claimer, claim objects, and other knowledge connected to the claim. In this work, we present NewsClaims , a new benchmark for knowledge-aware claim detection in the news domain. We re-define the claim detection problem to include extraction of additional background attributes related to the claim and release 529 claims annotated over 103 news articles. In addition, NewsClaims aims to benchmark claim detection systems in emerging scenarios, comprising unseen topics with little or no training data. Finally, we provide a comprehensive evaluation of various zero-shot and prompt-based baselines for this new benchmark.

摘要|信息提取(3篇)

【1】 CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs 标题：CrossSum：超越以英语为中心的1500多个语言对的跨语言摘要文本摘要链接：https://arxiv.org/abs/2112.08804

作者：Tahmid Hasan,Abhik Bhattacharjee,Wasi Uddin Ahmad,Yuan-Fang Li,Yong-Bin Kang,Rifat Shahriyar 备注：Work in progress 摘要：我们展示了CrossSum，这是一个大型数据集，包含165万个跨语言文章摘要样本，共有1500多个语言对，共45种语言。我们使用多语言XL Sum数据集，通过使用语言不可知表示模型的跨语言检索，对齐以不同语言编写的相同文章。我们提出了一种多阶段数据采样算法和多语言预训练模型mT5，该模型具有明确的跨语言监督和交叉和，并引入了一种新的跨语言摘要评估指标。基于已建立和我们提出的度量的结果表明，即使源语言和目标语言对在语言上相距遥远，基于交叉和微调的模型也优于摘要+翻译基线。据我们所知，CrossSum是最大的跨语言摘要数据集，也是第一个不依赖英语作为核心语言的摘要数据集。我们正在发布数据集、对齐和训练脚本以及模型，以推动跨语言抽象摘要的未来研究。可以在\url中找到这些资源{https://github.com/csebuetnlp/CrossSum}. 摘要：We present CrossSum, a large-scale dataset comprising 1.65 million cross-lingual article-summary samples in 1500+ language-pairs constituting 45 languages. We use the multilingual XL-Sum dataset and align identical articles written in different languages via cross-lingual retrieval using a language-agnostic representation model. We propose a multi-stage data sampling algorithm and fine-tune mT5, a multilingual pretrained model, with explicit cross-lingual supervision with CrossSum and introduce a new metric for evaluating cross-lingual summarization. Results on established and our proposed metrics indicate that models fine-tuned on CrossSum outperforms summarization+translation baselines, even when the source and target language pairs are linguistically distant. To the best of our knowledge, CrossSum is the largest cross-lingual summarization dataset and also the first-ever that does not rely on English as the pivot language. We are releasing the dataset, alignment and training scripts, and the models to spur future research on cross-lingual abstractive summarization. The resources can be found at \url{https://github.com/csebuetnlp/CrossSum}.

【2】 A Proposition-Level Clustering Approach for Multi-Document Summarization 标题：一种命题级别的多文档文摘聚类方法链接：https://arxiv.org/abs/2112.08770

作者：Ori Ernst,Avi Caciularu,Ori Shapira,Ramakanth Pasunuru,Mohit Bansal,Jacob Goldberger,Ido Dagan 摘要：传统上，文本聚类方法被合并到多文档摘要（MDS）中，作为处理大量信息重复的一种手段。利用集群来表示信息显著性并避免冗余。这些方法侧重于对句子进行聚类，即使密切相关的句子通常也包含不对齐的信息。在这项工作中，我们重新讨论了聚类方法，将命题组合在一起以实现更精确的信息对齐。具体地说，我们的方法检测显著命题，将它们聚类成释义类，并通过融合命题为每个类生成一个代表性句子。在DUC 2004和TAC 2011数据集中，我们的总结方法在自动胭脂评分和人类偏好方面都比以前最先进的MDS方法有所改进。摘要：Text clustering methods were traditionally incorporated into multi-document summarization (MDS) as a means for coping with considerable information repetition. Clusters were leveraged to indicate information saliency and to avoid redundancy. These methods focused on clustering sentences, even though closely related sentences also usually contain non-aligning information. In this work, we revisit the clustering approach, grouping together propositions for more precise information alignment. Specifically, our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster by fusing its propositions. Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets, both in automatic ROUGE scores and human preference.

【3】 CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning 标题：Confit：以语言信息为基础的对比微调走向忠实的对话总结链接：https://arxiv.org/abs/2112.08713

作者：Xiangru Tang,Arjun Nair,Borui Wang,Bingyao Wang,Jai Desai,Aaron Wade,Haoran Li,Asli Celikyilmaz,Yashar Mehdad,Dragomir Radev 摘要：摘要生成中的事实不一致严重限制了抽象对话摘要的实际应用。虽然通过使用预先训练的模型已经取得了重大进展，但在人类评估过程中发现了大量的幻觉内容。对于文本摘要，预训练的模型通常使用交叉熵损失进行微调，这可能不是最佳策略。在这项工作中，我们提供了一个带有注释数据的事实错误类型，以突出错误的类型，并摆脱对事实性的二元理解。我们进一步提出了一种训练策略，通过一种新的对比微调，即ConFiT，提高总结的事实一致性和整体质量。基于我们的语言信息错误类型学，我们设计了不同的模块化目标，每个目标针对特定类型。具体来说，我们利用带有错误的硬负样本来减少事实不一致的产生。为了捕捉说话人之间的关键信息，我们还设计了一个特定的对话。通过使用人类评估和自动忠实度度量，我们表明我们的模型显著减少了对话摘要、SAMSum语料库中的各种事实错误。此外，我们的模型可以推广到会议摘要、AMI语料库，并且在单词重叠度量方面，它在两个数据集上的得分显著高于大多数基线。摘要：Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained models, substantial amounts of hallucinated content are found during the human evaluation. Pre-trained models are most commonly fine-tuned with cross-entropy loss for text summarization, which may not be an optimal strategy. In this work, we provide a typology of factual errors with annotation data to highlight the types of errors and move away from a binary understanding of factuality. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called ConFiT. Based on our linguistically-informed typology of errors, we design different modular objectives that each target a specific type. Specifically, we utilize hard negative samples with errors to reduce the generation of factual inconsistency. In order to capture the key information between speakers, we also design a dialogue-specific loss. Using human evaluation and automatic faithfulness metrics, we show that our model significantly reduces all kinds of factual errors on the dialogue summarization, SAMSum corpus. Moreover, our model could be generalized to the meeting summarization, AMI corpus, and it produces significantly higher scores than most of the baselines on both datasets regarding word-overlap metrics.

推理|分析|理解|解释(4篇)

【1】 Inherently Explainable Reinforcement Learning in Natural Language 标题：自然语言中的内在可解释强化学习链接：https://arxiv.org/abs/2112.08907

作者：Xiangyu Peng,Mark O. Riedl,Prithviraj Ammanabrolu 摘要：我们专注于创建一个内在可解释的强化学习代理的任务——通过在执行任务时大声思考并事后分析整个轨迹以产生因果解释，能够立即产生局部解释。这种层次结构可解释的强化学习代理（HEX-RL）在交互式小说、基于文本的游戏环境中运行，在这种环境中，代理使用文本自然语言感知并作用于世界。这些游戏通常被设计成具有长期依赖性的谜题或任务，在这些谜题或任务中，代理必须完成一系列动作才能成功——提供理想的环境来测试代理解释其动作的能力。我们的代理被设计为将可解释性作为一级公民对待，使用提取的基于符号知识图的状态表示，再加上层次图注意机制，该机制指向内部图表示中对行为选择影响最大的事实。实验表明，该代理在强基线的基础上提供了显著改进的解释，如通常不熟悉环境的人类参与者所评价的，同时也与最先进的任务性能相匹配。摘要：We focus on the task of creating a reinforcement learning agent that is inherently explainable -- with the ability to produce immediate local explanations by thinking out loud while performing a task and analyzing entire trajectories post-hoc to produce causal explanations. This Hierarchically Explainable Reinforcement Learning agent (HEX-RL), operates in Interactive Fictions, text-based game environments in which an agent perceives and acts upon the world using textual natural language. These games are usually structured as puzzles or quests with long-term dependencies in which an agent must complete a sequence of actions to succeed -- providing ideal environments in which to test an agent's ability to explain its actions. Our agent is designed to treat explainability as a first-class citizen, using an extracted symbolic knowledge graph-based state representation coupled with a Hierarchical Graph Attention mechanism that points to the facts in the internal graph representation that most influenced the choice of actions. Experiments show that this agent provides significantly improved explanations over strong baselines, as rated by human participants generally unfamiliar with the environment, while also matching state-of-the-art task performance.

【2】 Distilled Dual-Encoder Model for Vision-Language Understanding 标题：用于视觉语言理解的提炼双编码器模型链接：https://arxiv.org/abs/2112.08723

作者：Zekun Wang,Wenhui Wang,Haichao Zhu,Ming Liu,Bing Qin,Furu Wei 备注：Work in progress 摘要：我们提出了一个跨模态注意提取框架来训练视觉语言理解任务（如视觉推理和视觉问答）的双编码器模型。双编码器模型比融合编码器模型具有更快的推理速度，并且能够在推理过程中对图像和文本进行预计算。然而，双编码器模型中使用的浅层交互模块不足以处理复杂的视觉语言理解任务。为了了解图像和文本之间的深层交互，我们引入了跨模式注意提取，它使用融合编码器模型的图像到文本和文本到图像的注意分布来指导我们的双编码器模型的训练。此外，我们还表明，将跨模态注意蒸馏应用于预训练和微调阶段可以实现进一步的改进。实验结果表明，提取的双编码器模型在视觉推理、视觉蕴涵和视觉问答任务方面具有很强的竞争力，同时比融合编码器模型具有更快的推理速度。我们的代码和模型将在https://github.com/kugwzk/Distilled-DualEncoder. 摘要：We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results demonstrate that the distilled dual-encoder model achieves competitive performance for visual reasoning, visual entailment and visual question answering tasks while enjoying a much faster inference speed than fusion-encoder models. Our code and models will be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.

【3】 Reframing Human-AI Collaboration for Generating Free-Text Explanations 标题：重构人类-人工智能协作以生成自由文本解释链接：https://arxiv.org/abs/2112.08674

作者：Sarah Wiegreffe,Jack Hessel,Swabha Swayamdipta,Mark Riedl,Yejin Choi 备注：10 pages main; 13 pages appendix 摘要：大型语言模型越来越能够在相对较少的特定任务监督下生成流畅的文本。但这些模型能准确地解释分类决策吗？我们考虑的任务是使用少量的人类书面实例（即，以少量镜头的方式）生成自由文本解释。我们发现：（1）编写高质量的示例以促进生成高质量的代；（2）令人惊讶的是，在面对面比较中，与现有数据集中包含的众包人工书面解释相比，众包工作者通常更喜欢GPT-3生成的解释。然而，众工评级也显示，尽管模型提供了事实、语法和充分的解释，但它们仍有改进的余地，例如，沿着提供新颖信息和支持标签等轴线。我们创建了一个结合GPT-3和监督过滤器的管道，该过滤器通过二进制可接受性判断将人纳入循环。尽管判断可接受性具有明显的主观性，但我们的方法能够始终如一地过滤GPT-3生成的人类认为可接受的解释。摘要：Large language models are increasingly capable of generating fluent-appearing text with relatively little task-specific supervision. But can these models accurately explain classification decisions? We consider the task of generating free-text explanations using a small number of human-written examples (i.e., in a few-shot manner). We find that (1) authoring higher-quality examples for prompting results in higher quality generations; and (2) surprisingly, in a head-to-head comparison, crowdworkers often prefer explanations generated by GPT-3 to crowdsourced human-written explanations contained within existing datasets. Crowdworker ratings also show, however, that while models produce factual, grammatical, and sufficient explanations, they have room to improve, e.g., along axes such as providing novel information and supporting the label. We create a pipeline that combines GPT-3 with a supervised filter that incorporates humans-in-the-loop via binary acceptability judgments. Despite significant subjectivity intrinsic to judging acceptability, our approach is able to consistently filter GPT-3 generated explanations deemed acceptable by humans.

【4】 Explainable Natural Language Processing with Matrix Product States 标题：基于矩阵乘积状态的可解释自然语言处理链接：https://arxiv.org/abs/2112.08628

作者：Jirawat Tangpanitanon,Chanatip Mangkang,Pradeep Bhadola,Yuichiro Minato,Dimitris Angelakis,Thiparat Chotibut 备注：25 pages, 7 figures 摘要：尽管递归神经网络（RNN）在自然语言处理（NLP）中取得了经验上的成功，但由于RNN中固有的复杂计算，对RNN的理论理解仍然有限。我们通过一类称为递归算术电路（RAC）的RNN和矩阵乘积状态（MPS）之间的映射，对普遍存在的NLP任务（电影评论的情绪分析）中RNN的行为进行系统分析。使用冯·诺依曼纠缠熵（EE）作为信息传播的代理，我们证明了单层RAC具有最大的信息传播能力，这可以通过EE的饱和来反映。将MPS的键维数扩大到EE饱和阈值之外不会提高预测精度，因此可以构建一个最能估计数据统计信息的最小模型。虽然饱和EE小于MPS面积定律所能达到的最大EE，但我们的模型在现实情绪分析数据集中达到了99%的训练精度。因此，仅低EE并不是反对NLP采用单层RAC的理由。与远程信息传播是RNN表达能力的主要来源这一普遍观点相反，我们发现单层RAC还利用有意义的词向量嵌入的高表达能力。我们的工作使用多体量子物理的工具，揭示了RACs中学习的现象学，更广泛地揭示了NLP中RNN的可解释性方面。摘要：Despite empirical successes of recurrent neural networks (RNNs) in natural language processing (NLP), theoretical understanding of RNNs is still limited due to intrinsically complex computations in RNNs. We perform a systematic analysis of RNNs' behaviors in a ubiquitous NLP task, the sentiment analysis of movie reviews, via the mapping between a class of RNNs called recurrent arithmetic circuits (RACs) and a matrix product state (MPS). Using the von-Neumann entanglement entropy (EE) as a proxy for information propagation, we show that single-layer RACs possess a maximum information propagation capacity, reflected by the saturation of the EE. Enlarging the bond dimension of an MPS beyond the EE saturation threshold does not increase the prediction accuracies, so a minimal model that best estimates the data statistics can be constructed. Although the saturated EE is smaller than the maximum EE achievable by the area law of an MPS, our model achieves ~99% training accuracies in realistic sentiment analysis data sets. Thus, low EE alone is not a warrant against the adoption of single-layer RACs for NLP. Contrary to a common belief that long-range information propagation is the main source of RNNs' expressiveness, we show that single-layer RACs also harness high expressiveness from meaningful word vector embeddings. Our work sheds light on the phenomenology of learning in RACs and more generally on the explainability aspects of RNNs for NLP, using tools from many-body quantum physics.

GAN|对抗|攻击|生成相关(6篇)

【1】 Learning and Analyzing Generation Order for Undirected Sequence Models 标题：无向序列模型的生成顺序学习与分析链接：https://arxiv.org/abs/2112.09097

作者：Yichen Jiang,Mohit Bansal 备注：EMNLP 2021 Findings (12 pages) 摘要：无向神经序列模型的性能与机器翻译任务中从左到右单调生成的最先进的有向序列模型相当。在这项工作中，我们训练了一种策略，该策略通过强化学习学习预先训练的无向翻译模型的生成顺序。我们表明，在WMT'14德语-英语翻译任务中，根据我们的学习顺序解码的翻译比从左到右解码的输出或根据Mansimov等人（2019）的学习顺序解码的输出获得更高的BLEU分数。在De-En、WMT'16英罗曼语和WMT'21英汉语翻译任务的最大源和目标长度为30的示例中，我们的学习顺序在六分之四的任务上优于所有启发式生成顺序。接下来，我们通过定性和定量分析仔细分析所学的订单模式。我们表明，我们的政策通常遵循从外部到内部的顺序，首先预测最左侧和最右侧的位置，然后向中间移动，同时在开始时跳过不太重要的单词。此外，该策略通常在连续步骤中预测单个句法成分结构的位置。我们相信我们的发现可以提供更多关于无向生成模型机制的见解，并鼓励在这一方向上进行进一步的研究。我们的代码在https://github.com/jiangycTarheel/undirected-generation 摘要：Undirected neural sequence models have achieved performance competitive with the state-of-the-art directed sequence models that generate monotonically from left to right in machine translation tasks. In this work, we train a policy that learns the generation order for a pre-trained, undirected translation model via reinforcement learning. We show that the translations decoded by our learned orders achieve higher BLEU scores than the outputs decoded from left to right or decoded by the learned order from Mansimov et al. (2019) on the WMT'14 German-English translation task. On examples with a maximum source and target length of 30 from De-En, WMT'16 English-Romanian, and WMT'21 English-Chinese translation tasks, our learned order outperforms all heuristic generation orders on four out of six tasks. We next carefully analyze the learned order patterns via qualitative and quantitative analysis. We show that our policy generally follows an outer-to-inner order, predicting the left-most and right-most positions first, and then moving toward the middle while skipping less important words at the beginning. Furthermore, the policy usually predicts positions for a single syntactic constituent structure in consecutive steps. We believe our findings could provide more insights on the mechanism of undirected generation models and encourage further research in this direction. Our code is publicly available at https://github.com/jiangycTarheel/undirected-generation

【2】 NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics 标题：神经学A*EQUE解码：基于先行启发式的受限文本生成链接：https://arxiv.org/abs/2112.08726

作者：Ximing Lu,Sean Welleck,Peter West,Liwei Jiang,Jungo Kasai,Daniel Khashabi,Ronan Le Bras,Lianhui Qin,Youngjae Yu,Rowan Zellers,Noah A. Smith,Yejin Choi 摘要：神经文本生成的主要范式是自回归语言模型的从左到右解码。然而，复杂词汇约束下的受限或可控生成需要远见，以规划未来可行的路径。受A*搜索算法的启发，我们提出了神经A*式解码算法，它结合了对未来成本的启发式估计。我们开发了适用于大规模语言模型的高效前瞻性启发式算法，使我们的方法成为普通技术（如波束搜索和top-k采样）的替代品。为了实现约束生成，我们基于神经解码（Lu et al.，2021），将其在合并逻辑约束方面的灵活性与未来约束满足度的*式估计相结合。我们的方法在五代任务上优于竞争基线，并在表到文本生成、约束机器翻译和关键字约束生成方面实现了最新的性能。这些改进在需要复杂约束满足的任务上，或在少数放炮或零放炮设置中尤其显著。神经学A*esque说明了解码对于提高和启用大规模语言模型的新功能的力量。摘要：The dominant paradigm for neural text generation is left-to-right decoding from autoregressive language models. Constrained or controllable generation under complex lexical constraints, however, requires foresight to plan ahead feasible future paths. Drawing inspiration from the A* search algorithm, we propose NeuroLogic A*esque, a decoding algorithm that incorporates heuristic estimates of future cost. We develop efficient lookahead heuristics that are efficient for large-scale language models, making our method a drop-in replacement for common techniques such as beam search and top-k sampling. To enable constrained generation, we build on NeuroLogic decoding (Lu et al., 2021), combining its flexibility in incorporating logical constraints with A*esque estimates of future constraint satisfaction. Our approach outperforms competitive baselines on five generation tasks, and achieves new state-of-the-art performance on table-to-text generation, constrained machine translation, and keyword-constrained generation. The improvements are particularly notable on tasks that require complex constraint satisfaction or in few-shot or zero-shot settings. NeuroLogic A*esque illustrates the power of decoding for improving and enabling new capabilities of large-scale language models.

【3】 Taming Repetition in Dialogue Generation 标题：在对话生成中驯服重复链接：https://arxiv.org/abs/2112.08657

作者：Yadong Xi,Jiashu Pu,Xiaoxi Mao 备注：accepted by AAAI-22 W16: Dialog System Technology Challenge (DSTC10) 摘要：训练前语言模型的浪潮一直在不断提高机器生成对话的质量，然而，一些生成的反应仍然存在过度重复的问题，有时会从话语中重复单词，有时会在自我生成的反应中重复单词，或两者兼而有之。单词的不恰当重复会显著降低生成文本的质量。惩罚抽样是一种流行的解决方案，它降低了推理过程中现有单词的抽样概率，但是，它很容易受到静态权重设置不当的影响。设置得太高会产生奇怪和不现实的句子，而设置得太低则会使抑制重复的任务变得微不足道。为了弥补上述方法的不足，我们设计了一个上下文感知分类器来明确决定何时允许重复以及何时使用惩罚抽样。这样的分类器可以很容易地与现有的解码方法集成，在适当的情况下减少重复，同时保持文本的多样性。实验结果表明，该方法能够生成更高质量、更真实的对话。摘要：The wave of pre-training language models has been continuously improving the quality of the machine-generated conversations, however, some of the generated responses still suffer from excessive repetition, sometimes repeating words from utterance, sometimes repeating words within self-generated responses, or both. Inappropriate repetition of words can significantly degrade the quality of the generated texts. Penalized sampling is one popular solution, reducing the sampling probability of existing words during inference, however, it is highly vulnerable to the inappropriate setting of the static weight. Setting it too high can yield strange and unrealistic sentences while setting it too low makes the task of suppressing repetition trivial. To remedy the shortcomings of the above methods, we design a context-aware classifier to explicitly decide when to allow repetition and when to employ penalized sampling. Such a classifier can be easily integrated with existing decoding methods, reducing repetitions where appropriate while preserving the diversity of the text. Experimental results demonstrate that our method can generate higher quality and more authentic dialogues.

【4】 Guiding Neural Story Generation with Reader Models 标题：用读者模型指导神经故事的生成链接：https://arxiv.org/abs/2112.08596

作者：Xiangyu Peng,Kaige Xie,Amal Alabdulkarim,Harshith Kayam,Samihan Dani,Mark O. Riedl 摘要：由于日常生活中叙事的普遍性，自动讲故事一直吸引着研究人员的注意力。然而，在使用神经语言模型生成叙述时，保持连贯性并保持主题朝着特定的结尾是一个挑战。在本文中，我们介绍了读者模型的故事生成（StoRM），这是一个框架，在该框架中，读者模型用于推理故事应该进展。读者模型推断出人类读者对虚构故事世界的概念、实体和关系的看法。我们展示了以知识图表示的显式读者模型如何提供故事连贯性，并以实现给定故事世界状态目标的形式提供可控性。实验表明，我们的模型产生的故事更加连贯和主题化，在包括情节合理性和主题性在内的维度上优于基线。我们的系统在不排序的情况下编写给定概念方面也优于大纲引导的故事生成基线。摘要：Automated storytelling has long captured the attention of researchers for the ubiquity of narratives in everyday life. However, it is challenging to maintain coherence and stay on-topic toward a specific ending when generating narratives with neural language models. In this paper, we introduce Story generation with Reader Models (StoRM), a framework in which a reader model is used to reason about the story should progress. A reader model infers what a human reader believes about the concepts, entities, and relations about the fictional story world. We show how an explicit reader model represented as a knowledge graph affords story coherence and provides controllability in the form of achieving a given story world state goal. Experiments show that our model produces significantly more coherent and on-topic stories, outperforming baselines in dimensions including plot plausibility and staying on topic. Our system also outperforms outline-guided story generation baselines in composing given concepts without ordering.

【5】 Goal-Directed Story Generation: Augmenting Generative Language Models with Reinforcement Learning 标题：目标导向的故事生成：用强化学习增强生成语言模型链接：https://arxiv.org/abs/2112.08593

作者：Amal Alabdulkarim,Winston Li,Lara J. Martin,Mark O. Riedl 备注：preprint 摘要：大型预先训练的生成性语言模型的出现为人工智能故事生成提供了一个通用框架，通过对模型进行采样来创建继续故事的序列。但是，仅采样不足以生成故事。特别是，很难指导语言模型创建故事以达到特定的目标事件。我们提出了两种基于深度强化学习和奖励塑造的自动化技术来控制计算机生成故事的情节。第一种方法利用近端策略优化对现有的基于转换器的语言模型进行微调，以生成文本连续性，但也可以进行目标搜索。第二种方法从展开的故事中提取一个知识图，该知识图被具有图注意的策略网络用于选择由语言模型生成的候选延续。我们报告了与故事实现给定目标事件的频率相关的自动化指标，以及与基线和破坏相比，人类参与者对连贯性和总体故事质量的排名。摘要：The advent of large pre-trained generative language models has provided a common framework for AI story generation via sampling the model to create sequences that continue the story. However, sampling alone is insufficient for story generation. In particular, it is hard to direct a language model to create stories to reach a specific goal event. We present two automated techniques grounded in deep reinforcement learning and reward shaping to control the plot of computer-generated stories. The first utilizes proximal policy optimization to fine-tune an existing transformer-based language model to generate text continuations but also be goal-seeking. The second extracts a knowledge graph from the unfolding story, which is used by a policy network with graph attention to select a candidate continuation generated by a language model. We report on automated metrics pertaining to how often stories achieve a given goal event as well as human participant rankings of coherence and overall story quality compared to baselines and ablations.

【6】 Neural Content Extraction for Poster Generation of Scientific Papers 标题：面向科技论文海报生成的神经内容提取链接：https://arxiv.org/abs/2112.08550

作者：Sheng Xu,Xiaojun Wan 摘要：科学论文的海报生成问题正在研究中。海报通常呈现论文中最重要的信息，而任务可以被视为一种特殊的文档摘要形式。以往的研究主要集中在海报布局和面板组成上，而忽略了内容提取的重要性。此外，他们的数据集不公开，这阻碍了进一步的研究。在本文中，我们为这项任务从头构建了一个基准数据集。然后，我们提出了一个三步框架来解决这一问题，并重点研究了内容提取步骤。为了同时获取海报面板的文本和视觉元素，提出了一种神经提取模型来同时提取纸张部分的文本、图形和表格。我们在数据集上进行了实验，并进行了消融研究。结果证明了我们提出的模型的有效性。数据集和代码将被发布。摘要：The problem of poster generation for scientific papers is under-investigated. Posters often present the most important information of papers, and the task can be considered as a special form of document summarization. Previous studies focus mainly on poster layout and panel composition, while neglecting the importance of content extraction. Besides, their datasets are not publicly available, which hinders further research. In this paper, we construct a benchmark dataset from scratch for this task. Then we propose a three-step framework to tackle this task and focus on the content extraction step in this study. To get both textual and visual elements of a poster panel, a neural extractive model is proposed to extract text, figures and tables of a paper section simultaneously. We conduct experiments on the dataset and also perform ablation study. Results demonstrate the efficacy of our proposed model. The dataset and code will be released.

半/弱/无监督|不确定性(7篇)

【1】 Towards Unsupervised Dense Information Retrieval with Contrastive Learning 标题：基于对比学习的无监督密集信息检索链接：https://arxiv.org/abs/2112.09118

作者：Gautier Izacard,Mathilde Caron,Lucas Hosseini,Sebastian Riedel,Piotr Bojanowski,Armand Joulin,Edouard Grave 摘要：信息检索是自然语言处理中的一个重要组成部分，用于回答问题和检查事实等知识密集型任务。最近，信息检索出现了基于神经网络的密集检索器，作为基于术语频率的经典稀疏方法的替代方法。这些模型已经在数据集和任务上获得了最先进的结果，在这些数据集和任务中，可以使用大型训练集。然而，在没有训练数据的情况下，它们不能很好地转移到新的领域或应用程序中，并且通常比不受监督的术语频率方法（如BM25）表现更好。因此，一个自然的问题是，是否有可能在没有监督的情况下训练密集的猎犬。在这项工作中，我们探讨了对比学习作为一种训练无监督密集检索器的方法的局限性，并表明它能带来强大的检索性能。更准确地说，我们在BEIR基准上显示，我们的模型在15个数据集中的11个数据集上优于BM25。此外，当数千个示例可用时，我们表明，与BM25相比，在这些示例上微调我们的模型会带来很大的改进。最后，当在MS-MARCO数据集上进行微调之前用作预训练时，我们的技术在BEIR基准上获得了最先进的结果。摘要：Information retrieval is an important component in natural language processing, for knowledge intensive tasks such as question answering and fact checking. Recently, information retrieval has seen the emergence of dense retrievers, based on neural networks, as an alternative to classical sparse methods based on term-frequency. These models have obtained state-of-the-art results on datasets and tasks where large training sets are available. However, they do not transfer well to new domains or applications with no training data, and are often outperformed by term-frequency methods such as BM25 which are not supervised. Thus, a natural question is whether it is possible to train dense retrievers without supervision. In this work, we explore the limits of contrastive learning as a way to train unsupervised dense retrievers, and show that it leads to strong retrieval performance. More precisely, we show on the BEIR benchmark that our model outperforms BM25 on 11 out of 15 datasets. Furthermore, when a few thousands examples are available, we show that fine-tuning our model on these leads to strong improvements compared to BM25. Finally, when used as pre-training before fine-tuning on the MS-MARCO dataset, our technique obtains state-of-the-art results on the BEIR benchmark.

【2】 ATM: An Uncertainty-aware Active Self-training Framework for Label-efficient Text Classification 标题：ATM：一种不确定性感知的标签有效文本分类主动自训练框架链接：https://arxiv.org/abs/2112.08787

作者：Yue Yu,Lingkai Kong,Jieyu Zhang,Rongzhi Zhang,Chao Zhang 摘要：尽管预训练语言模型（LMs）在许多自然语言处理（NLP）任务中取得了巨大的成功，但它们需要过多的标记数据进行微调以获得令人满意的性能。为了提高标记效率，研究人员采用了主动学习（AL），而以往的大多数工作都忽略了未标记数据的潜力。为了释放未标记数据的力量以获得更好的标记效率和模型性能，我们开发了ATM，这是一种利用自我训练来利用未标记数据的新框架，与特定的AL算法无关，用作改进现有AL方法的插件模块。具体来说，具有高不确定性的未标记数据将暴露给oracle进行注释，而具有低不确定性的数据则用于自我训练。为了缓解自训练中标签噪声的传播问题，我们设计了一个简单有效的基于动量的内存库，用于动态聚合来自各个回合的模型预测。通过大量实验，我们证明ATM优于最强的主动学习和自训练基线，平均提高了51.9%的标签效率。摘要：Despite the great success of pre-trained language models (LMs) in many natural language processing (NLP) tasks, they require excessive labeled data for fine-tuning to achieve satisfactory performance. To enhance the label efficiency, researchers have resorted to active learning (AL), while the potential of unlabeled data is ignored by most of prior work. To unleash the power of unlabeled data for better label efficiency and model performance, we develop ATM, a new framework that leverage self-training to exploit unlabeled data and is agnostic to the specific AL algorithm, serving as a plug-in module to improve existing AL methods. Specifically, the unlabeled data with high uncertainty is exposed to oracle for annotations while those with low uncertainty are leveraged for self-training. To alleviate the label noise propagation issue in self-training, we design a simple and effective momentum-based memory bank to dynamically aggregate the model predictions from all rounds. By extensive experiments, we demonstrate that ATM outperforms the strongest active learning and self-training baselines and improve the label efficiency by 51.9% on average.

【3】 Lacuna Reconstruction: Self-supervised Pre-training for Low-Resource Historical Document Transcription 标题：缺陷性重建：低资源历史文献抄写的自我监督预训练链接：https://arxiv.org/abs/2112.08692

作者：Nikolai Vogler,Jonathan Parkes Allen,Matthew Thomas Miller,Taylor Berg-Kirkpatrick 摘要：我们提出了一种自我监督的预训练方法，用于学习手写和印刷历史文档转录的丰富视觉语言表示。在监督下对我们预先训练的编码器表示进行微调，以实现两种语言的低资源文档转录后，（1）一组异构的手写伊斯兰手稿图像和（2）早期现代英语印刷文档，与从头开始训练的同一监督模型相比，我们显示了识别准确率的显著提高，仅需30行图像转录进行训练。我们的蒙面语言模型风格预训练策略，其中模型经过训练，能够从同一行中采样的干扰物中识别真实的蒙面视觉表示，鼓励学习对涂鸦书写风格和文档中存在的打印噪音保持不变的鲁棒语境化语言表示。摘要：We present a self-supervised pre-training approach for learning rich visual language representations for both handwritten and printed historical document transcription. After supervised fine-tuning of our pre-trained encoder representations for low-resource document transcription on two languages, (1) a heterogeneous set of handwritten Islamicate manuscript images and (2) early modern English printed documents, we show a meaningful improvement in recognition accuracy over the same supervised model trained from scratch with as few as 30 line image transcriptions for training. Our masked language model-style pre-training strategy, where the model is trained to be able to identify the true masked visual representation from distractors sampled from within the same line, encourages learning robust contextualized language representations invariant to scribal writing style and printing noise present across documents.

【4】 Analyzing the Limits of Self-Supervision in Handling Bias in Language 标题：浅析自我监督在处理语言偏见中的局限性链接：https://arxiv.org/abs/2112.08637

作者：Lisa Bauer,Karthik Gopalakrishnan,Spandana Gella,Yang Liu,Mohit Bansal,Dilek Hakkani-Tur 备注：16 pages, 1 figure 摘要：使用自然语言任务描述来提示输入已经成为一种流行的机制，可以从大规模生成性语言模型中获得合理准确的输出，而几乎没有上下文监督。这也有助于深入了解语言模型如何从大量未标记文本语料库的自我监督预训练中捕捉广泛下游任务的语义。这样的模特自然也会接触到很多不受欢迎的内容，比如种族主义和性别歧视的语言，而在这些方面对模特的认识工作也很有限。在本文中，我们定义并全面评估了这类语言模型在多大程度上捕获了四项偏向任务的语义：诊断、识别、提取和改写。我们为这些任务定义了三大类的任务描述：陈述、问题和完成，每个类中有许多词汇变体。我们研究了使用这些类和空任务描述，通过几种解码方法和几个镜头示例，对每个任务进行提示的有效性。我们的分析表明，语言模型能够在不同的偏见维度（如性别和政治归属）上执行这些任务，程度差异很大。我们相信，我们的工作是朝着无偏见的语言模式迈出的重要一步，通过量化当前自我监督目标在完成这些具有社会挑战性的任务方面的局限性。摘要：Prompting inputs with natural language task descriptions has emerged as a popular mechanism to elicit reasonably accurate outputs from large-scale generative language models with little to no in-context supervision. This also helps gain insight into how well language models capture the semantics of a wide range of downstream tasks purely from self-supervised pre-training on massive corpora of unlabeled text. Such models have naturally also been exposed to a lot of undesirable content like racist and sexist language and there is limited work on awareness of models along these dimensions. In this paper, we define and comprehensively evaluate how well such language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing. We define three broad classes of task descriptions for these tasks: statement, question, and completion, with numerous lexical variants within each class. We study the efficacy of prompting for each task using these classes and the null task description across several decoding methods and few-shot examples. Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation. We believe our work is an important step towards unbiased language models by quantifying the limits of current self-supervision objectives at accomplishing such sociologically challenging tasks.

【5】 Idiomatic Expression Paraphrasing without Strong Supervision 标题：没有强监督的习语释义链接：https://arxiv.org/abs/2112.08592

作者：Jianing Zhou,Ziheng Zeng,Hongyu Gong,Suma Bhat 摘要：习语在自然语言中起着至关重要的作用。在本文中，我们研究了惯用句释义（ISP）任务，其目的是通过用字面释义替换IE来用IE释义句子。缺乏大规模的语料库与惯用的文字平行句是一个主要的挑战，这项任务，我们认为两个单独的解决方案。首先，我们提出了一种无监督的ISP方法，它利用IE的上下文信息和定义，不需要平行句子训练集。其次，我们提出了一种弱监督的方法，使用回译与IEs联合执行句子的释义和生成，以扩大小规模的平行句子训练数据集。该研究的其他重要衍生工具包括一个模型，该模型用IE替换句子中的字面短语以生成惯用表达式，以及一个包含惯用/字面句子对的大规模并行数据集。当使用自动和手动评估在平行数据集上对生成的句子进行经验验证时，与竞争基线相比，拟议解决方案的有效性体现在BLEU的相对收益超过5.16分，METEOR的相对收益超过8.75分，SARI的相对收益超过19.57分。我们展示了ISP在机器翻译中作为预处理步骤的实用性。摘要：Idiomatic expressions (IEs) play an essential role in natural language. In this paper, we study the task of idiomatic sentence paraphrasing (ISP), which aims to paraphrase a sentence with an IE by replacing the IE with its literal paraphrase. The lack of large-scale corpora with idiomatic-literal parallel sentences is a primary challenge for this task, for which we consider two separate solutions. First, we propose an unsupervised approach to ISP, which leverages an IE's contextual information and definition and does not require a parallel sentence training set. Second, we propose a weakly supervised approach using back-translation to jointly perform paraphrasing and generation of sentences with IEs to enlarge the small-scale parallel sentence training dataset. Other significant derivatives of the study include a model that replaces a literal phrase in a sentence with an IE to generate an idiomatic expression and a large scale parallel dataset with idiomatic/literal sentence pairs. The effectiveness of the proposed solutions compared to competitive baselines is seen in the relative gains of over 5.16 points in BLEU, over 8.75 points in METEOR, and over 19.57 points in SARI when the generated sentences are empirically validated on a parallel dataset using automatic and manual evaluations. We demonstrate the practical utility of ISP as a preprocessing step in En-De machine translation.

【6】 Applying SoftTriple Loss for Supervised Language Model Fine Tuning 标题：软三重损失在有监督语言模型精调中的应用链接：https://arxiv.org/abs/2112.08462

作者：Witold Sosnowski,Anna Wroblewska,Piotr Gawrysiak 摘要：我们引入了一种新的损失函数三重性，以改善基于交叉熵和软三重损失的微调通用知识预训练语言模型的分类性能。该损失函数可以将经过交叉熵损失微调的robustroberta基线模型提高约（0.02%-2.29%）。对流行数据集进行的全面测试表明，该数据集稳步增长。训练数据集中的样本越少，增益越高——因此，对于小型数据集，增益为0.78%，对于中型数据集，增益为0.86%，对于大型数据集，增益为0.20%，对于特大数据集，增益为0.04%。摘要：We introduce a new loss function TripleEntropy, to improve classification performance for fine-tuning general knowledge pre-trained language models based on cross-entropy and SoftTriple loss. This loss function can improve the robust RoBERTa baseline model fine-tuned with cross-entropy loss by about (0.02% - 2.29%). Thorough tests on popular datasets indicate a steady gain. The fewer samples in the training dataset, the higher gain -- thus, for small-sized dataset it is 0.78%, for medium-sized -- 0.86% for large -- 0.20% and for extra-large 0.04%.

【7】 Self-Supervised Learning for speech recognition with Intermediate layer supervision 标题：带中间层监督的语音识别自监督学习链接：https://arxiv.org/abs/2112.08778

作者：Chengyi Wang,Yu Wu,Sanyuan Chen,Shujie Liu,Jinyu Li,Yao Qian,Zhenglu Yang 备注：Submitted to ICASSP 2022 摘要：最近，先锋研究发现，语音预训练模型可以解决全堆栈语音处理任务，因为该模型利用底层学习说话人相关信息，顶层编码内容相关信息。由于网络容量有限，我们相信如果该模型专门用于音频内容信息学习，语音识别性能可以进一步提高。为此，我们提出了用于自监督学习的中间层监督（ILS-SSL），通过在中间层上添加额外的SSL损失，迫使模型尽可能集中于内容信息。在LibriSpeech-test-other集上的实验表明，我们的方法明显优于HuBERT，在基本/大型模型的w/o语言模型设置下，相对字错误率降低了23.5%/11.6%。详细的分析表明，我们模型的底层与语音单位有更好的相关性，这与我们的直觉一致，并解释了我们的ASR方法的成功。摘要：Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information. Since the network capacity is limited, we believe the speech recognition performance could be further improved if the model is dedicated to audio content information learning. To this end, we propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL), which forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly, which achieves a 23.5%/11.6% relative word error rate reduction in the w/o language model setting for base/large models. Detailed analysis shows the bottom layers of our model have a better correlation with phonetic units, which is consistent with our intuition and explains the success of our method for ASR.

检测相关(3篇)

【1】 Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages 标题：利用跨语言特征改进低资源语言的同源检测链接：https://arxiv.org/abs/2112.08789

作者：Diptesh Kanojia,Raj Dabre,Shubham Dewangan,Pushpak Bhattacharyya,Gholamreza Haffari,Malhar Kulkarni 备注：Published at COLING 2020 摘要：同源词是不同语言中相同词汇形式的变体；例如，西班牙语中的“fonema”和英语中的“phoneme”是同源词，它们都表示“一个声音单位”。自动检测任何两种语言之间的同源词的任务可以帮助后续NLP任务，如跨语言信息检索、计算系统发育和机器翻译。在这篇文章中，我们演示了如何使用跨语言单词嵌入来检测14种印度语言中的同源词。我们的方法引入了使用知识图中的上下文来生成改进的同源检测特征表示。然后，我们评估我们的同源检测机制对神经机器翻译（NMT）的影响，作为下游任务。我们评估了我们在12种印度语言（即梵语、印地语、阿萨姆语、奥里亚语、卡纳达语、古吉拉特语、泰米尔语、泰卢固语、旁遮普语、孟加拉语、马拉地语和马来语）的具有挑战性的数据集上检测同源词的方法。此外，我们还为另外两种印度语言——康卡尼语和尼泊尔语创建了评估数据集。我们观察到，同源检测的F分数提高了18%。此外，我们观察到，使用我们的方法提取的同源词有助于将NMT质量提高2.76 BLEU。我们还公开发布代码、新构建的数据集和跨语言模型。摘要：Cognates are variants of the same lexical form across different languages; for example 'fonema' in Spanish and 'phoneme' in English are cognates, both of which mean 'a unit of sound'. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages. Our approach introduces the use of context from a knowledge graph to generate improved feature representations for cognate detection. We, then, evaluate the impact of our cognate detection mechanism on neural machine translation (NMT), as a downstream task. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. Additionally, we create evaluation datasets for two more Indian languages, Konkani and Nepali. We observe an improvement of up to 18% points, in terms of F-score, for cognate detection. Furthermore, we observe that cognates extracted using our method help improve NMT quality by up to 2.76 BLEU. We also release our code, newly constructed datasets and cross-lingual models publicly.

【2】 Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal Misinformation 标题：Twitter-Comms：检测气候、COVID和军事多模式错误信息链接：https://arxiv.org/abs/2112.08594

作者：Giscard Biamby,Grace Luo,Trevor Darrell,Anna Rohrbach 备注：11 pages, 6 figures 摘要：检测断章取义的媒体，比如推特上的“误选”图像，通常需要检测两种模式之间的不一致。本文描述了我们对DARPA语义取证（SemaFor）程序的图像-文本不一致性检测挑战的方法。首先，我们推特推2019冠状病毒疾病，COVID-19和军用车辆的主题，与一个大型的多模态数据集。我们训练我们的方法，基于最先进的剪辑模型，利用自动生成的随机和硬底片。然后在一个隐藏的人工生成的评估集上测试我们的方法。我们在节目排行榜上取得了最好的成绩，与Zero-Shot剪辑基线相比，在高精度区域的检测率提高了11%。摘要：Detecting out-of-context media, such as "miscaptioned" images on Twitter, often requires detecting inconsistencies between the two modalities. This paper describes our approach to the Image-Text Inconsistency Detection challenge of the DARPA Semantic Forensics (SemaFor) Program. First, we collect Twitter-COMMs, a large-scale multimodal dataset with 884k tweets relevant to the topics of Climate Change, COVID-19, and Military Vehicles. We train our approach, based on the state-of-the-art CLIP model, leveraging automatically generated random and hard negatives. Our method is then tested on a hidden human-generated evaluation set. We achieve the best result on the program leaderboard, with 11% detection improvement in a high precision regime over a zero-shot CLIP baseline.

【3】 Insta-VAX: A Multimodal Benchmark for Anti-Vaccine and Misinformation Posts Detection on Social Media 标题：Insta-VAX：社交媒体上抗疫苗和错误信息帖子检测的多模态基准链接：https://arxiv.org/abs/2112.08470

作者：Mingyang Zhou,Mahasweta Chakraborti,Sijia Qian,Zhou Yu,Jingwen Zhang 摘要：在社交媒体上分享抗疫苗帖子（包括错误信息帖子）已被证明会造成混乱，降低公众对疫苗的信心，导致疫苗犹豫不决和抗药性。近年来，在线网络中以各种语言和视觉形式出现的此类反疫苗帖子迅速崛起，对有效的内容控制和跟踪提出了巨大挑战。本文扩展了以前利用文本信息理解疫苗信息的工作，介绍了Insta VAX，这是一个新的多模式数据集，由64957篇Instagram帖子组成，这些帖子与人类疫苗有关。我们对该数据集应用了一个众包注释程序，该程序由两位训练有素的专家评委验证。然后，我们对几个最先进的NLP和计算机视觉分类器进行了台架标记，以检测这些帖子是否表现出抗疫苗态度，以及它们是否包含错误信息。大量的实验和分析表明，多模态模型比单模态模型能更准确地对文章进行分类，但在视觉语境理解和外部知识合作方面仍需改进。数据集和分类器有助于监测和跟踪疫苗讨论情况，以便社会科学和公共卫生努力解决疫苗误报问题。摘要：Sharing of anti-vaccine posts on social media, including misinformation posts, has been shown to create confusion and reduce the publics confidence in vaccines, leading to vaccine hesitancy and resistance. Recent years have witnessed the fast rise of such anti-vaccine posts in a variety of linguistic and visual forms in online networks, posing a great challenge for effective content moderation and tracking. Extending previous work on leveraging textual information to understand vaccine information, this paper presents Insta-VAX, a new multi-modal dataset consisting of a sample of 64,957 Instagram posts related to human vaccines. We applied a crowdsourced annotation procedure verified by two trained expert judges to this dataset. We then bench-marked several state-of-the-art NLP and computer vision classifiers to detect whether the posts show anti-vaccine attitude and whether they contain misinformation. Extensive experiments and analyses demonstrate the multimodal models can classify the posts more accurately than the uni-modal models, but still need improvement especially on visual context understanding and external knowledge cooperation. The dataset and classifiers contribute to monitoring and tracking of vaccine discussions for social scientific and public health efforts in combating the problem of vaccine misinformation.

识别/分类(3篇)

【1】 Simple Questions Generate Named Entity Recognition Datasets 标题：简单问题会生成命名实体识别数据集链接：https://arxiv.org/abs/2112.08808

作者：Hyunjae Kim,Jaehyo Yoo,Seunghyun Yoon,Jinhyuk Lee,Jaewoo Kang 摘要：命名实体识别（NER）是从文本中提取特定类型命名实体的任务。当前的NER模型通常依赖于人工标注的数据集，需要大量涉及目标领域和实体的专业知识。这项工作引入了一种从询问到生成的方法，该方法通过询问简单的自然语言问题自动生成NER数据集，这些问题反映了对实体类型的需求（例如，哪种疾病？）一个开放领域的问答系统。在不使用任何域内资源（即，训练句子、标签或域内词典）的情况下，我们仅在生成的数据集上训练的模型在四个不同域的六个NER基准上的性能大大优于以前的弱监督模型。令人惊讶的是，在NCBI疾病上，我们的模型达到75.5 F1分数，甚至比以前的最佳弱监督模型高出4.1 F1分数，该模型利用了领域专家提供的丰富领域词典。用自然语言描述NER的需求还允许我们为细粒度实体类型（如Award）构建NER模型，我们的模型甚至优于完全监督模型。在三个测试基准上，我们的模型实现了新的最先进的性能。摘要：Named entity recognition (NER) is a task of extracting named entities of specific types from text. Current NER models often rely on human-annotated datasets requiring the vast engagement of professional knowledge on the target domain and entities. This work introduces an ask-to-generate approach, which automatically generates NER datasets by asking simple natural language questions that reflect the needs for entity types (e.g., Which disease?) to an open-domain question answering system. Without using any in-domain resources (i.e., training sentences, labels, or in-domain dictionaries), our models solely trained on our generated datasets largely outperform previous weakly supervised models on six NER benchmarks across four different domains. Surprisingly, on NCBI-disease, our model achieves 75.5 F1 score and even outperforms the previous best weakly supervised model by 4.1 F1 score, which utilizes a rich in-domain dictionary provided by domain experts. Formulating the needs of NER with natural language also allows us to build NER models for fine-grained entity types such as Award, where our model even outperforms fully supervised models. On three few-shot NER benchmarks, our model achieves new state-of-the-art performance.

【2】 Extreme Zero-Shot Learning for Extreme Text Classification 标题：用于极端文本分类的极端零点学习链接：https://arxiv.org/abs/2112.08652

作者：Yuanhao Xiong,Wei-Cheng Chang,Cho-Jui Hsieh,Hsiang-Fu Yu,Inderjit Dhillon 备注：Our code is available at this https URL 摘要：极端多标签文本分类（XMC）问题涉及从大型标签集中为输入文本实例查找最相关的标签。然而，XMC设置面临两个挑战：（1）在动态环境中预测看不见的标签是不可推广的，（2）它需要大量受监督（实例、标签）对，这对于新兴领域来说可能很难获得。最近，人们研究了广义零拍XMC（GZ-XMC）设置，并相应地提出了ZestXML来处理看不见的标签，这仍然需要大量的注释（实例，标签）对。在本文中，我们考虑一个更实际的场景，称为极端Few-ShotXMC（EZ-XMC），其中不需要监督，只有实例和标签的原始文本是可访问的。还研究了监督有限的EZ-XMC的一个扩展——少炮XMC（FS-XMC）。为了学习实例和标签在原始文本中的语义嵌入，我们建议使用自监督对比损失对基于转换器的编码器进行预训练。具体而言，我们开发了一种预训练方法MACLR，该方法通过多尺度自适应聚类、标签正则化和伪正对自训练等技术充分利用原始文本。在四个公共EZ-XMC数据集上的实验结果表明，与所有其他领先的基线方法相比，MACLR实现了卓越的性能，尤其是平均在精确度和召回率方面提高了约5-10%。此外，我们还表明，当训练中的地面真值正对数量有限时，我们的预训练编码器可以在FS-XMC上进一步改进。通过在这样几个镜头子集上微调编码器，MACLR仍然显著优于其他极端分类器。摘要：The eXtreme Multi-label text Classification (XMC) problem concerns finding most relevant labels for an input text instance from a large label set. However, the XMC setup faces two challenges: (1) it is not generalizable to predict unseen labels in dynamic environments, and (2) it requires a large amount of supervised (instance, label) pairs, which can be difficult to obtain for emerging domains. Recently, the generalized zero-shot XMC (GZ-XMC) setup has been studied and ZestXML is proposed accordingly to handle the unseen labels, which still requires a large number of annotated (instance, label) pairs. In this paper, we consider a more practical scenario called Extreme Zero-Shot XMC (EZ-XMC), in which no supervision is needed and merely raw text of instances and labels are accessible. Few-Shot XMC (FS-XMC), an extension to EZ-XMC with limited supervision is also investigated. To learn the semantic embeddings of instances and labels with raw text, we propose to pre-train Transformer-based encoders with self-supervised contrastive losses. Specifically, we develop a pre-training method MACLR, which thoroughly leverages the raw text with techniques including Multi-scale Adaptive Clustering, Label Regularization, and self-training with pseudo positive pairs. Experimental results on four public EZ-XMC datasets demonstrate that MACLR achieves superior performance compared to all other leading baseline methods, in particular with approximately 5-10% improvement in precision and recall on average. Moreover, we also show that our pre-trained encoder can be further improved on FS-XMC when there are a limited number of ground-truth positive pairs in training. By fine-tuning the encoder on such a few-shot subset, MACLR still outperforms other extreme classifiers significantly.

【3】 CLICKER: A Computational LInguistics Classification Scheme for Educational Resources 标题：CLICKER：一种教育资源的计算语言学分类方案链接：https://arxiv.org/abs/2112.08578

作者：Swapnil Hingmire,Irene Li,Rena Kawamura,Benjamin Chen,Alexander Fabbri,Xiangru Tang,Yixin Liu,Thomas George,Tammy Liao,Wai Pan Wong,Vanessa Yan,Richard Zhou,Girish K. Palshikar,Dragomir Radev 备注：7 pages, 5 figures, 4 tables 摘要：科学学科的分类方案概述了其知识体系。它还可用于方便访问与该主题相关的研究文章和其他材料。例如，ACM计算分类系统（CCS）用于ACM数字图书馆搜索界面，也用于为计算机科学论文编制索引。我们观察到，对于计算语言学（CL）和自然语言处理（NLP），不存在像CCS或数学学科分类（MSC）这样的综合分类系统。基于对77门大学课程在线讲座的分析，我们提出了一个分类方案——CL/NLP点击器。目前提出的分类法包括334个主题，重点关注CL/NLP的教育方面；它主要以NLP课程的课堂讲稿为基础，但并非完全如此。我们将讨论这种分类法如何在各种实际应用中提供帮助，包括教学平台、资源检索、资源推荐、先决条件链学习和调查生成。摘要：A classification scheme of a scientific subject gives an overview of its body of knowledge. It can also be used to facilitate access to research articles and other materials related to the subject. For example, the ACM Computing Classification System (CCS) is used in the ACM Digital Library search interface and also for indexing computer science papers. We observed that a comprehensive classification system like CCS or Mathematics Subject Classification (MSC) does not exist for Computational Linguistics (CL) and Natural Language Processing (NLP). We propose a classification scheme -- CLICKER for CL/NLP based on the analysis of online lectures from 77 university courses on this subject. The currently proposed taxonomy includes 334 topics and focuses on educational aspects of CL/NLP; it is based primarily, but not exclusively, on lecture notes from NLP courses. We discuss how such a taxonomy can help in various real-world applications, including tutoring platforms, resource retrieval, resource recommendation, prerequisite chain learning, and survey generation.

Zero/Few/One-Shot|迁移|自适应(2篇)

【1】 Efficient Hierarchical Domain Adaptation for Pretrained Language Models 标题：用于预训练语言模型的高效分层领域自适应链接：https://arxiv.org/abs/2112.08786

作者：Alexandra Chronopoulou,Matthew E. Peters,Jesse Dodge 摘要：生成语言模型是在不同的、通用的领域语料库上训练的。然而，这限制了它们在较窄领域的适用性，先前的工作表明，持续的领域内训练可以提供进一步的收益。在本文中，我们介绍了一种使用计算效率高的适配器方法将域自适应扩展到许多不同域的方法。我们的方法基于文本域部分重叠的观察，我们将域表示为一个层次树结构，其中树中的每个节点都与一组适配器权重相关联。当与冻结的预训练语言模型相结合时，该方法可以在相关域之间共享参数，同时避免不相关域之间的负面干扰。对于D个域，它是有效的，并且计算成本可扩展为O（log（D））。GPT-2和C4中100个最具代表性的网站中的一大部分的实验结果表明，该领域有了全面的改进。此外，我们还提供了一个用于保持域的推理时间算法，并表明通过树的多条路径的平均可以进一步提高泛化能力，同时只增加了推理的边际成本。摘要：Generative language models are trained on diverse, general domain corpora. However, this limits their applicability to narrower domains, and prior work has shown that continued in-domain training can provide further gains. In this paper, we introduce a method to scale domain adaptation to many diverse domains using a computationally efficient adapter approach. Our method is based on the observation that textual domains are partially overlapping, and we represent domains as a hierarchical tree structure where each node in the tree is associated with a set of adapter weights. When combined with a frozen pretrained language model, this approach enables parameter sharing among related domains, while avoiding negative interference between unrelated ones. It is efficient and computational cost scales as O(log(D)) for D domains. Experimental results with GPT-2 and a large fraction of the 100 most represented websites in C4 show across-the-board improvements in-domain. We additionally provide an inference time algorithm for a held-out domain and show that averaging over multiple paths through the tree enables further gains in generalization, while adding only a marginal cost to inference.

【2】 Domain Prompts: Towards memory and compute efficient domain adaptation of ASR systems 标题：领域提示：面向ASR系统的存储和计算高效的领域适配链接：https://arxiv.org/abs/2112.08718

作者：Saket Dingliwal,Ashish Shenoy,Sravan Bodapati,Ankur Gandhe,Ravi Teja Gadde,Katrin Kirchhoff 备注：4 pages ICASSP submission 摘要：自动语音识别（ASR）系统已在许多不同领域的工业应用中得到应用。由于特定领域的系统在域内评估方面比一般系统表现得更好，因此显然需要内存和计算效率高的领域自适应。特别是，调整用于重新调整ASR假设的基于参数重Transformer的语言模型具有挑战性。在这项工作中，我们介绍了域提示，一种训练少量域令牌嵌入参数的方法，以将基于转换器的LM初始化到特定域。每个域只需几个额外的参数，我们就可以比使用不适应LM的基线提高7-14%的功率。尽管参数效率很高，但这些改进与具有数亿个参数的完全微调模型相当。通过对提示大小、数据集大小、初始化和域的定义，我们为在ASR系统中使用域提示的好处提供了证据。摘要：Automatic Speech Recognition (ASR) systems have found their use in numerous industrial applications in very diverse domains. Since domain-specific systems perform better than their generic counterparts on in-domain evaluation, the need for memory and compute-efficient domain adaptation is obvious. Particularly, adapting parameter-heavy transformer-based language models used for rescoring ASR hypothesis is challenging. In this work, we introduce domain-prompts, a methodology that trains a small number of domain token embedding parameters to prime a transformer-based LM to a particular domain. With just a handful of extra parameters per domain, we achieve 7-14% WER improvement over the baseline of using an unadapted LM. Despite being parameter-efficient, these improvements are comparable to those of fully-fine-tuned models with hundreds of millions of parameters. With ablations on prompt-sizes, dataset sizes, initializations and domains, we provide evidence for the benefits of using domain-prompts in ASR systems.

检索(1篇)

【1】 CONQRR: Conversational Query Rewriting for Retrieval with Reinforcement Learning 标题：CONQRR：基于强化学习的对话式查询重写链接：https://arxiv.org/abs/2112.08558

作者：Zeqiu Wu,Yi Luan,Hannah Rashkin,David Reitter,Gaurav Singh Tomar 摘要：对于开放域会话问答（CQA），检索最相关的段落来回答问题是很重要的，但与标准的段落检索相比，这是一个挑战，因为它需要理解完整的对话上下文，而不是单个查询。此外，重新训练成熟的检索器（如最初为非对话查询开发的搜索引擎）可能会很昂贵。为了便于使用，我们开发了一个查询重写模型CONQRR，该模型将上下文中的会话问题重写为独立问题。它通过一个新的奖励函数进行训练，以直接优化检索，并且可以通过强化学习适应任何固定的黑盒检索器。我们展示了CONQRR在最近的开放域CQA数据集上实现了最先进的结果，该数据集是来自三个不同来源的对话的组合。我们还进行了广泛的实验，以证明CONQRR对于任何给定的固定寻回犬都是有效的。摘要：For open-domain conversational question answering (CQA), it is important to retrieve the most relevant passages to answer a question, but this is challenging compared with standard passage retrieval because it requires understanding the full dialogue context rather than a single query. Moreover, it can be expensive to re-train well-established retrievers such as search engines that are originally developed for non-conversational queries. To facilitate their use, we develop a query rewriting model CONQRR that rewrites a conversational question in context into a standalone question. It is trained with a novel reward function to directly optimize towards retrieval and can be adapted to any fixed blackbox retriever using reinforcement learning. We show that CONQRR achieves state-of-the-art results on a recent open-domain CQA dataset, a combination of conversations from three different sources. We also conduct extensive experiments to show the effectiveness of CONQRR for any given fixed retriever.

表征(2篇)

【1】 Learning Rich Representation of Keyphrases from Text 标题：从文本中学习关键短语的丰富表示链接：https://arxiv.org/abs/2112.08547

作者：Mayank Kulkarni,Debanjan Mahata,Ravneet Arora,Rajarshi Bhowmik 摘要：在这项工作中，我们探索了如何学习任务特定的语言模型，目的是从文本文档中学习关键短语的丰富表示。我们用不同的掩蔽策略在区分性和生成性环境下对Transformer语言模型（LMs）进行预训练。在区分设置中，我们引入了一个新的预训练目标-关键字短语边界填充替换（KBIR），当使用KBIR预训练的LM针对关键字短语提取任务进行微调时，与SOTA相比，该目标在性能上有很大的提高（F1中高达9.26个点）。在生成环境中，我们为BART-KeyBART引入了一种新的预训练设置，它以CatSeq格式再现与输入文本相关的关键短语，而不是去噪的原始输入。这也导致了业绩的提升（在2010年高达4.33点）F1@M)通过SOTA生成关键短语。此外，我们还对命名实体识别（NER）、问答（QA）、关系提取（RE）、抽象摘要等预先训练的语言模型进行了微调，并取得了与SOTA相当的性能，这表明学习丰富的关键短语表示确实有利于许多其他基本NLP任务。摘要：In this work, we explore how to learn task-specific language models aimed towards learning rich representation of keyphrases from text documents. We experiment with different masking strategies for pre-training transformer language models (LMs) in discriminative as well as generative settings. In the discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with Replacement (KBIR), showing large gains in performance (upto 9.26 points in F1) over SOTA, when LM pre-trained using KBIR is fine-tuned for the task of keyphrase extraction. In the generative setting, we introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the input text in the CatSeq format, instead of the denoised original input. This also led to gains in performance (upto 4.33 points in F1@M) over SOTA for keyphrase generation. Additionally, we also fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), relation extraction (RE), abstractive summarization and achieve comparable performance with that of the SOTA, showing that learning rich representation of keyphrases is indeed beneficial for many other fundamental NLP tasks.

【2】 DocAMR: Multi-Sentence AMR Representation and Evaluation 标题：DocAMR：多句AMR表示与评价链接：https://arxiv.org/abs/2112.08513

作者：Tahira Naseem,Austin Blodgett,Sadhana Kumaravel,Tim O'Gorman,Young-Suk Lee,Jeffrey Flanigan,Ramón Fernandez Astudillo,Radu Florian,Salim Roukos,Nathan Schneider 摘要：尽管对将英语句子解析为抽象意义表示（AMR）图进行了广泛的研究，并通过Smatch度量将其与黄金图进行了比较，但将完整文档解析为统一的图表示缺乏定义良好的表示和评估。利用前面工作中的超句子层次的共指注释，我们引入了一个简单的算法来推导统一的图表示，避免了过度合并导致的信息丢失和欠合并导致的缺乏连贯性的陷阱。接下来，我们描述了对Smatch度量的改进，使其易于比较文档级图，并使用它重新评估最佳发布的文档级AMR解析器。我们还提出了一种结合顶级AMR解析器和共指消解系统的流水线方法，为未来的研究提供了一个强大的基线。摘要：Despite extensive research on parsing of English sentences into Abstraction Meaning Representation (AMR) graphs, which are compared to gold graphs via the Smatch metric, full-document parsing into a unified graph representation lacks well-defined representation and evaluation. Taking advantage of a super-sentential level of coreference annotation from previous work, we introduce a simple algorithm for deriving a unified graph representation, avoiding the pitfalls of information loss from over-merging and lack of coherence from under-merging. Next, we describe improvements to the Smatch metric to make it tractable for comparing document-level graphs, and use it to re-evaluate the best published document-level AMR parser. We also present a pipeline approach combining the top performing AMR parser and coreference resolution systems, providing a strong baseline for future research.

Word2Vec|文本|单词(4篇)

【1】 Pay More Attention to History: A Context Modeling Strategy for Conversational Text-to-SQL 标题：关注历史：一种对话式Text-to-SQL的上下文建模策略链接：https://arxiv.org/abs/2112.08735

作者：Yuntao Li,Hanchu Zhang,Yutian Li,Sirui Wang,Wei Wu,Yan Zhang 摘要：会话文本到SQL旨在将多回合自然语言查询转换为相应的SQL表示。将文本转换为SQL最棘手的问题之一是对多轮查询的语义进行建模，并收集当前查询所需的适当信息。本文表明，通过添加每个回合和对整个上下文的摘要来显式地建模语义变化，可以在将会话查询转换为SQL方面带来更好的性能。特别地，我们提出了两个会话建模任务，分别是转折粒度和会话粒度。这两个任务只是作为辅助训练任务来帮助进行多回合会话语义分析。我们对大规模开放域会话文本到SQL数据集进行了实证研究，并取得了最新成果。结果表明，该机制显著提高了多轮语义分析的性能。摘要：Conversational text-to-SQL aims at converting multi-turn natural language queries into their corresponding SQL representations. One of the most intractable problem of conversational text-to-SQL is modeling the semantics of multi-turn queries and gathering proper information required for the current query. This paper shows that explicit modeling the semantic changes by adding each turn and the summarization of the whole context can bring better performance on converting conversational queries into SQLs. In particular, we propose two conversational modeling tasks in both turn grain and conversation grain. These two tasks simply work as auxiliary training tasks to help with multi-turn conversational semantic parsing. We conducted empirical studies and achieve new state-of-the-art results on large-scale open-domain conversational text-to-SQL dataset. The results demonstrate that the proposed mechanism significantly improves the performance of multi-turn semantic parsing.

【2】 FRUIT: Faithfully Reflecting Updated Information in Text 标题：果子：在文本中忠实反映最新信息链接：https://arxiv.org/abs/2112.08634

作者：Robert L. Logan IV,Alexandre Passos,Sameer Singh,Ming-Wei Chang 备注：v1.0 摘要：像维基百科这样的文本知识库需要付出相当大的努力来保持最新和一致性。虽然自动化写作助手可能会减轻这一负担，但基于外部知识的编辑建议问题尚未得到充分探讨。在本文中，我们介绍了新的生成任务*忠实地在文本*中反映更新的信息*（水果），其目标是在提供新证据的情况下更新现有的文章。我们发布了FRUIT-WIKI数据集，这是一个由超过170K个远程监控数据组成的集合，这些数据来自于维基百科的成对快照，同时我们还发布了数据生成管道和一个包含914个实例的黄金评估集，这些实例的编辑保证得到证据的支持。我们为流行的生成系统以及EDIT5提供基准测试结果，EDIT5是一种基于T5的方法，专为我们介绍的编辑而设计，它建立了最先进的技术。我们的分析表明，开发能够忠实更新文章的模型需要神经生成模型的新功能，并为许多新应用打开了大门。摘要：Textual knowledge bases such as Wikipedia require considerable effort to keep up to date and consistent. While automated writing assistants could potentially ease this burden, the problem of suggesting edits grounded in external knowledge has been under-explored. In this paper, we introduce the novel generation task of *faithfully reflecting updated information in text*(FRUIT) where the goal is to update an existing article given new evidence. We release the FRUIT-WIKI dataset, a collection of over 170K distantly supervised data produced from pairs of Wikipedia snapshots, along with our data generation pipeline and a gold evaluation set of 914 instances whose edits are guaranteed to be supported by the evidence. We provide benchmark results for popular generation systems as well as EDIT5 -- a T5-based approach tailored to editing we introduce that establishes the state of the art. Our analysis shows that developing models that can update articles faithfully requires new capabilities for neural generation models, and opens doors to many new applications.

【3】 Masked Measurement Prediction: Learning to Jointly Predict Quantities and Units from Textual Context 标题：蒙版测量预测：学习从文本上下文中联合预测量和单位链接：https://arxiv.org/abs/2112.08616

作者：Daniel Spokoyny,Ivan Lee,Zhao Jin,Taylor Berg-Kirkpatrick 备注：Preprint 摘要：物理测量在学术论文、工程报告和网络表格中占很大一部分。目前的基准未能正确评估预先训练的语言模型在测量方面的计算能力，阻碍了开发新方法并将其应用于数值任务的研究。为此，我们引入了一个新的任务，蒙蔽测量预测（MMP），模型学习在给定蒙蔽文本的情况下重建一个数字及其相关单元。MMP对于训练新的数字信息模型以及评估现有系统的计算能力都是有用的。为了解决这一问题，我们引入了一种新的生成掩蔽测量（GeMM）模型，该模型可以联合学习预测数字及其单位。我们将我们的模型与各种烧蚀和基线进行细粒度分析比较。我们使用传统预训练Transformer模型（RoBERTa）的线性探测来表明它们的性能明显低于联合训练的数字单元模型，突出了这项新任务的难度和我们提出的预训练方法的好处。我们希望这个框架能加速未来建立更强大的数值推理系统的进程。摘要：Physical measurements constitute a large portion of numbers in academic papers, engineering reports, and web tables. Current benchmarks fall short of properly evaluating numeracy of pretrained language models on measurements, hindering research on developing new methods and applying them to numerical tasks. To that end, we introduce a novel task, Masked Measurement Prediction (MMP), where a model learns to reconstruct a number together with its associated unit given masked text. MMP is useful for both training new numerically informed models as well as evaluating numeracy of existing systems. In order to address this task, we introduce a new Generative Masked Measurement (GeMM) model that jointly learns to predict numbers along with their units. We perform fine-grained analyses comparing our model with various ablations and baselines. We use linear probing of traditional pretrained transformer models (RoBERTa) to show that they significantly underperform jointly trained number-unit models, highlighting the difficulty of this new task and the benefits of our proposed pretraining approach. We hope this framework accelerates the progress towards building more robust numerical reasoning systems in the future.

【4】 A Deep Learning Approach for Ontology Enrichment from Unstructured Text 标题：一种从非结构化文本中充实本体的深度学习方法链接：https://arxiv.org/abs/2112.08554

作者：Lalit Mohan Sanagavarapu,Vivek Iyer,Raghu Reddy 备注：Accepted as a book chapter in "Cybersecurity & High-Performance Computing Environments: Integrated Innovations, Practices, and Applications", published by Taylor and Francis. arXiv admin note: substantial text overlap with arXiv:2102.04081 摘要：网络世界中的信息安全是一个令人担忧的主要原因，攻击面数量显著增加。web上现有的关于漏洞、攻击、控制和建议的信息提供了表示知识和执行安全分析以缓解某些问题的机会。以本体的形式表示安全知识有助于异常检测、威胁智能、推理和攻击的相关属性等。这就需要动态和自动化地丰富信息安全本体。然而，现有的基于自然语言处理和ML模型的本体丰富算法在词汇、短语和句子中的概念上下文提取方面存在问题。这激发了对顺序深入学习体系结构的需求，该体系结构遍历文本中的依赖路径，并从学习的路径表示中提取嵌入式漏洞、威胁、控制、产品和其他安全相关概念和实例。在所提出的方法中，部署了在大型DBpedia数据集和2.8GB Wikipedia语料库上训练的双向LSTM以及通用句子编码器，以丰富基于ISO27001的信息安全本体。该模型在高性能计算（HPC）环境中进行训练和测试，以处理Wiki文本维度。当使用本体和网页实例中剔除的概念进行测试以验证健壮性时，该方法的测试精度超过80%。摘要：Information Security in the cyber world is a major cause for concern, with a significant increase in the number of attack surfaces. Existing information on vulnerabilities, attacks, controls, and advisories available on the web provides an opportunity to represent knowledge and perform security analytics to mitigate some of the concerns. Representing security knowledge in the form of ontology facilitates anomaly detection, threat intelligence, reasoning and relevance attribution of attacks, and many more. This necessitates dynamic and automated enrichment of information security ontologies. However, existing ontology enrichment algorithms based on natural language processing and ML models have issues with contextual extraction of concepts in words, phrases, and sentences. This motivates the need for sequential Deep Learning architectures that traverse through dependency paths in text and extract embedded vulnerabilities, threats, controls, products, and other security-related concepts and instances from learned path representations. In the proposed approach, Bidirectional LSTMs trained on a large DBpedia dataset and Wikipedia corpus of 2.8 GB along with Universal Sentence Encoder is deployed to enrich ISO 27001-based information security ontology. The model is trained and tested on a high-performance computing (HPC) environment to handle Wiki text dimensionality. The approach yielded a test accuracy of over 80% when tested with knocked-out concepts from ontology and web page instances to validate the robustness.

其他神经网络|深度学习|模型|建模(9篇)

【1】 Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants 标题：循环中的模型：用生成性注释助手帮助克劳德工人链接：https://arxiv.org/abs/2112.09062

作者：Max Bartolo,Tristan Thrush,Sebastian Riedel,Pontus Stenetorp,Robin Jia,Douwe Kiela 摘要：在动态对抗性数据收集（DADC）中，人类注释者的任务是找到模型难以正确预测的示例。在DADC收集的训练数据上训练的模型在对抗和域外环境中表现出更强的鲁棒性，并且人类很难愚弄它。然而，DADC比传统的数据收集更耗时，因此每个示例的成本更高。在这项工作中，我们考察了我们是否能够在不增加额外成本的情况下保持DADC的优势。为此，我们引入了生成性注释助手（GAA），即循环中的生成器模型，它提供了注释者可以批准、修改或完全拒绝的实时建议。我们收集了20个实验环境中的训练数据集，并针对标准和对抗性数据收集的抽取式问答（QA）任务对这种方法进行了详细分析。我们证明了砷化镓在注释速度方面提供了显著的效率优势，同时提高了模型欺骗率。此外，我们还表明，砷化镓辅助数据在各种问答任务中都能提高下游模型的性能。摘要：In Dynamic Adversarial Data Collection (DADC), human annotators are tasked with finding examples that models struggle to predict correctly. Models trained on DADC-collected training data have been shown to be more robust in adversarial and out-of-domain settings, and are considerably harder for humans to fool. However, DADC is more time-consuming than traditional data collection and thus more costly per example. In this work, we examine if we can maintain the advantages of DADC, without suffering the additional cost. To that end, we introduce Generative Annotation Assistants (GAAs), generator-in-the-loop models that provide real-time suggestions that annotators can either approve, modify, or reject entirely. We collect training datasets in twenty experimental settings and perform a detailed analysis of this approach for the task of extractive question answering (QA) for both standard and adversarial data collection. We demonstrate that GAAs provide significant efficiency benefits in terms of annotation speed, while leading to improved model fooling rates. In addition, we show that GAA-assisted data leads to higher downstream model performance on a variety of question answering tasks.

【2】 Bridging between Cognitive Processing Signals and Linguistic Features via a Unified Attentional Network 标题：通过统一注意网络在认知加工信号和语言特征之间架起桥梁链接：https://arxiv.org/abs/2112.08831

作者：Yuqi Ren,Deyi Xiong 摘要：认知加工信号可以用来改善自然语言处理（NLP）任务。然而，这些信号如何与语言信息相关尚不清楚。人类语言处理和语言特征之间的桥梁在神经语言学中得到了广泛的研究，通常是通过高控制刺激的单变量控制实验。这种方法不仅损害了自然阅读的真实性，而且耗时且昂贵。在本文中，我们提出了一种数据驱动的方法来研究认知加工信号与语言特征之间的关系。具体来说，我们提出了一个统一的注意框架，该框架由嵌入层、注意层、编码层和预测层组成，以选择性地将认知加工信号映射到语言特征。我们将映射过程定义为桥接任务，并针对词汇、句法和语义特征开发了12个桥接任务。该框架只需要记录在自然阅读下的认知加工信号作为输入，并且可以使用单个认知数据集检测广泛的语言特征。实验结果的观察结果与先前的神经科学发现一致。除此之外，我们的实验还揭示了一些有趣的发现，例如上下文眼动特征和句子时态之间的相关性。摘要：Cognitive processing signals can be used to improve natural language processing (NLP) tasks. However, it is not clear how these signals correlate with linguistic information. Bridging between human language processing and linguistic features has been widely studied in neurolinguistics, usually via single-variable controlled experiments with highly-controlled stimuli. Such methods not only compromises the authenticity of natural reading, but also are time-consuming and expensive. In this paper, we propose a data-driven method to investigate the relationship between cognitive processing signals and linguistic features. Specifically, we present a unified attentional framework that is composed of embedding, attention, encoding and predicting layers to selectively map cognitive processing signals to linguistic features. We define the mapping procedure as a bridging task and develop 12 bridging tasks for lexical, syntactic and semantic features. The proposed framework only requires cognitive processing signals recorded under natural reading as inputs, and can be used to detect a wide range of linguistic features with a single cognitive dataset. Observations from experiment results resonate with previous neuroscience findings. In addition to this, our experiments also reveal a number of interesting findings, such as the correlation between contextual eye-tracking features and tense of sentence.

【3】 UniREx: A Unified Learning Framework for Language Model Rationale Extraction 标题：UniREx：语言模型理论基础提取的统一学习框架链接：https://arxiv.org/abs/2112.08802

作者：Aaron Chan,Maziar Sanjabi,Lambert Mathias,Liang Tan,Shaoliang Nie,Xiaochang Peng,Xiang Ren,Hamed Firooz 备注：14 pages, 6 figures 摘要：提取原理通过突出显示对输出影响最大的文本输入来解释语言模型（LM）对给定任务实例的预测。理想情况下，基本原理提取应该忠实（反映LM的行为）、合理（对人类有意义）、数据高效且快速，而不会牺牲LM的任务性能。先前的基本原理提取工作包括处理这些需求的各种子集的专门方法，但决不是全部五种。狭隘地关注某些desiderata通常是以忽略的desiderata为代价的，因此现有的基本原理提取器在实际应用中往往不切实际。为了应对这一挑战，我们提出了UniREx，这是一个用于基本原理提取的统一且高度灵活的学习框架，它允许用户轻松解释所有五个因素。UniREx支持端到端定制基本原理提取器训练流程，支持任意：（1）启发性/习得性基本原理提取器，（2）忠实性和/或合理性目标的组合，以及（3）黄金基本原理监督量。在三个文本分类数据集中，我们最好的UniREx配置实现了五个desiderata的卓越平衡，与强大的基线相比。此外，UniREx训练的基本原理提取器甚至可以推广到看不见的数据集和任务。摘要：An extractive rationale explains a language model's (LM's) prediction on a given task instance by highlighting the text inputs that most influenced the output. Ideally, rationale extraction should be faithful (reflects LM's behavior), plausible (makes sense to humans), data-efficient, and fast, without sacrificing the LM's task performance. Prior rationale extraction works consist of specialized approaches for addressing various subsets of these desiderata -- but never all five. Narrowly focusing on certain desiderata typically comes at the expense of ignored ones, so existing rationale extractors are often impractical in real-world applications. To tackle this challenge, we propose UniREx, a unified and highly flexible learning framework for rationale extraction, which allows users to easily account for all five factors. UniREx enables end-to-end customization of the rationale extractor training process, supporting arbitrary: (1) heuristic/learned rationale extractors, (2) combinations of faithfulness and/or plausibility objectives, and (3) amounts of gold rationale supervision. Across three text classification datasets, our best UniREx configurations achieve a superior balance of the five desiderata, when compared to strong baselines. Furthermore, UniREx-trained rationale extractors can even generalize to unseen datasets and tasks.

【4】 CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain 标题：CLIN-X：临床领域概念提取的预训练语言模型和跨任务迁移研究链接：https://arxiv.org/abs/2112.08754

作者：Lukas Lange,Heike Adel,Jannik Strötgen,Dietrich Klakow 摘要：自然语言处理（NLP）领域最近发生了巨大的变化，使用预先训练好的语言模型来解决几乎所有的任务。尽管在各种任务的基准数据集方面有了很大的改进，但这些模型在非标准领域（如临床领域）的表现往往是次优的，在这些领域中，训练前文档和目标文档之间存在很大差距。在本文中，我们旨在通过特定领域的语言模型训练来弥补这一差距，并研究其对一系列下游任务和环境的影响。我们介绍了预训练的CLIN-X（Clinical XLM-R）语言模型，并展示了CLIN-X如何在两种语言的十项临床概念提取任务中大大优于其他预训练的transformer模型。此外，我们还演示了如何使用我们提出的基于随机拆分和跨句子上下文的集成的任务和语言不可知模型架构进一步改进transformer模型。我们在低资源和转移设置下的研究显示，尽管缺少注释数据，但模型性能稳定，当只有250个标记句子可用时，改进高达47个点。我们的结果强调了特殊语言模型（如CLIN-X）在非标准领域概念提取中的重要性，但也表明我们的任务不可知模型体系结构在测试任务和语言中是健壮的，因此不需要领域或任务特定的调整。CLIN Xlanguage模型以及用于微调和传输模型的源代码可在https://github.com/boschresearch/clin\_x/和huggingface模型轮毂。摘要：The field of natural language processing (NLP) has recently seen a large change towards using pre-trained language models for solving almost any task. Despite showing great improvements in benchmark datasets for various tasks, these models often perform sub-optimal in non-standard domains like the clinical domain where a large gap between pre-training documents and target documents is observed. In this paper, we aim at closing this gap with domain-specific training of the language model and we investigate its effect on a diverse set of downstream tasks and settings. We introduce the pre-trained CLIN-X (Clinical XLM-R) language models and show how CLIN-X outperforms other pre-trained transformer models by a large margin for ten clinical concept extraction tasks from two languages. In addition, we demonstrate how the transformer model can be further improved with our proposed task- and language-agnostic model architecture based on ensembles over random splits and cross-sentence context. Our studies in low-resource and transfer settings reveal stable model performance despite a lack of annotated data with improvements of up to 47 F1points when only 250 labeled sentences are available. Our results highlight the importance of specialized language models as CLIN-X for concept extraction in non-standard domains, but also show that our task-agnostic model architecture is robust across the tested tasks and languages so that domain- or task-specific adaptations are not required. The CLIN-Xlanguage models and source code for fine-tuning and transferring the model are publicly available at https://github.com/boschresearch/clin\_x/ and the huggingface model hub.

【5】 DOCmT5: Document-Level Pretraining of Multilingual Language Models 标题：DOCmT5：多语言模型的文档级预训练链接：https://arxiv.org/abs/2112.08709

作者：Chia-Hsuan Lee,Aditya Siddhant,Viresh Ratnakar,Melvin Johnson 摘要：在本文中，我们介绍了DOCmT5，一个使用大规模并行文档预训练的多语言序列到序列语言模型。虽然以前的方法侧重于利用句子级并行数据，但我们尝试构建一个通用的预训练模型，该模型可以理解和生成长文档。我们提出了一种简单而有效的预训练目标-文档重排序机器翻译（DrMT），其中需要翻译被洗牌和屏蔽的输入文档。DrMT在各种文档级生成任务的强基线上带来了一致的改进，包括SEED语言对文档级MT超过12个BLEU点，seen语言对文档级MT超过7个BLEU点，SEED语言对跨语言摘要超过3个ROUGE-1点。我们在WMT20 De En和IWSLT15 Zh En文档翻译任务上实现了最先进的（SOTA）。我们还对文档预训练的各种因素进行了广泛的分析，包括（1）预训练数据质量的影响和（2）单语言和跨语言预训练相结合的影响。我们计划公开我们的模型检查点。摘要：In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents. While previous approaches have focused on leveraging sentence-level parallel data, we try to build a general-purpose pre-trained model that can understand and generate long documents. We propose a simple and effective pre-training objective - Document Reordering Machine Translation (DrMT), in which the input documents that are shuffled and masked need to be translated. DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks, including over 12 BLEU points for seen-language-pair document-level MT, over 7 BLEU points for unseen-language-pair document-level MT and over 3 ROUGE-1 points for seen-language-pair cross-lingual summarization. We achieve state-of-the-art (SOTA) on WMT20 De-En and IWSLT15 Zh-En document translation tasks. We also conduct extensive analysis on various factors for document pre-training, including (1) the effects of pre-training data quality and (2) The effects of combining mono-lingual and cross-lingual pre-training. We plan to make our model checkpoints publicly available.

【6】 DREAM: Uncovering Mental Models behind Language Models 标题：梦想：揭开语言模型背后的心理模型链接：https://arxiv.org/abs/2112.08656

作者：Yuling Gu,Bhavana Dalvi Mishra,Peter Clark 摘要：在回答情境问题（例如，关于特定道德困境的问题）时，语言模型（LMs）在多大程度上构建了场景的“心理模型”？虽然认知科学已经表明，心智模型在人类解决问题中起着基础性作用，但目前尚不清楚现有LMs的高问答性能是否有类似的模型构建支持——如果没有，这是否可以解释他们众所周知的灾难性失败。我们观察到，Macaw是一种现有的基于T5的LM，当被探究时，它为情境问题提供了一些有用但不充分的心智模型（估计准确率=43%，有用性=21%，一致性=42%）。我们提出了DREAM模型，该模型将情境问题作为输入，生成一个描述情境的心智模型，而不需要任何额外的特定于任务的心智模型训练数据。它通过对现有NLP资源的远程监督来继承其社会常识。我们的分析表明，与金刚鹦鹉相比，梦可以产生显著更好的心理模型（估计准确率为67%，有用性为37%，一致性为71%）。最后，由梦生成的心理模型可以作为情境QA任务的附加上下文。在三个不同的数据集上，这种额外的上下文将金刚鹦鹉Zero-Shot模型的答案准确性提高了+1%到+4%（绝对值）。摘要：To what extent do language models (LMs) build "mental models" of a scene when answering situated questions (e.g., questions about a specific ethical dilemma)? While cognitive science has shown that mental models play a fundamental role in human problem-solving, it is unclear whether the high question-answering performance of existing LMs is backed by similar model building - and if not, whether that can explain their well-known catastrophic failures. We observed that Macaw, an existing T5-based LM, when probed provides somewhat useful but inadequate mental models for situational questions (estimated accuracy=43%, usefulness=21%, consistency=42%). We propose DREAM, a model that takes a situational question as input to produce a mental model elaborating the situation, without any additional task specific training data for mental models. It inherits its social commonsense through distant supervision from existing NLP resources. Our analysis shows that DREAM can produce significantly better mental models (estimated accuracy=67%, usefulness=37%, consistency=71%) compared to Macaw. Finally, mental models generated by DREAM can be used as additional context for situational QA tasks. This additional context improves the answer accuracy of a Macaw zero-shot model by between +1% and +4% (absolute) on three different datasets.

【7】 Reconsidering the Past: Optimizing Hidden States in Language Models 标题：反思过去：优化语言模型中的隐藏状态链接：https://arxiv.org/abs/2112.08653

作者：Davis Yoshida,Kevin Gimpel 备注：None 摘要：我们提出了隐状态优化（HSO），这是一种基于梯度的方法，用于提高Transformer语言模型在推理时的性能。与动态评估类似（Krause et al.，2018），HSO计算语言模型分配给评估文本的日志概率梯度，但使用它更新缓存的隐藏状态，而不是模型参数。我们使用预训练的Transformer XL和GPT-2语言模型测试HSO，发现WikiText103和PG-19数据集在困惑方面有所改进，尤其是在评估训练分布之外的模型时。我们还通过显示最近开发的基于提示的少量放炮评估设置的增益来证明下游的适用性，同样没有额外的参数或训练数据。摘要：We present Hidden-State Optimization (HSO), a gradient-based method for improving the performance of transformer language models at inference time. Similar to dynamic evaluation (Krause et al., 2018), HSO computes the gradient of the log-probability the language model assigns to an evaluation text, but uses it to update the cached hidden states rather than the model parameters. We test HSO with pretrained Transformer-XL and GPT-2 language models, finding improvement on the WikiText103 and PG-19 datasets in terms of perplexity, especially when evaluating a model outside of its training distribution. We also demonstrate downstream applicability by showing gains in the recently developed prompt-based few-shot evaluation setting, again with no extra parameters or training data.

【8】 Learning To Retrieve Prompts for In-Context Learning 标题：学习检索提示以进行情景学习链接：https://arxiv.org/abs/2112.08633

作者：Ohad Rubin,Jonathan Herzig,Jonathan Berant 摘要：情境学习是自然语言理解的一种新范式，一个大型的预训练语言模型（LM）观察一个测试实例和几个训练实例作为输入，并直接解码输出，而不需要对其参数进行任何更新。然而，已经证明，性能在很大程度上取决于所选的训练示例（称为提示）。在这项工作中，我们提出了一种使用注释数据和LM检索上下文学习提示的有效方法。给定一个输入-输出对，我们估计给定输入和一个候选训练示例作为提示的输出概率，并根据该概率将训练示例标记为正或负。然后，我们从这些数据中训练一个高效的密集检索器，用于在测试时作为提示检索训练示例。我们在三个序列对序列任务中评估了我们的方法，在这些任务中，语言话语被映射到意义表征，并且发现它大大优于先前的工作和多个基线。摘要：In-context learning is a recent paradigm in natural language understanding, where a large pre-trained language model (LM) observes a test instance and a few training examples as its input, and directly decodes the output without any update to its parameters. However, performance has been shown to strongly depend on the selected training examples (termed prompt). In this work, we propose an efficient method for retrieving prompts for in-context learning using annotated data and a LM. Given an input-output pair, we estimate the probability of the output given the input and a candidate training example as the prompt, and label training examples as positive or negative based on this probability. We then train an efficient dense retriever from this data, which is used to retrieve training examples as prompts at test time. We evaluate our approach on three sequence-to-sequence tasks where language utterances are mapped to meaning representations, and find that it substantially outperforms prior work and multiple baselines across the board.

【9】 DuQM: A Chinese Dataset of Linguistically Perturbed Natural Questions for Evaluating the Robustness of Question Matching Models 标题：DUQM：用于评价问题匹配模型鲁棒性的汉语语言扰动自然问题数据集链接：https://arxiv.org/abs/2112.08609

作者：Hongyu Zhu,Yan Chen,Jing Yan,Jing Liu,Yu Hong,Ying Chen,Hua Wu,Haifeng Wang 摘要：本文主要研究中文问题匹配的稳健性评价问题。以前关于分析稳健性问题的大多数工作只关注一种或几种类型的人工对抗性示例。相反，我们认为有必要对自然文本模型的语言能力进行综合评估。为此，我们创建了一个中文数据集DuQM，其中包含带有语言扰动的自然问题，以评估问题匹配模型的稳健性。DuQM包含3个类别和13个子类别，包含32种语言扰动。大量实验表明，DuQM具有更好的区分不同模型的能力。重要的是，DuQM中按语言现象对评估进行的详细分解有助于我们轻松诊断不同模型的优缺点。此外，我们的实验结果表明，人工对抗性例子的效果对自然文本不起作用。摘要：In this paper, we focus on studying robustness evaluation of Chinese question matching. Most of the previous work on analyzing robustness issue focus on just one or a few types of artificial adversarial examples. Instead, we argue that it is necessary to formulate a comprehensive evaluation about the linguistic capabilities of models on natural texts. For this purpose, we create a Chinese dataset namely DuQM which contains natural questions with linguistic perturbations to evaluate the robustness of question matching models. DuQM contains 3 categories and 13 subcategories with 32 linguistic perturbations. The extensive experiments demonstrate that DuQM has a better ability to distinguish different models. Importantly, the detailed breakdown of evaluation by linguistic phenomenon in DuQM helps us easily diagnose the strength and weakness of different models. Additionally, our experiment results show that the effect of artificial adversarial examples does not work on the natural texts.

其他(8篇)

【1】 ADBCMM : Acronym Disambiguation by Building Counterfactuals and Multilingual Mixing 标题：ADBCMM：通过构建反事实和多语言混合来消除缩略语歧义链接：https://arxiv.org/abs/2112.08991

作者：Yixuan Weng,Fei Xia,Bin Li,Xiusheng Huang,Shizhu He,Kang Liu,Jun Zhao 备注：SDU@AAAI-2022 摘要：科学文献通常包含大量首字母缩略词。消除这些首字母缩略词的歧义将有助于研究人员更好地理解文档中词汇的含义。过去，由于大量的英语文献资料，缩略语任务主要应用于英语文献中。然而，对于其他低资源语言，由于缺乏大量的注释数据，这项任务很难获得良好的性能，并且受到较少的关注。为了解决上述问题，本文提出了一种新的首字母缩略词消歧方法ADBCMM，该方法通过建立反事实和多语言混合来显著提高低资源语言的性能。具体而言，通过平衡低资源语言中的数据偏差，ADBCMM将能够改善数据集之外的测试性能。在里面SDU@AAAI-22-共享任务2：首字母缩略词消歧，提出的方法在法语和西班牙语中获得第一名。你可以在这里重复我们的结果https://github.com/WENGSYX/ADBCMM. 摘要：Scientific documents often contain a large number of acronyms. Disambiguation of these acronyms will help researchers better understand the meaning of vocabulary in the documents. In the past, thanks to large amounts of data from English literature, acronym task was mainly applied in English literature. However, for other low-resource languages, this task is difficult to obtain good performance and receives less attention due to the lack of large amount of annotation data. To address the above issue, this paper proposes an new method for acronym disambiguation, named as ADBCMM, which can significantly improve the performance of low-resource languages by building counterfactuals and multilingual mixing. Specifically, by balancing data bias in low-resource langauge, ADBCMM will able to improve the test performance outside the data set. In SDU@AAAI-22 - Shared Task 2: Acronym Disambiguation, the proposed method won first place in French and Spanish. You can repeat our results here https://github.com/WENGSYX/ADBCMM.

【2】 Characterizing and addressing the issue of oversmoothing in neural autoregressive sequence modeling 标题：神经自回归序列建模中过平滑问题的表征与解决链接：https://arxiv.org/abs/2112.08914

作者：Ilia Kulikov,Maksim Eremeev,Kyunghyun Cho 备注：Ilia Kulikov and Maksim Eremeev contributed equally 摘要：神经自回归序列模型抹黑了许多可能序列的概率，包括退化序列，如空序列或重复序列。在这项工作中，我们处理一个特定的情况下，该模型分配了一个不合理的短序列的高概率。我们定义过平滑率来量化这个问题。在确认神经机器翻译中的高过平滑度之后，我们建议在训练期间显式地最小化过平滑率。我们进行了一组实验来研究所提出的正则化对模型分布和解码性能的影响。我们使用神经机器翻译任务作为测试床，并考虑不同大小的三个不同的数据集。我们的实验揭示了三个主要发现。首先，我们可以通过调整正则化的强度来控制模型的过平滑率。第二，通过增加过平滑损失贡献，在不应该出现的位置，标记的概率和等级大幅降低。第三，建议的正则化会影响波束搜索的结果，尤其是在使用大波束时。在较低的过平滑率下，大波束的平移质量（以BLEU为单位）的退化显著减少，但与较小波束尺寸相比，这种退化仍然存在。从这些观察结果，我们得出结论，高度过光滑是神经自回归模型中过度可能的短序列退化情况背后的主要原因。摘要：Neural autoregressive sequence models smear the probability among many possible sequences including degenerate ones, such as empty or repetitive sequences. In this work, we tackle one specific case where the model assigns a high probability to unreasonably short sequences. We define the oversmoothing rate to quantify this issue. After confirming the high degree of oversmoothing in neural machine translation, we propose to explicitly minimize the oversmoothing rate during training. We conduct a set of experiments to study the effect of the proposed regularization on both model distribution and decoding performance. We use a neural machine translation task as the testbed and consider three different datasets of varying size. Our experiments reveal three major findings. First, we can control the oversmoothing rate of the model by tuning the strength of the regularization. Second, by enhancing the oversmoothing loss contribution, the probability and the rank oftoken decrease heavily at positions where it is not supposed to be. Third, the proposed regularization impacts the outcome of beam search especially when a large beam is used. The degradation of translation quality (measured in BLEU) with a large beam significantly lessens with lower oversmoothing rate, but the degradation compared to smaller beam sizes remains to exist. From these observations, we conclude that the high degree of oversmoothing is the main reason behind the degenerate case of overly probable short sequences in a neural autoregressive model.

【3】 Gendered Language in Resumes and its Implications for Algorithmic Bias in Hiring 标题：简历中的性别语言及其对招聘算法偏差的影响链接：https://arxiv.org/abs/2112.08910

作者：Prasanna Parasurama,João Sedoc 备注：None 摘要：尽管人们越来越关注算法招聘中使用的NLP模型中的性别偏见，但很少有实证工作研究简历中性别语言的范围和性质。使用709k份来自IT公司的简历语料库，我们训练了一系列模型来分类申请人的性别，从而测量简历中编码的性别信息的程度。我们还调查了是否有可能通过删除性别标识符、爱好、嵌入模型中的性别子空间等来混淆简历中的性别。我们发现，即使在混淆之后，简历中也存在大量的性别信息。一个简单的Tf Idf模型可以学习使用AUROC=0.75对性别进行分类，而更复杂的基于Transformer的模型可以实现AUROC=0.8。我们进一步发现，性别预测值与嵌入的性别方向相关性较低——这意味着，性别预测值远大于男性/女性意义上的“性别化”。我们讨论了这些发现在招聘环境中的算法偏差和公平性影响。摘要：Despite growing concerns around gender bias in NLP models used in algorithmic hiring, there is little empirical work studying the extent and nature of gendered language in resumes. Using a corpus of 709k resumes from IT firms, we train a series of models to classify the gender of the applicant, thereby measuring the extent of gendered information encoded in resumes. We also investigate whether it is possible to obfuscate gender from resumes by removing gender identifiers, hobbies, gender sub-space in embedding models, etc. We find that there is a significant amount of gendered information in resumes even after obfuscation. A simple Tf-Idf model can learn to classify gender with AUROC=0.75, and more sophisticated transformer-based models achieve AUROC=0.8. We further find that gender predictive values have low correlation with gender direction of embeddings -- meaning that, what is predictive of gender is much more than what is "gendered" in the masculine/feminine sense. We discuss the algorithmic bias and fairness implications of these findings in the hiring context.

【4】 Looking Outside the Box to Ground Language in 3D Scenes 标题：跳出框框看3D场景中的落地语言链接：https://arxiv.org/abs/2112.08879

作者：Ayush Jain,Nikolaos Gkanatsios,Ishita Mediratta,Katerina Fragkiadaki 备注：First two authors contributed equally 摘要：现有的语言基础模型通常使用对象提议瓶颈：预先训练的检测器在场景中提议对象，模型学习从这些盒子提议中选择答案，而不考虑原始图像或三维点云。对象检测器通常是在对象和属性的固定词汇表上进行训练的，这些词汇表对于开放域语言基础来说往往过于严格，其中话语可能涉及不同抽象层次的视觉实体，例如椅子、椅子腿或椅子前腿的尖端。我们提出了一个3D场景中的基础语言模型，该模型绕过了盒子建议瓶颈，主要创新有三个：i）跨语言流、点云特征流和3D盒子建议的迭代注意。ii）具有非参数实体查询的转换器解码器，用于解码对象和零件参照的三维框。iii）通过将对象检测视为由候选类别标签列表组成的参考话语的基础，从3D对象注释和语言基础注释进行联合监督。与之前流行的3D语言基准测试方法相比，这些创新带来了显著的数量收益（SR3D基准测试的绝对改善率高达+9%）。我们对每项创新都进行了删减，以显示其对模型性能的贡献。当应用于语言的基础上的2D图像与微小的变化，它执行与国家的最先进的，而收敛在一半的GPU时间。代码和检查点将在https://github.com/nickgkan/beauty_detr 摘要：Existing language grounding models often use object proposal bottlenecks: a pre-trained detector proposes objects in the scene and the model learns to select the answer from these box proposals, without attending to the original image or 3D point cloud. Object detectors are typically trained on a fixed vocabulary of objects and attributes that is often too restrictive for open-domain language grounding, where an utterance may refer to visual entities at various levels of abstraction, such as a chair, the leg of a chair, or the tip of the front leg of a chair. We propose a model for grounding language in 3D scenes that bypasses box proposal bottlenecks with three main innovations: i) Iterative attention across the language stream, the point cloud feature stream and 3D box proposals. ii) Transformer decoders with non-parametric entity queries that decode 3D boxes for object and part referentials. iii) Joint supervision from 3D object annotations and language grounding annotations, by treating object detection as grounding of referential utterances comprised of a list of candidate category labels. These innovations result in significant quantitative gains (up to +9% absolute improvement on the SR3D benchmark) over previous approaches on popular 3D language grounding benchmarks. We ablate each of our innovations to show its contribution to the performance of the model. When applied on language grounding on 2D images with minor changes, it performs on par with the state-of-the-art while converges in half of the GPU time. The code and checkpoints will be made available at https://github.com/nickgkan/beauty_detr

【5】 δ-SAM: Sharpness-Aware Minimization with Dynamic Reweighting 标题：δ-SAM：动态加权的清晰度感知最小化链接：https://arxiv.org/abs/2112.08772

作者：Wenxuan Zhou,Muhao Chen 摘要：深层神经网络往往参数化过度，不易实现模型泛化。对抗性训练通过在对抗性选择扰动的基础上调整损失的变化，在提高泛化能力方面显示出了有效性。最近提出的锐度感知最小化（SAM）算法采用对抗性权重扰动，鼓励模型收敛到平坦的极小值。不幸的是，由于计算成本的增加，对抗性权重扰动只能有效地近似于每个批次，而不是每个实例，从而导致性能下降。在本文中，我们提出在每个批次中动态重加权的扰动，其中无防护实例是上加权的，可以作为每个实例扰动的更好近似。我们提出了具有动态重加权的锐度感知最小化（{\delta}-SAM），它通过有效的保证度估计实现了这一思想。GLUE基准测试的实验证明了{\delta}-SAM的有效性。摘要：Deep neural networks are often overparameterized and may not easily achieve model generalization. Adversarial training has shown effectiveness in improving generalization by regularizing the change of loss on top of adversarially chosen perturbations. The recently proposed sharpness-aware minimization (SAM) algorithm adopts adversarial weight perturbation, encouraging the model to converging to a flat minima. Unfortunately, due to increased computational cost, adversarial weight perturbation can only be efficiently approximated per-batch instead of per-instance, leading to degraded performance. In this paper, we propose that dynamically reweighted perturbation within each batch, where unguarded instances are up-weighted, can serve as a better approximation to per-instance perturbation. We propose sharpness-aware minimization with dynamic reweighting ({\delta}-SAM), which realizes the idea with efficient guardedness estimation. Experiments on the GLUE benchmark demonstrate the effectiveness of {\delta}-SAM.

【6】 MAVE: A Product Dataset for Multi-source Attribute Value Extraction 标题：MAVE：一种用于多源属性值提取的产品数据集链接：https://arxiv.org/abs/2112.08663

作者：Li Yang,Qifan Wang,Zac Yu,Anand Kulkarni,Sumit Sanghai,Bin Shu,Jon Elsas,Bhargav Kanagal 备注：10 pages, 7 figures. Accepted to WSDM 2022. Dataset available at this https URL 摘要：属性值提取是指从产品信息中识别感兴趣的属性值的任务。产品属性值在许多电子商务场景中都是必不可少的，例如客户服务机器人、产品排名、检索和推荐。然而在现实世界中，产品的属性值往往是不完整的，并且随着时间的推移而变化，这极大地阻碍了实际应用。为了更好地促进产品属性值提取的研究，本文引入了一种新的数据集MAVE。MAVE由来自Amazon页面的220万个产品组成，在1257个独特类别中有300万个属性值注释。MAVE有四个主要且独特的优势：首先，MAVE是按属性值示例数量计算的最大的产品属性值提取数据集。其次，MAVE包含来自产品的多源表示，它以高属性覆盖率捕获完整的产品信息。第三，相对于以前的数据集所涵盖的内容，MAVE代表了一组更加多样化的属性和值。最后，MAVE提供了一个非常具有挑战性的Zero-Shot测试集，正如我们在实验中经验说明的那样。我们进一步提出了一种从多源产品信息中有效提取属性值的新方法。我们在多个基线上进行了大量实验，结果表明MAVE是一个有效的属性值提取数据集。零炮属性提取也是一项极具挑战性的任务。数据可从{\it\url获得{https://github.com/google-research-datasets/MAVE}}. 摘要：Attribute value extraction refers to the task of identifying values of an attribute of interest from product information. Product attribute values are essential in many e-commerce scenarios, such as customer service robots, product ranking, retrieval and recommendations. While in the real world, the attribute values of a product are usually incomplete and vary over time, which greatly hinders the practical applications. In this paper, we introduce MAVE, a new dataset to better facilitate research on product attribute value extraction. MAVE is composed of a curated set of 2.2 million products from Amazon pages, with 3 million attribute-value annotations across 1257 unique categories. MAVE has four main and unique advantages: First, MAVE is the largest product attribute value extraction dataset by the number of attribute-value examples. Second, MAVE includes multi-source representations from the product, which captures the full product information with high attribute coverage. Third, MAVE represents a more diverse set of attributes and values relative to what previous datasets cover. Lastly, MAVE provides a very challenging zero-shot test set, as we empirically illustrate in the experiments. We further propose a novel approach that effectively extracts the attribute value from the multi-source product information. We conduct extensive experiments with several baselines and show that MAVE is an effective dataset for attribute value extraction task. It is also a very challenging task on zero-shot attribute extraction. Data is available at {\it \url{https://github.com/google-research-datasets/MAVE}}.

【7】 Human Languages with Greater Information Density Increase Communication Speed, but Decrease Conversation Breadth 标题：人类语言的信息密度越大，交流速度越快，但交谈的广度却越小链接：https://arxiv.org/abs/2112.08491

作者：Pedro Aceves,James A. Evans 摘要：语言是人类信息交流和协调的主要媒介。最重要的语言功能之一是对世界进行分类，以便通过对话传达信息。虽然我们对人类语言在语义域（如颜色、声音、数字、运动、时间、空间、人类活动、性别、身体部位和生物学）中的信息编码方式有很大了解，但对语义信息的全球结构及其对人类交流的影响知之甚少。使用大规模计算、人工智能技术和跨越15个学科领域（包括宗教、经济、医学、娱乐、政治和技术）的999种语言的大规模并行语料库，在这里，我们展示了语言信息和语义密度的巨大变化及其对人类交流和协调的影响。与之前的工作相比，我们证明了高密度语言比低密度语言更快地传递信息。然后，通过使用14种语言的9000多个真实对话和140种语言的90000篇维基百科文章，我们表明，因为有更多的方法用更密集的语言讨论任何给定的话题，对话和文章在更狭窄的概念范围内回溯和循环。这些结果表明，语言结构决定了会话的性质和结构，并对群体、组织、市场和社会的行为产生了重要影响。摘要：Language is the primary medium through which human information is communicated and coordination is achieved. One of the most important language functions is to categorize the world so messages can be communicated through conversation. While we know a great deal about how human languages vary in their encoding of information within semantic domains such as color, sound, number, locomotion, time, space, human activities, gender, body parts and biology, little is known about the global structure of semantic information and its effect on human communication. Using large-scale computation, artificial intelligence techniques, and massive, parallel corpora across 15 subject areas--including religion, economics, medicine, entertainment, politics, and technology--in 999 languages, here we show substantial variation in the information and semantic density of languages and their consequences for human communication and coordination. In contrast to prior work, we demonstrate that higher density languages communicate information much more quickly relative to lower density languages. Then, using over 9,000 real-life conversations across 14 languages and 90,000 Wikipedia articles across 140 languages, we show that because there are more ways to discuss any given topic in denser languages, conversations and articles retrace and cycle over a narrower conceptual terrain. These results demonstrate an important source of variation across the human communicative channel, suggesting that the structure of language shapes the nature and texture of conversation, with important consequences for the behavior of groups, organizations, markets, and societies.

【8】 ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical Error Correction 标题：ErAConD：用于语法纠错的错误标注对话数据集链接：https://arxiv.org/abs/2112.08466

作者：Xun Yuan,Derek Pham,Sam Davidson,Zhou Yu 摘要：目前可用的语法错误纠正（GEC）数据集是使用格式良好的书面文本进行编译的，这将这些数据集的适用性限制在其他领域，如非正式写作和对话。本文提出了一种新的基于开放域聊天机器人会话的并行GEC数据集；据我们所知，该数据集是第一个针对会话设置的GEC数据集。为了证明数据集的实用性，我们使用带注释的数据对最先进的GEC模型进行微调，从而使模型精度提高了16个百分点。这在GEC模型中尤为重要，因为在GEC任务中，模型精度被认为比回忆更重要，因为误报可能会导致语言学习者的严重困惑。我们还提出了一个详细的注释方案，该方案根据对可理解性的感知影响对错误进行排序，使我们的数据集具有可复制性和可扩展性。实验结果表明，我们的数据在提高会话场景下GEC模型性能方面是有效的。摘要：Currently available grammatical error correction (GEC) datasets are compiled using well-formed written text, limiting the applicability of these datasets to other domains such as informal writing and dialog. In this paper, we present a novel parallel GEC dataset drawn from open-domain chatbot conversations; this dataset is, to our knowledge, the first GEC dataset targeted to a conversational setting. To demonstrate the utility of the dataset, we use our annotated data to fine-tune a state-of-the-art GEC model, resulting in a 16 point increase in model precision. This is of particular importance in a GEC model, as model precision is considered more important than recall in GEC tasks since false positives could lead to serious confusion in language learners. We also present a detailed annotation scheme which ranks errors by perceived impact on comprehensibility, making our dataset both reproducible and extensible. Experimental results show the effectiveness of our data in improving GEC model performance in conversational scenario.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-12-17，如有侵权请联系 cloudcommunity@tencent.com 删除

linux