自然语言处理学术速递[7.16]

公众号-arXiv每日学术速递

发布于 2021-07-27 11:00:38

5970

发布于 2021-07-27 11:00:38

文章被收录于专栏：arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

cs.CL 方向，今日共计21篇

BERT(1篇)

【1】 AutoBERT-Zero: Evolving BERT Backbone from Scratch 标题：AutoBERT-Zero：从头开始进化的BERT主干

作者：Jiahui Gao,Hang Xu,Han shi,Xiaozhe Ren,Philip L. H. Yu,Xiaodan Liang,Xin Jiang,Zhenguo Li 机构：Philip L.H. Yu , The University of Hong Kong, Huawei Noah’s Ark Lab, Hong Kong University of Science and Technology, Sun Yat-sen University 备注：9 pages 链接：https://arxiv.org/abs/2107.07445 摘要：基于变换器的预训练语言模型（如BERT及其变体）最近在各种自然语言处理（NLP）任务中取得了良好的性能。然而，传统的范式纯粹是通过将人工设计的全局自我注意层叠加，引入诱导偏差，从而导致次优。在这项工作中，我们提出了一个操作优先的神经架构搜索（OP-NAS）算法来自动搜索有前途的混合主干架构。我们精心设计的搜索空间（i）在层内包含原始的数学运算以探索新的注意结构，（ii）在层间利用卷积块作为注意结构的补充以更好地学习局部依赖性。我们优化了搜索算法和候选模型的评估，以提高我们提出的OP-NAS的效率。具体来说，我们提出了操作优先级（OP）进化策略，通过平衡探索和开发来促进模型搜索。此外，我们还设计了一种双分支权重共享（BIWS）训练策略，用于快速模型评估。大量实验表明，该搜索结构（autobertzero）在各种下游任务中的性能明显优于BERT及其不同模型容量的变体，证明了该结构的迁移和泛化能力。值得注意的是，autobertzero-base在GLUE测试集上比RoBERTa-base（使用更多的数据）和BERT-large（模型尺寸更大）分别高出2.4和1.4分。代码和预先训练的模型将公开。摘要：Transformer-based pre-trained language models like BERT and its variants have recently achieved promising performance in various natural language processing (NLP) tasks. However, the conventional paradigm constructs the backbone by purely stacking the manually designed global self-attention layers, introducing inductive bias and thus leading to sub-optimal. In this work, we propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures. Our well-designed search space (i) contains primitive math operations in the intra-layer level to explore novel attention structures, and (ii) leverages convolution blocks to be the supplementary for attention structure in the inter-layer level to better learn local dependency. We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS. Specifically, we propose Operation-Priority (OP) evolution strategy to facilitate model search via balancing exploration and exploitation. Furthermore, we design a Bi-branch Weight-Sharing (BIWS) training strategy for fast model evaluation. Extensive experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks, proving the architecture's transfer and generalization abilities. Remarkably, AutoBERT-Zero-base outperforms RoBERTa-base (using much more data) and BERT-large (with much larger model size) by 2.4 and 1.4 higher score on GLUE test set. Code and pre-trained models will be made publicly available.

机器翻译(1篇)

【1】 FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task 标题：FST：面向IWSLT21多语种共享任务的公平语音翻译系统

作者：Yun Tang,Hongyu Gong,Xian Li,Changhan Wang,Juan Pino,Holger Schwenk,Naman Goyal 机构：Facebook AI Research 备注：Accepted by IWSLT 2021 as a system paper 链接：https://arxiv.org/abs/2107.06959 摘要：在这篇论文中，我们描述了我们的端到端多语种语音翻译系统提交给IWSLT2021多语种语音翻译共享任务评估活动。我们的系统是通过跨模式、任务和语言的迁移学习来构建的。首先，我们利用预先训练的通用多语言模块，其中包含大量未标记和已标记的数据。通过两个任务的联合训练，我们进一步实现了从文本任务到语音任务的知识转移。最后，我们的多语种模型是微调语音翻译任务的具体数据，以达到最佳的翻译效果。实验结果表明，我们的系统在端到端和级联两种方法上都比文献报道的系统有很大的优势。在某些翻译方向上，我们在公共多语言TEDx测试集上评估的语音翻译结果甚至可以与使用oracle语音转录本作为输入的强大文本到文本翻译系统的结果相比较。摘要：In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. Our system is built by leveraging transfer learning across modalities, tasks and languages. First, we leverage general-purpose multilingual modules pretrained with large amounts of unlabelled and labelled data. We further enable knowledge transfer from the text task to the speech task by training two tasks jointly. Finally, our multilingual model is finetuned on speech translation task-specific data to achieve the best translation results. Experimental results show our system outperforms the reported systems, including both end-to-end and cascaded based approaches, by a large margin. In some translation directions, our speech translation results evaluated on the public Multilingual TEDx test set are even comparable with the ones from a strong text-to-text translation system, which uses the oracle speech transcripts as input.

语义分析(2篇)

【1】 Tailor: Generating and Perturbing Text with Semantic Controls 标题：裁剪：使用语义控件生成和干扰文本

作者：Alexis Ross,Tongshuang Wu,Hao Peng,Matthew E. Peters,Matt Gardner 机构：†Allen Institute for Artificial Intelligence, Seattle, WA, USA, ♦Paul G. Allen School of Computer Science and Engineering, University of Washington 链接：https://arxiv.org/abs/2107.07150 摘要：对于各种任务（例如，数据扩充），进行受控扰动是必不可少的，但是构建特定于任务的生成器可能会非常昂贵。我们介绍了一个任务无关的生成系统tailer，它以语义控制的方式对文本进行扰动。在不可能训练的情况下，我们设计了裁缝生成器来跟踪一系列由语义角色派生的控制代码。通过修改这些控制代码，裁缝可以产生细粒度的扰动。我们在控制码上实现了一组操作，这些操作可以组合成复杂的扰动策略，并在三个不同的应用中证明了它们的有效性：首先，tailer有助于构造高质量的对比集，这些对比集在词汇上是多样的，并且比原始任务测试数据的偏差更小。其次，与自动标号启发式算法相结合，Tailor通过数据扩充帮助改进模型的泛化：通过仅扰动5%的训练数据，我们在NLI挑战集上获得了1.73的平均增益。第三，在没有任何微调开销的情况下，tailer的扰动有效地提高了细粒度风格转换中的组合性，在6次转换中优于微调基线。摘要：Making controlled perturbations is essential for various tasks (e.g., data augmentation), but building task-specific generators can be expensive. We introduce Tailor, a task-agnostic generation system that perturbs text in a semantically-controlled way. With unlikelihood training, we design Tailor's generator to follow a series of control codes derived from semantic roles. Through modifications of these control codes, Tailor can produce fine-grained perturbations. We implement a set of operations on control codes that can be composed into complex perturbation strategies, and demonstrate their effectiveness in three distinct applications: First, Tailor facilitates the construction of high-quality contrast sets that are lexically diverse, and less biased than original task test data. Second, paired with automated labeling heuristics, Tailor helps improve model generalization through data augmentation: We obtain an average gain of 1.73 on an NLI challenge set by perturbing just 5% of training data. Third, without any finetuning overhead, Tailor's perturbations effectively improve compositionality in fine-grained style transfer, outperforming fine-tuned baselines on 6 transfers.

【2】 Transition-based Bubble Parsing: Improvements on Coordination Structure Prediction 标题：基于转换的泡沫句法分析：对并列结构预测的改进

作者：Tianze Shi,Lillian Lee 机构：Cornell University 备注：None 链接：https://arxiv.org/abs/2107.06905 摘要：我们提出了一个基于转换的气泡解析器来同时进行协调结构识别和基于依赖关系的句法分析。泡沫表征是几十年前在正式语言学文献中提出的；它们通过显式编码协调结构中的协调边界和内部关系来增强依赖树。在本文中，我们介绍了一个过渡系统和神经模型来解析这些气泡增强结构。在英语Penn-Treebank和英语GENIA语料库上的实验结果表明，我们的语法分析器在协调结构预测方面优于现有的方法，特别是对于具有复杂协调结构的句子子集。摘要：We propose a transition-based bubble parser to perform coordination structure identification and dependency-based syntactic analysis simultaneously. Bubble representations were proposed in the formal linguistics literature decades ago; they enhance dependency trees by encoding coordination boundaries and internal relationships within coordination structures explicitly. In this paper, we introduce a transition system and neural models for parsing these bubble-enhanced structures. Experimental results on the English Penn Treebank and the English GENIA corpus show that our parsers beat previous state-of-the-art approaches on the task of coordination structure prediction, especially for the subset of sentences with complex coordination structures.

Graph|知识图谱|Knowledge(2篇)

【1】 Increasing Faithfulness in Knowledge-Grounded Dialogue with Controllable Features 标题：具有可控性的知识型对话中的增信力

作者：Hannah Rashkin,David Reitter,Gaurav Singh Tomar,Dipanjan Das 机构：Google Research, New York, NY 备注：ACL 2021 链接：https://arxiv.org/abs/2107.06963 摘要：以知识为基础的对话系统旨在传达基于给定源文本中提供的证据的信息。我们讨论的挑战，训练一个生成神经对话模型，这样的系统是控制，以保持忠实的证据。现有的数据集包含了忠实于所选证据的会话反应，以及更主观的或闲聊式的反应。我们提出了不同的评估方法，通过量化信息性和客观性来区分这些不同风格的回答。在训练时，将根据这些评估措施向对话模式提供额外的投入。在生成时，这些额外的输入作为风格控制，鼓励模型生成忠实于所提供证据的响应。我们还研究了使用重采样技术在解码时额外控制的使用。除了自动度量之外，我们还进行了一项人类评估研究，评估人员判断这些受控生成模型的输出通常比基线对话系统更客观、更忠实于证据。摘要：Knowledge-grounded dialogue systems are intended to convey information that is based on evidence provided in a given source text. We discuss the challenges of training a generative neural dialogue model for such systems that is controlled to stay faithful to the evidence. Existing datasets contain a mix of conversational responses that are faithful to selected evidence as well as more subjective or chit-chat style responses. We propose different evaluation measures to disentangle these different styles of responses by quantifying the informativeness and objectivity. At training time, additional inputs based on these evaluation measures are given to the dialogue model. At generation time, these additional inputs act as stylistic controls that encourage the model to generate responses that are faithful to the provided evidence. We also investigate the usage of additional controls at decoding time using resampling techniques. In addition to automatic metrics, we perform a human evaluation study where raters judge the output of these controlled generation models to be generally more objective and faithful to the evidence compared to baseline dialogue systems.

【2】 TGIF: Tree-Graph Integrated-Format Parser for Enhanced UD with Two-Stage Generic- to Individual-Language Finetuning 标题：TGIF：面向增强型UD的树图集成格式解析器，支持两阶段通用语言到个体语言的精调

作者：Tianze Shi,Lillian Lee 机构：Cornell University 备注：None 链接：https://arxiv.org/abs/2107.06907 摘要：我们将介绍我们对iwpt2021共享任务（解析为增强的通用依赖）的贡献。我们的主要系统组件是一个混合树图解析器，它集成了（a）增强图的生成树预测和（b）生成树中不存在的额外图边。我们还采用了一种微调策略，首先训练一个语言通用解析器来连接所有可用语言的数据，然后在第二步中，分别对每种语言进行微调。此外，基于预先训练的XLM-R模型和我们自己预先训练的字符级语言模型，我们开发了一整套与共享任务相关的预处理模块，包括标记化、句子分割和多词标记扩展。我们提交的测试集的宏平均弹性模量达到89.24。它在所有团队中排名第一，在下一个表现最好的提交中有超过2个绝对ELA的差距，在17种语言中有16种语言的得分最好。摘要：We present our contribution to the IWPT 2021 shared task on parsing into enhanced Universal Dependencies. Our main system component is a hybrid tree-graph parser that integrates (a) predictions of spanning trees for the enhanced graphs with (b) additional graph edges not present in the spanning trees. We also adopt a finetuning strategy where we first train a language-generic parser on the concatenation of data from all available languages, and then, in a second step, finetune on each individual language separately. Additionally, we develop our own complete set of pre-processing modules relevant to the shared task, including tokenization, sentence segmentation, and multiword token expansion, based on pre-trained XLM-R models and our own pre-training of character-level language models. Our submission reaches a macro-average ELAS of 89.24 on the test set. It ranks top among all teams, with a margin of more than 2 absolute ELAS over the next best-performing submission, and best score on 16 out of 17 languages.

推理|分析|理解|解释(2篇)

【1】 Turning Tables: Generating Examples from Semi-structured Tables for Endowing Language Models with Reasoning Skills 标题：翻转表：从半结构化表格生成示例，以赋予语言模型推理技能

作者：Ori Yoran,Alon Talmor,Jonathan Berant 机构：Tel-Aviv University,The Allen Institute for AI 链接：https://arxiv.org/abs/2107.07261 摘要：通过语言建模目标预先训练的模型拥有丰富的世界知识和语言技能，但已知在需要推理的任务中会遇到困难。在这项工作中，我们建议利用半结构化表格，自动生成大规模的问题-段落对，其中回答问题需要对段落中的多个事实进行推理。我们在这个合成数据上添加了一个预训练步骤，其中包括需要16种不同推理技能的示例，例如数字比较、连接和事实合成。为了提高数据效率，我们提出了抽样策略，重点训练模型目前缺乏的推理技能。我们在三个以推理为中心的阅读理解数据集上对我们的方法进行了评估，结果表明，我们的模型PReasM大大优于T5，T5是一种流行的预训练编译码模型。此外，基于当前模型误差的抽样示例可以提高训练速度和整体性能。摘要：Models pre-trained with a language modeling objective possess ample world knowledge and language skills, but are known to struggle in tasks that require reasoning. In this work, we propose to leverage semi-structured tables, and automatically generate at scale question-paragraph pairs, where answering the question requires reasoning over multiple facts in the paragraph. We add a pre-training step over this synthetic data, which includes examples that require 16 different reasoning skills such as number comparison, conjunction, and fact composition. To improve data efficiency, we propose sampling strategies that focus training on reasoning skills the model is currently lacking. We evaluate our approach on three reading comprehension datasets that are focused on reasoning, and show that our model, PReasM, substantially outperforms T5, a popular pre-trained encoder-decoder model. Moreover, sampling examples based on current model errors leads to faster training and higher overall performance.

【2】 Annotation and Classification of Evidence and Reasoning Revisions in Argumentative Writing 标题：议论文写作中证据与推理修改的注解与分类

作者：Tazin Afrin,Elaine Wang,Diane Litman,Lindsay C. Matsumura,Richard Correnti 机构：Learning Research and Development Center, University of Pittsburgh, Pittsburgh, Pennsylvania 备注：10 pages, 11 tables, 15th Workshop on Innovative Use of NLP for Building Educational Applications 链接：https://arxiv.org/abs/2107.06990 摘要：自动写作评价系统可以提高学生的写作水平，只要学生注意到所提供的反馈，并根据反馈修改论文草稿。然而，现有的关于这类系统中议论文修改的研究主要集中在学生修改的类型（如表面与内容）上，而不是修改在多大程度上对反馈做出反应并改进论文。我们引入了一个注释方案来捕捉证据使用和推理的句子级修订的性质（RER方案），并将其应用于五年级和六年级学生的议论文中。我们表明，可靠的手动注释可以实现，修订注释与论文改进的整体评估相关联，与提供的反馈一致。此外，我们探讨了根据我们的方案自动分类修订的可行性。摘要：Automated writing evaluation systems can improve students' writing insofar as students attend to the feedback provided and revise their essay drafts in ways aligned with such feedback. Existing research on revision of argumentative writing in such systems, however, has focused on the types of revisions students make (e.g., surface vs. content) rather than the extent to which revisions actually respond to the feedback provided and improve the essay. We introduce an annotation scheme to capture the nature of sentence-level revisions of evidence use and reasoning (the `RER' scheme) and apply it to 5th- and 6th-grade students' argumentative essays. We show that reliable manual annotation can be achieved and that revision annotations correlate with a holistic assessment of essay improvement in line with the feedback provided. Furthermore, we explore the feasibility of automatically classifying revisions according to our scheme.

检测相关(1篇)

【1】 Multi-Task Learning based Online Dialogic Instruction Detection with Pre-trained Language Models 标题：基于多任务学习的预训练语言模型在线对话教学检测

作者：Yang Hao,Hang Li,Wenbiao Ding,Zhongqin Wu,Jiliang Tang,Rose Luckin,Zitao Liu 机构： TAL Education Group, Beijing, China, Data Science and Engineering Lab, Michigan State University, USA, UCL Knowledge Lab, London, UK 备注：AIED'21: The 22nd International Conference on Artificial Intelligence in Education, 2021 链接：https://arxiv.org/abs/2107.07119 摘要：在这项工作中，我们研究了计算方法来检测在线对话指令，它被广泛用于帮助学生理解学习材料，建立有效的学习习惯。由于对话教学的质量和教学风格千差万别，这项任务相当具有挑战性。为了应对这些挑战，我们利用预先训练好的语言模型，提出了一种多任务范式，该范式通过对比损失来扩大类别之间的界限，从而增强了区分不同类别实例的能力。此外，我们还设计了一个策略，在训练阶段充分利用错误分类的例子。在一个真实的在线教育数据集上的大量实验表明，与典型的基线相比，我们的方法取得了更好的性能。为了鼓励可复制的结果，我们在\url上提供了我们的在线实现{https://github.com/AIED2021/multitask-dialogic-instruction}. 摘要：In this work, we study computational approaches to detect online dialogic instructions, which are widely used to help students understand learning materials, and build effective study habits. This task is rather challenging due to the widely-varying quality and pedagogical styles of dialogic instructions. To address these challenges, we utilize pre-trained language models, and propose a multi-task paradigm which enhances the ability to distinguish instances of different classes by enlarging the margin between categories via contrastive loss. Furthermore, we design a strategy to fully exploit the misclassified examples during the training stage. Extensive experiments on a real-world online educational data set demonstrate that our approach achieves superior performance compared to representative baselines. To encourage reproducible results, we make our implementation online available at \url{https://github.com/AIED2021/multitask-dialogic-instruction}.

识别/分类(1篇)

【1】 Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining 标题：基于多源噪声模拟和硬例挖掘的文本分类鲁棒学习

作者：Guowei Xu,Wenbiao Ding,Weiping Fu,Zhongqin Wu,Zitao Liu 机构：TAL Education Group, Beijing, China 备注：ECML-PKDD'21: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2021 链接：https://arxiv.org/abs/2107.07113 摘要：许多实际应用都涉及到使用光学字符识别（OCR）引擎将手写图像转换成抄本，并应用下游自然语言处理（NLP）模型。在这个过程中，OCR引擎可能会引入错误，并且下游NLP模型的输入会变得有噪声。尽管预先训练的模型在许多NLP基准测试中取得了最先进的性能，但我们证明了它们对真实OCR引擎产生的噪声文本不具有鲁棒性。这大大限制了NLP模型在现实场景中的应用。为了提高模型在含噪OCR文本上的性能，自然需要对含噪文本进行NLP模型训练。然而，在大多数情况下，只有标记干净的文本。由于不存在与文本相对应的手写图片，因此不可能直接使用识别模型来获得带噪的标签数据。人力资源可以用来复制文本和拍照，但考虑到模型训练的数据量，这是非常昂贵的。因此，我们感兴趣的是使NLP模型以低资源的方式对OCR错误具有内在的鲁棒性。提出了一种新的鲁棒性训练框架：1）采用简单而有效的方法直接模拟干净文本中的自然OCR噪声；2）从大量模拟样本中迭代挖掘硬样本以获得最佳性能。3）为了使我们的模型学习噪声不变表示，采用了稳定性损失。在三个真实数据集上的实验表明，该框架大大提高了预训练模型的鲁棒性。我们相信这项工作可以极大地促进NLP模型在实际场景中的应用，尽管我们使用的算法简单明了。我们公开了我们的代码和三个数据集\footnote{https://github.com/tal-ai/Robust-learning-MSSHEM}. 摘要：Many real-world applications involve the use of Optical Character Recognition (OCR) engines to transform handwritten images into transcripts on which downstream Natural Language Processing (NLP) models are applied. In this process, OCR engines may introduce errors and inputs to downstream NLP models become noisy. Despite that pre-trained models achieve state-of-the-art performance in many NLP benchmarks, we prove that they are not robust to noisy texts generated by real OCR engines. This greatly limits the application of NLP models in real-world scenarios. In order to improve model performance on noisy OCR transcripts, it is natural to train the NLP model on labelled noisy texts. However, in most cases there are only labelled clean texts. Since there is no handwritten pictures corresponding to the text, it is impossible to directly use the recognition model to obtain noisy labelled data. Human resources can be employed to copy texts and take pictures, but it is extremely expensive considering the size of data for model training. Consequently, we are interested in making NLP models intrinsically robust to OCR errors in a low resource manner. We propose a novel robust training framework which 1) employs simple but effective methods to directly simulate natural OCR noises from clean texts and 2) iteratively mines the hard examples from a large number of simulated samples for optimal performance. 3) To make our model learn noise-invariant representations, a stability loss is employed. Experiments on three real-world datasets show that the proposed framework boosts the robustness of pre-trained models by a large margin. We believe that this work can greatly promote the application of NLP models in actual scenarios, although the algorithm we use is simple and straightforward. We make our codes and three datasets publicly available\footnote{https://github.com/tal-ai/Robust-learning-MSSHEM}.

Zero/Few/One-Shot|迁移|自适应(2篇)

【1】 FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark 标题：FewCLUE：一种中文短程学习评价基准

作者：Liang Xu,Xiaojing Lu,Chenyang Yuan,Xuanwei Zhang,Hu Yuan,Huilin Xu,Guoao Wei,Xiang Pan,Hai Hu 机构：CLUE team 备注：Work in Progress; 8 pages, 3 tables 链接：https://arxiv.org/abs/2107.07498 摘要：预训练语言模型在自然语言理解任务中取得了巨大的成功。虽然人们对英语等语言的不同学习模式——微调、Zero-Shot和Few-Shot学习——进行了广泛的探索和比较，但在汉语中，公平、全面地评价和比较这些方法的工作相对较少。本文首先介绍了汉语第一个综合性小样本学习评价基准（FewCLUE）。它包括九个任务，从单句和句子对分类任务到机器阅读理解任务。鉴于Few-Shot学习性能的高方差性，我们提供了多个训练/验证集，以便于对Few-Shot模型进行更准确和稳定的评估。提供了一个未标记的训练集，每个任务最多可增加20000个样本，使研究人员能够探索使用未标记样本的更好方法。接下来，我们实现了一套最先进的（SOTA）少数镜头学习方法（包括PET、ADAPET、LM-BFF、P-tuning和EFL），并在新构建的FewCLUE基准上，将其与微调和零次学习方案的性能进行了比较，结果表明：1）五种少量次学习方法的性能均优于微调和零次学习；2）在五种方法中，PET是表现最好的少射法；3）少数镜头的学习成绩是高度依赖于具体的任务。我们的基准和代码可在https://github.com/CLUEbenchmark/FewCLUE 摘要：Pretrained Language Models (PLMs) have achieved tremendous success in natural language understanding tasks. While different learning schemes -- fine-tuning, zero-shot and few-shot learning -- have been widely explored and compared for languages such as English, there is comparatively little work in Chinese to fairly and comprehensively evaluate and compare these methods. This work first introduces Chinese Few-shot Learning Evaluation Benchmark (FewCLUE), the first comprehensive small sample evaluation benchmark in Chinese. It includes nine tasks, ranging from single-sentence and sentence-pair classification tasks to machine reading comprehension tasks. Given the high variance of the few-shot learning performance, we provide multiple training/validation sets to facilitate a more accurate and stable evaluation of few-shot modeling. An unlabeled training set with up to 20,000 additional samples per task is provided, allowing researchers to explore better ways of using unlabeled samples. Next, we implement a set of state-of-the-art (SOTA) few-shot learning methods (including PET, ADAPET, LM-BFF, P-tuning and EFL), and compare their performance with fine-tuning and zero-shot learning schemes on the newly constructed FewCLUE benchmark.Our results show that: 1) all five few-shot learning methods exhibit better performance than fine-tuning or zero-shot learning; 2) among the five methods, PET is the best performing few-shot method; 3) few-shot learning performance is highly dependent on the specific task. Our benchmark and code are available at https://github.com/CLUEbenchmark/FewCLUE

【2】 FLEX: Unifying Evaluation for Few-Shot NLP 标题：FLEX：对少发NLP的统一评估

作者：Jonathan Bragg,Arman Cohan,Kyle Lo,Iz Beltagy 机构：Allen Institute for AI, Seattle, WA 备注：First two authors contributed equally. Code and leaderboard available at: this https URL 链接：https://arxiv.org/abs/2107.07170 摘要：少数镜头NLP研究是高度活跃的，但进行不相交的研究线程与评估套件，缺乏挑战性，但现实的测试设置，并没有采用仔细的实验设计。因此，社区不知道哪些技术表现最好，甚至不知道它们是否优于简单的基线。我们为一个理想的Few-ShotNLP基准制定了需求，并提出了FLEX，第一个基准，公共排行榜，以及为Few-ShotNLP技术提供统一、全面测量的框架。FLEX整合并引入了新的少数镜头评估最佳实践，包括四种传输设置的测量、Zero-Shot评估的文本标签，以及优化统计准确性的基准设计原则性方法，同时使研究人员无需大量计算资源即可获得评估成本。此外，我们还提出了UniFew，这是一个简单而强大的基于提示的Few-Shot学习模型，它将预训练和微调提示格式统一起来，避免了最近基于提示方法的复杂机制，使下游任务格式适应语言模型预训练目标。我们证明，尽管简单UniFew取得的结果与流行的元学习和基于提示的方法都具有竞争力。摘要：Few-shot NLP research is highly active, yet conducted in disjoint research threads with evaluation suites that lack challenging-yet-realistic testing setups and fail to employ careful experimental design. Consequently, the community does not know which techniques perform best or even if they outperform simple baselines. We formulate desiderata for an ideal few-shot NLP benchmark and present FLEX, the first benchmark, public leaderboard, and framework that provides unified, comprehensive measurement for few-shot NLP techniques. FLEX incorporates and introduces new best practices for few-shot evaluation, including measurement of four transfer settings, textual labels for zero-shot evaluation, and a principled approach to benchmark design that optimizes statistical accuracy while keeping evaluation costs accessible to researchers without large compute resources. In addition, we present UniFew, a simple yet strong prompt-based model for few-shot learning which unifies the pretraining and finetuning prompt formats, eschewing complex machinery of recent prompt-based approaches in adapting downstream task formats to language model pretraining objectives. We demonstrate that despite simplicity UniFew achieves results competitive with both popular meta-learning and prompt-based approaches.

表征(2篇)

【1】 MultiBench: Multiscale Benchmarks for Multimodal Representation Learning 标题：MultiBench：多模态表征学习的多尺度基准

作者：Paul Pu Liang,Yiwei Lyu,Xiang Fan,Zetian Wu,Yun Cheng,Jason Wu,Leslie Chen,Peter Wu,Michelle A. Lee,Yuke Zhu,Ruslan Salakhutdinov,Louis-Philippe Morency 机构：CMU,Johns Hopkins,Northeastern,Stanford,UT Austin 备注：Code: this https URL and Website: this https URL 链接：https://arxiv.org/abs/2107.07502 摘要：学习多模态表示涉及整合来自多个异构数据源的信息。这是一个具有挑战性但又至关重要的领域，在多媒体、情感计算、机器人技术、金融、人机交互和医疗保健等领域有着广泛的应用。不幸的是，多模态研究的资源有限，无法研究（1）跨域和模式的泛化，（2）训练和推理过程中的复杂性，以及（3）对噪声和缺失模式的鲁棒性。为了加快研究模式和任务的进度，同时确保真实世界的鲁棒性，我们发布了MultiBench，这是一个系统和统一的大规模基准，涵盖15个数据集、10个模式、20个预测任务和6个研究领域。MultiBench提供了一个自动化的端到端机器学习管道，简化和标准化数据加载、实验设置和模型评估。为了实现整体评估，MultiBench提供了一种全面的方法来评估（1）泛化，（2）时间和空间复杂性，以及（3）模态稳健性。MultiBench为未来的研究带来了巨大的挑战，包括对大规模多模态数据集的可扩展性和对现实缺陷的鲁棒性。为了配合这一基准，我们还提供了多模式学习中20种核心方法的标准化实施。简单地应用在不同研究领域提出的方法可以提高9/15数据集的最新性能。因此，MultiBench在统一多模态研究中不相交的工作方面具有里程碑意义，并为更好地理解多模态模型的能力和局限性铺平了道路，同时确保了易用性、可访问性和可再现性。MultiBench，我们的标准化代码和排行榜是公开的，将定期更新，并欢迎来自社区的投入。摘要：Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world applications in multimedia, affective computing, robotics, finance, human-computer interaction, and healthcare. Unfortunately, multimodal research has seen limited resources to study (1) generalization across domains and modalities, (2) complexity during training and inference, and (3) robustness to noisy and missing modalities. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiBench, a systematic and unified large-scale benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. MultiBench provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, MultiBench offers a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MultiBench introduces impactful challenges for future research, including scalability to large-scale multimodal datasets and robustness to realistic imperfections. To accompany this benchmark, we also provide a standardized implementation of 20 core approaches in multimodal learning. Simply applying methods proposed in different research areas can improve the state-of-the-art performance on 9/15 datasets. Therefore, MultiBench presents a milestone in unifying disjoint efforts in multimodal research and paves the way towards a better understanding of the capabilities and limitations of multimodal models, all the while ensuring ease of use, accessibility, and reproducibility. MultiBench, our standardized code, and leaderboards are publicly available, will be regularly updated, and welcomes inputs from the community.

【2】 CLSRIL-23: Cross Lingual Speech Representations for Indic Languages 标题：CLSRIL-23：印地语的跨语言语音表示

作者：Anirudh Gupta,Harveen Singh Chadha,Priyanshi Shah,Neeraj Chimmwal,Ankur Dhuriya,Rishabh Gaur,Vivek Raghavan 机构：Thoughtworks, Ekstep Foundation 备注：7 pages, 2 figures 链接：https://arxiv.org/abs/2107.07402 摘要：我们提出了一个CLSRIL-23，一个基于自监督学习的音频预训练模型，它从23种印度语的原始音频中学习跨语言的语音表征。它是建立在wav2vec2.0的基础上的，wav2vec2.0通过训练隐藏的潜在言语表征的对比任务来解决这个问题，并联合学习所有语言共享的潜在语的量化。我们比较了训练前的语言损失，以比较单语和多语训练的效果。我们还比较了一些下游语音识别微调任务的性能，实验表明，多语言预训练在学习编码语言语音相似性的语音表征方面以及在下游任务上的性能都优于单语训练。当使用印地语的多语种预训练模型进行微调时，WER和CER分别下降了5%和9.5%。所有的代码模型都是开源的。CLSRIL-23是一个以23美元的语言和近10000小时的音频数据训练的模型，用于促进印度语语音识别的研究。我们希望，新的国家的最先进的系统将创建使用自我监督的方法，特别是低资源的印度语。摘要：We present a CLSRIL-23, a self supervised learning based audio pre-trained model which learns cross lingual speech representations from raw audio across 23 Indic languages. It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations and jointly learns the quantization of latents shared across all languages. We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining. Performance on some downstream fine-tuning tasks for speech recognition is also compared and our experiments show that multilingual pretraining outperforms monolingual training, in terms of learning speech representations which encodes phonetic similarity of languages and also in terms of performance on down stream tasks. A decrease of 5% is observed in WER and 9.5% in CER when a multilingual pretrained model is used for finetuning in Hindi. All the code models are also open sourced. CLSRIL-23 is a model trained on $23$ languages and almost 10,000 hours of audio data to facilitate research in speech recognition for Indic languages. We hope that new state of the art systems will be created using the self supervised approach, especially for low resources Indic languages.

Word2Vec|文本|单词(1篇)

【1】 HTLM: Hyper-Text Pre-Training and Prompting of Language Models 标题：HTLM：语言模型的超文本预训练和提示

作者：Armen Aghajanyan,Dmytro Okhonko,Mike Lewis,Mandar Joshi,Hu Xu,Gargi Ghosh,Luke Zettlemoyer 机构：Facebook AI, University of Washington 链接：https://arxiv.org/abs/2107.06955 摘要：我们介绍了HTLM，一种在大规模网络爬网中训练的超文本语言模型。建模超文本有许多优点：（1）它易于大规模收集，（2）它提供丰富的文档级别和结束任务相邻的监视（例如，类和id属性通常编码文档类别信息），以及（3）它允许新的结构化提示，遵循HTML的既定语义（例如，通过填充包含输入文本的网页的标题标签来进行零快照摘要）。我们表明，预训练与巴特风格去噪损失直接在简化的HTML提供了一个广泛的终端任务和监督水平的高效传输。HTLM在Zero-Shot提示和分类基准微调方面的性能达到或超过了同等大小的纯文本LMs，同时也为Zero-Shot摘要设置了最新的性能水平。我们还发现，就数据效率而言，超文本提示比现有LMs的纯文本提示更能为HTLM提供价值，而且HTLM在自动提示方面非常有效，只需为任何可用的训练数据生成最可能的超文本格式。我们将发布所有代码和模型，以支持未来的HTLM研究。摘要：We introduce HTLM, a hyper-text language model trained on a large-scale web crawl. Modeling hyper-text has a number of advantages: (1) it is easily gathered at scale, (2) it provides rich document-level and end-task-adjacent supervision (e.g. class and id attributes often encode document category information), and (3) it allows for new structured prompting that follows the established semantics of HTML (e.g. to do zero-shot summarization by infilling title tags for a webpage that contains the input text). We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels. HTLM matches or exceeds the performance of comparably sized text-only LMs for zero-shot prompting and fine-tuning for classification benchmarks, while also setting new state-of-the-art performance levels for zero-shot summarization. We also find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs, and that HTLM is highly effective at auto-prompting itself, by simply generating the most likely hyper-text formatting for any available training data. We will release all code and models to support future HTLM research.

其他神经网络|深度学习|模型|建模(2篇)

【1】 Spanish Language Models 标题：西班牙语模型

作者：Asier Gutiérrez-Fandiño,Jordi Armengol-Estapé,Marc Pàmies,Joan Llop-Palao,Joaquín Silveira-Ocampo,Casimiro Pio Carrino,Aitor Gonzalez-Agirre,Carme Armentano-Oller,Carlos Rodriguez-Penagos,Marta Villegas 机构：Text Mining Unit, Barcelona Supercomputing Center 链接：https://arxiv.org/abs/2107.07253 摘要：本文介绍了西班牙RoBERTa基地和RoBERTa大型模型，以及相应的性能评价。这两个模型都是使用迄今为止已知的最大的西班牙语语料库进行预训练的，这项工作总共处理了570GB的干净和重复的文本，这些文本是从西班牙国家图书馆2009年至2019年进行的网络爬网中汇编而成的。摘要：This paper presents the Spanish RoBERTa-base and RoBERTa-large models, as well as the corresponding performance evaluations. Both models were pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019.

【2】 Solving ESL Sentence Completion Questions via Pre-trained Neural Language Models 标题：用预先训练好的神经语言模型求解ESL句子补全问题

作者：Qiongqiong Liu,Tianqiao Liu,Jiafu Zhao,Qiang Fang,Wenbiao Ding,Zhongqin Wu,Feng Xia,Jiliang Tang,Zitao Liu 机构： TAL Education Group, Beijing, China, Data Science and Engineering Lab, Michigan State University, USA, Federation University Australia, Australia 备注：AIED'21: The 22nd International Conference on Artificial Intelligence in Education, 2021 链接：https://arxiv.org/abs/2107.07122 摘要：句子完成（SC）问题是指一个句子有一个或多个空格需要填写，三到五个可能的单词或短语作为选项。作为第二语言的英语学习者，SC问题得到了广泛的应用，建立自动求解这类问题的计算方法对语言学习者是有益的。在这项工作中，我们提出了一个神经网络框架，利用预先训练的语言模型来解决英语考试中的SC问题。我们在一个真实的K-12 ESL-SC问题数据集上进行了大量的实验，结果证明了该模型在预测精度方面的优越性。此外，我们还进行了精确召回权衡分析，以讨论在实际场景中部署它时的实际问题。为了鼓励可复制的结果，我们在\url上公开了我们的代码{https://github.com/AIED2021/ESL-SentenceCompletion}. 摘要：Sentence completion (SC) questions present a sentence with one or more blanks that need to be filled in, three to five possible words or phrases as options. SC questions are widely used for students learning English as a Second Language (ESL) and building computational approaches to automatically solve such questions is beneficial to language learners. In this work, we propose a neural framework to solve SC questions in English examinations by utilizing pre-trained language models. We conduct extensive experiments on a real-world K-12 ESL SC question dataset and the results demonstrate the superiority of our model in terms of prediction accuracy. Furthermore, we run precision-recall trade-off analysis to discuss the practical issues when deploying it in real-life scenarios. To encourage reproducible results, we make our code publicly available at \url{https://github.com/AIED2021/ESL-SentenceCompletion}.

其他(4篇)

【1】 Wordcraft: a Human-AI Collaborative Editor for Story Writing 标题：Wordcraft：一种人机协同的故事写作编辑器

作者：Andy Coenen,Luke Davis,Daphne Ippolito,Emily Reif,Ann Yuan 机构：Google Research 备注：None 链接：https://arxiv.org/abs/2107.07430 摘要：随着神经语言模型有效性的提高，它们越来越多地应用于现实世界中。然而，这些应用程序往往在其支持的交互模式方面受到限制。在这个扩展的摘要中，我们提出了Wordcraft，一个人工智能辅助编辑的故事写作，其中一个作家和一个对话系统合作写一个故事。我们新颖的界面使用了少量的镜头学习和会话的自然启示来支持各种交互。我们的编辑器为作者提供了一个沙盒来探索基于transformer的语言模型的边界，并为将来的人在回路训练管道和新的评估方法铺平了道路。摘要：As neural language models grow in effectiveness, they are increasingly being applied in real-world settings. However these applications tend to be limited in the modes of interaction they support. In this extended abstract, we propose Wordcraft, an AI-assisted editor for story writing in which a writer and a dialog system collaborate to write a story. Our novel interface uses few-shot learning and the natural affordances of conversation to support a variety of interactions. Our editor provides a sandbox for writers to probe the boundaries of transformer-based language models and paves the way for future human-in-the-loop training pipelines and novel evaluation methods.

【2】 Improving Security in McAdams Coefficient-Based Speaker Anonymization by Watermarking Method 标题：用水印方法提高麦克亚当斯系数说话人匿名化的安全性

作者：Candy Olivia Mawalim,Masashi Unoki 机构：Japan Advanced Institute of Science and Technology, -, Asahidai, Nomi, Ishikawa ,–, Japan 链接：https://arxiv.org/abs/2107.07223 摘要：说话人匿名化的目的是抑制说话人的个性，在保护说话人隐私的同时，保护说话人的其他方面，如语音内容。一种有效的匿名化方法是修改McAdams系数。本文提出了一种基于McAdams系数的语音水印方法来提高说话人匿名的安全性。该方法主要包括两个过程：一个是嵌入过程，一个是检测过程。在嵌入过程中，两个不同的McAdams系数表示二进制位“0”和“1”。然后通过逐帧位逆切换得到带水印语音。随后，通过功率谱比较来执行检测过程。我们对voiceprivacy2020挑战（VP2020）和语音水印的信息隐藏挑战（IHC）进行了客观的评估，发现我们的方法能够满足水印的盲检测、不可听性和鲁棒性要求。与VP2020中的二级基线系统相比，它还显著提高了匿名性能。摘要：Speaker anonymization aims to suppress speaker individuality to protect privacy in speech while preserving the other aspects, such as speech content. One effective solution for anonymization is to modify the McAdams coefficient. In this work, we propose a method to improve the security for speaker anonymization based on the McAdams coefficient by using a speech watermarking approach. The proposed method consists of two main processes: one for embedding and one for detection. In embedding process, two different McAdams coefficients represent binary bits ``0" and ``1". The watermarked speech is then obtained by frame-by-frame bit inverse switching. Subsequently, the detection process is carried out by a power spectrum comparison. We conducted objective evaluations with reference to the VoicePrivacy 2020 Challenge (VP2020) and of the speech watermarking with reference to the Information Hiding Challenge (IHC) and found that our method could satisfy the blind detection, inaudibility, and robustness requirements in watermarking. It also significantly improved the anonymization performance in comparison to the secondary baseline system in VP2020.

【3】 From Show to Tell: A Survey on Image Captioning 标题：从秀到说：图像字幕研究综述

作者：Matteo Stefanini,Marcella Cornia,Lorenzo Baraldi,Silvia Cascianelli,Giuseppe Fiameni,Rita Cucchiara 机构： Cucchiaraare with the Department of Engineering “Enzo Ferrari”, University ofModena and Reggio Emilia 链接：https://arxiv.org/abs/2107.06912 摘要：连接视觉和语言在生成智力中起着重要作用。因此，在过去的几年里，大量的研究致力于图像字幕，即用句法和语义上有意义的句子来描述图像。从2015年开始，这项任务通常通过由可视化编码步骤和文本生成语言模型组成的管道来完成。在这些年里，这两个组件已经通过开发对象区域、属性和关系以及引入多模态连接、完全关注的方法和类似BERT的早期融合策略而有了很大的发展。然而，尽管取得了令人印象深刻的成果，图像字幕的研究还没有得出结论性的答案。这项工作的目的是提供一个全面的概述和分类的图像字幕的方法，从视觉编码和文本生成的训练策略，使用的数据集和评估指标。在这方面，我们定量比较了许多相关的最新方法，以确定在图像字幕体系结构和训练策略方面最具影响力的技术创新。此外，还分析和讨论了问题的许多变体及其面临的挑战。这项工作的最终目标是作为一种工具来理解现有的最新技术，并强调计算机视觉和自然语言处理可以找到最佳协同作用的研究领域的未来方向。摘要：Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, in the last few years, a large research effort has been devoted to image captioning, i.e. the task of describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoding step and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, and relationships and the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results obtained, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview and categorization of image captioning approaches, from visual encoding and text generation to training strategies, used datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in image captioning architectures and training strategies. Moreover, many variants of the problem and its open challenges are analyzed and discussed. The final goal of this work is to serve as a tool for understanding the existing state-of-the-art and highlighting the future directions for an area of research where Computer Vision and Natural Language Processing can find an optimal synergy.

【4】 VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording 标题：用于不分段记录的免VAD流式混合CTC/注意ASR

作者：Hirofumi Inaguma,Tatsuya Kawahara 机构：Graduate School of Informatics, Kyoto University, Kyoto, Japan 备注：Accepted at Interspeech 2021 链接：https://arxiv.org/abs/2107.07509 摘要：在这项工作中，我们提出了一种新的解码算法，在没有语音活动检测（VAD）的情况下，基于单调分块注意（MoChA）和辅助连接时间分类（CTC）的目标，对未分段的长格式录音进行流式自动语音识别（ASR）。我们提出了一种块同步波束搜索译码，以利用高效的成批输出同步和低延迟输入同步搜索。我们还提出了一种无VAD的推理算法，该算法利用CTC概率来确定重置模型状态的合适时间，以解决长格式数据的脆弱性。实验结果表明，块同步译码的精度与标签同步译码相当。此外，VAD-free推理可以在几个小时内对长格式语音进行鲁棒识别。摘要：In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches. We also propose a VAD-free inference algorithm that leverages CTC probabilities to determine a suitable timing to reset the model states to tackle the vulnerability to long-form data. Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one. Moreover, the VAD-free inference can recognize long-form speech robustly for up to a few hours.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-16，如有侵权请联系 cloudcommunity@tencent.com 删除

linux