自然语言处理学术速递[6.18]

公众号-arXiv每日学术速递

发布于 2021-07-02 19:05:28

6600

发布于 2021-07-02 19:05:28

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

cs.CL 方向，今日共计32篇

Transformer(2篇)

【1】 Multi-head or Single-head? An Empirical Comparison for Transformer Training 标题：多头的还是单头的？Transformer训练的实证比较

作者：Liyuan Liu,Jialu Liu,Jiawei Han 机构：University of Illinois at Urbana-Champaign, Google Research 备注：Work in progress 链接：https://arxiv.org/abs/2106.09650 摘要：多头注意在Transformer模型最近的成功中起着至关重要的作用，它导致在各种应用中比传统注意一致的性能改进。人们普遍认为，这种有效性源于联合参加多个职位的能力。在本文中，我们首先证明了联合注意多个位置并不是多头注意的一个独特特征，因为多层单头注意也注意多个位置并且更有效。然后，我们认为，多头部注意的主要优点是训练的稳定性，因为它比单头部注意的层次少，当参加相同数量的职位。例如，24层16头Transformer（BERT large）和384层单头Transformer具有相同的总注意头数和大致相同的模型尺寸，而多头Transformer则明显较浅。同时，我们证明，随着近年来深度学习的进展，我们可以成功地稳定384层Transformer的训练。由于训练难度不再是一个瓶颈，更深层的单头Transformer在不调整超参数的情况下实现了一致的性能改进。摘要：Multi-head attention plays a crucial role in the recent success of Transformer models, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that this effectiveness stems from the ability of jointly attending multiple positions. In this paper, we first demonstrate that jointly attending multiple positions is not a unique feature of multi-head attention, as multi-layer single-head attention also attends multiple positions and is more effective. Then, we suggest the main advantage of the multi-head attention is the training stability, since it has less number of layers than the single-head attention, when attending the same number of positions. For example, 24-layer 16-head Transformer (BERT-large) and 384-layer single-head Transformer has the same total attention head number and roughly the same model size, while the multi-head one is significantly shallower. Meanwhile, we show that, with recent advances in deep learning, we can successfully stabilize the training of the 384-layer Transformer. As the training difficulty is no longer a bottleneck, substantially deeper single-head Transformer achieves consistent performance improvements without tuning hyper-parameters.

【2】 Probing Image-Language Transformers for Verb Understanding 标题：探索用于动词理解的图像语言转换器

作者：Lisa Anne Hendricks,Aida Nematzadeh 机构：DeepMind 链接：https://arxiv.org/abs/2106.09141 摘要：多模态图像语言转换器在依赖于微调的各种任务（例如，视觉问答和图像检索）上取得了令人印象深刻的结果。我们感兴趣的是揭示它们的预训练表征的质量——特别是，如果这些模型能够区分不同类型的动词，或者如果它们仅仅依赖于给定句子中的名词。为了做到这一点，我们收集了一个由421个动词组成的图像句子对（英语）数据集，这些动词要么是视觉的，要么是在训练前的数据中常见的（即概念性字幕数据集）。我们使用这个数据集来评估预训练的图像语言转换器，发现它们在需要动词理解的情况下比其他词类更失败。我们还调查了哪些类别的动词特别具有挑战性。摘要：Multimodal image-language transformers have achieved impressive results on a variety of tasks that rely on fine-tuning (e.g., visual question answering and image retrieval). We are interested in shedding light on the quality of their pretrained representations -- in particular, if these models can distinguish different types of verbs or if they rely solely on nouns in a given sentence. To do so, we collect a dataset of image-sentence pairs (in English) consisting of 421 verbs that are either visual or commonly found in the pretraining data (i.e., the Conceptual Captions dataset). We use this dataset to evaluate pretrained image-language transformers and find that they fail more in situations that require verb understanding compared to other parts of speech. We also investigate what category of verbs are particularly challenging.

机器翻译(1篇)

【1】 Lost in Interpreting: Speech Translation from Source or Interpreter? 标题：口译中的迷失：来自原文还是口译员的演讲翻译？

作者：Dominik Macháček,Matúš Žilinec,Ondřej Bojar 机构： Ondˇrej BojarCharles University, Faculty of Mathematics and PhysicsInstitute of Formal and Applied Linguistics{surname}, and 398 1 20 of the Grant Agency of CharlesUniversity 备注：to be published at INTERSPEECH 2021 链接：https://arxiv.org/abs/2106.09343 摘要：口译员为多语种会议提供便利，但负担得起的语种往往比所需的语种少。自动同声翻译可以扩展所提供的语言集。我们研究这样一个自动系统是否应该跟随原说话人，或口译员，以增加延迟为代价来获得更好的翻译质量。为了回答这个问题，我们发布了Europarl同声传译语料库（ESIC），10个小时的欧洲议会演讲录音和笔录，同时提供捷克语和德语的同声传译。我们评估从英语到捷克语的基于说话人和基于口译员的口语翻译系统的质量和延迟。我们研究了人工翻译系统与机器翻译系统在隐式简化和摘要方面的差异。最后，我们进行人类评估来衡量每一种方法的信息损失。摘要：Interpreters facilitate multi-lingual meetings but the affordable set of languages is often smaller than what is needed. Automatic simultaneous speech translation can extend the set of provided languages. We investigate if such an automatic system should rather follow the original speaker, or an interpreter to achieve better translation quality at the cost of increased delay. To answer the question, we release Europarl Simultaneous Interpreting Corpus (ESIC), 10 hours of recordings and transcripts of European Parliament speeches in English, with simultaneous interpreting into Czech and German. We evaluate quality and latency of speaker-based and interpreter-based spoken translation systems from English to Czech. We study the differences in implicit simplification and summarization of the human interpreter compared to a machine translation system trained to shorten the output to some extent. Finally, we perform human evaluation to measure information loss of each of these approaches.

语义分析(1篇)

【1】 End-to-End Cross-Domain Text-to-SQL Semantic Parsing with Auxiliary Task 标题：具有辅助任务的端到端跨域TO-SQL语义解析

作者：Peng Shi,Tao Yu,Patrick Ng,Zhiguo Wang 机构： David R. Cheriton School of Computer Science, University of Waterloo, Department of Computer Science, The University of Hong Kong, AWS AI Labs 链接：https://arxiv.org/abs/2106.09588 摘要：在这项工作中，我们主要研究跨域文本到SQL语义分析任务中的两个关键组件：模式链接和值填充。为了鼓励模型学习更好的编码能力，我们提出了一个列选择辅助任务，通过显式学习目标赋予编码器相关匹配能力。此外，考虑到现有的语法分析器大多忽略了合成SQL中的值填充，我们提出了两种值填充方法来搭建从现有的零炮语义分析器到实际应用的桥梁。通过在Spider上的实验，我们提出的框架在数据库内容不可用的情况下，在执行精度和精确集匹配精度上都优于基线，详细的分析为以后的工作提供了参考。摘要：In this work, we focus on two crucial components in the cross-domain text-to-SQL semantic parsing task: schema linking and value filling. To encourage the model to learn better encoding ability, we propose a column selection auxiliary task to empower the encoder with the relevance matching capability by using explicit learning targets. Furthermore, we propose two value filling methods to build the bridge from the existing zero-shot semantic parsers to real-world applications, considering most of the existing parsers ignore the values filling in the synthesized SQL. With experiments on Spider, our proposed framework improves over the baselines on the execution accuracy and exact set match accuracy when database contents are unavailable, and detailed analysis sheds light on future work.

Graph|知识图谱|Knowledge(5篇)

【1】 Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study 标题：生物医学知识库完善的科学语言模型：一项实证研究

作者：Rahul Nadkarni,David Wadden,Iz Beltagy,Noah A. Smith,Hannaneh Hajishirzi,Tom Hope 机构：Paul G. Allen School for Computer Science & Engineering, University of Washington, Allen Institute for Artificial Intelligence (AI,) 链接：https://arxiv.org/abs/2106.09700 摘要：生物医学知识图（KG）包含了关于疾病、药物和基因等实体的丰富信息。预测这些图表中缺失的链接可以促进许多重要的应用，例如药物设计和重新调整用途。最近的工作表明，通用领域语言模型（LMs）可以充当“软”KG，并且可以针对KG完成的任务进行微调。在这项工作中，我们研究了KG完成的科学LMs，探索是否可以挖掘他们的潜在知识来加强生物医学链接预测。我们评估了几个领域特定的LMs，在以KG表示的药物和疾病为中心的数据集上对它们进行微调，并用文本实体描述来丰富它们。我们将基于LM的模型与KG嵌入模型集成，使用一种路由器方法，该方法学习将每个输入示例分配给任一类型的模型，并提供性能上的显著提升。最后，我们证明了LM模型在具有新颖科学实体的归纳环境中的优势。我们的数据集和代码是公开的。摘要：Biomedical knowledge graphs (KGs) hold rich information on entities such as diseases, drugs, and genes. Predicting missing links in these graphs can boost many important applications, such as drug design and repurposing. Recent work has shown that general-domain language models (LMs) can serve as "soft" KGs, and that they can be fine-tuned for the task of KG completion. In this work, we study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction. We evaluate several domain-specific LMs, fine-tuning them on datasets centered on drugs and diseases that we represent as KGs and enrich with textual entity descriptions. We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance. Finally, we demonstrate the advantage of LM models in the inductive setting with novel scientific entities. Our datasets and code are made publicly available.

【2】 Learning Knowledge Graph-based World Models of Textual Environments 标题：基于知识图学习的文本环境世界模型

作者：Prithviraj Ammanabrolu,Mark O. Riedl 机构：School of Interactive Computing, Georgia Institute of Technology 备注：Preprint. Under review 链接：https://arxiv.org/abs/2106.09608 摘要：世界模型提高了学习代理在交互和情境环境中高效运行的能力。这项工作的重点是建立基于文本的游戏环境的世界模型的任务。基于文本的游戏，或称互动叙事，是一种强化学习环境，在这种环境中，主体使用文本自然语言感知并与世界互动。这些环境包含了漫长的、多步骤的谜题或任务，它们编织在一个充满数百个角色、地点和对象的世界中。我们的世界模型同时学习：（1）将世界表示为知识图时，预测由代理行为引起的世界变化；以及（2）生成在世界上操作所需的一组上下文相关的自然语言动作。通过利用知识图和动作的内在结构，我们将此任务定义为一组序列生成问题，并引入基于变换器的多任务体系结构和损失函数来训练它。一项对从未见过的文本世界的Zero-Shot消融研究表明，我们的方法显著优于现有的文本世界建模技术以及我们的每一项贡献的重要性。摘要：World models improve a learning agent's ability to efficiently operate in interactive and situated environments. This work focuses on the task of building world models of text-based game environments. Text-based games, or interactive narratives, are reinforcement learning environments in which agents perceive and interact with the world using textual natural language. These environments contain long, multi-step puzzles or quests woven through a world that is filled with hundreds of characters, locations, and objects. Our world model learns to simultaneously: (1) predict changes in the world caused by an agent's actions when representing the world as a knowledge graph; and (2) generate the set of contextually relevant natural language actions required to operate in the world. We frame this task as a Set of Sequences generation problem by exploiting the inherent structure of knowledge graphs and actions and introduce both a transformer-based multi-task architecture and a loss function to train it. A zero-shot ablation study on never-before-seen textual worlds shows that our methodology significantly outperforms existing textual world modeling techniques as well as the importance of each of our contributions.

【3】 Classifying vaccine sentiment tweets by modelling domain-specific representation and commonsense knowledge into context-aware attentive GRU 标题：通过将特定领域的表示和常识知识建模为上下文感知的注意力GRU来对疫苗情感推文进行分类

作者：Usman Naseem,Matloob Khushi,Jinman Kim,Adam G. Dunn 机构：∗†‡School of Computer Science, The University of Sydney, Australia, §School of Medical Sciences, The University of Sydney, Australia 备注：Accepted in International Joint Conference on Neural Networks (IJCNN) 2021 链接：https://arxiv.org/abs/2106.09589 摘要：疫苗是一项重要的公共卫生措施，但疫苗犹豫不决和拒绝接种会造成疫苗覆盖率低的集群，降低疫苗接种计划的有效性。社交媒体提供了一个机会，通过包括地理位置和详细说明疫苗相关问题，来估计接受疫苗的新风险。对社会媒体帖子（如疫苗相关推文）进行分类的方法，使用的是在一般领域文本上训练的语言模型（LMs）。然而，在规模上衡量疫苗情绪的挑战来自于缺乏音调压力和手势提示，可能并不总是有关于用户的额外信息，例如，过去的推特或社交关系。LMs的另一个挑战是缺乏用户元数据中明显的常识知识，即表情符号、正面和负面词语等。在本研究中，对信息有限的疫苗情绪推特进行分类，我们提出了一个新的端到端框架，由相互连接的组件组成，这些组件使用在疫苗相关tweet上训练的领域特定LM，并将常识知识建模为具有上下文感知注意的双向门控递归网络（CK-BiGRU）。我们进一步利用语法、用户元数据和情感信息来捕捉tweet的情感。我们使用两个流行的疫苗相关Twitter数据集进行了实验，并证明我们提出的方法在识别疫苗前、疫苗后和中立tweet方面优于最先进的模型。摘要：Vaccines are an important public health measure, but vaccine hesitancy and refusal can create clusters of low vaccine coverage and reduce the effectiveness of vaccination programs. Social media provides an opportunity to estimate emerging risks to vaccine acceptance by including geographical location and detailing vaccine-related concerns. Methods for classifying social media posts, such as vaccine-related tweets, use language models (LMs) trained on general domain text. However, challenges to measuring vaccine sentiment at scale arise from the absence of tonal stress and gestural cues and may not always have additional information about the user, e.g., past tweets or social connections. Another challenge in LMs is the lack of commonsense knowledge that are apparent in users metadata, i.e., emoticons, positive and negative words etc. In this study, to classify vaccine sentiment tweets with limited information, we present a novel end-to-end framework consisting of interconnected components that use domain-specific LM trained on vaccine-related tweets and models commonsense knowledge into a bidirectional gated recurrent network (CK-BiGRU) with context-aware attention. We further leverage syntactical, user metadata and sentiment information to capture the sentiment of a tweet. We experimented using two popular vaccine-related Twitter datasets and demonstrate that our proposed approach outperforms state-of-the-art models in identifying pro-vaccine, anti-vaccine and neutral tweets.

【4】 Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases 标题：有学问还是有学问的猜测？作为知识库的语言模型再认识

作者：Boxi Cao,Hongyu Lin,Xianpei Han,Le Sun,Lingyong Yan,Meng Liao,Tong Xue,Jin Xu 机构：Chinese Information Processing Laboratory, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China, University of Chinese Academy of Sciences, Beijing, China, Data Quality Team, WeChat, Tencent Inc., China 备注：Accepted to ACL2021(main conference) 链接：https://arxiv.org/abs/2106.09231 摘要：已有文献表明，经过预训练的蒙蔽语言模型（MLMs）如BERT在某些数据集上可以获得有竞争力的事实知识抽取性能，表明MLMs有可能成为一种可靠的知识源。在本文中，我们进行了严格的研究，以探讨潜在的预测机制传销在不同的提取范式。通过研究MLMs的行为，我们发现，以前良好的性能主要是由于过度拟合数据集工件的偏差提示。此外，结合说明性案例和外部情境，主要由于实体类型引导和黄金答案泄漏，提高了知识预测能力。我们的发现揭示了传销的潜在预测机制，并强烈质疑先前的结论，即当前传销可以作为可靠的事实知识库。摘要：Previous literatures show that pre-trained masked language models (MLMs) such as BERT can achieve competitive factual knowledge extraction performance on some datasets, indicating that MLMs can potentially be a reliable knowledge source. In this paper, we conduct a rigorous study to explore the underlying predicting mechanisms of MLMs over different extraction paradigms. By investigating the behaviors of MLMs, we find that previous decent performance mainly owes to the biased prompts which overfit dataset artifacts. Furthermore, incorporating illustrative cases and external contexts improve knowledge prediction mainly due to entity type guidance and golden answer leakage. Our findings shed light on the underlying predicting mechanisms of MLMs, and strongly question the previous conclusion that current MLMs can potentially serve as reliable factual knowledge bases.

【5】 Can I Be of Further Assistance? Using Unstructured Knowledge Access to Improve Task-oriented Conversational Modeling 标题：我能为您做些什么吗？利用非结构化知识获取改进面向任务的会话建模

作者：Di Jin,Seokhwan Kim,Dilek Hakkani-Tur 机构：Amazon Alexa AI 备注：Presented as a DIALDOC workshop paper at ACL 2021 链接：https://arxiv.org/abs/2106.09174 摘要：大多数以前关于面向任务的对话系统的工作都局限于对域api的有限覆盖。但是，用户经常有超出这些api范围的请求。这项工作的重点是通过合并外部的、非结构化的知识源来响应这些超出API覆盖范围的用户转向。我们的方法以流水线的方式工作，依次进行知识搜索、转弯检测、知识选择和响应生成。我们在前两个步骤中引入了新的数据扩充方法，并证明了使用从对话上下文中提取的信息可以提高知识选择和端到端性能。通过实验，我们在DSTC9 Track 1基准数据集上实现了自动和人工评估指标的最新性能，验证了我们贡献的有效性。摘要：Most prior work on task-oriented dialogue systems are restricted to limited coverage of domain APIs. However, users oftentimes have requests that are out of the scope of these APIs. This work focuses on responding to these beyond-API-coverage user turns by incorporating external, unstructured knowledge sources. Our approach works in a pipelined manner with knowledge-seeking turn detection, knowledge selection, and response generation in sequence. We introduce novel data augmentation methods for the first two steps and demonstrate that the use of information extracted from dialogue context improves the knowledge selection and end-to-end performances. Through experiments, we achieve state-of-the-art performance for both automatic and human evaluation metrics on the DSTC9 Track 1 benchmark dataset, validating the effectiveness of our contributions.

推理|分析|理解|解释(4篇)

【1】 pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks 标题：pysentimiento：一个用于情感分析和SocialNLP任务的Python工具包

作者：Juan Manuel Pérez,Juan Carlos Giudici,Franco Luque 机构： Depto. de Computaci´on, Facultad de Cs. Exactas y Naturales, Universidad de, Buenos Aires, Consejo Nacional de Investigaciones Cient´ıficas y T´ecnicas (CONICET), Facultad de Matem´atica, Astronom´ıa y F´ısica, Universidad de C´ordoba 备注：4 pages, 2 tables Source code at this https URL Submitted to ASAI/JAIIO 链接：https://arxiv.org/abs/2106.09462 摘要：在过去的几年里，从文本中提取观点引起了人们的极大兴趣，因为我们在社交网络和其他地方正经历着前所未有的大量用户生成内容。社会研究人员在使用观点挖掘工具时发现的一个问题是，这些工具通常落后于商业api，而且不适用于英语以外的其他语言。为了解决这些问题，我们提供了pysentimiento，一个用于情绪分析和其他社交NLP任务的多语言Python工具箱。这个开源库以黑盒的方式为西班牙语和英语提供了最先进的模型，使研究人员能够轻松地访问这些技术。摘要：Extracting opinions from texts has gathered a lot of interest in the last years, as we are experiencing an unprecedented volume of user-generated content in social networks and other places. A problem that social researchers find in using opinion mining tools is that they are usually behind commercial APIs and unavailable for other languages than English. To address these issues, we present pysentimiento, a multilingual Python toolkit for Sentiment Analysis and other Social NLP tasks. This open-source library brings state-of-the-art models for Spanish and English in a black-box fashion, allowing researchers to easily access these techniques.

【2】 DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text 标题：DravidianCodeMix：代码混合文本中Dravidian语言的情感分析和攻击性语言识别数据集

作者：Bharathi Raja Chakravarthi,Ruba Priyadharshini,Vigneshwaran Muralidaran,Navya Jose,Shardul Suryawanshi,Elizabeth Sherly,John P. McCrae 机构：Received: date Accepted: date 备注：36 pages 链接：https://arxiv.org/abs/2106.09460 摘要：本文描述了一个多语种，手动注释的数据集的开发，用于三种资源不足的德拉威语，这些语言是由社交媒体评论生成的。该数据集被标注用于情绪分析和攻击性语言识别，总共有超过60000条YouTube评论。该数据集包括泰米尔语英语的约44000条评论，卡纳达语英语的约7000条评论，马拉雅拉姆语英语的约20000条评论。这些数据是由自愿的注释者手工注释的，在Krippendorff的alpha中注释者之间有很高的一致性。数据集包含所有类型的代码混合现象，因为它包含来自多语种国家的用户生成的内容。我们还提出了使用机器学习方法在数据集上建立基准的基线实验。数据集在Github上可用(https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset)和Zenodo(https://zenodo.org/record/4750858\#.YJtw0SYo\\米）。摘要：This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning methods. The dataset is available on Github (https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo (https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).

【3】 DocNLI: A Large-scale Dataset for Document-level Natural Language Inference 标题：DocNLI：文档级自然语言推理的大规模数据集

作者：Wenpeng Yin,Dragomir Radev,Caiming Xiong 机构：Salesforce Research and ,Yale University 备注：ACL'21 Findings Camera-ready 链接：https://arxiv.org/abs/2106.09449 摘要：自然语言推理（Natural language inference，NLI）是解决关系抽取、问答、摘要等自然语言问题的统一框架，近年来由于大规模标记数据集的出现而得到了广泛的研究。然而，现有的研究大多集中在句子层面的推理上，这限制了自然语言识别在下游自然语言处理问题中的应用范围。本文提出了一种新的大规模文档级NLI数据集DocNLI。DocNLI是从广泛的自然语言处理问题转化而来的，涵盖了多种文本类型。前提总是停留在文档的粒度上，而假设的长度从一句话到数百个单词的段落不等。此外，DocNLI有相当有限的工件，不幸的是广泛存在于一些流行的句子级NLI数据集中。我们的实验表明，即使没有微调，一个基于DocNLI的预训练模型在流行的句子级基准测试上也显示出良好的性能，并且很好地推广到依赖于文档粒度推理的域外NLP任务。特定于任务的微调可以带来进一步的改进。数据、代码和预训练模型可以在https://github.com/salesforce/DocNLI. 摘要：Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems such as relation extraction, question answering, summarization, etc. It has been studied intensively in the past few years thanks to the availability of large-scale labeled datasets. However, most existing studies focus on merely sentence-level inference, which limits the scope of NLI's application in downstream NLP problems. This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI. DocNLI is transformed from a broad range of NLP problems and covers multiple genres of text. The premises always stay in the document granularity, whereas the hypotheses vary in length from single sentences to passages with hundreds of words. Additionally, DocNLI has pretty limited artifacts which unfortunately widely exist in some popular sentence-level NLI datasets. Our experiments demonstrate that, even without fine-tuning, a model pretrained on DocNLI shows promising performance on popular sentence-level benchmarks, and generalizes well to out-of-domain NLP tasks that rely on inference at document granularity. Task-specific fine-tuning can bring further improvements. Data, code, and pretrained models can be found at https://github.com/salesforce/DocNLI.

【4】 STAN: A stuttering therapy analysis helper 标题：斯坦：口吃治疗分析助手

作者：Sebastian P. Bayerl,Marc Wenninger,Jochen Schmidt,Alexander Wolff von Gudenberg,Korbinian Riedhammer 机构：Technische Hochschule N¨urnberg Georg Simon Ohm,Technische Hochschule Rosenheim, Institut der Kasseler Stottertherapie 备注：None 链接：https://arxiv.org/abs/2106.09545 摘要：口吃是一种复杂的言语障碍，表现为说话时声音、音节或单词的重复、延长和阻塞。具体的口吃行为差异很大，因此需要个性化的治疗。治疗过程需要治疗师高度集中精神。我们介绍了一个帮助言语治疗师进行口吃治疗的系统。这样的自动反馈系统可以降低治疗师的认知负荷，从而使治疗更加一致，并允许在多个疗程中分析口吃。摘要：Stuttering is a complex speech disorder identified by repeti-tions, prolongations of sounds, syllables or words and blockswhile speaking. Specific stuttering behaviour differs strongly,thus needing personalized therapy. Therapy sessions requirea high level of concentration by the therapist. We introduceSTAN, a system to aid speech therapists in stuttering therapysessions. Such an automated feedback system can lower thecognitive load on the therapist and thereby enable a more con-sistent therapy as well as allowing analysis of stuttering overthe span of multiple therapy sessions.

GAN|对抗|攻击|生成相关(2篇)

【1】 Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction 标题：Text2Event：端到端事件提取的可控序列到结构生成

作者：Yaojie Lu,Hongyu Lin,Jin Xu,Xianpei Han,Jialong Tang,Annan Li,Le Sun,Meng Liao,Shaoyi Chen 机构：Chinese Information Processing Laboratory, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China, University of Chinese Academy of Sciences, Beijing, China, Data Quality Team, WeChat, Tencent Inc., China 备注：Accepted to ACL2021 (main conference) 链接：https://arxiv.org/abs/2106.09232 摘要：由于事件记录的结构复杂，文本与事件之间存在语义鸿沟，事件抽取具有很大的挑战性。传统的事件记录提取方法是将复杂的结构预测任务分解为多个子任务。在本文中，我们提出了Text2Event，一种序列到结构的生成范式，可以直接从文本中以端到端的方式提取事件。具体来说，我们设计了一个用于统一事件抽取的序列结构网络，一个用于推理过程中事件知识注入的约束解码算法，以及一个用于高效模型学习的课程学习算法。实验结果表明，通过在单个模型中对所有任务进行统一建模，并对不同的标签进行统一预测，我们的方法在有监督学习和转移学习环境下都可以仅使用记录级注释来获得具有竞争力的性能。摘要：Event extraction is challenging due to the complex structure of event records and the semantic gap between text and event. Traditional methods usually extract event records by decomposing the complex structure prediction task into multiple subtasks. In this paper, we propose Text2Event, a sequence-to-structure generation paradigm that can directly extract events from the text in an end-to-end manner. Specifically, we design a sequence-to-structure network for unified event extraction, a constrained decoding algorithm for event knowledge injection during inference, and a curriculum learning algorithm for efficient model learning. Experimental results show that, by uniformly modeling all tasks in a single model and universally predicting different labels, our method can achieve competitive performance using only record-level annotations in both supervised learning and transfer learning settings.

【2】 Automatic Construction of Evaluation Suites for Natural Language Generation Datasets 标题：自然语言生成数据集评估套件的自动构建

作者：Simon Mille,Kaustubh D. Dhole,Saad Mahamood,Laura Perez-Beltrachini,Varun Gangal,Mihir Kale,Emiel van Miltenburg,Sebastian Gehrmann 机构：Universitat Pompeu Fabra, Amelia Science, IPsoft R&D, trivago N.V., University of Edinburgh, Carnegie Mellon University, Google Research, Tilburg University 链接：https://arxiv.org/abs/2106.09069 摘要：应用于自然语言处理的机器学习方法通常通过将它们的性能总结为一个数字来进行评估，例如准确性。由于大多数测试集都是从整体数据中构造的i.i.d.样本，这种方法过度简化了语言的复杂性，并鼓励过度拟合数据分布的头部。因此，关于代表性不足群体的罕见语言现象或文本并没有平等地纳入评估。为了鼓励更深入的模型分析，研究人员建议使用多个测试集（也称为挑战集）来评估模型的特定能力。在本文中，我们开发了一个基于这种思想的框架，它能够生成受控扰动，并识别文本到标量、文本到文本或数据到文本设置中的子集。通过将此框架应用于GEM世代基准，我们提出了一个由80个挑战集组成的评估套件，展示了它支持的各种分析，并阐明了当前世代模型的局限性。摘要：Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly simplifies the complexity of language and encourages overfitting to the head of the data distribution. As such, rare language phenomena or text about underrepresented groups are not equally included in the evaluation. To encourage more in-depth model analyses, researchers have proposed the use of multiple test sets, also called challenge sets, that assess specific capabilities of a model. In this paper, we develop a framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. By applying this framework to the GEM generation benchmark, we propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models.

半/弱/无监督|不确定性(3篇)

【1】 A Self-supervised Method for Entity Alignment 标题：一种自监督的实体对齐方法

作者：Xiao Liu,Haoyun Hong,Xinghao Wang,Zeyi Chen,Evgeny Kharlamov,Yuxiao Dong,Jie Tang 机构：†Tsinghua University, ‡University of Oslo, §Facebook AI 链接：https://arxiv.org/abs/2106.09395 摘要：实体对齐（entityalignment）是构建大规模知识图的一个基本问题，其目的是识别不同知识图之间的等价实体。在其发展过程中，监督被认为是精确对准的必要条件。受最近自我监督学习进展的启发，我们探讨了在多大程度上可以摆脱实体对齐的监督。现有的监督方法侧重于将每一对阳性（标记的）实体拉近彼此。然而，我们的分析表明，实体对齐的学习实际上可以从将采样的（未标记的）负片推远而不是拉近正对齐对中获益更多。我们利用这一发现设计了一个跨两个KG的对比学习策略。在基准数据集上的大量实验表明，SelfKG在没有监督的情况下可以与最先进的监督基线相匹配或取得可比的结果。SelfKG的性能表明，自监督学习为KGs中的实体对齐提供了巨大的潜力。摘要：Entity alignment, aiming to identify equivalent entities across different knowledge graphs (KGs), is a fundamental problem for constructing large-scale KGs. Over the course of its development, supervision has been considered necessary for accurate alignments. Inspired by the recent progress of self-supervised learning, we explore the extent to which we can get rid of supervision for entity alignment. Existing supervised methods for this task focus on pulling each pair of positive (labeled) entities close to each other. However, our analysis suggests that the learning of entity alignment can actually benefit more from pushing sampled (unlabeled) negatives far away than pulling positive aligned pairs close. We present SelfKG by leveraging this discovery to design a contrastive learning strategy across two KGs. Extensive experiments on benchmark datasets demonstrate that SelfKG without supervision can match or achieve comparable results with state-of-the-art supervised baselines. The performance of SelfKG demonstrates self-supervised learning offers great potential for entity alignment in KGs.

【2】 Denoising Distantly Supervised Named Entity Recognition via a Hypergeometric Probabilistic Model 标题：基于超几何概率模型的远程监督命名实体识别去噪

作者：Wenkai Zhang,Hongyu Lin,Xianpei Han,Le Sun,Huidan Liu,Zhicheng Wei,Nicholas Jing Yuan 机构：Chinese Information Processing Laboratory, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China, University of Chinese Academy of Sciences, Beijing, China, Huawei Cloud&AI 备注：Accepted to AAAI2021 链接：https://arxiv.org/abs/2106.09234 摘要：去噪是基于远程监控的命名实体识别的关键步骤。以往的去噪方法大多基于实例级的置信度统计，忽略了不同数据集和实体类型上噪声分布的多样性。这使得它们难以适应高噪声率设置。本文提出了一种同时考虑噪声分布和实例置信度的超几何学习算法。具体地说，在神经网络训练过程中，我们自然地按照噪声率参数化的超几何分布对每一批中的噪声样本进行建模。然后根据前一个训练步骤得到的标签置信度，以及该样本批中的噪声分布，将该批中的每个实例视为正确的或有噪声的。实验表明，HGL能有效地去除远程监控中检索到的弱标记数据，对训练模型有显著的改进。摘要：Denoising is the essential step for distant supervision based named entity recognition. Previous denoising methods are mostly based on instance-level confidence statistics, which ignore the variety of the underlying noise distribution on different datasets and entity types. This makes them difficult to be adapted to high noise rate settings. In this paper, we propose Hypergeometric Learning (HGL), a denoising algorithm for distantly supervised NER that takes both noise distribution and instance-level confidence into consideration. Specifically, during neural network training, we naturally model the noise samples in each batch following a hypergeometric distribution parameterized by the noise-rate. Then each instance in the batch is regarded as either correct or noisy one according to its label confidence derived from previous training step, as well as the noise distribution in this sampled batch. Experiments show that HGL can effectively denoise the weakly-labeled data retrieved from distant supervision, and therefore results in significant improvements on the trained models.

【3】 De-biasing Distantly Supervised Named Entity Recognition via Causal Intervention 标题：通过因果干预消除远程监督命名实体识别的偏差

作者：Wenkai Zhang,Hongyu Lin,Xianpei Han,Le Sun 机构：Chinese Information Processing Laboratory, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China, University of Chinese Academy of Sciences, Beijing, China 备注：Accepted to ACL2021(main conference) 链接：https://arxiv.org/abs/2106.09233 摘要：远程监控通过字典匹配自动生成训练实例，解决了网络环境下的数据瓶颈问题。不幸的是，DS-NER的学习存在严重的字典偏向，存在虚假的相关性，从而影响了学习模型的有效性和鲁棒性。本文通过结构因果模型（SCM）从根本上解释了词典偏误，将词典偏误分为词典内偏误和词典间偏误，并对其产生的原因进行了分析。基于SCM，我们通过因果干预学习去偏DS-NER。对于字典内的偏倚，我们进行后门调整，以消除字典混淆带来的伪相关。对于字典间的偏差，我们提出了一种因果不变性正则化方法，使得DS-NER模型对字典的扰动具有更强的鲁棒性。在四个数据集和三个DS-NER模型上的实验表明，该方法能显著提高DS-NER的性能。摘要：Distant supervision tackles the data bottleneck in NER by automatically generating training instances via dictionary matching. Unfortunately, the learning of DS-NER is severely dictionary-biased, which suffers from spurious correlations and therefore undermines the effectiveness and the robustness of the learned models. In this paper, we fundamentally explain the dictionary bias via a Structural Causal Model (SCM), categorize the bias into intra-dictionary and inter-dictionary biases, and identify their causes. Based on the SCM, we learn de-biased DS-NER via causal interventions. For intra-dictionary bias, we conduct backdoor adjustment to remove the spurious correlations introduced by the dictionary confounder. For inter-dictionary bias, we propose a causal invariance regularizer which will make DS-NER models more robust to the perturbation of dictionaries. Experiments on four datasets and three DS-NER models show that our method can significantly improve the performance of DS-NER.

Zero/Few/One-Shot|迁移|自适应(2篇)

【1】 LoRA: Low-Rank Adaptation of Large Language Models 标题：LORA：大型语言模型的低阶改编

作者：Edward J. Hu,Yelong Shen,Phillip Wallis,Zeyuan Allen-Zhu,Yuanzhi Li,Shean Wang,Weizhu Chen 机构：Microsoft Corporation 链接：https://arxiv.org/abs/2106.09685 摘要：自然语言处理的主导范式包括对一般领域数据的大规模预训练和对特定任务或领域的适应。当我们预先训练较大的模型时，传统的重新训练所有模型参数的微调变得不太可行。以GPT-3175b为例，部署许多微调模型的独立实例（每个实例都有175B参数）是极其昂贵的。我们提出低秩自适应（Low-Rank adaption，简称LoRA），它冻结预先训练好的模型权值，并将可训练的秩分解矩阵注入到Transformer结构的每一层中，大大减少了下游任务的可训练参数数目。对于GPT-3，LoRA算法可使可训练参数的个数减少10000倍，计算硬件需求比全微调算法减少3倍。尽管具有较少的可训练参数、较高的训练吞吐量和没有额外的推理延迟，但在GPT-3和GPT-2上，LoRA在模型质量上的性能与微调相当或更好。我们还对语言模式适应中的秩缺陷进行了实证研究，从而揭示了LoRA的有效性。我们在GPT-2中发布了我们的实现https://github.com/microsoft/LoRA . 摘要：The dominant paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, conventional fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example, deploying many independent instances of fine-tuned models, each with 175B parameters, is extremely expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning. LoRA performs on-par or better than fine-tuning in model quality on both GPT-3 and GPT-2, despite having fewer trainable parameters, a higher training throughput, and no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptations, which sheds light on the efficacy of LoRA. We release our implementation in GPT-2 at https://github.com/microsoft/LoRA .

【2】 ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling 标题：基于交叉话语上下文和多任务语言建模的电子商务聊天机器人ASR适配

作者：Ashish Shenoy,Sravan Bodapati,Katrin Kirchhoff 机构：Amazon AWS AI, USA 备注：Accepted at ACL-IJCNLP 2021 Workshop on e-Commerce and NLP (ECNLP) 链接：https://arxiv.org/abs/2106.09532 摘要：在涉及货币交易和购买的电子商务语音助理中，自动语音识别（ASR）对槽实体的鲁棒性至关重要。随着有效的领域顺应，跨话语语境线索在消除特定领域内容词歧义方面起着重要作用。在本文中，我们研究了各种技术，以提高语境化，内容词稳健性和领域适应的TransformerXL神经语言模型（NLM）重新审视ASR的N-最佳假设。为了提高语境化程度，我们利用了转折层对话行为和跨话语语境的传承。此外，为了使我们的域通用NLM适应动态的电子商务，我们在域内数据上使用了从精细调整的屏蔽LM派生的嵌入。最后，为了提高对领域内内容词的鲁棒性，我们提出了一个多任务模型，可以联合执行内容词检测和语言建模任务。与无上下文的LSTM-LM基线相比，我们性能最好的NLM rescorer在电子商务音频测试集上的内容WER减少了19.2%，slot标签F1提高了6.4%。摘要：Automatic Speech Recognition (ASR) robustness toward slot entities are critical in e-commerce voice assistants that involve monetary transactions and purchases. Along with effective domain adaptation, it is intuitive that cross utterance contextual cues play an important role in disambiguating domain specific content words from speech. In this paper, we investigate various techniques to improve contextualization, content word robustness and domain adaptation of a Transformer-XL neural language model (NLM) to rescore ASR N-best hypotheses. To improve contextualization, we utilize turn level dialogue acts along with cross utterance context carry over. Additionally, to adapt our domain-general NLM towards e-commerce on-the-fly, we use embeddings derived from a finetuned masked LM on in-domain data. Finally, to improve robustness towards in-domain content words, we propose a multi-task model that can jointly perform content word detection and language modeling tasks. Compared to a non-contextual LSTM LM baseline, our best performing NLM rescorer results in a content WER reduction of 19.2% on e-commerce audio test set and a slot labeling F1 improvement of 6.4%.

表征(2篇)

【1】 Do Large Scale Molecular Language Representations Capture Important Structural Information? 标题：大规模分子语言表示法能捕捉到重要的结构信息吗？

作者：Jerret Ross,Brian Belgodere,Vijil Chenthamarakshan,Inkit Padhi,Youssef Mroueh,Payel Das 机构：IBM Research AI 备注：17 pages, 3 figures 链接：https://arxiv.org/abs/2106.09553 摘要：从分子结构预测化学性质在药物发现和材料设计等许多应用中具有重要意义。与密度泛函理论（DFT）等计算方法相比，基于机器学习的分子性质预测方法有望以更低的复杂度实现精确预测。从分子图中提取的特征，以有监督的方式使用图神经网络，已经成为此类任务的强基线。然而，巨大的化学空间加上有限的标签使得监督学习具有挑战性，要求学习一个通用的分子表示。最近，在大量未标记语料库上预先训练的基于变换器的语言模型（PTLMs）在许多下游自然语言处理任务中产生了最新的结果。受这一发展的启发，在这里，我们提出了通过训练一个有效的Transformer编码器模型，称为成型机获得的分子嵌入。该模型采用线性注意机制，对PubChem和Zn数据集中11亿个未标记分子的一维SMILES序列进行高度平行化训练。实验表明，与现有的基于图形和基于指纹的有监督学习基线相比，所学习的分子表示在预测QM8和QM9分子性质的挑战性任务上具有竞争力。对MoLFormerr表示法的进一步特定于任务的微调改进了其中几个属性预测基准的性能。这些结果提供了令人鼓舞的证据，大规模的分子语言模型可以捕捉到足够的结构信息，以便能够准确地预测量子化学性质和其他性质。摘要：Predicting chemical properties from the structure of a molecule is of great importance in many applications including drug discovery and material design. Machine learning based molecular property prediction holds the promise of enabling accurate predictions at much less complexity, when compared to, for example Density Functional Theory (DFT) calculations. Features extracted from molecular graphs, using graph neural nets in a supervised manner, have emerged as strong baselines for such tasks. However, the vast chemical space together with the limited availability of labels makes supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, pre-trained transformer-based language models (PTLMs) on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, here we present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer. This model was employed with a linear attention mechanism and highly paralleized training on 1D SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets. Experiments show that the learned molecular representation performs competitively, when compared to existing graph-based and fingerprint-based supervised learning baselines, on the challenging tasks of predicting properties of QM8 and QM9 molecules. Further task-specific fine-tuning of the MoLFormerr representation improves performance on several of those property prediction benchmarks. These results provide encouraging evidence that large-scale molecular language models can capture sufficient structural information to be able to accurately predict quantum chemical properties and beyond.

【2】 Biomedical Interpretable Entity Representations 标题：生物医学可解释实体表示法

作者：Diego Garcia-Olano,Yasumasa Onoe,Ioana Baldini,Joydeep Ghosh,Byron C. Wallace,Kush R. Varshney 机构：IBM Research,University of Texas at Austin,Northeastern University 备注：Accepted into Findings of ACL-IJCNLP 2021 链接：https://arxiv.org/abs/2106.09502 摘要：预先训练的语言模型会产生密集的实体表示，在以实体为中心的自然语言处理任务中具有很强的性能，但这种表示不能立即解释。这可能是一个障碍，以模型吸收在重要领域，如生物医学。最近有一些关于一般解释表征学习的研究（Onoe和Durrett，2020），但是这些领域不可知表征并不容易转移到生物医学的重要领域。在本文中，我们通过将实体映射到医学本体中的概念，并从这些概念映射到以我们的类型为类别的Wikipedia页面，从大量生物医学文本语料库中创建了一个新的实体类型系统和训练集。从这个映射中，我们导出生物医学可解释实体表示（BIERs），其中维度对应于细粒度实体类型，值是给定实体属于相应类型的预测概率。我们提出了一种新的方法，利用BIER的最终稀疏表示和中间密集表示来促进模型和实体类型的调试。我们证明BIERs在生物医学任务中取得了很好的性能，包括命名实体消歧和实体标签分类，并且我们提供了错误分析来突出其可解释性的效用，特别是在低监督设置下。最后，我们提供了我们的诱导68K生物医学类型系统，相应的3700万个用于训练BIER模型的导出数据的三倍以及我们的最佳性能模型。摘要：Pre-trained language models induce dense entity representations that offer strong performance on entity-centric NLP tasks, but such representations are not immediately interpretable. This can be a barrier to model uptake in important domains such as biomedicine. There has been recent work on general interpretable representation learning (Onoe and Durrett, 2020), but these domain-agnostic representations do not readily transfer to the important domain of biomedicine. In this paper, we create a new entity type system and training set from a large corpus of biomedical texts by mapping entities to concepts in a medical ontology, and from these to Wikipedia pages whose categories are our types. From this mapping we derive Biomedical Interpretable Entity Representations(BIERs), in which dimensions correspond to fine-grained entity types, and values are predicted probabilities that a given entity is of the corresponding type. We propose a novel method that exploits BIER's final sparse and intermediate dense representations to facilitate model and entity type debugging. We show that BIERs achieve strong performance in biomedical tasks including named entity disambiguation and entity label classification, and we provide error analysis to highlight the utility of their interpretability, particularly in low-supervision settings. Finally, we provide our induced 68K biomedical type system, the corresponding 37 million triples of derived data used to train BIER models and our best performing model.

Word2Vec|文本|单词(2篇)

【1】 Modeling Worlds in Text 标题：文本中的建模世界

作者：Prithviraj Ammanabrolu,Mark O. Riedl 机构：School of Interactive Computing, Georgia Institute of Technology 备注：Preprint. Under review. Benchmark can be found at this https URL 链接：https://arxiv.org/abs/2106.09578 摘要：我们提供了一个数据集，使学习代理，可以建立基于知识图的交互式叙事世界模型的创建。互动叙事——或文本冒险游戏——是一种部分可观察的环境，被构造成长时间的谜题或任务，在这种环境中，一个主体纯粹通过文本自然语言感知并与世界互动。每个游戏通常包含数百个位置、角色和对象——每个都有自己独特的描述——提供了一个机会来研究如何为基于语言的代理提供在这样的世界中运行所必需的结构化内存的问题。我们的数据集提供了24198个丰富的自然语言观测值和（1）以地图形式反映世界状态的知识图之间的映射(2）自然语言的行为，保证会引起特定世界状态的变化。训练数据收集了27个不同类型的游戏，并在测试集中包含了另外9个游戏的7836个heldout实例。除了对数据和相应的学习任务进行分析之外，我们还提供了使用基于规则、问答和序列学习方法的基线模型。摘要：We provide a dataset that enables the creation of learning agents that can build knowledge graph-based world models of interactive narratives. Interactive narratives -- or text-adventure games -- are partially observable environments structured as long puzzles or quests in which an agent perceives and interacts with the world purely through textual natural language. Each individual game typically contains hundreds of locations, characters, and objects -- each with their own unique descriptions -- providing an opportunity to study the problem of giving language-based agents the structured memory necessary to operate in such worlds. Our dataset provides 24198 mappings between rich natural language observations and: (1) knowledge graphs that reflect the world state in the form of a map; (2) natural language actions that are guaranteed to cause a change in that particular world state. The training data is collected across 27 games in multiple genres and contains a further 7836 heldout instances over 9 additional games in the test set. We further provide baseline models using rules-based, question-answering, and sequence learning approaches in addition to an analysis of the data and corresponding learning tasks.

【2】 Scalable Approach for Normalizing E-commerce Text Attributes (SANTA) 标题：电子商务文本属性规范化的可伸缩方法(圣诞老人)

作者：Ravi Shankar Mishra,Kartik Mehta,Nikhil Rasiwasia 机构：India Machine Learning, Amazon 备注：Accepted in ECNLP workshop of ACL-IJCNLP 2021 (this https URL) 链接：https://arxiv.org/abs/2106.09493 摘要：在本文中，我们提出了SANTA，一个可扩展的框架来自动将电子商务属性值（如win10pro）规范化为一组固定的预定义规范值（如windows10）。早期的属性规范化工作主要集中在模糊字符串匹配（本文也称为句法匹配）。在这项工作中，我们首先对9种句法匹配算法进行了广泛的研究，并确定“余弦”相似度导致最佳结果，比常用的Jaccard索引提高了2.7%。接下来，我们认为字符串相似性本身不足以进行属性规范化，因为许多表面形式需要超越句法匹配（例如，“720p”和“HD”是同义词）。虽然像无监督嵌入（例如word2vec/fastText）这样的语义技术在词汇相似性任务中表现出了很好的效果，但我们观察到它们在区分紧密规范形式方面表现不佳，因为这些紧密形式经常出现在相似的上下文中。我们建议学习令牌嵌入使用双网络与三重损失。我们提出了一个嵌入学习任务，利用原始属性值和产品标题以自我监督的方式学习这些嵌入。我们表明，提供监督使用我们提出的任务改进了句法和无监督嵌入的技术为基础的属性规范化。在一个包含50个属性的真实属性规范化数据集上的实验表明，使用该方法训练的嵌入比最佳字符串匹配提高了2.3%，比最佳无监督嵌入提高了19.3%。摘要：In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that 'cosine' similarity leads to best results, showing 2.7% improvement over commonly used Jaccard index. Next, we argue that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. "720p" and "HD" are synonyms). While semantic techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to distinguish between close canonical forms, as these close forms often occur in similar contexts. We propose to learn token embeddings using a twin network with triplet loss. We propose an embedding learning task leveraging raw attribute values and product titles to learn these embeddings in a self-supervised fashion. We show that providing supervision using our proposed task improves over both syntactic and unsupervised embeddings based techniques for attribute normalization. Experiments on a real-world attribute normalization dataset of 50 attributes show that the embeddings trained using our proposed approach obtain 2.3% improvement over best string matching and 19.3% improvement over best unsupervised embeddings.

其他神经网络|深度学习|模型|建模(3篇)

【1】 An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models 标题：精调预训练语言模型超参数优化的实证研究

作者：Xueqing Liu,Chi Wang 机构：Stevens Institute of Technology, Microsoft Research 备注：To appear in ACL-IJCNLP 2021 链接：https://arxiv.org/abs/2106.09204 摘要：微调预训练语言模型的性能很大程度上取决于超参数配置。本文研究了现代超参数优化方法（HPO）在微调预训练语言模型上的性能。首先，我们研究并报告了三种HPO算法在GLUE数据集上微调两种最新语言模型的性能。我们发现，在使用相同的时间预算时，HPO的性能往往不如网格搜索，原因有两个：时间预算不足和过度拟合。我们提出了两个通用的策略和一个实验程序来系统地排除HPO的故障。通过应用该程序，我们观察到，在搜索空间和时间预算中进行更适当的设置，HPO可以获得成功；然而，在某些情况下，过度装配仍然存在。最后，对今后的工作提出了建议。我们的实现可以在https://github.com/microsoft/FLAML/tree/main/flaml/nlp/. 摘要：The performance of fine-tuning pre-trained language models largely depends on the hyperparameter configuration. In this paper, we investigate the performance of modern hyperparameter optimization methods (HPO) on fine-tuning pre-trained language models. First, we study and report three HPO algorithms' performances on fine-tuning two state-of-the-art language models on the GLUE dataset. We find that using the same time budget, HPO often fails to outperform grid search due to two reasons: insufficient time budget and overfitting. We propose two general strategies and an experimental procedure to systematically troubleshoot HPO's failure cases. By applying the procedure, we observe that HPO can succeed with more appropriate settings in the search space and time budget; however, in certain cases overfitting remains. Finally, we make suggestions for future work. Our implementation can be found in https://github.com/microsoft/FLAML/tree/main/flaml/nlp/.

【2】 Specializing Multilingual Language Models: An Empirical Study 标题：专业化的多语种语言模式：一项实证研究

作者：Ethan C. Chau,Noah A. Smith 机构：†Paul G. Allen School of Computer Science & Engineering, University of Washington, ⋆Allen Institute for Artificial Intelligence 链接：https://arxiv.org/abs/2106.09063 摘要：预先训练的多语语言模型中的语境化词语表达已经成为处理多种语言中自然语言任务的事实标准，但这种方法的成功还远没有普及。对于这些模型很少或从未见过的语言，直接使用这些模型往往会导致数据的次优表示或使用，从而激发额外的模型调整以实现相当强的性能。在这项工作中，我们研究了在这种低资源环境下两种适应性的性能、可扩展性和交互作用：词汇扩充和脚本音译。我们用九种不同的低资源语言对一组三项任务进行了评估，结果喜忧参半，支持了这些方法的可行性，同时提出了如何使多语言模型最佳地适应低资源环境的新问题。摘要：Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks in many different languages, but the success of this approach is far from universal. For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data, motivating additional model adaptations to achieve reasonably strong performance. In this work, we study the performance, extensibility, and interaction of two such adaptations for this low-resource setting: vocabulary augmentation and script transliteration. Our evaluations on a set of three tasks in nine diverse low-resource languages yield a mixed result, upholding the viability of these approaches while raising new questions around how to optimally adapt multilingual models to low-resource settings.

【3】 Scaling Laws for Acoustic Models 标题：声学模型的标度律

作者：Jasha Droppo,Oguz Elibol 机构：Amazon Alexa 备注：Submitted to Interspeech 2021 链接：https://arxiv.org/abs/2106.09488 摘要：最近机器学习的一个趋势是通过将模型增长到以前认为不合理的大小来提高模型质量。最近的工作表明，具有交叉熵目标函数的自回归生成模型表现出平滑的幂律关系，或者称为标度律，可以从模型大小、训练集大小和可用的计算预算来预测模型质量。这些比例律允许在给定可用训练数据、模型参数计数或训练计算预算的约束条件下选择接近最优的超参数。在本文中，我们证明了声学模型训练与自动预测编码损失的行为，如果他们受到类似的缩放规律。我们扩展了先前的工作，共同预测由于模型大小、训练集大小和任务固有的“不可减少的损失”造成的损失。我们发现标度律在模型大小和训练集大小两个数量级上精确匹配了模型性能，并对模型性能的极限进行了预测。摘要：There is a recent trend in machine learning to increase model quality by growing models to sizes previously thought to be unreasonable. Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships, or scaling laws, that predict model quality from model size, training set size, and the available compute budget. These scaling laws allow one to choose nearly optimal hyper-parameters given constraints on available training data, model parameter count, or training computation budget. In this paper, we demonstrate that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws. We extend previous work to jointly predict loss due to model size, to training set size, and to the inherent "irreducible loss" of the task. We find that the scaling laws accurately match model performance over two orders of magnitude in both model size and training set size, and make predictions about the limits of model performance.

其他(5篇)

【1】 Topic Modeling and Progression of American Digital News Media During the Onset of the COVID-19 Pandemic 标题：冠状病毒大流行期间美国数字新闻媒体的主题建模与发展

作者：Xiangpeng Wan,Michael C. Lucic,Hakim Ghazzai,Yehia Massoud 链接：https://arxiv.org/abs/2106.09572 摘要：当前，世界正处于一场严重的全球性大流行之中，已经影响到人们生活的方方面面。因此，由于流感大流行的不同影响，美国出现了大量与COVID相关的数字媒体文章。如此庞大的信息量很难在合理的时间内被受众消费掉。在本文中，我们开发了一个自然语言处理（NLP）管道，它能够自动地将各种数字文章提取成可管理的信息片段，同时还可以对随时间推移讨论的主题进行建模，以帮助读者快速获得对紧迫问题的整体观点（即。，COVID-19大流行）来自不同的来源。我们通过在流感大流行期间收集大量与COVID相关的文章来实现这些目标。然后，采用无监督和半监督学习方法对文章进行归纳总结，然后利用社区检测方法对文章进行相似度聚类。接下来，我们使用BART算法确定每个文章簇的主题。最后，我们基于NLP管道输出提供了详细的数字媒体分析，并展示了围绕COVID-19的对话是如何随着时间的推移而演变的。摘要：Currently, the world is in the midst of a severe global pandemic, which has affected all aspects of people's lives. As a result, there is a deluge of COVID-related digital media articles published in the United States, due to the disparate effects of the pandemic. This large volume of information is difficult to consume by the audience in a reasonable amount of time. In this paper, we develop a Natural Language Processing (NLP) pipeline that is capable of automatically distilling various digital articles into manageable pieces of information, while also modelling the progression topics discussed over time in order to aid readers in rapidly gaining holistic perspectives on pressing issues (i.e., the COVID-19 pandemic) from a diverse array of sources. We achieve these goals by first collecting a large corpus of COVID-related articles during the onset of the pandemic. After, we apply unsupervised and semi-supervised learning procedures to summarize articles, then cluster them based on their similarities using the community detection methods. Next, we identify the topic of each cluster of articles using the BART algorithm. Finally, we provide a detailed digital media analysis based on the NLP-pipeline outputs and show how the conversation surrounding COVID-19 evolved over time.

【2】 Element Intervention for Open Relation Extraction 标题：开放关系抽取中的元素干预

作者：Fangchao Liu,Lingyong Yan,Hongyu Lin,Xianpei Han,Le Sun 机构：Chinese Information Processing Laboratory, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China, University of Chinese Academy of Sciences, Beijing, China 备注：Accepted to ACL2021(main conference) 链接：https://arxiv.org/abs/2106.09558 摘要：开放关系抽取的目的是对引用相同底层关系的关系实例进行聚类，这是一般关系抽取的关键步骤。现有的OpenRE模型通常是基于远程监控生成的数据集进行训练的，这往往会导致模型的不稳定性和易崩溃性。在本文中，我们从因果关系的角度重新审视OpenRE的过程。通过建立OpenRE的结构因果模型，我们发现上述问题源于实体和上下文与关系类型之间的虚假关联。为了解决这个问题，我们进行了\emph{Element Intervention}，它分别对上下文和实体进行干预，以获得它们潜在的因果关系。我们还提供了基于实体排序和上下文对比的干预措施的两个具体实现。在无监督关系抽取数据集上的实验结果表明，我们的方法优于现有的方法，并且在不同的数据集上具有很强的鲁棒性。摘要：Open relation extraction aims to cluster relation instances referring to the same underlying relation, which is a critical step for general relation extraction. Current OpenRE models are commonly trained on the datasets generated from distant supervision, which often results in instability and makes the model easily collapsed. In this paper, we revisit the procedure of OpenRE from a causal view. By formulating OpenRE using a structural causal model, we identify that the above-mentioned problems stem from the spurious correlations from entities and context to the relation type. To address this issue, we conduct \emph{Element Intervention}, which intervenes on the context and entities respectively to obtain the underlying causal effects of them. We also provide two specific implementations of the interventions based on entity ranking and context contrasting. Experimental results on unsupervised relation extraction datasets show that our methods outperform previous state-of-the-art methods and are robust across different datasets.

【3】 X-FACT: A New Benchmark Dataset for Multilingual Fact Checking 标题：X-FACT：一种新的多语言事实核查基准数据集

作者：Ashim Gupta,Vivek Srikumar 机构：School of Computing, University of Utah 备注：ACL 2021; For data and code, see this https URL 链接：https://arxiv.org/abs/2106.09248 摘要：在这项工作中，我们介绍了X-FACT：最大的公开可用的多语言数据集，用于自然存在的真实世界的事实验证。该数据集包含25种语言的简短语句，并由专家事实核查员对其准确性进行标记。该数据集包括一个多语言评估基准，该基准测量了多语言模型的域外泛化和Zero-Shot能力。使用最先进的基于多语言转换器的模型，我们开发了几个自动事实检查模型，这些模型与文本声明一起，利用了从搜索引擎检索的新闻故事中获得的额外元数据和证据。从经验上看，我们的最佳模型的F值达到了40%左右，这表明我们的数据集是评估多语言事实检查模型的一个具有挑战性的基准。摘要：In this work, we introduce X-FACT: the largest publicly available multilingual dataset for factual verification of naturally existing real-world claims. The dataset contains short statements in 25 languages and is labeled for veracity by expert fact-checkers. The dataset includes a multilingual evaluation benchmark that measures both out-of-domain generalization, and zero-shot capabilities of the multilingual models. Using state-of-the-art multilingual transformer-based models, we develop several automated fact-checking models that, along with textual claims, make use of additional metadata and evidence from news stories retrieved using a search engine. Empirically, our best model attains an F-score of around 40%, suggesting that our dataset is a challenging benchmark for evaluation of multilingual fact-checking models.

【4】 Disentangling Online Chats with DAG-Structured LSTMs 标题：利用DAG结构的LSTM解开在线聊天

作者：Duccio Pappadopulo,Lisa Bauer,Marco Farina,Ozan İrsoy,Mohit Bansal 机构：Bloomberg, UNC Chapel Hill 备注：8 pages, 1 figure. Accepted at *SEM 2021 链接：https://arxiv.org/abs/2106.09024 摘要：许多现代消息传递系统允许许多用户之间进行快速、同步的文本通信。由此产生的消息序列隐藏了一个更复杂的结构，其中独立的子会话相互交织。这对任何旨在理解聊天日志内容或从中收集信息的任务都是一个挑战。理清这些对话的能力相当于许多下游任务的成功，比如总结和问题回答。伴随着文本的结构化信息，如用户转向、用户提及、时间戳，被参与者自己用作提示，他们需要关注对话，并且已经被证明对解开矛盾很重要。DAG-LSTMs是树LSTMs的一个推广，它可以处理有向无环依赖，是一种自然的方法来整合这些信息及其非序列性。本文将DAG-LSTMs应用于会话解纠缠任务。我们在ubuntuirc数据集上进行实验。我们证明了我们提出的新模型在恢复回复关系的任务上达到了最先进的水平，并且在其他解纠缠度量上具有竞争力。摘要：Many modern messaging systems allow fast and synchronous textual communication among many users. The resulting sequence of messages hides a more complicated structure in which independent sub-conversations are interwoven with one another. This poses a challenge for any task aiming to understand the content of the chat logs or gather information from them. The ability to disentangle these conversations is then tantamount to the success of many downstream tasks such as summarization and question answering. Structured information accompanying the text such as user turn, user mentions, timestamps, is used as a cue by the participants themselves who need to follow the conversation and has been shown to be important for disentanglement. DAG-LSTMs, a generalization of Tree-LSTMs that can handle directed acyclic dependencies, are a natural way to incorporate such information and its non-sequential nature. In this paper, we apply DAG-LSTMs to the conversation disentanglement task. We perform our experiments on the Ubuntu IRC dataset. We show that the novel model we propose achieves state of the art status on the task of recovering reply-to relations and it is competitive on other disentanglement metrics.

【5】 Layer Pruning on Demand with Intermediate CTC 标题：使用中间CTC按需修剪图层

作者：Jaesong Lee,Jingu Kang,Shinji Watanabe 机构：Naver Corporation, Carnegie Mellon University 备注：Interspeech 2021 链接：https://arxiv.org/abs/2106.09216 摘要：在移动/嵌入式设备上部署端到端的自动语音识别（ASR）模型是一项具有挑战性的任务，因为在实际应用中，设备的计算能力和能耗需求是动态变化的。为了克服这个问题，我们提出了一种基于连接主义时间分类（CTC）的ASR训练和剪枝方法，该方法可以在运行时减少模型深度，而不需要任何额外的微调。为了达到这个目的，我们采用了两种正则化方法：中间CTC和随机深度，来训练一个剪枝后性能不会有太大下降的模型。我们提出了一个使用奇异向量典型相关分析（SVCCA）的层行为的深入分析，以及寻找安全修剪层的有效策略。利用该方法，可以根据需要对TransformerCTC模型进行不同深度的剪枝，使GPU上的实时因子从0.005提高到0.002，而剪枝后的每个子模型保持了相同深度的单独训练模型的精度。摘要：Deploying an end-to-end automatic speech recognition (ASR) model on mobile/embedded devices is a challenging task, since the device computational power and energy consumption requirements are dynamically changed in practice. To overcome the issue, we present a training and pruning method for ASR based on the connectionist temporal classification (CTC) which allows reduction of model depth at run-time without any extra fine-tuning. To achieve the goal, we adopt two regularization methods, intermediate CTC and stochastic depth, to train a model whose performance does not degrade much after pruning. We present an in-depth analysis of layer behaviors using singular vector canonical correlation analysis (SVCCA), and efficient strategies for finding layers which are safe to prune. Using the proposed method, we show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU, while each pruned sub-model maintains the accuracy of individually trained model of the same depth.

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2021-06-18，如有侵权请联系 cloudcommunity@tencent.com 删除

linux