自然语言处理学术速递[9.2]

公众号-arXiv每日学术速递

发布于 2021-09-16 15:09:45

9350

发布于 2021-09-16 15:09:45

文章被收录于专栏：arXiv每日学术速递

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计33篇

Transformer(3篇)

【1】 \infty-former: Infinite Memory Transformer链接：https://arxiv.org/abs/2109.00301

作者：Pedro Henrique Martins,Zita Marinho,André F. T. Martins 机构：�Instituto de Telecomunicações, �DeepMind, �Institute of Systems and Robotics, �LUMLIS (Lisbon ELLIS Unit), Instituto Superior Técnico, �Unbabel 摘要：Transformer在处理长上下文时会遇到困难，因为计算量随着上下文长度的增加而增加，因此它们无法有效地模拟长期记忆。有人提出了几种不同的方法来缓解这个问题，但它们都有有限的内存容量，被迫丢弃旧信息。在本文中，我们提出了$\infty$-前者，它扩展了具有无限长时记忆的vanilla transformer。通过利用连续空间注意机制参与长期记忆，$\infty$-前者的注意复杂性与上下文长度无关。因此，它能够对任意长的上下文建模，并在保持固定计算预算的同时保持“粘性记忆”。对合成排序任务的实验证明$\infty$-前者能够保留长序列中的信息。我们还进行了语言建模实验，通过从头开始训练模型和微调预先训练的语言模型，这显示了无限长时间记忆的好处。摘要：Transformers struggle when attending to long contexts, since the amount of computation grows with the context length, and therefore they cannot model long-term memories effectively. Several variations have been proposed to alleviate this problem, but they all have a finite memory capacity, being forced to drop old information. In this paper, we propose the $\infty$-former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend over the long-term memory, the $\infty$-former's attention complexity becomes independent of the context length. Thus, it is able to model arbitrarily long contexts and maintain "sticky memories" while keeping a fixed computation budget. Experiments on a synthetic sorting task demonstrate the ability of the $\infty$-former to retain information from long sequences. We also perform experiments on language modeling, by training a model from scratch and by fine-tuning a pre-trained language model, which show benefits of unbounded long-term memories.

【2】 MergeBERT: Program Merge Conflict Resolution via Neural Transformers 标题：MergeBERT：通过神经转换器解决程序合并冲突链接：https://arxiv.org/abs/2109.00084

作者：Alexey Svyatkovskiy,Todd Mytcowicz,Negar Ghorbani,Sarah Fakhoury,Elizabeth Dinella,Christian Bird,Neel Sundaresan,Shuvendu Lahiri 机构：Microsoft Cloud and AI, Microsoft Research, UC Irvine, Washington State University, University of Pennsylvania 备注：16 pages, 7 figures 摘要：协作软件开发是现代软件开发生命周期的一个组成部分，对于大型软件项目的成功至关重要。当多个开发人员对同一行代码进行并发更改时，可能会发生合并冲突。这种冲突会使拉请求和连续集成管道停滞数小时到数天，严重影响开发人员的生产率。在本文中，我们介绍了MergeBERT，一种基于令牌级三向差分和Transformer编码器模型的新型神经程序合并框架。利用合并冲突解决方案的局限性，我们将生成解决方案序列的任务重新表述为一个分类任务，该任务基于从真实世界合并提交数据中提取的一组原始合并模式。我们的模型实现了64-69%的合并分辨率合成精度，与现有的结构化和神经程序合并工具相比，性能提高了近2倍。最后，我们展示了我们的模型的多功能性，它能够使用Java、JavaScript、TypeScript和C#编程语言在多语言环境中执行程序合并，从而将zero shot推广到不可见的语言。摘要：Collaborative software development is an integral part of the modern software development life cycle, essential to the success of large-scale software projects. When multiple developers make concurrent changes around the same lines of code, a merge conflict may occur. Such conflicts stall pull requests and continuous integration pipelines for hours to several days, seriously hurting developer productivity. In this paper, we introduce MergeBERT, a novel neural program merge framework based on the token-level three-way differencing and a transformer encoder model. Exploiting restricted nature of merge conflict resolutions, we reformulate the task of generating the resolution sequence as a classification task over a set of primitive merge patterns extracted from real-world merge commit data. Our model achieves 64--69% precision of merge resolution synthesis, yielding nearly a 2x performance improvement over existing structured and neural program merge tools. Finally, we demonstrate versatility of our model, which is able to perform program merge in a multilingual setting with Java, JavaScript, TypeScript, and C# programming languages, generalizing zero-shot to unseen languages.

【3】 Sentence Bottleneck Autoencoders from Transformer Language Models 标题：来自Transformer语言模型的语句瓶颈自动编码器链接：https://arxiv.org/abs/2109.00055

作者：Ivan Montero,Nikolaos Pappas,Noah A. Smith 机构：♣Paul G. Allen School of Computer Science & Engineering, University of Washington, ♦Allen Institute for Artificial Intelligence 摘要：通过在大型语料库上预训练语言模型进行文本表征学习已经成为构建NLP系统的标准起点。这种方法与自动编码器形成对比，自动编码器也对原始文本进行训练，但其目标是学习将每个输入编码为一个向量，以允许完全重建。自动编码器因其潜在的空间结构和生成特性而备受关注。因此，我们探索从一个预先训练的、冻结的transformer语言模型构建一个句子级自动编码器。我们将蒙面语言建模目标作为一个生成的去噪目标，同时只训练一个句子瓶颈和一个单层修改的transformer解码器。我们证明，通过我们的模型发现的句子表示比以前的方法具有更好的质量，这些方法在GLUE基准测试中从文本相似性任务、风格转换（受控生成的一个示例）和单句分类任务的预训练变换器中提取表示，而使用的参数比大型预训练模型少。摘要：Representation learning for text via pretraining a language model on a large corpus has become a standard starting point for building NLP systems. This approach stands in contrast to autoencoders, also trained on raw text, but with the objective of learning to encode each input as a vector that allows full reconstruction. Autoencoders are attractive because of their latent space structure and generative properties. We therefore explore the construction of a sentence-level autoencoder from a pretrained, frozen transformer language model. We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder. We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer (an example of controlled generation), and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.

QA|VQA|问答|对话(1篇)

【1】 Structured Context and High-Coverage Grammar for Conversational Question Answering over Knowledge Graphs 标题：基于知识图的会话问答的结构化上下文和高覆盖率语法链接：https://arxiv.org/abs/2109.00269

作者：Pierre Marion,Paweł Krzysztof Nowak,Francesco Piccinno 机构：Sorbonne Université, CNRS, Laboratoire de Probabilités, Statistique et Modélisation, LPSM, F-, Paris, France, Google Research 备注：16 pages, 1 figure. Accepted to EMNLP 2021 摘要：我们使用神经语义分析方法解决了在大知识图上弱监督会话问答的问题。我们引入了一种新的逻辑形式（LF）语法，它可以对图上的各种查询进行建模，同时保持足够的简单性以高效地生成监控数据。我们基于Transformer的模型采用类似JSON的结构作为输入，允许我们轻松地将知识图和会话上下文结合起来。这种结构化的输入被转换成嵌入列表，然后反馈到标准的注意层。我们在两个公开可用的数据集CSQA和ConvQuestions上验证了我们的方法，包括语法覆盖率和LF执行准确性，这两个数据集都基于Wikidata。在CSQA方面，我们的方法将覆盖率从$80\%$提高到$96.2\%$，LF执行精度从$70.6\%$提高到$75.6\%$。在这些问题上，我们在最新技术方面取得了有竞争力的结果。摘要：We tackle the problem of weakly-supervised conversational Question Answering over large Knowledge Graphs using a neural semantic parsing approach. We introduce a new Logical Form (LF) grammar that can model a wide range of queries on the graph while remaining sufficiently simple to generate supervision data efficiently. Our Transformer-based model takes a JSON-like structure as input, allowing us to easily incorporate both Knowledge Graph and conversational contexts. This structured input is transformed to lists of embeddings and then fed to standard attention layers. We validate our approach, both in terms of grammar coverage and LF execution accuracy, on two publicly available datasets, CSQA and ConvQuestions, both grounded in Wikidata. On CSQA, our approach increases the coverage from $80\%$ to $96.2\%$, and the LF execution accuracy from $70.6\%$ to $75.6\%$, with respect to previous state-of-the-art results. On ConvQuestions, we achieve competitive results with respect to the state-of-the-art.

机器翻译(2篇)

【1】 Survey of Low-Resource Machine Translation 标题：低资源机器翻译研究综述链接：https://arxiv.org/abs/2109.00486

作者：Barry Haddow,Rachel Bawden,Antonio Valerio Miceli Barone,Jindřich Helcl,Alexandra Birch 机构：University of Edinburgh, Inria, France, Jindˇrich Helcl 摘要：我们提出了一个涵盖低资源机器翻译技术现状的调查报告。目前世界上大约有7000种语言，几乎所有的语言对都缺乏训练机器翻译模型的重要资源。在翻译训练数据很少的情况下，人们对研究如何制作有用的翻译模型越来越感兴趣。我们对这一主题领域进行了高水平的总结，并概述了最佳实践。摘要：We present a survey covering the state of the art in low-resource machine translation. There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a high level summary of this topical field and provide an overview of best practices.

【2】 Masked Adversarial Generation for Neural Machine Translation 标题：神经机器翻译中的掩蔽敌意生成链接：https://arxiv.org/abs/2109.00417

作者：Badr Youbi Idrissi,Stéphane Clinchant 机构：Centrale Supelec, Naver Labs Europe 备注：5 pages 摘要：攻击神经机器翻译模型是一个固有的组合任务在离散序列，解决近似启发式。大多数方法使用梯度对每个样本单独攻击模型。我们是否可以学习产生有意义的对抗性攻击，而不是机械地应用梯度？与现有方法相比，我们通过训练基于语言模型的对抗生成器来学习攻击模型。我们提出了蒙面对抗生成（MAG）模型，该模型学习在整个训练过程中扰动翻译模型。实验表明，该方法提高了机器翻译模型的鲁棒性，同时比其他同类方法更快。摘要：Attacking Neural Machine Translation models is an inherently combinatorial task on discrete sequences, solved with approximate heuristics. Most methods use the gradient to attack the model on each sample independently. Instead of mechanically applying the gradient, could we learn to produce meaningful adversarial attacks ? In contrast to existing approaches, we learn to attack a model by training an adversarial generator based on a language model. We propose the Masked Adversarial Generation (MAG) model, that learns to perturb the translation model throughout the training process. The experiments show that it improves the robustness of machine translation models, while being faster than competing methods.

Graph|知识图谱|Knowledge(2篇)

【1】 Discourse Analysis of Covid-19 in Persian Twitter Social Networks Using Graph Mining and Natural Language Processing 标题：基于图挖掘和自然语言处理的波斯语Twitter社交网络冠状病毒话语分析链接：https://arxiv.org/abs/2109.00298

作者：Omid Shokrollahi,Niloofar Hashemi,Mohammad Dehghani 机构：Amirkabir University of Technology, Tehran , Iran, Allame Tabatabaei Universiy, Tehran ,Iran, Tarbiat Modarres University, Tehran, Iran 摘要：理解话语动力学的一种新的科学方法是分析社交网络的公共数据。2019冠状病毒疾病的研究是以Laclau和Mouffe的话语理论为基础，运用波斯社会的智能数据挖掘技术进行后结构主义话语分析（PDA）。研究的大数据是波斯推特网络16万用户的500万条推特，用于比较两篇文章。除了单独分析tweet文本外，还创建了一个基于转发关系的社交网络图数据库。我们使用VoteRank算法来介绍和排名那些帖子成为口碑的人，前提是网络上的总信息传播范围最大化。这些用户还根据他们的单词使用模式进行聚类（使用高斯混合模型）。将有影响力的传播者构建的话语与最活跃的用户进行比较。这项分析是基于八次发作中与新冠病毒相关的帖子进行的。此外，通过统计内容分析和推特词的极性，对所提到的整个亚群体进行了语篇分析，特别是对排名靠前的个体。本研究最重要的结果是推特受试者的话语结构是基于政府而不是基于社区的。分析的伊朗社会2019冠状病毒疾病本身不认为自己负责，不相信参与，并希望政府解决所有问题。最活跃和最具影响力的用户的相似性是，政治、国家和批评性话语结构占主导地位。除了其研究方法的优点外，还需要注意研究的局限性。对伊朗社会今后遭遇类似危机提出了建议。摘要：One of the new scientific ways of understanding discourse dynamics is analyzing the public data of social networks. This research's aim is Post-structuralist Discourse Analysis (PDA) of Covid-19 phenomenon (inspired by Laclau and Mouffe's Discourse Theory) by using Intelligent Data Mining for Persian Society. The examined big data is five million tweets from 160,000 users of the Persian Twitter network to compare two discourses. Besides analyzing the tweet texts individually, a social network graph database has been created based on retweets relationships. We use the VoteRank algorithm to introduce and rank people whose posts become word of mouth, provided that the total information spreading scope is maximized over the network. These users are also clustered according to their word usage pattern (the Gaussian Mixture Model is used). The constructed discourse of influential spreaders is compared to the most active users. This analysis is done based on Covid-related posts over eight episodes. Also, by relying on the statistical content analysis and polarity of tweet words, discourse analysis is done for the whole mentioned subpopulations, especially for the top individuals. The most important result of this research is that the Twitter subjects' discourse construction is government-based rather than community-based. The analyzed Iranian society does not consider itself responsible for the Covid-19 wicked problem, does not believe in participation, and expects the government to solve all problems. The most active and most influential users' similarity is that political, national, and critical discourse construction is the predominant one. In addition to the advantages of its research methodology, it is necessary to pay attention to the study's limitations. Suggestion for future encounters of Iranian society with similar crises is given.

【2】 Interactive Machine Comprehension with Dynamic Knowledge Graphs 标题：基于动态知识图的交互式机器理解链接：https://arxiv.org/abs/2109.00077

作者：Xingdi Yuan 机构：Microsoft Research, Montréal 备注：Accepted at EMNLP 2021 摘要：交互式机器阅读理解（iMRC）是一种机器理解任务，其中知识源是部分可观察的。代理必须按顺序与环境交互，以收集必要的知识以回答问题。我们假设图形表示是一种良好的归纳偏差，可以作为智能体在iMRC任务中的记忆机制。我们将探讨四种不同类别的图形，它们可以捕获不同级别的文本信息。我们描述了在信息收集期间动态构建和更新这些图的方法，以及在RL代理中编码图表示的神经模型。在iSQuAD上的大量实验表明，图形表示可以显著提高RL代理的性能。摘要：Interactive machine reading comprehension (iMRC) is machine comprehension tasks where knowledge sources are partially observable. An agent must interact with an environment sequentially to gather necessary knowledge in order to answer a question. We hypothesize that graph representations are good inductive biases, which can serve as an agent's memory mechanism in iMRC tasks. We explore four different categories of graphs that can capture text information at various levels. We describe methods that dynamically build and update these graphs during information gathering, as well as neural models to encode graph representations in RL agents. Extensive experiments on iSQuAD suggest that graph representations can result in significant performance improvements for RL agents.

推理|分析|理解|解释(3篇)

【1】 Position Masking for Improved Layout-Aware Document Understanding 标题：位置遮罩可改进版面感知文档理解链接：https://arxiv.org/abs/2109.00442

作者：Anik Saha,Catherine Finegan-Dollak,Ashish Verma 机构：Rensselaer Polytechnic Institute, Troy, NY, USA, IBM Research, Yorktown Heights, NY, USA 备注：Document Intelligence Workshop at KDD, 2021 摘要：文档扫描和PDF的自然语言处理有可能极大地提高业务流程的效率。布局感知的单词嵌入，如LayoutLM，已显示出对此类文档进行分类和信息提取的前景。本文提出了一个新的预训练任务，称为，它可以提高包含二维位置嵌入的布局感知单词嵌入的性能。我们将仅使用语言掩蔽预训练的模型与同时使用语言掩蔽和位置掩蔽预训练的模型进行比较，发现位置掩蔽在形式理解任务中提高了5%以上的性能。摘要：Natural language processing for document scans and PDFs has the potential to enormously improve the efficiency of business processes. Layout-aware word embeddings such as LayoutLM have shown promise for classification of and information extraction from such documents. This paper proposes a new pre-training task called that can improve performance of layout-aware word embeddings that incorporate 2-D position embeddings. We compare models pre-trained with only language masking against models pre-trained with both language masking and position masking, and we find that position masking improves performance by over 5% on a form understanding task.

【2】 Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis 标题：多模态情感分析的分层互信息最大化改进多模态融合链接：https://arxiv.org/abs/2109.00412

作者：Wei Han,Hui Chen,Soujanya Poria 机构：† Singapore University of Technology and Design, Singapore 备注：Accepted as a long paper at EMNLP 2021 摘要：在多模态情感分析（MSA）中，模型的性能在很大程度上取决于合成嵌入的质量。这些嵌入是从称为多模态融合的上游过程中生成的，该过程旨在提取和组合输入的单峰原始数据，以产生更丰富的多模态表示。以前的工作要么反向传播任务损失，要么操纵特征空间的几何特性以产生有利的融合结果，这忽略了从输入到融合结果的关键任务相关信息的保留。在这项工作中，我们提出了一个称为多模态InfoMax（MMIM）的框架，该框架分层最大化了单峰输入对中的互信息（MI）以及多模态融合结果和单峰输入之间的互信息（MI），以便通过多模态融合来维护任务相关信息。该框架与主任务（MSA）联合训练，以提高下游MSA任务的性能。为了解决MI边界的棘手问题，我们进一步制定了一套计算简单的参数和非参数方法来近似其真值。在两个广泛使用的数据集上的实验结果证明了我们方法的有效性。这项工作的实施可在https://github.com/declare-lab/Multimodal-Infomax. 摘要：In multimodal sentiment analysis (MSA), the performance of a model highly depends on the quality of synthesized embeddings. These embeddings are generated from the upstream process called multimodal fusion, which aims to extract and combine the input unimodal raw data to produce a richer multimodal representation. Previous work either back-propagates the task loss or manipulates the geometric property of feature spaces to produce favorable fusion results, which neglects the preservation of critical task-related information that flows from input to the fusion results. In this work, we propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs (inter-modality) and between multimodal fusion result and unimodal input in order to maintain task-related information through multimodal fusion. The framework is jointly trained with the main task (MSA) to improve the performance of the downstream MSA task. To address the intractable issue of MI bounds, we further formulate a set of computationally simple parametric and non-parametric methods to approximate their truth value. Experimental results on the two widely used datasets demonstrate the efficacy of our approach. The implementation of this work is publicly available at https://github.com/declare-lab/Multimodal-Infomax.

【3】 FinQA: A Dataset of Numerical Reasoning over Financial Data 标题：FinQA：一种基于金融数据的数值推理数据集链接：https://arxiv.org/abs/2109.00122

作者：Zhiyu Chen,Wenhu Chen,Charese Smiley,Sameena Shah,Iana Borova,Dylan Langdon,Reema Moussa,Matt Beane,Ting-Hao Huang,Bryan Routledge,William Yang Wang 机构：University of California, Santa Barbara, J.P. Morgan, Pennsylvania State University, Carnegie Mellon University 备注：EMNLP 2021 摘要：财务报表的数量之大使得人们很难获取和分析企业的财务状况。稳健的数值推理同样在这一领域面临着独特的挑战。在这项工作中，我们专注于回答有关金融数据的深层次问题，旨在自动化对大量金融文档的分析。与一般领域的现有任务不同，金融领域包括复杂的数值推理和对异构表示的理解。为了促进分析进展，我们提出了一个新的大规模数据集FinQA，由金融专家编写，在财务报告上有问答对。我们还对黄金推理程序进行注释，以确保其完全可解释性。我们进一步引入了基线，并在我们的数据集中进行了全面的实验。结果表明，流行的、大型的、预先训练好的模型在获取金融知识和对这些知识进行复杂的多步骤数值推理方面远远落后于专家。因此，我们的数据集——这是同类数据集中的第一个——应该能够对复杂的应用程序领域进行重要的、新的社区研究。数据集和代码是公开的\url{https://github.com/czyssrs/FinQA}. 摘要：The sheer volume of financial statements makes it difficult for humans to access and analyze a business's financials. Robust numerical reasoning likewise faces unique challenges in this domain. In this work, we focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks on general domain, the finance domain includes complex numerical reasoning and understanding of heterogeneous representations. To facilitate analytical progress, we propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. We also annotate the gold reasoning programs to ensure full explainability. We further introduce baselines and conduct comprehensive experiments in our dataset. The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge and in complex multi-step numerical reasoning on that knowledge. Our dataset -- the first of its kind -- should therefore enable significant, new community research into complex application domains. The dataset and code are publicly available\url{https://github.com/czyssrs/FinQA}.

GAN|对抗|攻击|生成相关(2篇)

【1】 ConRPG: Paraphrase Generation using Contexts as Regularizer 标题：ConRPG：使用上下文作为调节器的释义生成链接：https://arxiv.org/abs/2109.00363

作者：Yuxian Meng,Xiang Ao,Qing He,Xiaofei Sun,Qinghong Han,Fei Wu,Chun fan,Jiwei Li 机构：♦Zhejiang University, ♠Computer Center of Peking University, ⋆Peng Cheng Laboratory, ▲ Key Lab of Intelligent Information Processing of Chinese Academy of Sciences, ♣Shannon.AI 备注：To appear at EMNLP2021 摘要：释义生成的一个长期问题是如何获得可靠的监督信号。在本文中，我们提出了一种无监督的释义生成范式，假设在相同的语境下生成两个具有相同含义的句子的概率应该是相同的。受这一基本思想的启发，我们提出了一个流水线系统，该系统包括基于上下文语言模型的释义候选生成、使用评分函数的候选筛选以及基于所选候选的释义模型训练。与现有的释义生成方法相比，该模型具有以下优点：（1）在语义上使用上下文正则化器，能够生成大量高质量的释义对；（2）使用人类可解释评分函数从候选词中选择释义对，该框架为开发人员干预数据生成过程提供了一个渠道，从而形成了一个更可控的模型。不同任务和数据集的实验结果表明，该模型在有监督和无监督环境下都是有效的。摘要：A long-standing issue with paraphrase generation is how to obtain reliable supervision signals. In this paper, we propose an unsupervised paradigm for paraphrase generation based on the assumption that the probabilities of generating two sentences with the same meaning given the same context should be the same. Inspired by this fundamental idea, we propose a pipelined system which consists of paraphrase candidate generation based on contextual language models, candidate filtering using scoring functions, and paraphrase model training based on the selected candidates. The proposed paradigm offers merits over existing paraphrase generation methods: (1) using the context regularizer on meanings, the model is able to generate massive amounts of high-quality paraphrase pairs; and (2) using human-interpretable scoring functions to select paraphrase pairs from candidates, the proposed framework provides a channel for developers to intervene with the data generation process, leading to a more controllable model. Experimental results across different tasks and datasets demonstrate that the effectiveness of the proposed model in both supervised and unsupervised setups.

【2】 OptAGAN: Entropy-based finetuning on text VAE-GAN 标题：OptAGAN：基于熵的文本VAE-GAN精调优链接：https://arxiv.org/abs/2109.00239

作者：Paolo Tirotta,Stefano Lodi 机构：Department of Statistics, Alma Mater Studiorum, University of Bologna, Department of Computer Science 备注：11 pages, 5 figures, 8 tables 摘要：通过大型预训练模型进行的迁移学习改变了当前自然语言处理（NLP）应用的格局。最近，Optimus发布了一种变分自动编码器（VAE），它结合了两种预先训练好的模型，即BERT和GPT-2，并且它与生成性对抗网络（GANs）的结合已经被证明可以生成新颖但非常人性化的文本。Optimus和GANs的组合避免了GANs在文本离散域的麻烦应用，并防止了标准最大似然方法的暴露偏差。我们将潜在空间中的GANs训练与Optimus解码器的微调结合起来，以生成单个单词。这种方法让我们既可以对句子的高级特征进行建模，也可以对低级别的逐字生成进行建模。我们通过利用GPT-2的结构，并通过添加基于熵的内在激励奖励来平衡质量和多样性，使用强化学习（RL）进行微调。我们对VAE-GAN模型的结果进行了基准测试，并展示了我们在三个广泛使用的文本生成数据集上的RL微调所带来的改进，结果在生成文本的质量方面大大超过了当前最先进的技术。摘要：Transfer learning through large pre-trained models has changed the landscape of current applications in natural language processing (NLP). Recently Optimus, a variational autoencoder (VAE) which combines two pre-trained models, BERT and GPT-2, has been released, and its combination with generative adversial networks (GANs) has been shown to produce novel, yet very human-looking text. The Optimus and GANs combination avoids the troublesome application of GANs to the discrete domain of text, and prevents the exposure bias of standard maximum likelihood methods. We combine the training of GANs in the latent space, with the finetuning of the decoder of Optimus for single word generation. This approach lets us model both the high-level features of the sentences, and the low-level word-by-word generation. We finetune using reinforcement learning (RL) by exploiting the structure of GPT-2 and by adding entropy-based intrinsically motivated rewards to balance between quality and diversity. We benchmark the results of the VAE-GAN model, and show the improvements brought by our RL finetuning on three widely used datasets for text generation, with results that greatly surpass the current state-of-the-art for the quality of the generated texts.

半/弱/无监督|不确定性(3篇)

【1】 Boosting Cross-Lingual Transfer via Self-Learning with Uncertainty Estimation 标题：基于不确定性估计的自主学习促进跨语言迁移链接：https://arxiv.org/abs/2109.00194

作者：Liyan Xu,Xuchao Zhang,Xujiang Zhao,Haifeng Chen,Feng Chen,Jinho D. Choi 机构：Choi††Emory University, com§University of Texas at Dallas 备注：Accepted to EMNLP 2021 摘要：最近的多语言预训练语言模型已经取得了显著的Zero-Shot性能，该模型仅在一种源语言上进行微调，并直接在目标语言上进行评估。在这项工作中，我们提出了一个自学习框架，进一步利用目标语言的未标记数据，结合过程中的不确定性估计来选择高质量的银标签。三种不同的不确定性被专门用于跨语言迁移：语言异方差/同方差不确定性（LEU/LOU）、证据不确定性（EVI）。我们在两项跨语言任务（包括命名实体识别（NER）和自然语言推理（NLI））中评估了我们的框架，这两项跨语言任务共涵盖40种语言，其NER平均优于基线10 F1，NLI的准确度得分为2.5。摘要：Recent multilingual pre-trained language models have achieved remarkable zero-shot performance, where the model is only finetuned on one source language and directly evaluated on target languages. In this work, we propose a self-learning framework that further utilizes unlabeled data of target languages, combined with uncertainty estimation in the process to select high-quality silver labels. Three different uncertainties are adapted and analyzed specifically for the cross lingual transfer: Language Heteroscedastic/Homoscedastic Uncertainty (LEU/LOU), Evidential Uncertainty (EVI). We evaluate our framework with uncertainties on two cross-lingual tasks including Named Entity Recognition (NER) and Natural Language Inference (NLI) covering 40 languages in total, which outperforms the baselines significantly by 10 F1 on average for NER and 2.5 accuracy score for NLI.

【2】 Evaluating Predictive Uncertainty under Distributional Shift on Dialogue Dataset 标题：对话数据集分布平移下的预测不确定性评估链接：https://arxiv.org/abs/2109.00186

作者：Nyoungwoo Lee,ChaeHun Park,Ho-Jin Choi 机构：KAIST, Daejeon, South Korea 摘要：在开放域对话中，预测不确定性主要在域转移设置中评估，以应对分布外输入。然而，在现实世界的对话中，可能会有比分布外输入更广泛的分布转移输入。为了评估这一点，我们首先提出了两种方法，未知词（UW）和上下文不足（IC），通过对话数据集上的腐败实现逐渐的分布转移。然后，我们研究了分布位移对精度和校准的影响。我们的实验表明，现有的不确定性估计方法的性能随着偏移的加剧而不断下降。结果表明，所提出的方法可用于评估分配转移下对话系统的校准。摘要：In open-domain dialogues, predictive uncertainties are mainly evaluated in a domain shift setting to cope with out-of-distribution inputs. However, in real-world conversations, there could be more extensive distributional shifted inputs than the out-of-distribution. To evaluate this, we first propose two methods, Unknown Word (UW) and Insufficient Context (IC), enabling gradual distributional shifts by corruption on the dialogue dataset. We then investigate the effect of distributional shifts on accuracy and calibration. Our experiments show that the performance of existing uncertainty estimation methods consistently degrades with intensifying the shift. The results suggest that the proposed methods could be useful for evaluating the calibration of dialogue systems under distributional shifts.

【3】 An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages 标题：一种多语言句子简化语料库的无监督构建方法链接：https://arxiv.org/abs/2109.00165

作者：Xinyu Lu,Jipeng Qiang,Yun Li,Yunhao Yuan,Yi Zhu 机构：Department of Computer Science, Yangzhou University, Jiangsu, China 备注：None 摘要：平行句简化（SS）的可用性对于神经SS建模来说是稀缺的。我们提出了一种无监督的方法来从大规模双语翻译语料库中构建SS语料库，从而减少了对SS监督语料库的需求。我们的方法是由以下两个发现驱动的：神经机器翻译模型通常倾向于产生更多的高频令牌，而文本复杂度水平的差异存在于翻译语料库的源语和目标语之间。通过对翻译语料库中的源语句和桥语言中的参考译文，我们可以构建大规模的伪并行SS数据。然后，我们将这些复杂度差异较大的句子对保留为SS句子对。采用无监督的方法构建SS语料库，可以满足对齐句子保持相同意义和不同文本复杂度的期望。实验结果表明，我们的语料库训练的SS方法达到了最先进的效果，显著优于英语基准测试WikiLarge的结果。摘要：The availability of parallel sentence simplification (SS) is scarce for neural SS modelings. We propose an unsupervised method to build SS corpora from large-scale bilingual translation corpora, alleviating the need for SS supervised corpora. Our method is motivated by the following two findings: neural machine translation model usually tends to generate more high-frequency tokens and the difference of text complexity levels exists between the source and target language of a translation corpus. By taking the pair of the source sentences of translation corpus and the translations of their references in a bridge language, we can construct large-scale pseudo parallel SS data. Then, we keep these sentence pairs with a higher complexity difference as SS sentence pairs. The building SS corpora with an unsupervised approach can satisfy the expectations that the aligned sentences preserve the same meanings and have difference in text complexity levels. Experimental results show that SS methods trained by our corpora achieve the state-of-the-art results and significantly outperform the results on English benchmark WikiLarge.

识别/分类(3篇)

【1】 Exploring deep learning methods for recognizing rare diseases and their clinical manifestations from texts 标题：从文本中探索识别罕见病及其临床表现的深度学习方法链接：https://arxiv.org/abs/2109.00343

作者：Isabel Segura-Bedmar,David Camino-Perdonas,Sara Guerrero-Aspizua 机构：GUERRERO-ASPIZUA., Human Language and Accesibility Technologies, Computer Science Department, Universidad Carlos III de Madrid, Leganés, Madrid, SPAIN (e-mail: 摘要：虽然罕见疾病的特点是发病率低，但约有3亿人受到罕见疾病的影响。对全科医生来说，这些疾病的早期准确诊断是一项重大挑战，因为他们没有足够的知识来识别这些疾病。除此之外，罕见疾病通常表现出多种多样的表现，这可能使诊断更加困难。延迟诊断可能会对患者的生活产生负面影响。因此，迫切需要增加有关罕见疾病的科学和医学知识。自然语言处理（NLP）和深度学习有助于提取罕见疾病的相关信息，以便于诊断和治疗。本文探讨了几种深度学习技术的使用，如双向长短时记忆（BiLSTM）网络或基于Transformers（BERT）双向编码器表示的深度上下文化单词表示，以识别罕见疾病及其在RareDis语料库中的临床表现（体征和症状）。该语料库包含5000多种罕见疾病和近6000种临床表现。BioBERT是一种基于BERT的领域特定语言表示，在生物医学语料库上进行训练，取得了最好的结果。特别是，该模型对罕见疾病的F1得分为85.2%，优于所有其他模型。摘要：Although rare diseases are characterized by low prevalence, approximately 300 million people are affected by a rare disease. The early and accurate diagnosis of these conditions is a major challenge for general practitioners, who do not have enough knowledge to identify them. In addition to this, rare diseases usually show a wide variety of manifestations, which might make the diagnosis even more difficult. A delayed diagnosis can negatively affect the patient's life. Therefore, there is an urgent need to increase the scientific and medical knowledge about rare diseases. Natural Language Processing (NLP) and Deep Learning can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments. The paper explores the use of several deep learning techniques such as Bidirectional Long Short Term Memory (BiLSTM) networks or deep contextualized word representations based on Bidirectional Encoder Representations from Transformers (BERT) to recognize rare diseases and their clinical manifestations (signs and symptoms) in the RareDis corpus. This corpus contains more than 5,000 rare diseases and almost 6,000 clinical manifestations. BioBERT, a domain-specific language representation based on BERT and trained on biomedical corpora, obtains the best results. In particular, this model obtains an F1-score of 85.2% for rare diseases, outperforming all the other models.

【2】 Dataset for Identification of Homophobia and Transophobia in Multilingual YouTube Comments 标题：用于识别多语言YouTube评论中的同性恋恐惧症和跨性恐惧症的数据集链接：https://arxiv.org/abs/2109.00227

作者：Bharathi Raja Chakravarthi,Ruba Priyadharshini,Rahul Ponnusamy,Prasanna Kumar Kumaresan,Kayalvizhi Sampath,Durairaj Thenmozhi,Sathiyaraj Thangasamy,Rajendran Nallathambi,John Phillip McCrae 机构：• Insight SFI Research Centre for Data Analytics, National University of Ireland Galway, Galway, Ireland, ULTRA Arts and Science College, Madurai, Tamil Nadu, India 备注：44 Pages 摘要：社交媒体平台上滥用内容的增加对在线用户产生了负面影响。对女同性恋、男同性恋、变性人或双性恋者的恐惧、厌恶、不适或不信任被定义为同性恋恐惧症/跨恐惧症。恐同/恐同言语是一种攻击性语言，可概括为针对LGBT+人群的仇恨言语，近年来受到越来越多的关注。在线同性恋恐惧症/跨恐惧症是一个严重的社会问题，它会使在线平台对LGBT+人群有害且不受欢迎，同时还试图消除平等、多样性和包容性。我们为在线恐同症和恐同症提供了一个新的层次分类法，以及一个专家标记的数据集，该数据集将允许自动识别恐同症/恐同症内容。因为这是一个敏感的问题，我们对注释者进行了教育，并为他们提供了全面的注释规则。我们之前发现，未经训练的众包注释者由于文化和其他偏见而难以诊断同性恋恐惧症。数据集包含15141条带注释的多语言注释。本文描述了数据集的构建过程、数据的定性分析以及注释者之间的一致性。此外，我们还为数据集创建基线模型。据我们所知，我们的数据集是第一个创建的此类数据集。警告：这篇文章包含对同性恋恐惧症、跨恐惧症和刻板印象的明确陈述，这可能会让一些读者感到不安。摘要：The increased proliferation of abusive content on social media platforms has a negative impact on online users. The dread, dislike, discomfort, or mistrust of lesbian, gay, transgender or bisexual persons is defined as homophobia/transphobia. Homophobic/transphobic speech is a type of offensive language that may be summarized as hate speech directed toward LGBT+ people, and it has been a growing concern in recent years. Online homophobia/transphobia is a severe societal problem that can make online platforms poisonous and unwelcome to LGBT+ people while also attempting to eliminate equality, diversity, and inclusion. We provide a new hierarchical taxonomy for online homophobia and transphobia, as well as an expert-labelled dataset that will allow homophobic/transphobic content to be automatically identified. We educated annotators and supplied them with comprehensive annotation rules because this is a sensitive issue, and we previously discovered that untrained crowdsourcing annotators struggle with diagnosing homophobia due to cultural and other prejudices. The dataset comprises 15,141 annotated multilingual comments. This paper describes the process of building the dataset, qualitative analysis of data, and inter-annotator agreement. In addition, we create baseline models for the dataset. To the best of our knowledge, our dataset is the first such dataset created. Warning: This paper contains explicit statements of homophobia, transphobia, stereotypes which may be distressing to some readers.

【3】 What Have Been Learned & What Should Be Learned? An Empirical Study of How to Selectively Augment Text for Classification 标题：学到了什么&应该学到什么？如何有选择地扩充文本进行分类的实证研究链接：https://arxiv.org/abs/2109.00175

作者：Biyang Guo,Sonqiao Han,Hailiang Huang 机构：AI Lab, School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai, China 摘要：文本增强技术广泛应用于文本分类问题中，以提高分类器的性能，特别是在低资源场景中。虽然已经设计了许多创造性的文本扩充方法，但它们以非选择性的方式扩充文本，这意味着不太重要或嘈杂的词与信息词有相同的机会被扩充，从而限制了扩充的性能。在本文中，我们系统地总结了三种角色关键字，它们在文本分类中具有不同的功能，并设计了有效的方法从文本中提取它们。基于这些提取的角色关键词，我们提出了选择性文本增强（STA）来选择性地增强文本，其中强调信息性、类指示词，但减少不相关或噪声词。在四个英文和中文文本分类基准数据集上进行的大量实验表明，STA可以显著优于非选择性文本增强方法。摘要：Text augmentation techniques are widely used in text classification problems to improve the performance of classifiers, especially in low-resource scenarios. Whilst lots of creative text augmentation methods have been designed, they augment the text in a non-selective manner, which means the less important or noisy words have the same chances to be augmented as the informative words, and thereby limits the performance of augmentation. In this work, we systematically summarize three kinds of role keywords, which have different functions for text classification, and design effective methods to extract them from the text. Based on these extracted role keywords, we propose STA (Selective Text Augmentation) to selectively augment the text, where the informative, class-indicating words are emphasized but the irrelevant or noisy words are diminished. Extensive experiments on four English and Chinese text classification benchmark datasets demonstrate that STA can substantially outperform the non-selective text augmentation methods.

Zero/Few/One-Shot|迁移|自适应(1篇)

【1】 Adapted End-to-End Coreference Resolution System for Anaphoric Identities in Dialogues 标题：自适应的对话中指代标识的端到端共指解析系统链接：https://arxiv.org/abs/2109.00185

作者：Liyan Xu,Jinho D. Choi 机构：Computer Science, Emory University, Atlanta, GA 备注：Submitted to CRAC 2021 摘要：我们提出了一个基于端到端的神经共指消解模型的有效系统，针对对话中的回指消解任务。我们的方法具体涉及三个方面，包括支持单身人士、在对话互动过程中编码说话人和话轮，以及利用现有资源进行知识转移。尽管我们的适应策略很简单，但它们对最终性能产生了重大影响，与基线相比F1提高了27。在CRAC 2021共享任务中，我们的最终系统在回指解析跟踪排行榜上排名第一，并在所有四个数据集上获得最佳评估结果。摘要：We present an effective system adapted from the end-to-end neural coreference resolution model, targeting on the task of anaphora resolution in dialogues. Three aspects are specifically addressed in our approach, including the support of singletons, encoding speakers and turns throughout dialogue interactions, and knowledge transfer utilizing existing resources. Despite the simplicity of our adaptation strategies, they are shown to bring significant impact to the final performance, with up to 27 F1 improvement over the baseline. Our final system ranks the 1st place on the leaderboard of the anaphora resolution track in the CRAC 2021 shared task, and achieves the best evaluation results on all four datasets.

表征(3篇)

【1】 Discovering Representation Sprachbund For Multilingual Pre-Training 标题：多语种前训练中发现表征的Sprachbund 链接：https://arxiv.org/abs/2109.00271

作者：Yimin Fan,Yaobo Liang,Alexandre Muzio,Hany Hassan,Houqiang Li,Ming Zhou,Nan Duan 机构：University of Science and Technology of China,Microsoft Research Asia, Microsoft,Sinovation Ventures 备注：To Appear at the Findings of EMNLP2021 摘要：多语言预训练模型已经在许多多语言NLP任务中证明了它们的有效性，并实现了从高资源语言到低资源语言的Few-Shot或少射击转移。然而，由于某些语言之间存在显著的类型差异和矛盾，此类模型通常在多种语言和跨语言环境下表现不佳，这表明学习单一模型很难同时处理大量不同的语言。为了缓解这个问题，我们提出了一个新的多语言训练前管道。我们建议从多语言预训练模型生成语言表示，并进行语言分析，以表明语言表示相似性从多个角度反映语言相似性，包括语系、地理位置、词汇统计和句法。然后我们将所有目标语言分为多个组，并将每个组命名为sprachbund表示。因此，同一代表性的语言在预训练和微调中应该相互促进，因为它们具有丰富的语言相似性。我们为每个代表sprachbund预先训练一个多语言模型。实验是在跨语言基准上进行的，与强基线相比，取得了显著的改进。摘要：Multilingual pre-trained models have demonstrated their effectiveness in many multilingual NLP tasks and enabled zero-shot or few-shot transfer from high-resource languages to low resource ones. However, due to significant typological differences and contradictions between some languages, such models usually perform poorly on many languages and cross-lingual settings, which shows the difficulty of learning a single model to handle massive diverse languages well at the same time. To alleviate this issue, we present a new multilingual pre-training pipeline. We propose to generate language representation from multilingual pre-trained models and conduct linguistic analysis to show that language representation similarity reflect linguistic similarity from multiple perspectives, including language family, geographical sprachbund, lexicostatistics and syntax. Then we cluster all the target languages into multiple groups and name each group as a representation sprachbund. Thus, languages in the same representation sprachbund are supposed to boost each other in both pre-training and fine-tuning as they share rich linguistic similarity. We pre-train one multilingual model for each representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.

【2】 Aligning Cross-lingual Sentence Representations with Dual Momentum Contrast 标题：基于双动量对比的跨语言句子表征对齐链接：https://arxiv.org/abs/2109.00253

作者：Liang Wang,Wei Zhao,Jingming Liu 机构：Yuanfudao AI Lab, Beijing, China 备注：Accepted to EMNLP 2021 main conference 摘要：在本文中，我们建议将来自不同语言的句子表示对齐到一个统一的嵌入空间中，其中语义相似性（跨语言和单语）可以用简单的点积计算。预先训练的语言模型可以根据翻译排名任务进行微调。现有作品（Feng等人，2020年）使用与否定词相同批次的句子，这可能会遇到容易否定的问题。我们采用MoCo（He等人，2020年）进一步提高校准质量。实验结果表明，我们的模型生成的句子表示在多个任务上达到了新的水平，包括Tatoeba en zh相似性搜索（Artetxe和Schwenk，2019b），BUCC en zh双文本挖掘，以及7个数据集上的语义文本相似性。摘要：In this paper, we propose to align sentence representations from different languages into a unified embedding space, where semantic similarities (both cross-lingual and monolingual) can be computed with a simple dot product. Pre-trained language models are fine-tuned with the translation ranking task. Existing work (Feng et al., 2020) uses sentences within the same batch as negatives, which can suffer from the issue of easy negatives. We adapt MoCo (He et al., 2020) to further improve the quality of alignment. As the experimental results show, the sentence representations produced by our model achieve the new state-of-the-art on several tasks, including Tatoeba en-zh similarity search (Artetxe and Schwenk, 2019b), BUCC en-zh bitext mining, and semantic textual similarity on 7 datasets.

【3】 Sense representations for Portuguese: experiments with sense embeddings and deep neural language models 标题：葡萄牙语的意义表征：意义嵌入和深层神经语言模型的实验链接：https://arxiv.org/abs/2109.00025

作者：Jessica Rodrigues da Silva,Helena de Medeiros Caseli 机构：br 1Federal University of Sa˜o Carlos (UFSCar) 备注：None 摘要：感官表征已经超越了Word2Vec、GloVe和FastText等词表征，并在广泛的自然语言处理任务中实现了创新性能。尽管在许多应用中非常有用，但生成单词嵌入的传统方法有一个严格的缺点：它们为给定的单词生成单个向量表示，而忽略了歧义单词可能具有不同含义的事实。在本文中，我们探讨了无监督意义表示，它不同于传统的单词嵌入，能够通过分析文本中的上下文语义来归纳单词的不同意义。本文研究的无监督意义表示有：意义嵌入和深层神经语言模型。我们介绍了为葡萄牙语生成语义嵌入而进行的第一次实验。我们的实验表明，语义嵌入模型（Sense2vec）在句法和语义类比任务中的性能优于传统的单词嵌入，证明了这里生成的语言资源可以提高葡萄牙语NLP任务的性能。我们还评估了预训练的深层神经语言模型（ELMo和BERT）在语义-文本相似性任务中的两种迁移学习方法：基于特征的和微调的性能。我们的实验表明，经过微调的多语言和葡萄牙语BERT语言模型能够获得比ELMo模型和基线更好的精度。摘要：Sense representations have gone beyond word representations like Word2Vec, GloVe and FastText and achieved innovative performance on a wide range of natural language processing tasks. Although very useful in many applications, the traditional approaches for generating word embeddings have a strict drawback: they produce a single vector representation for a given word ignoring the fact that ambiguous words can assume different meanings. In this paper, we explore unsupervised sense representations which, different from traditional word embeddings, are able to induce different senses of a word by analyzing its contextual semantics in a text. The unsupervised sense representations investigated in this paper are: sense embeddings and deep neural language models. We present the first experiments carried out for generating sense embeddings for Portuguese. Our experiments show that the sense embedding model (Sense2vec) outperformed traditional word embeddings in syntactic and semantic analogies task, proving that the language resource generated here can improve the performance of NLP tasks in Portuguese. We also evaluated the performance of pre-trained deep neural language models (ELMo and BERT) in two transfer learning approaches: feature based and fine-tuning, in the semantic textual similarity task. Our experiments indicate that the fine tuned Multilingual and Portuguese BERT language models were able to achieve better accuracy than the ELMo model and baselines.

Word2Vec|文本|单词(1篇)

【1】 Extracting all Aspect-polarity Pairs Jointly in a Text with Relation Extraction Approach 标题：用关系抽取方法联合提取文本中的所有方面-极性对链接：https://arxiv.org/abs/2109.00256

作者：Lingmei Bu,Li Chen,Yongmei Lu,Zhonghua Yu 机构：Department of Computer Science, Sichuan University, China 摘要：从文本中提取方面极性对是细粒度情感分析的一项重要任务。虽然现有的方法已经取得了许多进展，但它们在捕获文本中的方面极性对之间的关系方面受到限制，从而降低了提取性能。此外，现有的最先进的方法，即基于标记的序列标记和基于广度的分类，都有其自身的缺陷，如前者中单独标记标记导致的极性不一致，后者中混合了方面相关和极性相关标记的异构分类。为了弥补上述缺陷，在关系提取的最新进展的启发下，我们建议使用关系提取技术直接从文本生成方面极性对，将方面对视为一元关系，其中方面是实体，相应的极性是关系。基于这一观点，我们提出了一个位置和方向感知的Sequence2序列模型，用于方向极性对的联合提取。该模型的特点是不仅能够通过序列解码捕获文本中方面极性对之间的关系，而且能够通过位置和方面感知注意捕获方面与其极性之间的相关性。在三个基准数据集上进行的实验表明，我们的模型优于现有的最新方法，对它们进行了显著改进。摘要：Extracting aspect-polarity pairs from texts is an important task of fine-grained sentiment analysis. While the existing approaches to this task have gained many progresses, they are limited at capturing relationships among aspect-polarity pairs in a text, thus degrading the extraction performance. Moreover, the existing state-of-the-art approaches, namely token-based se-quence tagging and span-based classification, have their own defects such as polarity inconsistency resulted from separately tagging tokens in the former and the heterogeneous categorization in the latter where aspect-related and polarity-related labels are mixed. In order to remedy the above defects, in-spiring from the recent advancements in relation extraction, we propose to generate aspect-polarity pairs directly from a text with relation extraction technology, regarding aspect-pairs as unary relations where aspects are enti-ties and the corresponding polarities are relations. Based on the perspective, we present a position- and aspect-aware sequence2sequence model for joint extraction of aspect-polarity pairs. The model is characterized with its ability to capture not only relationships among aspect-polarity pairs in a text through the sequence decoding, but also correlations between an aspect and its polarity through the position- and aspect-aware attentions. The experi-ments performed on three benchmark datasets demonstrate that our model outperforms the existing state-of-the-art approaches, making significant im-provement over them.

其他神经网络|深度学习|模型|建模(4篇)

【1】 Chronic Pain and Language: A Topic Modelling Approach to Personal Pain Descriptions 标题：慢性疼痛与语言：个人疼痛描述的主题建模方法链接：https://arxiv.org/abs/2109.00402

作者：Diogo A. P. Nunes,David Martins de Matos,Joana Ferreira Gomes,Fani Neto 机构：INESC-ID, Lisbon, Portugal, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal, Biomedicina, Unidade de Biologia Experimental, Faculdade de Medicina, Universidade do Porto, Porto, Portugal 备注：9 pages, 5 figures, 6 tables 摘要：慢性疼痛被认为是一个主要的健康问题，不仅影响经济，而且影响社会和个人层面。作为一种私人和主观的体验，不可能从外部公正地体验、描述和解释慢性疼痛，将其视为一种纯粹的有害刺激，直接指向病因并促进其缓解，与急性疼痛相反，急性疼痛的评估通常是直截了当的。因此，口头沟通是向卫生专业人员传达外部实体无法获取的相关信息的关键，即痛苦经历和患者的内在品质。我们提出并讨论了一种主题建模方法来识别慢性疼痛口头描述中的模式，并使用这些模式来量化和限定疼痛体验。我们的方法允许从获得的主题模型和潜在空间中提取关于慢性疼痛体验的新见解。我们认为，我们的结果与慢性疼痛的评估和治疗具有临床相关性。摘要：Chronic pain is recognized as a major health problem, with impacts not only at the economic, but also at the social, and individual levels. Being a private and subjective experience, it is impossible to externally and impartially experience, describe, and interpret chronic pain as a purely noxious stimulus that would directly point to a causal agent and facilitate its mitigation, contrary to acute pain, the assessment of which is usually straightforward. Verbal communication is, thus, key to convey relevant information to health professionals that would otherwise not be accessible to external entities, namely, intrinsic qualities about the painful experience and the patient. We propose and discuss a topic modelling approach to recognize patterns in verbal descriptions of chronic pain, and use these patterns to quantify and qualify experiences of pain. Our approaches allow for the extraction of novel insights on chronic pain experiences from the obtained topic models and latent spaces. We argue that our results are clinically relevant for the assessment and management of chronic pain.

【2】 Problem Learning: Towards the Free Will of Machines 标题：问题学习：走向机器的自由意志链接：https://arxiv.org/abs/2109.00177

作者：Yongfeng Zhang 机构：Department of Computer Science, Rutgers University, New Brunswick, NJ 备注：17 pages, 1 figure 摘要：机器智能管道通常由六个部分组成：问题、表示、模型、损失、优化器和度量。研究人员一直在努力使管道的许多组件实现自动化。然而，管道的一个关键组成部分——问题定义——在自动化方面仍然没有得到充分的探索。通常，它需要领域专家的广泛努力来识别、定义和阐述某一领域的重要问题。然而，自动发现某个领域的研究或应用问题是有益的，因为它有助于识别隐藏在数据中的有效和潜在的重要问题，而这些问题是领域专家所不知道的，可以扩大我们在某个领域可以做的任务范围，甚至可以激发全新的发现。本文描述了问题学习，旨在学习从数据或机器与环境的交互中发现和定义有效的道德问题。我们将问题学习形式化为在问题空间中识别有效的道德问题，并介绍几种可能的问题学习方法。从广义上讲，问题学习是一种实现智能机器自由意志的方法。目前，机器仍然局限于解决人类定义的问题，没有能力或灵活性自由探索人类甚至未知的各种可能问题。尽管许多机器学习技术已经被开发并集成到智能系统中，但它们仍然关注于机器解决人类定义的问题的方法而不是目的。然而，提出好的问题有时甚至比解决问题更重要，因为一个好的问题有助于激发新的想法和获得更深的理解。本文还讨论了负责任人工智能背景下问题学习的伦理含义。摘要：A machine intelligence pipeline usually consists of six components: problem, representation, model, loss, optimizer and metric. Researchers have worked hard trying to automate many components of the pipeline. However, one key component of the pipeline--problem definition--is still left mostly unexplored in terms of automation. Usually, it requires extensive efforts from domain experts to identify, define and formulate important problems in an area. However, automatically discovering research or application problems for an area is beneficial since it helps to identify valid and potentially important problems hidden in data that are unknown to domain experts, expand the scope of tasks that we can do in an area, and even inspire completely new findings. This paper describes Problem Learning, which aims at learning to discover and define valid and ethical problems from data or from the machine's interaction with the environment. We formalize problem learning as the identification of valid and ethical problems in a problem space and introduce several possible approaches to problem learning. In a broader sense, problem learning is an approach towards the free will of intelligent machines. Currently, machines are still limited to solving the problems defined by humans, without the ability or flexibility to freely explore various possible problems that are even unknown to humans. Though many machine learning techniques have been developed and integrated into intelligent systems, they still focus on the means rather than the purpose in that machines are still solving human defined problems. However, proposing good problems is sometimes even more important than solving problems, because a good problem can help to inspire new ideas and gain deeper understandings. The paper also discusses the ethical implications of problem learning under the background of Responsible AI.

【3】 Effectiveness of Deep Networks in NLP using BiDAF as an example architecture 标题：深度网络在自然语言处理中的有效性研究--以BiDAF为例链接：https://arxiv.org/abs/2109.00074

作者：Soumyendu Sarkar 机构：Stanford Center for Professional Development 摘要：通过高级模型体系结构（如BERT和BiDAF）以及早期的基于单词、字符和上下文的嵌入，NLP的问答已经取得了进展。由于BERT已经超越了模型的准确性，下一个前沿领域的一个要素可能是引入深度网络和对其进行训练的有效方法。在此背景下，我探讨了专注于BiDAF模型编码器层的深度网络的有效性。BiDAF及其异构层提供了机会，不仅可以探索深层网络的有效性，还可以评估在较低层进行的改进是否与在模型体系结构的上层进行的改进相加。我相信NLP中的下一个最伟大的模型实际上将折叠在一个实体语言建模中，比如带有复合体系结构的BERT，除了通用语言建模之外，它还将带来改进，并将具有更广泛的分层体系结构。我尝试了旁路网络、剩余公路网络和DenseNet架构。此外，我还评估了对网络最后几层进行加密的有效性。我还研究了将字符嵌入添加到单词嵌入中所产生的差异，以及这些影响是否与深层网络相关。我的研究表明，深层次的人际网络实际上能有效地促进发展。此外，较低层的细化（如嵌入）与通过深层网络获得的收益一起传递。摘要：Question Answering with NLP has progressed through the evolution of advanced model architectures like BERT and BiDAF and earlier word, character, and context-based embeddings. As BERT has leapfrogged the accuracy of models, an element of the next frontier can be the introduction of deep networks and an effective way to train them. In this context, I explored the effectiveness of deep networks focussing on the model encoder layer of BiDAF. BiDAF with its heterogeneous layers provides the opportunity not only to explore the effectiveness of deep networks but also to evaluate whether the refinements made in lower layers are additive to the refinements made in the upper layers of the model architecture. I believe the next greatest model in NLP will in fact fold in a solid language modeling like BERT with a composite architecture which will bring in refinements in addition to generic language modeling and will have a more extensive layered architecture. I experimented with the Bypass network, Residual Highway network, and DenseNet architectures. In addition, I evaluated the effectiveness of ensembling the last few layers of the network. I also studied the difference character embeddings make in adding them to the word embeddings, and whether the effects are additive with deep networks. My studies indicate that deep networks are in fact effective in giving a boost. Also, the refinements in the lower layers like embeddings are passed on additively to the gains made through deep networks.

【4】 Machine-Learning media bias 标题：机器学习的媒体偏向链接：https://arxiv.org/abs/2109.00024

作者：Samantha D'Alonzo,Max Tegmark 机构：Dept. of Physics and Institute for AI & Fundamental Interactions 备注：29 pages, 23 figs; data available at this https URL 摘要：我们提出了一种自动测量媒体偏差的方法。仅根据报纸使用不同短语的频率推断出哪家报纸发表了给定的文章，从而得出一个条件概率分布，其分析使我们能够自动将报纸和短语映射到偏差空间。通过分析大约100份报纸上的大约100万篇文章中的几十个新闻主题的偏见，我们的方法将报纸映射到二维偏见图景中，这与之前基于人类判断的偏见分类非常一致。一个维度可以解释为传统的左右偏差，另一个维度可以解释为建立偏差。这意味着，尽管新闻偏见本质上是政治性的，但其衡量标准不必如此。摘要：We present an automated method for measuring media bias. Inferring which newspaper published a given article, based only on the frequencies with which it uses different phrases, leads to a conditional probability distribution whose analysis lets us automatically map newspapers and phrases into a bias space. By analyzing roughly a million articles from roughly a hundred newspapers for bias in dozens of news topics, our method maps newspapers into a two-dimensional bias landscape that agrees well with previous bias classifications based on human judgement. One dimension can be interpreted as traditional left-right bias, the other as establishment bias. This means that although news bias is inherently political, its measurement need not be.

其他(5篇)

【1】 Capturing Stance Dynamics in Social Media: Open Challenges and Research Directions 标题：获取社交媒体中的姿态动态：开放的挑战和研究方向链接：https://arxiv.org/abs/2109.00475

作者：Rabab Alkhalifa,Arkaitz Zubiaga 机构：the date of receipt and acceptance should be inserted later 摘要：社交媒体平台为挖掘公众对广泛社会利益问题的意见提供了一座金矿。意见挖掘是一个问题，可以通过捕获和汇总个人社交媒体帖子的立场来操作，即支持、反对或对当前问题保持中立。虽然之前在姿态检测方面的大多数工作都研究了时间范围有限的数据集，但最近对纵向数据集的研究兴趣有所增加。在新数据中观察到的语言和行为模式的动态变化反过来需要调整姿态检测系统以应对变化。在这篇综述文章中，我们研究了计算语言学与数字媒体中人类交流的时间演变之间的交叉点。我们在考虑动力学的新兴研究中进行了批判性回顾，探讨了影响语言数据的不同语义和语用因素，尤其是立场。我们进一步讨论当前在社交媒体中捕捉姿态动态的方向。我们组织应对立场动态的挑战，确定开放性挑战，并从三个关键维度讨论未来方向：话语、语境和影响。摘要：Social media platforms provide a goldmine for mining public opinion on issues of wide societal interest. Opinion mining is a problem that can be operationalised by capturing and aggregating the stance of individual social media posts as supporting, opposing or being neutral towards the issue at hand. While most prior work in stance detection has investigated datasets with limited time coverage, interest in investigating longitudinal datasets has recently increased. Evolving dynamics in linguistic and behavioural patterns observed in new data require in turn adapting stance detection systems to deal with the changes. In this survey paper, we investigate the intersection between computational linguistics and the temporal evolution of human communication in digital media. We perform a critical review in emerging research considering dynamics, exploring different semantic and pragmatic factors that impact linguistic data in general, and stance particularly. We further discuss current directions in capturing stance dynamics in social media. We organise the challenges of dealing with stance dynamics, identify open challenges and discuss future directions in three key dimensions: utterance, context and influence.

【2】 M^2-MedDialog: A Dataset and Benchmarks for Multi-domain Multi-service Medical Dialogues 标题：M^2-MedDialog：多领域多服务医疗对话的数据集和基准链接：https://arxiv.org/abs/2109.00430

作者：Guojun Yan,Jiahuan Pei,Pengjie Ren,Zhumin Chen,Zhaochun Ren,Huasheng Liang 机构：Shandong University, University of Amsterdam, WeChat, Tencent 摘要：医疗对话系统（MDS）旨在为医生和患者提供一系列专业医疗服务，即诊断、咨询和治疗。然而，一站式MDS仍然没有被开发，因为：（1）没有数据集有如此大规模的对话同时包含多个医疗服务和细粒度医疗标签（即意图、时隙、值）(2）没有一个模型能够在统一的框架中解决基于多个服务对话的MDS。在这项工作中，我们首先构建了一个多域多服务医疗对话（M^2-MEDIALOG）数据集，其中包含1557个医生和患者之间的对话，涵盖276种疾病、2468个医疗实体和3个医疗服务专业。据我们所知，它是唯一一个包含多种医疗服务和细粒度医疗标签的医疗对话数据集。然后，我们将一站式MDS描述为序列到序列的生成问题。我们将MDS分别与因果语言建模和条件因果语言建模相结合。具体而言，我们采用了几种预训练模型（即BERT-WWM、BERT-MED、GPT2和MT5）及其变体，以在M^2-MEDIALOG数据集上获得基准。我们还提出了伪标记和自然扰动方法来扩展数据集并增强最先进的预训练模型。我们通过在M2 MedDialog上的大量实验，展示了迄今为止基准测试所取得的结果。我们发布了数据集、代码以及评估脚本，以促进这一重要研究方向的未来研究。摘要：Medical dialogue systems (MDSs) aim to assist doctors and patients with a range of professional medical services, i.e., diagnosis, consultation, and treatment. However, one-stop MDS is still unexplored because: (1) no dataset has so large-scale dialogues contains both multiple medical services and fine-grained medical labels (i.e., intents, slots, values); (2) no model has addressed a MDS based on multiple-service conversations in a unified framework. In this work, we first build a Multiple-domain Multiple-service medical dialogue (M^2-MedDialog)dataset, which contains 1,557 conversations between doctors and patients, covering 276 types of diseases, 2,468 medical entities, and 3 specialties of medical services. To the best of our knowledge, it is the only medical dialogue dataset that includes both multiple medical services and fine-grained medical labels. Then, we formulate a one-stop MDS as a sequence-to-sequence generation problem. We unify a MDS with causal language modeling and conditional causal language modeling, respectively. Specifically, we employ several pretrained models (i.e., BERT-WWM, BERT-MED, GPT2, and MT5) and their variants to get benchmarks on M^2-MedDialog dataset. We also propose pseudo labeling and natural perturbation methods to expand M2-MedDialog dataset and enhance the state-of-the-art pretrained models. We demonstrate the results achieved by the benchmarks so far through extensive experiments on M2-MedDialog. We release the dataset, the code, as well as the evaluation scripts to facilitate future research in this important research direction.

【3】 Pattern-based Acquisition of Scientific Entities from Scholarly Article Titles 标题：基于模式的学术文章标题中科学实体的获取链接：https://arxiv.org/abs/2109.00199

作者：Jennifer D'Souza,Soeren Auer 机构：2[0000−000 2−0698− 286 4] 1 TIB Leibniz Information Centre for Science and Technology 2 L 3S Research Center at Leibniz University of HannoverHannover 备注：14 pages, 1 figure, Accepted for publication in ICADL 2021 as a short paper (8 pages) 摘要：我们描述了一种基于规则的方法，用于从学术文章标题中自动获取科学实体。有两项意见推动了这一做法：（i）注意到一篇文章的贡献信息集中在其标题中；以及（ii）通过一个规则系统捕获信息模式规律，该系统减轻了创建一次性注释单个实例的黄金标准时的人工注释任务。我们确定了一组易于识别的词汇句法模式，这些模式经常出现，并且通常表明对学术贡献感兴趣的科学实体类型。获取算法的一个子集用于计算语言学（CL）学术领域的文章标题。在其第一个版本中，名为ORKG标题解析器的工具从CL论文标题中确定了以下六种科学术语的概念类型，即：。研究问题、解决方案、资源、语言、工具和方法。它已经在50237个标题的集合上进行了经验评估，涵盖了ACL选集中几乎所有的文章。提取了19799个研究问题；18111种解决办法；20033资源；1059种语言；6878件工具；21687种方法，平均提取精度为75%。代码和相关数据资源可在https://gitlab.com/TIBHannover/orkg/orkg-title-parser. 最后，在本文中，我们讨论了学术知识图（SKG）创建等领域的扩展和应用。摘要：We describe a rule-based approach for the automatic acquisition of scientific entities from scholarly article titles. Two observations motivated the approach: (i) noting the concentration of an article's contribution information in its title; and (ii) capturing information pattern regularities via a system of rules that alleviate the human annotation task in creating gold standards that annotate single instances at a time. We identify a set of lexico-syntactic patterns that are easily recognizable, that occur frequently, and that generally indicates the scientific entity type of interest about the scholarly contribution. A subset of the acquisition algorithm is implemented for article titles in the Computational Linguistics (CL) scholarly domain. The tool called ORKG-Title-Parser, in its first release, identifies the following six concept types of scientific terminology from the CL paper titles, viz. research problem, solution, resource, language, tool, and method. It has been empirically evaluated on a collection of 50,237 titles that cover nearly all articles in the ACL Anthology. It has extracted 19,799 research problems; 18,111 solutions; 20,033 resources; 1,059 languages; 6,878 tools; and 21,687 methods at an average extraction precision of 75%. The code and related data resources are publicly available at https://gitlab.com/TIBHannover/orkg/orkg-title-parser. Finally, in the article, we discuss extensions and applications to areas such as scholarly knowledge graph (SKG) creation.

【4】 It's not Rocket Science : Interpreting Figurative Language in Narratives 标题：不是火箭科学：解读叙事中的比喻语言链接：https://arxiv.org/abs/2109.00087

作者：Tuhin Chakrabarty,Yejin Choi,Vered Shwartz 机构：Columbia University, Allen Institute for Artificial Intelligence, Paul G. Allen School of Computer Science & Engineering, University of Washington, University of British Columbia 摘要：比喻语言在英语中无处不在。然而，绝大多数NLP研究集中在字面语言上。设计的现有文本表示依赖于组合性，而比喻性语言通常是非组合性的。在本文中，我们研究了两种非组合比喻语言（习语和明喻）的解释。我们收集了包含比喻表达的虚构叙事数据集，以及基于对表达的正确解释的众包可信和不可信的延续。然后，我们训练模型来选择或生成合理的延续。我们的实验表明，仅基于预先训练的语言模型的模型在这些任务上的表现比人类差得多。此外，我们还提出了知识增强模型，采用人类的策略来解释比喻语言：从上下文推断意义并依赖组成词的字面意义。知识增强模型提高了区分性和生成性任务的绩效，进一步缩小了与人类绩效的差距。摘要：Figurative language is ubiquitous in English. Yet, the vast majority of NLP research focuses on literal language. Existing text representations by design rely on compositionality, while figurative language is often non-compositional. In this paper, we study the interpretation of two non-compositional figurative languages (idioms and similes). We collected datasets of fictional narratives containing a figurative expression along with crowd-sourced plausible and implausible continuations relying on the correct interpretation of the expression. We then trained models to choose or generate the plausible continuation. Our experiments show that models based solely on pre-trained language models perform substantially worse than humans on these tasks. We additionally propose knowledge-enhanced models, adopting human strategies for interpreting figurative language: inferring meaning from the context and relying on the constituent word's literal meanings. The knowledge-enhanced models improve the performance on both the discriminative and generative tasks, further bridging the gap from human performance.

【5】 Working Memory Connections for LSTM 标题：LSTM的工作内存连接链接：https://arxiv.org/abs/2109.00020

作者：Federico Landi,Lorenzo Baraldi,Marcella Cornia,Rita Cucchiara 机构：Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, Modena, Italy 备注：Accepted for publication in Neural Networks 摘要：长短时记忆递归神经网络（LSTM）在学习长期依赖关系时，利用门控机制来缓解梯度的爆炸和消失。因此，LSTM和其他门控RNN被广泛采用，成为许多序列建模任务的事实标准。尽管LSTM内的存储单元包含基本信息，但不允许直接影响选通机制。在这项工作中，我们通过包含来自内部单元状态的信息来提高门电位。所提议的修改名为“工作记忆连接”，包括在网络门中添加一个可学习的单元内容非线性投影。这种修改可以适用于经典的LSTM门，而无需对底层任务进行任何假设，在处理较长序列时尤其有效。以前在这方面的研究工作可以追溯到21世纪初，与香草LSTM相比，无法带来一致的改进。作为本文的一部分，我们确定了一个与以前的连接相关的关键问题，该问题严重限制了它们的有效性，从而阻止了来自内部单元状态的知识的成功集成。我们通过广泛的实验评估表明，工作记忆连接不断提高LSTM在各种任务上的性能。数值结果表明，单元状态包含有用的信息，值得包含在栅极结构中。摘要：Recurrent Neural Networks with Long Short-Term Memory (LSTM) make use of gating mechanisms to mitigate exploding and vanishing gradients when learning long-term dependencies. For this reason, LSTMs and other gated RNNs are widely adopted, being the standard de facto for many sequence modeling tasks. Although the memory cell inside the LSTM contains essential information, it is not allowed to influence the gating mechanism directly. In this work, we improve the gate potential by including information coming from the internal cell state. The proposed modification, named Working Memory Connection, consists in adding a learnable nonlinear projection of the cell content into the network gates. This modification can fit into the classical LSTM gates without any assumption on the underlying task, being particularly effective when dealing with longer sequences. Previous research effort in this direction, which goes back to the early 2000s, could not bring a consistent improvement over vanilla LSTM. As part of this paper, we identify a key issue tied to previous connections that heavily limits their effectiveness, hence preventing a successful integration of the knowledge coming from the internal cell state. We show through extensive experimental evaluation that Working Memory Connections constantly improve the performance of LSTMs on a variety of tasks. Numerical results suggest that the cell state contains useful information that is worth including in the gate structure.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-09-02，如有侵权请联系 cloudcommunity@tencent.com 删除

linux