自然语言处理学术速递[10.19]

公众号-arXiv每日学术速递

发布于 2021-10-21 16:24:03

1.4K0

发布于 2021-10-21 16:24:03

文章被收录于专栏：arXiv每日学术速递

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计92篇

Transformer(7篇)

【1】 NormFormer: Improved Transformer Pretraining with Extra Normalization 标题：NormFormer：改进的Transformer预训练和额外归一化链接：https://arxiv.org/abs/2110.09456

作者：Sam Shleifer,Jason Weston,Myle Ott 机构：Facebook AI Research∗ 摘要：在预训练期间，预层变形Transformer遭受梯度幅度失配：早期层的梯度比后期层大得多。这些问题可以通过我们提出的NormFormer架构得到缓解，该架构为每一层添加了三个规范化操作：自我注意后的层规范、自我注意输出的头部缩放以及第一个完全连接层后的层规范。额外操作产生的计算成本可以忽略不计（+0.4%的参数增加），但对于1.25亿到27亿个参数范围内的因果语言模型和掩蔽语言模型，可以改善训练前的困惑和下游任务性能。例如，在最强的1.3B参数基线上添加NormFormer可以更快地达到相等的复杂度24%，或者在相同的计算预算中更好地收敛到0.27复杂度。该型号的GPT3大（1.3B）零炮性能提高60%。对于屏蔽语言建模，NormFormer平均将微调的GLUE性能提高1.9%。fairseq中提供了用于训练NORMONER模型的代码https://github.com/pytorch/fairseq/tree/main/examples/normformer . 摘要：During pretraining, the Pre-LayerNorm transformer suffers from a gradient magnitude mismatch: gradients at early layers are much larger than at later layers. These issues can be alleviated by our proposed NormFormer architecture, which adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The extra operations incur negligible compute cost (+0.4% parameter increase), but improve pretraining perplexity and downstream task performance for both causal and masked language models ranging from 125 Million to 2.7 Billion parameters. For example, adding NormFormer on top of our strongest 1.3B parameter baseline can reach equal perplexity 24% faster, or converge 0.27 perplexity better in the same compute budget. This model reaches GPT3-Large (1.3B) zero shot performance 60% faster. For masked language modeling, NormFormer improves fine-tuned GLUE performance by 1.9% on average. Code to train NormFormer models is available in fairseq https://github.com/pytorch/fairseq/tree/main/examples/normformer .

【2】 SentimentArcs: A Novel Method for Self-Supervised Sentiment Analysis of Time Series Shows SOTA Transformers Can Struggle Finding Narrative Arcs 标题：SentimentArcs：一种时间序列自监督情感分析的新方法显示SOTATransformer很难找到叙事弧链接：https://arxiv.org/abs/2110.09454

作者：Jon Chun 机构：Digital Humanities Colab, Integrated Program for Humane Studies, Kenyon College, Gambier, OH 备注：87 pages, 97 figures 摘要：SOTA Transformer和DNN短文本情感分类器在IMDB电影评论等狭窄领域的准确率超过97%。现实世界中的性能要低得多，因为传统的模型过于适合基准测试，不能很好地推广到不同的或更开放的领域文本。本文介绍了一种新的自监督时间序列情感分析方法——情感弧，它解决了传统监督情感分析的两个主要局限性：有限的标记训练数据集和较差的泛化能力。大量不同的模型为自监督学习提供了一个综合的基本事实。新的指标联合优化了所有可能语料库的穷举搜索：模型组合。语料库和模型的联合优化解决了泛化问题。简单的可视化利用了叙述中的时间结构，因此领域专家可以快速发现趋势，识别关键特征，并注意到数百条弧和数百万个数据点上的异常情况。据我们所知，这是时间序列情绪分析的第一种自我监督方法，也是直接比较长形式叙事的真实世界模型性能的最大调查。摘要：SOTA Transformer and DNN short text sentiment classifiers report over 97% accuracy on narrow domains like IMDB movie reviews. Real-world performance is significantly lower because traditional models overfit benchmarks and generalize poorly to different or more open domain texts. This paper introduces SentimentArcs, a new self-supervised time series sentiment analysis methodology that addresses the two main limitations of traditional supervised sentiment analysis: limited labeled training datasets and poor generalization. A large ensemble of diverse models provides a synthetic ground truth for self-supervised learning. Novel metrics jointly optimize an exhaustive search across every possible corpus:model combination. The joint optimization over both the corpus and model solves the generalization problem. Simple visualizations exploit the temporal structure in narratives so domain experts can quickly spot trends, identify key features, and note anomalies over hundreds of arcs and millions of data points. To our knowledge, this is the first self-supervised method for time series sentiment analysis and the largest survey directly comparing real-world model performance on long-form narratives.

【3】 Contextual Hate Speech Detection in Code Mixed Text using Transformer Based Approaches 标题：使用基于转换器的方法检测代码混合文本中的上下文仇恨语音链接：https://arxiv.org/abs/2110.09338

作者：Ravindra Nayak,Raviraj Joshi 机构：Sri Jayachamarajendra College of Engineering, Mysore, Indian Institute of Technology Madras, Chennai 备注：Accepted at HASOC @Forum for Information Retrieval Evaluation(FIRE) 2021 摘要：在最近的过去，社交媒体平台帮助人们与更广泛的受众进行联系和沟通。但这也导致了网络欺凌的急剧增加。检测和遏制仇恨言论对于保持社交媒体平台的健全至关重要。此外，这些平台上经常使用包含多种语言的代码混合文本。因此，我们提出了一种自动的技术，用于在从刮擦过的Twitter上混合代码的文本中检测仇恨语音。我们特别关注代码混合的英语印地语文本和基于转换器的方法。虽然常规方法独立分析文本，但我们也以父推文的形式使用内容文本。我们尝试在单编码器和双编码器设置下评估多语言BERT和Indic BERT的性能。第一种方法是使用分隔符标记连接目标文本和上下文文本，并从BERT模型中获得单个表示。第二种方法使用双BERT编码器独立地对两个文本进行编码，并对相应的表示进行平均。我们证明了使用独立表示的双编码器方法可以产生更好的性能。我们还使用简单的集成方法来进一步提高性能。使用这些方法，我们能够在HASOC 2021 ICHCL代码混合数据集上获得73.07%的最佳F1分数。摘要：In the recent past, social media platforms have helped people in connecting and communicating to a wider audience. But this has also led to a drastic increase in cyberbullying. It is essential to detect and curb hate speech to keep the sanity of social media platforms. Also, code mixed text containing more than one language is frequently used on these platforms. We, therefore, propose automated techniques for hate speech detection in code mixed text from scraped Twitter. We specifically focus on code mixed English-Hindi text and transformer-based approaches. While regular approaches analyze the text independently, we also make use of content text in the form of parent tweets. We try to evaluate the performances of multilingual BERT and Indic-BERT in single-encoder and dual-encoder settings. The first approach is to concatenate the target text and context text using a separator token and get a single representation from the BERT model. The second approach encodes the two texts independently using a dual BERT encoder and the corresponding representations are averaged. We show that the dual-encoder approach using independent representations yields better performance. We also employ simple ensemble methods to further improve the performance. Using these methods we were able to achieve the best F1 score of 73.07% on the HASOC 2021 ICHCL code mixed data set.

【4】 Deep Transfer Learning & Beyond: Transformer Language Models in Information Systems Research 标题：深度迁移学习与超越：信息系统研究中的Transformer语言模型链接：https://arxiv.org/abs/2110.08975

作者：Ross Gruetzemacher,David Paradice 机构： Transformer Language Models in Information Systems Research Deep Transfer Learning & Beyond Transformer Language Models in Information Systems Research Ross Gruetzemacher Wichita State University, Frank Barton School of Business 备注：Under review (revised once). Section 2, the literature review on deep transfer learning and transformer language models, is a valuable introduction for a broad audience (not just information systems researchers) 摘要：人们普遍认为人工智能即将改变业务，但目前对这种转变范围的看法可能是短视的。涉及transformer language models（TLM）的自然语言处理的最新进展为AI驱动的业务和社会转型提供了一条潜在的途径，这超出了大多数人目前所预见的范围。我们回顾了这一最新进展以及利用顶级IS期刊中文本挖掘的最新文献，以概述未来IS研究如何从这些新技术中获益。我们对现有IS文献的回顾表明，次优文本挖掘技术非常普遍，更先进的TLM可用于增强和增加涉及文本数据的IS研究，并启用新的IS研究主题，从而为研究社区创造更多价值。这是可能的，因为这些技术使开发非常强大的定制系统变得更加容易，并且对于广泛的任务和应用，它们的性能优于现有方法。此外，多语言模型为多语言研究提供了更高质量的文本分析。我们还确定了IS研究的新途径，如语言用户界面，这可能为未来IS研究提供更大的潜力。摘要：AI is widely thought to be poised to transform business, yet current perceptions of the scope of this transformation may be myopic. Recent progress in natural language processing involving transformer language models (TLMs) offers a potential avenue for AI-driven business and societal transformation that is beyond the scope of what most currently foresee. We review this recent progress as well as recent literature utilizing text mining in top IS journals to develop an outline for how future IS research can benefit from these new techniques. Our review of existing IS literature reveals that suboptimal text mining techniques are prevalent and that the more advanced TLMs could be applied to enhance and increase IS research involving text data, and to enable new IS research topics, thus creating more value for the research community. This is possible because these techniques make it easier to develop very powerful custom systems and their performance is superior to existing methods for a wide range of tasks and applications. Further, multilingual language models make possible higher quality text analytics for research in multiple languages. We also identify new avenues for IS research, like language user interfaces, that may offer even greater potential for future IS research.

【5】 Transformer with a Mixture of Gaussian Keys 标题：混合使用高斯密钥的Transformer 链接：https://arxiv.org/abs/2110.08678

作者：Tam Nguyen,Tan M. Nguyen,Dung Le,Khuong Nguyen,Anh Tran,Richard G. Baraniuk,Nhat Ho,Stanley J. Osher 机构：FPT Software, Vietnam†, University of California, Los Angeles, USA‡, Rice University, Houston, USA⋄, University of Texas, Austin, USA◦ 备注：21 pages, 8 figures, 4 tables 摘要：多头注意力是最先进的Transformer背后的驱动力，这些Transformer在各种自然语言处理（NLP）和计算机视觉任务中实现了卓越的性能。据观察，在许多应用中，这些注意头学习冗余嵌入，并且大多数注意头可以在不降低模型性能的情况下移除。受这一观察结果的启发，我们提出了一种混合高斯密钥的Transformer（Transformer MGK），这是一种新的Transformer架构，它将Transformer中的冗余磁头替换为每个磁头上的混合密钥。这些关键点的混合遵循高斯混合模型，允许每个注意头有效地集中在输入序列的不同部分。与传统的transformer相比，transformer MGK加快了训练和推理速度，参数更少，需要的计算次数更少，同时在任务之间实现了相当或更好的精度。TransformerMGK也可以很容易地扩展到线性应用。我们以经验证明Transformer MGK在一系列实际应用中的优势，包括语言建模和涉及很长序列的任务。在Wikitext-103和远程竞技场基准测试中，具有4个磁头的TransformerMGK与具有8个磁头的基准Transformer具有相当或更好的性能。摘要：Multi-head attention is a driving force behind state-of-the-art transformers which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attentions. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.

【6】 On Learning the Transformer Kernel 标题：关于学习Transformer核心的几点思考链接：https://arxiv.org/abs/2110.08323

作者：Sankalan Pal Chowdhury,Adamos Solomou,Avinava Dubey,Mrinmaya Sachan 机构：Department of Computer Science, ETH Z¨urich, Google Research, Mountain View, CA 备注：26 pages, of which 11 form the appendix. 6 figures of which 2 are part of appendix 摘要：在这项工作中，我们介绍了核化Transformer，一个通用的，可扩展的，数据驱动的框架，用于学习Transformer的核心功能。我们的框架将Transformer内核近似为光谱特征映射之间的点积，并通过学习光谱分布来学习内核。这不仅有助于端到端地学习通用内核，而且还降低了转换器从二次到线性的时间和空间复杂性。我们表明，核化Transformer在精度和计算效率方面达到了与现有高效Transformer结构相当的性能。我们的研究还表明，内核的选择对性能有很大的影响，内核学习变体在长序列和短序列任务中都是固定内核转换器的竞争替代品。摘要：In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, data driven framework for learning the kernel function in Transformers. Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution. This not only helps in learning a generic kernel end-to-end, but also reduces the time and space complexity of Transformers from quadratic to linear. We show that KERNELIZED TRANSFORMERS achieve performance comparable to existing efficient Transformer architectures, both in terms of accuracy as well as computational efficiency. Our study also demonstrates that the choice of the kernel has a substantial impact on performance, and kernel learning variants are competitive alternatives to fixed kernel Transformers, both in long as well as short sequence tasks.

【7】 From Multimodal to Unimodal Attention in Transformers using Knowledge Distillation 标题：基于知识蒸馏的Transformer从多模态注意到单模态注意链接：https://arxiv.org/abs/2110.08270

作者：Dhruv Agarwal,Tanay Agrawal,Laura M. Ferrari,François Bremond 机构：INRIA Sophia Antipolis - M´editerran´ee, France, Indian Institute of Information Technology, Allahabad, India, Universit´e Cˆote d’Azur, France 备注：Preprint. Final paper accepted at the 17th IEEE International Conference on Advanced Video and Signal-based Surveillance, AVSS 2021, Virtual, November 16-19, 2021. 8 pages 摘要：多模深度学习引起了人们的极大兴趣，多亏了交叉注意机制，《Transformer》引发了新的学习方法。在这里，我们提出了一种解决现有两个关键挑战的方法：高计算资源需求和缺少模式的问题。我们首次在Transformer中引入知识提取的概念，以便在推理时仅使用一种模态。我们报告了一项完整的研究，分析了多个学生-教师配置、蒸馏应用的水平以及不同的方法。在最佳配置下，我们将最先进的精度提高了3%，参数数量减少了2.5倍，推理时间减少了22%。这种性能计算折衷可以在许多应用中得到利用，我们的目标是开辟一个新的研究领域，在这个领域中，需要在有限的资源中部署复杂的模型。摘要：Multimodal Deep Learning has garnered much interest, and transformers have triggered novel approaches, thanks to the cross-attention mechanism. Here we propose an approach to deal with two key existing challenges: the high computational resource demanded and the issue of missing modalities. We introduce for the first time the concept of knowledge distillation in transformers to use only one modality at inference time. We report a full study analyzing multiple student-teacher configurations, levels at which distillation is applied, and different methodologies. With the best configuration, we improved the state-of-the-art accuracy by 3%, we reduced the number of parameters by 2.5 times and the inference time by 22%. Such performance-computation tradeoff can be exploited in many applications and we aim at opening a new research area where the deployment of complex models with limited resources is demanded.

QA|VQA|问答|对话(2篇)

【1】 COVIDRead: A Large-scale Question Answering Dataset on COVID-19 标题：CoVIDRead：一个关于冠状病毒的大规模问答数据集链接：https://arxiv.org/abs/2110.09321

作者：Tanik Saikh,Sovan Kumar Sahoo,Asif Ekbal,Pushpak Bhattacharyya 机构：Department of Computer Science and Engineering, Indian Institute of Technology Patna, Bihta, Patna, India, Indian Institute of Technology Bombay, Mumbai, Maharashtra, India 备注：20 pages, 7 figures 摘要：在这种大流行情况下，提取任何与新冠病毒-19相关的信息将对整个社区大有裨益。在本文中，我们介绍了一个非常重要的资源COVIDRead，它是一个斯坦福问答数据集（STAND），类似于超过10万个问答对的数据集。数据集由上下文-答案-问题三元组组成。主要是上下文中的问题以自动方式构造。之后，系统生成的问题由hu mans的注释员手动检查。这是一个宝贵的资源，可以用于多种用途，从普通人对这种非常罕见的疾病的查询到编辑/助理编辑管理文章。我们建立了几个基于端到端神经网络的基线模型，其F1最低值为32.03%，最高值为37.19%。据我们所知，我们是第一个在COVID-19上提供如此大量此类QA数据集的公司。该数据集通过提供基准数据集和基线模型，为开展新冠病毒-19的研究开辟了一条新途径。摘要：During this pandemic situation, extracting any relevant information related to COVID-19 will be immensely beneficial to the community at large. In this paper, we present a very important resource, COVIDRead, a Stanford Question Answering Dataset (SQuAD) like dataset over more than 100k question-answer pairs. The dataset consists of Context-Answer-Question triples. Primarily the questions from the context are constructed in an automated way. After that, the system-generated questions are manually checked by hu-mans annotators. This is a precious resource that could serve many purposes, ranging from common people queries regarding this very uncommon disease to managing articles by editors/associate editors of a journal. We establish several end-to-end neural network based baseline models that attain the lowest F1 of 32.03% and the highest F1 of 37.19%. To the best of our knowledge, we are the first to provide this kind of QA dataset in such a large volume on COVID-19. This dataset creates a new avenue of carrying out research on COVID-19 by providing a benchmark dataset and a baseline model.

【2】 Open Domain Question Answering over Virtual Documents: A Unified Approach for Data and Text 标题：虚拟文档上的开放领域问答：一种数据和文本的统一方法链接：https://arxiv.org/abs/2110.08417

作者：Kaixin Ma,Hao Cheng,Xiaodong Liu,Eric Nyberg,Jianfeng Gao 机构：♣ Carnegie Mellon University ♠ Microsoft Research 摘要：由于其在数据和文本上具有通用接口的潜力，数据到文本生成最近变得越来越流行。然而，以前的工作很少关注其在下游任务中的应用，例如，使用转换后的数据进行推理。在这项工作中，我们的目标是弥合这一差距，并使用数据到文本的方法作为知识密集型应用（即开放领域问答（QA））的结构化知识编码手段。具体来说，我们提出了一个用于数据和文本上的开放域QA的言语化检索器阅读器框架，其中来自Wikipedia的言语化表和来自Wikidata的三元组被用作增强的知识源。我们表明，我们的统一数据和文本QA，UDT-QA，可以有效地受益于扩展的知识索引，从而比纯文本基线获得更大的收益。值得注意的是，我们的方法在自然问题上设定了单一模型的最新水平。此外，我们的分析表明，对于自适应和热插拔设置，口头化知识是首选的答案推理。摘要：Due to its potential for a universal interface over both data and text, data-to-text generation is becoming increasingly popular recently. However, few previous work has focused on its application to downstream tasks, e.g. using the converted data for grounding or reasoning. In this work, we aim to bridge this gap and use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA). Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources. We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines. Notably, our approach sets the single-model state-of-the-art on Natural Questions. Furthermore, our analyses indicate that verbalized knowledge is preferred for answer reasoning for both adapted and hot-swap settings.

机器翻译(1篇)

【1】 Towards Making the Most of Multilingual Pretraining for Zero-Shot Neural Machine Translation 标题：充分发挥零射频神经机器翻译的多语种预训练作用链接：https://arxiv.org/abs/2110.08547

作者：Guanhua Chen,Shuming Ma,Yun Chen,Dongdong Zhang,Jia Pan,Wenping Wang,Furu Wei 机构：The University of Hong Kong; ,Microsoft Research, Shanghai University of Finance and Economics; ,Texas A&M University 备注：Preprint 摘要：本文证明了多语言预训练、适当的微调方法和来自多个辅助语言的大规模并行数据集对于Zero-Shot翻译都是至关重要的，其中NMT模型是在监督训练过程中看不到的源语言上进行测试的。根据这一思想，我们提出了SixT++，这是一个强大的多对英文NMT模型，支持100种源语言，但只使用六种源语言的并行数据集进行一次训练。SixT++使用XLM-R large初始化解码器嵌入和完整编码器，然后使用简单的两阶段训练策略训练编码器和解码器层。SixT++在许多英语到英语的翻译中取得了令人印象深刻的性能。它的性能明显优于CRIS和m2m-100这两个强大的多语言NMT系统，平均增益分别为7.2和5.0 BLEU。此外，SixT++还提供了一组模型参数，可以进一步微调这些参数，以便为低资源语言开发无监督的NMT模型。通过对低资源语言的单语数据进行反译，它在翻译成英语和从英语中翻译出尼泊尔语和Sinhal语方面优于所有当前最先进的无监督方法。摘要：This paper demonstrates that multilingual pretraining, a proper fine-tuning method and a large-scale parallel dataset from multiple auxiliary languages are all critical for zero-shot translation, where the NMT model is tested on source languages unseen during supervised training. Following this idea, we present SixT++, a strong many-to-English NMT model that supports 100 source languages but is trained once with a parallel dataset from only six source languages. SixT++ initializes the decoder embedding and the full encoder with XLM-R large, and then trains the encoder and decoder layers with a simple two-stage training strategy. SixT++ achieves impressive performance on many-to-English translation. It significantly outperforms CRISS and m2m-100, two strong multilingual NMT systems, with an average gain of 7.2 and 5.0 BLEU respectively. Additionally, SixT++ offers a set of model parameters that can be further fine-tuned to develop unsupervised NMT models for low-resource languages. With back-translation on monolingual data of low-resource language, it outperforms all current state-of-the-art unsupervised methods on Nepali and Sinhal for both translating into and from English.

语义分析(7篇)

【1】 Ensembling Graph Predictions for AMR Parsing 标题：AMR分析中的集成图预测链接：https://arxiv.org/abs/2110.09131

作者：Hoang Thanh Lam,Gabriele Picco,Yufang Hou,Young-Suk Lee,Lam M. Nguyen,Dzung T. Phan,Vanessa López,Ramon Fernandez Astudillo 机构： IBM Research, Dublin, Ireland, IBM Research, Thomas J. Watson Research Center, Yorktown Heights, USA 备注：Accepted at NeurIPS 2021 摘要：在许多机器学习任务中，模型被训练来预测结构数据，如图。例如，在自然语言处理中，将文本解析为依赖树或抽象意义表示（AMR）图是非常常见的。另一方面，集成方法将来自多个模型的预测结合起来，创建一个比单个预测更稳健和准确的新模型。在文献中，有许多针对分类或回归问题提出的集成技术，然而，集成图预测还没有得到深入的研究。在这项工作中，我们将这个问题形式化为挖掘最大的图，这是一组图预测最支持的图。由于该问题是NP难问题，我们提出了一种有效的启发式算法来逼近最优解。为了验证我们的方法，我们在AMR解析问题中进行了实验。实验结果表明，所提出的方法可以结合最先进的AMR解析器的优势，创建比五个标准基准数据集中的任何单个模型更精确的新预测。摘要：In many machine learning tasks, models are trained to predict structure data such as graphs. For example, in natural language processing, it is very common to parse texts into dependency trees or abstract meaning representation (AMR) graphs. On the other hand, ensemble methods combine predictions from multiple models to create a new one that is more robust and accurate than individual predictions. In the literature, there are many ensembling techniques proposed for classification or regression problems, however, ensemble graph prediction has not been studied thoroughly. In this work, we formalize this problem as mining the largest graph that is the most supported by a collection of graph predictions. As the problem is NP-Hard, we propose an efficient heuristic algorithm to approximate the optimal solution. To validate our approach, we carried out experiments in AMR parsing problems. The experimental results demonstrate that the proposed approach can combine the strength of state-of-the-art AMR parsers to create new predictions that are more accurate than any individual models in five standard benchmark datasets.

【2】 Substructure Distribution Projection for Zero-Shot Cross-Lingual Dependency Parsing 标题：零命中率跨语言依存句法分析的子结构分布投影链接：https://arxiv.org/abs/2110.08538

作者：Haoyue Shi,Kevin Gimpel,Karen Livescu 机构：Toyota Technological Institute at Chicago, S Kenwood Ave, Chicago, IL, USA 摘要：我们提出了子结构分布投影（SubDP），这是一种通过分别投影子结构分布将一个域中的结构分布投影到另一个域的技术。然后可以使用投影分布作为软银标签来训练目标域的模型。我们将依赖弧作为子结构，对零炮跨语言依赖分析的SubDP进行评估：我们将源语言中预测的依赖弧分布投影到目标语言，并训练目标语言解析器以适应产生的分布。当英语树状库是唯一涉及人类努力的注释时，SubDP在八种不同的目标语言中获得了比之前所有关于Universal Dependencies v2.2（Nivre et al.，2020）测试集的工作更好的未标记附件分数，以及八种语言中六种语言的最佳标记附件分数。此外，SubDP改进了零炮跨语言依赖性解析，在更广泛的目标语言范围内，使用很少（例如50）个受监督的双文本对。摘要：We present substructure distribution projection (SubDP), a technique that projects a distribution over structures in one domain to another, by projecting substructure distributions separately. Models for the target domains can be then trained, using the projected distributions as soft silver labels. We evaluate SubDP on zero-shot cross-lingual dependency parsing, taking dependency arcs as substructures: we project the predicted dependency arc distributions in the source language(s) to target language(s), and train a target language parser to fit the resulting distributions. When an English treebank is the only annotation that involves human effort, SubDP achieves better unlabeled attachment score than all prior work on the Universal Dependencies v2.2 (Nivre et al., 2020) test set across eight diverse target languages, as well as the best labeled attachment score on six out of eight languages. In addition, SubDP improves zero-shot cross-lingual dependency parsing with very few (e.g., 50) supervised bitext pairs, across a broader range of target languages.

【3】 The Power of Prompt Tuning for Low-Resource Semantic Parsing 标题：低资源语义分析的即时调优能力链接：https://arxiv.org/abs/2110.08525

作者：Nathan Schucher,Siva Reddy,Harm de Vries 机构：ElementAI, a ServiceNow company, MilaMcGill University, Facebook CIFAR AI Chair 摘要：提示调优最近成为一种有效的方法，可以使预先训练好的语言模型适应许多语言任务。在本文中，我们研究了语义分析的快速调整，即将自然语言话语映射到形式意义表示的任务。对于大型T5模型，我们发现（i）在低数据状态下，即时调整显著优于微调；（ii）规范化（即，将含义表示自然化）几乎不能提高性能。这最后一个结果令人惊讶，因为它表明大型T5模型可以被调制以生成远离训练前分布的序列。摘要：Prompt tuning has recently emerged as an effective method for adapting pre-trained language models to a number of language tasks. In this paper, we investigate prompt tuning for semantic parsing, the task of mapping natural language utterances onto formal meaning representations. For large T5 models we find (i) that prompt tuning significantly outperforms fine-tuning in the low data regime and (ii) that canonicalization -- i.e. naturalizing the meaning representations -- barely improves performance. This last result is surprising as it suggests that large T5 models can be modulated to generate sequences that are far from the pre-training distribution.

【4】 Controllable Semantic Parsing via Retrieval Augmentation 标题：基于检索增强的可控语义分析链接：https://arxiv.org/abs/2110.08458

作者：Panupong Pasupat,Yuan Zhang,Kelvin Guu 机构：Google Research 备注：EMNLP 2021 摘要：在语义分析的实际应用中，我们经常希望快速更改解析器的行为，例如使其能够处理新域中的查询，或者更改其对特定目标查询的预测。虽然我们可以引入显示目标行为的新的训练示例，但不需要昂贵的模型重新训练就可以实施此类行为更改的机制更可取。为此，我们提出了基于范例检索（CASPER）的可控语义解析器。给定一个输入查询，解析器从检索索引中检索相关示例，将其扩充到查询中，然后应用生成的seq2seq模型生成输出解析。范例充当通用生成模型的控制机制：通过操纵检索索引或扩充查询的构造方式，我们可以操纵解析器的行为。在MTOP数据集上，除了在标准设置上实现最新水平外，我们还表明CASPER可以在新域中解析查询，根据指定模式调整预测，或者调整新语义模式，而无需进一步重新训练模型。摘要：In practical applications of semantic parsing, we often want to rapidly change the behavior of the parser, such as enabling it to handle queries in a new domain, or changing its predictions on certain targeted queries. While we can introduce new training examples exhibiting the target behavior, a mechanism for enacting such behavior changes without expensive model re-training would be preferable. To this end, we propose ControllAble Semantic Parser via Exemplar Retrieval (CASPER). Given an input query, the parser retrieves related exemplars from a retrieval index, augments them to the query, and then applies a generative seq2seq model to produce an output parse. The exemplars act as a control mechanism over the generic generative model: by manipulating the retrieval index or how the augmented query is constructed, we can manipulate the behavior of the parser. On the MTOP dataset, in addition to achieving state-of-the-art on the standard setup, we show that CASPER can parse queries in a new domain, adapt the prediction toward the specified patterns, or adapt to new semantic schemas without having to further re-train the model.

【5】 Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages 标题：多语言无监督序列分割转移到资源极低的语言链接：https://arxiv.org/abs/2110.08415

作者：C. M. Downey,Shannon Drizin,Levon Haroutunian,Shivin Thukral 机构：Department of Linguistics, University of Washington 摘要：我们表明，无监督的序列分割性能可以通过多语言预训练一个蒙蔽的分段语言模型（Downey et al.，2021）转移到资源极低的语言。此外，我们还表明，这种迁移可以通过对一组在类型上与目标语言相似（但在系统发育上不相关）的低资源语言进行训练来实现。在我们的实验中，我们从10种美洲土著语言的集合（AmericasNLP，Mager等人，2021年）转移到玛雅语K'iche。我们将我们的模型与单语基线进行了比较，结果表明，多语言预训练方法在目标数据集大小上产生了更一致的分割质量，包括20.6 F1的Zero-Shot性能，并且在9/10实验设置中超过了单语性能。这些结果对于涉及类人语言单位的低资源NLP管道具有很好的意义，例如Bird（2020）提出的稀疏转录框架。摘要：We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K'iche', a Mayan language. We compare our model to a monolingual baseline, and show that the multilingual pre-trained approach yields much more consistent segmentation quality across target dataset sizes, including a zero-shot performance of 20.6 F1, and exceeds the monolingual performance in 9/10 experimental settings. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).

【6】 On The Ingredients of an Effective Zero-shot Semantic Parser 标题：一种有效的零命中语义解析器的构成要素链接：https://arxiv.org/abs/2110.08381

作者：Pengcheng Yin,John Wieting,Avirup Sil,Graham Neubig 机构：Avi Sil♦, ♠Carnegie Mellon University, ♣Google Research, ♦IBM Research 摘要：语义分析器将自然语言的话语映射为意义表示（例如，程序）。由于需要费力的注释工作，训练数据的缺乏通常会限制这些模型。最近的研究通过从语法中综合规范话语和程序的训练示例，并进一步解释这些话语来提高语言多样性，从而实现Zero-Shot学习。然而，此类合成示例无法完全捕获真实数据中的模式。在本文中，我们通过语言和逻辑缺口（Herzig和Berant，2019）的视角分析零炮解析器，这些视角量化了规范示例和真实世界用户发布的示例之间的语言和编程模式差异。我们建议使用改进的语法、更强的释义和高效的学习方法，使用最有可能反映真实用户意图的规范示例来弥合这些差距。我们的模型在两个语义分析基准（Scholar、Geo）上实现了很好的性能，并且没有标记数据。摘要：Semantic parsers map natural language utterances into meaning representations (e.g., programs). Such models are typically bottlenecked by the paucity of training data due to the required laborious annotation efforts. Recent studies have performed zero-shot learning by synthesizing training examples of canonical utterances and programs from a grammar, and further paraphrasing these utterances to improve linguistic diversity. However, such synthetic examples cannot fully capture patterns in real data. In this paper we analyze zero-shot parsers through the lenses of the language and logical gaps (Herzig and Berant, 2019), which quantify the discrepancy of language and programmatic patterns between the canonical examples and real-world user-issued ones. We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods using canonical examples that most likely reflect real user intents. Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.

【7】 Towards Transparent Interactive Semantic Parsing via Step-by-Step Correction 标题：基于分步纠错的透明交互式语义分析链接：https://arxiv.org/abs/2110.08345

作者：Lingbo Mo,Ashley Lewis,Huan Sun,Michael White 机构：The Ohio State University 摘要：现有的语义分析研究主要集中在将自然语言话语一次性映射到相应的逻辑形式上。然而，由于自然语言可能包含大量的歧义和可变性，这是一个困难的挑战。在这项工作中，我们研究了一个交互式语义分析框架，该框架以自然语言一步一步地解释预测的逻辑形式，并使用户能够通过自然语言反馈对各个步骤进行更正。我们将重点放在知识库问答（KBQA）上，作为我们框架的一个实例，旨在提高解析过程的透明度，并帮助用户适当地信任最终答案。为此，我们构建了一个从ComplexWebQuestions数据集派生的受启发的众包对话数据集。我们的实验表明，具有人类反馈的交互式框架有可能极大地提高整体解析精度。此外，我们开发了一个对话模拟管道，以评估我们的框架w.r.t.各种最先进的KBQA模型，而无需进一步的众包工作。结果表明，我们的交互式语义分析框架有望在这些模型中有效。摘要：Existing studies on semantic parsing focus primarily on mapping a natural-language utterance to a corresponding logical form in one turn. However, because natural language can contain a great deal of ambiguity and variability, this is a difficult challenge. In this work, we investigate an interactive semantic parsing framework that explains the predicted logical form step by step in natural language and enables the user to make corrections through natural-language feedback for individual steps. We focus on question answering over knowledge bases (KBQA) as an instantiation of our framework, aiming to increase the transparency of the parsing process and help the user appropriately trust the final answer. To do so, we construct INSPIRED, a crowdsourced dialogue dataset derived from the ComplexWebQuestions dataset. Our experiments show that the interactive framework with human feedback has the potential to greatly improve overall parse accuracy. Furthermore, we develop a pipeline for dialogue simulation to evaluate our framework w.r.t. a variety of state-of-the-art KBQA models without involving further crowdsourcing effort. The results demonstrate that our interactive semantic parsing framework promises to be effective across such models.

Graph|知识图谱|Knowledge(7篇)

【1】 HRKD: Hierarchical Relational Knowledge Distillation for Cross-domain Language Model Compression 标题：HRKD：面向跨域语言模型压缩的层次关系知识抽取链接：https://arxiv.org/abs/2110.08551

作者：Chenhe Dong,Yaliang Li,Ying Shen,Minghui Qiu 机构： Sun Yat-sen University , Alibaba Group 备注：EMNLP 2021 摘要：在许多自然语言处理任务中，与传统的神经网络方法相比，大型预训练语言模型（PLM）表现出了压倒性的性能。然而，它们庞大的模型尺寸和较低的推理速度在实际应用中阻碍了在资源有限的设备上的部署。本文以知识提取为目标，提出了一种分层关系知识提取（HRKD）方法，用于获取分层和领域关系信息。具体来说，为了增强模型的能力和可转移性，我们利用元学习的思想，建立领域关系图来捕获不同领域之间的关系信息。为了动态地为每个域选择最具代表性的原型，我们提出了一种分层比较聚合机制来捕获分层关系。在公共多域数据集上的大量实验证明了我们的HRKD方法的优越性能及其强大的Few-Shot学习能力。为了再现性，我们在https://github.com/cheneydon/hrkd. 摘要：On many natural language processing tasks, large pre-trained language models (PLMs) have shown overwhelming performances compared with traditional neural network methods. Nevertheless, their huge model size and low inference speed have hindered the deployment on resource-limited devices in practice. In this paper, we target to compress PLMs with knowledge distillation, and propose a hierarchical relational knowledge distillation (HRKD) method to capture both hierarchical and domain relational information. Specifically, to enhance the model capability and transferability, we leverage the idea of meta-learning and set up domain-relational graphs to capture the relational information across different domains. And to dynamically select the most representative prototypes for each domain, we propose a hierarchical compare-aggregate mechanism to capture hierarchical relationships. Extensive experiments on public multi-domain datasets demonstrate the superior performance of our HRKD method as well as its strong few-shot learning ability. For reproducibility, we release the code at https://github.com/cheneydon/hrkd.

【2】 Think Before You Speak: Using Self-talk to Generate Implicit Commonsense Knowledge for Response Generation 标题：三思而后行：使用自言自语为响应生成隐含的常识知识链接：https://arxiv.org/abs/2110.08501

作者：Pei Zhou,Karthik Gopalakrishnan,Behnam Hedayatnia,Seokhwan Kim,Jay Pujara,Xiang Ren,Yang Liu,Dilek Hakkani-Tur 机构： Department of Computer Science, University of Southern California, Amazon Alexa AI 备注：13 pages, 2 figures, 7 tables 摘要：隐性知识，如常识，是流畅的人类对话的关键。当前的神经反应生成（RG）模型是端到端训练的，忽略了未说明的隐式知识。在本文中，我们提出了一种自对话方法，该方法首先生成隐式常识知识，然后通过引用外部化知识生成响应，所有这些都使用一个生成模型。我们分析不同的选择，以收集知识一致的对话，表示隐含的知识，并引出知识和反应。我们介绍了三个评估方面：知识质量、知识-反应联系和反应质量，并进行了广泛的人为评估。我们的实验结果表明，与端到端的RG模型相比，通过显式生成隐式知识来外部化知识基础过程的自我对话模型也会产生信息量更大、更具体、更符合常识的反应。我们还发现，通过人类评估，自言自语模型大约有75%的时间生成高质量的知识。我们希望我们的研究结果能够鼓励人们进一步研究不同的方法来建模隐性常识知识和训练知识丰富的RG模型。摘要：Implicit knowledge, such as common sense, is key to fluid human conversations. Current neural response generation (RG) models are trained end-to-end, omitting unstated implicit knowledge. In this paper, we present a self-talk approach that first generates the implicit commonsense knowledge and then generates response by referencing the externalized knowledge, all using one generative model. We analyze different choices to collect knowledge-aligned dialogues, represent implicit knowledge, and elicit knowledge and responses. We introduce three evaluation aspects: knowledge quality, knowledge-response connection, and response quality and perform extensive human evaluations. Our experimental results show that compared with end-to-end RG models, self-talk models that externalize the knowledge grounding process by explicitly generating implicit knowledge also produce responses that are more informative, specific, and follow common sense. We also find via human evaluation that self-talk models generate high-quality knowledge around 75% of the time. We hope that our findings encourage further work on different approaches to modeling implicit commonsense knowledge and training knowledgeable RG models.

【3】 Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals 标题：通过对多模态教学手册排序来理解过程性知识链接：https://arxiv.org/abs/2110.08486

作者：Te-Lin Wu,Alex Spangher,Pegah Alipoormolabashi,Marjorie Freedman,Ralph Weischedel,Nanyun Peng 机构：University of California, Los Angeles,Information Sciences Institute, University of Southern California, Sharif University of Technology 摘要：对无序事件进行排序的能力是理解和推理现实世界任务程序的一项基本技能，这通常需要彻底理解时间常识和多模态信息，因为这些程序通常通过文本和图像的组合进行沟通。这种能力对于顺序任务规划和多源指令摘要等应用至关重要。虽然人类能够对无序的多模态程序指令进行推理和排序，但当前的机器学习模型是否具有这种基本能力仍然是一个悬而未决的问题。在这项工作中，我们通过整理流行在线教学手册中的数据集和收集全面的人类注释，对模型对无序多模态指令进行推理和排序的能力进行基准测试。我们发现模型不仅表现得比人类差得多，而且似乎无法有效地利用多模态信息。为了提高机器在多模式事件排序方面的性能，我们提出了序列感知预训练技术，该技术利用文本和图像的顺序对齐特性，显著提高了5%以上。摘要：The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multi-source instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information. To improve machines' performance on multimodal event sequencing, we propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images, resulting in > 5% significant improvements.

【4】 Leveraging Knowledge in Multilingual Commonsense Reasoning 标题：在多语言常识推理中利用知识链接：https://arxiv.org/abs/2110.08462

作者：Yuwei Fang,Shuohang Wang,Yichong Xu,Ruochen Xu,Siqi Sun,Chenguang Zhu,Michael Zeng 机构：Microsoft Cognitive Services Research Group 备注：First place in XCSR Leaderboard: this https URL Work in progress 摘要：常识推理（CSR）要求模型具备一般世界知识。虽然CSR是一个语言不可知的过程，但最全面的知识来源是少数流行语言，尤其是英语。因此，如何有效地对各种语言进行多语言常识推理（XCSR）尚不清楚。在这项工作中，我们建议通过翻译-检索-翻译（TRT）策略来利用英语知识源。对于多语常识问题和选择，我们通过翻译和检索从知识源中收集相关知识。然后将检索到的知识翻译成目标语言，并通过可见的知识注意将其集成到预先训练的多语言模型中。然后，我们利用4种不同的英语知识来源，以不同的形式提供更全面的知识覆盖。关于XCSR基准的大量结果表明，具有外部知识的TRT可以显著改善零炮和翻译列车设置下的多语言常识推理，在XCSR基准数据集（X-CSQA和X-CODAH）上优于先前最先进的3.3和3.6个点。摘要：Commonsense reasoning (CSR) requires the model to be equipped with general world knowledge. While CSR is a language-agnostic process, most comprehensive knowledge sources are in few popular languages, especially English. Thus, it remains unclear how to effectively conduct multilingual commonsense reasoning (XCSR) for various languages. In this work, we propose to utilize English knowledge sources via a translate-retrieve-translate (TRT) strategy. For multilingual commonsense questions and choices, we collect related knowledge via translation and retrieval from the knowledge sources. The retrieved knowledge is then translated into the target language and integrated into a pre-trained multilingual language model via visible knowledge attention. Then we utilize a diverse of 4 English knowledge sources to provide more comprehensive coverage of knowledge in different formats. Extensive results on the XCSR benchmark demonstrate that TRT with external knowledge can significantly improve multilingual commonsense reasoning in both zero-shot and translate-train settings, outperforming 3.3 and 3.6 points over the previous state-of-the-art on XCSR benchmark datasets (X-CSQA and X-CODAH).

【5】 Knowledge Enhanced Pretrained Language Models: A Compreshensive Survey 标题：知识增强型预训练语言模型研究综述链接：https://arxiv.org/abs/2110.08455

作者：Xiaokai Wei,Shen Wang,Dejiao Zhang,Parminder Bhatia,Andrew Arnold 摘要：预训练语言模型（PLM）通过在大规模文本语料库上学习信息语境化表达，建立了一种新的范式。这一新范式彻底改变了自然语言处理的整个领域，并为各种NLP任务提供了最新的性能。然而，尽管PLMs能够从训练语料库中存储一定的知识/事实，但是他们的知识意识还远远不能令人满意。为了解决这个问题，将知识集成到PLMs中最近成为一个非常活跃的研究领域，并且已经开发了多种方法。在本文中，我们对这一新兴且快速发展的领域——知识增强的预训练语言模型（KE PLMs）的文献进行了全面的综述。我们引入三种分类法来对现有工作进行分类。此外，我们还调查了各种NLU和NLG应用，其中KE-PLM的性能优于普通PLM。最后，我们讨论了KE PLM面临的挑战以及未来研究的前景。摘要：Pretrained Language Models (PLM) have established a new paradigm through learning informative contextualized representations on large-scale text corpus. This new paradigm has revolutionized the entire field of natural language processing, and set the new state-of-the-art performance for a wide variety of NLP tasks. However, though PLMs could store certain knowledge/facts from training corpus, their knowledge awareness is still far from satisfactory. To address this issue, integrating knowledge into PLMs have recently become a very active research area and a variety of approaches have been developed. In this paper, we provide a comprehensive survey of the literature on this emerging and fast-growing field - Knowledge Enhanced Pretrained Language Models (KE-PLMs). We introduce three taxonomies to categorize existing work. Besides, we also survey the various NLU and NLG applications on which KE-PLM has demonstrated superior performance over vanilla PLMs. Finally, we discuss challenges that face KE-PLMs and also promising directions for future research.

【6】 Prix-LM: Pretraining for Multilingual Knowledge Base Construction 标题：PRIX-LM：多语种知识库建设的前期训练链接：https://arxiv.org/abs/2110.08443

作者：Wenxuan Zhou,Fangyu Liu,Ivan Vulić,Nigel Collier,Muhao Chen 机构：LUKA Lab, CSD & ISI, University of Southern California, USA, Language Technology Lab, TAL, University of Cambridge, UK 摘要：知识库包含了大量的结构化世界知识和常识知识。因此，它们通常补充基于文本的分布式信息，并促进各种下游任务。由于人工构建是资源和时间密集型的，最近的工作尝试利用大型预训练语言模型（PLM）为KBs生成额外的单语知识事实。然而，这些方法还没有尝试过构建和丰富多语言知识库。除了更广泛的应用，这种多语言知识库可以提供比单语（如英语）知识库更丰富的组合知识。用不同语言表达的知识可能是互补的，分布不均的：这意味着高资源语言中可用的知识可以转移到低资源语言中。为了实现这一点，在共享/统一的空间中表示多语言知识至关重要。为此，我们提出了一个统一的框架Prix LM，用于多语言知识库的构建和完成。我们利用两种类型的知识，单语三元组和跨语言链接，从现有的多语言知识库中提取，并通过因果语言建模目标调整多语言编码器XLM-R。Prix LM将有用的多语言和基于知识库的事实知识集成到单个模型中。对标准实体相关任务（如多种语言中的链接预测、跨语言实体链接和双语词典归纳）的实验证明了该方法的有效性，在强大的任务专用基线上报告了该方法的成效。摘要：Knowledge bases (KBs) contain plenty of structured world and commonsense knowledge. As such, they often complement distributional text-based information and facilitate various downstream tasks. Since their manual construction is resource- and time-intensive, recent efforts have tried leveraging large pretrained language models (PLMs) to generate additional monolingual knowledge facts for KBs. However, such methods have not been attempted for building and enriching multilingual KBs. Besides wider application, such multilingual KBs can provide richer combined knowledge than monolingual (e.g., English) KBs. Knowledge expressed in different languages may be complementary and unequally distributed: this implies that the knowledge available in high-resource languages can be transferred to low-resource ones. To achieve this, it is crucial to represent multilingual knowledge in a shared/unified space. To this end, we propose a unified framework, Prix-LM, for multilingual KB construction and completion. We leverage two types of knowledge, monolingual triples and cross-lingual links, extracted from existing multilingual KBs, and tune a multilingual language encoder XLM-R via a causal language modeling objective. Prix-LM integrates useful multilingual and KB-based factual knowledge into a single model. Experiments on standard entity-related tasks, such as link prediction in multiple languages, cross-lingual entity linking and bilingual lexicon induction, demonstrate its effectiveness, with gains reported over strong task-specialised baselines.

【7】 Generated Knowledge Prompting for Commonsense Reasoning 标题：用于常识推理的生成知识提示链接：https://arxiv.org/abs/2110.08387

作者：Jiacheng Liu,Alisa Liu,Ximing Lu,Sean Welleck,Peter West,Ronan Le Bras,Yejin Choi,Hannaneh Hajishirzi 机构：♥Paul G. Allen School of Computer Science & Engineering, University of Washington, ♠Allen Institute for Artificial Intelligence 摘要：尽管大型语言模型能够在训练前获取大量知识，但它们通常受益于整合外部知识库，特别是在常识推理任务中。这促使我们探索如何最好地利用从语言模型本身获取的知识。我们建议直接从具有通用提示格式的语言模型生成知识语句，然后选择最大化预测概率的知识。尽管它很简单，但这种方法在四种常识推理任务上提高了现成和微调语言模型的性能，改进了数字常识（NumerSense）、一般常识（commonsense QA 2.0）和科学常识（QASC）基准的最新水平。值得注意的是，我们发现当使用模型自身生成的知识时，模型的预测可以得到改善，这表明了符号知识表示在神经推理过程中的重要性。摘要：Despite their ability to capture large amount of knowledge during pretraining, large-scale language models often benefit from incorporating external knowledge bases, especially on commonsense reasoning tasks. This motivates us to explore how we can best leverage knowledge elicited from language models themselves. We propose generating knowledge statements directly from a language model with a generic prompt format, then selecting the knowledge which maximizes prediction probability. Despite its simplicity, this approach improves performance of both off-the-shelf and finetuned language models on four commonsense reasoning tasks, improving the state-of-the-art on numerical commonsense (NumerSense), general commonsense (CommonsenseQA 2.0), and scientific commonsense (QASC) benchmarks. Notably, we find that a model's predictions can improve when using its own generated knowledge, demonstrating the importance of symbolic knowledge representation in neural reasoning processes.

摘要|信息提取(4篇)

【1】 Fine-Grained Opinion Summarization with Minimal Supervision 标题：基于最小监督的细粒度意见摘要链接：https://arxiv.org/abs/2110.08845

作者：Suyu Ge,Jiaxin Huang,Yu Meng,Sharon Wang,Jiawei Han 机构：University of Illinois Urbana-Champaign 摘要：意见摘要旨在通过从多个文档中提取意见来分析目标。由于难以从数千个文档中获得高质量的注释，大多数现有工作都是以半监督的方式进行的。其中，一些人使用方面和情绪分析作为识别意见的代理。在这项工作中，我们提出了一个新的框架FineSum，它在三个方面推进了这一前沿：（1）最小监督，其中只有方面名称和一些方面/情感关键词可用；（2）细粒度意见分析，其中情绪分析深入到子方面级别；（3）基于短语的总结，即以短语的形式总结意见。FineSum自动从原始语料库中识别意见短语，将其分类为不同的方面和情绪，并在每个方面/情绪下构建多个细粒度意见簇。每个簇由语义连贯的短语组成，表达对某些子方面或特征的一致意见（例如，“食物”方面对“汉堡”的积极感觉）。训练面向观点的球形词嵌入空间，为短语分类器提供弱监督，并使用短语分类器生成的方面感知上下文嵌入进行短语聚类。基准上的自动评估和定量人类评估都验证了我们方法的有效性。摘要：Opinion summarization aims to profile a target by extracting opinions from multiple documents. Most existing work approaches the task in a semi-supervised manner due to the difficulty of obtaining high-quality annotation from thousands of documents. Among them, some use aspect and sentiment analysis as a proxy for identifying opinions. In this work, we propose a new framework, FineSum, which advances this frontier in three aspects: (1) minimal supervision, where only aspect names and a few aspect/sentiment keywords are available; (2) fine-grained opinion analysis, where sentiment analysis drills down to the sub-aspect level; and (3) phrase-based summarization, where opinion is summarized in the form of phrases. FineSum automatically identifies opinion phrases from the raw corpus, classifies them into different aspects and sentiments, and constructs multiple fine-grained opinion clusters under each aspect/sentiment. Each cluster consists of semantically coherent phrases, expressing uniform opinions towards certain sub-aspect or characteristics (e.g., positive feelings for ``burgers'' in the ``food'' aspect). An opinion-oriented spherical word embedding space is trained to provide weak supervision for the phrase classifier, and phrase clustering is performed using the aspect-aware contextualized embedding generated from the phrase classifier. Both automatic evaluation on the benchmark and quantitative human evaluation validate the effectiveness of our approach.

【2】 PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization 标题：Primer：基于金字塔的多文档文摘掩蔽句预训练链接：https://arxiv.org/abs/2110.08499

作者：Wen Xiao,Iz Beltagy,Giuseppe Carenini,Arman Cohan 机构：†University of British Columbia, Vancouver, Canada, ‡Allen Institute for AI, Seattle, WA, USA, §Paul G. Allen School of Computer Science & Engineering, University of Washington 摘要：最近提出的预训练生成模型在单文档摘要基准上取得了很好的性能。然而，他们中的大多数人都接受过通用目标的预先训练，主要目的是处理单一的文件输入。在本文中，我们提出了PRIMER，这是一个预训练的多文档表示模型，重点是摘要，它减少了对特定于数据集的体系结构和大量微调标记数据的需要。具体来说，我们采用了Longformer架构，该架构具有适当的输入转换和全局注意，以适合多文档输入，并使用间隙句子生成目标和新策略为整个集群选择显著句子，称为实体金字塔，教导模型在相关文档集群中选择和聚合信息。通过对来自3个不同领域的6个多文档摘要数据集在Zero-Shot、Few-Shot和全监督设置下的大量实验，我们的模型PRIMER在大多数设置下都优于当前最先进的模型，并且具有很大的优势。代码和预先训练的模型发布于https://github.com/allenai/PRIMER 摘要：Recently proposed pre-trained generation models achieve strong performance on single-document summarization benchmarks. However, most of them are pre-trained with general-purpose objectives and mainly aim to process single document inputs. In this paper, we propose PRIMER, a pre-trained model for multi-document representation with focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. Specifically, we adopt the Longformer architecture with proper input transformation and global attention to fit for multi-document inputs, and we use Gap Sentence Generation objective with a new strategy to select salient sentences for the whole cluster, called Entity Pyramid, to teach the model to select and aggregate information across a cluster of related documents. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on the zero-shot, few-shot, and full-supervised settings, our model, PRIMER, outperforms current state-of-the-art models on most of these settings with large margins. Code and pre-trained models are released at https://github.com/allenai/PRIMER

【3】 Training Dynamics for Text Summarization Models 标题：文本摘要模型的训练动态链接：https://arxiv.org/abs/2110.08370

作者：Tanya Goyal,Jiacheng Xu,Junyi Jessy Li,Greg Durrett 机构： Department of Computer Science, Department of Linguistics, The University of Texas at Austin 备注：preprint 摘要：当对大型摘要数据集进行微调时，预先训练的语言模型（如BART）已显示出令人印象深刻的结果。然而，人们对这种微调过程了解甚少，包括从预训练模型中保留哪些知识，或者如何在迭代中学习内容选择和生成策略。在这项工作中，我们分析了生成模型的训练动态，重点是新闻摘要。在不同的数据集（CNN/DM、XSum、MediaSum）和摘要属性（如抽象性和幻觉）中，我们研究模型在其微调过程的不同阶段学到了什么。我们发现，诸如复制行为之类的属性是在训练过程中提前学习的，并且这些观察结果在各个领域都是稳健的。另一方面，事实错误，如对不支持的事实的幻觉，是在后期习得的，而且这种行为在不同领域中更为多样。基于这些观察，我们探索了修改训练的补充方法：第一，忽略学习难度大的高损失代币，第二，忽略学习速度快的低损失代币。这种简单的训练修改允许我们配置模型以实现不同的目标，例如提高真实性或抽象性。摘要：Pre-trained language models (e.g. BART) have shown impressive results when fine-tuned on large summarization datasets. However, little is understood about this fine-tuning process, including what knowledge is retained from pre-training models or how content selection and generation strategies are learnt across iterations. In this work, we analyze the training dynamics for generation models, focusing on news summarization. Across different datasets (CNN/DM, XSum, MediaSum) and summary properties, such as abstractiveness and hallucination, we study what the model learns at different stages of its fine-tuning process. We find that properties such as copy behavior are learnt earlier in the training process and these observations are robust across domains. On the other hand, factual errors, such as hallucination of unsupported facts, are learnt in the later stages, and this behavior is more varied across domains. Based on these observations, we explore complementary approaches for modifying training: first, disregarding high-loss tokens that are challenging to learn and second, disregarding low-loss tokens that are learnt very quickly. This simple training modification allows us to configure our model to achieve different goals, such as improving factuality or improving abstractiveness.

【4】 Aspect-Oriented Summarization through Query-Focused Extraction 标题：通过面向查询的抽取实现面向方面的摘要链接：https://arxiv.org/abs/2110.08296

作者：Ojas Ahuja,Jiacheng Xu,Akshay Gupta,Kevin Horecka,Greg Durrett 机构：The University of Texas at Austin, Walmart NexTech 摘要：对某个特定主题感兴趣的读者可能会有兴趣以特定的焦点总结该主题的文档，而不是简单地看到大多数摘要系统生成的通用摘要。虽然在以前的工作中已经探索过以查询为中心的摘要，但这通常是从文档特定问题或合成数据的角度来实现的。真实用户的需求通常更接近于用户感兴趣的数据集中的方面、广泛的主题，而不是特定的查询。在本文中，我们收集了一个真实的面向方面测试用例数据集AspectNews，它涵盖了新闻子域中文章的不同子主题。然后，我们研究了可以构建合成数据的以查询为中心的方法如何处理这种面向方面的设置：我们对以提取查询为中心的训练方案进行基准测试，并提出了一种对比增强方法来训练模型。我们在两个面向方面的数据集上进行了评估，发现这种方法产生了（a）集中的摘要，比来自通用摘要系统的摘要要好，后者超越了简单的关键字匹配；（b）对关键字选择敏感的系统。摘要：A reader interested in a particular topic might be interested in summarizing documents on that subject with a particular focus, rather than simply seeing generic summaries produced by most summarization systems. While query-focused summarization has been explored in prior work, this is often approached from the standpoint of document-specific questions or on synthetic data. Real users' needs often fall more closely into aspects, broad topics in a dataset the user is interested in rather than specific queries. In this paper, we collect a dataset of realistic aspect-oriented test cases, AspectNews, which covers different subtopics about articles in news sub-domains. We then investigate how query-focused methods, for which we can construct synthetic data, can handle this aspect-oriented setting: we benchmark extractive query-focused training schemes, and propose a contrastive augmentation approach to train the model. We evaluate on two aspect-oriented datasets and find this approach yields (a) focused summaries, better than those from a generic summarization system, which go beyond simple keyword matching; (b) a system sensitive to the choice of keywords.

推理|分析|理解|解释(4篇)

【1】 Analysis of French Phonetic Idiosyncrasies for Accent Recognition 标题：法语重音识别的语音特点分析链接：https://arxiv.org/abs/2110.09179

作者：Pierre Berjon,Avishek Nag,Soumyabrata Dev 机构：Department de Sciences du Numérique, INP-ENSEEIHT, Toulouse, France, School of Electrical and Electronic Engineering, University College Dublin, Ireland, ADAPT SFI Research Centre, Dublin, Ireland, School of Computer Science, University College Dublin, Ireland 备注：Accepted in Soft Computing Letters, 2021 摘要：语音识别系统在过去几十年中取得了巨大的进步。它们在识别说话人的讲话方面有了显著的发展。然而，在识别说话人的细微差别和口音方面，语音识别系统还有改进的余地。众所周知，任何特定的自然语言都可能至少有一种口音。尽管单词的音素构成相同，但如果它以不同的口音发音，我们会有声波，它们彼此不同。语音、口音和语调的差异是语音识别中最常见的问题之一。如果语言中有很多重音，我们应该分别为每个重音创建声学模型。我们对口音的准确分类问题进行了系统的分析。我们使用了传统的机器学习技术和卷积神经网络，并且证明了经典的技术不足以有效地解决这个问题。利用语音信号的频谱图，我们提出了一个用于重音识别的多类分类框架。在本文中，我们将重点放在法语口音上。我们还通过了解法国特质对其光谱图的影响来确定其局限性。摘要：Speech recognition systems have made tremendous progress since the last few decades. They have developed significantly in identifying the speech of the speaker. However, there is a scope of improvement in speech recognition systems in identifying the nuances and accents of a speaker. It is known that any specific natural language may possess at least one accent. Despite the identical word phonemic composition, if it is pronounced in different accents, we will have sound waves, which are different from each other. Differences in pronunciation, in accent and intonation of speech in general, create one of the most common problems of speech recognition. If there are a lot of accents in language we should create the acoustic model for each separately. We carry out a systematic analysis of the problem in the accurate classification of accents. We use traditional machine learning techniques and convolutional neural networks, and show that the classical techniques are not sufficiently efficient to solve this problem. Using spectrograms of speech signals, we propose a multi-class classification framework for accent recognition. In this paper, we focus our attention on the French accent. We also identify its limitation by understanding the impact of French idiosyncrasies on its spectrograms.

【2】 MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding 标题：MarkupLM：文本和标记语言的预训练，用于视觉丰富的文档理解链接：https://arxiv.org/abs/2110.08518

作者：Junlong Li,Yiheng Xu,Lei Cui,Furu Wei 机构：Shanghai Jiao Tong Univiersity, Microsoft Research Asia 备注：Work in Progress 摘要：文本、布局和图像的多模式预训练在视觉丰富文档理解（VrDU）方面取得了重大进展，尤其是固定布局文档，如扫描文档图像。然而，仍然有大量的数字文档，其中布局信息不是固定的，需要交互和动态地呈现以进行可视化，使得现有的基于布局的预训练方法不容易应用。在本文中，我们提出MarkupLM用于以标记语言为主干的文档理解任务，例如基于HTML/XML的文档，其中文本和标记信息是联合预训练的。实验结果表明，经过预训练的MarkupLM在多个文档理解任务上的性能明显优于现有的强基线模型。预先训练的模型和代码将在https://aka.ms/markuplm. 摘要：Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available at https://aka.ms/markuplm.

【3】 Case-based Reasoning for Better Generalization in Text-Adventure Games 标题：基于案例推理的文本冒险游戏中更好的泛化链接：https://arxiv.org/abs/2110.08470

作者：Mattia Atzeni,Shehzaad Dhuliawala,Keerthiram Murugesan,Mrinmaya Sachan 机构：IBM Research, EPFL, ETH Zürich 摘要：基于文本的游戏（TBG）已经成为推动扎根语言理解研究和研究泛化和样本效率等问题的一个很有前景的环境。针对TBG提出了几种不同结构和学习方案的深度强化学习（RL）方法。然而，这些方法不能有效地推广，特别是在分布转移的情况下。与深度RL方法不同，在本文中，我们提出了一种基于案例推理的通用方法来训练agent，并将其推广到训练分布之外。基于案例的推理器从代理过去与世界的交互中收集积极经验的实例，然后重用收集的经验以有效地采取行动。该方法可与文献中关于TBG的任何现有策略神经代理结合使用。我们的实验表明，该方法持续改进了现有方法，获得了良好的分布外泛化，并在广泛使用的环境中取得了新的最新成果。摘要：Text-based games (TBG) have emerged as promising environments for driving research in grounded language understanding and studying problems like generalization and sample efficiency. Several deep reinforcement learning (RL) methods with varying architectures and learning schemes have been proposed for TBGs. However, these methods fail to generalize efficiently, especially under distributional shifts. In a departure from deep RL approaches, in this paper, we propose a general method inspired by case-based reasoning to train agents and generalize out of the training distribution. The case-based reasoner collects instances of positive experiences from the agent's interaction with the world in the past and later reuses the collected experiences to act efficiently. The method can be applied in conjunction with any existing on-policy neural agent in the literature for TBGs. Our experiments show that the proposed approach consistently improves existing methods, obtains good out-of-distribution generalization, and achieves new state-of-the-art results on widely used environments.

【4】 Unsupervised Natural Language Inference Using PHL Triplet Generation 标题：基于PHL三元组生成的无监督自然语言推理链接：https://arxiv.org/abs/2110.08438

作者：Neeraj Varshney,Pratyay Banerjee,Tejas Gokhale,Chitta Baral 机构：Arizona State University 备注：9 pages, 2 figures, 8 tables 摘要：当在各自的训练数据集上进行训练时，基于转换器的模型在各种自然语言推理（NLI）基准上取得了令人印象深刻的性能。但是，在某些情况下，可能无法获得训练样本，或者收集样本可能会耗费大量时间和资源。在这项工作中，我们解决了这一挑战，并对无监督NLI进行了探索性研究，这是一种没有人类注释训练样本的范式。我们在三种具有挑战性的环境下研究NLI：PH、P和NPH，它们在可用于学习的未标记数据的范围上不同。作为一种解决方案，我们提出了一种过程数据生成方法，该方法利用一组句子转换来收集用于训练NLI模型的PHL（前提、假设、标签）三元组，而不需要人类注释的训练数据集。综合实验表明，该方法在PH、P、NPH设置下的准确率分别为66.75%、65.9%、65.39%，优于所有现有基线。此外，仅使用约0.1%的训练数据集（500个样本）对我们的模型进行微调，与在相同的500个实例上从头开始训练的模型相比，精确度提高了12.2%。摘要：Transformer-based models have achieved impressive performance on various Natural Language Inference (NLI) benchmarks, when trained on respective training datasets. However, in certain cases, training samples may not be available or collecting them could be time-consuming and resource-intensive. In this work, we address this challenge and present an explorative study on unsupervised NLI, a paradigm in which no human-annotated training samples are available. We investigate NLI under three challenging settings: PH, P, and NPH that differ in the extent of unlabeled data available for learning. As a solution, we propose a procedural data generation approach that leverages a set of sentence transformations to collect PHL (Premise, Hypothesis, Label) triplets for training NLI models, bypassing the need for human-annotated training datasets. Comprehensive experiments show that this approach results in accuracies of 66.75%, 65.9%, 65.39% in PH, P, NPH settings respectively, outperforming all existing baselines. Furthermore, fine-tuning our models with as little as ~0.1% of the training dataset (500 samples) leads to 12.2% higher accuracy than the model trained from scratch on the same 500 instances.

GAN|对抗|攻击|生成相关(9篇)

【1】 Protecting Anonymous Speech: A Generative Adversarial Network Methodology for Removing Stylistic Indicators in Text 标题：匿名言论保护：一种去除文本中文体指标的生成性对抗性网络方法链接：https://arxiv.org/abs/2110.09495

作者：Rishi Balakrishnan,Stephen Sloan,Anil Aswani 机构：University of California, Berkeley 摘要：随着互联网用户不断留下文字线索，无论是通过博客、电子邮件还是社交媒体帖子，匿名写作和抗议的能力正在受到侵蚀，因为人工智能在获得之前工作的样本后，可以在数百名可能的候选人中将文字与其作者进行匹配。现有的作者身份匿名化方法，也称为作者身份混淆，通常侧重于保护二元人口统计属性，而不是整个身份。即使是那些专注于混淆身份的人也需要人工反馈，失去原始句子的连贯性，或者只在有限的作者子集下表现良好。在本文中，我们开发了一种新的作者身份匿名方法，通过构建一个生成性对抗网络来保护身份，并针对匿名性、流畅性和内容保留三种不同的损失进行优化。我们的全自动方法在内容保存和流畅性方面与其他方法取得了相当的结果，但在匿名化方面大大优于基线。此外，我们的方法能够很好地推广到开放集上下文，并且能够匿名化以前从未遇到过的作者的句子。摘要：With Internet users constantly leaving a trail of text, whether through blogs, emails, or social media posts, the ability to write and protest anonymously is being eroded because artificial intelligence, when given a sample of previous work, can match text with its author out of hundreds of possible candidates. Existing approaches to authorship anonymization, also known as authorship obfuscation, often focus on protecting binary demographic attributes rather than identity as a whole. Even those that do focus on obfuscating identity require manual feedback, lose the coherence of the original sentence, or only perform well given a limited subset of authors. In this paper, we develop a new approach to authorship anonymization by constructing a generative adversarial network that protects identity and optimizes for three different losses corresponding to anonymity, fluency, and content preservation. Our fully automatic method achieves comparable results to other methods in terms of content preservation and fluency, but greatly outperforms baselines in regards to anonymization. Moreover, our approach is able to generalize well to an open-set context and anonymize sentences from authors it has not encountered before.

【2】 Don't Judge Me by My Face : An Indirect Adversarial Approach to Remove Sensitive Information From Multimodal Neural Representation in Asynchronous Job Video Interviews 标题：不要以貌取人：异步工作视频面试中从多模态神经表征中去除敏感信息的间接对抗性方法链接：https://arxiv.org/abs/2110.09424

作者：Léo Hemamou,Arthur Guillon,Jean-Claude Martin,Chloé Clavel 机构：∗EASYRECRUE, Paris, France, †LIMSI-LISN, CNRS, Paris-Sud University, Paris-Saclay University F-, Orsay, France, ‡T´el´ecom-Paris, IP-Paris, F-, Paris, France 备注：published in ACII 2021 摘要：机器学习用于自动分析求职面试视频的se最近引起了越来越多的兴趣。尽管声称对候选人的性别或种族等敏感信息的输出是公平的，但目前的方法很少提供无偏见决策的证据，或者说敏感信息没有被使用。近年来，对抗性方法已被证明能有效地从神经网络的潜在表征中去除敏感信息。然而，这些方法依赖于使用明确标记的受保护变量（如性别），而在某些国家（如法国）的招聘中无法收集这些变量。在这篇文章中，我们提出了一种新的对抗性方法来去除神经网络潜在表示中的敏感信息，而不需要收集任何敏感变量。仅使用面试的几个框架，我们训练我们的模型，使其无法在模型的内层找到与面试相关的候选人的面孔。这反过来又允许我们从这些层中删除相关的私有信息。将我们的方法与带有性别和种族注释的公共数据集上的标准基线进行比较，我们发现它可以有效地从主网络中删除敏感信息。此外，据我们所知，这是第一次应用对抗性技术，在视频面试中获得多模式公平表达。总之，我们的贡献旨在提高即将推出的自动系统处理求职面试视频的公平性，以实现求职平等。摘要：se of machine learning for automatic analysis of job interview videos has recently seen increased interest. Despite claims of fair output regarding sensitive information such as gender or ethnicity of the candidates, the current approaches rarely provide proof of unbiased decision-making, or that sensitive information is not used. Recently, adversarial methods have been proved to effectively remove sensitive information from the latent representation of neural networks. However, these methods rely on the use of explicitly labeled protected variables (e.g. gender), which cannot be collected in the context of recruiting in some countries (e.g. France). In this article, we propose a new adversarial approach to remove sensitive information from the latent representation of neural networks without the need to collect any sensitive variable. Using only a few frames of the interview, we train our model to not be able to find the face of the candidate related to the job interview in the inner layers of the model. This, in turn, allows us to remove relevant private information from these layers. Comparing our approach to a standard baseline on a public dataset with gender and ethnicity annotations, we show that it effectively removes sensitive information from the main network. Moreover, to the best of our knowledge, this is the first application of adversarial techniques for obtaining a multimodal fair representation in the context of video job interviews. In summary, our contributions aim at improving fairness of the upcoming automatic systems processing videos of job interviews for equality in job selection.

【3】 BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation 标题：BEAMetrics：一种语言生成评价的基准链接：https://arxiv.org/abs/2110.09147

作者：Thomas Scialom,Felix Hill 机构：Sorbonne Université, CNRS, LIP, F-, reciTAL, Paris, France, DeepMind 摘要：自然语言处理（NLP）系统越来越多地训练生成开放式文本，而不是在响应之间进行分类。这使得对生成的语言的评估指标的研究至关重要。生成的语言是在给定上下文和/或人类引用响应的情况下对系统输出进行评分的函数。然而，不同的指标具有不同的优势和偏差，并且在某些任务上比在其他任务上更好地反映了人类的直觉。目前还没有一种简单、统一的方法来比较、分析或评估一组具有代表性的任务的指标。这里，我们描述了评估自动度量的基准（BeamMetrics），这是一种使新度量本身的研究更容易评估的资源。BeamMetrics用户可以在不同的任务、质量维度（流利性、连贯性、信息性等）和语言中快速比较现有和新的指标与人类的判断。正如发电专家可能预测的那样，BeamMetrics揭示了现有指标之间与任务相关的显著差异，以及在回答空间复杂或高度依赖一般知识的任务上一贯表现不佳。虽然该分析强调了当前研究实践中面临的一个关键问题，但BeamMetrics也通过促进更好的度量研究——特别是那些可以解释许多现代NLP应用固有的上下文和一般知识之间复杂交互的度量，为解决这一问题做出了贡献。BEAMetrics在麻省理工学院许可证下提供：https://github.com/ThomasScialom/BEAMetrics 摘要：Natural language processing (NLP) systems are increasingly trained to generate open-ended text rather than classifying between responses. This makes research on evaluation metrics for generated language -- functions that score system output given the context and/or human reference responses -- of critical importance. However, different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others. There is currently no simple, unified way to compare, analyse or evaluate metrics across a representative set of tasks. Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics), a resource to make research into new metrics itself easier to evaluate. BEAMetrics users can quickly compare existing and new metrics with human judgements across a diverse set of tasks, quality dimensions (fluency vs. coherence vs. informativeness etc), and languages. As generation experts might predict, BEAMetrics reveals stark task-dependent differences between existing metrics, and consistently poor performance on tasks with complex answer spaces or high reliance on general knowledge. While this analysis highlights a critical issue facing current research practice, BEAMetrics also contribute to its resolution by facilitating research into better metrics -- particularly those that can account for the complex interaction between context and general knowledge inherent to many modern NLP applications. BEAMetrics is available under the MIT License: https://github.com/ThomasScialom/BEAMetrics

【4】 FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metricsfor Automatic Text Generation 标题：FrugalScore：学习更便宜、更轻、更快的文本自动生成评估指标链接：https://arxiv.org/abs/2110.08559

作者：Moussa Kamal Eddine,Guokan Shang,Antoine J. -P. Tixier,Michalis Vazirgiannis 机构：École Polytechnique,Linagora 摘要：快速可靠的评估指标是研发进展的关键。虽然传统的自然语言生成度量是快速的，但它们不是非常可靠。相反，基于大型预训练语言模型的新指标更可靠，但需要大量计算资源。在本文中，我们提出了FrugalScore，一种学习任何昂贵NLG度量的固定、低成本版本的方法，同时保留其大部分原始性能。BoTrice和MOVELISCORE在总结和翻译上的实验表明，节拍与原始度量（有时更好）相吻合，同时具有几个数量级较少的参数，并且运行速度快几倍。在所有学习的指标、任务和变体中，FrugalScore平均保留96.8%的性能，运行速度快24倍，参数比原始指标少35倍。我们将经过训练的指标公开，以使整个NLP社区受益，尤其是资源有限的研究人员和从业者受益。摘要：Fast and reliable evaluation metrics are key to R&D progress. While traditional natural language generation metrics are fast, they are not very reliable. Conversely, new metrics based on large pretrained language models are much more reliable, but require significant computational resources. In this paper, we propose FrugalScore, an approach to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance. Experiments with BERTScore and MoverScore on summarization and translation show that FrugalScore is on par with the original metrics (and sometimes better), while having several orders of magnitude less parameters and running several times faster. On average over all learned metrics, tasks, and variants, FrugalScore retains 96.8% of the performance, runs 24 times faster, and has 35 times less parameters than the original metrics. We make our trained metrics publicly available, to benefit the entire NLP community and in particular researchers and practitioners with limited resources.

【5】 Multimodal Dialogue Response Generation 标题：多模态对话响应生成链接：https://arxiv.org/abs/2110.08515

作者：Qingfeng Sun,Yujing Wang,Can Xu,Kai Zheng,Yaming Yang,Huang Hu,Fei Xu,Jessica Zhang,Xiubo Geng,Daxin Jiang 机构：Microsoft STC Aisa, Microsoft Research Asia 备注：This paper has been submitted before 15th October @ 11:59pm AOE(UTC -12) 摘要：图像响应是智能会话代理的一项重要功能。然而现有的研究只关注于多模态对话模型的探索，依赖于基于检索的方法，而忽略了生成方法。为了填补这一空白，我们首先提出了一个多模态对话生成模型，该模型将对话历史作为输入，然后生成文本序列或图像作为响应。学习这样一个模型通常需要多模态对话，其中包含难以获得的文本和图像。动机的挑战，在实践中，我们认为多模态对话产生的自然假设，只有有限的训练实例是可用的。在这种低资源环境下，我们设计了一种新的会话代理Divter，以便将依赖于多模态对话的参数从整个生成模型中分离出来。通过这种方法，可以分别从大量纯文本对话和文本-图像对中学习模型的主要部分，然后使用有限的训练示例对整个参数进行拟合。大量实验表明，我们的方法在自动和人工评估方面都达到了最先进的效果，并且能够生成信息丰富的文本和高分辨率的图像响应。摘要：Responsing with image has been recognized as an important capability for an intelligent conversational agent. Yet existing works only focus on exploring the multimodal dialogue models which depend on retrieval-based methods, but neglecting generation methods. To fill in the gaps, we first present a multimodal dialogue generation model, which takes the dialogue history as input, then generates a textual sequence or an image as response. Learning such a model often requires multimodal dialogues containing both texts and images which are difficult to obtain. Motivated by the challenge in practice, we consider multimodal dialogue generation under a natural assumption that only limited training examples are available. In such a low-resource setting, we devise a novel conversational agent, Divter, in order to isolate parameters that depend on multimodal dialogues from the entire generation model. By this means, the major part of the model can be learned from a large number of text-only dialogues and text-image pairs respectively, then the whole parameters can be well fitted using the limited training examples. Extensive experiments demonstrate our method achieves state-of-the-art results in both automatic and human evaluation, and can generate informative text and high-resolution image responses.

【6】 Analyzing Dynamic Adversarial Training Data in the Limit 标题：极限条件下的动态对抗性训练数据分析链接：https://arxiv.org/abs/2110.08514

作者：Eric Wallace,Adina Williams,Robin Jia,Douwe Kiela 机构：UC Berkeley, Facebook AI Research, USC 摘要：为了创建在广泛的测试输入中具有鲁棒性的模型，训练数据集应该包括跨越许多现象的不同示例。动态对抗性数据收集（DADC）是一种生成如此多样化的训练集的方法，注释者可以在其中制作挑战不断改进模型的示例。先前的工作表明，在1-3轮中运行DADC可以帮助模型修复某些错误类型，但这并不一定会导致更好的泛化，超越对抗性测试数据。我们认为，多轮运行DADC可以最大限度地提高其训练时间效益，因为不同的轮可以一起覆盖许多与任务相关的现象。我们介绍了第一项长期DADC研究，其中我们收集了20轮NLI示例，用于一小部分前提段落，包括对抗性和非对抗性方法。与基于非对抗性数据训练的模型相比，基于DADC示例训练的模型在我们的专家策划的测试集上的错误减少了26%。我们的分析表明，与非对抗性示例相比，DADC生成的示例更加困难，在词汇和语法上更加多样化，并且包含的注释工件更少。摘要：To create models that are robust across a wide range of test inputs, training datasets should include diverse examples that span numerous phenomena. Dynamic adversarial data collection (DADC), where annotators craft examples that challenge continually improving models, holds promise as an approach for generating such diverse training sets. Prior work has shown that running DADC over 1-3 rounds can help models fix some error types, but it does not necessarily lead to better generalization beyond adversarial test data. We argue that running DADC over many rounds maximizes its training-time benefits, as the different rounds can together cover many of the task-relevant phenomena. We present the first study of longer-term DADC, where we collect 20 rounds of NLI examples for a small set of premise paragraphs, with both adversarial and non-adversarial approaches. Models trained on DADC examples make 26% fewer errors on our expert-curated test set compared to models trained on non-adversarial data. Our analysis shows that DADC yields examples that are more difficult, more lexically and syntactically diverse, and contain fewer annotation artifacts compared to non-adversarial examples.

【7】 Improving Compositional Generalization with Self-Training for Data-to-Text Generation 标题：通过自训练提高数据到文本生成的构图泛化能力链接：https://arxiv.org/abs/2110.08467

作者：Sanket Vaibhav Mehta,Jinfeng Rao,Yi Tay,Mihir Kale,Ankur Parikh,Hongtao Zhong,Emma Strubell 机构：Carnegie Mellon University,Google,Google Research 备注：10 pages 摘要：数据到文本生成侧重于从结构化语义表示生成流畅的自然语言响应。这种表示是组合的，允许以各种方式组合原子意义图式，以表达自然语言中丰富的语义。最近，预训练语言模型（LMs）在从数据到文本的任务中取得了令人印象深刻的结果，尽管这些LMs在多大程度上推广到新的语义表示尚不清楚。在这项工作中，我们系统地研究了数据到文本任务中当前最先进的生成模型的合成概括。通过模拟成分天气数据集中的结构变化，我们表明T5模型无法推广到看不见的结构。接下来，我们展示了基于模板的输入表示极大地提高了模型性能，并且模型规模并不能很好地解决泛化的不足。为了进一步提高模型的性能，我们提出了一种基于自训练的伪响应选择方法。在少炮天气和多域SGD数据集上的大量实验表明，我们的方法具有很强的增益。摘要：Data-to-text generation focuses on generating fluent natural language responses from structured semantic representations. Such representations are compositional, allowing for the combination of atomic meaning schemata in various ways to express the rich semantics in natural language. Recently, pretrained language models (LMs) have achieved impressive results on data-to-text tasks, though it remains unclear the extent to which these LMs generalize to new semantic representations. In this work, we systematically study the compositional generalization of current state-of-the-art generation models in data-to-text tasks. By simulating structural shifts in the compositional Weather dataset, we show that T5 models fail to generalize to unseen structures. Next, we show that template-based input representations greatly improve the model performance and model scale does not trivially solve the lack of generalization. To further improve the model's performance, we propose an approach based on self-training using finetuned BLEURT for pseudo-response selection. Extensive experiments on the few-shot Weather and multi-domain SGD datasets demonstrate strong gains of our method.

【8】 How Well Do You Know Your Audience? Reader-aware Question Generation 标题：你对你的观众有多了解？读者感知的问题生成链接：https://arxiv.org/abs/2110.08445

作者：Ian Stewart,Rada Mihalcea 机构：Computer Science and Engineering, University of Michigan 摘要：写作时，一个人可能需要预测读者提出的问题，但不同类型的读者可能会提出不同类型的问题。如果有人在为某个问题寻求建议，领域专家会问什么问题，这与新手的反应有什么不同？在本文中，我们讨论了读者感知问题生成的任务。我们从社交媒体收集了一组新的问题和帖子数据，并添加了关于帖子读者的背景信息。基于预测分析和描述性差异，我们发现不同的读者，如专家和新手，总是问不同类型的问题。接下来，我们开发了几种文本生成模型，这些模型结合了不同类型的读者背景，包括基于读者先前行为的离散和连续读者表示。我们表明，读者感知模型可以执行PAR或略优于仅文本模型在某些情况下，特别是在情况下，吸引不同的读者来自不同群体的问题。我们的工作有可能帮助作者预测不同读者的信息需求。摘要：When writing, a person may need to anticipate questions from their readers, but different types of readers may ask very different types of questions. If someone is writing for advice about a problem, what question will a domain expert ask, and is this different from how a novice might react? In this paper, we address the task of reader-aware question generation. We collect a new data set of questions and posts from social media, augmented with background information about the post readers. Based on predictive analysis and descriptive differences, we find that different readers, such as experts and novices, consistently ask different types of questions. We next develop several text generation models that incorporate different types of reader background, including discrete and continuous reader representations based on the readers' prior behavior. We demonstrate that reader-aware models can perform on par or slightly better than the text-only model in some cases, particularly in cases where a post attracts very different questions from readers of different groups. Our work has the potential to help writers anticipate the information needs of different readers.

【9】 Control Prefixes for Text Generation 标题：用于文本生成的控件前缀链接：https://arxiv.org/abs/2110.08329

作者：Jordan Clive,Kris Cao,Marek Rei 机构：Imperial College London, DeepMind, London, UK 摘要：提示学习方法通过使用特定于任务的提示和输入，使预先训练的语言模型适应下游应用程序。当前关于文本生成中提示学习的大多数工作都依赖于数据集中所有示例的共享数据集级别的提示。我们扩展了这种方法，并提出了一种动态方法，即控制前缀，它允许在每个提示中包含条件输入相关信息。控制前缀位于即时学习和受控生成的交叉点，使模型能够在文本生成期间具有更细粒度的控制。该方法将属性级可学习表示合并到预先训练的转换器的不同层中，允许生成的文本在特定方向上被引导。我们对该技术进行了系统评估，并将其应用于自然语言生成（NLG）GEM基准中的五个数据集。我们展示了几个数据到文本数据集的最新结果，包括WebNLG。摘要：Prompt learning methods adapt pre-trained language models to downstream applications by using a task-specific prompt together with the input. Most of the current work on prompt learning in text generation relies on a shared dataset-level prompt for all examples in the dataset. We extend this approach and propose a dynamic method, Control Prefixes, which allows for the inclusion of conditional input-dependent information in each prompt. Control Prefixes is at the intersection of prompt learning and controlled generation, empowering the model to have finer-grained control during text generation. The method incorporates attribute-level learnable representations into different layers of a pre-trained transformer, allowing for the generated text to be guided in a particular direction. We provide a systematic evaluation of the technique and apply it to five datasets from the GEM benchmark for natural language generation (NLG). We present state-of-the-art results on several data-to-text datasets, including WebNLG.

半/弱/无监督|不确定性(1篇)

【1】 Prioritization of COVID-19-related literature via unsupervised keyphrase extraction and document representation learning 标题：通过无监督关键词提取和文档表示学习对冠状病毒相关文献进行优先排序链接：https://arxiv.org/abs/2110.08874

作者：Blaž Škrlj,Marko Jukič,Nika Eržen,Senja Pollak,Nada Lavrač 机构： Joˇzef Stefan Institute, Ljubljana, Slovenia, Joˇzef Stefan International Postgraduate School, Ljubljana, Slovenia, University of Maribor, Slovenia 摘要：新冠病毒-19大流行引发了一波新的科学文献浪潮，无法在合理的时间范围内进行手动检查和研究。当前的机器学习方法提供了将此类文献体投影到向量空间的方法，在向量空间中相似的文档彼此靠近，提供了对科学论文和与新冠病毒-19相关的其他知识源的深入探索。然而，为了开始搜索，这些文本需要进行适当的注释，由于缺乏人力资源，这种情况很少发生。在我们的系统中，当前与新冠病毒-19相关的文献主体使用无监督的关键词提取进行注释，便于初始查询到包含所学文档嵌入（低维表示）的潜在空间。该解决方案可以通过能够进行交互式搜索、术语排名和潜在有趣文献探索的web服务器访问。我们通过药物化学领域的案例研究证明了该方法的有效性。摘要：The COVID-19 pandemic triggered a wave of novel scientific literature that is impossible to inspect and study in a reasonable time frame manually. Current machine learning methods offer to project such body of literature into the vector space, where similar documents are located close to each other, offering an insightful exploration of scientific papers and other knowledge sources associated with COVID-19. However, to start searching, such texts need to be appropriately annotated, which is seldom the case due to the lack of human resources. In our system, the current body of COVID-19-related literature is annotated using unsupervised keyphrase extraction, facilitating the initial queries to the latent space containing the learned document embeddings (low-dimensional representations). The solution is accessible through a web server capable of interactive search, term ranking, and exploration of potentially interesting literature. We demonstrate the usefulness of the approach via case studies from the medicinal chemistry domain.

检测相关(1篇)

【1】 Ceasing hate withMoH: Hate Speech Detection in Hindi-English Code-Switched Language 标题：用MoH停止仇恨：印英语码转换语言中的仇恨语音检测链接：https://arxiv.org/abs/2110.09393

作者：Arushi Sharma,Anubha Kabra,Minni Jain 机构：Optum Global Advantage, Adobe Inc., Delhi Technological University 备注：Accepted in Elsevier Journal of Information Processing and Management. Sharma and Kabra made equal contribution 摘要：社交媒体已经成为人们在全世界发表意见的基石。由于匿名功能带来的更大的自由感，有可能无视网络社交礼仪，攻击他人而不会面临严重后果，不可避免地传播仇恨言论。目前筛选网络内容和抵消仇恨蔓延的措施做得不够。造成这一现象的一个因素是社交媒体中区域语言的流行以及缺乏灵活的仇恨语言检测器。拟议的工作重点是分析印地语-英语代码转换语言中的仇恨言论。我们的方法探索了转换技术来捕获精确的文本表示。为了包含数据的结构并与现有算法一起使用，我们开发了MoH或Map Only印地语，这在印地语中表示“爱”。MoH管道包括语言识别、使用罗马-印地语单词知识库的罗马-德瓦纳加里-印地语音译。最后，它采用了经过微调的多语言Bert和MuRIL语言模型。我们在三个数据集上进行了一些定量实验研究，并使用精确度、召回率和F1指标评估了性能。第一个实验使用经典机器学习模型研究MoH映射文本的性能，结果显示F1成绩平均提高了13%。第二种方法将拟议工作的分数与基线模型的分数进行比较，结果显示性能提高了6%。最后，第三部分利用现有的音译库进行各种数据模拟，得出所提出的MoH技术。在这方面，卫生部的表现优于其他部门15%。我们的结果表明，在所有三个数据集上，最先进的分数都有显著提高。摘要：Social media has become a bedrock for people to voice their opinions worldwide. Due to the greater sense of freedom with the anonymity feature, it is possible to disregard social etiquette online and attack others without facing severe consequences, inevitably propagating hate speech. The current measures to sift the online content and offset the hatred spread do not go far enough. One factor contributing to this is the prevalence of regional languages in social media and the paucity of language flexible hate speech detectors. The proposed work focuses on analyzing hate speech in Hindi-English code-switched language. Our method explores transformation techniques to capture precise text representation. To contain the structure of data and yet use it with existing algorithms, we developed MoH or Map Only Hindi, which means "Love" in Hindi. MoH pipeline consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words. Finally, it employs the fine-tuned Multilingual Bert and MuRIL language models. We conducted several quantitative experiment studies on three datasets and evaluated performance using Precision, Recall, and F1 metrics. The first experiment studies MoH mapped text's performance with classical machine learning models and shows an average increase of 13% in F1 scores. The second compares the proposed work's scores with those of the baseline models and offers a rise in performance by 6%. Finally, the third reaches the proposed MoH technique with various data simulations using the existing transliteration library. Here, MoH outperforms the rest by 15%. Our results demonstrate a significant improvement in the state-of-the-art scores on all three datasets.

识别/分类(3篇)

【1】 Intent Classification Using Pre-Trained Embeddings For Low Resource Languages 标题：基于预训练嵌入的低资源语言意图分类链接：https://arxiv.org/abs/2110.09264

作者：Hemant Yadav,Akshat Gupta,Sai Krishna Rallabandi,Alan W Black,Rajiv Ratn Shah 机构：MIDAS, IIIT Delhi, India,J.P.Morgan AI Research, New York, USA, Carnegie Mellon University 摘要：构建不依赖于特定语言的自动语音识别（ASR）的口语理解（SLU）系统是语言处理中一个重要但尚未深入研究的问题。在本文中，我们提出了一项比较研究，旨在采用预先训练的声学模型在低资源场景中执行SLU。具体来说，我们使用了三种不同的嵌入，它们是使用Allosaurus（一种预先训练过的通用电话解码器）提取的：（1）电话（2）Panphone，和（3）Allo嵌入。然后使用这些嵌入来识别口头意图。我们使用三种不同的语言进行实验：英语、僧伽罗语和泰米尔语，每种语言都有不同的数据大小，以模拟高、中、低资源场景。我们的系统将僧伽罗语和泰米尔语的最新（SOTA）意向分类准确率分别提高了约2.11%和7.00%，并在英语方面取得了竞争性的成绩。此外，我们还定量分析了绩效如何随每个意图使用的训练示例数量而变化。摘要：Building Spoken Language Understanding (SLU) systems that do not rely on language specific Automatic Speech Recognition (ASR) is an important yet less explored problem in language processing. In this paper, we present a comparative study aimed at employing a pre-trained acoustic model to perform SLU in low resource scenarios. Specifically, we use three different embeddings extracted using Allosaurus, a pre-trained universal phone decoder: (1) Phone (2) Panphone, and (3) Allo embeddings. These embeddings are then used in identifying the spoken intent. We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios. Our system improves on the state-of-the-art (SOTA) intent classification accuracy by approximately 2.11% for Sinhala and 7.00% for Tamil and achieves competitive results on English. Furthermore, we present a quantitative analysis of how the performance scales with the number of training examples used per intent.

【2】 Sparse Distillation: Speeding Up Text Classification by Using Bigger Models 标题：稀疏精馏：通过使用更大的模型来加速文本分类链接：https://arxiv.org/abs/2110.08536

作者：Qinyuan Ye,Madian Khabsa,Mike Lewis,Sinong Wang,Xiang Ren,Aaron Jaech 机构：University of Southern California, Facebook AI 摘要：将最先进的Transformer模型提取为轻量级学生模型是降低推理时计算成本的有效方法。然而，对于某些时间敏感的应用，改进的推理速度可能仍然不能令人满意。在本文中，我们的目标是通过在学生模型的设计空间中探索一个新的领域来进一步提高推理速度的极限。更具体地，我们考虑将基于Transformer的文本分类器提取为十亿参数，稀疏激活的学生模型与嵌入平均架构。我们的实验表明，学生模型在六个文本分类任务的集合中保留了RoBERTa大型教师97%的表现。同时，与教师模式相比，学生模式在GPU和CPU上实现了高达600倍的速度提升。进一步的研究表明，我们的管道在隐私保护和域泛化设置方面也是有效的。摘要：Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. However, the improved inference speed may be still unsatisfactory for certain time-sensitive applications. In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model. More specifically, we consider distilling a transformer-based text classifier into a billion-parameter, sparsely-activated student model with a embedding-averaging architecture. Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks. Meanwhile, the student model achieves up to 600x speed-up on both GPUs and CPUs, compared to the teacher models. Further investigation shows that our pipeline is also effective in privacy-preserving and domain generalization settings.

【3】 Inconsistent Few-Shot Relation Classification via Cross-Attentional Prototype Networks with Contrastive Learning 标题：基于对比性学习的交叉注意原型网络不一致少射关系分类链接：https://arxiv.org/abs/2110.08254

作者：Hongru Wang,Zhijing Jin,Jiarun Cao,Gabriel Pui Cheong Fung,Kam-Fai Wong 机构：⋆ Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, ♣ Max Planck Institute for Intelligent Systems & ETH Z¨urich 摘要：标准少数镜头关系分类（RC）旨在学习一个鲁棒分类器，每个类只有很少的标记数据。然而，以前的工作很少研究训练与测试期间不同数量的类（即$N$-way）和每个类的标记数据数量（即$K$-shot）的影响。在这项工作中，我们定义了一个新任务，\textit{inconsistent minor shot RC}，其中模型需要处理训练和测试之间$N$和$K$的不一致性。为了解决这一新任务，我们提出了基于原型网络的交叉注意对比学习（ProtoCACL），以捕获支持集和查询集之间丰富的交互作用。实验结果表明，在不一致的$K$和不一致的$N$设置下，我们的ProtoCACL都优于最先进的基线模型，因为它具有更健壮和更具区分性的表示。此外，我们发现，在不一致的少数镜头学习设置中，模型可以在\textit{less data}的情况下获得比仔细选择$N$和$K$的标准少数镜头设置更好的性能。在文章的最后，我们提供了进一步的分析和建议，以系统地指导不同情景下的$N$和$K$选择。摘要：Standard few-shot relation classification (RC) is designed to learn a robust classifier with only few labeled data for each class. However, previous works rarely investigate the effects of a different number of classes (i.e., $N$-way) and number of labeled data per class (i.e., $K$-shot) during training vs. testing. In this work, we define a new task, \textit{inconsistent few-shot RC}, where the model needs to handle the inconsistency of $N$ and $K$ between training and testing. To address this new task, we propose Prototype Network-based cross-attention contrastive learning (ProtoCACL) to capture the rich mutual interactions between the support set and query set. Experimental results demonstrate that our ProtoCACL can outperform the state-of-the-art baseline model under both inconsistent $K$ and inconsistent $N$ settings, owing to its more robust and discriminate representations. Moreover, we identify that in the inconsistent few-shot learning setting, models can achieve better performance with \textit{less data} than the standard few-shot setting with carefully-selected $N$ and $K$. In the end of the paper, we provide further analyses and suggestions to systematically guide the selection of $N$ and $K$ under different scenarios.

Zero/Few/One-Shot|迁移|自适应(1篇)

【1】 A Unified Speaker Adaptation Approach for ASR 标题：一种适用于ASR的统一说话人自适应方法链接：https://arxiv.org/abs/2110.08545

作者：Yingzhu Zhao,Chongjia Ni,Cheung-Chi Leung,Shafiq Joty,Eng Siong Chng,Bin Ma 机构：Nanyang Technological University, Singapore, Machine Intelligence Technology, Alibaba Group 备注：Accepted by EMNLP 2021 摘要：Transformer模型已成功地应用于自动语音识别（ASR），并产生了最先进的结果。然而，它的性能仍然受到训练数据和测试数据之间说话人不匹配的影响。使用目标说话人数据进一步微调训练模型是最自然的自适应方法，但这需要大量计算，并可能导致现有说话人的灾难性遗忘。在这项工作中，我们提出了一种统一的说话人自适应方法，包括特征自适应和模型自适应。对于特征自适应，我们采用了一种说话人感知的持久记忆模型，该模型利用说话人i向量形成持久记忆，从而更好地推广到不可见的测试说话人。对于模型自适应，我们使用了一种新的逐步修剪方法来适应目标说话人，而不改变模型结构，据我们所知，这在ASR中从未被探索过。具体地说，我们将模型编码器上贡献较少的参数逐渐修剪到一定的稀疏水平，并使用修剪后的参数进行自适应，同时冻结未运行的参数以保持原始模型的性能。我们在Librispeech数据集上进行了实验。我们提出的方法在一般说话人自适应方面相对减少了2.74-6.52%的字错误率（WER）。在目标说话人自适应方面，我们的方法比基线方法的相对功率降低了20.58%，比微调方法的相对功率降低了2.54%。此外，对于极低的资源适应数据（例如，1次话语），我们的方法仅需几次训练就可以相对提高WER 6.53%。摘要：Transformer models have been used in automatic speech recognition (ASR) successfully and yields state-of-the-art results. However, its performance is still affected by speaker mismatch between training and test data. Further finetuning a trained model with target speaker data is the most natural approach for adaptation, but it takes a lot of compute and may cause catastrophic forgetting to the existing speakers. In this work, we propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation. For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers by making use of speaker i-vectors to form a persistent memory. For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture, which to the best of our knowledge, has never been explored in ASR. Specifically, we gradually prune less contributing parameters on model encoder to a certain sparsity level, and use the pruned parameters for adaptation, while freezing the unpruned parameters to keep the original model performance. We conduct experiments on the Librispeech dataset. Our proposed approach brings relative 2.74-6.52% word error rate (WER) reduction on general speaker adaptation. On target speaker adaptation, our method outperforms the baseline with up to 20.58% relative WER reduction, and surpasses the finetuning method by up to relative 2.54%. Besides, with extremely low-resource adaptation data (e.g., 1 utterance), our method could improve the WER by relative 6.53% with only a few epochs of training.

语料库(1篇)

【1】 The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses 标题：阿拉伯平行性别语料库2.0：扩展与分析链接：https://arxiv.org/abs/2110.09216

作者：Bashar Alhafni,Nizar Habash,Houda Bouamor 机构：Computational Approaches to Modeling Language Lab, New York University Abu Dhabi, †Carnegie Mellon University in Qatar 摘要：自然语言处理（NLP）应用中的性别偏见，特别是机器翻译，越来越受到关注。关于这一问题的许多研究都集中在缓解英语NLP模型和系统中的性别偏见上。在资源匮乏和/或形态丰富的语言中解决这一问题已经落后，主要原因是缺乏数据集和资源。在本文中，我们介绍了一个新的语料库，用于在涉及一个或两个目标用户（我和/或你）的上下文中进行性别识别和重写——具有独立语法性别偏好的第一和第二语法人。我们关注阿拉伯语，一种形态丰富的性别标记语言。语料库有多个平行的组成部分：第一人称和第二人称的四种组合，包括女性和男性语法性别，以及英语和英语到阿拉伯语的机器翻译输出。该语料库扩展了Habash等人（2019）的阿拉伯语平行性别语料库（APGC v1.0），增加了第二人称目标，并将句子总数增加了6.5倍，达到59000多个单词。我们的新数据集将帮助研究和开发性别识别、受控文本生成和编辑后重写系统，这些系统可用于个性化NLP应用程序，并根据用户的语法性别偏好为用户提供正确的输出。我们公开了阿拉伯语平行性别语料库（APGC v2.0）。摘要：Gender bias in natural language processing (NLP) applications, particularly machine translation, has been receiving increasing attention. Much of the research on this issue has focused on mitigating gender bias in English NLP models and systems. Addressing the problem in poorly resourced, and/or morphologically rich languages has lagged behind, largely due to the lack of datasets and resources. In this paper, we introduce a new corpus for gender identification and rewriting in contexts involving one or two target users (I and/or You) -- first and second grammatical persons with independent grammatical gender preferences. We focus on Arabic, a gender-marking morphologically rich language. The corpus has multiple parallel components: four combinations of 1st and 2nd person in feminine and masculine grammatical genders, as well as English, and English to Arabic machine translation output. This corpus expands on Habash et al. (2019)'s Arabic Parallel Gender Corpus (APGC v1.0) by adding second person targets as well as increasing the total number of sentences over 6.5 times, reaching over 590K words. Our new dataset will aid the research and development of gender identification, controlled text generation, and post-editing rewrite systems that could be used to personalize NLP applications and provide users with the correct outputs based on their grammatical gender preferences. We make the Arabic Parallel Gender Corpus (APGC v2.0) publicly available.

表征(3篇)

【1】 Deep Clustering For General-Purpose Audio Representations 标题：通用音频表示的深度聚类链接：https://arxiv.org/abs/2110.08895

作者：Sreyan Ghosh,Sandesh V Katta,Ashish Seth,S. Umesh 机构：† Speech Lab, Dept. of Electrical Engineering, IIT Madras, Chennai, India 备注：Submitted to ICASSP 2022 摘要：我们介绍了DECAR，一种用于学习通用音频表示的自我监督预训练方法。我们的系统基于聚类：它利用离线聚类步骤来提供目标标签，作为解决预测任务的伪标签。我们在计算机视觉自监督学习最新进展的基础上开发了一个轻量级、易于使用的自监督预训练方案。我们在大规模音频集数据集的平衡子集上预训练DECAR嵌入，并将这些表示转移到9个下游分类任务，包括语音、音乐、动物声音和声学场景。此外，我们进行消融研究，确定关键设计选择，并公开所有代码和预先训练的模型。摘要：We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations. Our system is based on clustering: it utilizes an offline clustering step to provide target labels that act as pseudo-labels for solving a prediction task. We develop on top of recent advances in self-supervised learning for computer vision and design a lightweight, easy-to-use self-supervised pre-training scheme. We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset and transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes. Furthermore, we conduct ablation studies identifying key design choices and also make all our code and pre-trained models publicly available.

【2】 Virtual Augmentation Supported Contrastive Learning of Sentence Representations 标题：虚拟增强支持的句子表征对比学习链接：https://arxiv.org/abs/2110.08552

作者：Dejiao Zhang,Wei Xiao,Henghui Zhu,Xiaofei Ma,Andrew O. Arnold 机构：AWS AI Labs, New York 备注：8 pages, 3 figures, 3 tables 摘要：尽管取得了巨大的成功，对比表征学习仍然依赖于使用特定领域知识精心设计的数据扩充。在自然语言处理中，由于自然语言的离散性，不存在数据扩充的一般规则，这一挑战被放大。我们通过提出一个虚拟增强支持的句子表征对比学习（VaSCL）来应对这一挑战。基于数据增强本质上构建每个训练实例的邻域的解释，我们反过来利用邻域生成有效的数据增强。利用对比学习的大训练批量，我们通过实例在表示空间中的K-最近邻来近似实例的邻域。然后，我们在该邻域内定义实例识别任务，并以对抗性训练方式生成虚拟增强。我们评估了VaSCL在一系列下游任务上的表现，并为无监督的句子表征学习开创了新的技术水平。摘要：Despite profound successes, contrastive representation learning relies on carefully designed data augmentations using domain specific knowledge. This challenge is magnified in natural language processing where no general rules exist for data augmentation due to the discrete nature of natural language. We tackle this challenge by presenting a Virtual augmentation Supported Contrastive Learning of sentence representations (VaSCL). Originating from the interpretation that data augmentation essentially constructs the neighborhoods of each training instance, we in turn utilize the neighborhood to generate effective data augmentations. Leveraging the large training batch size of contrastive learning, we approximate the neighborhood of an instance via its K-nearest in-batch neighbors in the representation space. We then define an instance discrimination task within this neighborhood, and generate the virtual augmentation in an adversarial training manner. We access the performance of VaSCL on a wide range of downstream tasks, and set a new state-of-the-art for unsupervised sentence representation learning.

【3】 Probing as Quantifying the Inductive Bias of Pre-trained Representations 标题：作为量化预训练表征归纳偏差的探索链接：https://arxiv.org/abs/2110.08388

作者：Alexander Immer,Lucas Torroba Hennigen,Vincent Fortuin,Ryan Cotterell 机构：QETH Zürich, DUniversity of Cambridge 摘要：经过预先训练的上下文表示法已经在一系列下游任务中带来了显著的性能改进。这促使研究人员量化和理解编码在其中的语言信息。一般来说，这是通过探测来完成的，探测包括训练一个有监督的模型，从所述表征中预测语言属性。不幸的是，这种探索的定义受到了广泛的批评，并可能导致自相矛盾或违反直觉的结果。在这项工作中，我们提出了一个新的框架，用于探索目标是评估特定任务表示的归纳偏差的地方，并提供了一种使用贝叶斯推理实现这一点的实用途径。我们将我们的框架应用于一系列标记级、arc级和句子级任务。我们的结果表明，我们的框架解决了以前方法的问题，并且在某些情况下，fastText可以提供比BERT更好的归纳偏差。摘要：Pre-trained contextual representations have led to dramatic performance improvements on a range of downstream tasks. This has motivated researchers to quantify and understand the linguistic information encoded in them. In general, this is done by probing, which consists of training a supervised model to predict a linguistic property from said representations. Unfortunately, this definition of probing has been subject to extensive criticism, and can lead to paradoxical or counter-intuitive results. In this work, we present a novel framework for probing where the goal is to evaluate the inductive bias of representations for a particular task, and provide a practical avenue to do this using Bayesian inference. We apply our framework to a series of token-, arc-, and sentence-level tasks. Our results suggest that our framework solves problems of previous approaches and that fastText can offer a better inductive bias than BERT in certain situations.

Word2Vec|文本|单词(3篇)

【1】 ViraPart: A Text Refinement Framework for ASR and NLP Tasks in Persian 标题：ViraPart：面向波斯语ASR和NLP任务的文本精化框架链接：https://arxiv.org/abs/2110.09086

作者：Narges Farokhshad,Milad Molazadeh,Saman Jamalabbasi,Hamed Babaei Giglou,Saeed Bibak 摘要：波斯语是一种屈折变化的SOV语言。这一事实使波斯语成为一种更不确定的语言。然而，使用ZWNJ识别、标点符号恢复和波斯语Ezafe结构等技术将使我们获得一种更易于理解和精确的语言。在大多数波斯语作品中，这些技巧都是单独处理的。尽管如此，我们相信，对于波斯语文本的精炼，所有这些任务都是必要的。在这项工作中，我们提出了一个ViraPart框架，该框架在其核心中使用嵌入式ParsBERT进行文本澄清。首先，使用BERT变量进行波斯语分类，然后使用分类器层进行分类。接下来，我们组合模型输出以输出明文。最后，提出的ZWNJ识别、标点恢复和波斯语Ezafe结构模型的F1宏平均分数分别为96.90%、92.13%和98.50%。实验结果表明，我们提出的方法是非常有效的文本细化波斯语言。摘要：The Persian language is an inflectional SOV language. This fact makes Persian a more uncertain language. However, using techniques such as ZWNJ recognition, punctuation restoration, and Persian Ezafe construction will lead us to a more understandable and precise language. In most of the works in Persian, these techniques are addressed individually. Despite that, we believe that for text refinement in Persian, all of these tasks are necessary. In this work, we proposed a ViraPart framework that uses embedded ParsBERT in its core for text clarifications. First, used the BERT variant for Persian following by a classifier layer for classification procedures. Next, we combined models outputs to output cleartext. In the end, the proposed model for ZWNJ recognition, punctuation restoration, and Persian Ezafe construction performs the averaged F1 macro scores of 96.90\%, 92.13\%, and 98.50\%, respectively. Experimental results show that our proposed approach is very effective in text refinement for the Persian language.

【2】 Quantifying the Task-Specific Information in Text-Based Classifications 标题：基于文本的分类中特定任务信息的量化链接：https://arxiv.org/abs/2110.08931

作者：Zining Zhu,Aparna Balagopalan,Marzyeh Ghassemi,Frank Rudzicz 机构：University of Toronto ,Vector Institute for Artificial Intelligence ,MIT ,Unity Health Toronto 摘要：最近，神经自然语言模型在各种各样的任务上都取得了最先进的性能，但这种高性能可能来自表面层面的线索（Bender和Koller，2020；Niven和Kao，2020）。这些表面线索，作为数据集中固有的“快捷方式”，不构成分类任务的*任务特定信息*（TSI）。虽然查看模型性能很重要，但了解数据集也很重要。在本文中，我们考虑这个问题：除了由快捷方式引入的信息之外，需要多少特定任务的信息来对数据集进行分类？我们在一个信息论的框架中描述了这个数量。虽然这个量很难计算，但我们用一种快速稳定的方法来近似它。TSI将语言知识的数量量化为一组预定义的快捷方式，这有助于对每个数据集中的样本进行分类。该框架允许我们跨数据集进行比较，即除了一组“快捷功能”外，对多NLI任务中的每个样本进行分类涉及的TSI比Quora问题对中多0.4个NAT。摘要：Recently, neural natural language models have attained state-of-the-art performance on a wide variety of tasks, but the high performance can result from superficial, surface-level cues (Bender and Koller, 2020; Niven and Kao, 2020). These surface cues, as the ``shortcuts'' inherent in the datasets, do not contribute to the *task-specific information* (TSI) of the classification tasks. While it is essential to look at the model performance, it is also important to understand the datasets. In this paper, we consider this question: Apart from the information introduced by the shortcut features, how much task-specific information is required to classify a dataset? We formulate this quantity in an information-theoretic framework. While this quantity is hard to compute, we approximate it with a fast and stable method. TSI quantifies the amount of linguistic knowledge modulo a set of predefined shortcuts -- that contributes to classifying a sample from each dataset. This framework allows us to compare across datasets, saying that, apart from a set of ``shortcut features'', classifying each sample in the Multi-NLI task involves around 0.4 nats more TSI than in the Quora Question Pair.

【3】 Seeking Patterns, Not just Memorizing Procedures: Contrastive Learning for Solving Math Word Problems 标题：寻找模式，而不仅仅是背诵步骤：解决数学应用题的对比学习链接：https://arxiv.org/abs/2110.08464

作者：Zhongli Li,Wenxuan Zhang,Chao Yan,Qingyu Zhou,Chao Li,Hongzhi Liu,Yunbo Cao 机构：Tencent Cloud Xiaowei, Peking University 摘要：数学单词问题（MWP）的解决需要发现自然语言叙述中的数量关系。最近的工作表明，现有的模型从上下文中记忆过程，并依靠浅层启发式来解决MWP。在本文中，我们着眼于这个问题，认为其原因是缺乏对MWP模式的全面理解。我们首先研究神经网络如何仅从语义上理解模式，并观察到，如果原型方程相同，大多数问题都会得到更接近的表示，而这些表示与它们分开或接近其他原型往往会产生错误的解。受此启发，我们提出了一种对比学习方法，其中神经网络感知模式的差异。我们通过将原型方程转换成一棵树并寻找相似的树结构来收集对比示例。在收集的示例上，使用辅助目标对求解模型进行训练，从而使具有相似原型的问题表示更加接近。我们在中文数据集Math23k和英文数据集MathQA上进行了实验。我们的方法大大提高了单语和多语环境下的性能。摘要：Math Word Problem (MWP) solving needs to discover the quantitative relationships over natural language narratives. Recent work shows that existing models memorize procedures from context and rely on shallow heuristics to solve MWPs. In this paper, we look at this issue and argue that the cause is a lack of overall understanding of MWP patterns. We first investigate how a neural network understands patterns only from semantics, and observe that, if the prototype equations are the same, most problems get closer representations and those representations apart from them or close to other prototypes tend to produce wrong solutions. Inspired by it, we propose a contrastive learning approach, where the neural network perceives the divergence of patterns. We collect contrastive examples by converting the prototype equation into a tree and seeking similar tree structures. The solving model is trained with an auxiliary objective on the collected examples, resulting in the representations of problems with similar prototypes being pulled closer. We conduct experiments on the Chinese dataset Math23k and the English dataset MathQA. Our method greatly improves the performance in monolingual and multilingual settings.

其他神经网络|深度学习|模型|建模(22篇)

【1】 Automatic Learning of Subword Dependent Model Scales 标题：子词相关模型尺度的自动学习链接：https://arxiv.org/abs/2110.09324

作者：Felix Meyer,Wilfried Michel,Mohammad Zeineldeen,Ralf Schlüter,Hermann Ney 机构：Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany, AppTek GmbH, Aachen, Germany 备注：submitted to ICASSP 2022 摘要：为了提高最先进的自动语音识别系统的性能，通常的做法是包括外部知识源，如语言模型或先前的更正。这通常通过对数线性模型组合完成，每个模型使用单独的缩放参数。通常，这些参数是在某些保留数据上手动优化的。在这项工作中，我们建议通过自动微分和类似于神经网络模型参数的随机梯度下降来优化这些缩放参数。我们在LibriSpeech（LBS）和SWB（SWB）语料库上展示了该模型的扩展，基于注意的编码器-解码器声学模型和语言模型的组合可以像手动调谐一样有效地学习。我们进一步将此方法扩展到子词相关的模型尺度，这些尺度无法手动调整，从而导致LBS提高7%，SWB提高3%。我们还表明，联合训练的规模和模型参数是可能的，并给予额外的6%的改善LBS。摘要：To improve the performance of state-of-the-art automatic speech recognition systems it is common practice to include external knowledge sources such as language models or prior corrections. This is usually done via log-linear model combination using separate scaling parameters for each model. Typically these parameters are manually optimized on some held-out data. In this work we propose to optimize these scaling parameters via automatic differentiation and stochastic gradient decent similar to the neural network model parameters. We show on the LibriSpeech (LBS) and Switchboard (SWB) corpora that the model scales for a combination of attentionbased encoder-decoder acoustic model and language model can be learned as effectively as with manual tuning. We further extend this approach to subword dependent model scales which could not be tuned manually which leads to 7% improvement on LBS and 3% on SWB. We also show that joint training of scales and model parameters is possible and gives additional 6% improvement on LBS.

【2】 Efficient Sequence Training of Attention Models using Approximative Recombination 标题：基于近似重组的注意力模型高效序列训练链接：https://arxiv.org/abs/2110.09245

作者：Nils-Philipp Wynands,Wilfried Michel,Jan Rosendahl,Ralf Schlüter,Hermann Ney 机构：Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany, AppTek GmbH, Aachen, Germany 备注：submitted to ICASSP 2022 摘要：序列判别训练是提高自动语音识别系统性能的重要工具。然而，它确实需要对所有可能的单词序列求和，这在实践中很难计算。当前具有无限标签上下文的最先进系统通过将求和限制为从beam搜索获得的相关竞争假设的n-最佳列表来规避此问题。这项工作建议在波束搜索期间对假设进行（近似）重组，前提是它们具有共同的本地历史。分析了近似产生的误差，结果表明，使用这种技术，有效光束尺寸可以增加几个数量级，而不会显著增加计算要求。最后，该技术可以有效地对LibriSpeech任务中基于注意的编解码声学模型进行序列判别训练。摘要：Sequence discriminative training is a great tool to improve the performance of an automatic speech recognition system. It does, however, necessitate a sum over all possible word sequences, which is intractable to compute in practice. Current state-of-the-art systems with unlimited label context circumvent this problem by limiting the summation to an n-best list of relevant competing hypotheses obtained from beam search. This work proposes to perform (approximative) recombinations of hypotheses during beam search, if they share a common local history. The error that is incurred by the approximation is analyzed and it is shown that using this technique the effective beam size can be increased by several orders of magnitude without significantly increasing the computational requirements. Lastly, it is shown that this technique can be used to effectively perform sequence discriminative training for attention-based encoder-decoder acoustic models on the LibriSpeech task.

【3】 Schrödinger's Tree -- On Syntax and Neural Language Models 标题：薛定谔树--论句法和神经语言模型链接：https://arxiv.org/abs/2110.08887

作者：Artur Kulmizev,Joakim Nivre 机构：Uppsala University 备注：preprint, submitted to Frontiers in Artificial Intelligence: Perspectives for Natural Language Processing between AI, Linguistics and Cognitive Science 摘要：在过去的五年中，自然语言处理（NLP）领域经历了两个主要的转变：转向神经网络作为主要的建模范式和训练模式的同质化（预训练，然后微调）。在这一过程中，语言模型已成为NLP的主力军，显示出越来越流畅的生成能力，并被证明是下游知识转移的不可或缺的手段。由于这类模型的不透明性和黑箱性质，研究人员利用语言学理论来描述他们的行为。句法的核心问题——对语言层次结构的研究——在这类工作中起到了重要作用，对模型的固有偏见及其做出类似人类的概括的能力提出了宝贵的见解。在本文中，我们试图对这一不断增长的文学作品进行评估。在这样做的过程中，我们观察到在许多维度上缺乏清晰性，这影响了研究人员形成的假设以及他们从发现中得出的结论。为了纠正这一点，我们敦促研究人员在调查编码属性、选择表示和通过下游任务进行评估时仔细考虑。此外，我们还概述了句法研究中不同类型研究问题的含义，以及聚合度量的固有缺陷。最后，我们希望我们的讨论能为研究语言模型的前景增添细微差别，并为在这种情况下对语法进行不太单一的研究铺平道路。摘要：In the last half-decade, the field of natural language processing (NLP) has undergone two major transitions: the switch to neural networks as the primary modeling paradigm and the homogenization of the training regime (pre-train, then fine-tune). Amidst this process, language models have emerged as NLP's workhorse, displaying increasingly fluent generation capabilities and proving to be an indispensable means of knowledge transfer downstream. Due to the otherwise opaque, black-box nature of such models, researchers have employed aspects of linguistic theory in order to characterize their behavior. Questions central to syntax -- the study of the hierarchical structure of language -- have factored heavily into such work, shedding invaluable insights about models' inherent biases and their ability to make human-like generalizations. In this paper, we attempt to take stock of this growing body of literature. In doing so, we observe a lack of clarity across numerous dimensions, which influences the hypotheses that researchers form, as well as the conclusions they draw from their findings. To remedy this, we urge researchers make careful considerations when investigating coding properties, selecting representations, and evaluating via downstream tasks. Furthermore, we outline the implications of the different types of research questions exhibited in studies on syntax, as well as the inherent pitfalls of aggregate metrics. Ultimately, we hope that our discussion adds nuance to the prospect of studying language models and paves the way for a less monolithic perspective on syntax in this context.

【4】 Predicting the Performance of Multilingual NLP Models 标题：多语言自然语言处理模型的性能预测链接：https://arxiv.org/abs/2110.08875

作者：Anirudh Srinivasan,Sunayana Sitaram,Tanuja Ganu,Sandipan Dandapat,Kalika Bali,Monojit Choudhury 机构： The University of Texas at Austin 摘要：NLP的最新进展为我们提供了mBERT和XLMR等模型，它们可以服务100多种语言。然而，对这些模型进行评估的语言数量很少，评估数据集不太可能涵盖这些模型支持的所有语言。数据集创建成本高昂问题的潜在解决方案是将数据集翻译成新的语言或使用基于模板填充的技术进行创建。本文提出了一种跨语言评估模型的替代解决方案，该方案利用模型在特定任务具有测试集的语言上的现有性能分数。我们根据这些性能分数训练预测因子，并使用该预测因子预测模型在不同评估设置下的性能。我们的结果表明，我们的方法可以有效地填补现有一组语言评估中的空白，但如果我们想将其推广到看不见的语言，则可能需要额外的改进。摘要：Recent advancements in NLP have given us models like mBERT and XLMR that can serve over 100 languages. The languages that these models are evaluated on, however, are very few in number, and it is unlikely that evaluation datasets will cover all the languages that these models support. Potential solutions to the costly problem of dataset creation are to translate datasets to new languages or use template-filling based techniques for creation. This paper proposes an alternate solution for evaluating a model across languages which make use of the existing performance scores of the model on languages that a particular task has test sets for. We train a predictor on these performance scores and use this predictor to predict the model's performance in different evaluation settings. Our results show that our method is effective in filling the gaps in the evaluation for an existing set of languages, but might require additional improvements if we want it to generalize to unseen languages.

【5】 Reminding the Incremental Language Model via Data-Free Self-Distillation 标题：通过无数据自蒸馏提醒增量语言模型链接：https://arxiv.org/abs/2110.08745

作者：Han Wang,Ruiliu Fu,Chengzhang Li,Xuejun Zhang,Jun Zhou,Yonghong Yan 机构： Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, China, University of Chinese Academy of Sciences, Beijing, China 备注：8 pages, 5 figures 摘要：使用伪数据的增量语言学习可以缓解神经网络中的灾难性遗忘。然而，为了获得更好的性能，以前的方法对以前任务的伪数据有更高的要求。当使用更少的伪数据时，性能会显著降低。此外，随着不同任务的顺序学习，伪数据的分布逐渐偏离真实数据。学习的任务越多，偏差就越大，这会导致更严重的灾难性遗忘。为了解决这些问题，我们提出了通过无数据自蒸馏（DFSD）来提醒增量语言模型，该模型包括基于地球移动者距离的自蒸馏和隐藏数据增强。通过估计GPT-2各层的知识分布，并将其从教师模型转换为学生模型，基于推土机距离的自蒸馏可以显著减少对伪数据的需求。通过将伪数据的生成建模为一个隐藏数据扩充过程，其中每个样本都是所有训练任务数据的混合，隐藏数据扩充可以极大地缓解由偏差引起的灾难性遗忘。实验结果表明，即使伪数据的最大减少量为90%，我们的DFSD也可以超过以前的最新方法。摘要：Incremental language learning with pseudo-data can alleviate catastrophic forgetting in neural networks. However, to obtain better performance, former methods have higher demands for pseudo-data of the previous tasks. The performance dramatically decreases when fewer pseudo-data are employed. In addition, the distribution of pseudo-data gradually deviates from the real data with the sequential learning of different tasks. The deviation will be greater with more tasks learned, which results in more serious catastrophic forgetting. To address these issues, we propose reminding incremental language model via data-free self-distillation (DFSD), which includes self-distillation based on the Earth Mover's Distance and hidden data augmentation. By estimating the knowledge distribution in all layers of GPT-2 and transforming it from teacher model to student model, the Self-distillation based on the Earth Mover's Distance can significantly reduce the demand for pseudo-data. Hidden data augmentation can greatly alleviate the catastrophic forgetting caused by deviations via modeling the generation of pseudo-data as a hidden data augmentation process, where each sample is a mixture of all trained task data. The experimental results demonstrate that our DFSD can exceed the previous state-of-the-art methods even if the maximum decrease in pseudo-data is 90%.

【6】 Back to Reality: Leveraging Pattern-driven Modeling to Enable Affordable Sentiment Dependency Learning 标题：回到现实：利用模式驱动的建模实现负担得起的情绪依赖学习链接：https://arxiv.org/abs/2110.08604

作者：Heng Yang,Biqing Zeng,Mayi Xu,Tianxing Wang 机构： School of Computer, South China Normal University, China, School of Software, South China Normal University, China, Linklogis Co.,Ltd., Shenzhen, China 摘要：基于方面的情感分类（ABSC）是传统情感分析中一个具有挑战性的子任务。由于难以处理多个方面的情感极性之间的潜在相关性，即情感依赖，最近的流行作品倾向于利用句法信息指导情感依赖分析。然而，语法信息（例如，语法依赖树）通常在相邻矩阵的操作方面占用昂贵的计算资源。相反，在我们发现大多数情绪依赖发生在相邻方面之间的情况下，我们将具有相同情绪的连续方面定义为情绪簇。基于这一发现，我们提出了情绪模式（SP）来指导模型依赖学习。然后，我们引入局部情绪聚合（LSA）机制，重点学习情绪聚类中的情绪依赖性。由于没有额外的依赖矩阵构造和建模，LSA比现有的基于依赖树的模型更有效。此外，我们还提出了聚合窗口构建的差异权重来衡量情绪依赖的重要性。在四个公共数据集上的实验表明，我们的模型实现了最先进的性能，尤其是在学习情绪聚类方面。摘要：Aspect-based Sentiment Classification (ABSC) is a challenging sub-task of traditional sentiment analysis. Due to the difficulty of handling potential correlations among sentiment polarities of multiple aspects, i.e., sentiment dependency, recent popular works tend to exploit syntactic information guiding sentiment dependency parsing. However, syntax information (e.g., syntactic dependency trees) usually occupies expensive computational resources in terms of the operation of the adjacent matrix. Instead, we define the consecutive aspects with the same sentiment as the sentiment cluster in the case that we find that most sentiment dependency occurs between adjacent aspects. Motivated by this finding, we propose the sentiment patterns (SP) to guide the model dependency learning. Thereafter, we introduce the local sentiment aggregating (LSA) mechanism to focus on learning the sentiment dependency in the sentiment cluster. The LSA is more efficient than existing dependency tree-based models due to the absence of additional dependency matrix constructing and modeling. Furthermore, we propose differential weighting for aggregation window building to measure the importance of sentiment dependency. Experiments on four public datasets show that our models achieve state-of-the-art performance with especially improvement on learning sentiment cluster.

【7】 On the Robustness of Reading Comprehension Models to Entity Renaming 标题：论阅读理解模型对实体重命名的稳健性链接：https://arxiv.org/abs/2110.08555

作者：Jun Yan,Yang Xiao,Sagnik Mukherjee,Bill Yuchen Lin,Robin Jia,Xiang Ren 机构：University of Southern California, Fudan University, IIT Kanpur 摘要：我们研究了机器阅读理解（MRC）模型对实体重命名的鲁棒性——当答案实体具有不同名称时，模型是否会做出更多错误的预测？这样的失败表明，模型过于依赖实体知识来回答问题，因此，当关于世界变化的事实或关于新实体的问题被问及时，模型的泛化能力可能会很差。为了系统地审核模型的稳健性，我们提出了一种通用的、可扩展的方法，用各种来源的名称替换人名，从普通英文名称到其他语言的名称，再到任意字符串。在四个数据集和三个预先训练的模型体系结构中，当实体被重命名时，MRC模型的性能始终较差，通过远程监控构建的数据集的精度下降特别大。我们还发现模型之间存在很大的差异：尽管在未受干扰的测试数据上具有相似的精度，但使用跨度级掩蔽预训练的斯潘BERT比罗BERT更稳健。受此启发，我们将跨度级和实体级掩蔽作为一个连续的预训练目标进行实验，发现它们可以进一步提高MRC模型的鲁棒性。摘要：We study the robustness of machine reading comprehension (MRC) models to entity renaming -- do models make more wrong predictions when answer entities have different names? Such failures would indicate that models are overly reliant on entity knowledge to answer questions, and therefore may generalize poorly when facts about the world change or questions are asked about novel entities. To systematically audit model robustness, we propose a general and scalable method to replace person names with names from a variety of sources, ranging from common English names to names from other languages to arbitrary strings. Across four datasets and three pretrained model architectures, MRC models consistently perform worse when entities are renamed, with particularly large accuracy drops on datasets constructed via distant supervision. We also find large differences between models: SpanBERT, which is pretrained with span-level masking, is more robust than RoBERTa, despite having similar accuracy on unperturbed test data. Inspired by this, we experiment with span-level and entity-level masking as a continual pretraining objective and find that they can further improve the robustness of MRC models.

【8】 PAGnol: An Extra-Large French Generative Model 标题：Pagnol：一种超大的法语生成模式链接：https://arxiv.org/abs/2110.08554

作者：Julien Launay,E. L. Tommasone,Baptiste Pannier,François Boniface,Amélie Chatelain,Alessandro Cappelli,Iacopo Poli,Djamé Seddah 机构：E.L. Tommasone∗, LightOn , LPENS, École Normale Supérieure , Inria, Paris, lair.lighton.aipagnol 摘要：使用多种不同语言的不同体系结构的大型预先训练模型是NLP民主化的核心。我们介绍法国GPT模型的集合PAGnol。利用比例定律，我们有效地训练了PAGnol XL（1.5B参数），其计算预算与CamemBERT相同，后者的模型比CamemBERT小13倍。PAGnol XL是迄今为止为法语训练的最大型号。我们计划训练越来越多的大型和高性能版本的PAGnol，探索法国极限规模模型的能力。对于第一个版本，我们将重点放在PAGnol下面的预训练和缩放计算上。我们为法语的compute拟合了一个比例律，并将其与英语的对应项进行了比较。我们发现，训练前数据集显著影响输出的质量，常见的数据集（如OSCAR）会导致低质量的攻击性文本。与其他最先进的法语和多语言模型相比，我们用法语评估了我们关于辨别性和生成性任务的模型，并在抽象摘要任务中达到了最先进的水平。我们的研究是在公共GENCI Jean-Zay超级计算机上进行的，我们的大型模型是公开的。摘要：Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained to date for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models. For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language, and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.

【9】 Learning to Solve Complex Tasks by Talking to Agents 标题：学习通过与座席交谈来解决复杂任务链接：https://arxiv.org/abs/2110.08542

作者：Tushar Khot,Kyle Richardson,Daniel Khashabi,Ashish Sabharwal 机构：Allen Institute for AI, Seattle, WA, U.S.A. 摘要：人类通常通过（用自然语言）与现有代理（如AI助手）交互来解决复杂问题，这些代理可以解决更简单的子任务。这些代理本身可以是使用大量资源和私有数据构建的强大系统。相比之下，通用NLP基准旨在为每项任务开发自给自足的模型。为了解决这一差距并促进对基于现有代理的“绿色”人工智能系统的研究，我们提出了一个新的称为CommaQA的基准测试，它包含三种复杂的推理任务，旨在通过与四个具有不同能力的代理“对话”来解决。我们证明，最先进的黑盒模型无法利用现有代理，即使在获得代理的内部知识和黄金事实监督的情况下，也会在CommaQA（精确匹配分数仅达到40分）上挣扎。另一方面，通过学习使用代理，使用黄金问题分解监督的模型确实可以高精度地解决CommaQA（超过96\%精确匹配）。然而，即使这些附加的监督模型也不能解决我们的合成泛化测试集。最后，学习通过与现有代理通信来解决复杂任务的最终目标{不依赖任何额外的监督}仍未解决，我们希望CommaQA可以作为开发此类系统的新基准。摘要：Humans often solve complex problems by interacting (in natural language) with existing agents, such as AI assistants, that can solve simpler sub-tasks. These agents themselves can be powerful systems built using extensive resources and privately held data. In contrast, common NLP benchmarks aim for the development of self-sufficient models for every task. To address this gap and facilitate research towards ``green'' AI systems that build upon existing agents, we propose a new benchmark called CommaQA that contains three kinds of complex reasoning tasks that are designed to be solved by ``talking'' to four agents with different capabilities. We demonstrate that state-of-the-art black-box models, which are unable to leverage existing agents, struggle on CommaQA (exact match score only reaches 40pts) even when given access to the agents' internal knowledge and gold fact supervision. On the other hand, models using gold question decomposition supervision can indeed solve CommaQA to a high accuracy (over 96\% exact match) by learning to utilize the agents. Even these additional supervision models, however, do not solve our compositional generalization test set. Finally the end-goal of learning to solve complex tasks by communicating with existing agents \emph{without relying on any additional supervision} remains unsolved and we hope CommaQA serves as a novel benchmark to enable the development of such systems.

【10】 Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora 标题：终身预备训练：不断调整语言模型以适应新兴语料库链接：https://arxiv.org/abs/2110.08534

作者：Xisen Jin,Dejiao Zhang,Henghui Zhu,Wei Xiao,Shang-Wen Li,Xiaokai Wei,Andrew Arnold,Xiang Ren 机构：University of Southern California, Amazon Inc. 备注：8 pages 摘要：预训练语言模型（PTLM）通常在大型静态语料库中学习，并针对各种下游任务进行进一步微调。然而，当在现实世界中部署时，基于PTLM的模型必须处理来自新领域的数据，该领域与PTLM最初训练的领域不同，或者处理包含分布外信息的新出现的数据。在本文中，我们研究了终身语言模型训练前的挑战，其中PTLM不断更新，以适应新出现的数据。在一个领域增量研究论文流和一个按时间顺序排列的tweet流上，我们使用不同的持续学习算法对一个PTLM进行增量预训练，并跟踪下游任务性能（微调后），以分析其获取新知识和保存所学知识的能力。我们的实验表明，持续学习算法提高了知识保存，其中logit蒸馏是最有效的方法。我们进一步表明，当下游任务的训练和测试数据来自不同的时间步时，持续的预训练可以提高泛化能力，但当它们来自相同的时间步时，没有提高泛化能力。我们相信，我们的问题表述、方法和分析将启发未来对语言模型持续预训练的研究。摘要：Pretrained language models (PTLMs) are typically learned over a large, static corpus and further fine-tuned for various downstream tasks. However, when deployed in the real world, a PTLM-based model must deal with data from a new domain that deviates from what the PTLM was initially trained on, or newly emerged data that contains out-of-distribution information. In this paper, we study a lifelong language model pretraining challenge where a PTLM is continually updated so as to adapt to emerging data. Over a domain-incremental research paper stream and a chronologically ordered tweet stream, we incrementally pretrain a PTLM with different continual learning algorithms, and keep track of the downstream task performance (after fine-tuning) to analyze its ability of acquiring new knowledge and preserving learned knowledge. Our experiments show continual learning algorithms improve knowledge preservation, with logit distillation being the most effective approach. We further show that continual pretraining improves generalization when training and testing data of downstream tasks are drawn from different time steps, but do not improve when they are from the same time steps. We believe our problem formulation, methods, and analysis will inspire future studies towards continual pretraining of language models.

【11】 Sharpness-Aware Minimization Improves Language Model Generalization 标题：清晰度感知最小化改进了语言模型泛化链接：https://arxiv.org/abs/2110.08529

作者：Dara Bahri,Hossein Mobahi,Yi Tay 机构：Google Research, Mountain View, CA, USA 摘要：超人级能力的吸引力导致了人们对GPT-3和T5等语言模型的极大兴趣，其中的研究大体上围绕着新的模型体系结构、训练任务和损失目标，以及扩大模型容量和数据集大小的大量工程努力。通过更好的优化来提高这些模型的泛化能力的工作相对较少。在这项工作中，我们证明了锐度感知最小化（SAM）是最近提出的一种优化过程，它鼓励收敛到更平坦的极小值，可以在不需要太多计算开销的情况下显著提高语言模型的泛化能力。我们表明，SAM能够提高SuperGLUE、GLUE、Web问题、自然问题、琐事QA和TyDiQA的性能，当这些任务的训练数据有限时，获得了特别大的收益。摘要：The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by and large, revolved around new model architectures, training tasks, and loss objectives, along with substantial engineering efforts to scale up model capacity and dataset size. Comparatively little work has been done to improve the generalization of these models through better optimization. In this work, we show that Sharpness-Aware Minimization (SAM), a recently proposed optimization procedure that encourages convergence to flatter minima, can substantially improve the generalization of language models without much computational overhead. We show that SAM is able to boost performance on SuperGLUE, GLUE, Web Questions, Natural Questions, Trivia QA, and TyDiQA, with particularly large gains when training data for these tasks is limited.

【12】 An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-Trained Language Models 标题：预训练语言模型去偏技术有效性的实证研究链接：https://arxiv.org/abs/2110.08527

作者：Nicholas Meade,Elinor Poole-Dayan,Siva Reddy 机构：MilaMcGill University, Facebook CIFAR AI Chair 摘要：最近的研究表明，预先训练的语言模型从他们训练的文本语料库中捕捉社会偏见。这引起了人们对开发减轻这种偏见的技术的关注。在这项工作中，我们对最近提出的五种借记技术进行了实证调查：反事实数据增强（CDA）、退出、迭代零空间投影、自借记和句子借记。我们使用三种不同的偏差基准量化了每种技术的有效性，同时还测量了这些技术对模型语言建模能力的影响，以及其对下游NLU任务的性能。我们在实验中发现：（1）CDA和Self Debias是最强大的debiasing技术，在大多数偏差基准上获得了改进的分数（2）当前的debiasing技术不能很好地概括性别偏差；（3）通过使用debiasing策略对诸如立体集和CrowS对之类的偏差基准进行改进，通常伴随着语言建模能力的降低，这使得很难确定偏差缓解是否有效。摘要：Recent work has shown that pre-trained language models capture social biases from the text corpora they are trained on. This has attracted attention to developing techniques that mitigate such biases. In this work, we perform a empirical survey of five recently proposed debiasing techniques: Counterfactual Data Augmentation (CDA), Dropout, Iterative Nullspace Projection, Self-Debias, and SentenceDebias. We quantify the effectiveness of each technique using three different bias benchmarks while also measuring the impact of these techniques on a model's language modeling ability, as well as its performance on downstream NLU tasks. We experimentally find that: (1) CDA and Self-Debias are the strongest of the debiasing techniques, obtaining improved scores on most of the bias benchmarks (2) Current debiasing techniques do not generalize well beyond gender bias; And (3) improvements on bias benchmarks such as StereoSet and CrowS-Pairs by using debiasing strategies are usually accompanied by a decrease in language modeling ability, making it difficult to determine whether the bias mitigation is effective.

【13】 A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models 标题：一个好的提示符抵得上数百万个参数吗？基于低资源提示的视觉语言模型学习链接：https://arxiv.org/abs/2110.08484

作者：Woojeong Jin,Yu Cheng,Yelong Shen,Weizhu Chen,Xiang Ren 机构：University of Southern California, Microsoft Corporation 备注：Preprint 摘要：大型预训练视觉语言（VL）模型可以通过少量示例学习新任务，也可以在不进行微调的情况下推广到新任务。然而，这些巨大的VL模型由于其不切实际的巨大模型尺寸和缓慢的推理速度，很难部署到实际应用中。在这项工作中，我们提出了FewVLM，一种基于视觉语言任务的少数镜头提示学习者。我们使用前缀语言建模（PrefixLM）和掩码语言建模（MaskedLM）预训练了一个序列到序列转换器模型，并引入了简单的提示来提高VQA和图像字幕的Zero-Shot和Few-Shot性能。在五个VQA和字幕数据集上的实验结果表明\method\xspace优于Frozed，后者在零拍VQAv2上比我们的大31倍，达到18.2%，并且与一个246$\times$大的模型PICa的结果相当。我们观察到：（1）提示显著影响Zero-Shot性能，但对Few-Shot性能影响不大，（2）MaskedLM帮助Few-ShotVQA任务，而PrefixLM提高字幕性能，（3）当训练集大小较小时，性能显著提高。摘要：Large pretrained vision-language (VL) models can learn a new task with a handful of examples or generalize to a new task without fine-tuning. However, these gigantic VL models are hard to deploy for real-world applications due to their impractically huge model size and slow inference speed. In this work, we propose FewVLM, a few-shot prompt-based learner on vision-language tasks. We pretrain a sequence-to-sequence Transformer model with both prefix language modeling (PrefixLM) and masked language modeling (MaskedLM), and introduce simple prompts to improve zero-shot and few-shot performance on VQA and image captioning. Experimental results on five VQA and captioning datasets show that \method\xspace outperforms Frozen which is 31 times larger than ours by 18.2% point on zero-shot VQAv2 and achieves comparable results to a 246$\times$ larger model, PICa. We observe that (1) prompts significantly affect zero-shot performance but marginally affect few-shot performance, (2) MaskedLM helps few-shot VQA tasks while PrefixLM boosts captioning performance, and (3) performance significantly increases when training set size is small.

【14】 On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark 标题：论会话模型的安全性：分类法、数据集和基准链接：https://arxiv.org/abs/2110.08466

作者：Hao Sun,Guangxuan Xu,Jiawen Deng,Jiale Cheng,Chujie Zheng,Hao Zhou,Nanyun Peng,Xiaoyan Zhu,Minlie Huang 机构：The CoAI group, DCST, Institute for Artificial Intelligence, State Key Lab of Intelligent Technology and Systems, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing , China 摘要：对话安全问题严重限制了神经会话模型在现实世界中的应用，近年来引起了广泛的研究兴趣。我们提出了一种对话安全分类法，专门用于捕捉人类机器人对话环境中独特的不安全行为，重点关注上下文敏感的不安全性，这在以前的工作中还没有得到充分的探讨。为了推动这方面的研究，我们编译了DiaSafety，这是一个包含6个不安全类别的数据集，其中包含丰富的上下文敏感的不安全示例。实验表明，现有的话语级安全防护工具在我们的数据集上出现了灾难性的失败。作为补救措施，我们训练了一个上下文级别的对话安全分类器，为上下文敏感的对话不安全检测提供了一个强大的基线。使用我们的分类器，我们对流行的对话模型进行了安全评估，并表明现有的对话系统仍然存在上下文敏感的安全问题。摘要：Dialogue safety problems severely limit the real-world deployment of neural conversational models and attract great research interests recently. We propose a taxonomy for dialogue safety specifically designed to capture unsafe behaviors that are unique in human-bot dialogue setting, with focuses on context-sensitive unsafety, which is under-explored in prior works. To spur research in this direction, we compile DiaSafety, a dataset of 6 unsafe categories with rich context-sensitive unsafe examples. Experiments show that existing utterance-level safety guarding tools fail catastrophically on our dataset. As a remedy, we train a context-level dialogue safety classifier to provide a strong baseline for context-sensitive dialogue unsafety detection. With our classifier, we perform safety evaluations on popular conversational models and show that existing dialogue systems are still stuck in context-sensitive safety problems.

【15】 A Short Study on Compressing Decoder-Based Language Models 标题：基于解码器的语言模型压缩初探链接：https://arxiv.org/abs/2110.08460

作者：Tianda Li,Yassir El Mesbahi,Ivan Kobyzev,Ahmad Rashid,Atif Mahmud,Nithin Anchuri,Habib Hajimolahoseini,Yang Liu,Mehdi Rezagholizadeh 机构：Huawei Noah’s Ark Lab 摘要：预训练语言模型（PLM）已经成功地用于各种自然语言处理（NLP）任务。然而，最先进的PLM非常大，可用于边缘设备。因此，模型压缩的主题在NLP社区中引起了越来越多的关注。现有的工作大多集中在压缩基于编码器的模型（tiny-BERT、distilBERT、Distilberta等），然而，据我们所知，基于解码器的模型（如GPT-2）的压缩还没有得到太多的研究。我们的论文旨在填补这一空白。具体来说，我们探索了两个方向：1）我们采用最先进的知识提取技术来改进DistilGPT-2的微调。2）我们使用层截断预训练了一个压缩的GPT-2模型，并将其与基于蒸馏的方法（DistilGPT2）进行了比较。我们的压缩模型的训练时间明显少于DistilGPT-2，但当对下游任务进行微调时，它可以获得更好的性能。我们还演示了数据清理对模型性能的影响。摘要：Pre-trained Language Models (PLMs) have been successful for a wide range of natural language processing (NLP) tasks. The state-of-the-art of PLMs, however, are extremely large to be used on edge devices. As a result, the topic of model compression has attracted increasing attention in the NLP community. Most of the existing works focus on compressing encoder-based models (tiny-BERT, distilBERT, distilRoBERTa, etc), however, to the best of our knowledge, the compression of decoder-based models (such as GPT-2) has not been investigated much. Our paper aims to fill this gap. Specifically, we explore two directions: 1) we employ current state-of-the-art knowledge distillation techniques to improve fine-tuning of DistilGPT-2. 2) we pre-train a compressed GPT-2 model using layer truncation and compare it against the distillation-based method (DistilGPT2). The training time of our compressed model is significantly less than DistilGPT-2, but it can achieve better performance when fine-tuned on downstream tasks. We also demonstrate the impact of data cleaning on model performance.

【16】 Good Examples Make A Faster Learner: Simple Demonstration-based Learning for Low-resource NER 标题：好榜样造就更快的学习者：面向低资源学习者的简单演示学习链接：https://arxiv.org/abs/2110.08454

作者：Dong-Ho Lee,Mahak Agarwal,Akshen Kadakia,Jay Pujara,Xiang Ren 机构：Department of Computer Science, University of Southern California 备注：7 pages, 4 figures, 4 tables 摘要：基于提示的学习的最新进展表明，通过使用完形填空风格的语言提示，在少数镜头文本分类任务中取得了令人印象深刻的结果。有人尝试对NER进行基于提示的学习，使用手动设计的模板预测实体类型。然而，这两步方法可能会受到错误传播（从实体跨度检测）的影响，需要提示所有可能的文本跨度，这是昂贵的，并且在预测句子中不同跨度的标签时忽略了相互依赖性。在本文中，我们提出了一种简单的基于演示的NER学习方法，它通过一些任务演示来增强提示（学习上下文）。此类演示有助于模型在低资源设置下更好地学习任务，并允许对所有令牌进行跨度检测和分类。在这里，我们将探讨面向实体的演示，它为每个实体类型选择一个合适的实体示例，以及面向实例的演示，它检索一个类似的实例示例。通过大量的实验，我们发现，在每种实体类型中显示实体示例以及其示例语句，可以将域内和跨域设置的性能提高1-3个F1分数。摘要：Recent advances in prompt-based learning have shown impressive results on few-shot text classification tasks by using cloze-style language prompts. There have been attempts on prompt-based learning for NER which use manually designed templates to predict entity types. However, these two-step methods may suffer from error propagation (from entity span detection), need to prompt for all possible text spans which is costly, and neglect the interdependency when predicting labels for different spans in a sentence. In this paper, we present a simple demonstration-based learning method for NER, which augments the prompt (learning context) with a few task demonstrations. Such demonstrations help the model learn the task better under low-resource settings and allow for span detection and classification over all tokens jointly. Here, we explore entity-oriented demonstration which selects an appropriate entity example per each entity type, and instance-oriented demonstration which retrieves a similar instance example. Through extensive experiments, we find empirically that showing entity example per each entity type, along with its example sentence, can improve the performance both in in-domain and cross-domain settings by 1-3 F1 score.

【17】 What do Compressed Large Language Models Forget? Robustness Challenges in Model Compression 标题：压缩的大型语言模型会忘记什么？模型压缩中的健壮性挑战链接：https://arxiv.org/abs/2110.08419

作者：Mengnan Du,Subhabrata Mukherjee,Yu Cheng,Milad Shokouhi,Xia Hu,Ahmed Hassan Awadallah 机构：Texas A&M University ,Microsoft Research, Rice University 摘要：最近的工作集中于压缩预训练语言模型（PLM），如BERT，其主要重点是提高下游任务的压缩模型性能。然而，还没有研究分析压缩对这些模型的通用性和鲁棒性的影响。为此，我们研究了两种流行的模型压缩技术，包括知识提取和剪枝，并表明压缩模型在对抗性测试集上的鲁棒性明显低于其对应的PLM，尽管它们在任务的分布开发集上获得了类似的性能。进一步分析表明，压缩模型对简单样本的拟合过度，对硬样本的泛化较差。我们进一步利用这一观察结果，开发基于样本不确定性的模型压缩正则化策略。在几个自然语言理解任务上的实验结果表明，我们的缓解框架可以提高压缩模型的对抗性泛化和分布任务性能。摘要：Recent works have focused on compressing pre-trained language models (PLMs) like BERT where the major focus has been to improve the compressed model performance for downstream tasks. However, there has been no study in analyzing the impact of compression on the generalizability and robustness of these models. Towards this end, we study two popular model compression techniques including knowledge distillation and pruning and show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets although they obtain similar performance on in-distribution development sets for a task. Further analysis indicates that the compressed models overfit on the easy samples and generalize poorly on the hard ones. We further leverage this observation to develop a regularization strategy for model compression based on sample uncertainty. Experimental results on several natural language understanding tasks demonstrate our mitigation framework to improve both the adversarial generalization as well as in-distribution task performance of the compressed models.

【18】 Training Conversational Agents with Generative Conversational Networks 标题：用产生式会话网络训练会话Agent 链接：https://arxiv.org/abs/2110.08383

作者：Yen-Ting Lin,Alexandros Papangelis,Seokhwan Kim,Dilek Hakkani-Tur 备注：Accepted at WeCNLP 2021 摘要：web上丰富的开放域文本数据为语言处理带来了巨大的进步。然而，尽管这些数据可能适合于语言处理任务，但它们大多是非对话的，缺乏人类交互中出现的许多现象，这也是为什么我们在对话人工智能中仍然存在许多未解决的挑战的原因之一。在这项工作中，我们试图通过使用生成会话网络来自动生成数据和训练社会会话代理来解决这个问题。我们使用自动度量和人工评估工具对TopicalChat上的方法进行评估，结果表明，使用10%的种子数据，其性能接近使用100%数据的基线。摘要：Rich, open-domain textual data available on the web resulted in great advancements for language processing. However, while that data may be suitable for language processing tasks, they are mostly non-conversational, lacking many phenomena that appear in human interactions and this is one of the reasons why we still have many unsolved challenges in conversational AI. In this work, we attempt to address this by using Generative Conversational Networks to automatically generate data and train social conversational agents. We evaluate our approach on TopicalChat with automatic metrics and human evaluators, showing that with 10% of seed data it performs close to the baseline that uses 100% of the data.

【19】 Learning with Noisy Labels by Targeted Relabeling 标题：通过有针对性的重新标记在有噪声的标签下学习链接：https://arxiv.org/abs/2110.08355

作者：Derek Chen,Zhou Yu,Samuel R. Bowman 机构：ASAPP Inc., New York, NY, Dept. of Computer Science, Columbia University, NY, New York University, NY 备注：14 pages, 5 figures 摘要：众包平台通常用于收集数据集以训练深层神经网络，尽管与专家标记相比，不准确的标记水平更高。有两种常见的策略来管理这种噪声的影响，第一种涉及到聚合冗余注释，但代价是标记的示例数量大大减少。其次，之前的工作还考虑使用整个注释预算来标记尽可能多的示例，然后应用去噪算法来隐式清理数据集。我们提出了一种方法，它保留了一小部分注释来显式地重新标记高度可能的标记错误。特别是，我们分配了很大一部分标签预算，以形成用于训练模型的初始数据集。然后使用此模型确定最有可能不正确的具体示例，我们将用剩余预算重新标记这些示例。对三种模型变体和四种自然语言处理任务的实验表明，在分配相同的注释预算时，我们的方法优于标签聚合和用于处理噪声标签的高级去噪方法。摘要：Crowdsourcing platforms are often used to collect datasets for training deep neural networks, despite higher levels of inaccurate labeling compared to expert labeling. There are two common strategies to manage the impact of this noise, the first involves aggregating redundant annotations, but comes at the expense of labeling substantially fewer examples. Secondly, prior works have also considered using the entire annotation budget to label as many examples as possible and subsequently apply denoising algorithms to implicitly clean up the dataset. We propose an approach which instead reserves a fraction of annotations to explicitly relabel highly probable labeling errors. In particular, we allocate a large portion of the labeling budget to form an initial dataset used to train a model. This model is then used to identify specific examples that appear most likely to be incorrect, which we spend the remaining budget to relabel. Experiments across three model variations and four natural language processing tasks show our approach outperforms both label aggregation and advanced denoising methods designed to handle noisy labels when allocated the same annotation budget.

【20】 Omni-sparsity DNN: Fast Sparsity Optimization for On-Device Streaming E2E ASR via Supernet 标题：全稀疏DNN：超网设备上流式E2E ASR的快速稀疏性优化链接：https://arxiv.org/abs/2110.08352

作者：Haichuan Yang,Yuan Shangguan,Dilin Wang,Meng Li,Pierce Chuang,Xiaohui Zhang,Ganesh Venkatesh,Ozlem Kalinli,Vikas Chandra 机构：Facebook AI 摘要：从可穿戴设备到功能强大的智能设备，现代自动语音识别（ASR）模型运行在各种计算预算不同的边缘设备上。为了浏览模型精度与模型尺寸之间的帕累托前沿，研究人员陷入了一个两难境地，即通过训练和微调每个边缘设备的模型来优化模型精度，同时保持训练GPU时间的可控性。在本文中，我们提出了全稀疏DNN，其中一个单一的神经网络可以修剪，以生成优化模型的大范围的模型大小。我们为全向稀疏DNN开发了训练策略，允许它沿着单词错误率（WER）与模型大小的帕累托前沿查找模型，同时保持训练GPU的时间不超过训练一个单一模型的时间。我们用流E2E ASR模型演示了全稀疏DNN。我们的结果表明，与单独修剪的稀疏模型相比，LibriSpeech在训练时间和资源上节省了大量的时间和资源，具有相似或更好的准确度：在其他测试中，WER提高了2%-6.6%。摘要：From wearables to powerful smart devices, modern automatic speech recognition (ASR) models run on a variety of edge devices with different computational budgets. To navigate the Pareto front of model accuracy vs model size, researchers are trapped in a dilemma of optimizing model accuracy by training and fine-tuning models for each individual edge device while keeping the training GPU-hours tractable. In this paper, we propose Omni-sparsity DNN, where a single neural network can be pruned to generate optimized model for a large range of model sizes. We develop training strategies for Omni-sparsity DNN that allows it to find models along the Pareto front of word-error-rate (WER) vs model size while keeping the training GPU-hours to no more than that of training one singular model. We demonstrate the Omni-sparsity DNN with streaming E2E ASR models. Our results show great saving on training time and resources with similar or better accuracy on LibriSpeech compared to individually pruned sparse models: 2%-6.6% better WER on Test-other.

【21】 Boosting coherence of language models 标题：提高语言模型的连贯性链接：https://arxiv.org/abs/2110.08294

作者：Nikolay Malkin,Zhen Wang,Nebojsa Jojic 机构：Mila Université de Montréal, Ohio State University, Microsoft Research 摘要：长期信息结构的自然性——连贯性——仍然是语言生成中的一个挑战。大型语言模型还没有充分了解这种结构，因为它们的长格式生成在连贯性方面与自然文本不同。为了缓解这种分歧，我们提出了一致性增强（coherence boosting），这是一种推理过程，可以增加远距离上下文对下一个标记预测的影响。通过对生成的普通文本和对话反应的分布分析，我们展示了使用预训练模型提高连贯性的好处。我们还发现，对于各种零炮NLP任务，使用最先进的模型来提高连贯性，可以在无需额外训练的情况下提高性能。摘要：Naturality of long-term information structure -- coherence -- remains a challenge in language generation. Large language models have insufficiently learned such structure, as their long-form generations differ from natural text in measures of coherence. To alleviate this divergence, we propose coherence boosting, an inference procedure that increases the effect of distant context on next-token prediction. We show the benefits of coherence boosting with pretrained models by distributional analyses of generated ordinary text and dialog responses. We also find that coherence boosting with state-of-the-art models for various zero-shot NLP tasks yields performance gains with no additional training.

【22】 ASR4REAL: An extended benchmark for speech models 标题：ASR4REAL：一种扩展的语音模型基准链接：https://arxiv.org/abs/2110.08583

作者：Morgane Riviere,Jade Copet,Gabriel Synnaeve 机构：†Facebook AI Research 备注：Submitted to ICASSP 2022 摘要：流行的ASR基准测试，如Librispeech和Switchboard，其所代表的设置和扬声器的多样性受到限制。我们引入了一组与现实生活条件相匹配的基准，旨在发现模型中可能存在的偏差和弱点。我们发现，尽管最近的模型似乎没有表现出性别偏见，但它们通常通过口音表现出重要的表现差异，甚至更重要的表现差异取决于说话者的社会经济地位。最后，当对会话语音进行测试时，所有测试模型的性能都有很大的下降，在这种精确的情况下，即使是在像Common Crawl这样大的数据集上训练的语言模型，似乎也没有显著的积极影响，这重申了开发会话语言模型的重要性摘要：Popular ASR benchmarks such as Librispeech and Switchboard are limited in the diversity of settings and speakers they represent. We introduce a set of benchmarks matching real-life conditions, aimed at spotting possible biases and weaknesses in models. We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent, and even more important ones depending on the socio-economic status of the speakers. Finally, all tested models show a strong performance drop when tested on conversational speech, and in this precise context even a language model trained on a dataset as big as Common Crawl does not seem to have significant positive effect which reiterates the importance of developing conversational language models

其他(16篇)

【1】 Measuring Cognitive Status from Speech in a Smart Home Environment 标题：在智能家居环境中从语音测量认知状态链接：https://arxiv.org/abs/2110.09421

作者：Kathleen C. Fraser,Majid Komeili 备注：None 摘要：人口正在老龄化，越来越精通科技。联合国预测，到2050年，全世界六分之一的人将超过65岁（2019年为11分之一），而欧洲和北美的这一比例将增至四分之一。与此同时，65岁以上的美国成年人拥有智能手机的比例从2013年到2017年上升了24个百分点，其中大多数人的家中都有互联网。智能设备和智能家居技术在改变人们的年龄、他们晚年独立生活的能力以及他们与护理圈的互动方面具有巨大的潜力。认知健康是老年人独立和幸福的一个关键组成部分，智能家庭提供了许多机会，以持续、不引人注目的方式测量认知状态。在这篇文章中，我们将重点放在言语作为认知健康的测量工具上。现有的认知评估方法存在许多局限性，可以通过智能家庭语音感知技术加以解决。我们从一个简短的教程开始，介绍如何从语音中测量认知状态，包括一些有用的开源软件工具箱的指针，供感兴趣的读者使用。然后，我们概述了用于测量认知健康的主动和被动智能家居语音传感试点研究的初步结果，并总结了该领域下一波工作的一些建议和挑战陈述，以帮助克服成功的技术和道德障碍。摘要：The population is aging, and becoming more tech-savvy. The United Nations predicts that by 2050, one in six people in the world will be over age 65 (up from one in 11 in 2019), and this increases to one in four in Europe and Northern America. Meanwhile, the proportion of American adults over 65 who own a smartphone has risen 24 percentage points from 2013-2017, and the majority have Internet in their homes. Smart devices and smart home technology have profound potential to transform how people age, their ability to live independently in later years, and their interactions with their circle of care. Cognitive health is a key component to independence and well-being in old age, and smart homes present many opportunities to measure cognitive status in a continuous, unobtrusive manner. In this article, we focus on speech as a measurement instrument for cognitive health. Existing methods of cognitive assessment suffer from a number of limitations that could be addressed through smart home speech sensing technologies. We begin with a brief tutorial on measuring cognitive status from speech, including some pointers to useful open-source software toolboxes for the interested reader. We then present an overview of the preliminary results from pilot studies on active and passive smart home speech sensing for the measurement of cognitive health, and conclude with some recommendations and challenge statements for the next wave of work in this area, to help overcome both technical and ethical barriers to success.

【2】 LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech 标题：LDNet：合成语音MOS预测中的统一听者相关建模链接：https://arxiv.org/abs/2110.09103

作者：Wen-Chin Huang,Erica Cooper,Junichi Yamagishi,Tomoki Toda 机构：Nagoya University, Japan, National Institute of Informatics, Japan 备注：Submitted to ICASSP 2022. Code available at: this https URL 摘要：自动预测合成语音主观评分的一种有效方法是在带有人类注释分数的听力测试数据集上进行训练。尽管数据集中的每个语音样本都由几个听者进行评分，但大多数以前的工作仅使用平均分数作为训练目标。在这项工作中，我们提出了LDNet，一个统一的平均意见分数（MOS）预测框架，该框架预测给定输入语音和听者身份的听者感知质量。我们反映了LD建模的最新进展，包括模型架构的设计选择，并提出了两种推理方法，可提供更稳定的结果和高效的计算。我们在voice conversion challenge（VCC）2018基准测试和新收集的大规模MOS数据集上进行了系统实验，对提议的框架进行了深入分析。结果表明，平均听者推理方法是一种更好的利用平均分数的方法，当每个样本的评分越多，其有效性越明显。摘要：An effective approach to automatically predict the subjective rating for synthetic speech is to train on a listening test dataset with human-annotated scores. Although each speech sample in the dataset is rated by several listeners, most previous works only used the mean score as the training target. In this work, we present LDNet, a unified framework for mean opinion score (MOS) prediction that predicts the listener-wise perceived quality given the input speech and the listener identity. We reflect recent advances in LD modeling, including design choices of the model architecture, and propose two inference methods that provide more stable results and efficient computation. We conduct systematic experiments on the voice conversion challenge (VCC) 2018 benchmark and a newly collected large-scale MOS dataset, providing an in-depth analysis of the proposed framework. Results show that the mean listener inference method is a better way to utilize the mean scores, whose effectiveness is more obvious when having more ratings per sample.

【3】 Using Natural Language Processing to Understand Reasons and Motivators Behind Customer Calls in Financial Domain 标题：使用自然语言处理来理解金融领域客户拜访背后的原因和动机链接：https://arxiv.org/abs/2110.09094

作者：Ankit Patil,Ankush Chopra,Sohom Ghosh,Vamshi Vadla 机构：Fidelity Investments, AI CoE, Bengaluru, India 备注：Accepted at ICCMDE-2021. To be published in Springer - Lecture Notes on Data Engineering and Communications Technologies 摘要：在这个数字信息丰富的时代，客户满意度已成为任何企业成功的重要因素之一。客户希望几乎所有东西都能一键式解决方案。如果他们不得不打电话询问一些他们本可以在网上做的事情，他们往往会感到不满意。此外，来电对于任何企业来说都是一个高成本的组成部分。因此，开发一个能够挖掘客户呼叫背后的原因和动机的框架至关重要。本文提出了两个模型。首先，基于注意的堆叠式双向长-短期记忆网络，然后通过分层聚类从入站呼叫的转录本中提取这些原因。其次，基于支持向量机和Logistic回归的概率建立了一套集成模型。它能够检测导致这些呼叫的因素。广泛的评估证明了这些模型的有效性。摘要：In this era of abundant digital information, customer satisfaction has become one of the prominent factors in the success of any business. Customers want a one-click solution for almost everything. They tend to get unsatisfied if they have to call about something which they could have done online. Moreover, incoming calls are a high-cost component for any business. Thus, it is essential to develop a framework capable of mining the reasons and motivators behind customer calls. This paper proposes two models. Firstly, an attention-based stacked bidirectional Long Short Term Memory Network followed by Hierarchical Clustering for extracting these reasons from transcripts of inbound calls. Secondly, a set of ensemble models based on probabilities from Support Vector Machines and Logistic Regression. It is capable of detecting factors that led to these calls. Extensive evaluation proves the effectiveness of these models.

【4】 Ranking Facts for Explaining Answers to Elementary Science Questions 标题：解释初等科学问题答案的排序事实链接：https://arxiv.org/abs/2110.09036

作者：Jennifer D'Souza,Isaiah Onando Mulang',Soeren Auer 机构： Germany 2University of Bonn, machine learningc⃝ Cambridge University Press 20 19 备注：25 pages, 5 figures, accepted for publication in NLE 摘要：在多项选择题考试中，学生从典型的四个选项中选择一个答案，并可以解释他们为什么做出那个特定的选择。学生擅长理解自然语言问题，基于他们的领域知识，可以通过在各种相关事实之间“连接点”轻松推断问题的答案。考虑到基础科学问答的自动推理，我们解决了从人类编写的事实生成答案解释的新任务。为此，我们研究了功能丰富的支持向量机的实际可扩展框架，该框架利用面向领域的手工制作的功能。解释是根据WorldTree语料库中近5000个候选事实的人类注释集创建的。我们的目标是在现有的事实候选者的基础上，更好地匹配解释问题正确答案的有效事实。为此，我们的功能提供了一个全面的语言和语义统一范式。机器学习问题是事实的偏好排序，为此，我们测试逐点回归与成对学习排序。我们的贡献是：（1）一个案例研究，其中系统地比较了两种偏好排序方法；（2）这是一种实际可行的方法，可以超越基于BERT的重新排序模型的一些变体；（3）人工设计的特性使其成为任务的可解释机器学习模型。摘要：In multiple-choice exams, students select one answer from among typically four choices and can explain why they made that particular choice. Students are good at understanding natural language questions and based on their domain knowledge can easily infer the question's answer by 'connecting the dots' across various pertinent facts. Considering automated reasoning for elementary science question answering, we address the novel task of generating explanations for answers from human-authored facts. For this, we examine the practically scalable framework of feature-rich support vector machines leveraging domain-targeted, hand-crafted features. Explanations are created from a human-annotated set of nearly 5,000 candidate facts in the WorldTree corpus. Our aim is to obtain better matches for valid facts of an explanation for the correct answer of a question over the available fact candidates. To this end, our features offer a comprehensive linguistic and semantic unification paradigm. The machine learning problem is the preference ordering of facts, for which we test pointwise regression versus pairwise learning-to-rank. Our contributions are: (1) a case study in which two preference ordering approaches are systematically compared; (2) it is a practically competent approach that can outperform some variants of BERT-based reranking models; and (3) the human-engineered features make it an interpretable machine learning model for the task.

【5】 GNN-LM: Language Modeling based on Global Contexts via GNN 标题：GNN-LM：基于GNN的基于全局上下文的语言建模链接：https://arxiv.org/abs/2110.08743

作者：Yuxian Meng,Shi Zong,Xiaoya Li,Xiaofei Sun,Tianwei Zhang,Fei Wu,Jiwei Li 机构：Shannon.AI,Nanjing University,Nanyang Technological University,Zhejiang University 摘要：受“{\it to copy比to memory}”这一概念的启发，在这项工作中，我们引入了GNN-LM，它通过允许在整个训练语料库中引用相似的上下文来扩展香草神经语言模型（LM）。我们在输入上下文和从训练语料库中选择的语义相关邻居之间构建了一个有向异构图，其中节点是输入上下文和检索到的邻居上下文中的标记，边表示节点之间的连接。图神经网络（GNN）是在该图的基础上构造的，用于聚合来自相似上下文的信息以解码令牌。这种学习范式提供了对参考上下文的直接访问，并有助于提高模型的泛化能力。我们进行了全面的实验来验证GNN-LM的有效性：GNN-LM在WikiText-103上实现了14.8的最新困惑（比香草LM模型的对应项提高了4.5个百分点），并在10亿字和Enwiki8数据集上显示了相对于强基线的实质性改进。为了了解GNN-LM的机理，进行了深入的烧蚀研究。摘要：Inspired by the notion that ``{\it to copy is easier than to memorize}``, in this work, we introduce GNN-LM, which extends the vanilla neural language model (LM) by allowing to reference similar contexts in the entire training corpus. We build a directed heterogeneous graph between an input context and its semantically related neighbors selected from the training corpus, where nodes are tokens in the input context and retrieved neighbor contexts, and edges represent connections between nodes. Graph neural networks (GNNs) are constructed upon the graph to aggregate information from similar contexts to decode the token. This learning paradigm provides direct access to the reference contexts and helps improve a model's generalization ability. We conduct comprehensive experiments to validate the effectiveness of the GNN-LM: GNN-LM achieves a new state-of-the-art perplexity of 14.8 on WikiText-103 (a 4.5 point improvement over its counterpart of the vanilla LM model) and shows substantial improvement on One Billion Word and Enwiki8 datasets against strong baselines. In-depth ablation studies are performed to understand the mechanics of GNN-LM.

【6】 n-stage Latent Dirichlet Allocation: A Novel Approach for LDA 标题：N阶段潜在Dirichlet分配：一种新的LDA方法链接：https://arxiv.org/abs/2110.08591

作者：Zekeriya Anil Guven,Banu Diri,Tolgahan Cakaloglu 机构：Department of Computer Engineering, Ege University, Izmir, Turkey, Yildiz Technical University, Istanbul, Turkey, Walmart Global Tech, Dallas, USA 备注：Published in: 2019 4th International Conference on Computer Science and Engineering (UBMK). This study is extension version of "Comparison of Topic Modeling Methods for Type Detection of Turkish News" this http URL . Please citation this IEEE paper 摘要：如今，随着数据量的不断增加，数据分析已经成为一个问题。为了克服文本数据中的这一问题，在自然语言处理中使用了许多模型和方法。主题建模字段就是这些方法之一。主题建模允许确定文本文档的语义结构。潜在Dirichlet分配（LDA）是主题建模方法中最常用的方法。本文详细说明了所提出的n级LDA方法，它可以使LDA方法得到更有效的使用。应用英语和土耳其语研究证明了该方法的积极效果。由于该方法的重点是减少字典中的字数，因此它可以独立于语言使用。您可以访问该方法和示例的开放源代码：https://github.com/anil1055/n-stage_LDA 摘要：Nowadays, data analysis has become a problem as the amount of data is constantly increasing. In order to overcome this problem in textual data, many models and methods are used in natural language processing. The topic modeling field is one of these methods. Topic modeling allows determining the semantic structure of a text document. Latent Dirichlet Allocation (LDA) is the most common method among topic modeling methods. In this article, the proposed n-stage LDA method, which can enable the LDA method to be used more effectively, is explained in detail. The positive effect of the method has been demonstrated by the applied English and Turkish studies. Since the method focuses on reducing the word count in the dictionary, it can be used language-independently. You can access the open-source code of the method and the example: https://github.com/anil1055/n-stage_LDA

【7】 Tackling Multi-Answer Open-Domain Questions via a Recall-then-Verify Framework 标题：通过先召回后验证框架解决多答案开放领域问题链接：https://arxiv.org/abs/2110.08544

作者：Zhihong Shao,Minlie Huang 机构：The CoAI group, DCST, Tsinghua University, Institute for Artificial Intelligence;, State Key Lab of Intelligent Technology and Systems;, Beijing National Research Center for Information Science and Technology;, Tsinghua University, Beijing , China 摘要：开放领域的问题可能是开放式的和模棱两可的，导致多个有效答案。现有的方法通常采用先重库后读的框架，读者阅读排名靠前的证据来预测答案。根据我们的实证分析，这一框架面临三个问题：为了利用大型读者的力量，重读者被迫只选择几个涵盖不同答案的相关段落，这对读者的表现有着未知的影响，这是非常重要的；小的阅读预算也阻止读者利用重读者过滤掉的有价值的检索证据；此外，由于读者根据所有选定的证据一次生成所有预测，因此可能会了解答案之间的病理依赖性，即，是否预测答案也可能取决于其他答案的证据。为了避免这些问题，我们建议使用回忆-然后验证框架来解决多答案开放领域问题，该框架将每个答案的推理过程分开，以便我们能够更好地利用检索到的证据，同时在相同的内存约束下利用大模型的能力。我们的框架在两个多答案数据集上获得了最新的最新结果，并预测了比使用oracle reranker的rerank-then-read系统多得多的黄金答案。摘要：Open domain questions are likely to be open-ended and ambiguous, leading to multiple valid answers. Existing approaches typically adopt the rerank-then-read framework, where a reader reads top-ranking evidence to predict answers. According to our empirical analyses, this framework is faced with three problems: to leverage the power of a large reader, the reranker is forced to select only a few relevant passages that cover diverse answers, which is non-trivial due to unknown effect on the reader's performance; the small reading budget also prevents the reader from making use of valuable retrieved evidence filtered out by the reranker; besides, as the reader generates predictions all at once based on all selected evidence, it may learn pathological dependencies among answers, i.e., whether to predict an answer may also depend on evidence of the other answers. To avoid these problems, we propose to tackle multi-answer open-domain questions with a recall-then-verify framework, which separates the reasoning process of each answer so that we can make better use of retrieved evidence while also leveraging the power of large models under the same memory constraint. Our framework achieves new state-of-the-art results on two multi-answer datasets, and predicts significantly more gold answers than a rerank-then-read system with an oracle reranker.

【8】 Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher 标题：PRO-KD：追随老师的脚步进行渐进式蒸馏链接：https://arxiv.org/abs/2110.08532

作者：Mehdi Rezagholizadeh,Aref Jafari,Puneeth Salad,Pranav Sharma,Ali Saheb Pasand,Ali Ghodsi 机构：Huawei Noah’s Ark Lab, University of Waterloo 摘要：随着神经模型规模的不断扩大，知识提取（KD）作为一种重要的神经模型压缩工具越来越受到人们的关注。然而，文献中有一些反直觉的观察结果显示KD存在一些具有挑战性的局限性。这方面的一个例子是，教师表现最好的检查点不一定是训练学生KD的最佳教师。因此，一个重要的问题将是如何找到最好的检查点的教师蒸馏？通过教师的检查点进行搜索将是一个非常繁琐且计算代价高昂的过程，我们称之为\textit{checkpoint search problem}。此外，另一个观察结果是，在KD中，较大的教师未必是更好的教师，这被称为能力差距问题。为了解决这些具有挑战性的问题，在这项工作中，我们引入了我们的渐进式知识提炼（Pro-KD）技术，该技术通过遵循教师的训练足迹，而不是仅仅依赖于从一个成熟的、经过充分训练的教师身上提炼，为学生定义了一条更为平滑的训练路径。我们证明了我们的技术在缓解容量差距问题和检查点搜索问题方面是非常有效的。我们使用基于BERT的模型，对图像分类（CIFAR-10和CIFAR-100）、GLUE基准的自然语言理解任务和问答（1.1班和2.0班）等不同任务进行了一系列综合实验，并对我们的技术进行了评估，结果始终优于最先进的技术。摘要：With ever growing scale of neural models, knowledge distillation (KD) attracts more attention as a prominent tool for neural model compression. However, there are counter intuitive observations in the literature showing some challenging limitations of KD. A case in point is that the best performing checkpoint of the teacher might not necessarily be the best teacher for training the student in KD. Therefore, one important question would be how to find the best checkpoint of the teacher for distillation? Searching through the checkpoints of the teacher would be a very tedious and computationally expensive process, which we refer to as the \textit{checkpoint-search problem}. Moreover, another observation is that larger teachers might not necessarily be better teachers in KD which is referred to as the \textit{capacity-gap} problem. To address these challenging problems, in this work, we introduce our progressive knowledge distillation (Pro-KD) technique which defines a smoother training path for the student by following the training footprints of the teacher instead of solely relying on distilling from a single mature fully-trained teacher. We demonstrate that our technique is quite effective in mitigating the capacity-gap problem and the checkpoint search problem. We evaluate our technique using a comprehensive set of experiments on different tasks such as image classification (CIFAR-10 and CIFAR-100), natural language understanding tasks of the GLUE benchmark, and question answering (SQuAD 1.1 and 2.0) using BERT-based models and consistently got superior results over state-of-the-art techniques.

【9】 A Dataset for Discourse Structure in Peer Review Discussions 标题：同行评议讨论中的语篇结构数据集链接：https://arxiv.org/abs/2110.08520

作者：Neha Nayak Kennard,Tim O'Gorman,Akshay Sharma,Chhandak Bagchi,Matthew Clinton,Pranay Kumar Yelugam,Rajarshi Das,Hamed Zamani,Andrew McCallum 机构：Tim O’Gorman 摘要：在科学评价的基础上是同行评议的劳动密集型过程。这项关键任务要求参与者阅读和解释大量高度技术性的文本。我们发现，来自反驳的话语线索可以揭示评论的质量和解释。此外，对评论者和作者所采用的辩论策略的理解为区域主席和其他决策者提供了有用的信号。本文提出了一个新的标记数据集，其中包含506个英文评论-反驳对中的20k个句子，由专家注释。虽然现有的数据集使用各种方案对评论语句的子集进行注释，但我们综合了现有的标签集，并对其进行了扩展，以包括反驳语句的细粒度注释，描述了作者对评论者批评的立场以及他们对解决这些批评的承诺。此外，我们在评论和反驳中对{every}句进行注释，包括对每个反驳句的上下文的描述。摘要：At the foundation of scientific evaluation is the labor-intensive process of peer review. This critical task requires participants to consume and interpret vast amounts of highly technical text. We show that discourse cues from rebuttals can shed light on the quality and interpretation of reviews. Further, an understanding of the argumentative strategies employed by the reviewers and authors provides useful signal for area chairs and other decision makers. This paper presents a new labeled dataset of 20k sentences contained in 506 review-rebuttal pairs in English, annotated by experts. While existing datasets annotate a subset of review sentences using various schemes, ours synthesizes existing label sets and extends them to include fine-grained annotation of the rebuttal sentences, characterizing the authors' stance towards the reviewers' criticisms and their commitment to addressing them. Further, we annotate \textit{every} sentence in both the review and the rebuttal, including a description of the context for each rebuttal sentence.

【10】 Metadata Shaping: Natural Language Annotations for the Tail 标题：元数据整形：尾部的自然语言注释链接：https://arxiv.org/abs/2110.08430

作者：Simran Arora,Sen Wu,Enci Liu,Christopher Re 机构：Stanford University 摘要：语言模型（LMs）已经取得了显著的进展，但仍然难以将训练数据推广到罕见的语言模式。由于稀有实体和事实普遍存在于用户提交给搜索和个人助理系统等流行应用程序的查询中，因此提高LMs可靠捕获稀有实体知识的能力是重要前期工作中研究的一个紧迫挑战。注意到现有方法主要修改LM体系结构或引入辅助目标以注入有用的实体知识，我们询问，使用基本LM体系结构，仅更改数据，我们可以在多大程度上匹配这些体系结构的质量？我们提出了元数据成形（metadata shaping），一种基于信息论度量将现成的元数据（如实体描述和分类标记）附加到示例中的方法。直观地说，如果与流行实体对应的元数据与稀有实体的元数据重叠，LM可能能够使用从类似流行实体学习的模式更好地解释稀有实体。在标准的实体丰富任务（TACRED、FewRel、OPENTITY）上，在不改变LM的情况下，元数据成形比BERT基线高出5.3个F1点，并达到或与最先进的结果竞争。我们进一步显示，与流行实体相比，在包含尾部的示例上，改进的效果要大10倍。摘要：Language models (LMs) have made remarkable progress, but still struggle to generalize beyond the training data to rare linguistic patterns. Since rare entities and facts are prevalent in the queries users submit to popular applications such as search and personal assistant systems, improving the ability of LMs to reliably capture knowledge over rare entities is a pressing challenge studied in significant prior work. Noticing that existing approaches primarily modify the LM architecture or introduce auxiliary objectives to inject useful entity knowledge, we ask to what extent we could match the quality of these architectures using a base LM architecture, and only changing the data? We propose metadata shaping, a method in which readily available metadata, such as entity descriptions and categorical tags, are appended to examples based on information theoretic metrics. Intuitively, if metadata corresponding to popular entities overlap with metadata for rare entities, the LM may be able to better reason about the rare entities using patterns learned from similar popular entities. On standard entity-rich tasks (TACRED, FewRel, OpenEntity), with no changes to the LM whatsoever, metadata shaping exceeds the BERT-baseline by up to 5.3 F1 points, and achieves or competes with state-of-the-art results. We further show the improvements are up to 10x larger on examples containing tail versus popular entities.

【11】 EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks 标题：EncT5：用于非自回归任务的微调T5编码器链接：https://arxiv.org/abs/2110.08426

作者：Frederick Liu,Siamak Shakeri,Hongkun Yu,Jing Li 机构：Google 摘要：随着T5型号的出现，编码器-解码器-Transformer结构最近变得很流行。对于语言模型任务的预训练，它也比像BERT这样的体系结构更为有利，因为大规模模型由于其通用性可能需要数月的训练。虽然能够推广到更多的任务，但在给定预训练模型的情况下，所提出的编码器-解码器体系结构对于分类和回归任务的微调是否最有效并不明显。在这项工作中，我们研究微调预训练编码器-解码器模型，如T5。特别地，我们提出\textbf{EncT5}作为一种通过使用编码器层有效地微调分类和回归任务的预训练编码器-解码器T5模型的方法。我们的实验结果表明，参数小于T5一半的\textbf{EncT5}在GLUE基准上的性能与T5模型类似。我们相信，我们提出的方法可以很容易地应用于任何预先训练的编码器-解码器模型。摘要：Encoder-decoder transformer architectures have become popular recently with the advent of T5 models. It is also more favorable over architectures like BERT for pre-training on language model task when it comes to large scale models which could take months to train given it's generality. While being able to generalize to more tasks, it is not evident if the proposed encoder-decoder architecture is the most efficient for fine-tuning on classification and regression tasks given the pre-trained model. In this work, we study fine-tuning pre-trained encoder-decoder models such as T5. Particularly, we propose \textbf{EncT5} as a way to efficiently fine-tune pre-trained encoder-decoder T5 models for classification and regression tasks by using the encoder layers. Our experimental results show that \textbf{EncT5} with less than half of the parameters of T5 performs similarly to T5 models on GLUE benchmark. We believe our proposed approach can be easily applied to any pre-trained encoder-decoder model.

【12】 Information-Theoretic Measures of Dataset Difficulty 标题：数据集难度的信息论测度链接：https://arxiv.org/abs/2110.08420

作者：Kawin Ethayarajh,Yejin Choi,Swabha Swayamdipta 机构：Stanford University♥, Allen Institute for Artificial Intelligence♣, Paul G. Allen School of Computer Science, University of Washington♦ 摘要：评估数据集的难度通常涉及将最先进的模型与人类进行比较；性能差距越大，数据集的难度就越大。这个框架不仅是非正式的，而且对于每个实例有多困难，或者是什么属性使给定模型变得困难，它也提供了很少的理解。为了解决这些问题，我们提出了一种信息论的观点，即由于缺少$\textit{usableinformation}$，数据集难以构建。测量可用信息与测量性能一样简单，但具有一定的理论优势。后者只允许我们比较同一数据集的不同模型，前者还允许我们比较同一模型的不同数据集。然后，我们引入$\textit{pointwise}$$\mathcal{V}-$\textit{information}$（PVI）来度量单个实例的难度，其中具有较高PVI的实例对于模型$\mathcal{V}$更容易。通过在测量可用信息之前对输入进行操作，我们可以理解$\textit{why}$数据集对于给定模型来说是容易的还是困难的，我们使用它来发现广泛使用的基准测试中的注释人工制品。摘要：Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. Not only is this framework informal, but it also provides little understanding of how difficult each instance is, or what attributes make it difficult for a given model. To address these problems, we propose an information-theoretic perspective, framing dataset difficulty as the absence of $\textit{usable information}$. Measuring usable information is as easy as measuring performance, but has certain theoretical advantages. While the latter only allows us to compare different models w.r.t the same dataset, the former also allows us to compare different datasets w.r.t the same model. We then introduce $\textit{pointwise}$ $\mathcal{V}-$$\textit{information}$ (PVI) for measuring the difficulty of individual instances, where instances with higher PVI are easier for model $\mathcal{V}$. By manipulating the input before measuring usable information, we can understand $\textit{why}$ a dataset is easy or difficult for a given model, which we use to discover annotation artefacts in widely-used benchmarks.

【13】 Invariant Language Modeling 标题：不变量语言建模链接：https://arxiv.org/abs/2110.08413

作者：Maxime Peyrard,Sarvjeet Singh Ghotra,Martin Josifoski,Vidhan Agarwal,Barun Patra,Dean Carignan,Emre Kiciman,Robert West 机构：♢EPFL, ♠Microsoft Corporation 摘要：现代预训练语言模型是NLP管道的重要组成部分。然而，它们受到虚假相关性、较差的域外泛化和偏见的影响。受因果机器学习的最新进展，特别是不变风险最小化（IRM）范式的启发，我们提出了不变语言建模，这是一个学习不变表示的框架，可以更好地在多个环境中推广。特别是，我们将IRM（IRM games）的博弈论实现应用于语言模型，其中不变性来自于特定的训练计划，在该计划中，所有环境通过循环更新模型的子集来竞争以优化其自身特定于环境的损失。在一系列受控实验中，我们证明了我们的方法能够（i）去除结构化噪声，（ii）忽略特定的虚假相关性而不影响全局性能，以及（iii）实现更好的域外泛化。与标准训练相比，这些优点带来的计算开销可以忽略不计，不需要改变局部损失，并且可以应用于任何语言模型体系结构。我们相信这个框架有望帮助减轻语言模型中的虚假关联和偏见。摘要：Modern pretrained language models are critical components of NLP pipelines. Yet, they suffer from spurious correlations, poor out-of-domain generalization, and biases. Inspired by recent progress in causal machine learning, in particular the invariant risk minimization (IRM) paradigm, we propose invariant language modeling, a framework for learning invariant representations that generalize better across multiple environments. In particular, we adapt a game-theoretic implementation of IRM (IRM-games) to language models, where the invariance emerges from a specific training schedule in which all the environments compete to optimize their own environment-specific loss by updating subsets of the model in a round-robin fashion. In a series of controlled experiments, we demonstrate the ability of our method to (i) remove structured noise, (ii) ignore specific spurious correlations without affecting global performance, and (iii) achieve better out-of-domain generalization. These benefits come with a negligible computational overhead compared to standard training, do not require changing the local loss, and can be applied to any language model architecture. We believe this framework is promising to help mitigate spurious correlations and biases in language models.

【14】 Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining 标题：用递归掩蔽和再训练评估自然语言处理中重要性度量的可信性链接：https://arxiv.org/abs/2110.08412

作者：Andreas Madsen,Nicholas Meade,Vaibhav Adlakha,Siva Reddy 机构： Mila – Quebec AI Institute, Polytechnique Montréal, McGill, Facebook CIFAR AI Chair 摘要：为了解释NLP模型，许多方法告知哪些输入标记对预测很重要。然而，一个悬而未决的问题是，这些方法是否准确地反映了模型的逻辑，这一特性通常被称为忠实性。在这项工作中，我们改编并改进了胡克等人（2019）最近提出的一个来自计算机视觉的忠诚基准，名为ROAR（移除和再训练）。我们通过递归移除数据集冗余来改进ROAR，否则这些冗余会干扰ROAR。我们将ROAR应用于流行的NLP重要性度量，即注意力、梯度和综合梯度。此外，我们使用互信息作为额外的基线。评估是在一系列的分类任务上进行的，这些任务通常用于注意力的忠实性文献中。最后，我们提出了一个标量信度度量，它可以方便地比较不同论文的结果。我们发现，被认为对计算机视觉任务不忠的重要性度量在NLP任务中表现良好，重要性度量的忠实性依赖于任务，并且积分梯度的计算开销很少合理。摘要：To explain NLP models, many methods inform which inputs tokens are important for a prediction. However, an open question is if these methods accurately reflect the model's logic, a property often called faithfulness. In this work, we adapt and improve a recently proposed faithfulness benchmark from computer vision called ROAR (RemOve And Retrain), by Hooker et al. (2019). We improve ROAR by recursively removing dataset redundancies, which otherwise interfere with ROAR. We adapt and apply ROAR, to popular NLP importance measures, namely attention, gradient, and integrated gradients. Additionally, we use mutual information as an additional baseline. Evaluation is done on a suite of classification tasks often used in the faithfulness of attention literature. Finally, we propose a scalar faithfulness metric, which makes it easy to compare results across papers. We find that, importance measures considered to be unfaithful for computer vision tasks perform favorably for NLP tasks, the faithfulness of an importance measure is task-dependent, and the computational overhead of integrated gradient is rarely justified.

【15】 DS-TOD: Efficient Domain Specialization for Task Oriented Dialog 标题：DS-TOD：面向任务对话的高效领域专门化链接：https://arxiv.org/abs/2110.08395

作者：Chia-Chien Hung,Anne Lauscher,Simone Paolo Ponzetto,Goran Glavaš 机构：Data and Web Science Group, University of Mannheim, Germany, MilaNLP, Bocconi University, Italy, Center for Information and Language Processing, LMU Munich, Germany 摘要：最近的研究表明，与传统的语言建模（LM）在下游面向任务的对话（TOD）中的预训练相比，在大型对话数据集上进行自我监督的对话特定预训练可以获得显著的收益。然而，这些方法利用一般的对话语料库（如Reddit），因此可能无法可靠地嵌入对具体下游TOD领域有用的领域特定知识。在这项工作中，我们研究了面向任务对话的预训练语言模型（PLM）领域专门化的影响。在我们的DS-TOD框架中，我们首先自动提取显著的领域特定术语，然后使用它们构建DomainCC和DomainReddit——我们分别基于（i）蒙面语言建模（MLM）和（ii）响应选择（RS）目标，利用这些资源进行领域特定的预训练。我们进一步提出了一种通过域适配器实现的资源高效和模块化的域专门化——我们在其中编码域知识的附加参数光层。我们对两个突出的TOD任务——对话状态跟踪（DST）和响应检索（RR）——进行了实验，包括来自MultiWOZ TOD基准测试的五个领域，证明了我们的领域专门化方法的有效性。此外，我们还表明，基于适配器的轻量级专门化（1）在单域设置中的性能与完全微调相当，（2）特别适合于多域专门化，在多域专门化中，除了有利的计算占用外，它还可以提供更好的下游性能。摘要：Recent work has shown that self-supervised dialog-specific pretraining on large conversational datasets yields substantial gains over traditional language modeling (LM) pretraining in downstream task-oriented dialog (TOD). These approaches, however, exploit general dialogic corpora (e.g., Reddit) and thus presumably fail to reliably embed domain-specific knowledge useful for concrete downstream TOD domains. In this work, we investigate the effects of domain specialization of pretrained language models (PLMs) for task-oriented dialog. Within our DS-TOD framework, we first automatically extract salient domain-specific terms, and then use them to construct DomainCC and DomainReddit -- resources that we leverage for domain-specific pretraining, based on (i) masked language modeling (MLM) and (ii) response selection (RS) objectives, respectively. We further propose a resource-efficient and modular domain specialization by means of domain adapters -- additional parameter-light layers in which we encode the domain knowledge. Our experiments with two prominent TOD tasks -- dialog state tracking (DST) and response retrieval (RR) -- encompassing five domains from the MultiWOZ TOD benchmark demonstrate the effectiveness of our domain specialization approach. Moreover, we show that the light-weight adapter-based specialization (1) performs comparably to full fine-tuning in single-domain setups and (2) is particularly suitable for multi-domain specialization, in which, besides advantageous computational footprint, it can offer better downstream performance.

【16】 When Combating Hype, Proceed with Caution 标题：在打击炒作时，请谨慎行事。链接：https://arxiv.org/abs/2110.08300

作者：Samuel R. Bowman 机构：New York University 摘要：为了避免强化关于最先进语言技术能力的广泛宣传，研究人员开发了框架和引用的实践，这些实践有助于淡化该领域的成功。尽管这些做法的用意很好，但它们往往会误导甚至错误地宣称我们的最佳技术的局限性。这是一个问题，而且可能比看上去更严重：它限制了我们减轻NLP部署带来的短期危害的能力，也限制了我们为更遥远的未来进步带来的潜在巨大影响做好准备的能力。本文敦促研究人员对这些说法保持谨慎，并提出一些研究方向和沟通策略，以便于避免或反驳这些说法。摘要：In an effort to avoid reinforcing widespread hype about the capabilities of state-of-the-art language technology, researchers have developed practices in framing and citation that serve to deemphasize the field's successes. Though well-meaning, these practices often yield misleading or even false claims about the limits of our best technology. This is a problem, and it may be more serious than it looks: It limits our ability to mitigate short-term harms from NLP deployments and it limits our ability to prepare for the potentially enormous impacts of more distant future advances. This paper urges researchers to be careful about these claims and suggests some research directions and communication strategies that will make it easier to avoid or rebut them.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-10-19，如有侵权请联系 cloudcommunity@tencent.com 删除

linux