自然语言处理学术速递[12.21]

公众号-arXiv每日学术速递

发布于 2021-12-24 08:46:59

5500

发布于 2021-12-24 08:46:59

cs.CL 方向，今日共计31篇

Transformer(1篇)

【1】 Leveraging Transformers for Hate Speech Detection in Conversational Code-Mixed Tweets 标题：利用Transformer检测会话式代码混合推文中的仇恨语音链接：https://arxiv.org/abs/2112.09986

作者：Zaki Mustafa Farooqi,Sreyan Ghosh,Rajiv Ratn Shah 机构：Multimodal Digital Media Analysis Lab, Indraprastha Institute of Information Technology Delhi, India 备注：Accepted at FIRE 2021 - Hate Speech and offensive content detection (HASOC) Track 摘要：在当今互联网时代，人人都可以轻松访问社交媒体平台，人们往往不得不面对威胁、身份攻击、仇恨和欺凌，因为他们与演员、信仰、性别、宗教，甚至接受或拒绝某个概念有关。现有的在仇恨语音检测的工作主要集中在单个评论分类作为序列标签任务，并且经常不考虑会话的上下文。当确定推特背后的作者意图和情感时，对话的上下文通常起着重要作用。本文描述了由MIDAS-IIITD提出的系统，用于HASOC 2021子任务2，这是第一个共享任务，专注于检测推特上的英语语音代码混合对话中的仇恨言语。我们使用神经网络来解决这个问题，利用transformer的跨语言嵌入并进一步微调它们，以便在音译印地语文本中进行低资源仇恨语音分类。我们表现最好的系统是Indi-BERT、XLM-RoBERTa和多语言BERT的硬投票组合，F1的宏观得分为0.7253，使我们在整体排行榜上排名第一。摘要：In the current era of the internet, where social media platforms are easily accessible for everyone, people often have to deal with threats, identity attacks, hate, and bullying due to their association with a cast, creed, gender, religion, or even acceptance or rejection of a notion. Existing works in hate speech detection primarily focus on individual comment classification as a sequence labeling task and often fail to consider the context of the conversation. The context of a conversation often plays a substantial role when determining the author's intent and sentiment behind the tweet. This paper describes the system proposed by team MIDAS-IIITD for HASOC 2021 subtask 2, one of the first shared tasks focusing on detecting hate speech from Hindi-English code-mixed conversations on Twitter. We approach this problem using neural networks, leveraging the transformer's cross-lingual embeddings and further finetuning them for low-resource hate-speech classification in transliterated Hindi text. Our best performing system, a hard voting ensemble of Indic-BERT, XLM-RoBERTa, and Multilingual BERT, achieved a macro F1 score of 0.7253, placing us first on the overall leaderboard standings.

BERT(2篇)

【1】 Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages 标题：训练数据集和字典大小在BERT模型中很重要：以波罗的海语言为例链接：https://arxiv.org/abs/2112.10553

作者：Matej Ulčar,Marko Robnik-Šikonja 机构： Marko Robnik-ˇSikonjaUniversity of Ljubljana 备注：12 pages. To be published in proceedings of the AIST 2021 conference 摘要：大型预训练蒙面语言模型已经成为许多NLP问题的最新解决方案。虽然研究表明，单语模型比多语模型产生更好的结果，但训练数据集必须足够大。我们为立陶宛语、拉脱维亚语和英语训练了一个三语LitLat-BERT-like模型，为爱沙尼亚语训练了一个单语Est-RoBERTa模型。我们评估了它们在四个下游任务上的性能：命名实体识别、依赖项解析、词性标记和单词类比。为了分析关注单一语言的重要性和大型训练集的重要性，我们将创建的模型与爱沙尼亚语、拉脱维亚语和立陶宛语的现有单语言和多语言BERT模型进行比较。结果表明，在大多数情况下，新创建的LitLat-BERT和Est-RoBERTa模型改进了现有模型在所有测试任务上的结果。摘要：Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most situations.

【2】 Syntactic-GCN Bert based Chinese Event Extraction 标题：基于句法-GCN-BERT的中文事件抽取链接：https://arxiv.org/abs/2112.09939

作者：Jiangwei Liu,Jingshu Zhang,Xiaohong Huang,Liangyu Min 机构：School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai , China 备注：9 pages, 4 figures, 3 tables. arXiv admin note: text overlap with arXiv:2111.03212 摘要：随着信息技术的快速发展，在线平台（如新闻门户和社交媒体）每时每刻都会产生大量的网络信息。因此，从社会流中提取事件的结构化表示至关重要。通常，现有的事件提取研究利用模式匹配、机器学习或深度学习方法来执行事件提取任务。然而，由于汉语的独特性，汉语事件抽取的性能不如英语。本文提出了一个完整的中文事件提取框架。该方法是一个多通道输入的神经网络框架，集成了语义特征和句法特征。语义特征由BERT体系结构捕获。词性（POS）特征和依赖解析（DP）特征分别通过分析嵌入和图卷积网络（GCN）捕获。我们还将在真实数据集上评估我们的模型。实验结果表明，该方法的性能明显优于基准方法。摘要：With the rapid development of information technology, online platforms (e.g., news portals and social media) generate enormous web information every moment. Therefore, it is crucial to extract structured representations of events from social streams. Generally, existing event extraction research utilizes pattern matching, machine learning, or deep learning methods to perform event extraction tasks. However, the performance of Chinese event extraction is not as good as English due to the unique characteristics of the Chinese language. In this paper, we propose an integrated framework to perform Chinese event extraction. The proposed approach is a multiple channel input neural framework that integrates semantic features and syntactic features. The semantic features are captured by BERT architecture. The Part of Speech (POS) features and Dependency Parsing (DP) features are captured by profiling embeddings and Graph Convolutional Network (GCN), respectively. We also evaluate our model on a real-world dataset. Experimental results show that the proposed method outperforms the benchmark approaches significantly.

QA|VQA|问答|对话(2篇)

【1】 MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding 标题：MuMuQA：基于跨媒体知识抽取和接地的多媒体多跳新闻问答链接：https://arxiv.org/abs/2112.10728

作者：Revanth Gangi Reddy,Xilin Rui,Manling Li,Xudong Lin,Haoyang Wen,Jaemin Cho,Lifu Huang,Mohit Bansal,Avirup Sil,Shih-Fu Chang,Alexander Schwing,Heng Ji 机构： University of Illinois Urbana-Champaign, Tsinghua University, Columbia University, University of North Carolina at Chapel Hill, Virginia Tech, IBM Research AI 备注：To be presented at AAAI 2022 摘要：最近，人们对建立问答（QA）模型越来越感兴趣，该模型可以跨多种模式进行推理，例如文本和图像。然而，使用图像的QA通常仅限于从预定义的选项集中选择答案。此外，现实世界中的图像，特别是新闻中的图像，具有与文本共同参照的对象，具有来自两种模式的补充信息。在本文中，我们提出了一个新的QA评估基准，针对需要将图像中的对象跨媒体接地到文本的新闻文章提出1384个问题。具体来说，该任务涉及多跳问题，需要对图像标题对进行推理，以识别所指的固定视觉对象，然后预测从新闻正文文本到回答问题的跨度。此外，我们还介绍了一种基于跨媒体知识提取和综合问答生成的多媒体数据扩充框架，以自动扩充能够为该任务提供弱监督的数据。我们在我们的基准上评估了基于管道和基于端到端预训练的多媒体QA模型，并表明它们实现了有希望的性能，同时大大落后于人的性能，因此为这项具有挑战性的新任务的未来工作留有很大的空间。摘要：Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a pre-defined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text. Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question. In addition, we introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task. We evaluate both pipeline-based and end-to-end pretraining-based multimedia QA models on our benchmark, and show that they achieve promising performance, while considerably lagging behind human performance hence leaving large room for future work on this challenging new task.

【2】 Cascading Adaptors to Leverage English Data to Improve Performance of Question Answering for Low-Resource Languages 标题：利用英语数据的级联适配器来提高低资源语言的问答性能链接：https://arxiv.org/abs/2112.09866

作者：Hariom A. Pandya,Bhavik Ardeshna,Dr. Brijesh S. Bhatt 机构：Dharmsinh Desai University, Nadiad-Gujarat(India) 摘要：基于Transformer的体系结构在许多下行任务（包括问题回答）上显示了显著的结果。另一方面，数据的可用性阻碍了低资源语言获得合理的性能。在本文中，我们研究了预训练的多语言模型的适用性，以提高低资源语言的问答性能。我们在类似于MLQA数据集的七种语言上使用多语言转换器架构测试了语言和任务适配器的四种组合。此外，我们还提出了使用语言和任务适配器进行低资源问答的Zero-Shot迁移学习。我们观察到，对于低资源语言，堆叠语言和任务适配器可以显著提高多语言转换器模型的性能。摘要：Transformer based architectures have shown notable results on many down streaming tasks including question answering. The availability of data, on the other hand, impedes obtaining legitimate performance for low-resource languages. In this paper, we investigate the applicability of pre-trained multilingual models to improve the performance of question answering in low-resource languages. We tested four combinations of language and task adapters using multilingual transformer architectures on seven languages similar to MLQA dataset. Additionally, we have also proposed zero-shot transfer learning of low-resource question answering using language and task adapters. We observed that stacking the language and the task adapters improves the multilingual transformer models' performance significantly for low-resource languages.

Graph|知识图谱|Knowledge(4篇)

【1】 Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching 标题：集成知识的端到端自动语音识别中的汉英代码转换链接：https://arxiv.org/abs/2112.10202

作者：Chia-Yu Li,Ngoc Thang Vu 机构：Institute for Natural Language Processing (IMS), University of Stuttgart, Stuttgart, Germany 备注：The 2019 International Conference on Asian Language Processing (IALP) 摘要：语码转换（CS）是多语言社区中常见的一种语言现象，包括在说话时在语言之间的转换。本文介绍了我们对汉英CS语音的端到端语音识别的研究。我们分析了不同的CS特定问题，如CS语言对中语言之间的属性不匹配、切换点的不可预测性以及数据稀缺问题。我们通过合并非语言符号，通过使用分层softmax集成语言识别，通过对子单词单元建模，通过人为降低语速，开发和改进最先进的端到端系统，并通过使用速度扰动技术和多个单语数据集来增加数据，从而不仅提高CS语音的最终性能，而且提高单语基准测试的最终性能，从而使系统更适用于实际环境。最后，我们探讨了不同语言模型集成方法对所提出模型性能的影响。我们的实验结果表明，所有提出的技术都提高了识别性能。最佳组合系统在混合错误率方面将基线系统提高了35%，并在单语基准上提供了可接受的性能。摘要：Code-Switching (CS) is a common linguistic phenomenon in multilingual communities that consists of switching between languages while speaking. This paper presents our investigations on end-to-end speech recognition for Mandarin-English CS speech. We analyse different CS specific issues such as the properties mismatches between languages in a CS language pair, the unpredictable nature of switching points, and the data scarcity problem. We exploit and improve the state-of-the-art end-to-end system by merging nonlinguistic symbols, by integrating language identification using hierarchical softmax, by modeling sub-word units, by artificially lowering the speaking rate, and by augmenting data using speed perturbed technique and several monolingual datasets to improve the final performance not only on CS speech but also on monolingual benchmarks in order to make the system more applicable on real life settings. Finally, we explore the effect of different language model integration methods on the performance of the proposed model. Our experimental results reveal that all the proposed techniques improve the recognition performance. The best combined system improves the baseline system by up to 35% relatively in terms of mixed error rate and delivers acceptable performance on monolingual benchmarks.

【2】 Continual Learning with Knowledge Transfer for Sentiment Classification 标题：用于情感分类的带知识转移的持续学习链接：https://arxiv.org/abs/2112.10021

作者：Zixuan Ke,Bing Liu,Hao Wang,Lei Shu 机构： University of Illinois at Chicago, USA, Southwest Jiaotong University, Chengdu, China 备注：None 摘要：本文研究情绪分类（SC）中的连续学习（CL）。在此设置中，CL系统在神经网络中增量学习SC任务序列，其中每个任务构建分类器，以对特定产品类别或领域的评论情绪进行分类。两个自然的问题是：系统能否将过去从以前的任务中学到的知识转移到新任务中，以帮助它为新任务学习更好的模型？而且，以前任务的旧模型也能在过程中得到改进吗？本文提出了一种称为KAN的新技术来实现这些目标。KAN可以通过正向和反向知识转移显著提高新任务和旧任务的SC准确性。通过大量实验证明了KAN的有效性。摘要：This paper studies continual learning (CL) for sentiment classification (SC). In this setting, the CL system learns a sequence of SC tasks incrementally in a neural network, where each task builds a classifier to classify the sentiment of reviews of a particular product category or domain. Two natural questions are: Can the system transfer the knowledge learned in the past from the previous tasks to the new task to help it learn a better model for the new task? And, can old models for previous tasks be improved in the process as well? This paper proposes a novel technique called KAN to achieve these objectives. KAN can markedly improve the SC accuracy of both the new task and the old tasks via forward and backward knowledge transfer. The effectiveness of KAN is demonstrated through extensive experiments.

【3】 Word Graph Guided Summarization for Radiology Findings 标题：词图引导下的放射学发现总结链接：https://arxiv.org/abs/2112.09925

作者：Jinpeng Hu,Jianling Li,Zhihong Chen,Yaling Shen,Yan Song,Xiang Wan,Tsung-Hui Chang 机构：♠The Chinese University of Hong Kong (Shenzhen), ♥Shenzhen Research Institute of Big Data, ♦National University of Defense Technology 备注：11 pages, 6 figures, ACL2021 Findings 摘要：放射学报告在向医生传达医学发现方面起着至关重要的作用。在每份报告中，印象部分总结了重要的放射学发现。在临床实践中，对于放射科医生来说，书写印象要求很高，但很耗时，而且容易出错。因此，自动印模生成已成为促进此类临床实践的一个有吸引力的研究方向。现有的研究主要集中在将显著词信息引入到一般的文本摘要框架中，以指导放射学发现中关键内容的选择。然而，对于这项任务，模型不仅需要捕捉发现中的重要词语，还需要准确描述它们之间的关系，以便产生高质量的印象。在本文中，我们提出了一种自动印象生成的新方法，即根据发现构建一个词图来记录关键词及其关系，然后设计一个词图引导的摘要模型（WGSum）来借助词图生成印象。在两个数据集OpenI和MIMIC-CXR上的实验结果证实了我们提出的方法的有效性和有效性，在这两个数据集上都获得了最新的结果。还进行了进一步的实验，以分析不同图形设计对我们的方法性能的影响。摘要：Radiology reports play a critical role in communicating medical findings to physicians. In each report, the impression section summarizes essential radiology findings. In clinical practice, writing impression is highly demanded yet time-consuming and prone to errors for radiologists. Therefore, automatic impression generation has emerged as an attractive research direction to facilitate such clinical practice. Existing studies mainly focused on introducing salient word information to the general text summarization framework to guide the selection of the key content in radiology findings. However, for this task, a model needs not only capture the important words in findings but also accurately describe their relations so as to generate high-quality impressions. In this paper, we propose a novel method for automatic impression generation, where a word graph is constructed from the findings to record the critical words and their relations, then a Word Graph guided Summarization model (WGSum) is designed to generate impressions with the help of the word graph. Experimental results on two datasets, OpenI and MIMIC-CXR, confirm the validity and effectiveness of our proposed approach, where the state-of-the-art results are achieved on both datasets. Further experiments are also conducted to analyze the impact of different graph designs to the performance of our method.

【4】 The Web Is Your Oyster -- Knowledge-Intensive NLP against a Very Large Web Corpus 标题：网络就是你的牡蛎--知识密集型自然语言处理与庞大的网络语料库链接：https://arxiv.org/abs/2112.09924

作者：Aleksandra Piktus,Fabio Petroni,Vladimir Karpukhin,Dmytro Okhonko,Samuel Broscheit,Gautier Izacard,Patrick Lewis,Barlas Oğuz,Edouard Grave,Wen-tau Yih,Sebastian Riedel 机构：Facebook AI Research ,University College London, University of Mannheim , ENS, PSL University , Inria 摘要：为了满足现实世界中日益增长的应用需求，知识密集型NLP（KI-NLP）的研究应该通过捕捉真正开放领域环境的挑战来推进：网络规模的知识、缺乏结构、质量不一致和噪音。为此，我们提出了一种评估现有KI-NLP任务的新设置，其中我们将背景语料库概括为通用web快照。我们重新利用KILT，一个最初为维基百科开发的标准KI-NLP基准，并要求系统使用CCNet的一个子集——球体语料库——作为知识源。与维基百科相比，Sphere要大几个数量级，更好地反映了互联网上知识的多样性。我们发现，尽管存在潜在的覆盖范围差距、规模挑战、缺乏结构和较低的质量，但从Sphere检索使最先进的检索和阅读系统能够在几个KILT任务上与基于Wikipedia的模型相匹配，甚至优于基于Wikipedia的模型，即使我们积极过滤看起来像Wikipedia的内容。我们还观察到，虽然Wikipedia上的单一密集通道索引可以优于稀疏BM25版本，但在Sphere上这还不可能。为了促进这一领域的进一步研究，并最大限度地减少社区对专有黑盒搜索引擎的依赖，我们将分享我们的指数、评估指标和基础设施。摘要：In order to address the increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web scale knowledge, lack of structure, inconsistent quality, and noise. To this end, we propose a new setup for evaluating existing KI-NLP tasks in which we generalize the background corpus to a universal web snapshot. We repurpose KILT, a standard KI-NLP benchmark initially developed for Wikipedia, and ask systems to use a subset of CCNet - the Sphere corpus - as a knowledge source. In contrast to Wikipedia, Sphere is orders of magnitude larger and better reflects the full diversity of knowledge on the Internet. We find that despite potential gaps of coverage, challenges of scale, lack of structure and lower quality, retrieval from Sphere enables a state-of-the-art retrieve-and-read system to match and even outperform Wikipedia-based models on several KILT tasks - even if we aggressively filter content that looks like Wikipedia. We also observe that while a single dense passage index over Wikipedia can outperform a sparse BM25 version, on Sphere this is not yet possible. To facilitate further research into this area, and minimise the community's reliance on proprietary black box search engines, we will share our indices, evaluation metrics and infrastructure.

推理|分析|理解|解释(1篇)

【1】 Improving Ethical Outcomes with Machine-in-the-Loop: Broadening Human Understanding of Data Annotations 标题：通过机器在环中改善伦理结果：拓宽人类对数据注释的理解链接：https://arxiv.org/abs/2112.09738

作者：Ashis Kumer Biswas,Geeta Verma,Justin Otto Barber 机构：Computer Science & Engineering, University of Colorado Denver, Colorado, CO , School of Education and Human Development, Denver, CO , Radiology Partners, El Segundo, CA 备注：Accepted and presented at the Human Centered AI workshop at the 35th Conference on Neural Information Processing Systems (NeurIPS), Dec 13th 2021 摘要：我们介绍了一种机器在环管道，旨在解决教育领域基于自然语言的有监督机器学习任务中不必要的偏见的根本原因。从学生的经历中学习是教育研究者和学术管理者的基础。在新的知识经济中，从经验中学到的21世纪技能正在成为大学和职业准备以及招聘过程的核心部分。少数族裔学生在日常生活中展示了这些技能，但记录、评估和验证这些技能对于教育机构来说是一个巨大的问题。作为一个注重公平的在线平台，LivedX将未经认证的学生的生活经历转化为21世纪的技能，颁发微证书，并创建个人21世纪技能组合。为了从学生提交的论文中获取的自然语言文本中自动进行微证书挖掘，我们采用了一个词袋模型来构造一个多输出分类器。尽管有我们的目标，我们的模式最初加剧了对少数民族学生的不同影响。我们使用一个机器在环模型开发管道来解决这个问题，并改进前面提到的模型，以确保其预测的公平性。摘要：We introduce a machine-in-the-loop pipeline that aims to address root causes of unwanted bias in natural language based supervised machine learning tasks in the education domain. Learning from the experiences of students is foundational for education researchers, and academic administrators. 21st-century skills learned from experience are becoming a core part of college and career readiness as well as the hiring process in the new knowledge economy. Minoritized students demonstrate these skills in their daily lives, but documenting, assessing, and validating these skills is a huge problem for educational institutions. As an equity focused online platform, LivedX translates minoritized students' lived experiences into the 21st century skills, issues micro-credentials, and creates personal 21st century skills portfolio. To automate the micro credential mining from the natural language texts received from the students' submitted essays, we employed a bag-of-word model to construct a multi-output classifier. Despite our goal, our model initially exacerbated disparate impact on minoritized students. We used a machine-in-the-loop model development pipeline to address the problem and refine the aforementioned model to ensure fairness in its prediction.

GAN|对抗|攻击|生成相关(2篇)

【1】 May the Force Be with Your Copy Mechanism: Enhanced Supervised-Copy Method for Natural Language Generation 标题：愿原力与你的复制机制同在：自然语言生成的增强型监督复制方法链接：https://arxiv.org/abs/2112.10360

作者：Sanghyuk Choi,Jeong-in Hwang,Hyungjong Noh,Yeonsoo Lee 机构：NCSOFT NLP Center 备注：8 pages, 3 figures, 8 tables and 4 pages of appendices 摘要：最近，具有复制机制的神经序列到序列模型在各种文本生成任务中取得了显著的进展。这些模型解决了词汇表外的问题，并促进了稀有词的产生。然而，正如先前的复制模型所观察到的那样，识别需要复制的单词是困难的，这些复制模型存在生成错误和缺乏抽象性的问题。在本文中，我们提出了一种新的有监督的复制网络方法，帮助模型决定哪些单词需要复制，哪些单词需要生成。具体来说，我们重新定义了目标函数，它利用源序列和目标词汇表作为复制的指导。在数据到文本生成和抽象摘要任务上的实验结果验证了我们的方法提高了复制质量和抽象度。摘要：Recent neural sequence-to-sequence models with a copy mechanism have achieved remarkable progress in various text generation tasks. These models addressed out-of-vocabulary problems and facilitated the generation of rare words. However, the identification of the word which needs to be copied is difficult, as observed by prior copy models, which suffer from incorrect generation and lacking abstractness. In this paper, we propose a novel supervised approach of a copy network that helps the model decide which words need to be copied and which need to be generated. Specifically, we re-define the objective function, which leverages source sequences and target vocabularies as guidance for copying. The experimental results on data-to-text generation and abstractive summarization tasks verify that our approach enhances the copying quality and improves the degree of abstractness.

【2】 Investigation of Densely Connected Convolutional Networks with Domain Adversarial Learning for Noise Robust Speech Recognition 标题：基于域对抗学习的密集连接卷积网络抗噪语音识别研究链接：https://arxiv.org/abs/2112.10108

作者：Chia Yu Li,Ngoc Thang Vu 机构：Institute for Natural Language Processing, University of Stuttgart, Germany 备注：7 pages, 5 figures, The 30th Conference on Electronic Speech Signal Processing (ESSV2019) 摘要：我们研究了稠密连接卷积网络（DenseNets）及其在域对抗训练中的扩展，用于噪声鲁棒语音识别。DenseNets是一种非常深入、紧凑的卷积神经网络，与计算机视觉领域的最新成果相比有了惊人的改进。我们的实验结果表明，DenseNets比其他基于神经网络的模型（如深前馈神经网络和卷积神经网络）对噪声具有更强的鲁棒性。此外，领域对抗性学习可以进一步提高Densenet对已知和未知噪声条件的鲁棒性。摘要：We investigate densely connected convolutional networks (DenseNets) and their extension with domain adversarial training for noise robust speech recognition. DenseNets are very deep, compact convolutional neural networks which have demonstrated incredible improvements over the state-of-the-art results in computer vision. Our experimental results reveal that DenseNets are more robust against noise than other neural network based models such as deep feed forward neural networks and convolutional neural networks. Moreover, domain adversarial learning can further improve the robustness of DenseNets against both, known and unknown noise conditions.

检测相关(4篇)

【1】 An ensemble deep learning technique for detecting suicidal ideation from posts in social media platforms 标题：从社交媒体平台的帖子中检测自杀意念的集成深度学习技术链接：https://arxiv.org/abs/2112.10609

作者：Shini Renjith,Annie Abraham,Surya B. Jyothi,Lekshmi Chandran,Jincy Thomson 机构：Department of Computer Science and Engineering, Mar Baselios College of Engineering and Technology, Thiruvananthapuram, Kerala, India, a r t i c l e, i n f o, Article history:, Available online xxxx 备注：None 摘要：社交媒体中的自杀意念检测是一项不断发展的研究，具有很大的挑战性。许多有自杀倾向的人通过社交媒体平台分享他们的想法和观点。作为许多研究的一部分，人们观察到，社交媒体公开发布的帖子包含有效检测有自杀想法的个人的有价值的标准。预防自杀最困难的部分是发现和理解可能导致自杀的复杂危险因素和警告信号。这可以通过自动识别用户行为的突然变化来实现。自然语言处理技术可用于从社交媒体互动中收集行为和文本特征，这些特征可被传递到专门设计的框架中，以检测人类互动中的异常，这些异常是自杀意图的指标。我们可以使用深度学习和/或基于机器学习的分类方法快速检测自杀意念。为此，我们可以使用LSTM和CNN模型的组合，从用户的帖子中检测此类情绪。为了提高准确度，可以采取一些方法，如使用更多的数据进行训练，使用注意模型提高现有模型的效率等。本文提出了一个LSTM-Attention-CNN组合模型来分析社交媒体提交，以检测潜在的自杀意图。在评估过程中，所提出的模型的准确率为90.3%，F1分数为92.6%，高于基准模型。摘要：Suicidal ideation detection from social media is an evolving research with great challenges. Many of the people who have the tendency to suicide share their thoughts and opinions through social media platforms. As part of many researches it is observed that the publicly available posts from social media contain valuable criteria to effectively detect individuals with suicidal thoughts. The most difficult part to prevent suicide is to detect and understand the complex risk factors and warning signs that may lead to suicide. This can be achieved by identifying the sudden changes in a user behavior automatically. Natural language processing techniques can be used to collect behavioral and textual features from social media interactions and these features can be passed to a specially designed framework to detect anomalies in human interactions that are indicators of suicidal intentions. We can achieve fast detection of suicidal ideation using deep learning and/or machine learning based classification approaches. For such a purpose, we can employ the combination of LSTM and CNN models to detect such emotions from posts of the users. In order to improve the accuracy, some approaches like using more data for training, using attention model to improve the efficiency of existing models etc. could be done. This paper proposes a LSTM-Attention-CNN combined model to analyze social media submissions to detect any underlying suicidal intentions. During evaluations, the proposed model demonstrated an accuracy of 90.3 percent and an F1-score of 92.6 percent, which is greater than the baseline models.

【2】 Article Reranking by Memory-Enhanced Key Sentence Matching for Detecting Previously Fact-Checked Claims 标题：通过记忆增强的关键句子匹配进行文章重新排序以检测先前事实核查的主张链接：https://arxiv.org/abs/2112.10322

作者：Qiang Sheng,Juan Cao,Xueyao Zhang,Xirong Li,Lei Zhong 机构： Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Key Lab of Data Engineering and Knowledge Engineering, Renmin University of China 备注：ACL-IJCNLP 2021 Main Conference Long Paper 摘要：之前经过事实核实的虚假声明仍然可以在社交媒体上传播。为了减少它们的持续传播，检测之前经过事实核实的索赔是必不可少的。鉴于一种说法，现有的工作重点是通过对BM25检索到的候选事实检查文章（FC文章）重新排序来为检测提供证据。然而，这些性能可能会受到限制，因为它们忽略了FC文章的以下特征：（1）经常引用声明来描述检查的事件，提供语义之外的词汇信息；（2）介绍或揭穿声明的句子模板在文章中很常见，提供模式信息。忽略这两个方面的模型只利用语义相关性，可能会被描述相似但不相关事件的句子误导。在本文中，我们提出了一种新的重排器，MTM（记忆增强匹配转换器），使用事件（词汇和语义）和模式信息选择的关键句子对FC文章进行排序。对于事件信息，我们提出了一种胭脂引导Transformer，该Transformer通过胭脂回归进行微调。对于模式信息，我们生成与句子匹配的模式向量。通过融合事件和模式信息，我们选择关键句子来表示文章，然后预测文章事实是否使用声明、关键句子和模式检查给定声明。在两个真实数据集上的实验表明，MTM的性能优于现有的方法。人类评估证明，MTM可以捕捉关键句子进行解释。代码和数据集位于https://github.com/ICTMCG/MTM. 摘要：False claims that have been previously fact-checked can still spread on social media. To mitigate their continual spread, detecting previously fact-checked claims is indispensable. Given a claim, existing works focus on providing evidence for detection by reranking candidate fact-checking articles (FC-articles) retrieved by BM25. However, these performances may be limited because they ignore the following characteristics of FC-articles: (1) claims are often quoted to describe the checked events, providing lexical information besides semantics; (2) sentence templates to introduce or debunk claims are common across articles, providing pattern information. Models that ignore the two aspects only leverage semantic relevance and may be misled by sentences that describe similar but irrelevant events. In this paper, we propose a novel reranker, MTM (Memory-enhanced Transformers for Matching) to rank FC-articles using key sentences selected with event (lexical and semantic) and pattern information. For event information, we propose a ROUGE-guided Transformer which is finetuned with regression of ROUGE. For pattern information, we generate pattern vectors for matching with sentences. By fusing event and pattern information, we select key sentences to represent an article and then predict if the article fact-checks the given claim using the claim, key sentences, and patterns. Experiments on two real-world datasets show that MTM outperforms existing methods. Human evaluation proves that MTM can capture key sentences for explanations. The code and the dataset are at https://github.com/ICTMCG/MTM.

【3】 Early Detection of Security-Relevant Bug Reports using Machine Learning: How Far Are We? 标题：使用机器学习及早检测与安全相关的错误报告：我们还有多远？链接：https://arxiv.org/abs/2112.10123

作者：Arthur D. Sawadogo,Quentin Guimard,Tegawendé F. Bissyandé,Abdoul Kader Kaboré,Jacques Klein,Naouel Moha 机构：Universit´e du Qu´ebec a Montr´eal, †University of Luxembourg 备注：10 pages 摘要：Bug报告是软件开发中常见的人工制品。它们是用户向开发人员传达他们在使用已发布版本的软件程序时遇到的问题的信息的主要渠道。然而，在问题的描述中，用户可能有意或无意地暴露漏洞。在典型的维护场景中，开发团队在准备纠正补丁时会优先考虑此类安全相关的bug报告。然而，当安全相关性没有立即表达（例如，通过标签）或没有迅速被分类团队识别时，开放式安全相关bug报告可能会成为敏感信息的严重泄漏，攻击者可以利用这些敏感信息执行零日攻击。为了支持从业人员对bug报告进行分类，研究社区提出了许多方法来检测与安全相关的bug报告。近年来，基于机器学习的方法在这方面得到了广泛的应用。我们的工作重点是这些方法，并重新审视它们的组成部分，以提供对当前成就的全面看法。为此，我们建立了一个大型实验数据集，并对特征集和学习算法进行了广泛的实验。最后，我们的研究强调了产生最佳性能分类器的不同方法配置。摘要：Bug reports are common artefacts in software development. They serve as the main channel for users to communicate to developers information about the issues that they encounter when using released versions of software programs. In the descriptions of issues, however, a user may, intentionally or not, expose a vulnerability. In a typical maintenance scenario, such security-relevant bug reports are prioritised by the development team when preparing corrective patches. Nevertheless, when security relevance is not immediately expressed (e.g., via a tag) or rapidly identified by triaging teams, the open security-relevant bug report can become a critical leak of sensitive information that attackers can leverage to perform zero-day attacks. To support practitioners in triaging bug reports, the research community has proposed a number of approaches for the detection of security-relevant bug reports. In recent years, approaches in this respect based on machine learning have been reported with promising performance. Our work focuses on such approaches, and revisits their building blocks to provide a comprehensive view on the current achievements. To that end, we built a large experimental dataset and performed extensive experiments with variations in feature sets and learning algorithms. Eventually, our study highlights different approach configurations that yield best performing classifiers.

【4】 Morpheme Boundary Detection & Grammatical Feature Prediction for Gujarati : Dataset & Model 标题：古吉拉特文语素边界检测与语法特征预测：数据集与模型链接：https://arxiv.org/abs/2112.09860

作者：Jatayu Baxi,Dr. Brijesh Bhatt 机构：Dharmsinh Desai University Nadiad 摘要：为低资源语言开发自然语言处理资源是一项具有挑战性但必不可少的任务。在本文中，我们提出了一个古吉拉特邦形态分析仪。我们使用了一种基于双向LSTM的方法来执行语素边界检测和语法特征标记。我们已经创建了一个古吉拉特语词汇的数据集，具有引理和语法特征。本文讨论的基于Bi-LSTM的词形分析器模型在不知道任何手工制作的后缀规则的情况下有效地处理了语言的词形。据我们所知，这是古吉拉特语的第一个数据集和语素分析器模型，它执行语法特征标记和语素边界检测任务。摘要：Developing Natural Language Processing resources for a low resource language is a challenging but essential task. In this paper, we present a Morphological Analyzer for Gujarati. We have used a Bi-Directional LSTM based approach to perform morpheme boundary detection and grammatical feature tagging. We have created a data set of Gujarati words with lemma and grammatical features. The Bi-LSTM based model of Morph Analyzer discussed in the paper handles the language morphology effectively without the knowledge of any hand-crafted suffix rules. To the best of our knowledge, this is the first dataset and morph analyzer model for the Gujarati language which performs both grammatical feature tagging and morpheme boundary detection tasks.

识别/分类(5篇)

【1】 Unifying Model Explainability and Robustness for Joint Text Classification and Rationale Extraction 标题：联合文本分类和理论抽取的统一模型可解释性和鲁棒性链接：https://arxiv.org/abs/2112.10424

作者：Dongfang Li,Baotian Hu,Qingcai Chen,Tujie Xu,Jingcong Tao,Yunan Zhang 机构： Harbin Institute of Technology (Shenzhen), Shenzhen, China, Peng Cheng Laboratory, Shenzhen, China 备注：AAAI 2022 摘要：最近的研究表明，可解释性和鲁棒性是可信和可靠的文本分类的两个关键要素。然而，以前的工作通常涉及两个方面中的一个：i）如何在有利于预测的同时提取准确的可解释性理由；ii）如何使预测模型对不同类型的对抗性攻击具有鲁棒性。直觉上，产生有用解释的模型应该对敌对攻击更具鲁棒性，因为我们不能信任输出解释但在小扰动下改变其预测的模型。为此，我们提出了一个联合分类和基本原理提取模型AT-BMC。它包括两个关键机制：混合对抗训练（AT）旨在利用离散空间和嵌入空间中的各种扰动来提高模型的鲁棒性；边界匹配约束（BMC）在边界信息的指导下帮助更精确地定位原理。在基准数据集上的性能表明，所提出的AT-BMC在分类和基本原理提取方面都大大优于基线。鲁棒性分析表明，所提出的AT-BMC有效地降低了69%的攻击成功率。实证结果表明，稳健模型与更好的解释之间存在联系。摘要：Recent works have shown explainability and robustness are two crucial ingredients of trustworthy and reliable text classification. However, previous works usually address one of two aspects: i) how to extract accurate rationales for explainability while being beneficial to prediction; ii) how to make the predictive model robust to different types of adversarial attacks. Intuitively, a model that produces helpful explanations should be more robust against adversarial attacks, because we cannot trust the model that outputs explanations but changes its prediction under small perturbations. To this end, we propose a joint classification and rationale extraction model named AT-BMC. It includes two key mechanisms: mixed Adversarial Training (AT) is designed to use various perturbations in discrete and embedding space to improve the model's robustness, and Boundary Match Constraint (BMC) helps to locate rationales more precisely with the guidance of boundary information. Performances on benchmark datasets demonstrate that the proposed AT-BMC outperforms baselines on both classification and rationale extraction by a large margin. Robustness analysis shows that the proposed AT-BMC decreases the attack success rate effectively by up to 69%. The empirical results indicate that there are connections between robust models and better explanations.

【2】 LUC at ComMA-2021 Shared Task: Multilingual Gender Biased and Communal Language Identification without using linguistic features 标题：逗号的LUC-2021年共享任务：不使用语言特征的多语言、性别偏见和公共语言识别链接：https://arxiv.org/abs/2112.10189

作者：Rodrigo Cuéllar-Hidalgo,Julio de Jesús Guerrero-Zambrano,Dominic Forest,Gerardo Reyes-Salgado,Juan-Manuel Torres-Moreno 机构：Universit´e de Montr´eal, EBSI, CENIDET, Universit´e d’Avignon, LIA 备注：None 摘要：这项工作的目的是评估概率和最先进的向量空间建模（VSM）方法为著名的机器学习算法提供的能力，以确定社交网络文档被归类为攻击性、性别偏见或公共收费。为此，首先执行探索阶段，以找到相关的测试设置，即通过使用训练和开发样本，我们使用多向量空间建模和概率方法训练了多个算法，并丢弃了信息量较小的配置。这些系统已提交给ComMA@ICON“21多语言性别偏见和社区语言识别研讨会。摘要：This work aims to evaluate the ability that both probabilistic and state-of-the-art vector space modeling (VSM) methods provide to well known machine learning algorithms to identify social network documents to be classified as aggressive, gender biased or communally charged. To this end, an exploratory stage was performed first in order to find relevant settings to test, i.e. by using training and development samples, we trained multiple algorithms using multiple vector space modeling and probabilistic methods and discarded the less informative configurations. These systems were submitted to the competition of the ComMA@ICON'21 Workshop on Multilingual Gender Biased and Communal Language Identification.

【3】 Unified Named Entity Recognition as Word-Word Relation Classification 标题：作为词-词关系分类的统一命名实体识别链接：https://arxiv.org/abs/2112.10070

作者：Jingye Li,Hao Fei,Jiang Liu,Shengqiong Wu,Meishan Zhang,Chong Teng,Donghong Ji,Fei Li 机构： Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry, of Education, School of Cyber Science and Engineering, Wuhan University, China, Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen), China 备注：Accepted by AAAI'22 摘要：到目前为止，命名实体识别（NER）主要涉及三种类型，包括平坦、重叠（又称嵌套）和不连续的NER，这些类型大多是单独研究的。最近，人们对统一NER越来越感兴趣，用一个模型同时处理上述三个工作。目前表现最好的方法主要包括基于跨度的模型和序列到序列的模型，不幸的是前者只关注边界识别，而后者可能会受到曝光偏差的影响。在这项工作中，我们提出了一种新的替代方法，将统一的NER建模为词-词关系分类，即W^2NER。该体系结构通过有效地建模实体词与下一个相邻词（NNW）和尾部词-*（THW-*）关系之间的相邻关系，解决了统一NER的核心瓶颈。基于W^2NER方案，我们开发了一个神经框架，其中统一的NER被建模为二维单词对网格。然后，我们提出多粒度二维卷积来更好地细化网格表示。最后，使用一个协预测器对单词关系进行充分推理。我们对14个广泛使用的基准数据集（8个英文数据集和6个中文数据集）进行了广泛的实验，其中我们的模型超过了所有当前性能最好的基线，推动了统一NER的最先进性能。摘要：So far, named entity recognition (NER) has been involved with three major types, including flat, overlapped (aka. nested), and discontinuous NER, which have mostly been studied individually. Recently, a growing interest has been built for unified NER, tackling the above three jobs concurrently with one single model. Current best-performing methods mainly include span-based and sequence-to-sequence models, where unfortunately the former merely focus on boundary identification and the latter may suffer from exposure bias. In this work, we present a novel alternative by modeling the unified NER as word-word relation classification, namely W^2NER. The architecture resolves the kernel bottleneck of unified NER by effectively modeling the neighboring relations between entity words with Next-Neighboring-Word (NNW) and Tail-Head-Word-* (THW-*) relations. Based on the W^2NER scheme we develop a neural framework, in which the unified NER is modeled as a 2D grid of word pairs. We then propose multi-granularity 2D convolutions for better refining the grid representations. Finally, a co-predictor is used to sufficiently reason the word-word relations. We perform extensive experiments on 14 widely-used benchmark datasets for flat, overlapped, and discontinuous NER (8 English and 6 Chinese datasets), where our model beats all the current top-performing baselines, pushing the state-of-the-art performances of unified NER.

【4】 Data Augmentation for Mental Health Classification on Social Media 标题：社交媒体上心理健康分类的数据增强链接：https://arxiv.org/abs/2112.10064

作者：Gunjan Ansari,Muskan Garg,Chandni Saxena 机构：JSS Academy of Technical, Education, Noida, India, Thapar Institute of, Engineering & Technology, Patiala, Punjab, The Chinese University, of Hong Kong, Shatin, NT, Hong Kong 备注：10 摘要：在线用户的精神障碍是通过社交媒体帖子确定的。这一领域的主要挑战是，在社交媒体平台上使用用户生成的文本需要获得道德许可。学术研究者发现了精神健康分类数据不足和未标记的问题。为了解决这个问题，我们研究了数据增强技术对特定领域用户生成的用于心理健康分类的文本的影响。在现有成熟的数据增强技术中，我们已经确定Easy data augmentation（EDA）、conditional BERT和Back Translation（BT）是生成额外文本以提高分类器性能的潜在技术。此外，使用三种不同的分类器随机森林（RF）、支持向量机（SVM）和逻辑回归（LR）来分析数据增加对两个公开的社交媒体数据集的影响。实验结果表明，在增广数据上进行训练后，分类器的性能有显著提高。摘要：The mental disorder of online users is determined using social media posts. The major challenge in this domain is to avail the ethical clearance for using the user generated text on social media platforms. Academic re searchers identified the problem of insufficient and unlabeled data for mental health classification. To handle this issue, we have studied the effect of data augmentation techniques on domain specific user generated text for mental health classification. Among the existing well established data augmentation techniques, we have identified Easy Data Augmentation (EDA), conditional BERT, and Back Translation (BT) as the potential techniques for generating additional text to improve the performance of classifiers. Further, three different classifiers Random Forest (RF), Support Vector Machine (SVM) and Logistic Regression (LR) are employed for analyzing the impact of data augmentation on two publicly available social media datasets. The experiments mental results show significant improvements in classifiers performance when trained on the augmented data.

【5】 Multi-turn RNN-T for streaming recognition of multi-party speech 标题：用于多方语音流识别的多轮RNN-T 链接：https://arxiv.org/abs/2112.10200

作者：Ilya Sklyar,Anna Piunova,Xianrui Zheng,Yulan Liu 机构：Amazon Alexa,University of Cambridge 备注：Submitted to ICASSP2022 摘要：传统上，扬声器数量未知的单通道远场记录的自动语音识别（ASR）是通过级联模块来解决的。最近的研究表明，与模块化系统相比，端到端（E2E）多说话人ASR模型可以实现更高的识别精度。但是，由于这些模型依赖于完整的音频上下文，因此无法确保实时适用性。这项工作将实时适用性作为模型设计的首要任务，并解决了以前多说话人递归神经网络传感器（MS-RNN-T）工作中的一些挑战。首先，我们在训练过程中引入动态重叠语音模拟，在LibriSpeechMix测试集上产生14%的相对字错误率（WER）。其次，我们提出了一种新的多回合RNN-T（MT-RNN-T）模型，该模型采用了基于重叠的目标排列策略，可以在不改变模型结构的情况下推广到任意数量的说话人。我们在LibriCSS测试集上调查了训练期间看到的最大说话人数量对MT-RNN-T性能的影响，并报告了与两个说话人MS-RNN-T相比28%的相对功率改善。第三，我们试验了一种丰富的转录策略，用于多方语音的联合识别和分割。通过深入分析，我们讨论了所提出系统的潜在缺陷以及未来的研究方向。摘要：Automatic speech recognition (ASR) of single channel far-field recordings with an unknown number of speakers is traditionally tackled by cascaded modules. Recent research shows that end-to-end (E2E) multi-speaker ASR models can achieve superior recognition accuracy compared to modular systems. However, these models do not ensure real-time applicability due to their dependency on full audio context. This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T). First, we introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set. Second, we propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture. We investigate the impact of the maximum number of speakers seen during training on MT-RNN-T performance on LibriCSS test set, and report 28% relative WER improvement over the two-speaker MS-RNN-T. Third, we experiment with a rich transcription strategy for joint recognition and segmentation of multi-party speech. Through an in-depth analysis, we discuss potential pitfalls of the proposed system as well as promising future research directions.

Zero/Few/One-Shot|迁移|自适应(1篇)

【1】 Few-shot Learning with Multilingual Language Models 标题：基于多语言模型的少语言学习链接：https://arxiv.org/abs/2112.10668

作者：Xi Victoria Lin,Todor Mihaylov,Mikel Artetxe,Tianlu Wang,Shuohui Chen,Daniel Simig,Myle Ott,Naman Goyal,Shruti Bhosale,Jingfei Du,Ramakanth Pasunuru,Sam Shleifer,Punit Singh Koura,Vishrav Chaudhary,Brian O'Horo,Jeff Wang,Luke Zettlemoyer,Zornitsa Kozareva,Mona Diab,Veselin Stoyanov,Xian Li 机构：Meta AI 备注：36 pages 摘要：大规模自回归语言模型（如GPT-3）是少数不需要微调就能执行广泛语言任务的短期学习者。虽然这些模型能够联合表示多种不同的语言，但它们的训练数据主要是英语，这可能会限制它们的跨语言泛化。在这项工作中，我们在覆盖多种语言的平衡语料库上训练多语言自回归语言模型，并研究它们在广泛任务中的少量和零次学习能力。我们最大的模型拥有75亿个参数，在20多种代表性语言的Few-Shot学习方面创造了新的技术水平，在多语言常识推理（0-shot设置的绝对准确度提高了+7.4%，4-shot设置的绝对准确度提高了+9.4%）和自然语言推理（0-shot和4-shot设置的绝对准确度提高了+5.4%）方面优于同等规模的GPT-3。在FLORES-101机器翻译基准上，我们的模型在182个翻译方向中的171个方向（32个训练示例）上优于GPT-3，同时在45个方向上超过官方监督基线。我们详细分析了该模型的成功和失败之处，特别表明它能够在某些任务上进行跨语言的上下文学习，而在表面形式的稳健性和对没有自然完形填空形式的任务的适应性方面仍有改进的空间。最后，我们评估了我们在社会价值任务中的模型，如五种语言中的仇恨言语检测，发现它与同等大小的GPT-3模型有相似的局限性。摘要：Large-scale autoregressive language models such as GPT-3 are few-shot learners that can perform a wide range of language tasks without fine-tuning. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages, and study their few- and zero-shot learning capabilities in a wide range of tasks. Our largest model with 7.5 billion parameters sets new state of the art in few-shot learning in more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning (with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in 4-shot settings) and natural language inference (+5.4% in each of 0-shot and 4-shot settings). On the FLORES-101 machine translation benchmark, our model outperforms GPT-3 on 171 out of 182 translation directions with 32 training examples, while surpassing the official supervised baseline in 45 directions. We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning on some tasks, while there is still room for improvement on surface form robustness and adaptation to tasks that do not have a natural cloze form. Finally, we evaluate our models in social value tasks such as hate speech detection in five languages and find it has limitations similar to comparable sized GPT-3 models.

表征(1篇)

【1】 Learning Semi-Structured Representations of Radiology Reports 标题：放射学报告的半结构化表示学习链接：https://arxiv.org/abs/2112.10746

作者：Tamara Katic,Martin Pavlovski,Danijela Sekulic,Slobodan Vucetic 机构： Temple University, Philadelphia, PA, USA, University Clinical Center of Serbia, Belgrade, Serbia 摘要：除了主要的诊断目的外，放射学报告一直是医学研究中宝贵的信息来源。鉴于放射报告的语料库，研究人员通常对确定描述特定医学发现的报告子集感兴趣。由于放射学报告中的医学发现空间巨大且可能无限，最近的研究建议将放射学报告中的自由文本语句映射为从有限词汇表中提取的半结构化术语字符串。本文旨在提出一种自动生成放射报告半结构化表示的方法。该方法包括将放射报告中的句子匹配到手动创建的半结构化表示，然后学习将匹配句子映射到其半结构化表示的序列到序列神经模型。我们在人工注释的胸部x射线放射学报告的OpenI语料库上评估了所提出的方法。结果表明，无论是在（1）定量测量，如BLEU、ROUGE和METOR方面，还是在（2）放射科医生的定性判断方面，所提出的方法都优于几个基线。研究结果还表明，训练后的模型在来自不同医疗机构的胸部x射线放射学报告样本语料库上产生了合理的半结构化表示。摘要：Beyond their primary diagnostic purpose, radiology reports have been an invaluable source of information in medical research. Given a corpus of radiology reports, researchers are often interested in identifying a subset of reports describing a particular medical finding. Because the space of medical findings in radiology reports is vast and potentially unlimited, recent studies proposed mapping free-text statements in radiology reports to semi-structured strings of terms taken from a limited vocabulary. This paper aims to present an approach for the automatic generation of semi-structured representations of radiology reports. The approach consists of matching sentences from radiology reports to manually created semi-structured representations, followed by learning a sequence-to-sequence neural model that maps matched sentences to their semi-structured representations. We evaluated the proposed approach on the OpenI corpus of manually annotated chest x-ray radiology reports. The results indicate that the proposed approach is superior to several baselines, both in terms of (1) quantitative measures such as BLEU, ROUGE, and METEOR and (2) qualitative judgment of a radiologist. The results also demonstrate that the trained model produces reasonable semi-structured representations on an out-of-sample corpus of chest x-ray radiology reports from a different medical provider.

Word2Vec|文本|单词(1篇)

【1】 Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP 标题：在词与字之间：自然语言处理中的开放词汇建模与词汇化简史链接：https://arxiv.org/abs/2112.10508

作者：Sabrina J. Mielke,Zaid Alyafeai,Elizabeth Salesky,Colin Raffel,Manan Dey,Matthias Gallé,Arun Raja,Chenglei Si,Wilson Y. Lee,Benoît Sagot,Samson Tan 机构：BigScience Workshop Tokenization Working Group, Johns Hopkins University, HuggingFace, King Fahd University of Petroleum and Minerals, SAP, Naver Labs Europe, Institute for Infocomm Research, ASTAR Singapore, University of Maryland, Inria Paris 备注：15 page preprint 摘要：我们要建模的文本单位是什么？从字节到多词表达式，文本可以以多种粒度进行分析和生成。直到最近，大多数自然语言处理（NLP）模型都是对单词进行操作的，将其视为离散的原子标记，但从字节对编码（BPE）开始，基于子单词的方法在许多领域占据主导地位，支持小型词汇表，同时仍然允许快速推理。道路终点是字符级模型还是字节级处理？在这项调查中，我们通过展示如何提出和评估单词和字符的混合方法以及基于学习切分的子单词方法，将前神经和神经时代的几项工作联系起来。我们的结论是，对于所有应用程序来说，都存在而且可能永远不会有一个银弹单一解决方案，认真考虑标记化对于许多应用程序仍然很重要。摘要：What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously about tokenization remains important for many applications.

其他(7篇)

【1】 Efficient Large Scale Language Modeling with Mixtures of Experts 标题：高效的混合专家大规模语言建模链接：https://arxiv.org/abs/2112.10684

作者：Mikel Artetxe,Shruti Bhosale,Naman Goyal,Todor Mihaylov,Myle Ott,Sam Shleifer,Xi Victoria Lin,Jingfei Du,Srinivasan Iyer,Ramakanth Pasunuru,Giri Anantharaman,Xian Li,Shuohui Chen,Halil Akin,Mandeep Baines,Louis Martin,Xing Zhou,Punit Singh Koura,Brian O'Horo,Jeff Wang,Luke Zettlemoyer,Mona Diab,Zornitsa Kozareva,Ves Stoyanov 机构：Meta AI 摘要：混合专家层（MOE）通过条件计算实现语言模型的有效缩放。本文对自回归MoE语言模型与稠密模型在各种环境下的扩展进行了详细的实证研究：域内和域外语言建模、零触发和少触发以及完全微调。除了微调之外，我们发现MOE的计算效率更高。在较低的训练预算下，MOE可以使用比计算量少4倍的$\sim$来匹配密集模型的性能。这一差距在规模上缩小，但我们最大的MoE模型（1.1T参数）始终优于计算等效密集模型（6.7B参数）。总的来说，不同任务和领域之间的性能差距差异很大，这表明MoE和稠密模型的概括方式不同，值得进一步研究。我们将我们的代码和模型公开供研究使用。摘要：Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full fine-tuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using $\sim$4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use.

【2】 Intelligent Online Selling Point Extraction for E-Commerce Recommendation 标题：面向电子商务推荐的智能在线卖点提取链接：https://arxiv.org/abs/2112.10613

作者：Xiaojie Guo,Shugen Wang,Hanqing Zhao,Shiliang Diao,Jiajia Chen,Zhuoye Ding,Zhen He,Yun Xiao,Bo Long,Han Yu,Lingfei Wu 机构：JD.COM Silicon Valley Research Center, School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore 备注：IAAI 2022 industry award 摘要：在过去的十年中，面向电子商务的产品描述自动生成技术取得了重大进展。随着电子商务平台提供的服务变得多样化，有必要动态调整生成的描述模式。产品卖点是一种重要的产品描述类型，其长度应尽可能短，同时仍传达关键信息。此外，这种产品描述应该吸引读者的眼球。目前，产品卖点通常由人类专家撰写。因此，创建和维护这些内容会产生高昂的成本。如果产品卖点可以由机器自动生成，则这些成本可以显著降低。在本文中，我们报告了我们开发和部署智能在线卖点提取（IOSPE）系统以服务于JD推荐系统的经验。com电子商务平台。自2020年7月以来，IOSPE已成为62个关键类别产品的核心服务（涵盖400多万种产品）。到目前为止，它已经创造了超过1亿个卖点，从而大大扩大了卖点创建操作并节省了人力。与以前的做法相比，这些IOSPE产生的卖点使点击率（CTR）提高了1.89%，客户在产品上花费的平均时间增加了2.03%以上，这对于如此大规模的电子商务平台来说是显著的改进。摘要：In the past decade, automatic product description generation for e-commerce have witnessed significant advancement. As the services provided by e-commerce platforms become diverse, it is necessary to dynamically adapt the patterns of descriptions generated. The selling point of products is an important type of product description for which the length should be as short as possible while still conveying key information. In addition, this kind of product description should be eye-catching to the readers. Currently, product selling points are normally written by human experts. Thus, the creation and maintenance of these contents incur high costs. These costs can be significantly reduced if product selling points can be automatically generated by machines. In this paper, we report our experience developing and deploying the Intelligent Online Selling Point Extraction (IOSPE) system to serve the recommendation system in the JD.com e-commerce platform. Since July 2020, IOSPE has become a core service for 62 key categories of products (covering more than 4 million products). So far, it has generated more than 0.1 billion selling points, thereby significantly scaling up the selling point creation operation and saving human labour. These IOSPE generated selling points have increased the click-through rate (CTR) by 1.89\% and the average duration the customers spent on the products by more than 2.03\% compared to the previous practice, which are significant improvements for such a large-scale e-commerce platform.

【3】 Spiral Language Modeling 标题：螺旋式语言建模链接：https://arxiv.org/abs/2112.10543

作者：Yong Cao,Yukun Feng,Shaohui Kuang,Gu Xu 机构：ByteDance 摘要：在几乎所有的文本生成应用程序中，单词序列都是以从左到右（L2R）或从右到左（R2L）的方式构造的，因为自然语言句子是由L2R或R2L编写的。然而，我们发现自然语言的书写顺序并不是文本生成的必要条件。在本文中，我们提出了螺旋语言建模（SLM），这是一种通用的方法，使人们能够构造出L2R和R2L顺序以外的自然语言句子。SLM允许从结果文本中的任意标记开始，并围绕所选标记展开其余标记，从而形成自然语言文本。它使解码顺序成为语言模型困惑之外的一个新的优化目标，进一步提高了生成文本的多样性和质量。此外，SLM还可以通过选择适当的起始标记来操纵文本构造过程。SLM还引入了生成顺序作为额外的正则化，以提高低资源场景中的模型稳健性。8个广泛研究的神经机器翻译（NMT）任务的实验表明，与传统的L2R解码方法相比，SLM始终有效地达到4.7的BLUU增加。摘要：In almost all text generation applications, word sequences are constructed in a left-to-right (L2R) or right-to-left (R2L) manner, as natural language sentences are written either L2R or R2L. However, we find that the natural language written order is not essential for text generation. In this paper, we propose Spiral Language Modeling (SLM), a general approach that enables one to construct natural language sentences beyond the L2R and R2L order. SLM allows one to form natural language text by starting from an arbitrary token inside the result text and expanding the rest tokens around the selected ones. It makes the decoding order a new optimization objective besides the language model perplexity, which further improves the diversity and quality of the generated text. Furthermore, SLM makes it possible to manipulate the text construction process by selecting a proper starting token. SLM also introduces generation orderings as additional regularization to improve model robustness in low-resource scenarios. Experiments on 8 widely studied Neural Machine Translation (NMT) tasks show that SLM is constantly effective with up to 4.7 BLEU increase comparing to the conventional L2R decoding approach.

【4】 English-to-Chinese Transliteration with Phonetic Back-transliteration 标题：英汉音译与音译回译链接：https://arxiv.org/abs/2112.10321

作者：Shi Cheng,Zhuofei Ding,Songpeng Yan 机构：School of Informatics, University of Edinburgh, Crichton Street, Edinburgh EH,AB 摘要：音译是根据语音相似性将命名实体从一种语言翻译成另一种语言的任务。近年来，这项任务采用了深度学习方法，但大多数人忽视了所涉及语言的语音特征。在这项工作中，我们通过两种方式将语音信息纳入神经网络：我们使用前向和后向翻译合成额外数据，但采用语音方式；在学习音译之前，我们对语音任务的模型进行预训练。我们的实验包括三对语言和六个方向，即英语与汉语、希伯来语和泰语之间的转换。结果表明，与现有技术相比，我们提出的方法为模型带来了好处，并取得了更好或类似的性能。摘要：Transliteration is a task of translating named entities from a language to another, based on phonetic similarity. The task has embraced deep learning approaches in recent years, yet, most ignore the phonetic features of the involved languages. In this work, we incorporate phonetic information into neural networks in two ways: we synthesize extra data using forward and back-translation but in a phonetic manner; and we pre-train models on a phonetic task before learning transliteration. Our experiments include three language pairs and six directions, namely English to and from Chinese, Hebrew and Thai. Results indicate that our proposed approach brings benefits to the model and achieves better or similar performance when compared to state of the art.

【5】 Assessing Post-editing Effort in the English-Hindi Direction 标题：评估英语-印地语方向的编辑后工作链接：https://arxiv.org/abs/2112.09841

作者：Arafat Ahsan,Vandan Mujadia,Dipti Misra Sharma 机构：IIIT, Hyderabad, India 备注：ICON-2021 摘要：我们展示了第一个深入的英文印地语编辑后努力评估研究的结果，以及多个努力指标。我们进行了一项由专业翻译人员参与的对照实验，他们在从头开始的翻译和编辑后的条件下交替完成指定的任务。我们发现，与从头开始翻译相比，后期编辑减少了翻译时间（63%），使用了更少的按键（59%），并减少了暂停次数（63%）。我们通过人工评估任务进一步验证由此产生的翻译质量，在该任务中，我们没有发现任何明显的质量差异。摘要：We present findings from a first in-depth post-editing effort estimation study in the English-Hindi direction along multiple effort indicators. We conduct a controlled experiment involving professional translators, who complete assigned tasks alternately, in a translation from scratch and a post-edit condition. We find that post-editing reduces translation time (by 63%), utilizes fewer keystrokes (by 59%), and decreases the number of pauses (by 63%) when compared to translating from scratch. We further verify the quality of translations thus produced via a human evaluation task in which we do not detect any discernible quality differences.

【6】 Can we Fix the Scope for Coreference? Problems and Solutions for Benchmarks beyond OntoNotes 标题：我们能确定共同引用的范围吗？OntoNotes之外的基准测试存在的问题和解决方案链接：https://arxiv.org/abs/2112.09742

作者：Amir Zeldes 机构：Georgetown University, Department of Linguistics 摘要：由于OntoNotes基准数据集的大小和一致性，目前关于自动共指消解的工作主要集中在OntoNotes基准数据集上。然而，NLP实践者对OntoNotes注释方案的许多方面没有很好的理解，包括对通用名词短语、名词修饰语、不定回指、谓词等的处理。这些通常会导致违反直觉的声明、结果和系统行为。这篇评论文章的目的是强调OntoNotes共指再现的一些问题，并根据三个原则提出前进的方向：1。注重语义学，而不是形态语法；2.跨语言概括性；三,。标识和范围的分离，可以解决涉及时域和模态域一致性的旧问题。摘要：Current work on automatic coreference resolution has focused on the OntoNotes benchmark dataset, due to both its size and consistency. However many aspects of the OntoNotes annotation scheme are not well understood by NLP practitioners, including the treatment of generic NPs, noun modifiers, indefinite anaphora, predication and more. These often lead to counterintuitive claims, results and system behaviors. This opinion piece aims to highlight some of the problems with the OntoNotes rendition of coreference, and to propose a way forward relying on three principles: 1. a focus on semantics, not morphosyntax; 2. cross-linguistic generalizability; and 3. a separation of identity and scope, which can resolve old problems involving temporal and modal domain consistency.

【7】 Improving scripts with a memory of natural feedback 标题：利用对自然反馈的记忆来改进脚本链接：https://arxiv.org/abs/2112.09737

作者：Niket Tandon,Aman Madaan,Peter Clark,Yiming Yang 机构：Allen Institute for Artificial Intelligence, Seattle, WA, USA, † Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA 摘要：如果部署的结构化预测模型生成错误的输出，最终用户如何提供反馈？我们的目标是允许用户通过交互直接更正错误，而无需再训练，通过对模型输出进行反馈。我们创建了一个动态内存体系结构，该体系结构具有不断增长的输出错误反馈内存。给定一个新的、看不见的输入，我们的模型可以使用来自类似的、过去错误状态的反馈。在一个脚本生成任务中，我们以经验证明，该模型能够有效地应用反馈（最多提高30分），同时避免部署后出现类似的过去错误（在一个看不见的集合上最多提高10分）。这是加强已部署模型的第一步，有可能扩大其实用性。摘要：How can an end-user provide feedback if a deployed structured prediction model generates incorrect output? Our goal is to allow users to correct errors directly through interaction, without retraining, by giving feedback on the model's output. We create a dynamic memory architecture with a growing memory of feedbacks about errors in the output. Given a new, unseen input, our model can use feedback from a similar, past erroneous state. On a script generation task, we show empirically that the model learns to apply feedback effectively (up to 30 points improvement), while avoiding similar past mistakes after deployment (up to 10 points improvement on an unseen set). This is a first step towards strengthening deployed models, potentially broadening their utility.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-12-21，如有侵权请联系 cloudcommunity@tencent.com 删除

linux