自然语言处理学术速递[8.30]

公众号-arXiv每日学术速递

发布于 2021-09-16 14:49:05

6990

发布于 2021-09-16 14:49:05

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计28篇

Transformer(1篇)

【1】 Can the Transformer Be Used as a Drop-in Replacement for RNNs in Text-Generating GANs? 标题：在文本生成GAN中，是否可以使用Transformer替代RNN？链接：https://arxiv.org/abs/2108.12275

作者：Kevin Blin,Andrei Kucharavy 机构：Section of Communication Systems, Bachelor semester , EPFL, Lausanne, Switzerland, Distributed Computing Laboratory 备注：accepted to RANLP 2021 摘要：在本文中，我们解决的问题微调文本生成有限的计算预算。为此，我们使用了性能良好的文本生成对抗性网络（GAN）体系结构-多样性促进GAN（DPGAN），并尝试使用基于自我注意的转换层替换LSTM层，以利用其效率。由此产生的自我注意DPGAN（SADPGAN）被评估为生成文本的性能、质量和多样性以及稳定性。计算实验表明，Transformer结构无法替换LSTM层，在预训练阶段性能不足，在GAN调谐阶段经历完全模式崩溃。我们的结果表明，在将transformer架构用作文本生成GAN中RNN的替代品之前，需要对其进行调整。摘要：In this paper we address the problem of fine-tuned text generation with a limited computational budget. For that, we use a well-performing text generative adversarial network (GAN) architecture - Diversity-Promoting GAN (DPGAN), and attempted a drop-in replacement of the LSTM layer with a self-attention-based Transformer layer in order to leverage their efficiency. The resulting Self-Attention DPGAN (SADPGAN) was evaluated for performance, quality and diversity of generated text and stability. Computational experiments suggested that a transformer architecture is unable to drop-in replace the LSTM layer, under-performing during the pre-training phase and undergoing a complete mode collapse during the GAN tuning phase. Our results suggest that the transformer architecture need to be adapted before it can be used as a replacement for RNNs in text-generating GANs.

BERT(1篇)

【1】 A New Sentence Ordering Method Using BERT Pretrained Model 标题：一种基于BERT预训练模型的句子排序新方法链接：https://arxiv.org/abs/2108.11994

作者：Melika Golestani,Seyedeh Zahra Razavi,Heshaam Faili 机构：department of ECE, University of Tehran, Tehran, Iran, department of Computer Science, University of Rochester, Rochester, USA 备注：7 pages, 4 figures, 2020 11th International Conference on Information and Knowledge Technology (IKT) 摘要：构建具有自然语言理解能力（NLU）的系统是人工智能最古老的领域之一。NLU的一个重要组成部分是检测文本中包含的事件的逻辑顺序。提出了句子排序任务来学习事件的连续性，并将其应用于人工智能任务中。以往采用统计方法的工作性能较差，而基于神经网络的方法迫切需要用于模型学习的大型语料库。在本文中，我们提出了一种句子排序的方法，它不需要训练阶段，也不需要大量的学习语料。为此，我们使用BERT预训练模型生成句子嵌入，并使用余弦相似度得分度量句子相似度。我们建议将该分数作为连续事件连贯性水平的指标。最后，我们通过暴力搜索对句子进行排序，以最大限度地提高顺序句子的整体相似性。我们提出的方法在ROCStories上的表现优于其他基线，ROCStories是一个由5句话的人造故事组成的语料库。特别是在没有大型语料库的情况下，该方法比基于神经网络的方法更有效。这种方法的其他优点之一是它的可解释性和不需要语言知识。摘要：Building systems with capability of natural language understanding (NLU) has been one of the oldest areas of AI. An essential component of NLU is to detect logical succession of events contained in a text. The task of sentence ordering is proposed to learn succession of events with applications in AI tasks. The performance of previous works employing statistical methods is poor, while the neural networks-based approaches are in serious need of large corpora for model learning. In this paper, we propose a method for sentence ordering which does not need a training phase and consequently a large corpus for learning. To this end, we generate sentence embedding using BERT pre-trained model and measure sentence similarity using cosine similarity score. We suggest this score as an indicator of sequential events' level of coherence. We finally sort the sentences through brute-force search to maximize overall similarities of the sequenced sentences. Our proposed method outperformed other baselines on ROCStories, a corpus of 5-sentence human-made stories. The method is specifically more efficient than neural network-based methods when no huge corpus is available. Among other advantages of this method are its interpretability and needlessness to linguistic knowledge.

机器翻译(2篇)

【1】 Translation Error Detection as Rationale Extraction 标题：作为理据提取的翻译错误检测链接：https://arxiv.org/abs/2108.12197

作者：Marina Fomicheva,Lucia Specia,Nikolaos Aletras 机构：Department of Computer Science, University of Sheffield, Department of Computing, Imperial College London, United Kingdom 摘要：最近，基于多语言预训练表示的质量评估（QE）模型在预测翻译句子的整体质量时取得了非常有竞争力的结果。预测翻译错误，即具体检测哪些单词不正确，是一项更具挑战性的任务，尤其是在训练数据有限的情况下。我们假设，与人类不同，成功的量化宽松模型依赖于翻译错误来预测整体句子质量。通过探索一套特征归因方法，将相关分数分配给输入以解释模型预测，我们研究了最先进的句子级量化宽松模型的行为，并表明从这些模型中提取的解释（即理据）确实可以用于检测翻译错误。因此，我们（i）介绍了一种新的用于单词级量化宽松的半监督方法，（ii）建议使用量化宽松任务作为评估特征属性合理性的新基准，即模型解释对人类的解释程度。摘要：Recent Quality Estimation (QE) models based on multilingual pre-trained representations have achieved very competitive results when predicting the overall quality of translated sentences. Predicting translation errors, i.e. detecting specifically which words are incorrect, is a more challenging task, especially with limited amounts of training data. We hypothesize that, not unlike humans, successful QE models rely on translation errors to predict overall sentence quality. By exploring a set of feature attribution methods that assign relevance scores to the inputs to explain model predictions, we study the behaviour of state-of-the-art sentence-level QE models and show that explanations (i.e. rationales) extracted from these models can indeed be used to detect translation errors. We therefore (i) introduce a novel semi-supervised method for word-level QE and (ii) propose to use the QE task as a new benchmark for evaluating the plausibility of feature attribution, i.e. how interpretable model explanations are to humans.

【2】 Secoco: Self-Correcting Encoding for Neural Machine Translation 标题：SECOCO：神经机器翻译的自校正编码链接：https://arxiv.org/abs/2108.12137

作者：Tao Wang,Chengqi Zhao,Mingxuan Wang,Lei Li,Hang Li,Deyi Xiong 机构：ByteDance AI Lab, College of Intelligence and Computing, Tianjin University, Tianjin, China 备注：6 pages, 2 figures, 3 tables 摘要：本文提出了一种自校正编码（SECOCO），该框架通过引入自校正预测器有效地处理输入噪声对鲁棒神经机器翻译的影响。与以前的健壮方法不同，Secoco使NMT能够在翻译解码过程中同时显式地纠正噪声输入和删除特定错误。Secoco能够在两个真实世界的测试集和一个具有良好解释性的基准WMT数据集上实现强基线的显著改进。我们将很快公开我们的代码和数据集。摘要：This paper presents Self-correcting Encoding (Secoco), a framework that effectively deals with input noise for robust neural machine translation by introducing self-correcting predictors. Different from previous robust approaches, Secoco enables NMT to explicitly correct noisy inputs and delete specific errors simultaneously with the translation decoding process. Secoco is able to achieve significant improvements over strong baselines on two real-world test sets and a benchmark WMT dataset with good interpretability. We will make our code and dataset publicly available soon.

语义分析(1篇)

【1】 Semantic-based Self-Critical Training For Question Generation 标题：基于语义的自我批判性问题生成训练链接：https://arxiv.org/abs/2108.12026

作者：Loïc,Kwate Dassi 机构：Ensimag, Grenoble, France 备注：5 pages, 1 figure 摘要：在这项工作中，我们提出了一个完全基于Transformer的强化学习生成器评估器架构，用于神经问题生成。问题生成是在给定上下文和答案的情况下生成问题的任务。为了提高生成问题的质量，我们在生成器evaluator体系结构中提出了一种基于语义的自关键训练布局，它超越了典型的最大似然训练。仅基于N-gram重叠的语言建模的评价度量不考虑引用字符串和候选字符串之间的语义关系。为了改进评估步骤，我们使用BLEU评估了我们的模型中的n-gram重叠，并在语义上使用BERTScore和NUBIA，这是一种用于文本生成的最新评估指标。问题生成可用于许多下游应用程序，包括扩展问答数据集、对话系统和教育评估系统。摘要：We present in this work a fully Transformer-based reinforcement learning generator-evaluator architecture for neural question generation. Question generation is a task that consists in generating questions given a context and answer. To improve the quality of the generated question, we came up with a semantic-based self-critical training layout in generator-evaluator architecture, which goes beyond typical maximum likelihood training. Evaluation metrics for language modeling only based on n-gram overlapping do not consider semantic relations between reference and candidate strings. To improve the evaluation step, we assess our model for both n-gram overlap using BLEU and semantically using BERTScore and NUBIA, a novel state-of-the-art evaluation metric for text generation. Question generation could be used in many downstream applications, including in extending question answering datasets, conversational systems, and educational assessment systems.

Graph|知识图谱|Knowledge(1篇)

【1】 DomiKnowS: A Library for Integration of Symbolic Domain Knowledge in Deep Learning 标题：DomiKnowS：深度学习中符号领域知识集成的库链接：https://arxiv.org/abs/2108.12370

作者：Hossein Rajaby Faghihi,Quan Guo,Andrzej Uszok,Aliakbar Nafar,Elaheh Raisi,Parisa Kordjamshidi 机构：Michigan State University, Sichuan University, Florida Institute for Human and Machine Cognition 备注：Accepted at EMNLP 2021 demo track 摘要：我们演示了一个用于在深度学习体系结构中集成领域知识的库。使用该库，数据的结构通过图形声明以符号方式表示，输出或潜在变量的逻辑约束可以无缝地添加到深层模型中。领域知识可以明确定义，这不仅提高了模型在低数据区的性能和可推广性，还提高了模型的可解释性。介绍了符号和亚符号模型集成的几种方法；然而，在可以使用各种底层算法的情况下，没有库以通用方式促进此类集成的编程。我们的库旨在简化在训练和推理阶段进行集成的编程，同时将知识表示与学习算法分离。我们展示了各种NLP基准任务和其他任务。该框架在Github上公开提供(https://github.com/HLR/DomiKnowS). 摘要：We demonstrate a library for the integration of domain knowledge in deep learning architectures. Using this library, the structure of the data is expressed symbolically via graph declarations and the logical constraints over outputs or latent variables can be seamlessly added to the deep models. The domain knowledge can be defined explicitly, which improves the models' explainability in addition to the performance and generalizability in the low-data regime. Several approaches for such an integration of symbolic and sub-symbolic models have been introduced; however, there is no library to facilitate the programming for such an integration in a generic way while various underlying algorithms can be used. Our library aims to simplify programming for such an integration in both training and inference phases while separating the knowledge representation from learning algorithms. We showcase various NLP benchmark tasks and beyond. The framework is publicly available at Github(https://github.com/HLR/DomiKnowS).

摘要|信息提取(1篇)

【1】 Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization 标题：基于对比学习的增强Seq2Seq自动编码器摘要链接：https://arxiv.org/abs/2108.11992

作者：Chujie Zheng,Kunpeng Zhang,Harry Jiannan Wang,Ling Fan,Zhe Wang 机构：University of Delaware, University of Maryland, College Park, Tongji University,Tezign 摘要：在本文中，我们提出了一种基于对比学习的去噪序列到序列（seq2seq）自动编码器，用于抽象文本摘要。我们的模型采用基于标准Transformer的结构，带有多层双向编码器和自回归解码器。为了增强其去噪能力，我们将自监督对比学习与各种句子级文档增强相结合。seq2seq autoencoder和对比学习这两个组件通过微调进行联合训练，从而提高了文本摘要在胭脂评分和人类评价方面的性能。我们在两个数据集上进行了实验，证明我们的模型优于许多现有的基准测试，甚至达到了与使用更复杂的体系结构和大量计算资源训练的最先进的抽象系统相当的性能。摘要：In this paper, we present a denoising sequence-to-sequence (seq2seq) autoencoder via contrastive learning for abstractive text summarization. Our model adopts a standard Transformer-based architecture with a multi-layer bi-directional encoder and an auto-regressive decoder. To enhance its denoising ability, we incorporate self-supervised contrastive learning along with various sentence-level document augmentation. These two components, seq2seq autoencoder and contrastive learning, are jointly trained through fine-tuning, which improves the performance of text summarization with regard to ROUGE scores and human evaluation. We conduct experiments on two datasets and demonstrate that our model outperforms many existing benchmarks and even achieves comparable performance to the state-of-the-art abstractive systems trained with more complex architecture and extensive computation resources.

推理|分析|理解|解释(1篇)

【1】 Using GAN-based models to sentimental analysis on imbalanced datasets in education domain 标题：基于GAN的模型在教育领域不平衡数据集情感分析中的应用链接：https://arxiv.org/abs/2108.12061

作者：Ru Yang,Maryam Edalati 机构：Department of Computer Science, Norwegian University of Science and Technology, Gjøvik, Norway 摘要：虽然全世界仍在与COVID-19流行病斗争，在线学习和家庭办公室变得越来越普遍。许多学校将课程教学转移到网上教室。因此，从学生对学习的评价中挖掘学生的反馈和意见是非常重要的，这样学校和教师都可以知道他们需要改进的地方。本文使用平衡和非平衡数据集训练机器学习和深度学习模型进行情感分类。两个SOTA类别感知文本生成GAN模型：CatGAN和SentiGAN，用于合成用于平衡高度不平衡数据集的文本。在三个不同领域的不平衡程度不同的数据集上的结果表明，当使用生成的文本来平衡数据集时，机器学习和深度学习模型在情感分类上的F1分数提高了2.79%~9.28%。此外，实验结果还表明，CR100k的平均增长度高于CR23k，深度学习的平均增长度比机器学习算法更高，更复杂的深度学习模型的平均增长度比简单的深度学习模型更高。摘要：While the whole world is still struggling with the COVID-19 pandemic, online learning and home office become more common. Many schools transfer their courses teaching to the online classroom. Therefore, it is significant to mine the students' feedback and opinions from their reviews towards studies so that both schools and teachers can know where they need to improve. This paper trains machine learning and deep learning models using both balanced and imbalanced datasets for sentiment classification. Two SOTA category-aware text generation GAN models: CatGAN and SentiGAN, are utilized to synthesize text used to balance the highly imbalanced dataset. Results on three datasets with different imbalance degree from distinct domains show that when using generated text to balance the dataset, the F1-score of machine learning and deep learning model on sentiment classification increases 2.79% ~ 9.28%. Also, the results indicate that the average growth degree for CR100k is higher than CR23k, the average growth degree for deep learning is more increased than machine learning algorithms, and the average growth degree for more complex deep learning models is more increased than simpler deep learning models in experiments.

GAN|对抗|攻击|生成相关(4篇)

【1】 Latent Tree Decomposition Parsers for AMR-to-Text Generation 标题：用于AMR到文本生成的潜在树分解解析器链接：https://arxiv.org/abs/2108.12304

作者：Lisa Jin,Daniel Gildea 机构：Department of Computer Science, University of Rochester, Rochester, NY 摘要：AMR到文本生成模型中的图编码器通常依赖于邻域卷积或全局顶点注意。虽然这些方法适用于一般图，但AMR可能适合于针对其树状结构的编码器。通过将边聚类到一个层次结构中，树分解总结了图的结构。我们的模型对树分解的派生林进行编码，并提取期望的树。从树节点嵌入，它建立了用于图编码器顶点注意的图边缘特征。在自我关注的基线中编码TD森林而不是最短的成对路径将BLEU提高0.7，chrF++提高0.3。forest编码器的ROC-AUC比卷积基线的分子特性预测值高1.92%。摘要：Graph encoders in AMR-to-text generation models often rely on neighborhood convolutions or global vertex attention. While these approaches apply to general graphs, AMRs may be amenable to encoders that target their tree-like structure. By clustering edges into a hierarchy, a tree decomposition summarizes graph structure. Our model encodes a derivation forest of tree decompositions and extracts an expected tree. From tree node embeddings, it builds graph edge features used in vertex attention of the graph encoder. Encoding TD forests instead of shortest-pairwise paths in a self-attentive baseline raises BLEU by 0.7 and chrF++ by 0.3. The forest encoder also surpasses a convolutional baseline for molecular property prediction by 1.92% ROC-AUC.

【2】 Tree Decomposition Attention for AMR-to-Text Generation 标题：AMR-TO-TEXT生成中的树分解注意事项链接：https://arxiv.org/abs/2108.12300

作者：Lisa Jin,Daniel Gildea 机构：Department of Computer Science, University of Rochester, Rochester, NY 摘要：从AMR生成文本需要将语义图映射到它注释的字符串。然而，基于转换器的图形编码器不能很好地捕获可能有利于序列预测的顶点依赖关系。为了对编码器施加顺序，我们使用图的树分解局部约束顶点的自我注意。我们将注意力限制在父树、子树和顶点的相同深度包中的顶点，而不是形成完整的查询键二部图。这种层次上下文为顶点状态更新提供了稀疏性和结构。我们应用动态规划来导出树分解的森林，选择结构最相似的树作为AMR。我们的系统比自我关注基线的性能高1.6 BLEU和1.8 chrF++。摘要：Text generation from AMR requires mapping a semantic graph to a string that it annotates. Transformer-based graph encoders, however, poorly capture vertex dependencies that may benefit sequence prediction. To impose order on an encoder, we locally constrain vertex self-attention using a graph's tree decomposition. Instead of forming a full query-key bipartite graph, we restrict attention to vertices in parent, subtree, and same-depth bags of a vertex. This hierarchical context lends both sparsity and structure to vertex state updates. We apply dynamic programming to derive a forest of tree decompositions, choosing the most structurally similar tree to the AMR. Our system outperforms a self-attentive baseline by 1.6 BLEU and 1.8 chrF++.

【3】 Automated Generation of Accurate \& Fluent Medical X-ray Reports 链接：https://arxiv.org/abs/2108.12126

作者：Hoang T. N. Nguyen,Dong Nie,Taivanbat Badamdorj,Yujie Liu,Yingying Zhu,Jason Truong,Li Cheng 机构：University of North Carolina at Chapel Hill, University of Alberta, Guangzhou University of, Chinese Medicine, University of Texas, at Arlington 备注：accepted in emnlp 摘要：我们的论文集中于从胸部X射线图像输入自动生成医疗报告，这对于放射科医生来说是一项关键但耗时的任务。与现有的医疗报告生成工作不同，我们的目标是生成流畅且临床准确的医疗报告。这是通过我们的完全可微和端到端范式实现的，该范式包含三个互补模块：以患者的胸部X光图像和临床病历作为输入，我们的分类模块生成疾病相关主题的内部检查表，称为丰富的疾病嵌入；嵌入表示然后被传递到我们基于Transformer的生成器，产生医疗报告；同时，我们的生成器还产生了加权嵌入表示，并将其提供给我们的解释器，以确保疾病相关主题的一致性。我们的方法在有关语言流利性和临床准确性的常用指标上取得了有希望的结果。此外，当有额外的输入信息可用时，如临床文档和不同视图的额外扫描，可以持续获得显著的性能提升。摘要：Our paper focuses on automating the generation of medical reports from chest X-ray image inputs, a critical yet time-consuming task for radiologists. Unlike existing medical re-port generation efforts that tend to produce human-readable reports, we aim to generate medical reports that are both fluent and clinically accurate. This is achieved by our fully differentiable and end-to-end paradigm containing three complementary modules: taking the chest X-ray images and clinical his-tory document of patients as inputs, our classification module produces an internal check-list of disease-related topics, referred to as enriched disease embedding; the embedding representation is then passed to our transformer-based generator, giving rise to the medical reports; meanwhile, our generator also pro-duces the weighted embedding representation, which is fed to our interpreter to ensure consistency with respect to disease-related topics.Our approach achieved promising results on commonly-used metrics concerning language fluency and clinical accuracy. Moreover, noticeable performance gains are consistently ob-served when additional input information is available, such as the clinical document and extra scans of different views.

【4】 Lingxi: A Diversity-aware Chinese Modern Poetry Generation System 标题：灵溪：一个多元化的中国现代诗歌生成体系链接：https://arxiv.org/abs/2108.12108

作者：Xinran Zhang,Maosong Sun,Jiafeng Liu,Xiaobing Li 机构：Department of Music Artificial Intelligence and Music Information Technology, Central Conservatory of Music, Beijing, China, Department of Computer Science and Technology, Tsinghua University, Beijing, China 摘要：诗歌生成一直是自然语言处理中的一个难题。与普通的神经文本生成任务不同，诗歌对新颖性有很高的要求，因为一个容易理解的句子中有太多的高频词可能不会被认为是诗歌，而具有低频词的充分歧义的句子可能是新颖的和创造性的。受此启发，我们提出了灵溪，一个具有多样性意识的中国现代诗歌生成系统。我们提出了随机头部核抽样（NS-RH）算法，该算法将预测分布的高频部分（“头部”）随机化，以强调“相对低频”词。与传统的采样方法相比，该算法能显著提高诗歌生成的新颖性。通过调整确定要排列的“头”的滤波参数，可以控制分布的排列，从而实现分集感知采样。我们发现，即使大部分过滤词汇是随机的，它实际上也可以生成流畅的诗歌，但具有更高的新颖性。我们还提出了一种基于语义相似度的拒绝采样算法，该算法在短输入诗歌标题的基础上创建更长、信息量更大的上下文，同时保持与标题的高语义相似度，缓解了主题外问题。摘要：Poetry generation has been a difficult task in natural language processing. Unlike plain neural text generation tasks, poetry has a high requirement for novelty, since an easily-understood sentence with too many high frequency words might not be considered as poetic, while adequately ambiguous sentences with low frequency words can possibly be novel and creative. Inspired by this, we present Lingxi, a diversity-aware Chinese modern poetry generation system. We propose nucleus sampling with randomized head (NS-RH) algorithm, which randomizes the high frequency part ("head") of the predicted distribution, in order to emphasize on the "comparatively low frequency" words. The proposed algorithm can significantly increase the novelty of generated poetry compared with traditional sampling methods. The permutation of distribution is controllable by tuning the filtering parameter that determines the "head" to permutate, achieving diversity-aware sampling. We find that even when a large portion of filtered vocabulary is randomized, it can actually generate fluent poetry but with notably higher novelty. We also propose a semantic-similarity-based rejection sampling algorithm, which creates longer and more informative context on the basis of the short input poetry title while maintaining high semantic similarity to the title, alleviating the off-topic issue.

半/弱/无监督|不确定性(1篇)

【1】 Injecting Text in Self-Supervised Speech Pretraining 标题：自监督语音预训练中的文本注入链接：https://arxiv.org/abs/2108.12226

作者：Zhehuai Chen,Yu Zhang,Andrew Rosenberg,Bhuvana Ramabhadran,Gary Wang,Pedro Moreno 机构：Google, Inc. 备注：submit to ASRU 2021 摘要：自动语音识别（ASR）的自我监督预训练已显示出不同程度的成功。在这篇论文中，我们建议在训练前从两种不同的模式：语音和文本共同学习表征。所提出的tts4pretrain方法补充了对比学习在自我监督中的力量，从合成语音中提取语言/词汇表征，有效地从未翻译语音和未说话文本中学习。语音编码器中的词汇学习是通过一个额外的序列丢失项来实现的，该项与训练前的对比丢失相结合。我们证明，这种新的预训练方法在仅使用wav2vec2.0预训练的最新基线上，相对于基准良好的Librispeech任务，字错误率（WER）降低了10%。该方法还可以作为一种有效的策略来弥补转录语音的不足，有效地匹配AMI会议转录任务中5000小时转录语音与100小时转录语音的性能。最后，我们演示了与传统的预训练相比，内部语音搜索任务的WER降低高达15%。将文本合并到编码器预训练中是对使用更大或域内语言模型重新排序的补充，从而使WER相对减少6%。摘要：Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text. The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speech, effectively learning from untranscribed speech and unspoken text. Lexical learning in the speech encoder is enforced through an additional sequence loss term that is coupled with contrastive loss during pretraining. We demonstrate that this novel pretraining method yields Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task over a state-of-the-art baseline pretrained with wav2vec2.0 only. The proposed method also serves as an effective strategy to compensate for the lack of transcribed speech, effectively matching the performance of 5000 hours of transcribed speech with just 100 hours of transcribed speech on the AMI meeting transcription task. Finally, we demonstrate WER reductions of up to 15% on an in-house Voice Search task over traditional pretraining. Incorporating text into encoder pretraining is complimentary to rescoring with a larger or in-domain language model, resulting in additional 6% relative reduction in WER.

检测相关(2篇)

【1】 Detecting Propaganda on the Sentence Level during the COVID-19 Pandemic 标题：在冠状病毒大流行期间检测句子层面的宣传链接：https://arxiv.org/abs/2108.12269

作者：Rong-Ching Chang,Chu-Hsing Lin 机构：Department of Computer Science, Tunghai University 摘要：外国媒体在社交媒体上传播的错误信息、阴谋和可疑的内容和信息的操纵与COVID-19大流行一起激增。此类恶意网络行为可能导致社会两极分化加剧、健康危机和财产损失。在本文中，使用微调的上下文嵌入嵌入训练ReDIT，我们解决了检测这种宣传用户帐户和他们的目标问题在推特上2020年3月，当COVID-19流行病被认为是大流行。我们的结果显示，亲中派的推特频率似乎是中立派的35到115倍。同时，中性群体鸣叫更积极的态度内容和警报警报COVID-19的情况。亲中组织还在不一定与中国有关的政治问题上使用了更多的呼吁行动的词语。摘要：The spread of misinformation, conspiracy, and questionable content and information manipulation by foreign adversaries on social media has surged along with the COVID-19 pandemic. Such malicious cyber-enabled actions may cause increasing social polarization, health crises, and property loss. In this paper, using fine-tuned contextualized embedding trained on Reddit, we tackle the detection of the propaganda of such user accounts and their targeted issues on Twitter during March 2020 when the COVID-19 epidemic became recognized as a pandemic. Our result shows that the pro-China group appeared to be tweeting 35 to 115 times more than the neutral group. At the same time, neutral groups were tweeting more positive-attitude content and voicing alarm for the COVID-19 situation. The pro-China group was also using more call-for-action words on political issues not necessarily China-related.

【2】 ProtoInfoMax: Prototypical Networks with Mutual Information Maximization for Out-of-Domain Detection 标题：ProtoInfoMax：带互信息最大化的域外检测原型网络链接：https://arxiv.org/abs/2108.12229

作者：Iftitahu Ni'mah,Meng Fang,Vlado Menkovski,Mykola Pechenizkiy 机构：Eindhoven University of Technology (TUe), Indonesian Institute of Science (LIPI) 备注：None 摘要：在许多实际NLP应用程序中，检测域外（OOD）输入的能力一直是一项关键要求，因为包含不受支持的OOD输入可能导致系统灾难性故障。然而，目前的算法是否能够在零OOD训练数据可用的现实场景中可靠地解决此类问题仍然是一个经验问题。在这项研究中，我们提出了ProtoInfoMax，这是一种新的体系结构，它通过互信息最大化（InfoMax）目标将原型网络扩展到同时处理域内（ID）和OOD语句。实验结果表明，在文本分类的低资源环境下，本文提出的方法可以显著提高OOD检测的性能达20%。我们还表明，ProtoInfoMax不太容易出现神经网络的典型过度自信错误，从而导致更可靠的ID和OOD预测结果。摘要：The ability to detect Out-of-Domain (OOD) inputs has been a critical requirement in many real-world NLP applications since the inclusion of unsupported OOD inputs may lead to catastrophic failure of systems. However, it remains an empirical question whether current algorithms can tackle such problem reliably in a realistic scenario where zero OOD training data is available. In this study, we propose ProtoInfoMax, a new architecture that extends Prototypical Networks to simultaneously process In-Domain (ID) and OOD sentences via Mutual Information Maximization (InfoMax) objective. Experimental results show that our proposed method can substantially improve performance up to 20% for OOD detection in low resource settings of text classification. We also show that ProtoInfoMax is less prone to typical over-confidence Error of Neural Networks, leading to more reliable ID and OOD prediction outcomes.

识别/分类(5篇)

【1】 Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling 标题：基于伪标注的低资源混码德拉威甸语攻击性语言识别链接：https://arxiv.org/abs/2108.12177

作者：Adeep Hande,Karthik Puranik,Konthala Yasaswini,Ruba Priyadharshini,Sajeetha Thavareesan,Anbukkarasi Sampath,Kogilavani Shanmugavadivel,Durairaj Thenmozhi,Bharathi Raja Chakravarthi 机构：Indian Institute of Information Technology Tiruchirappalli, Tamil Nadu, India, ULTRA Arts and Science College, Madurai, Tamil Nadu, India, Eastern University, Sri Lanka, Kongu Engineering College, Erode, Tamil Nadu, India 备注：27 pages, 12 figures, 10 tables 摘要：社交媒体已经有效地成为传播和数字营销的主要枢纽。由于这些平台能够在文本、图像和视频中自由地展示思想和事实，因此有广泛的需要对其进行筛选，以保护个人和群体免受针对他们的攻击性内容的侵害。我们的工作旨在对泰米尔语、卡纳达语和马来语的德拉威语混合社交媒体评论/帖子进行分类。我们打算通过在数据集上生成伪标签来改进攻击性语言识别。自定义数据集是通过将所有代码混合文本翻译成各自的德拉维语（卡纳达语、马来语或泰米尔语），然后为翻译后的数据集生成伪标签来构建的。使用生成的伪标签组合这两个数据集，以创建一个名为CMTRA的自定义数据集。由于德拉威语资源不足，我们的方法增加了语言模型的训练数据量。我们在新构建的数据集上微调了几个最近的预训练语言模型。我们提取预训练的语言嵌入并将其传递到递归神经网络。我们观察到，在定制数据集上微调ULMFiT可以在所有三种语言的代码混合测试集上产生最佳结果。我们的方法在泰米尔语英语的基准模型中产生了最好的结果，在马来语英语和卡纳达语英语的代码混合测试集中，加权F1分数分别为0.7934和0.9624，竞争加权F1分数分别为0.7306。摘要：Social media has effectively become the prime hub of communication and digital marketing. As these platforms enable the free manifestation of thoughts and facts in text, images and video, there is an extensive need to screen them to protect individuals and groups from offensive content targeted at them. Our work intends to classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam. We intend to improve offensive language identification by generating pseudo-labels on the dataset. A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language, either Kannada, Malayalam, or Tamil and then generating pseudo-labels for the transliterated dataset. The two datasets are combined using the generated pseudo-labels to create a custom dataset called CMTRA. As Dravidian languages are under-resourced, our approach increases the amount of training data for the language models. We fine-tune several recent pretrained language models on the newly constructed dataset. We extract the pretrained language embeddings and pass them onto recurrent neural networks. We observe that fine-tuning ULMFiT on the custom dataset yields the best results on the code-mixed test sets of all three languages. Our approach yields the best results among the benchmarked models on Tamil-English, achieving a weighted F1-Score of 0.7934 while scoring competitive weighted F1-Scores of 0.9624 and 0.7306 on the code-mixed test sets of Malayalam-English and Kannada-English, respectively.

【2】 Grammar Based Identification Of Speaker Role For Improving ATCO And Pilot ASR 标题：基于语法的说话人角色识别改进ATCO和Pilot ASR 链接：https://arxiv.org/abs/2108.12175

作者：Amrutha Prasad,Juan Zuluaga-Gomez,Petr Motlicek,Oliver Ohneiser,Hartmut Helmke,Saeed Sarfjoo,Iuliia Nigmatulina 机构：Idiap Research Institute, Martigny, Switzerland, Brno University of Technology, Brno, Czechia, Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland, German Aerospace Center (DLR), Institute of Flight Guidance, Braunschweig, Germany 备注：Submitted to Interspeech 2021 摘要：用于空中交通管制的基于助手的语音识别（ABSR）通常通过汇集空中交通管制员（ATCO）和飞行员数据进行训练。实际上，这是因为与ATCO相比，飞行员数据的比例较小，而他们的标准通信语言相似。然而，由于ATCO和飞行员的数据不平衡以及他们不同的声学条件，ATCO的ASR性能通常明显优于飞行员。在本文中，我们建议（1）使用ATR笔录的自动方法分割ATCO和导频数据，和（2）考虑阿特科和导频ASR作为声学模型（AM）训练的两个单独任务。对于ATCO和飞行员数据的说话人角色分类，使用种子模型生成假设的ASR转录本，然后根据从国际民用航空组织（ICAO）定义的语法中提取的知识对说话人角色进行分类。这种方法为ATCO和pilot提供了83%的平均说话人角色识别准确率。最后，我们表明，与通过汇集所有数据进行的AM训练相比，针对每个任务单独训练AM或使用多任务方法非常适合此数据。摘要：Assistant Based Speech Recognition (ABSR) for air traffic control is generally trained by pooling both Air Traffic Controller (ATCO) and pilot data. In practice, this is motivated by the fact that the proportion of pilot data is lesser compared to ATCO while their standard language of communication is similar. However, due to data imbalance of ATCO and pilot and their varying acoustic conditions, the ASR performance is usually significantly better for ATCOs than pilots. In this paper, we propose to (1) split the ATCO and pilot data using an automatic approach exploiting ASR transcripts, and (2) consider ATCO and pilot ASR as two separate tasks for Acoustic Model (AM) training. For speaker role classification of ATCO and pilot data, a hypothesized ASR transcript is generated with a seed model, subsequently used to classify the speaker role based on the knowledge extracted from grammar defined by International Civil Aviation Organization (ICAO). This approach provides an average speaker role identification accuracy of 83% for ATCO and pilot. Finally, we show that training AMs separately for each task, or using a multitask approach is well suited for this data compared to AM trained by pooling all data.

【3】 Improving callsign recognition with air-surveillance data in air-traffic communication 标题：空中交通通信中利用空中监视数据改进呼号识别链接：https://arxiv.org/abs/2108.12156

作者：Iuliia Nigmatulina,Rudolf Braun,Juan Zuluaga-Gomez,Petr Motlicek 机构：Idiap Research Institute, Martigny, Switzerland, Institute of Computational Linguistics, University of Z¨urich, Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland 备注：Submitted to Interspeech 2021 摘要：自动语音识别（ASR）可以作为飞行员和空中交通管制员之间语音通信的辅助手段。它的应用可以显著降低任务的复杂性，提高传输信息的可靠性。显然，需要高精度的预测来最小化错误风险。特别是，在识别用于导航飞行员的关键信息（如命令和呼号）时，需要高精度。我们的研究结果证明，当每个话语中可能的呼号n-gram的权重降低时，包含呼号的监控数据有助于显著提高对话语中呼号的识别。在本文中，我们研究了两种方法：（1）G-boosting，即在语言模型级别（G）调整呼号权重，然后使用动态合成的动态解码器；（2）格重排序，即在使用传统解码器生成的格上引入呼号信息。通过两种方法的结合提高呼号n-gram，我们可以在呼号识别准确率方面获得28.4%的绝对改善，在呼号识别WER方面获得74.2%的相对改善。摘要：Automatic Speech Recognition (ASR) can be used as the assistance of speech communication between pilots and air-traffic controllers. Its application can significantly reduce the complexity of the task and increase the reliability of transmitted information. Evidently, high accuracy predictions are needed to minimize the risk of errors. Especially, high accuracy is required in recognition of key information, such as commands and callsigns, used to navigate pilots. Our results prove that the surveillance data containing callsigns can help to considerably improve the recognition of a callsign in an utterance when the weights of probable callsign n-grams are reduced per utterance. In this paper, we investigate two approaches: (1) G-boosting, when callsigns weights are adjusted at language model level (G) and followed by the dynamic decoder with an on-the-fly composition, and (2) lattice rescoring when callsign information is introduced on top of lattices generated using a conventional decoder. Boosting callsign n-grams with the combination of two methods allowed us to gain 28.4% of absolute improvement in callsign recognition accuracy and up to 74.2% of relative improvement in WER of callsign recognition.

【4】 4-bit Quantization of LSTM-based Speech Recognition Models 标题：基于LSTM的语音识别模型的4位量化链接：https://arxiv.org/abs/2108.12074

作者：Andrea Fasoli,Chia-Yu Chen,Mauricio Serrano,Xiao Sun,Naigang Wang,Swagath Venkataramani,George Saon,Xiaodong Cui,Brian Kingsbury,Wei Zhang,Zoltán Tüske,Kailash Gopalakrishnan 机构：IBM Research, USA, †These authors contributed equally to this work 备注：5 pages, 3 figures, Andrea Fasoli and Chia-Yu Chen equally contributed to this work. Paper accepted to Interspeech 2021 摘要：我们研究了两大类基于LSTM的自动语音识别（ASR）体系结构：混合深层双向LSTM-隐马尔可夫模型（DBLSTM-HMMs）和递归神经网络-传感器（RNN-Ts）中权重和激活的积极低精度表示的影响。使用4位整数表示，应用于这些模型的LSTM部分的自然量化方法会导致显著的字错误率（WER）下降。另一方面，我们表明，通过适当选择量化器和初始化，可以实现最小的精度损失。特别是，我们根据网络的局部特性定制量化方案，在限制计算时间的同时提高识别性能。我们在NIST Hub5-2000评估的交换机（SWB）和呼叫中心（CH）测试集上演示了我们的解决方案。使用300或2000小时SWB数据训练的DBLSTM HMMs分别达到$<$0.5%和$<$1%的平均WER降级。在更具挑战性的RNN-T模型上，我们的量化策略将4位推断的退化限制在1.3%。摘要：We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts). Using a 4-bit integer representation, a na\"ive quantization approach applied to the LSTM portion of these models results in significant Word Error Rate (WER) degradation. On the other hand, we show that minimal accuracy loss is achievable with an appropriate choice of quantizers and initializations. In particular, we customize quantization schemes depending on the local properties of the network, improving recognition performance while limiting computational time. We demonstrate our solution on the Switchboard (SWB) and CallHome (CH) test sets of the NIST Hub5-2000 evaluation. DBLSTM-HMMs trained with 300 or 2000 hours of SWB data achieves $<$0.5% and $<$1% average WER degradation, respectively. On the more challenging RNN-T models, our quantization strategy limits degradation in 4-bit inference to 1.3%.

【5】 EmoBERTa: Speaker-Aware Emotion Recognition in Conversation with RoBERTa 标题：EmoBERTa：与罗BERT对话中的说话人感知情感识别链接：https://arxiv.org/abs/2108.12009

作者：Taewoon Kim,Piek Vossen 机构：Vrije Universiteit Amsterdam 备注：4 pages, not including references and appendix 摘要：我们提出了EmoBERTa：与RoBERTa对话中的说话人感知情绪识别，这是一个解决ERC（对话中的情绪识别）任务的简单而富有表现力的方案。通过简单地在话语前加上说话人姓名，并在对话中的话语之间插入分离标记，EmoBERTa可以学习说话人内部和内部的状态和上下文，以端到端的方式预测当前说话人的情绪。我们的实验表明，我们使用基本和直接的方法在两个流行的ERC数据集上达到了一个新的水平。我们已经在https://github.com/tae898/erc. 摘要：We present EmoBERTa: Speaker-Aware Emotion Recognition in Conversation with RoBERTa, a simple yet expressive scheme of solving the ERC (emotion recognition in conversation) task. By simply prepending speaker names to utterances and inserting separation tokens between the utterances in a dialogue, EmoBERTa can learn intra- and inter- speaker states and context to predict the emotion of a current speaker, in an end-to-end manner. Our experiments show that we reach a new state of the art on the two popular ERC datasets using a basic and straight-forward approach. We've open sourced our code and models at https://github.com/tae898/erc.

表征(1篇)

【1】 Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies 标题：语言技术中性别排他性的危害和非二进制表示的挑战链接：https://arxiv.org/abs/2108.12084

作者：Sunipa Dev,Masoud Monajatipoor,Anaelia Ovalle,Arjun Subramonian,Jeff M Phillips,Kai-Wei Chang 机构：sheher, UCLA, hehim, theyheshe, theythem, University of Utah 备注：None 摘要：性别问题在语言任务和研究语言模型传播的陈规定型观念时得到广泛讨论。然而，目前的讨论主要将性别视为二元性，这可能会造成伤害，如非二元性身份的周期性抹杀。这些危害是由模型和数据集偏见造成的，这些偏见是社会中对非二元性别不认识和缺乏理解的后果。在本文中，我们解释了性别和语言的复杂性，并调查了非二元性的人，以了解在英语语言技术中将性别视为二元性所带来的危害。我们还详细介绍了当前的语言表征（如手套、BERT）是如何捕捉和延续这些危害和相关挑战的，这些危害和挑战需要得到承认和解决，以便表征公平地编码性别信息。摘要：Gender is widely discussed in the context of language tasks and when examining the stereotypes propagated by language models. However, current discussions primarily treat gender as binary, which can perpetuate harms such as the cyclical erasure of non-binary gender identities. These harms are driven by model and dataset biases, which are consequences of the non-recognition and lack of understanding of non-binary genders in society. In this paper, we explain the complexity of gender and language around it, and survey non-binary persons to understand harms associated with the treatment of gender as binary in English language technologies. We also detail how current language representations (e.g., GloVe, BERT) capture and perpetuate these harms and related challenges that need to be acknowledged and addressed for representations to equitably encode gender information.

Word2Vec|文本|单词(1篇)

【1】 Deep learning models are not robust against noise in clinical text 标题：深度学习模型对临床文本中的噪声不是很健壮链接：https://arxiv.org/abs/2108.12242

作者：Milad Moradi,Kathrin Blagec,Matthias Samwald 机构：Institute for Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and, Intelligent Systems, Medical University of Vienna, Vienna, Austria 摘要：人工智能（AI）系统由于能够学习需要人类智能和专家知识的复杂任务，在医学领域吸引着越来越多的兴趣。利用高性能自然语言处理（NLP）模型的人工智能系统在各种临床文本处理基准上取得了最先进的结果。在某些任务上，它们甚至超过了人类的准确性。然而，此类人工智能系统的性能评估仅限于对经过策划和清洁的基准数据集的准确度测量，这些数据集可能无法正确反映这些系统在实际情况下的运行能力。为了应对这一挑战，我们引入并实现了各种各样的扰动方法，模拟临床文本数据中不同类型的噪声和可变性。虽然这些扰动方法产生的噪声样本通常可以被人类理解，但它们可能会导致人工智能系统做出错误的决策。在几个临床文本处理任务上进行了广泛的实验，我们评估了高性能NLP模型对各种类型的字符级和单词级噪声的鲁棒性。结果表明，当输入含有少量噪声时，NLP模型的性能下降。这项研究是暴露临床文本处理系统中使用的人工智能模型漏洞的重要一步。所提出的扰动方法可用于性能评估测试，以评估在真实环境中，临床NLP模型在噪声数据上的鲁棒性。摘要：Artificial Intelligence (AI) systems are attracting increasing interest in the medical domain due to their ability to learn complicated tasks that require human intelligence and expert knowledge. AI systems that utilize high-performance Natural Language Processing (NLP) models have achieved state-of-the-art results on a wide variety of clinical text processing benchmarks. They have even outperformed human accuracy on some tasks. However, performance evaluation of such AI systems have been limited to accuracy measures on curated and clean benchmark datasets that may not properly reflect how robustly these systems can operate in real-world situations. In order to address this challenge, we introduce and implement a wide variety of perturbation methods that simulate different types of noise and variability in clinical text data. While noisy samples produced by these perturbation methods can often be understood by humans, they may cause AI systems to make erroneous decisions. Conducting extensive experiments on several clinical text processing tasks, we evaluated the robustness of high-performance NLP models against various types of character-level and word-level noise. The results revealed that the NLP models performance degrades when the input contains small amounts of noise. This study is a significant step towards exposing vulnerabilities of AI models utilized in clinical text processing systems. The proposed perturbation methods can be used in performance evaluation tests to assess how robustly clinical NLP models can operate on noisy data, in real-world settings.

其他神经网络|深度学习|模型|建模(4篇)

【1】 CAPE: Context-Aware Private Embeddings for Private Language Learning 标题：CAPE：面向私人语言学习的上下文感知私人嵌入链接：https://arxiv.org/abs/2108.12318

作者：Richard Plant,Dimitra Gkatzia,Valerio Giuffrida 机构：Edinburgh Napier University 备注：Accepted into EMNLP21 main conference 摘要：基于深度学习的语言模型在许多应用中取得了最先进的成果，包括情感分析、主题标记、意图分类等。使用这些模型获取文本表示或嵌入，就有可能对从语言和上下文线索中获得的个人识别信息进行编码，这可能会对声誉或隐私造成风险。为了改善这些问题，我们提出了上下文感知私有嵌入（CAPE），这是一种在嵌入训练期间保护隐私的新方法。为了维护文本表示的隐私，CAPE通过差异隐私应用校准噪声，在隐藏敏感信息的同时保留编码语义链接。此外，CAPE采用了一种对抗性训练制度，该制度掩盖了已确定的私人变量。实验结果表明，该方法比单次干预都能更好地减少私有信息泄漏。摘要：Deep learning-based language models have achieved state-of-the-art results in a number of applications including sentiment analysis, topic labelling, intent classification and others. Obtaining text representations or embeddings using these models presents the possibility of encoding personally identifiable information learned from language and context cues that may present a risk to reputation or privacy. To ameliorate these issues, we propose Context-Aware Private Embeddings (CAPE), a novel approach which preserves privacy during training of embeddings. To maintain the privacy of text representations, CAPE applies calibrated noise through differential privacy, preserving the encoded semantic links while obscuring sensitive information. In addition, CAPE employs an adversarial training regime that obscures identified private variables. Experimental results demonstrate that the proposed approach reduces private information leakage better than either single intervention.

【2】 Evaluating the Robustness of Neural Language Models to Input Perturbations 标题：评估神经语言模型对输入扰动的鲁棒性链接：https://arxiv.org/abs/2108.12237

作者：Milad Moradi,Matthias Samwald 机构：Institute for Artificial Intelligence, Medical University of Vienna, Austria 备注：Accepted by EMNLP 2021 摘要：高性能神经语言模型已经在广泛的自然语言处理（NLP）任务中获得了最新的结果。然而，当应用于嘈杂的真实数据时，通用基准数据集的结果通常不能反映模型的可靠性和鲁棒性。在这项研究中，我们设计并实现了各种类型的字符级和单词级扰动方法，以模拟输入文本可能有轻微噪声或与NLP系统训练的数据分布不同的真实场景。通过对不同NLP任务的综合实验，我们研究了高性能语言模型（如BERT、XLNet、RoBERTa和ELMo）处理不同类型输入扰动的能力。结果表明，语言模型对输入扰动非常敏感，即使引入微小的变化，其性能也会下降。我们强调模型需要进一步改进，当前的基准没有很好地反映模型的稳健性。我们认为，对扰动输入的评估应该常规地补充广泛使用的基准，以便更现实地理解NLP系统的鲁棒性。摘要：High-performance neural language models have obtained state-of-the-art results on a wide range of Natural Language Processing (NLP) tasks. However, results for common benchmark datasets often do not reflect model reliability and robustness when applied to noisy, real-world data. In this study, we design and implement various types of character-level and word-level perturbation methods to simulate realistic scenarios in which input texts may be slightly noisy or different from the data distribution on which NLP systems were trained. Conducting comprehensive experiments on different NLP tasks, we investigate the ability of high-performance language models such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations. The results suggest that language models are sensitive to input perturbations and their performance can decrease even when small changes are introduced. We highlight that models need to be further improved and that current benchmarks are not reflecting model robustness well. We argue that evaluations on perturbed inputs should routinely complement widely-used benchmarks in order to yield a more realistic understanding of NLP systems robustness.

【3】 Exploring the Capacity of a Large-scale Masked Language Model to Recognize Grammatical Errors 标题：大规模掩蔽语言模型识别语法错误能力的研究链接：https://arxiv.org/abs/2108.12216

作者：Ryo Nagata,Manabu Kimura,Kazuaki Hanawa 机构：Konan University, JST, PRESTO, GRAS Group, Inc., RIKEN AIP, Tohoku University 摘要：在本文中，我们详细探讨了基于语言模型的语法错误检测方法的能力。我们首先表明，5%到10%的训练数据足以使基于BERT的错误检测方法达到与基于非语言模型的方法在完整训练数据下所能达到的性能相当的性能；与非语言模型方法相比，基于BERT的方法在训练数据大小方面的召回率提高得更快，而精度表现类似。这表明（i）基于BERT的方法应该具有识别特定类型错误所需的良好语法知识，并且（ii）它可以通过使用少量训练样本进行微调将知识转换为错误检测规则，这说明了它在语法错误检测方面具有很高的泛化能力。我们进一步用伪错误数据表明，它在识别各种类型错误的学习规则中实际上表现出了很好的特性。最后，基于这些发现，我们探索了一种经济有效的方法，通过向学习者解释相关语法规则的反馈评论来检测语法错误。摘要：In this paper, we explore the capacity of a language model-based method for grammatical error detection in detail. We first show that 5 to 10% of training data are enough for a BERT-based error detection method to achieve performance equivalent to a non-language model-based method can achieve with the full training data; recall improves much faster with respect to training data size in the BERT-based method than in the non-language model method while precision behaves similarly. These suggest that (i) the BERT-based method should have a good knowledge of grammar required to recognize certain types of error and that (ii) it can transform the knowledge into error detection rules by fine-tuning with a few training samples, which explains its high generalization ability in grammatical error detection. We further show with pseudo error data that it actually exhibits such nice properties in learning rules for recognizing various types of error. Finally, based on these findings, we explore a cost-effective method for detecting grammatical errors with feedback comments explaining relevant grammatical rules to learners.

【4】 A Partition Filter Network for Joint Entity and Relation Extraction 标题：一种用于联合实体和关系抽取的划分过滤网络链接：https://arxiv.org/abs/2108.12202

作者：Zhiheng Yan,Chong Zhang,Jinlan Fu,Qi Zhang,Zhongyu Wei 机构：School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, National University of Singapore, Singapore, School of Data Science, Fudan University, Shanghai, China 备注：To appear in EMNLP-2021 main conference 摘要：在联合实体和关系提取中，现有的工作要么按顺序编码任务特定的特征，导致任务间特征交互的不平衡，在这种情况下，后来提取的特征与最先提取的特征没有直接联系。或者，它们以并行方式对实体特征和关系特征进行编码，这意味着每个任务的特征表示学习在很大程度上彼此独立，除了输入共享。我们提出了一个分区过滤网络来恰当地模拟任务之间的双向交互，其中特征编码被分解为两个步骤：分区和过滤。在我们的编码器中，我们利用两个门：实体门和关系门，将神经元分割成两个任务分区和一个共享分区。共享分区表示对两个任务都有价值的任务间信息，并在两个任务之间均匀共享，以确保适当的双向交互。任务分区表示任务内信息，并通过两个门的协同工作形成，确保任务特定特征的编码相互依赖。在五个公共数据集上的实验结果表明，该模型的性能明显优于以前的方法。源代码可以在中找到https://github.com/Coopercoppers/PFN. 摘要：In joint entity and relation extraction, existing work either sequentially encode task-specific features, leading to an imbalance in inter-task feature interaction where features extracted later have no direct contact with those that come first. Or they encode entity features and relation features in a parallel manner, meaning that feature representation learning for each task is largely independent of each other except for input sharing. We propose a partition filter network to model two-way interaction between tasks properly, where feature encoding is decomposed into two steps: partition and filter. In our encoder, we leverage two gates: entity and relation gate, to segment neurons into two task partitions and one shared partition. The shared partition represents inter-task information valuable to both tasks and is evenly shared across two tasks to ensure proper two-way interaction. The task partitions represent intra-task information and are formed through concerted efforts of both gates, making sure that encoding of task-specific features are dependent upon each other. Experiment results on five public datasets show that our model performs significantly better than previous approaches. The source code can be found in https://github.com/Coopercoppers/PFN.

其他(2篇)

【1】 Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation 标题：训练短，测试长：线性偏置的注意启用输入长度外推链接：https://arxiv.org/abs/2108.12409

作者：Ofir Press,Noah A. Smith,Mike Lewis 机构：Paul G. Allen School of Computer Science & Engineering, University of Washington, Facebook AI Research, Allen Institute for AI 摘要：自Vaswani等人（2017）引入Transformer模型以来，一个基本问题仍然悬而未决：如何在推理时实现对比训练时更长序列的外推？我们首先表明，通过改变位置表示方法可以改进外推法，尽管我们发现现有的建议不允许有效的外推法。我们介绍了一种简单而有效的方法，即线性偏差注意（ALiBi），它允许外推。不在场证明不会在单词embedings中添加位置嵌入；相反，它会用一个与距离成比例的术语来偏移查询关键注意分数。我们表明，该方法允许在长度为1024的输入序列上训练13亿个参数模型，并将其外推到长度为2048的输入序列，实现了与在长度为2048的输入上训练的正弦位置嵌入模型相同的复杂度，速度快11%，使用的内存少11%。ALiBi对最近性的归纳偏见使其在WikiText-103基准上优于多种强定位方法。最后，我们提供对不在场证明的分析，以了解它为什么能带来更好的表现。摘要：Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question remains open: how to achieve extrapolation at inference time to longer sequences than seen during training? We first show that extrapolation can be improved by changing the position representation method, though we find that existing proposals do not allow efficient extrapolation. We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation. ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance. We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048, 11% faster and using 11% less memory. ALiBi's inductive bias towards recency allows it to outperform multiple strong position methods on the WikiText-103 benchmark. Finally, we provide analysis of ALiBi to understand why it leads to better performance.

【2】 Query-Focused Extractive Summarisation for Finding Ideal Answers to Biomedical and COVID-19 Questions 标题：针对生物医学和冠状病毒问题寻找理想答案的以查询为中心的摘要链接：https://arxiv.org/abs/2108.12189

作者：Diego Mollá,Urvashi Khanna,Dima Galat,Vincent Nguyen,Maciej Rybinski 机构：Macquarie University, Australia, CSIRO Data, Australia, Australian National University, Australia 备注：12 pages, 2 figures, 6 tables. Accepted at BioASQ workshop, CLEF 2021 摘要：本文介绍了麦格理大学参与BioASQ协同任务和BioASQ9b B阶段的情况。在每项任务中，我们的参与重点是使用以查询为中心的提取摘要，以获得医疗问题的理想答案。协同任务是COVID-19的端到端问题回答任务，其中系统需要返回有关给定问题的相关文档、片段和答案。鉴于缺乏训练数据，我们使用了一个以查询为中心的总结系统，该系统使用BioASQ8b训练数据集进行训练，并尝试了检索文档和片段的方法。考虑到我们的系统检索到的文档和片段质量较差，我们发现返回的答案质量相当好。对于BioASQ9b任务的B阶段，相关文档和片段已经包含在测试数据中。我们的系统将片段分割成候选句子，并在句子分类设置下使用BERT变体。该系统使用问题和候选句子作为输入，并对其进行训练，以预测候选句子成为理想答案一部分的可能性。在所有批次的BioASQ9b中，所有参与者均获得了最佳或次优的胭脂-F1结果。这表明，在分类设置中使用BERT是识别理想答案的一个非常强大的基线。摘要：This paper presents Macquarie University's participation to the BioASQ Synergy Task, and BioASQ9b Phase B. In each of these tasks, our participation focused on the use of query-focused extractive summarisation to obtain the ideal answers to medical questions. The Synergy Task is an end-to-end question answering task on COVID-19 where systems are required to return relevant documents, snippets, and answers to a given question. Given the absence of training data, we used a query-focused summarisation system that was trained with the BioASQ8b training data set and we experimented with methods to retrieve the documents and snippets. Considering the poor quality of the documents and snippets retrieved by our system, we observed reasonably good quality in the answers returned. For phase B of the BioASQ9b task, the relevant documents and snippets were already included in the test data. Our system split the snippets into candidate sentences and used BERT variants under a sentence classification setup. The system used the question and candidate sentence as input and was trained to predict the likelihood of the candidate sentence being part of the ideal answer. The runs obtained either the best or second best ROUGE-F1 results of all participants to all batches of BioASQ9b. This shows that using BERT in a classification setup is a very strong baseline for the identification of ideal answers.

机器翻译，仅供参考

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2021-08-30，如有侵权请联系 cloudcommunity@tencent.com 删除

linux