自然语言处理学术速递[7.8]

公众号-arXiv每日学术速递

发布于 2021-07-27 10:29:56

6450

发布于 2021-07-27 10:29:56

文章被收录于专栏：arXiv每日学术速递

cs.CL 方向，今日共计25篇

Transformer(1篇)

【1】 Efficient Transformer for Direct Speech Translation 标题：一种高效的直接语音翻译转换器

作者：Belen Alastruey,Gerard I. Gállego,Marta R. Costa-jussà 机构：TALP Research Center, Universitat Politecnica de Catalunya, Barcelona 备注：(c) 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works 链接：https://arxiv.org/abs/2107.03069 摘要：基于Transformer的模型的出现已经超越了文本的障碍。在处理语音时，我们必须面对一个问题：音频输入的序列长度不适合转换器。为了绕过这个问题，通常的方法是添加跨步卷积层，在使用Transformer之前减少序列长度。在本文中，我们提出了一种直接语音翻译的新方法，由于有一个高效的转换器，我们不必在转换器之前使用卷积层就可以处理频谱图。这允许编码器直接从光谱图中学习，并且不会丢失任何信息。我们创建了一个编码器-解码器模型，其中编码器是一个高效的转换器——长转换器——而解码器是一个传统的转换器-解码器。结果表明，这是一个很有前途的研究方向。摘要：The advent of Transformer-based models has surpassed the barriers of text. When working with speech, we must face a problem: the sequence length of an audio input is not suitable for the Transformer. To bypass this problem, a usual approach is adding strided convolutional layers, to reduce the sequence length before using the Transformer. In this paper, we propose a new approach for direct Speech Translation, where thanks to an efficient Transformer we can work with a spectrogram without having to use convolutional layers before the Transformer. This allows the encoder to learn directly from the spectrogram and no information is lost. We have created an encoder-decoder model, where the encoder is an efficient Transformer -- the Longformer -- and the decoder is a traditional Transformer decoder. Our results, which are close to the ones obtained with the standard approach, show that this is a promising research direction.

QA|VQA|问答|对话(2篇)

【1】 Robustifying Multi-hop QA through Pseudo-Evidentiality Training 标题：通过伪证性训练实现多跳QA的规则化

作者：Kyungjae Lee,Seung-won Hwang,Sang-eun Han,Dohyeon Lee 机构：Yonsei University, Seoul National University 备注：Accepted to ACL2021 (main conference) 链接：https://arxiv.org/abs/2107.03242 摘要：研究了多跳问答模型在没有正确推理的情况下正确回答问题的偏差问题。对这些模型进行稳健性检验的一种方法是，不仅要监督正确的答案，还要监督正确的推理链。现有的一个方向是注释推理链来训练模型，需要昂贵的附加注释。相反，我们提出了一种新的学习证据性的方法，即在没有注释的情况下，判断答案预测是否有正确的证据支持。相反，我们比较有无证据句时答案可信度的反事实变化，以生成“伪证据性”注释。我们在HotpotQA的原始集和挑战集上验证了我们提出的模型，证明了我们的方法在多跳推理中的准确性和鲁棒性。摘要：This paper studies the bias problem of multi-hop question answering models, of answering correctly without correct reasoning. One way to robustify these models is by supervising to not only answer right, but also with right reasoning chains. An existing direction is to annotate reasoning chains to train models, requiring expensive additional annotations. In contrast, we propose a new approach to learn evidentiality, deciding whether the answer prediction is supported by correct evidences, without such annotations. Instead, we compare counterfactual changes in answer confidence with and without evidence sentences, to generate "pseudo-evidentiality" annotations. We validate our proposed model on an original set and challenge set in HotpotQA, showing that our method is accurate and robust in multi-hop reasoning.

【2】 Question Answering over Knowledge Graphs with Neural Machine Translation and Entity Linking 标题：具有神经机器翻译和实体链接的知识图问答

作者：Daniel Diomedi,Aidan Hogan 机构：DCC, Universidad de Chile; IMFD 链接：https://arxiv.org/abs/2107.02865 摘要：知识图问答（KGQA）的目标是在知识图上寻找自然语言问题的答案。最近的KGQA方法采用神经机器翻译（NMT）的方法，其中自然语言问题被翻译成结构化查询语言。然而，NMT面临着词汇表外的问题，即在训练过程中可能看不到问题中的术语，从而妨碍了它们的翻译。对于大型知识图所描述的数百万实体来说，这个问题尤其成问题。我们更倾向于提出一种KGQA方法，将实体的处理委托给实体链接（EL）系统。然后使用NMT创建一个查询模板，其中占位符由EL阶段中标识的实体填充。槽填充用于决定哪个实体填充哪个占位符。在Wikidata上进行的QA实验表明，我们的方法优于纯NMT：虽然在训练过程中仍然强烈依赖于看到过相似的查询模板，但是与实体相关的错误大大减少。摘要：The goal of Question Answering over Knowledge Graphs (KGQA) is to find answers for natural language questions over a knowledge graph. Recent KGQA approaches adopt a neural machine translation (NMT) approach, where the natural language question is translated into a structured query language. However, NMT suffers from the out-of-vocabulary problem, where terms in a question may not have been seen during training, impeding their translation. This issue is particularly problematic for the millions of entities that large knowledge graphs describe. We rather propose a KGQA approach that delegates the processing of entities to entity linking (EL) systems. NMT is then used to create a query template with placeholders that are filled by entities identified in an EL phase. Slot filling is used to decide which entity fills which placeholder. Experiments for QA over Wikidata show that our approach outperforms pure NMT: while there remains a strong dependence on having seen similar query templates during training, errors relating to entities are greatly reduced.

机器翻译(2篇)

【1】 Time-Aware Ancient Chinese Text Translation and Inference 标题：具有时间意识的古汉语文本翻译与推理

作者：Ernie Chang,Yow-Ting Shiue,Hui-Syuan Yeh,Vera Demberg 机构：⋆Dept. of Language Science and Technology, Saarland University, ◦Dept. of Computer Science, University of Maryland, College Park 备注：Accepted at LChange at ACL 2021 链接：https://arxiv.org/abs/2107.03179 摘要：本文旨在解决中国古代文本翻译面临的挑战：（1）由于时代的差异导致的语言差异导致翻译质量低下；（2）大多数翻译都缺少对理解文本至关重要的语境信息。为此，我们改进了以往的翻译技术，提出如下建议：我们将任务重新定义为多标签预测任务，其中模型预测翻译及其特定时代。我们观察到，这有助于弥合语言鸿沟，因为时间上下文也被用作辅助信息作为推广的自然步骤，我们以现代汉语翻译为中心，生成多语言输出我们通过实验证明了我们的框架在产生高质量翻译输出方面的有效性，并在收集到的特定任务的平行语料库上验证了我们的框架。我们在一个标注了时间序列信息的平行语料库上验证了我们的框架，并通过实验证明了它在产生高质量翻译输出方面的有效性。我们发布了代码和数据https://github.com/orina1123/time-aware-ancient-text-translation 为将来的研究做准备。摘要：In this paper, we aim to address the challenges surrounding the translation of ancient Chinese text: (1) The linguistic gap due to the difference in eras results in translations that are poor in quality, and (2) most translations are missing the contextual information that is often very crucial to understanding the text. To this end, we improve upon past translation techniques by proposing the following: We reframe the task as a multi-label prediction task where the model predicts both the translation and its particular era. We observe that this helps to bridge the linguistic gap as chronological context is also used as auxiliary information. % As a natural step of generalization, we pivot on the modern Chinese translations to generate multilingual outputs. %We show experimentally the efficacy of our framework in producing quality translation outputs and also validate our framework on a collected task-specific parallel corpus. We validate our framework on a parallel corpus annotated with chronology information and show experimentally its efficacy in producing quality translation outputs. We release both the code and the data https://github.com/orina1123/time-aware-ancient-text-translation for future research.

【2】 Kosp2e: Korean Speech to English Translation Corpus 标题：KOSP2e：韩语语音英译语料库

作者：Won Ik Cho,Seok Min Kim,Hyunchang Cho,Nam Soo Kim 机构：Department of Electrical and Computer Engineering and INMC, Seoul National University, PAPAGO, NAVER 备注：Interspeech 2021 Camera-ready 链接：https://arxiv.org/abs/2107.02875 摘要：大多数语篇转换（S2T）研究都是以英语语音为源，这使得非英语使用者很难利用S2T技术。对于一些语言来说，这一问题是通过语料库的构建来解决的，但是语言距离英语越远或资源越匮乏，这种不足和代表性不足就越明显。在本文中，我们介绍了kosp2e（即kospi），这是一个允许韩国语以端到端的方式翻译成英语文本的语料库。我们采用开放许可的语音识别语料库、翻译语料库和口语语料库，使我们的数据集免费提供给公众，并通过管道和基于训练的方法检查性能。利用流水线和各种端到端方案，我们得到了基于英式假设的最高BLEU分别为21.3和18.0，验证了我们数据的可行性。我们计划将来通过社区贡献来补充其他目标语言的注释。摘要：Most speech-to-text (S2T) translation studies use English speech as a source, which makes it difficult for non-English speakers to take advantage of the S2T technologies. For some languages, this problem was tackled through corpus construction, but the farther linguistically from English or the more under-resourced, this deficiency and underrepresentedness becomes more significant. In this paper, we introduce kosp2e (read as `kospi'), a corpus that allows Korean speech to be translated into English text in an end-to-end manner. We adopt open license speech recognition corpus, translation corpus, and spoken language corpora to make our dataset freely available to the public, and check the performance through the pipeline and training-based approaches. Using pipeline and various end-to-end schemes, we obtain the highest BLEU of 21.3 and 18.0 for each based on the English hypothesis, validating the feasibility of our data. We plan to supplement annotations for other target languages through community contributions in the future.

Graph|知识图谱|Knowledge(1篇)

【1】 Trans4E: Link Prediction on Scholarly Knowledge Graphs 标题：Trans4E：学术知识图上的链接预测

作者：Mojtaba Nayyeri,Gokce Muge Cil,Sahar Vahdati,Francesco Osborne,Mahfuzur Rahman,Simone Angioni,Angelo Salatino,Diego Reforgiato Recupero,Nadezhda Vassilyeva,Enrico Motta,Jens Lehmann 机构：SDA Research Group, University of Bonn (Germany), Institute for Applied Informatics (InfAI), Fraunhofer IAIS, Dresden (Germany), Knowledge Media Institute, The Open University, Milton Keynes (UK) 链接：https://arxiv.org/abs/2107.03297 摘要：知识图的不完全性是影响人工智能服务质量的关键问题。在学术领域，描述研究出版物的KG通常缺乏重要信息，妨碍了我们分析和预测研究动态的能力。近年来，基于知识图嵌入模型的链路预测方法成为解决这一问题的急救手段。在这项工作中，我们提出了一个新的嵌入模型Trans4E，它特别适合于包含N到M关系和N$\gg$M的KGs。这对于将大量实体（例如，研究文章、专利、人员）按照相对较小的类别进行分类的KG来说是典型的。Trans4E被应用于两个大规模的知识图，学术/工业动态（AIDA）和微软学术图（MAG），以完成有关研究领域（如“神经网络”、“机器学习”、“人工智能”）和附属类型（如“教育”、“公司”、“政府”）的信息，提高结果数据的范围和准确性。我们根据AIDA、MAG和其他四个基准（FB15k、FB15k-237、WN18和WN18RR）的替代解决方案评估了我们的方法。Trans4E模型在低嵌入维数下的性能优于其他模型，在高嵌入维数下具有竞争性。摘要：The incompleteness of Knowledge Graphs (KGs) is a crucial issue affecting the quality of AI-based services. In the scholarly domain, KGs describing research publications typically lack important information, hindering our ability to analyse and predict research dynamics. In recent years, link prediction approaches based on Knowledge Graph Embedding models became the first aid for this issue. In this work, we present Trans4E, a novel embedding model that is particularly fit for KGs which include N to M relations with N$\gg$M. This is typical for KGs that categorize a large number of entities (e.g., research articles, patents, persons) according to a relatively small set of categories. Trans4E was applied on two large-scale knowledge graphs, the Academia/Industry DynAmics (AIDA) and Microsoft Academic Graph (MAG), for completing the information about Fields of Study (e.g., 'neural networks', 'machine learning', 'artificial intelligence'), and affiliation types (e.g., 'education', 'company', 'government'), improving the scope and accuracy of the resulting data. We evaluated our approach against alternative solutions on AIDA, MAG, and four other benchmarks (FB15k, FB15k-237, WN18, and WN18RR). Trans4E outperforms the other models when using low embedding dimensions and obtains competitive results in high dimensions.

摘要|信息提取(1篇)

【1】 A Survey on Dialogue Summarization: Recent Advances and New Frontiers 标题：对话综述：最新进展与新领域

作者：Xiachong Feng,Xiaocheng Feng,Bing Qin 机构：Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, China 链接：https://arxiv.org/abs/2107.03175 摘要：随着对话系统和自然语言生成技术的发展，对话摘要的重新兴起引起了人们的广泛关注，其目的是将原始对话压缩成一个包含显著信息的较短版本。然而，这项任务仍然缺乏全面的调查。为此，我们迈出了第一步，对这一研究领域进行了全面的回顾。具体来说，我们提供了一个可公开获得的研究数据集的概述，根据输入对话的领域总结了现有的工作，并根据统一的指标组织了排行榜。此外，我们还讨论了未来的发展方向，并提出了自己的看法。我们希望这篇对话摘要的第一次调查能够为社区提供一个快速的访问和对这项任务的概貌，并推动未来的研究。摘要：With the development of dialogue systems and natural language generation techniques, the resurgence of dialogue summarization has attracted significant research attentions, which aims to condense the original dialogue into a shorter version covering salient information. However, there remains a lack of comprehensive survey for this task. To this end, we take the first step and present a thorough review of this research field. In detail, we provide an overview of publicly available research datasets, summarize existing works according to the domain of input dialogue as well as organize leaderboards under unified metrics. Furthermore, we discuss some future directions and give our thoughts. We hope that this first survey of dialogue summarization can provide the community with a quick access and a general picture to this task and motivate future researches.

GAN|对抗|攻击|生成相关(2篇)

【1】 On Training Instance Selection for Few-Shot Neural Text Generation 标题：Few-Shot神经文本生成中的训练实例选择研究

作者：Ernie Chang,Xiaoyu Shen,Hui-Syuan Yeh,Vera Demberg 机构：Dept. of Language Science and Technology, Saarland University 备注：Accepted at ACL 2021 链接：https://arxiv.org/abs/2107.03176 摘要：大规模的预训练语言模型导致了文本生成的巨大改进。只需在少量实例（很少的镜头设置）上进行微调，就可以获得令人印象深刻的性能。尽管如此，几乎所有以前的工作只是应用随机抽样来选择少数镜头训练实例。对于选择策略以及它们将如何影响模型性能，几乎没有人关注。在这项工作中，我们提出了一个研究训练实例选择在Few-Shot神经文本生成。选择决策只基于未标注的数据，以便在一定的标注成本预算下确定最值得标注的数据点。基于少数镜头训练实例应具有多样性和代表整个数据分布的直觉，我们提出了一种简单的K-均值聚类选择策略。我们发现，即使使用基于朴素聚类的方法，生成模型在三个文本生成任务（数据到文本生成、文档摘要和问题生成）上仍然优于随机抽样。我们希望，这项工作将引起对这一基本上未开发地区的更多关注。摘要：Large-scale pretrained language models have led to dramatic improvements in text generation. Impressive performance can be achieved by finetuning only on a small number of instances (few-shot setting). Nonetheless, almost all previous work simply applies random sampling to select the few-shot training instances. Little to no attention has been paid to the selection strategies and how they would affect model performance. In this work, we present a study on training instance selection in few-shot neural text generation. The selection decision is made based only on the unlabeled data so as to identify the most worthwhile data points that should be annotated under some budget of labeling cost. Based on the intuition that the few-shot training instances should be diverse and representative of the entire data distribution, we propose a simple selection strategy with K-means clustering. We show that even with the naive clustering-based approach, the generation models consistently outperform random sampling on three text generation tasks: data-to-text generation, document summarization and question generation. We hope that this work will call for more attention on this largely unexplored area.

【2】 Deep Extrapolation for Attribute-Enhanced Generation 标题：用于属性增强型生成的深度外推

作者：Alvin Chan,Ali Madani,Ben Krause,Nikhil Naik 机构：Salesforce Research, NTU 链接：https://arxiv.org/abs/2107.02968 摘要：样本生成中的属性外推对于超越训练分布的深层神经网络来说是一个挑战。我们在序列生成中提出了一个新的外推任务，重点是自然语言和蛋白质，并提出了GENhance，一个通过学习的潜在空间来增强属性的生成框架。通过电影评论和蛋白质稳定性数据集的训练，GENhance可以生成强阳性的文本评论和高度稳定的蛋白质序列，而无需在训练期间暴露于类似的数据。我们发布了我们的基准任务和模型，以促进生物和化学中生成性建模外推和数据驱动设计的研究。摘要：Attribute extrapolation in sample generation is challenging for deep neural networks operating beyond the training distribution. We formulate a new task for extrapolation in sequence generation, focusing on natural language and proteins, and propose GENhance, a generative framework that enhances attributes through a learned latent space. Trained on movie reviews and a computed protein stability dataset, GENhance can generate strongly-positive text reviews and highly stable protein sequences without being exposed to similar data during training. We release our benchmark tasks and models to contribute to the study of generative modeling extrapolation and data-driven design in biology and chemistry.

识别/分类(3篇)

【1】 A Survey on Data Augmentation for Text Classification 标题：面向文本分类的数据增强技术综述

作者：Markus Bayer,Marc-André Kaufhold,Christian Reuter 机构： Technical University of Darmstadt 备注：35 pages, 6 figures, 8 tables 链接：https://arxiv.org/abs/2107.03158 摘要：数据增广是通过变换人工生成机器学习训练数据，是机器学习学科中一个广泛研究的领域。虽然它有助于提高模型的泛化能力，但它也可以解决许多其他挑战和问题，从克服有限的训练数据量过度规范化目标到限制用于保护隐私的数据量。基于对数据扩充（C1）的目标和应用的精确描述和现有著作的分类法（C2），本调查关注文本分类的数据扩充方法，旨在为研究者和实践者提供一个简明而全面的概述（C3）。根据分类法，我们将100多种方法分为12个不同的组，并提供最新的参考资料，阐明哪些方法是非常有前途的（C4）。最后，给出了可能构成未来工作基石的研究前景（C5）。摘要：Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing the generalization capabilities of a model, it can also address many other challenges and problems, from overcoming a limited amount of training data over regularizing the objective to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation (C1) and a taxonomy for existing works (C2), this survey is concerned with data augmentation methods for textual classification and aims to achieve a concise and comprehensive overview for researchers and practitioners (C3). Derived from the taxonomy, we divided more than 100 methods into 12 different groupings and provide state-of-the-art references expounding which methods are highly promising (C4). Finally, research perspectives that may constitute a building block for future work are given (C5).

【2】 Hierarchical Text Classification of Urdu News using Deep Neural Network 标题：基于深度神经网络的乌尔都语新闻层次化文本分类

作者：Taimoor Ahmed Javed,Waseem Shahzad,Umair Arshad 机构：Pakistan 备注：22 pages with 16 figures 链接：https://arxiv.org/abs/2107.03141 摘要：网络上的数字文本日益增多。对大量异构数据集进行分类是一项非常具有挑战性的工作，这就需要改进信息处理方法来组织文本。为了对大规模语料库进行分类，一种常用的方法是采用层次文本分类法，该方法的目的是将文本数据按层次结构进行分类。有人提出了几种方法来解决文本分类问题，但大多数研究都是在英语语言上进行的。本文提出了一个乌尔都语新闻分级文本分类的深度学习模型，该模型由8个在线新闻网站的51325个句子组成，属于以下类型：体育；技术；和娱乐。本文的研究目标有两个：（1）建立乌尔都语新闻的大规模人工标注数据集，用于文本分级分类；利用本文提出的基于LSTM机制的乌尔都语新闻分层分类模型，即分层多层LSTM（HMLSTM）。我们的模型包括两个模块：文本表示层，用于获得文本表示，其中我们使用Word2vec嵌入将单词转换为向量；乌尔都语层次LSTM层（UHLSTML）是一个端到端完全连接的深层LSTM网络，用于执行自动特征学习，我们为类层次结构的每个级别训练一个LSTM层。我们在自己创建的乌尔都语新闻数据集（UNDHTC）上进行了大量的实验。实验结果表明，本文提出的方法是一种非常有效的分级文本分类方法，其分类效果明显优于基线方法，并且与深度神经网络模型相比也取得了较好的分类效果。摘要：Digital text is increasing day by day on the internet. It is very challenging to classify a large and heterogeneous collection of data, which require improved information processing methods to organize text. To classify large size of corpus, one common approach is to use hierarchical text classification, which aims to classify textual data in a hierarchical structure. Several approaches have been proposed to tackle classification of text but most of the research has been done on English language. This paper proposes a deep learning model for hierarchical text classification of news in Urdu language - consisting of 51,325 sentences from 8 online news websites belonging to the following genres: Sports; Technology; and Entertainment. The objectives of this paper are twofold: (1) to develop a large human-annotated dataset of news in Urdu language for hierarchical text classification; and (2) to classify Urdu news hierarchically using our proposed model based on LSTM mechanism named as Hierarchical Multi-layer LSTMs (HMLSTM). Our model consists of two modules: Text Representing Layer, for obtaining text representation in which we use Word2vec embedding to transform the words to vector and Urdu Hierarchical LSTM Layer (UHLSTML) an end-to-end fully connected deep LSTMs network to perform automatic feature learning, we train one LSTM layer for each level of the class hierarchy. We have performed extensive experiments on our self created dataset named as Urdu News Dataset for Hierarchical Text Classification (UNDHTC). The result shows that our proposed method is very effective for hierarchical text classification and it outperforms baseline methods significantly and also achieved good results as compare to deep neural model.

【3】 Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers 标题：改进的基于CTC-CRF的基于字片和符号器的端到端语音识别

作者：Huahuan Zheng,Wenjie Peng,Zhijian Ou,Jinsong Zhang 机构：Speech Processing and Machine Intelligence (SPMI) Lab, Tsinghua University, China, School of Information Science, Beijing Language and Culture University, China 备注：Submitted to ASRU 2021 链接：https://arxiv.org/abs/2107.03007 摘要：自动语音识别系统在过去的几十年中得到了很大的改进，目前的系统主要是基于混合和端到端的。最近提出的CTC-CRF框架继承了混合方法的数据效率和端到端方法的简单性。本文进一步提出了基于CTC-CRF的ASR技术，并对其建模单元和神经结构进行了探讨。具体地说，我们研究的技术，使最近开发的词条建模单元和符合神经网络能够成功地应用于CTC CRF。实验在两个英语数据集（switchember，Librispeech）和一个来自CommonVoice的德语数据集上进行。实验结果表明：（1）一致性能显著提高识别性能(ii）对于字-音对应程度较低的目标语言（如英语），与基于电话的系统相比，基于单词的系统的性能稍差，而当目标语言（如德语）的字-音对应程度较高时，这两个系统的性能相同。摘要：Automatic speech recognition systems have been largely improved in the past few decades and current systems are mainly hybrid-based and end-to-end-based. The recently proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach. In this paper, we further advance CTC-CRF based ASR technique with explorations on modeling units and neural architectures. Specifically, we investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs. Experiments are conducted on two English datasets (Switchboard, Librispeech) and a German dataset from CommonVoice. Experimental results suggest that (i) Conformer can improve the recognition performance significantly; (ii) Wordpiece-based systems perform slightly worse compared with phone-based systems for the target language with a low degree of grapheme-phoneme correspondence (e.g. English), while the two systems can perform equally strong when such degree of correspondence is high for the target language (e.g. German).

Word2Vec|文本|单词(1篇)

【1】 Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography 标题：芬兰历史古文学文本在现代正字法中的词汇化

作者：Mika Hämäläinen,Niko Partanen,Khalid Alnajjar 备注：la 28e Conf\'erence sur le Traitement Automatique des Langues Naturelles (TALN) 链接：https://arxiv.org/abs/2107.03266 摘要：从16世纪开始，用旧芬兰语写的文本代表了第一部用芬兰语写的文学作品。芬兰有几个项目对旧出版物进行了数字化，使其可供研究使用。然而，在这些数据中使用现代NLP方法带来了巨大的挑战。在本文中，我们提出了一种方法，同时规范和柠檬化旧文学芬兰语到现代拼写。我们的最佳模型在Agricola编写的文本中达到96.3\%的准确率，在其他当代领域外的文本中达到87.7\%的准确率。我们的方法已经在Zenodo和Github上免费提供。摘要：Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3\% accuracy in texts written by Agricola and 87.7\% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.

其他神经网络|深度学习|模型|建模(1篇)

【1】 Structured Denoising Diffusion Models in Discrete State-Spaces 标题：离散状态空间中的结构化去噪扩散模型

作者：Jacob Austin,Daniel Johnson,Jonathan Ho,Danny Tarlow,Rianne van den Berg 机构：Google Research, Brain Team 备注：10 pages plus references and appendices. First two authors contributed equally 链接：https://arxiv.org/abs/2107.03006 摘要：去噪扩散概率模型（DDPM）（Ho et al.2020）在连续状态空间中的图像和波形生成方面显示了令人印象深刻的结果。在这里，我们介绍了离散去噪扩散概率模型（D3PM），离散数据的类扩散生成模型，通过超越具有统一转移概率的腐败过程，推广了Hoogeboom等人2021年的多项式扩散模型。这包括连续空间中模仿高斯核的转移矩阵、嵌入空间中基于最近邻的矩阵以及引入吸收态的矩阵的损坏。第三种方法使我们能够在扩散模型和基于自回归和掩模的生成模型之间建立联系。我们证明了转换矩阵的选择是一个重要的设计决策，它可以改善图像和文本领域的结果。我们还引入了一个新的损失函数，它结合了变分下界和辅助交叉熵损失。对于文本，该模型类在字符级文本生成方面取得了很好的效果，同时可以扩展到LM1B上的大型词汇表。在CIFAR-10图像数据集上，我们的模型接近样本质量，超过了连续空间DDPM模型的对数似然。摘要：Denoising diffusion probabilistic models (DDPMs) (Ho et al. 2020) have shown impressive results on image and waveform generation in continuous state spaces. Here, we introduce Discrete Denoising Diffusion Probabilistic Models (D3PMs), diffusion-like generative models for discrete data that generalize the multinomial diffusion model of Hoogeboom et al. 2021, by going beyond corruption processes with uniform transition probabilities. This includes corruption with transition matrices that mimic Gaussian kernels in continuous space, matrices based on nearest neighbors in embedding space, and matrices that introduce absorbing states. The third allows us to draw a connection between diffusion models and autoregressive and mask-based generative models. We show that the choice of transition matrix is an important design decision that leads to improved results in image and text domains. We also introduce a new loss function that combines the variational lower bound with an auxiliary cross entropy loss. For text, this model class achieves strong results on character-level text generation while scaling to large vocabularies on LM1B. On the image dataset CIFAR-10, our models approach the sample quality and exceed the log-likelihood of the continuous-space DDPM model.

其他(11篇)

【1】 DORA: Toward Policy Optimization for Task-oriented Dialogue System with Efficient Context 标题：DORA：面向有效语境的任务型对话系统的政策优化

作者：Hyunmin Jeon,Gary Geunbae Lee 机构：Computer Science and Engineering, Pohang University of Science and Technology, South Korea, Graduate School of Artificial Intelligence, Pohang University of Science and Technology, South Korea 备注：23 pages, 9 figures, submitted to Computer Speech ans Language journal 链接：https://arxiv.org/abs/2107.03286 摘要：近年来，强化学习（RL）被应用于面向任务的对话系统中，利用潜在动作来解决监督学习（SL）的不足。在本文中，我们提出了一个多领域的面向任务的对话系统，称为使用有效上下文优化循环动作策略的对话系统（DORA），它使用SL，随后应用RL来优化使用循环对话策略的对话系统。这种对话策略反复生成显式的系统动作，作为词级和高层策略。因此，DORA在SL和RL两个步骤中都得到了明显的优化，它使用了一个显式的系统操作策略，该策略考虑了有效的上下文而不是整个对话历史。系统行为具有可解释性和可控性，而潜在行为则不具有可解释性和可控性。DORA在multiwoz2.0和multiwoz2.1上的成功率分别提高了6.6和10.9个百分点。摘要：Recently, reinforcement learning (RL) has been applied to task-oriented dialogue systems by using latent actions to solve shortcomings of supervised learning (SL). In this paper, we propose a multi-domain task-oriented dialogue system, called Dialogue System with Optimizing a Recurrent Action Policy using Efficient Context (DORA), that uses SL, with subsequently applied RL to optimize dialogue systems using a recurrent dialogue policy. This dialogue policy recurrently generates explicit system actions as a both word-level and high-level policy. As a result, DORA is clearly optimized during both SL and RL steps by using an explicit system action policy that considers an efficient context instead of the entire dialogue history. The system actions are both interpretable and controllable, whereas the latent actions are not. DORA improved the success rate by 6.6 points on MultiWOZ 2.0 and by 10.9 points on MultiWOZ 2.1.

【2】 Linear-time calculation of the expected sum of edge lengths in random projective linearizations of trees 标题：树的随机射影线性化中期望边长和的线性时间计算

作者：Lluís Alemany-Puig,Ramon Ferrer-i-Cancho 机构：Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain. 链接：https://arxiv.org/abs/2107.03277 摘要：句子的句法结构通常用句法依赖树来表示。在过去的几十年里，语法相关词之间距离的总和一直是人们关注的焦点。对依赖距离的研究导致了依赖距离最小化原则的形成，即句子中的单词按顺序排列，以使总和最小化。为了对语言进行相关的定量研究，人们定义了许多随机基线。最简单的随机基线是句子中单词的无约束随机排列的和的期望值，即当一个句子中单词的所有洗牌都是允许的并且可能性相等时。这里我们关注一个流行的基线：句子中单词的随机投射排列，也就是句法依赖结构是投射的排列，这是句子在语言中经常满足的一种形式约束。到目前为止，一个句子的随机投影洗牌中依赖距离之和的期望值已经用montecarlo方法近似估计，其代价为$Zn$，其中$n$是句子的字数，$Z$是样本数；$Z$越大，估计误差越小，但时间成本越大。在这里，我们给出了公式来计算在$n$的时间内没有错误的期望值。此外，我们证明了星型树的最大化，并设计了一个动态规划算法来检索最小化它的树。摘要：The syntactic structure of a sentence is often represented using syntactic dependency trees. The sum of the distances between syntactically related words has been in the limelight for the past decades. Research on dependency distances led to the formulation of the principle of dependency distance minimization whereby words in sentences are ordered so as to minimize that sum. Numerous random baselines have been defined to carry out related quantitative studies on languages. The simplest random baseline is the expected value of the sum in unconstrained random permutations of the words in the sentence, namely when all the shufflings of the words of a sentence are allowed and equally likely. Here we focus on a popular baseline: random projective permutations of the words of the sentence, that is, permutations where the syntactic dependency structure is projective, a formal constraint that sentences satisfy often in languages. Thus far, the expectation of the sum of dependency distances in random projective shufflings of a sentence has been estimated approximately with a Monte Carlo procedure whose cost is of the order of $Zn$, where $n$ is the number of words of the sentence and $Z$ is the number of samples; the larger $Z$, the lower the error of the estimation but the larger the time cost. Here we present formulae to compute that expectation without error in time of the order of $n$. Furthermore, we show that star trees maximize it, and devise a dynamic programming algorithm to retrieve the trees that minimize it.

【3】 MedGPT: Medical Concept Prediction from Clinical Narratives 标题：MedGPT：临床叙事中的医学概念预测

作者：Zeljko Kraljevic,Anthony Shek,Daniel Bean,Rebecca Bendayan,James Teo,Richard Dobson 机构：Kings College London, London, United Kingdom, James T.H. Teo∗, Kings College Hospital NHS, Foundation Trust, Richard J.B. Dobson∗ 备注：6 pages, 2 figures, 3 tables 链接：https://arxiv.org/abs/2107.03134 摘要：电子健康记录（EHRs）中的可用数据提供了转换护理的机会，为一名患者提供更好护理的最佳方法是从所有其他患者的可用数据中学习。考虑到过去事件的顺序的患者病史的时间模型可用于预测未来事件，例如新疾病的诊断或先前或现有疾病的并发症。虽然大多数预测方法主要使用EHR中的结构化数据或单域预测和结果的子集，我们提出了一种新的基于转换器的管道，它使用命名实体识别和链接工具（即MedCAT）来构造和组织EHRs的自由文本部分，并预测未来的一系列医疗事件（最初是疾病）。由于EHR数据的很大一部分是文本形式的，因此这种方法受益于患者的精细和详细视图，同时引入了适度的额外噪声。MedGPT有效地处理了噪声和增加的粒度，并在预测来自英国伦敦国王学院医院的真实世界医院数据（60万患者）的前1、3和5个候选未来疾病时，达到了0.344、0.552和0.640的精度（与LSTM 0.329、0.538和0.633相比）。我们还表明，我们的模型捕获医学知识，通过测试它在一个实验性的医学多项选择问题回答任务，并通过检查模型的注意力集中使用梯度为基础的显著性方法。摘要：The data available in Electronic Health Records (EHRs) provides the opportunity to transform care, and the best way to provide better care for one patient is through learning from the data available on all other patients. Temporal modelling of a patient's medical history, which takes into account the sequence of past events, can be used to predict future events such as a diagnosis of a new disorder or complication of a previous or existing disorder. While most prediction approaches use mostly the structured data in EHRs or a subset of single-domain predictions and outcomes, we present MedGPT a novel transformer-based pipeline that uses Named Entity Recognition and Linking tools (i.e. MedCAT) to structure and organize the free text portion of EHRs and anticipate a range of future medical events (initially disorders). Since a large portion of EHR data is in text form, such an approach benefits from a granular and detailed view of a patient while introducing modest additional noise. MedGPT effectively deals with the noise and the added granularity, and achieves a precision of 0.344, 0.552 and 0.640 (vs LSTM 0.329, 0.538 and 0.633) when predicting the top 1, 3 and 5 candidate future disorders on real world hospital data from King's College Hospital, London, UK (\textasciitilde600k patients). We also show that our model captures medical knowledge by testing it on an experimental medical multiple choice question answering task, and by examining the attentional focus of the model using gradient-based saliency methods.

【4】 MACCIF-TDNN: Multi aspect aggregation of channel and context interdependence features in TDNN-based speaker verification 标题：MACCIF-TDNN：基于TDNN的说话人确认中信道和上下文相互依赖特征的多方面聚合

作者：Fangyuan Wang,Zhigang Song,Hongchen Jiang,Bo Xu 机构：Institute of automation, Chinese Academy of Sciences , Beijing , China, Beijing University of Technology , Beijing , China 备注：6 pages. arXiv admin note: text overlap with arXiv:2005.07143 by other authors 链接：https://arxiv.org/abs/2107.03104 摘要：最近的大多数说话人验证结果都是通过X-vector及其后续变体来实现的。本文提出了一种基于时延神经网络（TDNN）的多角度融合信道和上下文相互依赖特征的网络结构。首先，我们使用ECAPA-TDNN中的SE-Res2Blocks对信道相关性进行显式建模，实现信道特征的自适应校正，并与传统的基于TDNN的方法相比，在更细粒度的层次上对局部上下文特征进行多尺度处理。其次，我们探讨利用Transformer的编码器结构，在话语层面上建立全局上下文相互依赖的特征模型，以获得更好的长期时间特征。在池层之前，我们将SE-res2block和Transformer-encoder的输出进行聚合，以利用各自学习到的互补信道和上下文相互依赖特性。最后，我们也发现，与单一专注的统计资料共用不同，我们可以将共用方法扩展成一种可以从多个方面区分特征的多头方法。提出的MACCIF-TDNN体系结构可以优于大多数基于VoxCeleb1测试集的最新TDNN系统。摘要：Most of the recent state-of-the-art results for speaker verification are achieved by X-vector and its subsequent variants. In this paper, we propose a new network architecture which aggregates the channel and context interdependence features from multi aspect based on Time Delay Neural Network (TDNN). Firstly, we use the SE-Res2Blocks as in ECAPA-TDNN to explicitly model the channel interdependence to realize adaptive calibration of channel features, and process local context features in a multi-scale way at a more granular level compared with conventional TDNN-based methods. Secondly, we explore to use the encoder structure of Transformer to model the global context interdependence features at an utterance level which can capture better long term temporal characteristics. Before the pooling layer, we aggregate the outputs of SE-Res2Blocks and Transformer encoder to leverage the complementary channel and context interdependence features learned by themself respectively. Finally, instead of performing a single attentive statistics pooling, we also find it beneficial to extend the pooling method in a multi-head way which can discriminate features from multiple aspect. The proposed MACCIF-TDNN architecture can outperform most of the state-of-the-art TDNN-based systems on VoxCeleb1 test sets.

【5】 Android Security using NLP Techniques: A Review 标题：基于NLP技术的Android安全研究综述

作者：Sevil Sen,Burcu Can 机构：WISE Lab., Dept. of Computer Engineering, Hacettepe University, Ankara, TURKEY, Research Institute of Information and Language Processing, University of Wolverhampton, Wolverhampton, UK 链接：https://arxiv.org/abs/2107.03072 摘要：Android是攻击者针对性最强的平台之一。在攻击者不断改进技术的同时，基于静态和动态分析的传统解决方案也在不断发展。除了应用程序代码之外，Android应用程序还有一些元数据，可以用于应用程序的安全分析。与传统的应用程序分发机制不同，Android应用程序在移动市场集中分发。因此，除了应用程序包，这些市场还包含应用程序开发者和应用程序用户提供的应用程序信息。这些有用的文本数据的可用性，以及用于处理和理解文本数据的自然语言处理（NLP）的进步，鼓励了研究人员研究NLP技术在Android安全中的应用。特别是，基于NLP的安全解决方案在过去5年中加速发展，并被证明是有用的。本研究回顾了这些建议，旨在通过介绍这一领域的最新进展，探索未来研究的可能方向。我们主要研究基于NLP的四类解决方案：行为保真度描述、描述生成、隐私和恶意软件检测。摘要：Android is among the most targeted platform by attackers. While attackers are improving their techniques, traditional solutions based on static and dynamic analysis have been also evolving. In addition to the application code, Android applications have some metadata that could be useful for security analysis of applications. Unlike traditional application distribution mechanisms, Android applications are distributed centrally in mobile markets. Therefore, beside application packages, such markets contain app information provided by app developers and app users. The availability of such useful textual data together with the advancement in Natural Language Processing (NLP) that is used to process and understand textual data has encouraged researchers to investigate the use of NLP techniques in Android security. Especially, security solutions based on NLP have accelerated in the last 5 years and proven to be useful. This study reviews these proposals and aim to explore possible research directions for future studies by presenting state-of-the-art in this domain. We mainly focus on NLP-based solutions under four categories: description-to-behaviour fidelity, description generation, privacy and malware detection.

【6】 EchoEA: Echo Information between Entities and Relations for Entity Alignment 标题：EchoEA：实体之间的回声信息和实体对齐的关系

作者：Xueyuan Lin,Haihong E,Wenyu Song,Haoran Luo 机构：Department of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China 链接：https://arxiv.org/abs/2107.03054 摘要：实体对齐（EA）是从不同的知识图（KG）中发现现实世界中引用同一对象的实体。它在自动整合来自多个来源的KG中起着重要的作用。现有的基于图神经网络（GNNs）的知识图嵌入（KGE）方法都取得了很好的效果，单向地增强了实体的关系表示。此外，越来越多的方法采用半监督的方法来要求更多的训练数据。然而，这些方法仍然存在两个挑战：（1）互动不足：实体和关系之间的互动没有得到充分利用(2）低质量引导：生成的半监督数据质量低。在本文中，我们提出了一个新的框架，回音实体对齐（echoa），它利用自我注意机制将实体信息传播到关系中，并回音到实体。关系表示是从实体表示中动态计算出来的。对称地，从关系表示中动态计算出下一个实体表示，这显示了足够的交互作用。此外，我们提出了属性组合双向全局滤波策略（ABGS）来改善自举，减少虚假样本，生成高质量的训练数据。在三个真实的跨语言数据集上的实验结果稳定在96%左右hits@1平均而言，我们的方法不仅显著优于最新的方法，而且对于现有的KGE方法具有普遍性和可移植性。摘要：Entity alignment (EA) is to discover entities referring to the same object in the real world from different knowledge graphs (KGs). It plays an important role in automatically integrating KGs from multiple sources. Existing knowledge graph embedding (KGE) methods based on Graph Neural Networks (GNNs) have achieved promising results, which enhance entity representation with relation information unidirectionally. Besides, more and more methods introduce semi-supervision to ask for more labeled training data. However, two challenges still exist in these methods: (1) Insufficient interaction: The interaction between entities and relations is insufficiently utilized. (2) Low-quality bootstrapping: The generated semi-supervised data is of low quality. In this paper, we propose a novel framework, Echo Entity Alignment (EchoEA), which leverages self-attention mechanism to spread entity information to relations and echo back to entities. The relation representation is dynamically computed from entity representation. Symmetrically, the next entity representation is dynamically calculated from relation representation, which shows sufficient interaction. Furthermore, we propose attribute-combined bi-directional global-filtered strategy (ABGS) to improve bootstrapping, reduce false samples and generate high-quality training data. The experimental results on three real-world cross-lingual datasets are stable at around 96\% at hits@1 on average, showing that our approach not only significantly outperforms the state-of-the-art methods, but also is universal and transferable for existing KGE methods.

【7】 SinSpell: A Comprehensive Spelling Checker for Sinhala 标题：SinSpell：僧伽罗文综合拼写检查器

作者：Upuli Liyanapathirana,Kaumini Gunasinghe,Gihan Dias 机构：University of Moratuwa 链接：https://arxiv.org/abs/2107.02983 摘要：我们已经建立了僧伽罗语，一个全面的拼写检查器为僧伽罗语是由超过1600万人说，主要是在斯里兰卡。然而，直到最近，僧伽罗语还没有覆盖范围可接受的拼写检查。Sinspell仍然是唯一的开源僧伽罗语拼写检查器。SinSpell识别可能的拼写错误并建议更正。它还包含一个模块，自动纠正明显的错误。为了保持准确性，SinSpell被设计成一个基于Hunspell的基于规则的系统。一组单词是从几个来源汇编而来并经过验证的。这些词被分为形态类，每个类的有效词根、后缀和前缀被识别，以及不规则词和例外词的列表。对僧伽罗文献语料库中的错误进行了分析，找出了常见的拼写错误和常见错误的类型。我们发现最常见的错误是元音长度和发音相似的字母。由于不正确的输入和编码错误也被发现。该分析被用于开发建议生成器和自动修正器。摘要：We have built SinSpell, a comprehensive spelling checker for the Sinhala language which is spoken by over 16 million people, mainly in Sri Lanka. However, until recently, Sinhala had no spelling checker with acceptable coverage. Sinspell is still the only open source Sinhala spelling checker. SinSpell identifies possible spelling errors and suggests corrections. It also contains a module which auto-corrects evident errors. To maintain accuracy, SinSpell was designed as a rule-based system based on Hunspell. A set of words was compiled from several sources and verified. These were divided into morphological classes, and the valid roots, suffixes and prefixes for each class were identified, together with lists of irregular words and exceptions. The errors in a corpus of Sinhala documents were analysed and commonly misspelled words and types of common errors were identified. We found that the most common errors were in vowel length and similar sounding letters. Errors due to incorrect typing and encoding were also found. This analysis was used to develop the suggestion generator and auto-corrector.

【8】 Neural Natural Language Processing for Unstructured Data in Electronic Health Records: a Review 标题：电子病历中非结构化数据的神经自然语言处理研究进展

作者：Irene Li,Jessica Pan,Jeremy Goldwasser,Neha Verma,Wai Pan Wong,Muhammed Yavuz Nuzumlalı,Benjamin Rosand,Yixin Li,Matthew Zhang,David Chang,R. Andrew Taylor,Harlan M. Krumholz,Dragomir Radev 机构：Yale University, New Haven CT 备注：33 pages, 11 figures 链接：https://arxiv.org/abs/2107.02975 摘要：电子健康记录（ehr）是患者医疗事件和观察的数字集合，在医学中无处不在，对医疗服务、运营和研究至关重要。尽管电子病历起着核心作用，但众所周知，它很难自动处理。EHRs中存储的信息中，有一半以上是非结构化文本（例如，提供商说明、运营报告）的形式，基本上还没有开发出来供二次使用。然而，最近，新的神经网络和自然语言处理（NLP）的深度学习方法取得了长足的进步，在许多任务上都优于传统的统计和基于规则的系统。在这篇综述文章中，我们总结了目前用于EHR的神经NLP方法。我们专注于广泛的任务，即分类和预测、单词嵌入、抽取、生成，以及其他主题，如问答、表型、知识图、医学对话、多语言性、可解释性等。摘要：Electronic health records (EHRs), digital collections of patient healthcare events and observations, are ubiquitous in medicine and critical to healthcare delivery, operations, and research. Despite this central role, EHRs are notoriously difficult to process automatically. Well over half of the information stored within EHRs is in the form of unstructured text (e.g. provider notes, operation reports) and remains largely untapped for secondary use. Recently, however, newer neural network and deep learning approaches to Natural Language Processing (NLP) have made considerable advances, outperforming traditional statistical and rule-based systems on a variety of tasks. In this survey paper, we summarize current neural NLP methods for EHR applications. We focus on a broad scope of tasks, namely, classification and prediction, word embeddings, extraction, generation, and other topics such as question answering, phenotyping, knowledge graphs, medical dialogue, multilinguality, interpretability, etc.

【9】 Answering Chinese Elementary School Social Study Multiple Choice Questions 标题：回答中国小学社会学习多项选择题

作者：Daniel Lee,Chao-Chun Liang,Keh-Yih Su 机构：Dept. of Computer Science Engineering, University of Michigan, Ann Arbor, Michigan, USA, Institute of Information Science, Academia Sinica, Taipei, Taiwan 备注：TAAI-2020 链接：https://arxiv.org/abs/2107.02893 摘要：我们提出了一种新的方法来回答中国小学社会学习多项选择题。尽管BERT在阅读理解任务中表现出了优异的表现，但它并不擅长处理某些特定类型的问题，如否定、以上所有问题和以上任何问题。因此，我们提出了一个新的架构来级联BERT与一个预处理器和一个答案选择器模块来解决上述挑战。实验结果表明，该方法有效地提高了BERT的性能，从而证明了在BERT中添加额外模块的可行性。摘要：We present a novel approach to answer the Chinese elementary school Social Study Multiple Choice questions. Although BERT has demonstrated excellent performance on Reading Comprehension tasks, it is found not good at handling some specific types of questions, such as Negation, All-of-the-above, and None-of-the-above. We thus propose a novel framework to cascade BERT with a Pre-Processor and an Answer-Selector modules to tackle the above challenges. Experimental results show the proposed approach effectively improves the performance of BERT, and thus demonstrate the feasibility of supplementing BERT with additional modules.

【10】 Topic Modeling in the Voynich Manuscript 标题：伏尼奇手稿中的主题建模

作者：Rachel Sterneck,Annie Polish,Claire Bowern 机构：Yale University, Department of Computer Science, New Haven, CT , Department of Linguistics 备注：See this https URL for a version that has the Voynich font (and better figure placement), since arxiv does not allow xelatex compilation 链接：https://arxiv.org/abs/2107.02858 摘要：本文介绍了使用伏尼契手稿（beineckems408）主题建模的调查结果。主题建模是一组用于识别文本中主题簇的计算方法。我们使用潜在dirichlet分配、潜在语义分析和非负矩阵分解将Voynich页面聚类为“主题”。然后，我们将从计算模型得到的主题与从伏尼契图解和古地理分析得到的簇进行比较。我们发现计算得出的聚类与抄写员和主题的结合紧密匹配（如插图所示），进一步证明伏尼契手稿包含有意义的文本。摘要：This article presents the results of investigations using topic modeling of the Voynich Manuscript (Beinecke MS408). Topic modeling is a set of computational methods which are used to identify clusters of subjects within text. We use latent dirichlet allocation, latent semantic analysis, and nonnegative matrix factorization to cluster Voynich pages into `topics'. We then compare the topics derived from the computational models to clusters derived from the Voynich illustrations and from paleographic analysis. We find that computationally derived clusters match closely to a conjunction of scribe and subject matter (as per the illustrations), providing further evidence that the Voynich Manuscript contains meaningful text.

【11】 A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio 标题：单声道长形音频说话人属性ASR的模块化和联合方法比较研究

作者：Naoyuki Kanda,Xiong Xiao,Jian Wu,Tianyan Zhou,Yashesh Gaur,Xiaofei Wang,Zhong Meng,Zhuo Chen,Takuya Yoshioka 机构：Microsoft Corp., USA 备注：Submitted to ASRU 2021 链接：https://arxiv.org/abs/2107.02852 摘要：说话人属性自动语音识别（SA-ASR）是一项从多人录音中识别“谁说了什么”的任务。SA-ASR系统通常由语音分离、说话人二值化和ASR等多个模块组成。另一方面，考虑到联合优化，最近提出了一种端到端（E2E）SA-ASR模型，并在仿真数据上取得了很好的结果。在本文中，我们介绍了我们最近的研究比较这种模块化和联合方法的SA-ASR的真实单耳录音。我们利用大规模的训练数据，包括75000小时的ASR训练数据和用于说话人表征学习的VoxCeleb语料库，为模块化和联合方法开发了最先进的SA-ASR系统。我们还提出了一个新的管道，在说话人聚类后执行E2E SA-ASR模型。我们对AMI会议语料库的评估显示，在使用少量真实数据进行微调后，联合系统的准确率比最佳模块系统高9.2-29.4%，而模块系统在微调前的准确率更高。我们还进行了各种错误分析，以显示单耳SA-ASR的剩余问题。摘要：Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings. An SA-ASR system usually consists of multiple modules such as speech separation, speaker diarization and ASR. On the other hand, considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data. In this paper, we present our recent study on the comparison of such modular and joint approaches towards SA-ASR on real monaural recordings. We develop state-of-the-art SA-ASR systems for both modular and joint approaches by leveraging large-scale training data, including 75 thousand hours of ASR training data and the VoxCeleb corpus for speaker representation learning. We also propose a new pipeline that performs the E2E SA-ASR model after speaker clustering. Our evaluation on the AMI meeting corpus reveals that after fine-tuning with a small real data, the joint system performs 9.2--29.4% better in accuracy compared to the best modular system while the modular system performs better before such fine-tuning. We also conduct various error analyses to show the remaining issues for the monaural SA-ASR.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-08，如有侵权请联系 cloudcommunity@tencent.com 删除

linux