自然语言处理学术速递[6.28]

公众号-arXiv每日学术速递

发布于 2021-07-02 17:26:48

7110

发布于 2021-07-02 17:26:48

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

cs.CL 方向，今日共计18篇

机器翻译(1篇)

【1】 DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders 标题：DeltaLM：通过增加预先训练的多语言编码器来进行语言生成和翻译的编解码器预训练

作者：Shuming Ma,Li Dong,Shaohan Huang,Dongdong Zhang,Alexandre Muzio,Saksham Singhal,Hany Hassan Awadalla,Xia Song,Furu Wei 机构：Microsoft Corporation 备注：Work in progress 链接：https://arxiv.org/abs/2106.13736 摘要：虽然预训练编码器已经在各种自然语言理解（NLU）任务中取得了成功，但是这些预训练编码器和自然语言生成（NLG）之间还存在着差距。NLG任务通常基于编解码器框架，其中预训练的编码器只能从中受益。为了缩小这个差距，我们引入了DeltaLM，一个预训练的多语言编解码器模型，它将解码器作为现成的预训练编码器的任务层。具体地说，我们在预训练的多语言编码器中增加了一个解码器，并以一种自我监督的方式对其进行预训练。为了充分利用大规模的单语数据和双语数据，我们采用了跨文化腐败和翻译跨文化腐败作为预训练任务。实验表明，DeltaLM在自然语言生成和翻译任务（包括机器翻译、抽象文本摘要、数据到文本和问题生成）上都优于各种强基线。摘要：While pretrained encoders have achieved success in various natural language understanding (NLU) tasks, there is a gap between these pretrained encoders and natural language generation (NLG). NLG tasks are often based on the encoder-decoder framework, where the pretrained encoders can only benefit part of it. To reduce this gap, we introduce DeltaLM, a pretrained multilingual encoder-decoder model that regards the decoder as the task layer of off-the-shelf pretrained encoders. Specifically, we augment the pretrained multilingual encoder with a decoder and pre-train it in a self-supervised way. To take advantage of both the large-scale monolingual data and bilingual data, we adopt the span corruption and translation span corruption as the pre-training tasks. Experiments show that DeltaLM outperforms various strong baselines on both natural language generation and translation tasks, including machine translation, abstractive text summarization, data-to-text, and question generation.

检测相关(1篇)

【1】 Multimodal Emergent Fake News Detection via Meta Neural Process Networks 标题：基于元神经过程网络的多模态突发假新闻检测

作者：Yaqing Wang,Fenglong Ma,Haoyu Wang,Kishlay Jha,Jing Gao 机构：§Purdue University, West Lafayette, Indiana, USA, ⋄Pennsylvania State University, Pennsylvania, USA, †University of Virginia, Charlottesville, Virginia, USA 备注：accepted by KDD 2021 链接：https://arxiv.org/abs/2106.13711 摘要：假新闻以前所未有的速度传播，到达全球受众，并通过社交媒体平台将用户和社区置于巨大风险之中。基于深度学习的模型在对感兴趣事件的大量标记数据进行训练时表现出良好的性能，而在其他事件上，由于领域转移，模型的性能往往会下降。因此，现有的突发事件假新闻检测方法面临着巨大的挑战，难以获得大规模的标记数据集。此外，添加来自新出现事件的知识需要从头开始构建新模型或继续微调模型，这对于现实环境来说可能是具有挑战性的、昂贵的和不现实的。为了应对这些挑战，我们提出了一个端到端的假新闻检测框架MetaFEND，该框架能够快速学习，在突发事件中检测出假新闻，只需少量的验证帖子。具体来说，该模型将元学习和神经过程方法结合在一起，以享受这些方法的好处。特别地，提出了一个标签嵌入模块和一个硬注意机制，通过处理分类信息和删减不相关的帖子来提高效率。在从Twitter和Weibo收集的多媒体数据集上进行了广泛的实验。实验结果表明，我们提出的MetaFEND模型能够有效地检测出从未见过的事件上的假新闻，并优于现有的方法。摘要：Fake news travels at unprecedented speeds, reaches global audiences and puts users and communities at great risk via social media platforms. Deep learning based models show good performance when trained on large amounts of labeled data on events of interest, whereas the performance of models tends to degrade on other events due to domain shift. Therefore, significant challenges are posed for existing detection approaches to detect fake news on emergent events, where large-scale labeled datasets are difficult to obtain. Moreover, adding the knowledge from newly emergent events requires to build a new model from scratch or continue to fine-tune the model, which can be challenging, expensive, and unrealistic for real-world settings. In order to address those challenges, we propose an end-to-end fake news detection framework named MetaFEND, which is able to learn quickly to detect fake news on emergent events with a few verified posts. Specifically, the proposed model integrates meta-learning and neural process methods together to enjoy the benefits of these approaches. In particular, a label embedding module and a hard attention mechanism are proposed to enhance the effectiveness by handling categorical information and trimming irrelevant posts. Extensive experiments are conducted on multimedia datasets collected from Twitter and Weibo. The experimental results show our proposed MetaFEND model can detect fake news on never-seen events effectively and outperform the state-of-the-art methods.

识别/分类(1篇)

【1】 byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings 标题：byteSteady：使用字节级n-Gram嵌入的快速分类

作者：Xiang Zhang,Alexandre Drouin,Raymond Li 机构：Montreal, Quebec, Canada, Element AI, ServiceNow 链接：https://arxiv.org/abs/2106.13302 摘要：本文介绍了byteSteady——一种使用字节级n-gram嵌入的快速分类模型。byteSteady假设每个输入都是一个字节序列。表示向量是使用字节级n-gram的平均嵌入向量生成的，具有预定义的n集。使用散列技术来减少嵌入向量的数目。然后将该输入表示向量输入线性分类器。byteSteady的一个简单应用是文本分类。我们还将byteSteady应用于一类非语言数据——用于基因分类的DNA序列。对于这两个问题，我们在强基线下获得了竞争性分类结果，这表明byteSteady可以应用于语言和非语言数据。此外，我们发现使用哈夫曼编码的简单压缩不会显著影响结果，这提供了一个以前在机器学习中未被探索过的精度-速度折衷。摘要：This article introduces byteSteady -- a fast model for classification using byte-level n-gram embeddings. byteSteady assumes that each input comes as a sequence of bytes. A representation vector is produced using the averaged embedding vectors of byte-level n-grams, with a pre-defined set of n. The hashing trick is used to reduce the number of embedding vectors. This input representation vector is then fed into a linear classifier. A straightforward application of byteSteady is text classification. We also apply byteSteady to one type of non-language data -- DNA sequences for gene classification. For both problems we achieved competitive classification results against strong baselines, suggesting that byteSteady can be applied to both language and non-language data. Furthermore, we find that simple compression using Huffman coding does not significantly impact the results, which offers an accuracy-speed trade-off previously unexplored in machine learning.

Zero/Few/One-Shot|迁移|自适应(2篇)

【1】 Privileged Zero-Shot AutoML 标题：特权零射AutoML

作者：Nikhil Singh,Brandon Kates,Jeff Mentch,Anant Kharkar,Madeleine Udell,Iddo Drori 机构：edu Cornell UniversityJeff Mentchjsmentch, edu Harvard UniversityAnant Kharkaragk 2 1 5 1, edu Columbia UniversityMadeleine Udelludell, edu Cornell UniversityIddo Droriidrori 备注：16 pages, 4 figures 链接：https://arxiv.org/abs/2106.13743 摘要：这项工作通过使用数据集和函数描述提高了自动机器学习（AutoML）系统的质量，同时通过使用零炮方法显著地减少了从几分钟到几毫秒的计算时间。给定一个新的数据集和一个定义良好的机器学习任务，人类首先阅读数据集的描述和要使用的算法的文档。这项工作首次将这些文本描述（我们称之为特权信息）用于AutoML。我们使用预先训练好的转换器模型来处理特权文本，并证明使用这些信息可以提高AutoML的性能。因此，我们的方法利用了自然语言处理中无监督表征学习的进展，为AutoML提供了极大的推动。我们证明了仅使用数据和函数的文本描述可以获得合理的分类性能，并且将文本描述添加到数据元特征中可以改进跨表格数据集的分类。为了实现零炮AutoML，我们训练了一个具有这些描述嵌入和数据元特征的图神经网络。每个节点代表一个训练数据集，我们用它来预测新测试数据集的最佳机器学习管道。我们的零炮方法可以快速预测一个高质量的管道，用于有监督的学习任务和数据集。相比之下，大多数AutoML系统需要数十或数百个管道评估。我们表明，zero-shot AutoML在数据集中一致地将运行和预测时间从几分钟减少到几毫秒。通过将AutoML加速几个数量级，这项工作演示了实时AutoML。摘要：This work improves the quality of automated machine learning (AutoML) systems by using dataset and function descriptions while significantly decreasing computation time from minutes to milliseconds by using a zero-shot approach. Given a new dataset and a well-defined machine learning task, humans begin by reading a description of the dataset and documentation for the algorithms to be used. This work is the first to use these textual descriptions, which we call privileged information, for AutoML. We use a pre-trained Transformer model to process the privileged text and demonstrate that using this information improves AutoML performance. Thus, our approach leverages the progress of unsupervised representation learning in natural language processing to provide a significant boost to AutoML. We demonstrate that using only textual descriptions of the data and functions achieves reasonable classification performance, and adding textual descriptions to data meta-features improves classification across tabular datasets. To achieve zero-shot AutoML we train a graph neural network with these description embeddings and the data meta-features. Each node represents a training dataset, which we use to predict the best machine learning pipeline for a new test dataset in a zero-shot fashion. Our zero-shot approach rapidly predicts a high-quality pipeline for a supervised learning task and dataset. In contrast, most AutoML systems require tens or hundreds of pipeline evaluations. We show that zero-shot AutoML reduces running and prediction times from minutes to milliseconds, consistently across datasets. By speeding up AutoML by orders of magnitude this work demonstrates real-time AutoML.

【2】 Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models 标题：减少提示和参数：用语言模型进行简单的Few-Shot学习

作者：Robert L. Logan IV,Ivana Balažević,Eric Wallace,Fabio Petroni,Sameer Singh,Sebastian Riedel 机构：Ivana Balaževi´c∗, UC Irvine, University of Edinburgh, UC Berkeley, Facebook AI Research, University College London 链接：https://arxiv.org/abs/2106.13353 摘要：用训练实例和任务描述来提示语言模型（LMs）被认为是最近在少数镜头学习中取得成功的关键。在这项工作中，我们表明，微调LMs在少数镜头设置可以大大减少需要及时工程。事实上，可以使用空提示，即既不包含特定于任务的模板也不包含训练示例的提示，并且可以在大量任务中手动调整提示，从而获得具有竞争力的准确性。虽然精调LMs确实为每个下游任务引入了新的参数，但我们表明，这种内存开销可以大大减少：仅精调偏倚项可以获得与标准精调相当或更好的精度，同时只更新0.1%的参数。总之，我们建议微调LMs用于少量镜头学习，因为它更精确，对不同提示更健壮，并且可以使其几乎与使用冻结LMs一样有效。摘要：Prompting language models (LMs) with training examples and task descriptions has been seen as critical to recent successes in few-shot learning. In this work, we show that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering. In fact, one can use null prompts, prompts that contain neither task-specific templates nor training examples, and achieve competitive accuracy to manually-tuned prompts across a wide range of tasks. While finetuning LMs does introduce new parameters for each downstream task, we show that this memory overhead can be substantially reduced: finetuning only the bias terms can achieve comparable or better accuracy than standard finetuning while only updating 0.1% of the parameters. All in all, we recommend finetuning LMs for few-shot learning as it is more accurate, robust to different prompts, and can be made nearly as efficient as using frozen LMs.

语料库(1篇)

【1】 Manually Annotated Spelling Error Corpus for Amharic 标题：阿姆哈拉语人工标注拼写错误语料库

作者：Andargachew Mekonnen Gezmu,Tirufat Tesifaye Lema,Binyam Ephrem Seyoum,Andreas Nürnberger 备注：Accepted to 2nd AfricaNLP Workshop at EACL 2021 链接：https://arxiv.org/abs/2106.13521 摘要：本文介绍了一个埃塞俄比亚通用语阿姆哈拉语拼写错误语料库。语料库被设计用来评估拼写错误的检测和纠正。拼写错误被标记为非单词错误和真实单词错误。此外，语料库中的上下文信息有助于处理这两种类型的拼写错误。摘要：This paper presents a manually annotated spelling error corpus for Amharic, lingua franca in Ethiopia. The corpus is designed to be used for the evaluation of spelling error detection and correction. The misspellings are tagged as non-word and real-word errors. In addition, the contextual information available in the corpus makes it useful in dealing with both types of spelling errors.

表征(1篇)

【1】 Exploring the Representation of Word Meanings in Context: A Case Study on Homonymy and Synonymy 标题：探索词义在语境中的表征--以同音异义和同义同义为例

作者：Marcos Garcia 机构：CiTIUS – Research Center in Intelligent Technologies, Universidade de Santiago de Compostela, Galiza 备注：None 链接：https://arxiv.org/abs/2106.13553 摘要：本文对语境中的词义表征进行了多语种研究。我们评估了静态模型和语境化模型充分表达不同词汇语义关系的能力，如同音异义。为了做到这一点，我们创建了一个新的多语言数据集，它允许我们对一些因素进行有控制的评估，例如周围环境的影响或单词之间的重叠，传达相同或不同的意义。对四种情景的系统评价表明，基于Transformer的最佳单语模型能够有效地消除语境中的同音异义词。然而，由于这些模型对语境的依赖性很强，它们不能很好地表达出现在相似句子中的不同意义的词。实验用加利西亚语、葡萄牙语、英语和西班牙语进行，数据集（有3000多个评估项目）和新模型随本研究免费发布。摘要：This paper presents a multilingual study of word meaning representations in context. We assess the ability of both static and contextualized models to adequately represent different lexical-semantic relations, such as homonymy and synonymy. To do so, we created a new multilingual dataset that allows us to perform a controlled evaluation of several factors such as the impact of the surrounding context or the overlap between words, conveying the same or different senses. A systematic assessment on four scenarios shows that the best monolingual models based on Transformers can adequately disambiguate homonyms in context. However, as they rely heavily on context, these models fail at representing words with different senses when occurring in similar sentences. Experiments are performed in Galician, Portuguese, English, and Spanish, and both the dataset (with more than 3,000 evaluation items) and new models are freely released with this study.

Word2Vec|文本|单词(2篇)

【1】 Sentiment Progression based Searching and Indexing of Literary Textual Artefacts 标题：基于情感级数的文学文本人工产物检索与索引

作者：Hrishikesh Kulkarni,Bradly Alicea 机构： SPPU Pune University, Pune , India 备注：12 pages, 2 figures, accepted at NLDB 2021 链接：https://arxiv.org/abs/2106.13767 摘要：文学作品通常都是根据标题、元数据和关键词来索引和搜索的。当用户/读者已经知道特定的创造性文本制品或文档时，这种搜索和索引工作得很好。这种索引和搜索几乎不考虑读者的兴趣和情感构成，也不考虑读者与书籍的关系。当一个人在寻找文学作品时，他/她可能不仅在寻找信息，而且在寻找阅读的乐趣。在文学作品的情况下，情绪在关键事件中的进展可能被证明是索引和搜索的关键。在这篇文章中，我们使用智能文本分析，基于情感进展之间的计算关系，在文学作品之间建立聚类。我们已经建立了一个数据库1076英语标题+20马拉地语标题和使用数据库http://www.cs.cmu.edu/~dbamman/booksummaries.html 有16559个标题及其摘要。我们提出了基于情感级数的图书检索和推荐方法。这可以用来创建读者感兴趣的个性化书名集群。这项分析清楚地表明，当我们把寻找某一特定类型书籍或创意艺术品的图书爱好者作为目标时，可以更好地进行搜索和索引。这种索引和搜索可以找到许多现实生活中的书籍推荐应用程序。摘要：Literary artefacts are generally indexed and searched based on titles, meta data and keywords over the years. This searching and indexing works well when user/reader already knows about that particular creative textual artefact or document. This indexing and search hardly takes into account interest and emotional makeup of readers and its mapping to books. When a person is looking for a literary textual artefact, he/she might be looking for not only information but also to seek the joy of reading. In case of literary artefacts, progression of emotions across the key events could prove to be the key for indexing and searching. In this paper, we establish clusters among literary artefacts based on computational relationships among sentiment progressions using intelligent text analysis. We have created a database of 1076 English titles + 20 Marathi titles and also used database http://www.cs.cmu.edu/~dbamman/booksummaries.html with 16559 titles and their summaries. We have proposed Sentiment Progression based Indexing for searching and recommending books. This can be used to create personalized clusters of book titles of interest to readers. The analysis clearly suggests better searching and indexing when we are targeting book lovers looking for a particular type of book or creative artefact. This indexing and searching can find many real-life applications for recommending books.

【2】 ParaLaw Nets -- Cross-lingual Sentence-level Pretraining for Legal Text Processing 标题：准法律网--法律文本处理的跨语言句子级预训练

作者：Ha-Thanh Nguyen,Vu Tran,Phuong Minh Nguyen,Thi-Hai-Yen Vuong,Quan Minh Bui,Chau Minh Nguyen,Binh Tran Dang,Minh Le Nguyen,Ken Satoh 机构：Japan Advanced Institute of Science, and Technology, Ishikawa, Japan, University of Engineering and, Technology, VNU, Hanoi, Vietnam, National Institute of Informatics, Tokyo, Japan 备注：Also published in COLIEE 2021's Proceeding 链接：https://arxiv.org/abs/2106.13403 摘要：歧义是自然语言的一个特点，它使表达思想具有灵活性。然而，在一个需要准确陈述的领域，它成为了一个障碍。具体来说，一个单词可以有很多意思，多个单词可以有相同的意思。在将文本翻译成外语时，译者需要确定原文句子中每个成分的确切含义，以产生正确的翻译句子。基于这一观察，本文提出了一种基于句子级跨语言信息的预训练模型族ParaLaw-Nets，以减少歧义，提高法律文本处理的性能。该方法在colie-2021的问答任务中取得了最好的效果。摘要：Ambiguity is a characteristic of natural language, which makes expression ideas flexible. However, in a domain that requires accurate statements, it becomes a barrier. Specifically, a single word can have many meanings and multiple words can have the same meaning. When translating a text into a foreign language, the translator needs to determine the exact meaning of each element in the original sentence to produce the correct translation sentence. From that observation, in this paper, we propose ParaLaw Nets, a pretrained model family using sentence-level cross-lingual information to reduce ambiguity and increase the performance in legal text processing. This approach achieved the best result in the Question Answering task of COLIEE-2021.

其他神经网络|深度学习|模型|建模(5篇)

【1】 Learning to Sample Replacements for ELECTRA Pre-Training 标题：学习ELECTRA前期训练中的样品更换

作者：Yaru Hao,Li Dong,Hangbo Bao,Ke Xu,Furu Wei 机构：†Beihang University, ‡Microsoft Research 备注：Accepted by Findings of ACL 2021 链接：https://arxiv.org/abs/2106.13715 摘要：ELECTRA预先训练了一个鉴别器来检测被替换的令牌，其中替换的令牌是从一个用掩蔽语言建模训练的生成器中采样的。尽管有令人信服的表现，ELECTRA仍面临以下两个问题。首先，从鉴别器到发生器之间没有直接的反馈回路，这使得替换采样效率低下。第二，生成器的预测往往过于自信，并伴随着训练，使得替换偏向于正确的标记。本文提出了两种改进ELECTRA预训练替换抽样的方法。具体来说，我们增加了硬度预测机制的抽样，使发生器可以鼓励鉴别器学习它没有获得的东西。我们也证明了有效的抽样减少了鉴别器的训练方差。此外，我们建议使用焦点损失的发电机，以减轻过采样的正确令牌作为替代品。实验结果表明，该方法提高了ELECTRA对各种下游任务的预训练效果。摘要：ELECTRA pretrains a discriminator to detect replaced tokens, where the replacements are sampled from a generator trained with masked language modeling. Despite the compelling performance, ELECTRA suffers from the following two issues. First, there is no direct feedback loop from discriminator to generator, which renders replacement sampling inefficient. Second, the generator's prediction tends to be over-confident along with training, making replacements biased to correct tokens. In this paper, we propose two methods to improve replacement sampling for ELECTRA pre-training. Specifically, we augment sampling with a hardness prediction mechanism, so that the generator can encourage the discriminator to learn what it has not acquired. We also prove that efficient sampling reduces the training variance of the discriminator. Moreover, we propose to use a focal loss for the generator in order to relieve oversampling of correct tokens as replacements. Experimental results show that our method improves ELECTRA pre-training on various downstream tasks.

【2】 Language Models are Good Translators 标题：语言模特儿是优秀的翻译家

作者：Shuo Wang,Zhaopeng Tu,Zhixing Tan,Wenxuan Wang,Maosong Sun,Yang Liu 机构：Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, Tsinghua University, Beijing Academy of Artificial Intelligence ,Institute for AIR, Tsinghua University, Tencent AI Lab 备注：12 pages. Work in progress. An earlier verison of this manuscript is under review 链接：https://arxiv.org/abs/2106.13627 摘要：近年来，神经机器翻译（neural machine translation，NMT）得到了迅速的发展，其核心在于编译码结构。受大规模预训练语言模型在有限场景下机器翻译的最新进展的启发，我们首先证明了单一语言模型（LM4MT）可以在标准机器翻译基准上实现与强大的编解码器NMT模型相当的性能，使用相同的训练数据和相似数量的模型参数。LM4MT还可以很容易地利用源端文本作为额外的监督。LM4MT通过对源语和目的语文本进行相同的建模机制，可以为源语和目的语句子提供统一的表示，从而更好地跨语言传递知识。在基于枢轴和Zero-Shot的翻译任务上的大量实验表明，LM4MT比编解码器NMT模型有很大的优势。摘要：Recent years have witnessed the rapid advance in neural machine translation (NMT), the core of which lies in the encoder-decoder architecture. Inspired by the recent progress of large-scale pre-trained language models on machine translation in a limited scenario, we firstly demonstrate that a single language model (LM4MT) can achieve comparable performance with strong encoder-decoder NMT models on standard machine translation benchmarks, using the same training data and similar amount of model parameters. LM4MT can also easily utilize source-side texts as additional supervision. Though modeling the source- and target-language texts with the same mechanism, LM4MT can provide unified representations for both source and target sentences, which can better transfer knowledge across languages. Extensive experiments on pivot-based and zero-shot translation tasks show that LM4MT can outperform the encoder-decoder NMT model by a large margin.

【3】 Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains 标题：适应和提取：为领域开发小型、快速和有效的预训练语言模型

作者：Yunzhi Yao,Shaohan Huang,Wenhui Wang,Li Dong,Furu Wei 机构：†Shandong University, Jinan, China, ‡Microsoft Research, Beijing, China 链接：https://arxiv.org/abs/2106.13474 摘要：大型预训练模型在许多自然语言处理任务中取得了巨大的成功。然而，当这些模型应用于特定的领域时，往往会出现领域转移的问题，并且会带来延迟和容量限制的微调和在线服务的挑战。在本文中，我们提出了一个通用的方法来开发小型，快速和有效的预训练模型，为特定领域。这是通过调整现成的通用预训练模型和在目标领域执行任务无关的知识提取来实现的。具体来说，我们提出在适应阶段进行特定领域的词汇扩展，并利用语料库级的发生概率来自动选择增量词汇的大小。然后，我们系统地探讨了不同的压缩策略，以压缩特定领域的大型预训练模型。我们在生物医学和计算机科学领域进行实验。实验结果表明，在特定领域的任务中，该方法比BERT-BASE模型具有更好的性能，比BERT-BASE模型小3.3倍，快5.1倍。代码和预先训练的模型可在https://aka.ms/adalm. 摘要：Large pre-trained models have achieved great success in many natural language processing tasks. However, when they are applied in specific domains, these models suffer from domain shift and bring challenges in fine-tuning and online serving for latency and capacity constraints. In this paper, we present a general approach to developing small, fast and effective pre-trained models for specific domains. This is achieved by adapting the off-the-shelf general pre-trained models and performing task-agnostic knowledge distillation in target domains. Specifically, we propose domain-specific vocabulary expansion in the adaptation stage and employ corpus level occurrence probability to choose the size of incremental vocabulary automatically. Then we systematically explore different strategies to compress the large pre-trained models for specific domains. We conduct our experiments in the biomedical and computer science domain. The experimental results demonstrate that our approach achieves better performance over the BERT BASE model in domain-specific tasks while 3.3x smaller and 5.1x faster than BERT BASE. The code and pre-trained models are available at https://aka.ms/adalm.

【4】 JNLP Team: Deep Learning Approaches for Legal Processing Tasks in COLIEE 2021 标题：JNLP团队：COLIEE 2021法律处理任务的深度学习方法

作者：Ha-Thanh Nguyen,Phuong Minh Nguyen,Thi-Hai-Yen Vuong,Quan Minh Bui,Chau Minh Nguyen,Binh Tran Dang,Vu Tran,Minh Le Nguyen,Ken Satoh 机构：Japan Advanced Institute of Science, and Technology, Ishikawa, Japan, University of Engineering and, Technology, VNU, Hanoi, Vietnam, National Institute of Informatics, Tokyo, Japan 备注：Also published in COLIEE 2021's proceeding 链接：https://arxiv.org/abs/2106.13405 摘要：COLIEE是一个自动计算机化法律文本处理的年度竞赛。法律文档的自动处理是一个雄心勃勃的目标，法律的结构和语义往往比日常语言复杂得多。在这篇文章中，我们调查并报告了我们在法律文件处理中使用深度学习的方法和实验结果。结果显示了这一系列方法的困难和潜力。摘要：COLIEE is an annual competition in automatic computerized legal text processing. Automatic legal document processing is an ambitious goal, and the structure and semantics of the law are often far more complex than everyday language. In this article, we survey and report our methods and experimental results in using deep learning in legal document processing. The results show the difficulties as well as potentials in this family of approaches.

【5】 VOGUE: Answer Verbalization through Multi-Task Learning 标题：Vogue：通过多任务学习回答动词化

作者：Endri Kacupaj,Shyamnath Premnadh,Kuldeep Singh,Jens Lehmann,Maria Maleshkova 机构：University of Bonn,Zerotha Research,Cerence GmbH,Fraunhofer IAIS 备注：Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2021 链接：https://arxiv.org/abs/2106.13316 摘要：近年来，基于知识图的问答系统（KGQA）得到了长足的发展。尽管有这些显著的进步，目前的KGQA系统只关注于答案生成技术，而不关注答案的语言化。然而，在现实场景中（例如，Alexa、Siri等语音助手），用户更喜欢口头回答，而不是生成响应。本文研究了知识图上（复杂）问答问题的答案表征问题。在这一背景下，我们提出了一个基于多任务的答案口语化框架：VOGUE（通过多任务学习进行口语化）。VOGUE框架试图通过多任务学习范式，使用混合方法生成口头化的答案。我们的框架可以在同时使用问题和查询作为输入的基础上生成结果。VOGUE包含四个模块，通过多任务学习同时训练。我们在现有的数据集上对我们的框架进行了评估，它在BLEU和METEOR得分上都优于所有现有的基线。摘要：In recent years, there have been significant developments in Question Answering over Knowledge Graphs (KGQA). Despite all the notable advancements, current KGQA systems only focus on answer generation techniques and not on answer verbalization. However, in real-world scenarios (e.g., voice assistants such as Alexa, Siri, etc.), users prefer verbalized answers instead of a generated response. This paper addresses the task of answer verbalization for (complex) question answering over knowledge graphs. In this context, we propose a multi-task-based answer verbalization framework: VOGUE (Verbalization thrOuGh mUlti-task lEarning). The VOGUE framework attempts to generate a verbalized answer using a hybrid approach through a multi-task learning paradigm. Our framework can generate results based on using questions and queries as inputs concurrently. VOGUE comprises four modules that are trained simultaneously through multi-task learning. We evaluate our framework on existing datasets for answer verbalization, and it outperforms all current baselines on both BLEU and METEOR scores.

其他(4篇)

【1】 Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance 标题：利用矢量量化潜在空间实现TTS/VC系统性能一致性的初步研究

作者：Hieu-Thi Luong,Junichi Yamagishi 机构：National Institute of Informatics, Tokyo, Japan 备注：to be presented at SSW11 链接：https://arxiv.org/abs/2106.13479 摘要：一般来说，训练神经语音合成系统的主要目的是从神经网络的输出层合成自然的、有表现力的语音，而不需要太多的注意隐含层。然而，通过学习有用的潜在表示，该系统可以用于许多更实际的场景。在这篇文章中，我们研究了如何使用量化向量来模拟潜在的语言嵌入，并将其与连续的嵌入进行了比较。通过对训练中的潜在空间实施不同的策略，我们可以得到具有不同性质的潜在语言嵌入，同时在质量和说话人相似性方面具有相似的性能。我们的实验表明，用矢量量化构建的语音克隆系统在感知评估方面只有很小的退化，但有一个离散的潜在空间，有助于降低表示比特率，这是数据传输所需要的，或者限制信息泄漏，这对于说话人匿名化和其他类似的任务很重要。摘要：Generally speaking, the main objective when training a neural speech synthesis system is to synthesize natural and expressive speech from the output layer of the neural network without much attention given to the hidden layers. However, by learning useful latent representation, the system can be used for many more practical scenarios. In this paper, we investigate the use of quantized vectors to model the latent linguistic embedding and compare it with the continuous counterpart. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding that takes on different properties while having a similar performance in terms of quality and speaker similarity. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations, but has a discrete latent space that is useful for reducing the representation bit-rate, which is desirable for data transferring, or limiting the information leaking, which is important for speaker anonymization and other tasks of that nature.

【2】 Fine-grained Geolocation Prediction of Tweets with Human Machine Collaboration 标题：基于人机协同的推文细粒度地理位置预测

作者：Florina Dutt,Subhajit Das 机构：Georgia Institute of Technology USA, Atlanta, GA 备注：7 pages 链接：https://arxiv.org/abs/2106.13411 摘要：Twitter是一个有用的资源，可以分析人们对各种话题的看法。通常，这些主题与这些Tweet帖子的发布位置相关。例如，餐馆老板可能需要了解他们的目标顾客在哪里吃饭，以及与食物有关的帖子的情绪，政策规划者可能需要分析市民对城市、县或州的特定地区的犯罪、安全、拥堵等相关问题的意见。尽管这很有希望，但只有不到1\%$的爬网Tweet帖子带有地理位置标签。这使得准确预测非地理标记Tweet的Tweet帖子对于分析各个领域的数据非常关键。在这项研究中，我们利用数以百万计的Twitter帖子和最终用户领域的专业知识，利用自然语言处理（NLP）技术构建了一套深层次的神经网络模型，以预测非地理标记的Tweet帖子在不同粒度级别（如邻域、zipcode和经纬度）的地理位置。通过多个神经结构实验和人机协同工作流程设计，我们正在进行的地理位置检测工作显示了有希望的结果，使最终用户能够将所选变量之间的关系与位置信息关联起来。摘要：Twitter is a useful resource to analyze peoples' opinions on various topics. Often these topics are correlated or associated with locations from where these Tweet posts are made. For example, restaurant owners may need to know where their target customers eat with respect to the sentiment of the posts made related to food, policy planners may need to analyze citizens' opinion on relevant issues such as crime, safety, congestion, etc. with respect to specific parts of the city, or county or state. As promising as this is, less than $1\%$ of the crawled Tweet posts come with geolocation tags. That makes accurate prediction of Tweet posts for the non geo-tagged tweets very critical to analyze data in various domains. In this research, we utilized millions of Twitter posts and end-users domain expertise to build a set of deep neural network models using natural language processing (NLP) techniques, that predicts the geolocation of non geo-tagged Tweet posts at various level of granularities such as neighborhood, zipcode, and longitude with latitudes. With multiple neural architecture experiments, and a collaborative human-machine workflow design, our ongoing work on geolocation detection shows promising results that empower end-users to correlate relationship between variables of choice with the location information.

【3】 A Source-Criticism Debiasing Method for GloVe Embeddings 标题：一种手套嵌入的源批评去偏方法

作者：Hope McGovern 机构： University of Cambridge 链接：https://arxiv.org/abs/2106.13382 摘要：众所周知，在大型公共语料库中训练的单词嵌入始终表现出已知的人类社会偏见。尽管存在许多方法来进行借记，但是几乎所有的方法都致力于从嵌入中完全消除有偏见的信息，并且在这个过程中经常减小训练集的大小。在本文中，我们提出了一种简单而有效的去除手套词嵌入的方法（Pennington et al.，2014），该方法通过合并关于训练集偏差的显式信息而不是直接删除偏差数据。我们的方法在Brunet等人（2019）的快速偏置梯度近似方法的帮助下快速有效地运行。由于我们的方法类似于人文学科中的“源头批评”概念，我们将我们的方法称为“源头批评手套”（SC手套）。结果表明，SC-GloVe在不牺牲训练数据和TOP-1性能的前提下，减小了单词嵌入关联测试（WEAT）集的影响大小。摘要：It is well-documented that word embeddings trained on large public corpora consistently exhibit known human social biases. Although many methods for debiasing exist, almost all fixate on completely eliminating biased information from the embeddings and often diminish training set size in the process. In this paper, we present a simple yet effective method for debiasing GloVe word embeddings (Pennington et al., 2014) which works by incorporating explicit information about training set bias rather than removing biased data outright. Our method runs quickly and efficiently with the help of a fast bias gradient approximation method from Brunet et al. (2019). As our approach is akin to the notion of 'source criticism' in the humanities, we term our method Source-Critical GloVe (SC-GloVe). We show that SC-GloVe reduces the effect size on Word Embedding Association Test (WEAT) sets without sacrificing training data or TOP-1 performance.

【4】 Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature 标题：垂直搜索的特定领域预训练：以生物医学文献为例

作者：Yu Wang,Jinchao Li,Tristan Naumann,Chenyan Xiong,Hao Cheng,Robert Tinn,Cliff Wong,Naoto Usuyama,Richard Rogahn,Zhihong Shen,Yang Qin,Eric Horvitz,Paul N. Bennett,Jianfeng Gao,Hoifung Poon 机构：yuwan,jincli,tristan,cxiong,chehao,rotinn,clwon,naotous, Microsoft Research, Redmond, WA 链接：https://arxiv.org/abs/2106.13375 摘要：信息过载是许多高价值领域普遍面临的挑战。一个突出的例子是关于COVID-19的生物医学文献的爆炸性增长，在几个月的时间里，它膨胀到数十万篇论文。一般来说，生物医学文献每分钟增加两篇论文，每年新增论文超过一百万篇。由于缺乏来自点击日志的直接监督，生物医学领域和许多其他垂直领域的搜索都具有挑战性。自监督学习是克服标注瓶颈的一个很有前途的方向。提出了一种基于领域预训练的垂直搜索方法，并以生物医学领域为例进行了研究。尽管我们的方法非常简单，并且没有使用任何相关标签进行训练或开发，但是我们的方法在官方的TREC-COVID评估（一个与COVID相关的生物医学搜索竞赛）中的表现相当或更好。在现代云基础设施中使用分布式计算，我们的系统可以扩展到PubMed上的数千万篇文章，并已部署为Microsoft Biomedical Search，这是一种新的生物医学文献搜索体验：https://aka.ms/biomedsearch. 摘要：Information overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical literature expands by two papers every minute, totalling over a million new papers every year. Search in the biomedical realm, and many other vertical domains is challenging due to the scarcity of direct supervision from click logs. Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck. We propose a general approach for vertical search based on domain-specific pretraining and present a case study for the biomedical domain. Despite being substantially simpler and not using any relevance labels for training or development, our method performs comparably or better than the best systems in the official TREC-COVID evaluation, a COVID-related biomedical search competition. Using distributed computing in modern cloud infrastructure, our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search, a new search experience for biomedical literature: https://aka.ms/biomedsearch.

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2021-06-28，如有侵权请联系 cloudcommunity@tencent.com 删除

linux