自然语言处理学术速递[6.17]

公众号-arXiv每日学术速递

发布于 2021-07-02 19:01:11

7980

发布于 2021-07-02 19:01:11

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

cs.CL 方向，今日共计34篇

Transformer(3篇)

【1】 Grounding Spatio-Temporal Language with Transformers 标题：用Transformer将时空语言接地

作者：Tristan Karch,Laetitia Teodorescu,Katja Hofmann,Clément Moulin-Frier,Pierre-Yves Oudeyer 备注：Contains main article and supplementaries 链接：https://arxiv.org/abs/2106.08858 摘要：语言是与外部世界的接口。为了让具身代理使用它，语言必须建立在其他感觉运动模式的基础上。虽然有大量的文献研究机器如何学习扎根语言，但如何学习时空语言概念的主题仍然是个未知数。为了在这个方向上取得进展，我们在这里介绍了一个新的时空语言基础任务，其目标是学习具体化代理的行为轨迹的时空描述的意义。这是通过训练真值函数来实现的，真值函数可以预测描述是否与给定的观测历史相匹配。描述包括过去时态和现在时态的时间扩展谓词，以及对场景中对象的时空引用。为了研究架构偏差在这个任务中的作用，我们训练了几个模型，包括多模态Transformer架构；后者实现了跨空间和跨时间的单词和对象之间不同的注意力计算。我们测试了两类泛化模型：1）对随机出现的句子进行泛化；2）语法原语的泛化。我们观察到，在我们的Transformer的注意力计算中保持对象的同一性有助于总体上获得良好的泛化性能，并且在单个标记中总结对象轨迹对性能的影响很小。然后，我们将讨论如何为语言引导的自主体现代理打开新的视角。我们还发布了开放源码许可证下的代码，以及预先训练的模型和数据集，以鼓励更广泛的社区在未来建立和扩展我们的工作。摘要：Language is an interface to the outside world. In order for embodied agents to use it, language must be grounded in other, sensorimotor modalities. While there is an extended literature studying how machines can learn grounded language, the topic of how to learn spatio-temporal linguistic concepts is still largely uncharted. To make progress in this direction, we here introduce a novel spatio-temporal language grounding task where the goal is to learn the meaning of spatio-temporal descriptions of behavioral traces of an embodied agent. This is achieved by training a truth function that predicts if a description matches a given history of observations. The descriptions involve time-extended predicates in past and present tense as well as spatio-temporal references to objects in the scene. To study the role of architectural biases in this task, we train several models including multimodal Transformer architectures; the latter implement different attention computations between words and objects across space and time. We test models on two classes of generalization: 1) generalization to randomly held-out sentences; 2) generalization to grammar primitives. We observe that maintaining object identity in the attention computation of our Transformers is instrumental to achieving good performance on generalization overall, and that summarizing object traces in a single token has little influence on performance. We then discuss how this opens new perspectives for language-guided autonomous embodied agents. We also release our code under open-source license as well as pretrained models and datasets to encourage the wider community to build upon and extend our work in the future.

【2】 Domain-independent User Simulation with Transformers for Task-oriented Dialogue Systems 标题：基于转换器的面向任务对话系统的领域无关用户模拟

作者：Hsien-chin Lin,Nurul Lubis,Songbo Hu,Carel van Niekerk,Christian Geishauser,Michael Heck,Shutong Feng,Milica Gašić 链接：https://arxiv.org/abs/2106.08838 摘要：通过强化学习优化对话策略需要大量的训练交互，这使得与真实用户的学习既耗时又昂贵。因此，许多设置依赖于用户模拟器而不是人类。这些用户模拟器有自己的问题。尽管手工编码的、基于规则的用户模拟器在小型、简单的领域已经被证明是足够的，但对于复杂的领域，规则的数量很快变得难以处理。另一方面，最先进的数据驱动用户模拟器仍然依赖于域。这意味着适应每个新领域需要重新设计和再训练。在这项工作中，我们提出了一个领域独立Transformer为基础的用户模拟器（TUS）。我们的tu的结构并不局限于特定的领域，它支持领域的泛化和从数据中学习跨领域的用户行为。我们使用自动和人工评估将TUS与最新技术进行了比较。TUS可以在预定义的域上与基于规则的用户模拟器竞争，并且能够以零拍的方式推广到不可见的域。摘要：Dialogue policy optimisation via reinforcement learning requires a large number of training interactions, which makes learning with real users time consuming and expensive. Many set-ups therefore rely on a user simulator instead of humans. These user simulators have their own problems. While hand-coded, rule-based user simulators have been shown to be sufficient in small, simple domains, for complex domains the number of rules quickly becomes intractable. State-of-the-art data-driven user simulators, on the other hand, are still domain-dependent. This means that adaptation to each new domain requires redesigning and retraining. In this work, we propose a domain-independent transformer-based user simulator (TUS). The structure of our TUS is not tied to a specific domain, enabling domain generalisation and learning of cross-domain user behaviour from data. We compare TUS with the state of the art using automatic as well as human evaluations. TUS can compete with rule-based user simulators on pre-defined domains and is able to generalise to unseen domains in a zero-shot fashion.

【3】 What Context Features Can Transformer Language Models Use? 标题：Transformer语言模型可以使用哪些上下文功能？

作者：Joe O'Connor,Jacob Andreas 备注：14 pages, 7 figures, to be published at ACL 2021 链接：https://arxiv.org/abs/2106.08367 摘要：基于Transformer的语言模型受益于数百到数千个先前标记的上下文条件。这些背景的哪些方面有助于精确的模型预测？我们描述了一系列实验，这些实验通过选择性地去除在英语维基百科上训练的transformer语言模型中的词汇和结构信息来测量可用信息。在中长期语境中，我们发现一些极具破坏性的语境操作——包括改变句子中的词序和删除除名词以外的所有单词——删除了不到15%的可用信息。我们的结果表明，长上下文，而不是其详细的句法和命题内容，是重要的低复杂度的电流互感器语言模型。摘要：Transformer-based language models benefit from conditioning on contexts of hundreds to thousands of previous tokens. What aspects of these contexts contribute to accurate model prediction? We describe a series of experiments that measure usable information by selectively ablating lexical and structural information in transformer language models trained on English Wikipedia. In both mid- and long-range contexts, we find that several extremely destructive context manipulations -- including shuffling word order within sentences and deleting all words other than nouns -- remove less than 15% of the usable information. Our results suggest that long contexts, but not their detailed syntactic and propositional content, are important for the low perplexity of current transformer language models.

BERT(1篇)

【1】 RefBERT: Compressing BERT by Referencing to Pre-computed Representations 标题：RefBERT：通过参考预计算表示来压缩BERT

作者：Xinyi Wang,Haiqin Yang,Liang Zhao,Yang Mo,Jianping Shen 备注：8 pages, 1 figure, 3 tables, in IJCNN'21 链接：https://arxiv.org/abs/2106.08898 摘要：最近开发的大型预训练语言模型，如BERT，在许多下游自然语言处理应用中取得了显著的效果。这些预先训练好的语言模型通常包含数亿个参数，并且在实际应用中存在计算量大和延迟的问题。在保持模型在下游应用中的性能的同时，为了快速训练和推理，需要减少模型的计算开销。有几行工作利用知识提炼将教师模型压缩为较小的学生模型。然而，他们在推理时通常会抛弃老师的知识。不同的是，在本文中，我们提出了RefBERT来利用从教师那里学到的知识，即促进参考样本上预先计算的BERT表示，并将BERT压缩成一个较小的学生模型。为了保证我们的建议，我们对损失函数和参考样本的使用提供了理论依据。值得注意的是，理论结果表明，在参考样本中加入预先计算的教师表示确实增加了学习学生模型的互信息。最后，我们进行了实证评估，结果表明我们的RefBERT可以击败香草TinyBERT超过8.1%，在GLUE基准上达到了$BERTBASE$的94%以上。同时，RefBERT比BERT${\rm BASE}$小7.4倍，推理速度快9.5倍。摘要：Recently developed large pre-trained language models, e.g., BERT, have achieved remarkable performance in many downstream natural language processing applications. These pre-trained language models often contain hundreds of millions of parameters and suffer from high computation and latency in real-world applications. It is desirable to reduce the computation overhead of the models for fast training and inference while keeping the model performance in downstream applications. Several lines of work utilize knowledge distillation to compress the teacher model to a smaller student model. However, they usually discard the teacher's knowledge when in inference. Differently, in this paper, we propose RefBERT to leverage the knowledge learned from the teacher, i.e., facilitating the pre-computed BERT representation on the reference sample and compressing BERT into a smaller student model. To guarantee our proposal, we provide theoretical justification on the loss function and the usage of reference samples. Significantly, the theoretical result shows that including the pre-computed teacher's representations on the reference samples indeed increases the mutual information in learning the student model. Finally, we conduct the empirical evaluation and show that our RefBERT can beat the vanilla TinyBERT over 8.1\% and achieves more than 94\% of the performance of $\BERTBASE$ on the GLUE benchmark. Meanwhile, RefBERT is 7.4x smaller and 9.5x faster on inference than BERT$_{\rm BASE}$.

机器翻译(4篇)

【1】 Revisiting the Weaknesses of Reinforcement Learning for Neural Machine Translation 标题：再论强化学习在神经机器翻译中的不足

作者：Samuel Kiegeland,Julia Kreutzer 备注：None 链接：https://arxiv.org/abs/2106.08942 摘要：策略梯度算法在自然语言处理中得到了广泛的应用，但最近却受到了批评，怀疑其是否适合于自然语言处理。Choshen等人（2020年）发现了多个弱点，并怀疑它们的成功取决于产出分布的形状，而不是回报。在本文中，我们重新审视这些主张，并在更广泛的配置下对其进行研究。我们的领域内和跨领域适应实验揭示了探索和奖励尺度的重要性，并为这些说法提供了实证反证。摘要：Policy gradient algorithms have found wide adoption in NLP, but have recently become subject to criticism, doubting their suitability for NMT. Choshen et al. (2020) identify multiple weaknesses and suspect that their success is determined by the shape of output distributions rather than the reward. In this paper, we revisit these claims and study them under a wider range of configurations. Our experiments on in-domain and cross-domain adaptation reveal the importance of exploration and reward scaling, and provide empirical counter-evidence to these claims.

【2】 Evaluating Gender Bias in Hindi-English Machine Translation 标题：评价印英机器翻译中的性别偏见

作者：Gauri Gupta,Krithika Ramesh,Sanjay Singh 链接：https://arxiv.org/abs/2106.08680 摘要：随着语言模型在现实世界中的应用越来越广泛，解决语言模型输出的公平性问题显得尤为重要。这些语言模型的嵌入词表示往往隐含着不必要的联想，在模型中形成社会偏见。像印地语这样的性别化语言的性质，对量化和减少偏见提出了另一个问题，因为句子中的单词形式根据主题的性别而发生变化。此外，在印度语的测量和借记系统领域中所做的工作很少。在我们的工作中，我们试图评估和量化印地语-英语机器翻译系统中的性别偏见。基于印地语的语法考虑，我们实现了现有TGBI度量的一个修改版本。我们还比较和对比了预先训练的嵌入和机器翻译模型学习的嵌入的多个度量的偏差测量结果。摘要：With language models being deployed increasingly in the real world, it is essential to address the issue of the fairness of their outputs. The word embedding representations of these language models often implicitly draw unwanted associations that form a social bias within the model. The nature of gendered languages like Hindi, poses an additional problem to the quantification and mitigation of bias, owing to the change in the form of the words in the sentence, based on the gender of the subject. Additionally, there is sparse work done in the realm of measuring and debiasing systems for Indic languages. In our work, we attempt to evaluate and quantify the gender bias within a Hindi-English machine translation system. We implement a modified version of the existing TGBI metric based on the grammatical considerations for Hindi. We also compare and contrast the resulting bias measurements across multiple metrics for pre-trained embeddings and the ones learned by our machine translation model.

【3】 Alternated Training with Synthetic and Authentic Data for Neural Machine Translation 标题：神经机器翻译中的人工真实数据交替训练

作者：Rui Jiao,Zonghan Yang,Maosong Sun,Yang Liu 备注：ACL 2021, Short Findings 链接：https://arxiv.org/abs/2106.08582 摘要：虽然合成双语语料库在低资源神经机器翻译中表现出了很好的效果，但添加更多的合成数据往往会降低翻译性能。在这项工作中，我们提出交替训练与合成和真实的数据为NMT。其基本思想是在训练过程中反复交替使用合成语料库和真实语料库。与以前的工作相比，我们引入真实数据作为指导，以防止NMT模型的训练受到噪声合成数据的干扰。在汉英和德英翻译任务上的实验表明，该方法在多个强基线上提高了性能。我们将BLEU景观可视化，以进一步研究真实和合成数据在交替训练中的作用。通过可视化，我们发现真实的数据有助于将NMT模型参数引导到BLEU分数较高的点，并导致一致的翻译性能改进。摘要：While synthetic bilingual corpora have demonstrated their effectiveness in low-resource neural machine translation (NMT), adding more synthetic data often deteriorates translation performance. In this work, we propose alternated training with synthetic and authentic data for NMT. The basic idea is to alternate synthetic and authentic corpora iteratively during training. Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data. Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines. We visualize the BLEU landscape to further investigate the role of authentic and synthetic data during alternated training. From the visualization, we find that authentic data helps to direct the NMT model parameters towards points with higher BLEU scores and leads to consistent translation performance improvement.

【4】 Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors 标题：代码评语翻译：模式有效性与错误的比较研究

作者：Junayed Mahmud,Fahim Faisal,Raihan Islam Arnob,Antonios Anastasopoulos,Kevin Moran 备注：Accepted to the 2021 NLP4Prog Workshop co-located with The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021) 链接：https://arxiv.org/abs/2106.08415 摘要：自动源代码摘要是一个流行的软件工程研究课题，其中机器翻译模型被用来将代码片段“翻译”成相关的自然语言描述。大多数此类模型的评估都是使用基于自动参考的度量进行的。然而，考虑到编程语言和自然语言之间相对较大的语义差距，我们认为这一研究方向将受益于对当前最先进模型的各种错误模式的定性研究。因此，在这项工作中，我们对最近提出的三种源代码摘要模型进行了定量和定性的比较。在我们的定量评估中，我们比较了基于平滑BLEU-4、METEOR和ROUGE-L机器翻译度量的模型，在我们的定性评估中，我们对模型在与地面真相字幕进行比较时犯下的最常见错误执行手动开放编码。我们的研究揭示了基于度量的性能和模型预测错误之间的关系的新见解，这种关系建立在一种可用于推动未来研究工作的经验衍生错误分类法的基础上摘要：Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to "translate" code snippets into relevant natural language descriptions. Most evaluations of such models are conducted using automatic reference-based metrics. However, given the relatively large semantic gap between programming languages and natural language, we argue that this line of research would benefit from a qualitative investigation into the various error modes of current state-of-the-art models. Therefore, in this work, we perform both a quantitative and qualitative comparison of three recently proposed source code summarization models. In our quantitative evaluation, we compare the models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics, and in our qualitative evaluation, we perform a manual open-coding of the most common errors committed by the models when compared to ground truth captions. Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an empirically derived error taxonomy that can be used to drive future research efforts

语义分析(3篇)

【1】 PRASEMap: A Probabilistic Reasoning and Semantic Embedding based Knowledge Graph Alignment System 标题：PRASEMap：一种基于概率推理和语义嵌入的知识图对齐系统

作者：Zhiyuan Qi,Ziheng Zhang,Jiaoyan Chen,Xi Chen,Yefeng Zheng 链接：https://arxiv.org/abs/2106.08801 摘要：知识图对齐的目的是寻找两个知识图之间的等价实体和关系（即映射）。现有的方法要么利用基于推理的技术，要么利用基于语义嵌入的技术，但很少有研究探讨它们的结合。在这个演示中，我们介绍了PRASEMap，一个无监督的KG对齐系统，它使用概率推理（PR）和语义嵌入（SE）技术迭代计算映射。PRASEMap可以作为SE模块支持各种基于嵌入的KG对齐方法，并且支持简单的人机交互，这还为用户提供了一个选项，可以将映射注释反馈给系统以获得更好的结果。该演示通过一个具有用户友好界面的独立Web应用程序展示了这些功能。摘要：Knowledge Graph (KG) alignment aims at finding equivalent entities and relations (i.e., mappings) between two KGs. The existing approaches utilize either reasoning-based or semantic embedding-based techniques, but few studies explore their combination. In this demonstration, we present PRASEMap, an unsupervised KG alignment system that iteratively computes the Mappings with both Probabilistic Reasoning (PR) And Semantic Embedding (SE) techniques. PRASEMap can support various embedding-based KG alignment approaches as the SE module, and enables easy human computer interaction that additionally provides an option for users to feed the mapping annotations back to the system for better results. The demonstration showcases these features via a stand-alone Web application with user friendly interfaces.

【2】 Semantic sentence similarity: size does not always matter 标题：语义句子相似性：大小并不总是重要的

作者：Danny Merkx,Stefan L. Frank,Mirjam Ernestus 备注：This paper has been accepted at Interspeech 2021 where it will be presented and appear in the conference proceedings in September 2021 链接：https://arxiv.org/abs/2106.08648 摘要：本研究探讨了视觉基础语音识别（VGS）模型是否能够在没有任何先验语言知识的情况下获取句子语义的问题。我们制作了一个著名的语义-文本相似度数据库的合成和自然口语版本，并表明我们的VGS模型生成的嵌入与人类的语义相似度判断有很好的关联。我们的结果表明，在一个小的image caption数据库上训练的模型比在大得多的数据库上训练的两个模型的性能要好，这表明数据库大小并不是所有的问题。我们还调查了每张图片有多个字幕的重要性，发现即使图片总数较低，这确实是有帮助的，这表明释义是一个有价值的学习信号。虽然该领域的总趋势是创建更大的数据集来训练模型，但我们的研究结果表明，数据库的其他特性也同样重要。摘要：This study addresses the question whether visually grounded speech recognition (VGS) models learn to capture sentence semantics without access to any prior linguistic knowledge. We produce synthetic and natural spoken versions of a well known semantic textual similarity database and show that our VGS model produces embeddings that correlate well with human semantic similarity judgements. Our results show that a model trained on a small image-caption database outperforms two models trained on much larger databases, indicating that database size is not all that matters. We also investigate the importance of having multiple captions per image and find that this is indeed helpful even if the total number of images is lower, suggesting that paraphrasing is a valuable learning signal. While the general trend in the field is to create ever larger datasets to train models on, our findings indicate other characteristics of the database can just as important important.

【3】 Improving Entity Linking through Semantic Reinforced Entity Embeddings 标题：通过语义强化的实体嵌入改进实体链接

作者：Feng Hou,Ruili Wang,Jun He,Yi Zhou 备注：None 链接：https://arxiv.org/abs/2106.08495 摘要：实体嵌入是神经实体连接模型的一个重要组成部分，它用一个向量表示每个实体的不同方面。现有的实体嵌入是从规范的Wikipedia文章和围绕目标实体的本地上下文中学习的。这样的实体嵌入是有效的，但是对于链接模型学习上下文公共性来说太独特了。我们提出了一种简单而有效的方法FGS2EE，将细粒度的语义信息注入到实体嵌入中，以减少上下文的显著性，促进上下文共性的学习。FGS2EE首先利用语义类型词的嵌入生成语义嵌入，然后通过线性聚合将其与现有实体嵌入相结合。大量实验证明了这种嵌入方法的有效性。在实体嵌入的基础上，实现了实体连接艺术表现的新境界。摘要：Entity embeddings, which represent different aspects of each entity with a single vector like word embeddings, are a key component of neural entity linking models. Existing entity embeddings are learned from canonical Wikipedia articles and local contexts surrounding target entities. Such entity embeddings are effective, but too distinctive for linking models to learn contextual commonality. We propose a simple yet effective method, FGS2EE, to inject fine-grained semantic information into entity embeddings to reduce the distinctiveness and facilitate the learning of contextual commonality. FGS2EE first uses the embeddings of semantic type words to generate semantic embeddings, and then combines them with existing entity embeddings through linear aggregation. Extensive experiments show the effectiveness of such embeddings. Based on our entity embeddings, we achieved new sate-of-the-art performance on entity linking.

Graph|知识图谱|Knowledge(1篇)

【1】 From Discourse to Narrative: Knowledge Projection for Event Relation Extraction 标题：从话语到叙事：事件关系抽取的知识投影

作者：Jialong Tang,Hongyu Lin,Meng Liao,Yaojie Lu,Xianpei Han,Le Sun,Weijian Xie,Jin Xu 备注：None 链接：https://arxiv.org/abs/2106.08629 摘要：当前以事件为中心的知识图高度依赖于显式连接词来挖掘事件之间的关系。不幸的是，由于连接词的稀疏性，这些方法严重破坏了EventKGs的覆盖范围。缺乏高质量的贴标语料库进一步加剧了这一问题。在本文中，我们提出了一个事件关系抽取的知识投射范式：利用话语知识与叙事知识的共性，将话语知识投射到叙事中。具体来说，我们提出了多层知识投影网络（MKPNet），它可以有效地利用多层话语知识进行事件关系抽取。通过这种方法，可以显著减少标记数据的需求，并且可以有效地提取隐式事件关系。内部实验结果表明MKPNet达到了最新的性能，外部实验结果验证了提取的事件关系的价值。摘要：Current event-centric knowledge graphs highly rely on explicit connectives to mine relations between events. Unfortunately, due to the sparsity of connectives, these methods severely undermine the coverage of EventKGs. The lack of high-quality labelled corpora further exacerbates that problem. In this paper, we propose a knowledge projection paradigm for event relation extraction: projecting discourse knowledge to narratives by exploiting the commonalities between them. Specifically, we propose Multi-tier Knowledge Projection Network (MKPNet), which can leverage multi-tier discourse knowledge effectively for event relation extraction. In this way, the labelled data requirement is significantly reduced, and implicit event relations can be effectively extracted. Intrinsic experimental results show that MKPNet achieves the new state-of-the-art performance, and extrinsic experimental results verify the value of the extracted event relations.

摘要|信息提取(1篇)

【1】 Coreference-Aware Dialogue Summarization 标题：有共指意识的对话摘要

作者：Zhengyuan Liu,Ke Shi,Nancy F. Chen 备注：accepted for presentation at SIGDIAL-2021 链接：https://arxiv.org/abs/2106.08556 摘要：最近，通过神经方法总结会话的研究越来越受到重视，但要找到切实可行的解决方案仍然是一个挑战。这类挑战的例子包括对话中的非结构化信息交换、演讲者之间的非正式互动以及演讲者随着对话的发展而发生的动态角色变化。许多这样的挑战导致复杂的共指链接。因此，在这项工作中，我们研究了不同的方法来显式地将共指信息整合到神经抽象对话摘要模型中，以应对上述挑战。实验结果表明，所提出的方法达到了最先进的性能，这意味着它是有用的利用共指信息的对话摘要。事实正确性的评估结果表明，这种共指感知模型能够更好地跟踪对话者之间的信息流，并将准确的状态/动作与相应的对话者和人物提及联系起来。摘要：Summarizing conversations via neural approaches has been gaining research traction lately, yet it is still challenging to obtain practical solutions. Examples of such challenges include unstructured information exchange in dialogues, informal interactions between speakers, and dynamic role changes of speakers as the dialogue evolves. Many of such challenges result in complex coreference links. Therefore, in this work, we investigate different approaches to explicitly incorporate coreference information in neural abstractive dialogue summarization models to tackle the aforementioned challenges. Experimental results show that the proposed approaches achieve state-of-the-art performance, implying it is useful to utilize coreference information in dialogue summarization. Evaluation results on factual correctness suggest such coreference-aware models are better at tracing the information flow among interlocutors and associating accurate status/actions with the corresponding interlocutors and person mentions.

推理|分析|理解|解释(3篇)

【1】 End-to-End Spoken Language Understanding for Generalized Voice Assistants 标题：通用语音助理的端到端口语理解

作者：Michael Saxon,Samridhi Choudhary,Joseph P. McKenna,Athanasios Mouchtaris 备注：Accepted to Interspeech 2021; 5 pages, 2 tables, 1 figure 链接：https://arxiv.org/abs/2106.09009 摘要：端到端（E2E）口语理解（SLU）系统使用单一模型直接从语音中预测话语语义。这方面的工作主要集中在固定域中的目标任务，其中输出语义结构是先验的，输入语音的复杂性是有限的。在这项工作中，我们提出了我们的方法来开发一个E2E模型的广义SLU在商业语音助理（VAs）。我们提出了一个完全可微的、基于Transformer的、层次化的系统，可以在ASR和NLU两个层次上进行预训练。然后对转录和语义分类损失进行微调，以处理不同的意图和参数组合。这导致SLU系统在复杂的内部广义VA数据集上实现了显著的基线改进，准确率提高了43%，同时仍然满足流行的Fluent Speech Commands数据集上99%的准确率基准。我们在一个硬测试集上进一步评估了我们的模型，该测试集只包含训练中看不到的时隙参数，并展示了近20%的改进，显示了我们的方法在真正苛刻的VA场景中的有效性。摘要：End-to-end (E2E) spoken language understanding (SLU) systems predict utterance semantics directly from speech using a single model. Previous work in this area has focused on targeted tasks in fixed domains, where the output semantic structure is assumed a priori and the input speech is of limited complexity. In this work we present our approach to developing an E2E model for generalized SLU in commercial voice assistants (VAs). We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels. This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations. This leads to an SLU system that achieves significant improvements over baselines on a complex internal generalized VA dataset with a 43% improvement in accuracy, while still meeting the 99% accuracy benchmark on the popular Fluent Speech Commands dataset. We further evaluate our model on a hard test set, exclusively containing slot arguments unseen in training, and demonstrate a nearly 20% improvement, showing the efficacy of our approach in truly demanding VA scenarios.

【2】 A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods 标题：多模态推文情感分析方法的公平全面比较

作者：Gullal S. Cheema,Sherzod Hakimov,Eric Müller-Budack,Ralph Ewerth 备注：Accepted in Workshop on Multi-ModalPre-Training for Multimedia Understanding (MMPT 2021), co-located with ICMR 2021 链接：https://arxiv.org/abs/2106.08829 摘要：意见和情绪分析是刻画社交媒体帖子中主观信息的一项重要任务。本文对现有的六种方法进行了综合实验评价和比较，并重新实现了其中一种方法。此外，我们还研究了不同的文本和视觉特征嵌入，包括内容的不同方面，以及最近引入的多模态剪辑嵌入。实验结果提出了两个不同的公开基准数据集的推特和相应的图像。与以往工作的评价方法相比，我们引入了一个可重复和公平的评价方案，使结果具有可比性。最后，我们进行了误差分析，以概述方法的局限性和未来工作的可能性。摘要：Opinion and sentiment analysis is a vital task to characterize subjective information in social media posts. In this paper, we present a comprehensive experimental evaluation and comparison with six state-of-the-art methods, from which we have re-implemented one of them. In addition, we investigate different textual and visual feature embeddings that cover different aspects of the content, as well as the recently introduced multimodal CLIP embeddings. Experimental results are presented for two different publicly available benchmark datasets of tweets and corresponding images. In contrast to the evaluation methodology of previous work, we introduce a reproducible and fair evaluation scheme to make results comparable. Finally, we conduct an error analysis to outline the limitations of the methods and possibilities for the future work.

【3】 On the proper role of linguistically-oriented deep net analysis in linguistic theorizing 标题：论面向语言学的深层网络分析在语言理论化中的作用

作者：Marco Baroni 备注：Submitted to collective volume on Algebraic Systems and the Representation of Linguistic Knowledge 链接：https://arxiv.org/abs/2106.08694 摘要：最近出现了一个活跃的研究领域，用实验的方法来探索现代深层网络的语言行为。虽然这项传统的工作经常报告有关深网语法技巧的有趣结果，但尚不清楚它们对语言理论的意义。因此，面向语言学的深网分析对语言学的影响微乎其微。在这一章中，我建议将深层网络视为对语言话语的可接受性做出明确预测的理论。我认为，如果我们克服一些阻碍我们认真追求这一观点的障碍，我们将获得一个强有力的新的理论工具，作为主流代数方法的补充。摘要：A lively research field has recently emerged that uses experimental methods to probe the linguistic behavior of modern deep networks. While work in this tradition often reports intriguing results about the grammatical skills of deep nets, it is not clear what their implications for linguistic theorizing should be. As a consequence, linguistically-oriented deep net analysis has had very little impact on linguistics at large. In this chapter, I suggest that deep networks should be treated as theories making explicit predictions about the acceptability of linguistic utterances. I argue that, if we overcome some obstacles standing in the way of seriously pursuing this idea, we will gain a powerful new theoretical tool, complementary to mainstream algebraic approaches.

半/弱/无监督|不确定性(2篇)

【1】 Out-of-Scope Intent Detection with Self-Supervision and Discriminative Training 标题：基于自我监督和区分训练的越界意图检测

作者：Li-Ming Zhan,Haowen Liang,Bo Liu,Lu Fan,Xiao-Ming Wu,Albert Y. S. Lam 备注：ACL2021 链接：https://arxiv.org/abs/2106.08616 摘要：超范围意图检测在面向任务的对话系统中具有重要的实际意义。由于在训练阶段，离群词的分布是任意的和未知的，现有的方法通常依赖于对数据分布的强假设，如混合高斯函数进行推理，结果要么是复杂的多步骤训练程序，要么是手工编制的规则，如离群点检测的置信阈值选择。本文通过模拟训练中的测试场景，提出了一种简单而有效的方法来训练一个完全端到端的范围外意图分类器，该方法不需要假设数据分布，也不需要额外的后处理或阈值设置。具体地说，我们在训练阶段构造了一组伪离群值，通过自我监督和从容易获得的开放域数据集中抽取范围外的语句，使用内联特征生成合成离群值。利用伪离群点训练可直接应用于测试任务并具有良好泛化能力的判别分类器。我们在四个基准对话数据集上广泛地评估了我们的方法，并观察到与最新方法相比的显著改进。我们的代码已经发布在https://github.com/liam0949/DCLOOS. 摘要：Out-of-scope intent detection is of practical importance in task-oriented dialogue systems. Since the distribution of outlier utterances is arbitrary and unknown in the training stage, existing methods commonly rely on strong assumptions on data distribution such as mixture of Gaussians to make inference, resulting in either complex multi-step training procedures or hand-crafted rules such as confidence threshold selection for outlier detection. In this paper, we propose a simple yet effective method to train an out-of-scope intent classifier in a fully end-to-end manner by simulating the test scenario in training, which requires no assumption on data distribution and no additional post-processing or threshold setting. Specifically, we construct a set of pseudo outliers in the training stage, by generating synthetic outliers using inliner features via self-supervision and sampling out-of-scope sentences from easily available open-domain datasets. The pseudo outliers are used to train a discriminative classifier that can be directly applied to and generalize well on the test task. We evaluate our method extensively on four benchmark dialogue datasets and observe significant improvements over state-of-the-art approaches. Our code has been released at https://github.com/liam0949/DCLOOS.

【2】 Unsupervised Enrichment of Persona-grounded Dialog with Background Stories 标题：具有背景故事的基于人物角色的对话的无监督丰富

作者：Bodhisattwa Prasad Majumder,Taylor Berg-Kirkpatrick,Julian McAuley,Harsh Jhamtani 备注：Accepted at ACL 2021 for oral presentation 链接：https://arxiv.org/abs/2106.08364 摘要：人类经常引用个人的故事、生活经历和事件来让谈话更引人入胜、更丰富。虽然基于人物角色的对话模型能够生成与给定人物角色相关的响应，但它们往往会错过与人物角色相关的详细经历或事件的陈述，从而使对话变得浅显乏味。在这项工作中，我们通过利用现有故事数据集（如ROCStories）中的虚构故事，为对话模型配备与人物角色相关的“背景故事”。由于当前的对话数据集不包含响应这样的叙述，我们使用基于梯度的重写技术对检索到的故事进行无监督的改编，以生成对话响应。我们提出的方法鼓励生成的响应与对话历史保持流畅（即很可能），与检索到的故事保持最小的差异，以保持事件顺序，并与原始人物角色保持一致。我们证明，与现有对话模型的输出相比，我们的方法可以生成更多样化的响应，并且被人类评估者评为更具吸引力和人性化。摘要：Humans often refer to personal narratives, life experiences, and events to make a conversation more engaging and rich. While persona-grounded dialog models are able to generate responses that follow a given persona, they often miss out on stating detailed experiences or events related to a persona, often leaving conversations shallow and dull. In this work, we equip dialog models with 'background stories' related to a persona by leveraging fictional narratives from existing story datasets (e.g. ROCStories). Since current dialog datasets do not contain such narratives as responses, we perform an unsupervised adaptation of a retrieved story for generating a dialog response using a gradient-based rewriting technique. Our proposed method encourages the generated response to be fluent (i.e., highly likely) with the dialog history, minimally different from the retrieved story to preserve event ordering and consistent with the original persona. We demonstrate that our method can generate responses that are more diverse, and are rated more engaging and human-like by human evaluators, compared to outputs from existing dialog models.

检测相关(1篇)

【1】 Alzheimer's Disease Detection from Spontaneous Speech through Combining Linguistic Complexity and (Dis)Fluency Features with Pretrained Language Models 标题：结合语言复杂性和(Dis)流利性特征与预先训练的语言模型从自发语音中检测阿尔茨海默病

作者：Yu Qiao,Xuefeng Yin,Daniel Wiechmann,Elma Kerz 备注：accepted at Interspeech2021 链接：https://arxiv.org/abs/2106.08689 摘要：在这篇论文中，我们结合语言的复杂性和（不）流利性特征以及预训练的语言模型来检测2021年ADReSSo（Alzheimer's Dementia Recognition through自发性语音识别阿尔茨海默病）挑战赛的阿尔茨海默病。测试集的准确率为83.1%，比基线模型提高了4.23%。我们使用叠加集成技术集成组件模型的最佳性能模型在交叉验证和测试数据上表现相同，表明它对过度拟合具有鲁棒性。摘要：In this paper, we combined linguistic complexity and (dis)fluency features with pretrained language models for the task of Alzheimer's disease detection of the 2021 ADReSSo (Alzheimer's Dementia Recognition through Spontaneous Speech) challenge. An accuracy of 83.1% was achieved on the test set, which amounts to an improvement of 4.23% over the baseline model. Our best-performing model that integrated component models using a stacking ensemble technique performed equally well on cross-validation and test data, indicating that it is robust against overfitting.

识别/分类(3篇)

【1】 Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data 标题：具有小的强标注数据和大的弱标注数据的命名实体识别

作者：Haoming Jiang,Danqing Zhang,Tianyu Cao,Bing Yin,Tuo Zhao 备注：The 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021) 链接：https://arxiv.org/abs/2106.08977 摘要：弱监督在许多自然语言处理任务中显示出了良好的效果，如命名实体识别（NER）。现有的研究主要集中在只学习弱监督下的deep-NER模型，即不需要任何人工标注，并且表明仅使用弱标记数据可以获得很好的性能，但是使用人工/强标记数据仍然不如完全监督的NER。在本文中，我们考虑一个更实际的场景，其中我们有少量的强标记数据和大量的弱标记数据。不幸的是，我们观察到，当我们在强标记和弱标记数据的简单或加权组合上训练deep-NER模型时，弱标记数据不一定改善甚至恶化模型性能（由于弱标记中的广泛噪声）。为了解决这个问题，我们提出了一个新的多阶段计算框架——针，它包含三个基本要素：（1）弱标记完成；（2）噪声感知损失函数；（3）强标记数据的最终微调。通过在电子商务查询引擎和生物医学引擎上的实验，证明了NEEDLE算法能够有效地抑制弱标签的噪声，并优于现有的方法。特别是，我们在3个生物医学NER数据集上获得了新的SOTA F1分数：BC5CDR化学93.74，BC5CDR疾病90.69，NCBI疾病92.28。摘要：Weak supervision has shown promising results in many natural language processing tasks, such as Named Entity Recognition (NER). Existing work mainly focuses on learning deep NER models only with weak supervision, i.e., without any human annotation, and shows that by merely using weakly labeled data, one can achieve good performance, though still underperforms fully supervised NER with manually/strongly labeled data. In this paper, we consider a more practical scenario, where we have both a small amount of strongly labeled data and a large amount of weakly labeled data. Unfortunately, we observe that weakly labeled data does not necessarily improve, or even deteriorate the model performance (due to the extensive noise in the weak labels) when we train deep NER models over a simple or weighted combination of the strongly labeled and weakly labeled data. To address this issue, we propose a new multi-stage computational framework -- NEEDLE with three essential ingredients: (1) weak label completion, (2) noise-aware loss function, and (3) final fine-tuning over the strongly labeled data. Through experiments on E-commerce query NER and Biomedical NER, we demonstrate that NEEDLE can effectively suppress the noise of the weak labels and outperforms existing methods. In particular, we achieve new SOTA F1-scores on 3 Biomedical NER datasets: BC5CDR-chem 93.74, BC5CDR-disease 90.69, NCBI-disease 92.28.

【2】 Collaborative Training of Acoustic Encoders for Speech Recognition 标题：用于语音识别的声学编码器的协作训练

作者：Varun Nagaraja,Yangyang Shi,Ganesh Venkatesh,Ozlem Kalinli,Michael L. Seltzer,Vikas Chandra 备注：INTERSPEECH 2021 链接：https://arxiv.org/abs/2106.08960 摘要：设备上的语音识别需要不同大小的训练模型来部署在具有不同计算预算的设备上。在建立这种不同的模型时，我们可以通过联合训练他们来利用他们之间共享的知识。联合训练也很有效，因为它减少了训练过程中数据处理操作的冗余。提出了一种协同训练不同尺寸声编码器进行语音识别的方法。我们使用一个序列传感器设置不同的声编码器共享一个共同的预测和连接模块。声学编码器也通过帧级chenone预测的辅助任务，以及传感器损耗，使用共蒸馏进行训练。我们使用LibriSpeech语料库进行了实验，结果表明协同训练的声学编码器在两个测试分区上的字错误率相对提高了11%。摘要：On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the knowledge shared between them. Joint training is also efficient since it reduces the redundancy in the training procedure's data handling operations. We propose a method for collaboratively training acoustic encoders of different sizes for speech recognition. We use a sequence transducer setup where different acoustic encoders share a common predictor and joiner modules. The acoustic encoders are also trained using co-distillation through an auxiliary task for frame level chenone prediction, along with the transducer loss. We perform experiments using the LibriSpeech corpus and demonstrate that the collaboratively trained acoustic encoders can provide up to a 11% relative improvement in the word error rate on both the test partitions.

【3】 SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model 标题：SEOVER：基于语句级情感倾向向量的会话情感识别模型

作者：Zaijing Li,Fengxiao Tang,Tieyu Sun,Yusen Zhu,Ming Zhao 链接：https://arxiv.org/abs/2106.08785 摘要：在会话情感识别方面，近年来的研究主要集中在说话人关系的建模上，而忽略了话语情感倾向的作用，本文提出了一种新的句子级情感倾向向量表达范式来模拟句子向量间情感的潜在相关性。在此基础上，我们设计了一个情感识别模型，该模型从语言模型中提取句子级情感定向向量，并从对话情感分析模型中联合学习，提取句子级情感定向向量来识别说话人在会话中的情感定向。我们在两个基准数据集上进行了实验，并与五个基准模型进行了比较，实验结果表明我们的模型在所有的数据集上都有较好的性能。摘要：For the task of conversation emotion recognition, recent works focus on speaker relationship modeling but ignore the role of utterance's emotional tendency.In this paper, we propose a new expression paradigm of sentence-level emotion orientation vector to model the potential correlation of emotions between sentence vectors. Based on it, we design an emotion recognition model, which extracts the sentence-level emotion orientation vectors from the language model and jointly learns from the dialogue sentiment analysis model and extracted sentence-level emotion orientation vectors to identify the speaker's emotional orientation during the conversation. We conduct experiments on two benchmark datasets and compare them with the five baseline models.The experimental results show that our model has better performance on all data sets.

Zero/Few/One-Shot|迁移|自适应(1篇)

【1】 Pathological voice adaptation with autoencoder-based voice conversion 标题：基于自动编码器的语音转换的病理性语音适配

作者：Marc Illa,Bence Mark Halpern,Rob van Son,Laureano Moro-Velazquez,Odette Scharenborg 备注：6 pages, 3 figures. Accepted to the 11th ISCA Speech Synthesis Workshop (2021) 链接：https://arxiv.org/abs/2106.08427 摘要：本文提出了一种新的病态语音合成方法。而不是使用健康的语音作为来源，我们定制一个现有的病理性语音样本，以一个新的说话人的声音特征。这种方法减轻了将典型语音转换为病理性语音时的评估问题，因为在我们的方法中，语音转换（VC）模型不需要针对语音退化进行优化，而只需要针对说话人的变化进行优化。优化中的这种变化确保了在自然度中发现的任何退化都是由于转换过程，而不是由于模型夸大了语音病理学的特征。为了证明该方法的有效性，我们使用UASpeech数据库和基于VC技术的自动编码器对构音障碍语音进行了转换。主观评价结果表明，虽然较低的清晰度可能会导致中、低清晰度说话人的自然度得分与地面真实度相比有一定程度的下降，但对于高清晰度、有构音障碍的说话人来说，自然度是合理的。低清晰度和高清晰度说话人的说话人特征转换是成功的，但中等清晰度说话人的转换不是成功的。不同清晰度水平的转换结果的差异是由于清晰度水平的不同还是由于说话人的不同需要进一步研究。摘要：In this paper, we propose a new approach to pathological speech synthesis. Instead of using healthy speech as a source, we customise an existing pathological speech sample to a new speaker's voice characteristics. This approach alleviates the evaluation problem one normally has when converting typical speech to pathological speech, as in our approach, the voice conversion (VC) model does not need to be optimised for speech degradation but only for the speaker change. This change in the optimisation ensures that any degradation found in naturalness is due to the conversion process and not due to the model exaggerating characteristics of a speech pathology. To show a proof of concept of this method, we convert dysarthric speech using the UASpeech database and an autoencoder-based VC technique. Subjective evaluation results show reasonable naturalness for high intelligibility dysarthric speakers, though lower intelligibility seems to introduce a marginal degradation in naturalness scores for mid and low intelligibility speakers compared to ground truth. Conversion of speaker characteristics for low and high intelligibility speakers is successful, but not for mid. Whether the differences in the results for the different intelligibility levels is due to the intelligibility levels or due to the speakers needs to be further investigated.

语料库(1篇)

【1】 RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis 标题：RyanSpeech：用于会话文语合成的语料库

作者：Rohola Zandie,Mohammad H. Mahoor,Julia Madsen,Eshrat S. Emamian 链接：https://arxiv.org/abs/2106.08468 摘要：本文介绍了一种新的语音语料库RyanSpeech，用于研究自动文本到语音（TTS）系统。公开的TTS语料库通常是嘈杂的，由多个说话人录制的，或者缺乏高质量的男性语音数据。为了满足语音识别领域对高质量、可公开获取的男性语音语料库的需求，我们设计并创建了RyanSpeech，其中包含来自真实会话环境的文本材料。这些材料包含超过10个小时的专业男声演员的发言记录在44.1千赫。这个语料库的设计和流水线使得RyanSpeech非常适合在实际应用中开发TTS系统。为了为将来的研究、协议和基准测试提供基线，我们在RyanSpeech上训练了4个最先进的语音模型和一个声码器。结果表明，在我们的最佳模型中，平均意见得分（MOS）为3.36。我们制作了语料库和训练模型供公众使用。摘要：This paper introduces RyanSpeech, a new speech corpus for research on automated text-to-speech (TTS) systems. Publicly available TTS corpora are often noisy, recorded with multiple speakers, or lack quality male speech data. In order to meet the need for a high quality, publicly available male speech corpus within the field of speech recognition, we have designed and created RyanSpeech which contains textual materials from real-world conversational settings. These materials contain over 10 hours of a professional male voice actor's speech recorded at 44.1 kHz. This corpus's design and pipeline make RyanSpeech ideal for developing TTS systems in real-world applications. To provide a baseline for future research, protocols, and benchmarks, we trained 4 state-of-the-art speech models and a vocoder on RyanSpeech. The results show 3.36 in mean opinion scores (MOS) in our best model. We have made both the corpus and trained models for public use.

Word2Vec|文本|单词(2篇)

【1】 Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study 标题：声学单词嵌入能捕捉到语音相似性吗？一项实证研究

作者：Badr M. Abdullah,Marius Mosbach,Iuliia Zaitova,Bernd MÃ¶bius,Dietrich Klakow 备注：Accepted in Interspeech 2021 链接：https://arxiv.org/abs/2106.08686 摘要：深度神经网络的几种变体已成功地用于建立参数化模型，将可变时长的口语词段投影到固定大小的向量表示或声学词嵌入（AWEs）上。然而，我们在多大程度上可以依赖出现的AWE空间中的距离来估计词形相似性，目前还不清楚。本文提出这样一个问题：声音嵌入空间中的距离是否与语音差异相关？为了回答这个问题，我们实证研究了不同神经结构和学习目标的AWEs监督方法的性能。我们在两种语言（德语和捷克语）的受控环境下训练AWE模型，并在两个任务上评估嵌入：单词辨别和语音相似性。我们的实验表明：（1）最佳情况下嵌入空间中的距离仅与语音距离适度相关；（2）提高单词识别任务的性能并不一定能得到更好地反映单词语音相似性的模型。我们的发现强调了重新思考当前对敬畏的内在评价的必要性。摘要：Several variants of deep neural networks have been successfully employed for building parametric models that project variable-duration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we empirically investigate the performance of supervised approaches for AWEs with different neural architectures and learning objectives. We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity. Our findings highlight the necessity to rethink the current intrinsic evaluations for AWEs.

【2】 Discrete Auto-regressive Variational Attention Models for Text Modeling 标题：用于文本建模的离散自回归变量注意力模型

作者：Xianghong Fang,Haoli Bai,Jian Li,Zenglin Xu,Michael Lyu,Irwin King 备注：IJCNN 2021 链接：https://arxiv.org/abs/2106.08571 摘要：变分自动编码器（VAE）在文本建模中得到了广泛的应用。然而，在实践中，他们面临着两个挑战：信息代表不足和后验崩溃。前者是将LSTM编码器的最后一个隐藏状态转化为潜在空间，一般不足以对数据进行汇总。后者是一个长期存在的问题，在训练过程中的VAEs的优化被困在一个灾难性的局部最优。本文提出离散自回归变分注意模型（DAVAM）来解决这一问题。具体来说，我们引入了一种自回归变分注意方法，通过有效地捕捉输入的语义依赖来丰富潜在空间。我们进一步设计了变分注意的离散潜空间，并从数学上证明了我们的模型不存在后崩溃。在语言建模任务上的大量实验证明了DAVAM相对于VAE的优越性。摘要：Variational autoencoders (VAEs) have been widely applied for text modeling. In practice, however, they are troubled by two challenges: information underrepresentation and posterior collapse. The former arises as only the last hidden state of LSTM encoder is transformed into the latent space, which is generally insufficient to summarize the data. The latter is a long-standing problem during the training of VAEs as the optimization is trapped to a disastrous local optimum. In this paper, we propose Discrete Auto-regressive Variational Attention Model (DAVAM) to address the challenges. Specifically, we introduce an auto-regressive variational attention approach to enrich the latent space by effectively capturing the semantic dependency from the input. We further design discrete latent space for the variational attention and mathematically show that our model is free from posterior collapse. Extensive experiments on language modeling tasks demonstrate the superiority of DAVAM against several VAE counterparts.

其他神经网络|深度学习|模型|建模(4篇)

【1】 On the long-term learning ability of LSTM LMs 标题：论LSTM LMS的长期学习能力

作者：Wim Boes,Robbe Van Rompaey,Lyan Verwimp,Joris Pelemans,Hugo Van hamme,Patrick Wambacq 备注：None 链接：https://arxiv.org/abs/2106.08927 摘要：我们通过对句子和语篇水平的长-短期记忆语言模型（LSTM-LMs）进行基于连续词袋（CBOW）模型的语境扩展评估，并分析其性能，来考察长-短期记忆语言模型（LSTM-LMs）的长期学习能力。我们根据文本和语音进行评估。使用长期语境模块的句子层次模型的表现与普通语篇层次的LSTM-LMs相当。另一方面，这种扩展并没有为语篇层面的模型提供收益。这些发现表明，语篇层面的LSTM-LMs已经依赖于语境信息来进行长期学习。摘要：We inspect the long-term learning ability of Long Short-Term Memory language models (LSTM LMs) by evaluating a contextual extension based on the Continuous Bag-of-Words (CBOW) model for both sentence- and discourse-level LSTM LMs and by analyzing its performance. We evaluate on text and speech. Sentence-level models using the long-term contextual module perform comparably to vanilla discourse-level LSTM LMs. On the other hand, the extension does not provide gains for discourse-level models. These findings indicate that discourse-level LSTM LMs already rely on contextual information to perform long-term learning.

【2】 C^3: Compositional Counterfactual Constrastive Learning for Video-grounded Dialogues标题：C^3：视频对话的构图反事实对比学习

作者：Hung Le,Nancy F. Chen,Steven C. H. Hoi 备注：22 pages, 11 figures, 7 tables 链接：https://arxiv.org/abs/2106.08914 摘要：基于视频的对话系统旨在整合视频理解和对话理解，以产生与对话和视频背景相关的回应。大多数现有的方法都采用了深度学习模型，并且在相对较小的数据集上取得了显著的效果。然而，结果部分是通过利用数据集中的偏差而不是发展多模态推理来实现的，这导致了有限的泛化。在本文中，我们提出了一种新的作文反事实对比学习方法（C^3$），来发展基于视频的对话中事实和反事实样本之间的对比训练。具体来说，我们设计了基于视频中的时间步长和对话中的标记的事实/反事实抽样，并提出了利用对象级或动作级方差的对比损失函数。与以往的方法不同，本文着重研究了合成输出标记之间的隐藏状态表示，以优化生成环境中的表示空间。我们在视听场景感知对话（AVSD）基准上取得了很好的性能，并展示了我们的方法在视频和对话背景方面的优势。摘要：Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. However, the results are partly accomplished by exploiting biases in the datasets rather than developing multimodal reasoning, resulting in limited generalization. In this paper, we propose a novel approach of Compositional Counterfactual Contrastive Learning ($C^3$) to develop contrastive training between factual and counterfactual samples in video-grounded dialogues. Specifically, we design factual/counterfactual sampling based on the temporal steps in videos and tokens in dialogues and propose contrastive loss functions that exploit object-level or action-level variance. Different from prior approaches, we focus on contrastive hidden state representations among compositional output tokens to optimize the representation space in a generation setting. We achieved promising performance gains on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark and showed the benefits of our approach in grounding video and dialogue context.

【3】 Algorithm to Compilation Codesign: An Integrated View of Neural Network Sparsity 标题：编译协同设计算法：神经网络稀疏性的综合观点

作者：Fu-Ming Guo,Austin Huang 链接：https://arxiv.org/abs/2106.08846 摘要：减少神经网络的计算量、推理延迟和内存占用是剪枝和稀疏性的研究动机。然而，实现这些好处以及理解算法设计和正则化对运行时执行的端到端影响通常没有深入研究。在这里，我们将结构化和非结构化剪枝应用于BERT语言模型的变换块的注意权重，同时还扩展了TVM编译器中的块稀疏表示（BSR）操作。BSR操作的集成使得TVM运行时执行能够利用由模型正则化引起的结构化模式稀疏性。这种修剪算法的集成视图使我们能够研究建模决策之间的关系及其对稀疏性增强执行的直接影响。我们的主要发现是：1）我们验证了结构化稀疏块正则化的性能优势必须通过对TVM的BSR增强来实现，相对于vanilla PyTorch有4倍的加速比，相对于标准TVM编译有2.2倍的加速比（没有扩展的BSR支持）。2）对于注意权重，在这个CPU推断上下文中，端到端的最优块稀疏形状不是一个正方形块（如{gray2017gpu}），而是一个线性32x1块3）性能和块大小/形状之间的关系暗示了模型正则化参数如何与任务调度器优化交互，从而导致观察到的结果端到端性能。摘要：Reducing computation cost, inference latency, and memory footprint of neural networks are frequently cited as research motivations for pruning and sparsity. However, operationalizing those benefits and understanding the end-to-end effect of algorithm design and regularization on the runtime execution is not often examined in depth. Here we apply structured and unstructured pruning to attention weights of transformer blocks of the BERT language model, while also expanding block sparse representation (BSR) operations in the TVM compiler. Integration of BSR operations enables the TVM runtime execution to leverage structured pattern sparsity induced by model regularization. This integrated view of pruning algorithms enables us to study relationships between modeling decisions and their direct impact on sparsity-enhanced execution. Our main findings are: 1) we validate that performance benefits of structured sparsity block regularization must be enabled by the BSR augmentations to TVM, with 4x speedup relative to vanilla PyTorch and 2.2x speedup relative to standard TVM compilation (without expanded BSR support). 2) for BERT attention weights, the end-to-end optimal block sparsity shape in this CPU inference context is not a square block (as in \cite{gray2017gpu}) but rather a linear 32x1 block 3) the relationship between performance and block size / shape is is suggestive of how model regularization parameters interact with task scheduler optimizations resulting in the observed end-to-end performance.

【4】 Generative Conversational Networks 标题：产生式会话网络

作者：Alexandros Papangelis,Karthik Gopalakrishnan,Aishwarya Padmakumar,Seokhwan Kim,Gokhan Tur,Dilek Hakkani-Tur 备注：SIGDial 2021 链接：https://arxiv.org/abs/2106.08484 摘要：受元学习和生成性教学网络的启发，我们提出了一个称为生成性会话网络的框架，其中会话主体学习生成自己的标记训练数据（给定一些种子数据），然后根据这些数据训练自己执行给定的任务。我们使用强化学习来优化数据生成过程，其中奖励信号是代理在任务上的表现。该任务可以是任何与语言相关的任务，从意图检测到完全面向任务的对话。在这项工作中，我们证明了我们的方法能够从种子数据中进行推广，并且在有限的数据和有限的计算设置中表现良好，在多个数据集（ATIS、TOD、SNIPS和Restaurants8k）的意图检测和时隙标记方面有显著的收益。我们显示了平均提高35%的意图检测和21%的槽标签比从种子数据训练的基线模型。我们还对生成数据的新颖性进行了分析，并提供了用于意图检测、时隙标记和非目标导向对话的生成示例。摘要：Inspired by recent work in meta-learning and generative teaching networks, we propose a framework called Generative Conversational Networks, in which conversational agents learn to generate their own labelled training data (given some seed data) and then train themselves from that data to perform a given task. We use reinforcement learning to optimize the data generation process where the reward signal is the agent's performance on the task. The task can be any language-related task, from intent detection to full task-oriented conversations. In this work, we show that our approach is able to generalise from seed data and performs well in limited data and limited computation settings, with significant gains for intent detection and slot tagging across multiple datasets: ATIS, TOD, SNIPS, and Restaurants8k. We show an average improvement of 35% in intent detection and 21% in slot tagging over a baseline model trained from the seed data. We also conduct an analysis of the novelty of the generated data and provide generated examples for intent detection, slot tagging, and non-goal oriented conversations.

其他(4篇)

【1】 Attention-Based Keyword Localisation in Speech using Visual Grounding 标题：基于视觉基础的语音中基于注意力的关键词定位

作者：Kayode Olaleye,Herman Kamper 备注：Accepted to Interspeech 2021 链接：https://arxiv.org/abs/2106.08859 摘要：基于视觉的语音模型从与语音字幕配对的图像中学习。通过使用一个经过训练的具有固定词汇量的视觉分类器，用软文本标签标记图像，先前的工作已经表明，可以训练一个能够检测特定文本关键字是否出现在言语中的模型。在这里，我们研究了基于视觉的语音模型是否也可以进行关键词定位：在没有任何明确的基于文本或对齐监督的情况下，预测一个给定的文本关键词在话语中出现的位置。我们特别考虑将注意力纳入卷积模型是否有利于定位。尽管视觉监督模型的绝对局部化性能仍然不高（与使用无序的文字标签进行监督相比），但我们表明，与以前的视觉监督模型相比，注意力在性能上有很大的提高。与其他许多语音图像研究一样，我们发现许多不正确的定位是由于语义混淆造成的，例如在查询关键字“swiming”中定位单词“backstroke”。摘要：Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model that can detect whether a particular text keyword occurs in speech utterances or not. Here we investigate whether visually grounded speech models can also do keyword localisation: predicting where, within an utterance, a given textual keyword occurs without any explicit text-based or alignment supervision. We specifically consider whether incorporating attention into a convolutional model is beneficial for localisation. Although absolute localisation performance with visually supervised models is still modest (compared to using unordered bag-of-word text labels for supervision), we show that attention provides a large gain in performance over previous visually grounded models. As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions, e.g. locating the word 'backstroke' for the query keyword 'swimming'.

【2】 Coreference Augmentation for Multi-Domain Task-Oriented Dialogue State Tracking 标题：面向多域任务的对话状态跟踪的共指增强

作者：Ting Han,Chongxuan Huang,Wei Peng 备注：Accepted by Interspeech2021 链接：https://arxiv.org/abs/2106.08723 摘要：对话状态跟踪（DST）是在给定对话历史的情况下，通过估计用户的信念状态来推断用户目标的过程，在面向任务的对话系统中起着至关重要的作用。现有的DST模型没有解决多回合会话中的共指现象，导致了次优的会话性能。在本文中，我们提出了共指对话状态跟踪器（CDST）显式模型的共指特征。特别地，在每一个回合中，该模型联合预测共指域时隙对，并从对话上下文中提取共指值。在multiwoz2.1数据集上的实验结果表明，该模型达到了56.47%的联合目标准确率。摘要：Dialogue State Tracking (DST), which is the process of inferring user goals by estimating belief states given the dialogue history, plays a critical role in task-oriented dialogue systems. A coreference phenomenon observed in multi-turn conversations is not addressed by existing DST models, leading to sub-optimal performances. In this paper, we propose Coreference Dialogue State Tracker (CDST) that explicitly models the coreference feature. In particular, at each turn, the proposed model jointly predicts the coreferred domain-slot pair and extracts the coreference values from the dialogue context. Experimental results on MultiWOZ 2.1 dataset show that the proposed model achieves the state-of-the-art joint goal accuracy of 56.47%.

【3】 Eider: Evidence-enhanced Document-level Relation Extraction 标题：EIDER：证据增强型文档级关系抽取

作者：Yiqing Xie,Jiaming Shen,Sha Li,Yuning Mao,Jiawei Han 链接：https://arxiv.org/abs/2106.08657 摘要：文档级关系抽取（DocRE）旨在抽取文档中实体对之间的语义关系。在DocRE中，文档中句子的子集，称为证据句，可能足以预测特定实体对之间的关系。为了更好地利用证据句，本文提出了一个由联合关系与证据抽取、以证据为中心的关系抽取和抽取结果融合组成的三阶段证据增强DocRE框架。我们首先联合训练一个RE模型和一个简单且记忆有效的证据提取模型。然后，基于提取的证据句构造伪文档，并再次运行RE模型。最后，利用混合层对前两阶段的提取结果进行融合，并进行最终预测。大量的实验表明，我们提出的框架在DocRED数据集上达到了最先进的性能，比第二好的方法的性能提高了0.76/0.82ignf1/F1。特别是，我们的方法在句间关系上的性能显著提高了1.23%。摘要：Document-level relation extraction (DocRE) aims at extracting the semantic relations among entity pairs in a document. In DocRE, a subset of the sentences in a document, called the evidence sentences, might be sufficient for predicting the relation between a specific entity pair. To make better use of the evidence sentences, in this paper, we propose a three-stage evidence-enhanced DocRE framework consisting of joint relation and evidence extraction, evidence-centered relation extraction (RE), and fusion of extraction results. We first jointly train an RE model with a simple and memory-efficient evidence extraction model. Then, we construct pseudo documents based on the extracted evidence sentences and run the RE model again. Finally, we fuse the extraction results of the first two stages using a blending layer and make a final prediction. Extensive experiments show that our proposed framework achieves state-of-the-art performance on the DocRED dataset, outperforming the second-best method by 0.76/0.82 Ign F1/F1. In particular, our method significantly improves the performance on inter-sentence relations by 1.23 Inter F1.

【4】 Improving the expressiveness of neural vocoding with non-affine Normalizing Flows 标题：用非仿射归一化流程提高神经声码的表达能力

作者：Adam Gabryś,Yunlong Jiao,Viacheslav Klimkov,Daniel Korzekwa,Roberto Barra-Chicote 备注：Accepted to Interspeech 2021, 5 pages,3 figures 链接：https://arxiv.org/abs/2106.08649 摘要：本文提出了一种通用的神经声码规范化流增强算法。作为一个案例研究，我们改进了表达性语音声码与改进的平行波网（PW）。特别地，我们建议将PW的仿射变换推广到更具表现力的可逆非仿射函数。在波形重建和文本到语音（TTS）任务中，改进后的PW具有更高的表达能力，可以获得更好的感知信号质量和自然度。我们在一个多说话人、多语言的数据集上评估了不同说话风格的模型。在波形重建任务中，提出的模型将原始波形到记录的自然度和信号质量差距缩小了10\%$，将其他最先进的神经声码系统的自然度和信号质量差距缩小了60\%$。我们还展示了在评估测试集上客观度量的改进，与仿射PW相比，L2谱距离和交叉熵分别减少了$3\%$和$6\unicode{x2030}$。此外，我们扩展了PW的概率密度蒸馏方法，使之适用于任何非仿射可逆可微函数。摘要：This paper proposes a general enhancement to the Normalizing Flows (NF) used in neural vocoding. As a case study, we improve expressive speech vocoding with a revamped Parallel Wavenet (PW). Specifically, we propose to extend the affine transformation of PW to the more expressive invertible non-affine function. The greater expressiveness of the improved PW leads to better-perceived signal quality and naturalness in the waveform reconstruction and text-to-speech (TTS) tasks. We evaluate the model across different speaking styles on a multi-speaker, multi-lingual dataset. In the waveform reconstruction task, the proposed model closes the naturalness and signal quality gap from the original PW to recordings by $10\%$, and from other state-of-the-art neural vocoding systems by more than $60\%$. We also demonstrate improvements in objective metrics on the evaluation test set with L2 Spectral Distance and Cross-Entropy reduced by $3\%$ and $6\unicode{x2030}$ comparing to the affine PW. Furthermore, we extend the probability density distillation procedure proposed by the original PW paper, so that it works with any non-affine invertible and differentiable function.

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2021-06-17，如有侵权请联系 cloudcommunity@tencent.com 删除

linux