前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >金融/语音/音频处理学术速递[6.17]

金融/语音/音频处理学术速递[6.17]

作者头像
公众号-arXiv每日学术速递
发布2021-07-02 18:38:48
8510
发布2021-07-02 18:38:48
举报

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

q-fin金融,共计8篇

cs.SD语音,共计19篇

eess.AS音频处理,共计19篇

1.q-fin金融:

【1】 The Economic Impact of Critical National Infrastructure Failure Due to Space Weather 标题:空间天气导致的国家重大基础设施故障的经济影响

作者:Edward J. Oughton 链接:https://arxiv.org/abs/2106.08945 摘要:空间天气是对可能对技术产生不利影响的不同太阳或空间现象的统称。然而,与飓风或地震等地面自然灾害相比,目前对空间天气灾害的认识还处于相对初级阶段。事实上,某些类型的空间天气,如大型日冕物质抛射(CME)是低概率、高严重性危险的典型例子。重大事件少、时间序列数据短以及对关键基础设施的潜在影响缺乏共识,妨碍了空间天气的经济影响评估。然而,空间天气有可能破坏一系列重要的国家基础设施(CNI)系统,包括电力传输、卫星通信和定位、航空和铁路运输。最近人们对这些潜在的经济和社会影响越来越感兴趣。从1989年魁北克地震造成的数百万美元的设备损坏,到一些分析师报告未来可能发生的灾难给整个经济造成了数十亿美元的损失。因此,这为本文提供了动力,跟踪1989年至2017年空间天气社会经济评估的起源和发展,并阐明该领域未来的研究方向。 摘要:Space weather is a collective term for different solar or space phenomena that can detrimentally affect technology. However, current understanding of space weather hazards is still relatively embryonic in comparison to terrestrial natural hazards such as hurricanes or earthquakes. Indeed, certain types of space weather such as large Coronal Mass Ejections (CMEs) are an archetypal example of a low probability, high severity hazard. Few major events, short time-series data and a lack of consensus regarding the potential impacts on critical infrastructure have hampered the economic impact assessment of space weather. Yet, space weather has the potential to disrupt a wide range of Critical National Infrastructure (CNI) systems including electricity transmission, satellite communications and positioning, aviation and rail transportation. Recently there has been growing interest in these potential economic and societal impacts. Estimates range from millions of dollars of equipment damage from the Quebec 1989 event, to some analysts reporting billions of lost dollars in the wider economy from potential future disaster scenarios. Hence, this provides motivation for this article which tracks the origin and development of the socio-economic evaluation of space weather, from 1989 to 2017, and articulates future research directions for the field.

【2】 What does Network Analysis teach us about International Environmental Cooperation? 标题:关于国际环境合作,网络分析教会了我们什么?

作者:Stefano Carattini,Sam Fankhauser,Jianjian Gao,Caterina Gennaioli,Pietro Panzarasa 链接:https://arxiv.org/abs/2106.08883 摘要:在过去70年中,国际环境协定的数量大幅度增加,突出了它们在环境治理中的突出作用。本文以1948年至2015年签订的546份国际环境协定为基础,运用网络分析工具对国际环境合作的网络属性进行了识别。我们确定了四个典型事实,为国际环境协定文献中的一些关键主题提供了拓扑佐证。首先,我们发现,一个具有统计意义的合作网络直到1970年初才出现,但自那时以来,该网络的实力不断增强,导致签署国之间的连通性和合作强度不断提高。第二,随着时间的推移,网络变得更紧密、更密集、更有凝聚力,使政策协调和知识传播更加有效。第三,这个网络虽然是全球性的,但有一个明显的欧洲印记:最初是英国,最近是法国和德国,是促成环境合作的最具战略意义的角色。第四,国际环境协调始于渔业和海洋的管理,但目前最为密切的是废物和有害物质。空气和大气层条约网络在一些指标上较弱,缺乏在其他网络中发现的层次结构。它是唯一一个拓扑性质受到联合国赞助条约影响的网络。 摘要:Over the past 70 years, the number of international environmental agreements (IEAs) has increased substantially, highlighting their prominent role in environmental governance. This paper applies the toolkit of network analysis to identify the network properties of international environmental cooperation based on 546 IEAs signed between 1948 and 2015. We identify four stylised facts that offer topological corroboration for some key themes in the IEA literature. First, we find that a statistically significant cooperation network did not emerge until early 1970, but since then the network has grown continuously in strength, resulting in higher connectivity and intensity of cooperation between signatory countries. Second, over time the network has become closer, denser and more cohesive, allowing more effective policy coordination and knowledge diffusion. Third, the network, while global, has a noticeable European imprint: initially the United Kingdom and more recently France and Germany have been the most strategic players to broker environmental cooperation. Fourth, international environmental coordination started with the management of fisheries and the sea, but is now most intense on waste and hazardous substances. The network of air and atmosphere treaties is weaker on a number of metrics and lacks the hierarchical structure found in other networks. It is the only network whose topological properties are shaped significantly by UN-sponsored treaties.

【3】 An Information Filtering approach to stress testing: an application to FTSE markets 标题:压力测试的信息过滤方法:在富时指数(FTSE)市场中的应用

作者:Isobel Seabrook,Fabio Caccioli,Tomaso Aste 备注:17 pages, 5 figures 链接:https://arxiv.org/abs/2106.08778 摘要:我们提出了一种新的方法来量化市场冲击的“影响”和“反应”。我们对一部分市场中的一组股票施加冲击,并使用整个市场多元收益分布的稀疏概率椭圆模型,根据市场另一部分的平均损失来量化影响。稀疏性是通过一个$L_0$-范数正则化引入的,该正则化根据从信息过滤网络推断出的依赖结构强制将逆协方差的一些元素归零。我们的研究关注富时100指数和250指数市场,分析了对个股和个股集团的冲击和反应。我们观察到,冲击模式与股票收益率逆协方差的稀疏结构相关的网络结构有关。中央部门似乎更容易受到冲击的影响,而基础多元化程度高的股票在经历冲击时对市场其他部分的影响更大。通过分析危机时期的系统和比较市场的平静度,我们观察了危机时期具有趋同行为的冲击模式的变化。 摘要:We present a novel methodology to quantify the "impact" of and "response" to market shocks. We apply shocks to a group of stocks in a part of the market, and we quantify the effects in terms of average losses on another part of the market using a sparse probabilistic elliptical model for the multivariate return distribution of the whole market. Sparsity is introduced with an $L_0$-norm regularization, which forces to zero some elements of the inverse covariance according to a dependency structure inferred from an information filtering network. Our study concerns the FTSE 100 and 250 markets and analyzes impact and response to shocks both applied to and received from individual stocks and group of stocks. We observe that the shock pattern is related to the structure of the network associated with the sparse structure of the inverse covariance of stock returns. Central sectors appear more likely to be affected by shocks, and stocks with a large level of underlying diversification have a larger impact on the rest of the market when experiencing shocks. By analyzing the system during times of crisis and comparative market calmness, we observe changes in the shock patterns with a convergent behavior in times of crisis.

【4】 Characterization of equilibrium existence and purification in general Bayesian games 标题:一般贝叶斯对策均衡存在与净化的刻画

作者:Wei He,Xiang Sun,Yeneng Sun,Yishu Zeng 链接:https://arxiv.org/abs/2106.08563 摘要:研究了具有一般作用空间、相关类型和相互依赖收益的贝叶斯博弈。我们引入了“可分解的粗支付相关信息”的条件,证明了该条件是纯策略均衡存在和行为策略净化的充分必要条件。作为我们的净化方法的结果,对于间断贝叶斯对策也得到了纯策略均衡的一个新的存在性结果。我们的结果举例说明寡头竞争和所有支付拍卖的应用提供。 摘要:This paper studies Bayesian games with general action spaces, correlated types and interdependent payoffs. We introduce the condition of ``decomposable coarser payoff-relevant information'', and show that this condition is both sufficient and necessary for the existence of pure-strategy equilibria and purification from behavioral strategies. As a consequence of our purification method, a new existence result on pure-strategy equilibria is also obtained for discontinuous Bayesian games. Illustrative applications of our results to oligopolistic competitions and all-pay auctions are provided.

【5】 Deep reinforcement learning on a multi-asset environment for trading 标题:多资产交易环境下的深度强化学习

作者:Ali Hirsa,Joerg Osterrieder,Branka Hadji-Misheva,Jan-Alexander Posth 链接:https://arxiv.org/abs/2106.08437 摘要:几十年来,金融交易被广泛分析,市场参与者和学术界一直在寻找提高交易绩效的先进方法。深度强化学习(DRL)是一种新兴的学习方法,在多个领域都取得了显著的成功,但它在金融市场上仍需发挥其优势。我们利用深度Q网络(DQN)来设计期货合约的多空交易策略。状态空间由波动率归一化的日收益率组成,买入或卖出是强化学习行为,总报酬定义为我们行为的累积利润。我们的交易策略在真实和模拟的价格序列上进行了训练和测试,并将结果与指数基准进行了比较。我们分析了基于人工数据和实际价格序列相结合的训练如何在实际市场中成功实施。将训练好的强化学习代理应用于E-mini标准普尔500连续期货合约的交易。我们在这项研究中的结果是初步的,需要进一步改进。 摘要:Financial trading has been widely analyzed for decades with market participants and academics always looking for advanced methods to improve trading performance. Deep reinforcement learning (DRL), a recently reinvigorated method with significant success in multiple domains, still has to show its benefit in the financial markets. We use a deep Q-network (DQN) to design long-short trading strategies for futures contracts. The state space consists of volatility-normalized daily returns, with buying or selling being the reinforcement learning action and the total reward defined as the cumulative profits from our actions. Our trading strategy is trained and tested both on real and simulated price series and we compare the results with an index benchmark. We analyze how training based on a combination of artificial data and actual price series can be successfully deployed in real markets. The trained reinforcement learning agent is applied to trading the E-mini S&P 500 continuous futures contract. Our results in this study are preliminary and need further improvement.

【6】 Pricing and Risk Analysis in Hyperbolic Local Volatility Model with Quasi Monte Carlo 标题:基于拟蒙特卡罗的双曲局部波动率模型定价与风险分析

作者:Julien Hok,Sergei Kucherenko 链接:https://arxiv.org/abs/2106.08421 摘要:局部波动率模型通常比其他方法(如随机波动率模型)更准确地捕捉隐含波动率的表面。本文给出了蒙特卡罗(MC)和拟蒙特卡罗(QMC)方法在基于双曲局部波动率模型的衍生品定价和风险分析中的应用结果。在高维积分中,如果被积函数的有效维数不太大,QMC的性能优于MC。在衍生工具定价和计算中,有效维数依赖于路径离散化算法。对亚式期权的计算结果表明,拟蒙特卡罗方法具有优越的性能,特别是对于布朗桥离散格式。 摘要:Local volatility models usually capture the surface of implied volatilities more accurately than other approaches, such as stochastic volatility models. We present the results of application of Monte Carlo (MC) and Quasi Monte Carlo (QMC) methods for derivative pricing and risk analysis based on Hyperbolic Local Volatility Model. In high-dimensional integration QMC shows a superior performance over MC if the effective dimension of an integrand is not too large. In application to derivative pricing and computation of Greeks effective dimensions depend on path discretization algorithms. The results presented for the Asian option show the superior performance of the Quasi Monte Carlo methods especially for the Brownian Bridge discretization scheme.

【7】 Time Series Momentum Predictability via Dynamic Binary Classification 标题:基于动态二值分类的时间序列动量可预测性

作者:Bruno P. C. Levy,Hedibert F. Lopes 链接:https://arxiv.org/abs/2106.08420 摘要:自Moskowitz,Ooi和Pedersen(2012)的工作以来,时间序列动量策略在定量金融行业中得到了广泛的应用,其学术研究也迅速发展。然而,交易信号通常是通过简单观察过去的收益率测量来获得的。在这篇文章中,我们研究的好处,结合动态经济计量模型,依次学习不同的回顾期的重要性时变的个别资产。通过使用动态二元分类器模型,投资者能够在过去动量和未来收益之间的时变或不变关系之间进行切换,动态组合不同的回望期,提高交易信号的准确性和投资组合的绩效。利用来自56份期货合约的数据,我们发现均值-方差投资者愿意支付相当大的管理费用,从传统的朴素时间序列动量策略转向动态分类器方法。 摘要:Time series momentum strategies are widely applied in the quantitative financial industry and its academic research has grown rapidly since the work of Moskowitz, Ooi and Pedersen (2012). However, trading signals are usually obtained via simple observation of past return measurements. In this article we study the benefits of incorporating dynamic econometric models to sequentially learn the time-varying importance of different look-back periods for individual assets. By the use of a dynamic binary classifier model, the investor is able to switch between time-varying or constant relations between past momentum and future returns, dynamically combining different look-back periods and improving trading signals accuracy and portfolio performance. Using data from 56 future contracts we show that a mean-variance investor will be willing to pay a considerable managment fee to switch from the traditional naive time series momentum strategy to the dynamic classifier approach.

【8】 Random feature neural networks learn Black-Scholes type PDEs without curse of dimensionality 标题:随机特征神经网络学习无维数灾的Black-Scholes型偏微分方程

作者:Lukas Gonon 链接:https://arxiv.org/abs/2106.08900 摘要:本文研究了用随机特征神经网络学习与Black-Scholes和更一般的指数L′evy模型有关的Kolmogorov偏(积分)微分方程。随机特征神经网络是一种仅输出权值可训练的单隐层前馈神经网络。这使得训练特别简单,但是(先验地)降低了表达能力。有趣的是,这不是布莱克-斯科尔斯型偏微分方程的情况,如我们在这里所示。我们推导了学习充分非退化Black-Scholes型模型的随机神经网络预测误差的界。文中给出了一个完整的误差分析,结果表明所导出的边界不受维数灾难的影响。我们还研究了这些结果在篮子期权中的应用,并在数值上验证了边界。这些结果证明了神经网络能够在没有维数灾难的情况下学习Black-Scholes型偏微分方程的解。此外,这提供了一个相关学习问题的例子,其中随机特征神经网络是可证明有效的。 摘要:This article investigates the use of random feature neural networks for learning Kolmogorov partial (integro-)differential equations associated to Black-Scholes and more general exponential L\'evy models. Random feature neural networks are single-hidden-layer feedforward neural networks in which only the output weights are trainable. This makes training particularly simple, but (a priori) reduces expressivity. Interestingly, this is not the case for Black-Scholes type PDEs, as we show here. We derive bounds for the prediction error of random neural networks for learning sufficiently non-degenerate Black-Scholes type models. A full error analysis is provided and it is shown that the derived bounds do not suffer from the curse of dimensionality. We also investigate an application of these results to basket options and validate the bounds numerically. These results prove that neural networks are able to \textit{learn} solutions to Black-Scholes type PDEs without the curse of dimensionality. In addition, this provides an example of a relevant learning problem in which random feature neural networks are provably efficient.

2.cs.SD语音:

【1】 End-to-End Spoken Language Understanding for Generalized Voice Assistants 标题:通用语音助理的端到端口语理解

作者:Michael Saxon,Samridhi Choudhary,Joseph P. McKenna,Athanasios Mouchtaris 备注:Accepted to Interspeech 2021; 5 pages, 2 tables, 1 figure 链接:https://arxiv.org/abs/2106.09009 摘要:端到端(E2E)口语理解(SLU)系统使用单一模型直接从语音中预测话语语义。这方面的工作主要集中在固定域中的目标任务,其中输出语义结构是先验的,输入语音的复杂性是有限的。在这项工作中,我们提出了我们的方法来开发一个E2E模型的广义SLU在商业语音助理(VAs)。我们提出了一个完全可微的、基于Transformer的、层次化的系统,可以在ASR和NLU两个层次上进行预训练。然后对转录和语义分类损失进行微调,以处理不同的意图和参数组合。这导致SLU系统在复杂的内部广义VA数据集上实现了显著的基线改进,准确率提高了43%,同时仍然满足流行的Fluent Speech Commands数据集上99%的准确率基准。我们在一个硬测试集上进一步评估了我们的模型,该测试集只包含训练中看不到的时隙参数,并展示了近20%的改进,显示了我们的方法在真正苛刻的VA场景中的有效性。 摘要:End-to-end (E2E) spoken language understanding (SLU) systems predict utterance semantics directly from speech using a single model. Previous work in this area has focused on targeted tasks in fixed domains, where the output semantic structure is assumed a priori and the input speech is of limited complexity. In this work we present our approach to developing an E2E model for generalized SLU in commercial voice assistants (VAs). We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels. This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations. This leads to an SLU system that achieves significant improvements over baselines on a complex internal generalized VA dataset with a 43% improvement in accuracy, while still meeting the 99% accuracy benchmark on the popular Fluent Speech Commands dataset. We further evaluate our model on a hard test set, exclusively containing slot arguments unseen in training, and demonstrate a nearly 20% improvement, showing the efficacy of our approach in truly demanding VA scenarios.

【2】 Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments 标题:VOICITY:噪声混响环境中的零激发非并行语音转换

作者:Alejandro Mottini,Jaime Lorenzo-Trueba,Sri Vishnu Kumar Karlapati,Thomas Drugman 备注:Presented at the Speech Synthesis Workshops 2021 (SSW11) 链接:https://arxiv.org/abs/2106.08873 摘要:语音转换(Voice Conversion,VC)是一种通过转换源语的非语言信息来改变说话人身份的技术。虽然有大量关于VC的文献,但是大多数提出的方法都是在干净的语音记录上进行训练和评估的。然而,许多声学环境是噪声和混响的,严重限制了流行的VC方法对此类场景的适用性。为了解决这个局限性,我们提出了voice,一个新的VC框架,特别是针对有噪声的语音。我们的方法受去噪自动编码器框架的启发,由四个编码器(说话人、内容、语音和声学ASR)和一个解码器组成。重要的是,Voice能够执行非平行零炮VC,这是任何VC系统的一个重要要求,需要在训练过程中看不到的扬声器上工作。我们已经使用LibriSpeech数据集的一个噪声混响版本验证了我们的方法。实验结果表明,在混响噪声环境中,voice在自然度和目标-说话人相似性方面优于其他测试的VC技术。 摘要:Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. While there is a rich literature on VC, most proposed methods are trained and evaluated on clean speech recordings. However, many acoustic environments are noisy and reverberant, severely restricting the applicability of popular VC methods to such scenarios. To address this limitation, we propose Voicy, a new VC framework particularly tailored for noisy speech. Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder. Importantly, Voicy is capable of performing non-parallel zero-shot VC, an important requirement for any VC system that needs to work on speakers not seen during training. We have validated our approach using a noisy reverberant version of the LibriSpeech dataset. Experimental results show that Voicy outperforms other tested VC techniques in terms of naturalness and target speaker similarity in noisy reverberant environments.

【3】 Attention-Based Keyword Localisation in Speech using Visual Grounding 标题:基于视觉基础的语音中基于注意力的关键词定位

作者:Kayode Olaleye,Herman Kamper 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.08859 摘要:基于视觉的语音模型从与语音字幕配对的图像中学习。通过使用一个经过训练的具有固定词汇量的视觉分类器,用软文本标签标记图像,先前的工作已经表明,可以训练一个能够检测特定文本关键字是否出现在言语中的模型。在这里,我们研究了基于视觉的语音模型是否也可以进行关键词定位:在没有任何明确的基于文本或对齐监督的情况下,预测一个给定的文本关键词在话语中出现的位置。我们特别考虑将注意力纳入卷积模型是否有利于定位。尽管视觉监督模型的绝对局部化性能仍然不高(与使用无序的文字标签进行监督相比),但我们表明,与以前的视觉监督模型相比,注意力在性能上有很大的提高。与其他许多语音图像研究一样,我们发现许多不正确的定位是由于语义混淆造成的,例如在查询关键字“swiming”中定位单词“backstroke”。 摘要:Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model that can detect whether a particular text keyword occurs in speech utterances or not. Here we investigate whether visually grounded speech models can also do keyword localisation: predicting where, within an utterance, a given textual keyword occurs without any explicit text-based or alignment supervision. We specifically consider whether incorporating attention into a convolutional model is beneficial for localisation. Although absolute localisation performance with visually supervised models is still modest (compared to using unordered bag-of-word text labels for supervision), we show that attention provides a large gain in performance over previous visually grounded models. As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions, e.g. locating the word 'backstroke' for the query keyword 'swimming'.

【4】 Source Separation-based Data Augmentation for Improved Joint Beat and Downbeat Tracking 标题:基于源分离的数据增强改进的联合差拍和降拍跟踪

作者:Ching-Yu Chiu,Joann Ching,Wen-Yi Hsiao,Yu-Hua Chen,Alvin Wen-Yu Su,Yi-Hsuan Yang 备注:Accepted to European Signal Processing Conference (EUSIPCO 2021) 链接:https://arxiv.org/abs/2106.08703 摘要:近年来,随着深度学习技术的发展,音乐音频信号中自动拍和下拍跟踪的性能有了很大的提高。在训练这种基于深度学习的模型时,数据扩充被认为是一种重要的技术。然而,现有的数据扩充方法主要是为了平衡训练数据的分布与速度之间的关系。在这篇论文中,我们探讨另一种资料扩充的方法,以说明训练资料的组成,在打击和非打击声源。具体地说,我们提出了一种盲鼓分离模型,从每个训练音频信号中分离出鼓声和非鼓声,过滤掉无鼓的训练信号,然后利用得到的鼓声和非鼓声来增强训练数据。我们在四个完全看不见的测试集上进行了实验,验证了所提方法的有效性,并相应地验证了鼓声合成在拍和下拍跟踪训练数据中的重要性。 摘要:Due to advances in deep learning, the performance of automatic beat and downbeat tracking in musical audio signals has seen great improvement in recent years. In training such deep learning based models, data augmentation has been found an important technique. However, existing data augmentation methods for this task mainly target at balancing the distribution of the training data with respect to their tempo. In this paper, we investigate another approach for data augmentation, to account for the composition of the training data in terms of the percussive and non-percussive sound sources. Specifically, we propose to employ a blind drum separation model to segregate the drum and non-drum sounds from each training audio signal, filtering out training signals that are drumless, and then use the obtained drum and non-drum stems to augment the training data. We report experiments on four completely unseen test sets, validating the effectiveness of the proposed method, and accordingly the importance of drum sound composition in the training data for beat and downbeat tracking.

【5】 Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study 标题:声学单词嵌入能捕捉到语音相似性吗?一项实证研究

作者:Badr M. Abdullah,Marius Mosbach,Iuliia Zaitova,Bernd Möbius,Dietrich Klakow 备注:Accepted in Interspeech 2021 链接:https://arxiv.org/abs/2106.08686 摘要:深度神经网络的几种变体已成功地用于建立参数化模型,将可变时长的口语词段投影到固定大小的向量表示或声学词嵌入(AWEs)上。然而,我们在多大程度上可以依赖出现的AWE空间中的距离来估计词形相似性,目前还不清楚。本文提出这样一个问题:声音嵌入空间中的距离是否与语音差异相关?为了回答这个问题,我们实证研究了不同神经结构和学习目标的AWEs监督方法的性能。我们在两种语言(德语和捷克语)的受控环境下训练AWE模型,并在两个任务上评估嵌入:单词辨别和语音相似性。我们的实验表明:(1)最佳情况下嵌入空间中的距离仅与语音距离适度相关;(2)提高单词识别任务的性能并不一定能得到更好地反映单词语音相似性的模型。我们的发现强调了重新思考当前对敬畏的内在评价的必要性。 摘要:Several variants of deep neural networks have been successfully employed for building parametric models that project variable-duration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we empirically investigate the performance of supervised approaches for AWEs with different neural architectures and learning objectives. We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity. Our findings highlight the necessity to rethink the current intrinsic evaluations for AWEs.

【6】 Drum-Aware Ensemble Architecture for Improved Joint Musical Beat and Downbeat Tracking 标题:用于改进音乐节拍和弱拍联合跟踪的鼓感知集成结构

作者:Ching-Yu Chiu,Alvin Wen-Yu Su,Yi-Hsuan Yang 备注:Accepted to IEEE Signal Processing Letters (May 2021) 链接:https://arxiv.org/abs/2106.08685 摘要:本文提出了一种新的系统结构,它将盲源分离与音乐音频信号的联合拍和下拍跟踪相结合。信源分离模块将输入信号中的冲击分量和非冲击分量分离开来,在这两个分量上分别进行拍频和下拍跟踪,然后利用可学习的融合机制对结果进行聚合。这样,系统可以自适应地确定输入信号的跟踪结果在多大程度上取决于输入的冲击或非冲击分量。对四个测试集的评估表明,新的架构始终优于广泛采用的不采用源分离的基线架构。 摘要:This paper presents a novel system architecture that integrates blind source separation with joint beat and downbeat tracking in musical audio signals. The source separation module segregates the percussive and non-percussive components of the input signal, over which beat and downbeat tracking are performed separately and then the results are aggregated with a learnable fusion mechanism. This way, the system can adaptively determine how much the tracking result for an input signal should depend on the input's percussive or non-percussive components. Evaluation on four testing sets that feature different levels of presence of drum sounds shows that the new architecture consistently outperforms the widely-adopted baseline architecture that does not employ source separation.

【7】 WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution 标题:WSRGlow:一种基于辉光的音频超分辨率波形生成模型

作者:Kexun Zhang,Yi Ren,Changliang Xu,Zhou Zhao 备注:Accepted by INTERSPEECH 2021 链接:https://arxiv.org/abs/2106.08507 摘要:音频超分辨率是通过添加缺失的频带,从低分辨率(LR)音频构造高分辨率(HR)音频的任务。以往基于卷积神经网络和均方误差训练目标的方法性能较低,而对抗生成模型的训练和调整比较困难。近年来,规范化流以其性能高、训练简单、推理速度快等优点引起了人们的广泛关注。本文提出了一种基于辉光的波形生成模型WSRGlow来实现音频的超分辨率。具体来说,1)我们将WaveNet和Glow结合起来,直接最大化目标HR音频在LR信息条件下的准确可能性;为了从低分辨率音频中提取音频信息,我们提出了一种LR音频编码器和一种STFT编码器,分别从时域和频域对LR信息进行编码。实验结果表明,该模型训练简单,在客观质量和感知质量方面均优于前人的研究成果。WSRGlow也是第一个从12kHz LR音频产生48kHz波形的模型。 摘要:Audio super-resolution is the task of constructing a high-resolution (HR) audio from a low-resolution (LR) audio by adding the missing band. Previous methods based on convolutional neural networks and mean squared error training objective have relatively low performance, while adversarial generative models are difficult to train and tune. Recently, normalizing flow has attracted a lot of attention for its high performance, simple training and fast inference. In this paper, we propose WSRGlow, a Glow-based waveform generative model to perform audio super-resolution. Specifically, 1) we integrate WaveNet and Glow to directly maximize the exact likelihood of the target HR audio conditioned on LR information; and 2) to exploit the audio information from low-resolution audio, we propose an LR audio encoder and an STFT encoder, which encode the LR information from the time domain and frequency domain respectively. The experimental results show that the proposed model is easier to train and outperforms the previous works in terms of both objective and perceptual quality. WSRGlow is also the first model to produce 48kHz waveforms from 12kHz LR audio.

【8】 Tonal Frequencies, Consonance, Dissonance: A Math-Bio Intersection 标题:音调频率、和声、不和谐:数学-生物交叉点

作者:Steve Mathew 备注:9 pages, 1 figure, 1 table 链接:https://arxiv.org/abs/2106.08479 摘要:迄今为止,计算音符的频率需要知道一些参考音符的频率。在这项研究中,一阶常微分方程是用来达成一个数学模型,以确定音调频率使用各自的注意指标。在下一部分的研究中,我们以基本的音乐频率为基础进行分析,从理论和神经生物学的角度来解释不同的音符在半音阶上所引起的和谐与不和谐,这是建立在声音的系统模式激发愉悦这一事实的基础上的。探讨了和声丰富的原因,和弦中的声音干扰和和声程度。因为人类的思维是相对地分析一切事物的,所以除了最辅音的音符以外的任何东西听起来都是不和谐的。总之,这项研究清楚地解释了为什么音符和音乐的声音是这样的。 摘要:To date, calculating the frequencies of musical notes requires one to know the frequency of some reference note. In this study, first-order ordinary differential equations are used to arrive at a mathematical model to determine tonal frequencies using their respective note indices. In the next part of the study, an analysis that is based on the fundamental musical frequencies is conducted to theoretically and neurobiologically explain the consonance and dissonance caused by the different musical notes in the chromatic scale which is based on the fact that systematic patterns of sound invoke pleasure. The reason behind the richness of harmony and the sonic interference and degree of consonance in musical chords are discussed. Since a human mind analyses everything relatively, anything other than the most consonant notes sounds dissonant. In conclusion, the study explains clearly why musical notes and in toto, music sounds the way it does.

【9】 Pathological voice adaptation with autoencoder-based voice conversion 标题:基于自动编码器的语音转换的病理性语音适配

作者:Marc Illa,Bence Mark Halpern,Rob van Son,Laureano Moro-Velazquez,Odette Scharenborg 备注:6 pages, 3 figures. Accepted to the 11th ISCA Speech Synthesis Workshop (2021) 链接:https://arxiv.org/abs/2106.08427 摘要:本文提出了一种新的病态语音合成方法。而不是使用健康的语音作为来源,我们定制一个现有的病理性语音样本,以一个新的说话人的声音特征。这种方法减轻了将典型语音转换为病理性语音时的评估问题,因为在我们的方法中,语音转换(VC)模型不需要针对语音退化进行优化,而只需要针对说话人的变化进行优化。优化中的这种变化确保了在自然度中发现的任何退化都是由于转换过程,而不是由于模型夸大了语音病理学的特征。为了证明该方法的有效性,我们使用UASpeech数据库和基于VC技术的自动编码器对构音障碍语音进行了转换。主观评价结果表明,虽然较低的清晰度可能会导致中、低清晰度说话人的自然度得分与地面真实度相比有一定程度的下降,但对于高清晰度、有构音障碍的说话人来说,自然度是合理的。低清晰度和高清晰度说话人的说话人特征转换是成功的,但中等清晰度说话人的转换不是成功的。不同清晰度水平的转换结果的差异是由于清晰度水平的不同还是由于说话人的不同需要进一步研究。 摘要:In this paper, we propose a new approach to pathological speech synthesis. Instead of using healthy speech as a source, we customise an existing pathological speech sample to a new speaker's voice characteristics. This approach alleviates the evaluation problem one normally has when converting typical speech to pathological speech, as in our approach, the voice conversion (VC) model does not need to be optimised for speech degradation but only for the speaker change. This change in the optimisation ensures that any degradation found in naturalness is due to the conversion process and not due to the model exaggerating characteristics of a speech pathology. To show a proof of concept of this method, we convert dysarthric speech using the UASpeech database and an autoencoder-based VC technique. Subjective evaluation results show reasonable naturalness for high intelligibility dysarthric speakers, though lower intelligibility seems to introduce a marginal degradation in naturalness scores for mid and low intelligibility speakers compared to ground truth. Conversion of speaker characteristics for low and high intelligibility speakers is successful, but not for mid. Whether the differences in the results for the different intelligibility levels is due to the intelligibility levels or due to the speakers needs to be further investigated.

【10】 A Flow-Based Neural Network for Time Domain Speech Enhancement 标题:一种基于流的时域语音增强神经网络

作者:Martin Strauss,Bernd Edler 备注:Accepted to ICASSP 2021 链接:https://arxiv.org/abs/2106.09008 摘要:语音增强包括区分目标语音信号和入侵背景。近年来,使用变分自动编码器或生成对抗网络的生成方法得到了越来越多的应用,但基于规范化流(NF)的系统尽管在相关领域取得了成功,但仍然比较落后。因此,在本文中,我们提出了一个NF框架来直接模拟增强过程,该框架通过对干净语音的密度估计来实现。语音合成中的WaveGlow模型可以在时域直接增强含噪语音。此外,我们还证明了非线性输入压扩通过均衡输入样本的分布来提高模型的性能。在一个公开的数据集上进行的实验评估显示,与目前最先进的基于GAN的方法具有可比性,同时使用客观的评估指标超越了所选择的基线。 摘要:Speech enhancement involves the distinction of a target speech signal from an intrusive background. Although generative approaches using Variational Autoencoders or Generative Adversarial Networks (GANs) have increasingly been used in recent years, normalizing flow (NF) based systems are still scarse, despite their success in related fields. Thus, in this paper we propose a NF framework to directly model the enhancement process by density estimation of clean speech utterances conditioned on their noisy counterpart. The WaveGlow model from speech synthesis is adapted to enable direct enhancement of noisy utterances in time domain. In addition, we demonstrate that nonlinear input companding benefits the model performance by equalizing the distribution of input samples. Experimental evaluation on a publicly available dataset shows comparable results to current state-of-the-art GAN-based approaches, while surpassing the chosen baselines using objective evaluation metrics.

【11】 Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition 标题:用于半监督语音识别的动量伪标注

作者:Yosuke Higuchi,Niko Moritz,Jonathan Le Roux,Takaaki Hori 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.08922 摘要:伪标记(PL)在半监督自动语音识别(ASR)中是一种有效的方法,该方法利用未标记数据生成的伪标记对基础模型进行自训练。随着模型的发展,通过迭代更新伪标签,PL可以得到进一步的改进,但是以前的大多数方法都涉及到模型的无效再训练或标签更新的复杂控制。本文提出了一种简单有效的半监督ASR策略&动量伪标记(MPL)。MPL由一对在线和离线模型组成,这些模型受mean-teacher方法的启发,可以相互交流和学习。在线模型被训练来预测离线模型动态生成的伪标签。离线模型保持在线模型基于动量的移动平均值。MPL是在单个训练过程中进行的,两个模型之间的交互有效地帮助它们相互增强,从而提高了ASR的性能。我们将MPL应用到一个基于连接主义时间分类的端到端ASR模型中。实验结果表明,MPL有效地改善了基本模型的性能,并可扩展到不同数据量或域不匹配的半监督场景。 摘要:Pseudo-labeling (PL) has been shown to be effective in semi-supervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. While PL can be further improved by iteratively updating pseudo-labels as the model evolves, most of the previous approaches involve inefficient retraining of the model or intricate control of the label update. We present momentum pseudo-labeling (MPL), a simple yet effective strategy for semi-supervised ASR. MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method. The online model is trained to predict pseudo-labels generated on the fly by the offline model. The offline model maintains a momentum-based moving average of the online model. MPL is performed in a single training process and the interaction between the two models effectively helps them reinforce each other to improve the ASR performance. We apply MPL to an end-to-end ASR model based on the connectionist temporal classification. The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios with varying amounts of data or domain mismatch.

【12】 Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion 标题:基于识别合成的非并行语音转换中丰富源风格转换

作者:Zhichao Wang,Xinyong Zhou,Fengyu Yang,Tao Li,Hongqiang Du,Lei Xie,Wendong Gan,Haitao Chen,Hai Li 备注:Accepted by Interspeech 2021 链接:https://arxiv.org/abs/2106.08741 摘要:目前的语音转换(VC)方法可以成功地转换音频的音色。由于对源音频进行有效的韵律建模是一项具有挑战性的任务,因此将源音频的风格转换为转换后的语音仍然存在局限性。本文提出了一种基于识别综合框架的信源风格转换方法。以前在语音生成任务中,韵律可以用韵律特征显式建模,也可以用潜在的韵律抽取器隐式建模。在本文中,我们利用这两种方法的优点,以一种混合的方式对韵律进行建模,在所提出的韵律模块中有效地结合了显式和隐式方法。具体地说,用韵律特征对韵律进行显式建模,用VAE和参考编码器对韵律进行隐式建模,分别以Mel谱和瓶颈特征作为输入。此外,还引入了对抗性训练来去除VAE输出中与说话人相关的信息,避免了在转换风格时泄漏说话人信息。最后,我们使用一个改进的基于自我注意的编码器从瓶颈特征中提取句子上下文,并从分层表示中隐式地聚合源语音的韵律方面。实验表明,该方法在风格转换方面优于基线和竞争系统;同时,保持了语音质量和说话人相似度。 摘要:Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audio's prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study proposes a source style transfer method based on recognition-synthesis framework. Previously in speech generation task, prosody can be modeled explicitly with prosodic features or implicitly with a latent prosody extractor. In this paper, taking advantages of both, we model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module. Specifically, prosodic features are used to explicit model prosody, while VAE and reference encoder are used to implicitly model prosody, which take Mel spectrum and bottleneck feature as input respectively. Furthermore, adversarial training is introduced to remove speaker-related information from the VAE outputs, avoiding leaking source speaker information while transferring style. Finally, we use a modified self-attention based encoder to extract sentential context from bottleneck features, which also implicitly aggregates the prosodic aspects of source speech from the layered representations. Experiments show that our approach is superior to the baseline and a competitive system in terms of style transfer; meanwhile, the speech quality and speaker similarity are well maintained.

【13】 Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI 标题:实时MRI中基于声道形状动力学的无声语音和情感识别

作者:Laxmi Pandey,Ahmed Sabbir Arif 备注:8 pages 链接:https://arxiv.org/abs/2106.08706 摘要:口语的语音是通过改变发音器在声道周围的结构来获得的。它们包含了丰富的信息,可以用来更好地理解人类言语产生的潜在机制。我们提出了一种新的基于深度神经网络的学习框架,该框架能够理解语音产生过程中声道形状可变长度序列中的声学信息,并通过实时磁共振成像(rtMRI)捕获这些信息,然后将其翻译成文本。提出的框架包括时空卷积、循环网络和连接主义时间分类损失,完全端到端训练。在USC-TIMIT语料库上,该模型在句子水平上的识别率达到了40.6%,比现有的模型要好得多。据我们所知,这是第一个研究表明,识别整个口语句子的基础上个人的发音运动捕捉到的rtMRI视频。我们还分析了不同情绪和性别的声道各亚区(即咽部、腭部和背侧、硬腭、唇部收缩区)发音几何结构的变化。结果表明,每个子区域的失真都受到情绪和性别的影响。 摘要:Speech sounds of spoken language are obtained by varying configuration of the articulators surrounding the vocal tract. They contain abundant information that can be utilized to better understand the underlying mechanism of human speech production. We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production, captured by real-time magnetic resonance imaging (rtMRI), and translate it into text. The proposed framework comprises of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. On the USC-TIMIT corpus, the model achieved a 40.6% PER at sentence-level, much better compared to the existing models. To the best of our knowledge, this is the first study that demonstrates the recognition of entire spoken sentence based on an individual's articulatory motions captured by rtMRI video. We also performed an analysis of variations in the geometry of articulation in each sub-regions of the vocal tract (i.e., pharyngeal, velar and dorsal, hard palate, labial constriction region) with respect to different emotions and genders. Results suggest that each sub-regions distortion is affected by both emotion and gender.

【14】 DCCRN+: Channel-wise Subband DCCRN with SNR Estimation for Speech Enhancement 标题:DCCRN+:用于语音增强的信噪比估计信道子带DCCRN

作者:Shubo Lv,Yanxin Hu,Shimin Zhang,Lei Xie 链接:https://arxiv.org/abs/2106.08672 摘要:深复卷积递归网络(DCCRN)是CRN的一种扩展,具有复杂的结构,在interspeech2020深噪声抑制挑战赛(DNS2020)的MOS评估中取得了优异的性能。本文进一步扩展了DCCRN,并做了以下重要的修改。我们首先将该模型扩展到子带处理,在子带处理中,用可学习的神经网络滤波器而不是工程化的FIR滤波器来分割和合并频带,从而得到一个以端到端方式训练的更快的噪声抑制器。然后用复杂的TF-LSTM进一步代替LSTM,以更好地模拟沿时间轴和频率轴的时间相关性。此外,我们不是简单地将每个编码器层的输出连接到相应解码器层的输入,而是使用卷积块首先从编码器输出聚合基本信息,然后再将其馈送到解码器层。为了在去除噪声的同时保持良好的语音质量,我们特别设计了一个额外的先验信噪比估计模块。最后采用后处理模块进一步抑制非自然残余噪声。新模型名为DCCRN+,在PESQ和DNSMOS方面已经超过了原有的DCCRN以及一些竞争模型,并在新的Interspeech 2021 DNS挑战中取得了优异的性能 摘要:Deep complex convolution recurrent network (DCCRN), which extends CRN with complex structure, has achieved superior performance in MOS evaluation in Interspeech 2020 deep noise suppression challenge (DNS2020). This paper further extends DCCRN with the following significant revisions. We first extend the model to sub-band processing where the bands are split and merged by learnable neural network filters instead of engineered FIR filters, leading to a faster noise suppressor trained in an end-to-end manner. Then the LSTM is further substituted with a complex TF-LSTM to better model temporal dependencies along both time and frequency axes. Moreover, instead of simply concatenating the output of each encoder layer to the input of the corresponding decoder layer, we use convolution blocks to first aggregate essential information from the encoder output before feeding it to the decoder layers. We specifically formulate the decoder with an extra a priori SNR estimation module to maintain good speech quality while removing noise. Finally a post-processing module is adopted to further suppress the unnatural residual noise. The new model, named DCCRN+, has surpassed the original DCCRN as well as several competitive models in terms of PESQ and DNSMOS, and has achieved superior performance in the new Interspeech 2021 DNS challenge

【15】 Improving the expressiveness of neural vocoding with non-affine Normalizing Flows 标题:用非仿射归一化流程提高神经声码的表达能力

作者:Adam Gabryś,Yunlong Jiao,Viacheslav Klimkov,Daniel Korzekwa,Roberto Barra-Chicote 备注:Accepted to Interspeech 2021, 5 pages,3 figures 链接:https://arxiv.org/abs/2106.08649 摘要:本文提出了一种通用的神经声码规范化流增强算法。作为一个案例研究,我们改进了表达性语音声码与改进的平行波网(PW)。特别地,我们建议将PW的仿射变换推广到更具表现力的可逆非仿射函数。在波形重建和文本到语音(TTS)任务中,改进后的PW具有更高的表达能力,可以获得更好的感知信号质量和自然度。我们在一个多说话人、多语言的数据集上评估了不同说话风格的模型。在波形重建任务中,提出的模型将原始波形到记录的自然度和信号质量差距缩小了10\%$,将其他最先进的神经声码系统的自然度和信号质量差距缩小了60\%$。我们还展示了在评估测试集上客观度量的改进,与仿射PW相比,L2谱距离和交叉熵分别减少了$3\%$和$6\unicode{x2030}$。此外,我们扩展了PW的概率密度蒸馏方法,使之适用于任何非仿射可逆可微函数。 摘要:This paper proposes a general enhancement to the Normalizing Flows (NF) used in neural vocoding. As a case study, we improve expressive speech vocoding with a revamped Parallel Wavenet (PW). Specifically, we propose to extend the affine transformation of PW to the more expressive invertible non-affine function. The greater expressiveness of the improved PW leads to better-perceived signal quality and naturalness in the waveform reconstruction and text-to-speech (TTS) tasks. We evaluate the model across different speaking styles on a multi-speaker, multi-lingual dataset. In the waveform reconstruction task, the proposed model closes the naturalness and signal quality gap from the original PW to recordings by $10\%$, and from other state-of-the-art neural vocoding systems by more than $60\%$. We also demonstrate improvements in objective metrics on the evaluation test set with L2 Spectral Distance and Cross-Entropy reduced by $3\%$ and $6\unicode{x2030}$ comparing to the affine PW. Furthermore, we extend the probability density distillation procedure proposed by the original PW paper, so that it works with any non-affine invertible and differentiable function.

【16】 Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain 标题:结合非自回归构象CTC和条件说话人链的多说话人ASR

作者:Pengcheng Guo,Xuankai Chang,Shinji Watanabe,Lei Xie 备注:Accepted by Interspeech 2021 链接:https://arxiv.org/abs/2106.08595 摘要:非自回归(NAR)模型在不同的序列间任务上都取得了与自回归(AR)模型相当的结果。然而,针对多序列问题,如多说话人自动语音识别(ASR)等,探索NAR序列方法的研究还很有限。在这项研究中,我们将我们提出的条件链模型扩展到NAR多人ASR。具体而言,使用输入混合语音和先前估计的条件说话人特征逐个推断每个说话人的输出。在每一步中,一个NAR连接时间分类(CTC)编码器被用来执行并行计算。在这种设计中,总的推理步骤将被限制为混合说话人的数量。此外,我们还采用了一致性,并加入了中间CTC损耗,以提高性能。在WSJ0-Mix和LibriMix语料库上的实验表明,我们的模型比其他NAR模型的性能要好,只增加了一点点延迟,分别达到了22.3%和24.9%的WERs。此外,通过包含可变说话人数的数据,我们的模型比PIT-Conformer-AR模型更具优势,仅需1/7的延迟,在WSJ0-2mix和WSJ0-3mix集合上的WERs分别为19.9%和34.3%。我们所有的代码都可以在https://github.com/pengchengguo/espnet/tree/conditional-multispk. 摘要:Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by-one using both the input mixture speech and previously-estimated conditional speaker features. In each step, a NAR connectionist temporal classification (CTC) encoder is used to perform parallel computation. With this design, the total inference steps will be restricted to the number of mixed speakers. Besides, we also adopt the Conformer and incorporate an intermediate CTC loss to improve the performance. Experiments on WSJ0-Mix and LibriMix corpora show that our model outperforms other NAR models with only a slight increase of latency, achieving WERs of 22.3% and 24.9%, respectively. Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19.9% and 34.3% on WSJ0-2mix and WSJ0-3mix sets. All of our codes are publicly available at https://github.com/pengchengguo/espnet/tree/conditional-multispk.

【17】 Detection of Consonant Errors in Disordered Speech Based on Consonant-vowel Segment Embedding 标题:基于辅音-元音段嵌入的乱序语音辅音错误检测

作者:Si-Ioi Ng,Cymie Wing-Yee Ng,Jingyu Li,Tan Lee 备注:Accepted to INTERSPEECH 2021 链接:https://arxiv.org/abs/2106.08536 摘要:语音障碍(Speech-sound disorder,SSD)是指幼儿在预期年龄阶段持续出现语音障碍的一种发育障碍。辅音错误是SSD临床评价的主要指标。以往对SSD自动评估的研究表明,短辅音和过渡辅音的语音错误检测效果较差。本文研究了一种基于神经网络的辅音-元音(CV)双音段检测语音混乱中辅音错误的方法。基本假设是CV段的元音部分携带了辅音共同发音的重要信息。利用递归神经网络模型从CV段中提取语音嵌入。计算测试片段和参考片段嵌入之间的相似性分数,以确定测试片段是否为预期辅音。实验结果表明,使用CV片段可以提高对先前研究中所报道的那些“困难”辅音的语音错误的检测性能。 摘要:Speech sound disorder (SSD) refers to a type of developmental disorder in young children who encounter persistent difficulties in producing certain speech sounds at the expected age. Consonant errors are the major indicator of SSD in clinical assessment. Previous studies on automatic assessment of SSD revealed that detection of speech errors concerning short and transitory consonants is less satisfactory. This paper investigates a neural network based approach to detecting consonant errors in disordered speech using consonant-vowel (CV) diphone segment in comparison to using consonant monophone segment. The underlying assumption is that the vowel part of a CV segment carries important information of co-articulation from the consonant. Speech embeddings are extracted from CV segments by a recurrent neural network model. The similarity scores between the embeddings of the test segment and the reference segments are computed to determine if the test segment is the expected consonant or not. Experimental results show that using CV segments achieves improved performance on detecting speech errors concerning those "difficult" consonants reported in the previous studies.

【18】 Global Rhythm Style Transfer Without Text Transcriptions 标题:无文本转录的全局节奏风格转移

作者:Kaizhi Qian,Yang Zhang,Shiyu Chang,Jinjun Xiong,Chuang Gan,David Cox,Mark Hasegawa-Johnson 链接:https://arxiv.org/abs/2106.08519 摘要:韵律在表征说话人的风格或情感方面起着重要的作用,但大多数非平行语音或情感风格转换算法并不转换任何韵律信息。韵律的两个主要组成部分是音高和节奏。从语音中分离韵律信息,特别是韵律成分是一个挑战,因为它涉及到打破输入语音和分离语音表示之间的同步性。因此,现有的大多数韵律风格转换算法都需要依赖于某种形式的文本转录来识别文本中的内容信息,这使得它们的应用仅限于高资源语言。近年来,SpeechSplit在无监督的韵律风格转换方面取得了很大的进展,但它无法以无监督的方式提取出高层次的整体韵律风格。在本文中,我们提出了AutoPST,它可以在不依赖任何文本转录的情况下从语音中分离出整体韵律风格。AutoPST是一个基于自动编码器的韵律风格转换框架,在自表达表征学习的指导下,具有一个完整的韵律去除模块。在不同风格转换任务上的实验表明,AutoPST能有效地转换出正确反映目标领域风格的韵律。 摘要:Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. Two major components of prosody are pitch and rhythm. Disentangling the prosody information, particularly the rhythm component, from the speech is challenging because it involves breaking the synchrony between the input speech and the disentangled speech representation. As a result, most existing prosody style transfer algorithms would need to rely on some form of text transcriptions to identify the content information, which confines their application to high-resource languages only. Recently, SpeechSplit has made sizeable progress towards unsupervised prosody style transfer, but it is unable to extract high-level global prosody style in an unsupervised manner. In this paper, we propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions. AutoPST is an Autoencoder-based Prosody Style Transfer framework with a thorough rhythm removal module guided by the self-expressive representation learning. Experiments on different style transfer tasks show that AutoPST can effectively convert prosody that correctly reflects the styles of the target domains.

【19】 Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis 标题:Ctrl-P:语音合成中韵律变化的时间控制

作者:Devang S Ram Mohan,Vivian Hu,Tian Huey Teh,Alexandra Torresquintero,Christopher G. R. Wallis,Marlene Staib,Lorenzo Foglianti,Jiameng Gao,Simon King 备注:To be published in Interspeech 2021. 5 pages, 4 figures 链接:https://arxiv.org/abs/2106.08352 摘要:文本不能完全指定口语形式,因此文本到语音模型必须能够从语音数据中学习,这些数据的变化方式不是由相应的文本来解释的。减少训练数据中无法解释的变化量的一种方法是提供声学信息作为额外的学习信号。在生成语音时,修改此声学信息可以生成文本的多个不同格式副本。由于许多无法解释的变化都发生在韵律中,我们提出了一个模型,该模型可以生成明显依赖于韵律的三个主要声学相关:F{0}$、能量和持续时间的语音。该模型在如何指定这些特性的值方面是灵活的:它们可以从外部提供,也可以从文本中预测,或者预测然后进行修改。与采用变分自动编码器学习无监督潜在特征的模型相比,我们的模型提供了更具解释性、时间精确性和不纠缠的控制。当从文本中自动预测声学特征时,它产生的语音比tacotron2模型和参考编码器产生的语音更自然。随后对预测的声学特征进行人在环修改可以显著地进一步提高自然度。 摘要:Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. One way to reduce the amount of unexplained variation in training data is to provide acoustic information as an additional learning signal. When generating speech, modifying this acoustic information enables multiple distinct renditions of a text to be produced. Since much of the unexplained variation is in the prosody, we propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody: $F_{0}$, energy and duration. The model is flexible about how the values of these features are specified: they can be externally provided, or predicted from text, or predicted then subsequently modified. Compared to a model that employs a variational auto-encoder to learn unsupervised latent features, our model provides more interpretable, temporally-precise, and disentangled control. When automatically predicting the acoustic features from text, it generates speech that is more natural than that from a Tacotron 2 model with reference encoder. Subsequent human-in-the-loop modification of the predicted acoustic features can significantly further increase naturalness.

3.eess.AS音频处理:

【1】 A Flow-Based Neural Network for Time Domain Speech Enhancement 标题:一种基于流的时域语音增强神经网络

作者:Martin Strauss,Bernd Edler 备注:Accepted to ICASSP 2021 链接:https://arxiv.org/abs/2106.09008 摘要:语音增强包括区分目标语音信号和入侵背景。近年来,使用变分自动编码器或生成对抗网络的生成方法得到了越来越多的应用,但基于规范化流(NF)的系统尽管在相关领域取得了成功,但仍然比较落后。因此,在本文中,我们提出了一个NF框架来直接模拟增强过程,该框架通过对干净语音的密度估计来实现。语音合成中的WaveGlow模型可以在时域直接增强含噪语音。此外,我们还证明了非线性输入压扩通过均衡输入样本的分布来提高模型的性能。在一个公开的数据集上进行的实验评估显示,与目前最先进的基于GAN的方法具有可比性,同时使用客观的评估指标超越了所选择的基线。 摘要:Speech enhancement involves the distinction of a target speech signal from an intrusive background. Although generative approaches using Variational Autoencoders or Generative Adversarial Networks (GANs) have increasingly been used in recent years, normalizing flow (NF) based systems are still scarse, despite their success in related fields. Thus, in this paper we propose a NF framework to directly model the enhancement process by density estimation of clean speech utterances conditioned on their noisy counterpart. The WaveGlow model from speech synthesis is adapted to enable direct enhancement of noisy utterances in time domain. In addition, we demonstrate that nonlinear input companding benefits the model performance by equalizing the distribution of input samples. Experimental evaluation on a publicly available dataset shows comparable results to current state-of-the-art GAN-based approaches, while surpassing the chosen baselines using objective evaluation metrics.

【2】 Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition 标题:用于半监督语音识别的动量伪标注

作者:Yosuke Higuchi,Niko Moritz,Jonathan Le Roux,Takaaki Hori 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.08922 摘要:伪标记(PL)在半监督自动语音识别(ASR)中是一种有效的方法,该方法利用未标记数据生成的伪标记对基础模型进行自训练。随着模型的发展,通过迭代更新伪标签,PL可以得到进一步的改进,但是以前的大多数方法都涉及到模型的无效再训练或标签更新的复杂控制。本文提出了一种简单有效的半监督ASR策略&动量伪标记(MPL)。MPL由一对在线和离线模型组成,这些模型受mean-teacher方法的启发,可以相互交流和学习。在线模型被训练来预测离线模型动态生成的伪标签。离线模型保持在线模型基于动量的移动平均值。MPL是在单个训练过程中进行的,两个模型之间的交互有效地帮助它们相互增强,从而提高了ASR的性能。我们将MPL应用到一个基于连接主义时间分类的端到端ASR模型中。实验结果表明,MPL有效地改善了基本模型的性能,并可扩展到不同数据量或域不匹配的半监督场景。 摘要:Pseudo-labeling (PL) has been shown to be effective in semi-supervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. While PL can be further improved by iteratively updating pseudo-labels as the model evolves, most of the previous approaches involve inefficient retraining of the model or intricate control of the label update. We present momentum pseudo-labeling (MPL), a simple yet effective strategy for semi-supervised ASR. MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method. The online model is trained to predict pseudo-labels generated on the fly by the offline model. The offline model maintains a momentum-based moving average of the online model. MPL is performed in a single training process and the interaction between the two models effectively helps them reinforce each other to improve the ASR performance. We apply MPL to an end-to-end ASR model based on the connectionist temporal classification. The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios with varying amounts of data or domain mismatch.

【3】 Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion 标题:基于识别合成的非并行语音转换中丰富源风格转换

作者:Zhichao Wang,Xinyong Zhou,Fengyu Yang,Tao Li,Hongqiang Du,Lei Xie,Wendong Gan,Haitao Chen,Hai Li 备注:Accepted by Interspeech 2021 链接:https://arxiv.org/abs/2106.08741 摘要:目前的语音转换(VC)方法可以成功地转换音频的音色。由于对源音频进行有效的韵律建模是一项具有挑战性的任务,因此将源音频的风格转换为转换后的语音仍然存在局限性。本文提出了一种基于识别综合框架的信源风格转换方法。以前在语音生成任务中,韵律可以用韵律特征显式建模,也可以用潜在的韵律抽取器隐式建模。在本文中,我们利用这两种方法的优点,以一种混合的方式对韵律进行建模,在所提出的韵律模块中有效地结合了显式和隐式方法。具体地说,用韵律特征对韵律进行显式建模,用VAE和参考编码器对韵律进行隐式建模,分别以Mel谱和瓶颈特征作为输入。此外,还引入了对抗性训练来去除VAE输出中与说话人相关的信息,避免了在转换风格时泄漏说话人信息。最后,我们使用一个改进的基于自我注意的编码器从瓶颈特征中提取句子上下文,并从分层表示中隐式地聚合源语音的韵律方面。实验表明,该方法在风格转换方面优于基线和竞争系统;同时,保持了语音质量和说话人相似度。 摘要:Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audio's prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study proposes a source style transfer method based on recognition-synthesis framework. Previously in speech generation task, prosody can be modeled explicitly with prosodic features or implicitly with a latent prosody extractor. In this paper, taking advantages of both, we model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module. Specifically, prosodic features are used to explicit model prosody, while VAE and reference encoder are used to implicitly model prosody, which take Mel spectrum and bottleneck feature as input respectively. Furthermore, adversarial training is introduced to remove speaker-related information from the VAE outputs, avoiding leaking source speaker information while transferring style. Finally, we use a modified self-attention based encoder to extract sentential context from bottleneck features, which also implicitly aggregates the prosodic aspects of source speech from the layered representations. Experiments show that our approach is superior to the baseline and a competitive system in terms of style transfer; meanwhile, the speech quality and speaker similarity are well maintained.

【4】 Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI 标题:实时MRI中基于声道形状动力学的无声语音和情感识别

作者:Laxmi Pandey,Ahmed Sabbir Arif 备注:8 pages 链接:https://arxiv.org/abs/2106.08706 摘要:口语的语音是通过改变发音器在声道周围的结构来获得的。它们包含了丰富的信息,可以用来更好地理解人类言语产生的潜在机制。我们提出了一种新的基于深度神经网络的学习框架,该框架能够理解语音产生过程中声道形状可变长度序列中的声学信息,并通过实时磁共振成像(rtMRI)捕获这些信息,然后将其翻译成文本。提出的框架包括时空卷积、循环网络和连接主义时间分类损失,完全端到端训练。在USC-TIMIT语料库上,该模型在句子水平上的识别率达到了40.6%,比现有的模型要好得多。据我们所知,这是第一个研究表明,识别整个口语句子的基础上个人的发音运动捕捉到的rtMRI视频。我们还分析了不同情绪和性别的声道各亚区(即咽部、腭部和背侧、硬腭、唇部收缩区)发音几何结构的变化。结果表明,每个子区域的失真都受到情绪和性别的影响。 摘要:Speech sounds of spoken language are obtained by varying configuration of the articulators surrounding the vocal tract. They contain abundant information that can be utilized to better understand the underlying mechanism of human speech production. We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production, captured by real-time magnetic resonance imaging (rtMRI), and translate it into text. The proposed framework comprises of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. On the USC-TIMIT corpus, the model achieved a 40.6% PER at sentence-level, much better compared to the existing models. To the best of our knowledge, this is the first study that demonstrates the recognition of entire spoken sentence based on an individual's articulatory motions captured by rtMRI video. We also performed an analysis of variations in the geometry of articulation in each sub-regions of the vocal tract (i.e., pharyngeal, velar and dorsal, hard palate, labial constriction region) with respect to different emotions and genders. Results suggest that each sub-regions distortion is affected by both emotion and gender.

【5】 DCCRN+: Channel-wise Subband DCCRN with SNR Estimation for Speech Enhancement 标题:DCCRN+:用于语音增强的信噪比估计信道子带DCCRN

作者:Shubo Lv,Yanxin Hu,Shimin Zhang,Lei Xie 链接:https://arxiv.org/abs/2106.08672 摘要:深复卷积递归网络(DCCRN)是CRN的一种扩展,具有复杂的结构,在interspeech2020深噪声抑制挑战赛(DNS2020)的MOS评估中取得了优异的性能。本文进一步扩展了DCCRN,并做了以下重要的修改。我们首先将该模型扩展到子带处理,在子带处理中,用可学习的神经网络滤波器而不是工程化的FIR滤波器来分割和合并频带,从而得到一个以端到端方式训练的更快的噪声抑制器。然后用复杂的TF-LSTM进一步代替LSTM,以更好地模拟沿时间轴和频率轴的时间相关性。此外,我们不是简单地将每个编码器层的输出连接到相应解码器层的输入,而是使用卷积块首先从编码器输出聚合基本信息,然后再将其馈送到解码器层。为了在去除噪声的同时保持良好的语音质量,我们特别设计了一个额外的先验信噪比估计模块。最后采用后处理模块进一步抑制非自然残余噪声。新模型名为DCCRN+,在PESQ和DNSMOS方面已经超过了原有的DCCRN以及一些竞争模型,并在新的Interspeech 2021 DNS挑战中取得了优异的性能 摘要:Deep complex convolution recurrent network (DCCRN), which extends CRN with complex structure, has achieved superior performance in MOS evaluation in Interspeech 2020 deep noise suppression challenge (DNS2020). This paper further extends DCCRN with the following significant revisions. We first extend the model to sub-band processing where the bands are split and merged by learnable neural network filters instead of engineered FIR filters, leading to a faster noise suppressor trained in an end-to-end manner. Then the LSTM is further substituted with a complex TF-LSTM to better model temporal dependencies along both time and frequency axes. Moreover, instead of simply concatenating the output of each encoder layer to the input of the corresponding decoder layer, we use convolution blocks to first aggregate essential information from the encoder output before feeding it to the decoder layers. We specifically formulate the decoder with an extra a priori SNR estimation module to maintain good speech quality while removing noise. Finally a post-processing module is adopted to further suppress the unnatural residual noise. The new model, named DCCRN+, has surpassed the original DCCRN as well as several competitive models in terms of PESQ and DNSMOS, and has achieved superior performance in the new Interspeech 2021 DNS challenge

【6】 Improving the expressiveness of neural vocoding with non-affine Normalizing Flows 标题:用非仿射归一化流程提高神经声码的表达能力

作者:Adam Gabryś,Yunlong Jiao,Viacheslav Klimkov,Daniel Korzekwa,Roberto Barra-Chicote 备注:Accepted to Interspeech 2021, 5 pages,3 figures 链接:https://arxiv.org/abs/2106.08649 摘要:本文提出了一种通用的神经声码规范化流增强算法。作为一个案例研究,我们改进了表达性语音声码与改进的平行波网(PW)。特别地,我们建议将PW的仿射变换推广到更具表现力的可逆非仿射函数。在波形重建和文本到语音(TTS)任务中,改进后的PW具有更高的表达能力,可以获得更好的感知信号质量和自然度。我们在一个多说话人、多语言的数据集上评估了不同说话风格的模型。在波形重建任务中,提出的模型将原始波形到记录的自然度和信号质量差距缩小了10\%$,将其他最先进的神经声码系统的自然度和信号质量差距缩小了60\%$。我们还展示了在评估测试集上客观度量的改进,与仿射PW相比,L2谱距离和交叉熵分别减少了$3\%$和$6\unicode{x2030}$。此外,我们扩展了PW的概率密度蒸馏方法,使之适用于任何非仿射可逆可微函数。 摘要:This paper proposes a general enhancement to the Normalizing Flows (NF) used in neural vocoding. As a case study, we improve expressive speech vocoding with a revamped Parallel Wavenet (PW). Specifically, we propose to extend the affine transformation of PW to the more expressive invertible non-affine function. The greater expressiveness of the improved PW leads to better-perceived signal quality and naturalness in the waveform reconstruction and text-to-speech (TTS) tasks. We evaluate the model across different speaking styles on a multi-speaker, multi-lingual dataset. In the waveform reconstruction task, the proposed model closes the naturalness and signal quality gap from the original PW to recordings by $10\%$, and from other state-of-the-art neural vocoding systems by more than $60\%$. We also demonstrate improvements in objective metrics on the evaluation test set with L2 Spectral Distance and Cross-Entropy reduced by $3\%$ and $6\unicode{x2030}$ comparing to the affine PW. Furthermore, we extend the probability density distillation procedure proposed by the original PW paper, so that it works with any non-affine invertible and differentiable function.

【7】 Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain 标题:结合非自回归构象CTC和条件说话人链的多说话人ASR

作者:Pengcheng Guo,Xuankai Chang,Shinji Watanabe,Lei Xie 备注:Accepted by Interspeech 2021 链接:https://arxiv.org/abs/2106.08595 摘要:非自回归(NAR)模型在不同的序列间任务上都取得了与自回归(AR)模型相当的结果。然而,针对多序列问题,如多说话人自动语音识别(ASR)等,探索NAR序列方法的研究还很有限。在这项研究中,我们将我们提出的条件链模型扩展到NAR多人ASR。具体而言,使用输入混合语音和先前估计的条件说话人特征逐个推断每个说话人的输出。在每一步中,一个NAR连接时间分类(CTC)编码器被用来执行并行计算。在这种设计中,总的推理步骤将被限制为混合说话人的数量。此外,我们还采用了一致性,并加入了中间CTC损耗,以提高性能。在WSJ0-Mix和LibriMix语料库上的实验表明,我们的模型比其他NAR模型的性能要好,只增加了一点点延迟,分别达到了22.3%和24.9%的WERs。此外,通过包含可变说话人数的数据,我们的模型比PIT-Conformer-AR模型更具优势,仅需1/7的延迟,在WSJ0-2mix和WSJ0-3mix集合上的WERs分别为19.9%和34.3%。我们所有的代码都可以在https://github.com/pengchengguo/espnet/tree/conditional-multispk. 摘要:Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by-one using both the input mixture speech and previously-estimated conditional speaker features. In each step, a NAR connectionist temporal classification (CTC) encoder is used to perform parallel computation. With this design, the total inference steps will be restricted to the number of mixed speakers. Besides, we also adopt the Conformer and incorporate an intermediate CTC loss to improve the performance. Experiments on WSJ0-Mix and LibriMix corpora show that our model outperforms other NAR models with only a slight increase of latency, achieving WERs of 22.3% and 24.9%, respectively. Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19.9% and 34.3% on WSJ0-2mix and WSJ0-3mix sets. All of our codes are publicly available at https://github.com/pengchengguo/espnet/tree/conditional-multispk.

【8】 Detection of Consonant Errors in Disordered Speech Based on Consonant-vowel Segment Embedding 标题:基于辅音-元音段嵌入的乱序语音辅音错误检测

作者:Si-Ioi Ng,Cymie Wing-Yee Ng,Jingyu Li,Tan Lee 备注:Accepted to INTERSPEECH 2021 链接:https://arxiv.org/abs/2106.08536 摘要:语音障碍(Speech-sound disorder,SSD)是指幼儿在预期年龄阶段持续出现语音障碍的一种发育障碍。辅音错误是SSD临床评价的主要指标。以往对SSD自动评估的研究表明,短辅音和过渡辅音的语音错误检测效果较差。本文研究了一种基于神经网络的辅音-元音(CV)双音段检测语音混乱中辅音错误的方法。基本假设是CV段的元音部分携带了辅音共同发音的重要信息。利用递归神经网络模型从CV段中提取语音嵌入。计算测试片段和参考片段嵌入之间的相似性分数,以确定测试片段是否为预期辅音。实验结果表明,使用CV片段可以提高对先前研究中所报道的那些“困难”辅音的语音错误的检测性能。 摘要:Speech sound disorder (SSD) refers to a type of developmental disorder in young children who encounter persistent difficulties in producing certain speech sounds at the expected age. Consonant errors are the major indicator of SSD in clinical assessment. Previous studies on automatic assessment of SSD revealed that detection of speech errors concerning short and transitory consonants is less satisfactory. This paper investigates a neural network based approach to detecting consonant errors in disordered speech using consonant-vowel (CV) diphone segment in comparison to using consonant monophone segment. The underlying assumption is that the vowel part of a CV segment carries important information of co-articulation from the consonant. Speech embeddings are extracted from CV segments by a recurrent neural network model. The similarity scores between the embeddings of the test segment and the reference segments are computed to determine if the test segment is the expected consonant or not. Experimental results show that using CV segments achieves improved performance on detecting speech errors concerning those "difficult" consonants reported in the previous studies.

【9】 Global Rhythm Style Transfer Without Text Transcriptions 标题:无文本转录的全局节奏风格转移

作者:Kaizhi Qian,Yang Zhang,Shiyu Chang,Jinjun Xiong,Chuang Gan,David Cox,Mark Hasegawa-Johnson 链接:https://arxiv.org/abs/2106.08519 摘要:韵律在表征说话人的风格或情感方面起着重要的作用,但大多数非平行语音或情感风格转换算法并不转换任何韵律信息。韵律的两个主要组成部分是音高和节奏。从语音中分离韵律信息,特别是韵律成分是一个挑战,因为它涉及到打破输入语音和分离语音表示之间的同步性。因此,现有的大多数韵律风格转换算法都需要依赖于某种形式的文本转录来识别文本中的内容信息,这使得它们的应用仅限于高资源语言。近年来,SpeechSplit在无监督的韵律风格转换方面取得了很大的进展,但它无法以无监督的方式提取出高层次的整体韵律风格。在本文中,我们提出了AutoPST,它可以在不依赖任何文本转录的情况下从语音中分离出整体韵律风格。AutoPST是一个基于自动编码器的韵律风格转换框架,在自表达表征学习的指导下,具有一个完整的韵律去除模块。在不同风格转换任务上的实验表明,AutoPST能有效地转换出正确反映目标领域风格的韵律。 摘要:Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. Two major components of prosody are pitch and rhythm. Disentangling the prosody information, particularly the rhythm component, from the speech is challenging because it involves breaking the synchrony between the input speech and the disentangled speech representation. As a result, most existing prosody style transfer algorithms would need to rely on some form of text transcriptions to identify the content information, which confines their application to high-resource languages only. Recently, SpeechSplit has made sizeable progress towards unsupervised prosody style transfer, but it is unable to extract high-level global prosody style in an unsupervised manner. In this paper, we propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions. AutoPST is an Autoencoder-based Prosody Style Transfer framework with a thorough rhythm removal module guided by the self-expressive representation learning. Experiments on different style transfer tasks show that AutoPST can effectively convert prosody that correctly reflects the styles of the target domains.

【10】 Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis 标题:Ctrl-P:语音合成中韵律变化的时间控制

作者:Devang S Ram Mohan,Vivian Hu,Tian Huey Teh,Alexandra Torresquintero,Christopher G. R. Wallis,Marlene Staib,Lorenzo Foglianti,Jiameng Gao,Simon King 备注:To be published in Interspeech 2021. 5 pages, 4 figures 链接:https://arxiv.org/abs/2106.08352 摘要:文本不能完全指定口语形式,因此文本到语音模型必须能够从语音数据中学习,这些数据的变化方式不是由相应的文本来解释的。减少训练数据中无法解释的变化量的一种方法是提供声学信息作为额外的学习信号。在生成语音时,修改此声学信息可以生成文本的多个不同格式副本。由于许多无法解释的变化都发生在韵律中,我们提出了一个模型,该模型可以生成明显依赖于韵律的三个主要声学相关:F{0}$、能量和持续时间的语音。该模型在如何指定这些特性的值方面是灵活的:它们可以从外部提供,也可以从文本中预测,或者预测然后进行修改。与采用变分自动编码器学习无监督潜在特征的模型相比,我们的模型提供了更具解释性、时间精确性和不纠缠的控制。当从文本中自动预测声学特征时,它产生的语音比tacotron2模型和参考编码器产生的语音更自然。随后对预测的声学特征进行人在环修改可以显著地进一步提高自然度。 摘要:Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. One way to reduce the amount of unexplained variation in training data is to provide acoustic information as an additional learning signal. When generating speech, modifying this acoustic information enables multiple distinct renditions of a text to be produced. Since much of the unexplained variation is in the prosody, we propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody: $F_{0}$, energy and duration. The model is flexible about how the values of these features are specified: they can be externally provided, or predicted from text, or predicted then subsequently modified. Compared to a model that employs a variational auto-encoder to learn unsupervised latent features, our model provides more interpretable, temporally-precise, and disentangled control. When automatically predicting the acoustic features from text, it generates speech that is more natural than that from a Tacotron 2 model with reference encoder. Subsequent human-in-the-loop modification of the predicted acoustic features can significantly further increase naturalness.

【11】 End-to-End Spoken Language Understanding for Generalized Voice Assistants 标题:通用语音助理的端到端口语理解

作者:Michael Saxon,Samridhi Choudhary,Joseph P. McKenna,Athanasios Mouchtaris 备注:Accepted to Interspeech 2021; 5 pages, 2 tables, 1 figure 链接:https://arxiv.org/abs/2106.09009 摘要:端到端(E2E)口语理解(SLU)系统使用单一模型直接从语音中预测话语语义。这方面的工作主要集中在固定域中的目标任务,其中输出语义结构是先验的,输入语音的复杂性是有限的。在这项工作中,我们提出了我们的方法来开发一个E2E模型的广义SLU在商业语音助理(VAs)。我们提出了一个完全可微的、基于Transformer的、层次化的系统,可以在ASR和NLU两个层次上进行预训练。然后对转录和语义分类损失进行微调,以处理不同的意图和参数组合。这导致SLU系统在复杂的内部广义VA数据集上实现了显著的基线改进,准确率提高了43%,同时仍然满足流行的Fluent Speech Commands数据集上99%的准确率基准。我们在一个硬测试集上进一步评估了我们的模型,该测试集只包含训练中看不到的时隙参数,并展示了近20%的改进,显示了我们的方法在真正苛刻的VA场景中的有效性。 摘要:End-to-end (E2E) spoken language understanding (SLU) systems predict utterance semantics directly from speech using a single model. Previous work in this area has focused on targeted tasks in fixed domains, where the output semantic structure is assumed a priori and the input speech is of limited complexity. In this work we present our approach to developing an E2E model for generalized SLU in commercial voice assistants (VAs). We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels. This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations. This leads to an SLU system that achieves significant improvements over baselines on a complex internal generalized VA dataset with a 43% improvement in accuracy, while still meeting the 99% accuracy benchmark on the popular Fluent Speech Commands dataset. We further evaluate our model on a hard test set, exclusively containing slot arguments unseen in training, and demonstrate a nearly 20% improvement, showing the efficacy of our approach in truly demanding VA scenarios.

【12】 Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments 标题:VOICITY:噪声混响环境中的零激发非并行语音转换

作者:Alejandro Mottini,Jaime Lorenzo-Trueba,Sri Vishnu Kumar Karlapati,Thomas Drugman 备注:Presented at the Speech Synthesis Workshops 2021 (SSW11) 链接:https://arxiv.org/abs/2106.08873 摘要:语音转换(Voice Conversion,VC)是一种通过转换源语的非语言信息来改变说话人身份的技术。虽然有大量关于VC的文献,但是大多数提出的方法都是在干净的语音记录上进行训练和评估的。然而,许多声学环境是噪声和混响的,严重限制了流行的VC方法对此类场景的适用性。为了解决这个局限性,我们提出了voice,一个新的VC框架,特别是针对有噪声的语音。我们的方法受去噪自动编码器框架的启发,由四个编码器(说话人、内容、语音和声学ASR)和一个解码器组成。重要的是,Voice能够执行非平行零炮VC,这是任何VC系统的一个重要要求,需要在训练过程中看不到的扬声器上工作。我们已经使用LibriSpeech数据集的一个噪声混响版本验证了我们的方法。实验结果表明,在混响噪声环境中,voice在自然度和目标-说话人相似性方面优于其他测试的VC技术。 摘要:Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. While there is a rich literature on VC, most proposed methods are trained and evaluated on clean speech recordings. However, many acoustic environments are noisy and reverberant, severely restricting the applicability of popular VC methods to such scenarios. To address this limitation, we propose Voicy, a new VC framework particularly tailored for noisy speech. Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder. Importantly, Voicy is capable of performing non-parallel zero-shot VC, an important requirement for any VC system that needs to work on speakers not seen during training. We have validated our approach using a noisy reverberant version of the LibriSpeech dataset. Experimental results show that Voicy outperforms other tested VC techniques in terms of naturalness and target speaker similarity in noisy reverberant environments.

【13】 Attention-Based Keyword Localisation in Speech using Visual Grounding 标题:基于视觉基础的语音中基于注意力的关键词定位

作者:Kayode Olaleye,Herman Kamper 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.08859 摘要:基于视觉的语音模型从与语音字幕配对的图像中学习。通过使用一个经过训练的具有固定词汇量的视觉分类器,用软文本标签标记图像,先前的工作已经表明,可以训练一个能够检测特定文本关键字是否出现在言语中的模型。在这里,我们研究了基于视觉的语音模型是否也可以进行关键词定位:在没有任何明确的基于文本或对齐监督的情况下,预测一个给定的文本关键词在话语中出现的位置。我们特别考虑将注意力纳入卷积模型是否有利于定位。尽管视觉监督模型的绝对局部化性能仍然不高(与使用无序的文字标签进行监督相比),但我们表明,与以前的视觉监督模型相比,注意力在性能上有很大的提高。与其他许多语音图像研究一样,我们发现许多不正确的定位是由于语义混淆造成的,例如在查询关键字“swiming”中定位单词“backstroke”。 摘要:Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model that can detect whether a particular text keyword occurs in speech utterances or not. Here we investigate whether visually grounded speech models can also do keyword localisation: predicting where, within an utterance, a given textual keyword occurs without any explicit text-based or alignment supervision. We specifically consider whether incorporating attention into a convolutional model is beneficial for localisation. Although absolute localisation performance with visually supervised models is still modest (compared to using unordered bag-of-word text labels for supervision), we show that attention provides a large gain in performance over previous visually grounded models. As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions, e.g. locating the word 'backstroke' for the query keyword 'swimming'.

【14】 Source Separation-based Data Augmentation for Improved Joint Beat and Downbeat Tracking 标题:基于源分离的数据增强改进的联合差拍和降拍跟踪

作者:Ching-Yu Chiu,Joann Ching,Wen-Yi Hsiao,Yu-Hua Chen,Alvin Wen-Yu Su,Yi-Hsuan Yang 备注:Accepted to European Signal Processing Conference (EUSIPCO 2021) 链接:https://arxiv.org/abs/2106.08703 摘要:近年来,随着深度学习技术的发展,音乐音频信号中自动拍和下拍跟踪的性能有了很大的提高。在训练这种基于深度学习的模型时,数据扩充被认为是一种重要的技术。然而,现有的数据扩充方法主要是为了平衡训练数据的分布与速度之间的关系。在这篇论文中,我们探讨另一种资料扩充的方法,以说明训练资料的组成,在打击和非打击声源。具体地说,我们提出了一种盲鼓分离模型,从每个训练音频信号中分离出鼓声和非鼓声,过滤掉无鼓的训练信号,然后利用得到的鼓声和非鼓声来增强训练数据。我们在四个完全看不见的测试集上进行了实验,验证了所提方法的有效性,并相应地验证了鼓声合成在拍和下拍跟踪训练数据中的重要性。 摘要:Due to advances in deep learning, the performance of automatic beat and downbeat tracking in musical audio signals has seen great improvement in recent years. In training such deep learning based models, data augmentation has been found an important technique. However, existing data augmentation methods for this task mainly target at balancing the distribution of the training data with respect to their tempo. In this paper, we investigate another approach for data augmentation, to account for the composition of the training data in terms of the percussive and non-percussive sound sources. Specifically, we propose to employ a blind drum separation model to segregate the drum and non-drum sounds from each training audio signal, filtering out training signals that are drumless, and then use the obtained drum and non-drum stems to augment the training data. We report experiments on four completely unseen test sets, validating the effectiveness of the proposed method, and accordingly the importance of drum sound composition in the training data for beat and downbeat tracking.

【15】 Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study 标题:声学单词嵌入能捕捉到语音相似性吗?一项实证研究

作者:Badr M. Abdullah,Marius Mosbach,Iuliia Zaitova,Bernd Möbius,Dietrich Klakow 备注:Accepted in Interspeech 2021 链接:https://arxiv.org/abs/2106.08686 摘要:深度神经网络的几种变体已成功地用于建立参数化模型,将可变时长的口语词段投影到固定大小的向量表示或声学词嵌入(AWEs)上。然而,我们在多大程度上可以依赖出现的AWE空间中的距离来估计词形相似性,目前还不清楚。本文提出这样一个问题:声音嵌入空间中的距离是否与语音差异相关?为了回答这个问题,我们实证研究了不同神经结构和学习目标的AWEs监督方法的性能。我们在两种语言(德语和捷克语)的受控环境下训练AWE模型,并在两个任务上评估嵌入:单词辨别和语音相似性。我们的实验表明:(1)最佳情况下嵌入空间中的距离仅与语音距离适度相关;(2)提高单词识别任务的性能并不一定能得到更好地反映单词语音相似性的模型。我们的发现强调了重新思考当前对敬畏的内在评价的必要性。 摘要:Several variants of deep neural networks have been successfully employed for building parametric models that project variable-duration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we empirically investigate the performance of supervised approaches for AWEs with different neural architectures and learning objectives. We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity. Our findings highlight the necessity to rethink the current intrinsic evaluations for AWEs.

【16】 Drum-Aware Ensemble Architecture for Improved Joint Musical Beat and Downbeat Tracking 标题:用于改进音乐节拍和弱拍联合跟踪的鼓感知集成结构

作者:Ching-Yu Chiu,Alvin Wen-Yu Su,Yi-Hsuan Yang 备注:Accepted to IEEE Signal Processing Letters (May 2021) 链接:https://arxiv.org/abs/2106.08685 摘要:本文提出了一种新的系统结构,它将盲源分离与音乐音频信号的联合拍和下拍跟踪相结合。信源分离模块将输入信号中的冲击分量和非冲击分量分离开来,在这两个分量上分别进行拍频和下拍跟踪,然后利用可学习的融合机制对结果进行聚合。这样,系统可以自适应地确定输入信号的跟踪结果在多大程度上取决于输入的冲击或非冲击分量。对四个测试集的评估表明,新的架构始终优于广泛采用的不采用源分离的基线架构。 摘要:This paper presents a novel system architecture that integrates blind source separation with joint beat and downbeat tracking in musical audio signals. The source separation module segregates the percussive and non-percussive components of the input signal, over which beat and downbeat tracking are performed separately and then the results are aggregated with a learnable fusion mechanism. This way, the system can adaptively determine how much the tracking result for an input signal should depend on the input's percussive or non-percussive components. Evaluation on four testing sets that feature different levels of presence of drum sounds shows that the new architecture consistently outperforms the widely-adopted baseline architecture that does not employ source separation.

【17】 WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution 标题:WSRGlow:一种基于辉光的音频超分辨率波形生成模型

作者:Kexun Zhang,Yi Ren,Changliang Xu,Zhou Zhao 备注:Accepted by INTERSPEECH 2021 链接:https://arxiv.org/abs/2106.08507 摘要:音频超分辨率是通过添加缺失的频带,从低分辨率(LR)音频构造高分辨率(HR)音频的任务。以往基于卷积神经网络和均方误差训练目标的方法性能较低,而对抗生成模型的训练和调整比较困难。近年来,规范化流以其性能高、训练简单、推理速度快等优点引起了人们的广泛关注。本文提出了一种基于辉光的波形生成模型WSRGlow来实现音频的超分辨率。具体来说,1)我们将WaveNet和Glow结合起来,直接最大化目标HR音频在LR信息条件下的准确可能性;为了从低分辨率音频中提取音频信息,我们提出了一种LR音频编码器和一种STFT编码器,分别从时域和频域对LR信息进行编码。实验结果表明,该模型训练简单,在客观质量和感知质量方面均优于前人的研究成果。WSRGlow也是第一个从12kHz LR音频产生48kHz波形的模型。 摘要:Audio super-resolution is the task of constructing a high-resolution (HR) audio from a low-resolution (LR) audio by adding the missing band. Previous methods based on convolutional neural networks and mean squared error training objective have relatively low performance, while adversarial generative models are difficult to train and tune. Recently, normalizing flow has attracted a lot of attention for its high performance, simple training and fast inference. In this paper, we propose WSRGlow, a Glow-based waveform generative model to perform audio super-resolution. Specifically, 1) we integrate WaveNet and Glow to directly maximize the exact likelihood of the target HR audio conditioned on LR information; and 2) to exploit the audio information from low-resolution audio, we propose an LR audio encoder and an STFT encoder, which encode the LR information from the time domain and frequency domain respectively. The experimental results show that the proposed model is easier to train and outperforms the previous works in terms of both objective and perceptual quality. WSRGlow is also the first model to produce 48kHz waveforms from 12kHz LR audio.

【18】 Tonal Frequencies, Consonance, Dissonance: A Math-Bio Intersection 标题:音调频率、和声、不和谐:数学-生物交叉点

作者:Steve Mathew 备注:9 pages, 1 figure, 1 table 链接:https://arxiv.org/abs/2106.08479 摘要:迄今为止,计算音符的频率需要知道一些参考音符的频率。在这项研究中,一阶常微分方程是用来达成一个数学模型,以确定音调频率使用各自的注意指标。在下一部分的研究中,我们以基本的音乐频率为基础进行分析,从理论和神经生物学的角度来解释不同的音符在半音阶上所引起的和谐与不和谐,这是建立在声音的系统模式激发愉悦这一事实的基础上的。探讨了和声丰富的原因,和弦中的声音干扰和和声程度。因为人类的思维是相对地分析一切事物的,所以除了最辅音的音符以外的任何东西听起来都是不和谐的。总之,这项研究清楚地解释了为什么音符和音乐的声音是这样的。 摘要:To date, calculating the frequencies of musical notes requires one to know the frequency of some reference note. In this study, first-order ordinary differential equations are used to arrive at a mathematical model to determine tonal frequencies using their respective note indices. In the next part of the study, an analysis that is based on the fundamental musical frequencies is conducted to theoretically and neurobiologically explain the consonance and dissonance caused by the different musical notes in the chromatic scale which is based on the fact that systematic patterns of sound invoke pleasure. The reason behind the richness of harmony and the sonic interference and degree of consonance in musical chords are discussed. Since a human mind analyses everything relatively, anything other than the most consonant notes sounds dissonant. In conclusion, the study explains clearly why musical notes and in toto, music sounds the way it does.

【19】 Pathological voice adaptation with autoencoder-based voice conversion 标题:基于自动编码器的语音转换的病理性语音适配

作者:Marc Illa,Bence Mark Halpern,Rob van Son,Laureano Moro-Velazquez,Odette Scharenborg 备注:6 pages, 3 figures. Accepted to the 11th ISCA Speech Synthesis Workshop (2021) 链接:https://arxiv.org/abs/2106.08427 摘要:本文提出了一种新的病态语音合成方法。而不是使用健康的语音作为来源,我们定制一个现有的病理性语音样本,以一个新的说话人的声音特征。这种方法减轻了将典型语音转换为病理性语音时的评估问题,因为在我们的方法中,语音转换(VC)模型不需要针对语音退化进行优化,而只需要针对说话人的变化进行优化。优化中的这种变化确保了在自然度中发现的任何退化都是由于转换过程,而不是由于模型夸大了语音病理学的特征。为了证明该方法的有效性,我们使用UASpeech数据库和基于VC技术的自动编码器对构音障碍语音进行了转换。主观评价结果表明,虽然较低的清晰度可能会导致中、低清晰度说话人的自然度得分与地面真实度相比有一定程度的下降,但对于高清晰度、有构音障碍的说话人来说,自然度是合理的。低清晰度和高清晰度说话人的说话人特征转换是成功的,但中等清晰度说话人的转换不是成功的。不同清晰度水平的转换结果的差异是由于清晰度水平的不同还是由于说话人的不同需要进一步研究。 摘要:In this paper, we propose a new approach to pathological speech synthesis. Instead of using healthy speech as a source, we customise an existing pathological speech sample to a new speaker's voice characteristics. This approach alleviates the evaluation problem one normally has when converting typical speech to pathological speech, as in our approach, the voice conversion (VC) model does not need to be optimised for speech degradation but only for the speaker change. This change in the optimisation ensures that any degradation found in naturalness is due to the conversion process and not due to the model exaggerating characteristics of a speech pathology. To show a proof of concept of this method, we convert dysarthric speech using the UASpeech database and an autoencoder-based VC technique. Subjective evaluation results show reasonable naturalness for high intelligibility dysarthric speakers, though lower intelligibility seems to introduce a marginal degradation in naturalness scores for mid and low intelligibility speakers compared to ground truth. Conversion of speaker characteristics for low and high intelligibility speakers is successful, but not for mid. Whether the differences in the results for the different intelligibility levels is due to the intelligibility levels or due to the speakers needs to be further investigated.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-06-17,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
语音识别
腾讯云语音识别(Automatic Speech Recognition,ASR)是将语音转化成文字的PaaS产品,为企业提供精准而极具性价比的识别服务。被微信、王者荣耀、腾讯视频等大量业务使用,适用于录音质检、会议实时转写、语音输入法等多个场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档