前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >金融/语音/音频处理学术速递[6.22]

金融/语音/音频处理学术速递[6.22]

作者头像
公众号-arXiv每日学术速递
发布2021-07-02 17:46:45
6080
发布2021-07-02 17:46:45
举报

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

q-fin金融,共计9篇

cs.SD语音,共计14篇

eess.AS音频处理,共计14篇

1.q-fin金融:

【1】 Dynamics of Disruption in Science and Technology 标题:科学技术中的颠覆动力学

作者:Michael Park,Erin Leahey,Russell Funk 机构:Carlson School of Management, University of Minnesota, School of Sociology, University of Arizona 链接:https://arxiv.org/abs/2106.11184 摘要:虽然新的科学发现和技术发明的数量在过去一个世纪里急剧增加,但也有人担心科学技术进步会放缓。我们分析了60年来2500万篇论文和400万项专利,发现科学技术对现有知识的破坏性越来越小,这种模式在各个领域几乎普遍存在。我们将这种破坏性的下降与现有知识利用的缩小联系起来。出版科学质量的下降和引文实践的变化不太可能是这一趋势的原因,这表明这种模式代表了科学技术的根本转变。 摘要:Although the number of new scientific discoveries and technological inventions has increased dramatically over the past century, there have also been concerns of a slowdown in the progress of science and technology. We analyze 25 million papers and 4 million patents across 6 decades and find that science and technology are becoming less disruptive of existing knowledge, a pattern that holds nearly universally across fields. We link this decline in disruptiveness to a narrowing in the utilization of existing knowledge. Diminishing quality of published science and changes in citation practices are unlikely to be responsible for this trend, suggesting that this pattern represents a fundamental shift in science and technology.

【2】 Framing and Social Information Nudges at Wikipedia 标题:维基百科的框架和社交信息微调

作者:Maximilian Linek,Christian Traxler 链接:https://arxiv.org/abs/2106.11128 摘要:我们分析了一系列的试验,这些试验将德国的维基百科用户随机分配到不同的网络横幅上,征求捐款。这些试验改变了关于有多少其他用户捐款的社会信息的框架或内容。以消极的方式界定一定数量的捐赠者会提高捐赠率。传播的社会信息的变化没有可察觉的影响。这一发现与调查实验的结果是一致的。与捐赠作为战略替代品一样,调查证明,负面框架降低了人们对他人捐赠的信心。相反,改变社会信息对改变平均信念是无效的。 摘要:We analyze a series of trials that randomly assigned Wikipedia users in Germany to different web banners soliciting donations. The trials varied framing or content of social information about how many other users are donating. Framing a given number of donors in a negative way increased donation rates. Variations in the communicated social information had no detectable effects. The findings are consistent with the results from a survey experiment. In line with donations being strategic substitutes, the survey documents that the negative framing lowers beliefs about others' donations. Varying the social information, in contrast, is ineffective in changing average beliefs.

【3】 Doctors and Nurses Social Media Ads Reduced Holiday Travel and COVID-19 infections: A cluster randomized controlled trial in 13 States 标题:医生和护士社交媒体广告减少假日旅行和冠状病毒感染:一项在13个州进行的整群随机对照试验

作者:Emily Breza,Fatima Cody Stanford,Marcela Alsan,M. D. Ph. D.,Burak Alsan,Abhijit Banerjee,Arun G. Chandrasekhar,Sarah Eichmeyer,Traci Glushko,Paul Goldsmith-Pinkham,Kelly Holland,Emily Hoppe,Mohit Karnani,Sarah Liegl,Tristan Loisel,Lucy Ogbu-Nwobodo,Benjamin A. Olken Carlos Torres,Pierre-Luc Vautrey,Erica Warner,Susan Wootton,Esther Duflo 机构:¶ Harvard University, Department of Economics, Cambridge, MA, † Harvard Kennedy School of Government, Cambridge, MA, # Online Care Group, Boston, MA, ‡ Massachusetts General Hospital, Department of Medicine- Neuroendocrine Unit, Department 链接:https://arxiv.org/abs/2106.11012 摘要:在COVID-19流行期间,许多卫生专业人员开始利用社交媒体上的大众传播来传递关键信息,并说服个人采取预防性健康行为。我们的临床医生和护士小组开发并录制了短视频信息,以鼓励观众在感恩节和圣诞节假期呆在家里。然后,我们在美国820个县(覆盖13个州)进行了一项两阶段的群集随机对照试验,开展大规模的Facebook广告活动,传播这些信息。在第一级随机分组中,我们将各县随机分为两组:高强度组和低强度组。在第二个层次,我们将邮政编码随机分配给治疗组或对照组,高强度县75%的邮政编码接受治疗,而低强度县25%的邮政编码接受治疗。在每个处理过的邮政编码中,我们将广告发送给尽可能多的Facebook用户(11954109名用户在感恩节收到至少一条广告,23302290名用户在圣诞节收到至少一条广告)。第一个主要结果是使用县一级的移动电话位置数据测量的总假日旅行:我们发现高强度县的平均旅行距离在每个假日前三天下降了-0.993个百分点(95%CI-1.616,-0.371,p值0.002)。第二个主要转归是邮政编码水平的COVID-19感染:与对照邮政编码相比,干预邮政编码在假期后五天开始的两周内记录的COVID-19感染下降了3.5%(调整后的95%CI[-6.2%,-0.7%),p值为0.013)。 摘要:During the COVID-19 epidemic, many health professionals started using mass communication on social media to relay critical information and persuade individuals to adopt preventative health behaviors. Our group of clinicians and nurses developed and recorded short video messages to encourage viewers to stay home for the Thanksgiving and Christmas Holidays. We then conducted a two-stage clustered randomized controlled trial in 820 counties (covering 13 States) in the United States of a large-scale Facebook ad campaign disseminating these messages. In the first level of randomization, we randomly divided the counties into two groups: high intensity and low intensity. In the second level, we randomly assigned zip codes to either treatment or control such that 75% of zip codes in high intensity counties received the treatment, while 25% of zip codes in low intensity counties received the treatment. In each treated zip code, we sent the ad to as many Facebook subscribers as possible (11,954,109 users received at least one ad at Thanksgiving and 23,302,290 users received at least one ad at Christmas). The first primary outcome was aggregate holiday travel, measured using mobile phone location data, available at the county level: we find that average distance travelled in high-intensity counties decreased by -0.993 percentage points (95% CI -1.616, -0.371, p-value 0.002) the three days before each holiday. The second primary outcome was COVID-19 infection at the zip-code level: COVID-19 infections recorded in the two-week period starting five days post-holiday declined by 3.5 percent (adjusted 95% CI [-6.2 percent, -0.7 percent], p-value 0.013) in intervention zip codes compared to control zip codes.

【4】 Output, Employment, and Price Effects of U.S. Narrative Tax Changes: A Factor-Augmented Vector Autoregression Approach 标题:美国叙述性税收变化的产出、就业和价格效应:因素增广向量自回归方法

作者:Masud Alam 备注:46 pages 链接:https://arxiv.org/abs/2106.10844 摘要:本文考察了在数据丰富的环境下,美国联邦个人所得税和企业所得税减税对一系列经济政策变量的短期和长期影响。利用一组美国宏观经济数据集(由1959-2018年132个季度宏观经济系列组成),该研究估计了因子增强向量自回归(FAVARs)模型,其中扩展的叙述性税收变化数据集与未观察到的因子相结合。叙述性方法将税收变化分为外生的还是内生的。本文利用符号约束和Uhlig(2005)的惩罚函数来识别向量自回归模型中的叙述性税收冲击。实证结果表明,减税对宏观经济变量具有显著的扩张效应。个人和企业所得税的削减导致产出、投资、就业和消费的增加;然而,与削减企业所得税相比,削减个人所得税似乎是更有效的财政政策工具。实际GDP、就业、投资和工业生产显著增长,并在个人所得税减税两年后达到其最大响应值。企业减税对产出和消费的影响相对较小,但对固定投资和价格水平的影响则更大。 摘要:This paper examines the short- and long-run effects of U.S. federal personal income and corporate income tax cuts on a wide array of economic policy variables in a data-rich environment. Using a panel of U.S. macroeconomic data set, made up of 132 quarterly macroeconomic series for 1959-2018, the study estimates factor-augmented vector autoregression (FAVARs) models where an extended narrative tax changes dataset combined with unobserved factors. The narrative approach classifies if tax changes are exogenous or endogenous. This paper identifies narrative tax shocks in the vector autoregression model using the sign restrictions with Uhlig's (2005) penalty function. Empirical findings show a significant expansionary effect of tax cuts on the macroeconomic variables. Cuts in personal and corporate income taxes cause a rise in output, investment, employment, and consumption; however, cuts in personal taxes appear to be a more effective fiscal policy tool than the cut in corporate income taxes. Real GDP, employment, investment, and industrial production increase significantly and reach their maximum response values two years after personal income tax cuts. The effects of corporate tax cuts have relatively smaller effects on output and consumption but show immediate and higher effects on fixed investment and price levels.

【5】 Entitled to Property: Inheritance Laws, Female Bargaining Power, and Child Health in India 标题:财产权:印度的继承法、女性议价能力和儿童健康

作者:Shahadath Hossain,Plamen Nikolov 机构:a State University of New York (Binghamton), b IZA Institute of Labor Economics, c Harvard Institute for Quantitative Social Science, d Global Labor Organization, Entitled to Property: Inheritance Laws, Female Bargaining Power, and Child Health in India† 链接:https://arxiv.org/abs/2106.10841 摘要:儿童身高是整个成年期人力资本和经济状况的重要预测指标。此外,非单一家庭模式的家庭行为假设,增加妇女的议价能力可以影响儿童健康。我们研究了继承政策变化的影响,印度继承法(HSA),该法赋予印度农村未婚妇女更高的继承权,对儿童身高的影响。我们发现强有力的证据表明,HSA提高了儿童的身高和体重。此外,我们发现有证据表明,该政策提高了妇女在家庭内部的议价能力,从而改善了父母对子女的投资。这些研究结果也符合这样一种观点,即当母亲控制了家庭中更重要的一部分时,孩子会做得更好。因此,赋予妇女权力的政策可以对儿童的人力资本产生额外的积极溢出效应。 摘要:Child height is a significant predictor of human capital and economic status throughout adulthood. Moreover, non-unitary household models of family behavior posit that an increase in women's bargaining power can influence child health. We study the effects of an inheritance policy change, the Hindu Succession Act (HSA), which conferred enhanced inheritance rights to unmarried women in rural India, on child height. We find robust evidence that the HSA improved the height and weight of children. In addition, we find evidence consistent with a channel that the policy improved the women's intrahousehold bargaining power within the household, leading to improved parental investments for children. These study findings are also compatible with the notion that children do better when their mothers control a more significant fraction of the family. Therefore, policies that empower women can have additional positive spillovers for children's human capital.

【6】 Multidimensional linear and nonlinear partial integro-differential equation in Bessel potential spaces with applications in option pricing 标题:Bessel势空间中的多维线性和非线性偏积分微分方程及其在期权定价中的应用

作者:Daniel Sevcovic,Cyril Izuchukwu Udeani 链接:https://arxiv.org/abs/2106.10498 摘要:本文研究多维空间中一类非局部非线性偏积分微分方程的解。这类PIDE经常出现在金融建模中。利用抽象半线性抛物方程的理论证明了在贝塞尔势空间尺度上解的存在唯一性。我们考虑在原点和无穷远处满足适当增长条件的一类广泛的L′evy测度。本文的创新之处在于将已有的一维结果推广到多维情形。我们考虑了标的资产期权定价的Black-Scholes模型,该模型遵循一个带跳跃的L′evy随机过程。作为一维空间中期权定价的一个应用,我们考虑了非线性期权定价模型所产生的一般转移函数,该模型考虑了大交易者的股票交易策略。我们证明了非线性PIDE问题解的存在唯一性,其中转移函数可能依赖于一个给定的大投资者股票交易策略函数。 摘要:The purpose of this paper is to analyze solutions of a non-local nonlinear partial integro-differential equation (PIDE) in multidimensional spaces. Such class of PIDE often arises in financial modeling. We employ the theory of abstract semilinear parabolic equations in order to prove existence and uniqueness of solutions in the scale of Bessel potential spaces. We consider a wide class of L\'evy measures satisfying suitable growth conditions near the origin and infinity. The novelty of the paper is the generalization of already known results in the one space dimension to the multidimensional case. We consider Black-Scholes models for option pricing on underlying assets following a L\'evy stochastic process with jumps. As an application to option pricing in the one-dimensional space, we consider a general shift function arising from nonlinear option pricing models taking into account a large trader stock-trading strategy. We prove existence and uniqueness of a solution to the nonlinear PIDE in which the shift function may depend on a prescribed large investor stock-trading strategy function.

【7】 The efficient frontiers of mean-variance portfolio rules under distribution misspecification 标题:分布误指定下均值-方差投资组合规则的有效边界

作者:Andrew Paskaramoorthy,Tim Gebbie,Terence van Zyl 机构:School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa, Department of Statistical Sciences, University of Cape Town, Cape Town, South Africa, orcid.org,-,-,-, Institute for Intelligent Systems 备注:8 pages, 2 figures, 4 tables, submitted to Fusion2021 链接:https://arxiv.org/abs/2106.10491 摘要:将预测和优化相结合的均值-方差投资组合决策已被证明具有较差的实证表现。在这里,我们考虑各种收缩方法在不同分布假设下的有效边界性能,以研究合理偏离正态性的影响。也就是说,我们研究了一阶自相关、二阶自相关、偏度和过峰度的影响。我们发现收缩方法倾向于重新缩放样本有效边界,而有效边界可以根据正态性局部扰动的性质而改变。这种重新定标意味着,在一个固定的风险规避水平上比较决策规则的标准方法是有问题的,在一个动态的市场环境中更是如此。我们的结果表明,比较有效的边界有严重的影响,反对主流的思想在文献中。也就是说,样本估计优于均值的Stein型估计,改进协方差的预测比改进均值的预测更重要。 摘要:Mean-variance portfolio decisions that combine prediction and optimisation have been shown to have poor empirical performance. Here, we consider the performance of various shrinkage methods by their efficient frontiers under different distributional assumptions to study the impact of reasonable departures from Normality. Namely, we investigate the impact of first-order auto-correlation, second-order auto-correlation, skewness, and excess kurtosis. We show that the shrinkage methods tend to re-scale the sample efficient frontier, which can change based on the nature of local perturbations from Normality. This re-scaling implies that the standard approach of comparing decision rules for a fixed level of risk aversion is problematic, and more so in a dynamic market setting. Our results suggest that comparing efficient frontiers has serious implications which oppose the prevailing thinking in the literature. Namely, that sample estimators out-perform Stein type estimators of the mean, and that improving the prediction of the covariance has greater importance than improving that of the means.

【8】 The Knowledge Mobility of Renewable Energy Technology 标题:可再生能源技术的知识流动

作者:P. G. J. Persoon,R. N. A. Bekkers,F. Alkemade 链接:https://arxiv.org/abs/2106.10474 摘要:在实现气候目标的竞赛中,许多政府和组织都在鼓励可再生能源技术的区域发展。一项技术的空间动态和成功的区域发展在一定程度上取决于该技术所建立的知识库的特点,特别是知识的分析性和累积性。在这项研究中,我们系统地评估了13个不同ret的知识库特征。我们发现,虽然一些可再生能源技术(光伏、燃料电池、储能)具有高度分析性的知识库,并且发展得更为广泛,但也有一些重要的可再生能源技术(风力涡轮机、太阳能热利用、地热能和水能)的知识库分析性较差,发展得不那么广泛。同样,前者的技术累积性往往低于后者。这就要求针对不同的可再生能源技术制定具体的区域政策,同时考虑到特定可再生能源技术所建立的知识类型以及这种知识在当地的存在。 摘要:In the race to achieve climate goals, many governments and organizations are encouraging the regional development of Renewable Energy Technology (RET). The spatial dynamics and successful regional development of a technology partly depends on the characteristics of the knowledge base on which this technology builds, in particular the analyticity and cumulativeness of knowledge. In this study we systematically evaluate these knowledge base characteristics for a set of 13 different RETs. We find that, while several RETs (photovoltaics, fuel-cells, energy storage) have a highly analytic knowledge base and develop more widespread, there are also important RETs (wind turbines, solar thermal, geothermal and hydro energy) for which the knowledge base is less analytic and which develop less widespread. Likewise, the technological cumulativeness tends to be lower for the former than for the latter group. This calls for regional policies to be specific for different RETs, taking for a given RET into account both the type of knowledge it builds on as well as the local presence of this knowledge.

【9】 Mechanism Design for Efficient Nash Equilibrium in Oligopolistic Markets 标题:寡头垄断市场有效纳什均衡的机制设计

作者:Kaiying Lin,Beibei Wang,Pengcheng You 机构:Electrical Engineering, Southeast University, Electrical and Computer Engineering, Johns Hopkins University 链接:https://arxiv.org/abs/2106.11120 摘要:研究了供需平衡市场中个体参与者的策略性竞价行为所造成的社会成本效率损失,提出了一种通过补贴和税收完全恢复均衡社会最优的机制。利用线性供应函数竞价和提出的效率回收机制,刻画了供给侧企业在满足给定非弹性需求下的竞争行为。我们证明了这种博弈的纳什均衡在温和的条件下存在,更重要的是,它实现了潜在的有效供给调度和反映真实系统边际生产成本的市场结算价格。此外,这一机制可以调整,以保证自给自足,即征收的税款抵消所需的补贴。通过大量的数值案例分析,验证了均衡分析的有效性,并采用个体净利润和修正的勒纳指数作为衡量指标,通过改变调节参数和企业异质性来评价该机制对市场结果的影响。 摘要:This paper investigates the efficiency loss in social cost caused by strategic bidding behavior of individual participants in a supply-demand balancing market, and proposes a mechanism to fully recover equilibrium social optimum via subsidization and taxation. We characterize the competition among supply-side firms to meet given inelastic demand, with linear supply function bidding and the proposed efficiency recovery mechanism. We show that the Nash equilibrium of such a game exists under mild conditions, and more importantly, it achieves the underlying efficient supply dispatch and the market clearing price that reflects the truthful system marginal production cost. Further, the mechanism can be tuned to guarantee self-sufficiency, i.e., taxes collected counterbalance subsidies needed. Extensive numerical case studies are run to validate the equilibrium analysis, and we employ individual net profit and a modified version of Lerner index as two metrics to evaluate the impact of the mechanism on market outcomes by varying its tuning parameter and firm heterogeneity.

2.cs.SD语音:

【1】 Affinity Mixup for Weakly Supervised Sound Event Detection 标题:基于亲和力混合的弱监督声音事件检测

作者:Mohammad Rasool Izadi,Robert Stevenson,Laura N. Kloepper 链接:https://arxiv.org/abs/2106.11233 摘要:弱监督声音事件检测问题是在弱标记数据集中预测声音事件的存在及其相应的起点和终点。弱数据集将每个训练样本(短记录)与一个或多个当前源相关联。仅依赖卷积层和循环层的网络不能直接关联记录中的多个帧。在注意和图神经网络的激励下,我们引入了亲和性混合的概念,将时间层次的相似性结合起来,并在帧之间建立联系。这种正则化技术使用自适应亲和矩阵混合不同层中的特征。我们提议的亲和性混合网络比最先进的技术event-F1分数提高了8.2\%$。 摘要:The weakly supervised sound event detection problem is the task of predicting the presence of sound events and their corresponding starting and ending points in a weakly labeled dataset. A weak dataset associates each training sample (a short recording) to one or more present sources. Networks that solely rely on convolutional and recurrent layers cannot directly relate multiple frames in a recording. Motivated by attention and graph neural networks, we introduce the concept of an affinity mixup to incorporate time-level similarities and make a connection between frames. This regularization technique mixes up features in different layers using an adaptive affinity matrix. Our proposed affinity mixup network improves over state-of-the-art techniques event-F1 scores by $8.2\%$.

【2】 EML Online Speech Activity Detection for the Fearless Steps Challenge Phase-III 标题:面向无畏舞步挑战赛第三阶段的EML在线语音活动检测

作者:Omid Ghahabi,Volker Fischer 机构:EML Speech Technology GmbH, Berliner Straße , Heidelberg, Germany 链接:https://arxiv.org/abs/2106.11075 摘要:语音活动检测(SAD)是大多数语音技术应用的一个主要部分,它是在音频记录中定位语音片段。在信噪比(SNR)变化的噪声环境下,鲁棒SAD通常比较困难。“无畏的脚步挑战”最近从美国宇航局阿波罗11号任务中为不同的语音处理任务(包括SAD)提供了这样的数据。大多数录音都会因频道内和频道之间不同种类和级别的噪声而降级。本文介绍了EML在线算法的最新阶段的这一挑战。该算法可以在有监督和无监督的情况下进行训练,并在运行时大约每0.1秒分配一次语音和非语音标签。实验结果表明,在单CPU环境下,开发和评估数据集的实时性因子约为0.002。 摘要:Speech Activity Detection (SAD), locating speech segments within an audio recording, is a main part of most speech technology applications. Robust SAD is usually more difficult in noisy conditions with varying signal-to-noise ratios (SNR). The Fearless Steps challenge has recently provided such data from the NASA Apollo-11 mission for different speech processing tasks including SAD. Most audio recordings are degraded by different kinds and levels of noise varying within and between channels. This paper describes the EML online algorithm for the most recent phase of this challenge. The proposed algorithm can be trained both in a supervised and unsupervised manner and assigns speech and non-speech labels at runtime approximately every 0.1 sec. The experimental results show a competitive accuracy on both development and evaluation datasets with a real-time factor of about 0.002 using a single CPU machine.

【3】 Advances in Speech Vocoding for Text-to-Speech with Continuous Parameters 标题:连续参数文语转换语音声码研究进展

作者:Mohammed Salah Al-Radhi,Tamás Gábor Csapó,Géza Németh 机构: Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary, MTA-ELTE Lendület Lingual Articulation Research Group, Budapest, Hungary 备注:6 pages, 3 figures, International Conference on Artificial Intelligence and Speech Technology (AIST2020) 链接:https://arxiv.org/abs/2106.10481 摘要:声码器作为统计参数文本到语音(TTS)合成和语音转换系统中的主要部件,受到了新的关注。尽管有一些声码技术提供了几乎被接受的合成语音,但是它们的高计算复杂度和不规则结构仍然被认为是具有挑战性的问题,这会导致各种各样的语音质量下降。因此,本文提出了一种新的连续声码器技术,即所有特征都是连续的,并提出了一种灵活的语音合成系统。首先,提出了一种新的基于相位失真的连续噪声掩蔽方法,消除了残余噪声对感知的影响,使噪声特性得到准确的重建。其次,针对基于递归网络的TTS任务,提出了神经序列到序列的建模方法。研究了双向长短时记忆(LSTM)和选通递归单元(GRU),并将其应用于连续参数建模,使其听起来更像人。评价结果表明,与其他传统方法相比,该模型达到了语音合成的最新水平。 摘要:Vocoders received renewed attention as main components in statistical parametric text-to-speech (TTS) synthesis and speech transformation systems. Even though there are vocoding techniques give almost accepted synthesized speech, their high computational complexity and irregular structures are still considered challenging concerns, which yield a variety of voice quality degradation. Therefore, this paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system. First, a new continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise and letting an accurate reconstruction of noise characteristics. Second, we addressed the need of neural sequence to sequence modeling approach for the task of TTS based on recurrent networks. Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human. The evaluation results proved that the proposed model achieves the state-of-the-art performance of the speech synthesis compared with the other traditional methods.

【4】 Improving robustness of one-shot voice conversion with deep discriminative speaker encoder 标题:利用深度区分性说话人编码器提高单次语音转换的鲁棒性

作者:Hongqiang Du,Lei Xie 机构:Northwestern Polytechnical University, China 链接:https://arxiv.org/abs/2106.10406 摘要:由于只需要源说话人和目标说话人的一个话语,一次语音转换受到了广泛的关注。此外,在训练过程中不需要看到源说话人和目标说话人。然而,现有的一次语音转换方法对于看不见的说话人来说并不稳定,因为从一个看不见的说话人的一句话中提取的说话人嵌入是不可靠的。在本文中,我们提出了一种深度辨别的说话人编码器,以更有效地从一个话语中提取说话人嵌入。具体地说,说话人编码器首先将残差网络和压缩激励网络相结合,通过对说话人特征的逐帧和逐信道相互依赖性建模,在帧级提取说话人信息。然后引入注意机制,通过对帧级说话人信息赋予不同的权重,进一步强调说话人相关信息。最后利用统计池层对加权帧级说话人信息进行聚合,形成话语级说话人嵌入。实验结果表明,我们提出的说话人编码器可以提高一次语音转换的鲁棒性,并且在语音质量和说话人相似性方面优于基线系统。 摘要:One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling frame-wise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

【5】 UniTTS: Residual Learning of Unified Embedding Space for Speech Style Control 标题:UniTTS:用于语音风格控制的统一嵌入空间的残差学习

作者:Minsu Kang,Sungjae Kim,Injung Kim 机构:Department of Computer Science and Electronic Engineering, Handong Global University 链接:https://arxiv.org/abs/2106.11171 摘要:提出了一种新的高保真表达性语音合成模型UniTTS,该模型学习和控制重叠样式属性,避免干扰。UniTTS通过应用属性前后音素嵌入之间的残差来表示单个统一嵌入空间中的多个风格属性。该方法在控制说话人ID和情感等难以清晰分离的多个属性时尤其有效,因为它在增加说话人ID和情感的方差时最小化了冗余,并基于说话人ID和情感预测了时长、基音和能量。实验结果表明,该方法能够以一种易于分离的方式协调地学习多个属性。此外,UniTTS还合成了控制多种风格属性的高保真语音信号。最后给出了合成语音样本https://jackson-kang.github.io/paper_works/UniTTS/demos. 摘要:We propose a novel high-fidelity expressive speech synthesis model, UniTTS, that learns and controls overlapping style attributes avoiding interference. UniTTS represents multiple style attributes in a single unified embedding space by the residuals between the phoneme embeddings before and after applying the attributes. The proposed method is especially effective in controlling multiple attributes that are difficult to separate cleanly, such as speaker ID and emotion, because it minimizes redundancy when adding variance in speaker ID and emotion, and additionally, predicts duration, pitch, and energy based on the speaker ID and emotion. In experiments, the visualization results exhibit that the proposed methods learned multiple attributes harmoniously in a manner that can be easily separated again. As well, UniTTS synthesized high-fidelity speech signals controlling multiple style attributes. The synthesized speech samples are presented at https://jackson-kang.github.io/paper_works/UniTTS/demos.

【6】 Towards sound based testing of COVID-19 -- Summary of the first Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge 标题:走向冠状病毒的声学检测--首次声学诊断冠状病毒(DiCOVA)挑战赛综述

作者:Neeraj Kumar Sharma,Ananya Muguli,Prashant Krishnan,Rohit Kumar,Srikanth Raj Chetupalli,Sriram Ganapathy 机构:Learning and Extraction of Acoustic Patterns (LEAP) Lab, Electrical Engineering, Indian Institute of Science, Bangalore, India 备注:Manuscript in review in the Elsevier Computer Speech and Language journal 链接:https://arxiv.org/abs/2106.10997 摘要:近年来,针对呼吸系统疾病的护理点测试(poct)技术的发展见证了需求的增长。研究咳嗽、呼吸和语音等模式中声学生物标记物的存在,并将其用于构建poct可以提供快速、非接触和廉价的检测。有鉴于此,在过去的一年里,我们启动了“Coswara”项目,通过全球众包收集咳嗽、呼吸和语音录音。根据这些数据,Interspeech 2021宣布了开发诊断工具的呼吁,这是一个名为“使用声学(DiCOVA)挑战诊断COVID-19”的特别会议。目标是通过使研究人员和实践者能够在同一组开发和测试数据集上工作,将对开发基于声学的COVID-19 poct感兴趣的研究人员和实践者聚集在一起。作为挑战的一部分,将来自COVID-19和非COVID-19个体的呼吸、咳嗽和语音样本数据集发布给参与者。挑战包括两条赛道。第一赛道只关注咳嗽声,参赛者在排行榜上竞争。在第二轨道,呼吸和讲话样本提供给参与者,没有竞争排行榜。这次挑战吸引了85多人报名,29人最终提交了Track-1。本文描述了挑战(数据集、任务、基线系统),并对参赛团队提交的各种系统进行了重点总结。对前四个小组的结果进行分析表明,这些小组的分数融合后,盲测数据的曲线下面积为95.1%。通过总结经验教训,我们预见了本文中的挑战概述,以帮助加快基于声学的poct技术。 摘要:The technology development for point-of-care tests (POCTs) targeting respiratory diseases has witnessed a growing demand in the recent past. Investigating the presence of acoustic biomarkers in modalities such as cough, breathing and speech sounds, and using them for building POCTs can offer fast, contactless and inexpensive testing. In view of this, over the past year, we launched the ``Coswara'' project to collect cough, breathing and speech sound recordings via worldwide crowdsourcing. With this data, a call for development of diagnostic tools was announced in the Interspeech 2021 as a special session titled ``Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge''. The goal was to bring together researchers and practitioners interested in developing acoustics-based COVID-19 POCTs by enabling them to work on the same set of development and test datasets. As part of the challenge, datasets with breathing, cough, and speech sound samples from COVID-19 and non-COVID-19 individuals were released to the participants. The challenge consisted of two tracks. The Track-1 focused only on cough sounds, and participants competed in a leaderboard setting. In Track-2, breathing and speech samples were provided for the participants, without a competitive leaderboard. The challenge attracted 85 plus registrations with 29 final submissions for Track-1. This paper describes the challenge (datasets, tasks, baseline system), and presents a focused summary of the various systems submitted by the participating teams. An analysis of the results from the top four teams showed that a fusion of the scores from these teams yields an area-under-the-curve of 95.1% on the blind test data. By summarizing the lessons learned, we foresee the challenge overview in this paper to help accelerate technology for acoustic-based POCTs.

【7】 Speech prosody and remote experiments: a technical report 标题:语音韵律与远程实验:一份技术报告

作者:Giuseppe Magistro 机构:Department of linguistics, University of Ghent, Belgium, Department of linguistics, University of Konstanz, Germany 链接:https://arxiv.org/abs/2106.10915 摘要:本文的目的是双重的。首先,我们回顾了在实地调查不可行的情况下(例如由于流行病)收集韵律数据的不同记录方法。在此基础上,采用不同的软硬件同步模拟远程阅读任务实验。为了评估所采用的方法,我们提取噪音水平和频率操纵的记录。随后,我们研究了不同录音对语言变量的影响,如音高曲线和音值。我们还讨论了实验的实用性。在平衡了这些因素之后,我们宣布一个在线平台Zencastr是最经济实惠、最实用的声学数据采集平台。第二,我们想就语音韵律研究者可以使用的最佳远程方法展开讨论。 摘要:The aim of this paper is twofold. First, we present a review of different recording options for gathering prosodic data in the event that fieldwork is impracticable (e.g. due to pandemics). Under this light, we mimic a long-distance reading task experiment using different software and hardware synchronously. In order to evaluate the employed methodologies, we extract noise levels and frequency manipulation of the recordings. Subsequently, we examine the impact of the different recordings onto linguistic variables, such as the pitch curves and values. We also include a discussion on experimental practicalities. After balancing these factors, we decree an online platform, Zencastr, as the most affordable and practical for acoustic data collection. Secondly, we want to open up a debate on the most optimal remote methodology that researchers on speech prosody can deploy.

【8】 Non-native English lexicon creation for bilingual speech synthesis 标题:面向双语语音合成的非母语英语词汇创设

作者:Arun Baby,Pranav Jawale,Saranya Vinnaitherthan,Sumukh Badam,Nagaraj Adiga,Sharath Adavanne 机构:Zapr Media Labs (Red Brick Lane Marketing Solutions Pvt. Ltd.), India 备注:Accepted for Presentation at Speech Synthesis Workshop (SSW), 2021 (August 2021) 链接:https://arxiv.org/abs/2106.10870 摘要:说英语的双语者把英语作为他们的语言之一。他们的英语是非母语的,他们的对话是代码混合的方式。双语文语转换(TTS)系统对于非英语母语者的可理解性取决于一个能够捕捉非英语母语者使用的音位序列的词汇。然而,由于缺乏非母语英语词汇,现有的双语TTS系统除了使用母语词汇外,还使用了广泛使用的母语英语词汇。由于语音中的非母语英语发音与文本中的母语英语词汇不一致,在这种TTS系统中合成语音的可懂度大大降低。本文的出发点在于说话人的母语对非母语英语发音的影响。我们提出了一种基于字母-音素对齐的规则获取方法,将英语本族语词汇映射到非本族语词汇。这种映射的有效性是通过比较双语(印度英语和印地语)TTS系统训练与不建议的规则。主观评价结果表明,采用本文提出的非母语英语词汇规则训练的双语TTS系统在偏好上获得了6%的绝对提高。 摘要:Bilingual English speakers speak English as one of their languages. Their English is of a non-native kind, and their conversations are of a code-mixed fashion. The intelligibility of a bilingual text-to-speech (TTS) system for such non-native English speakers depends on a lexicon that captures the phoneme sequence used by non-native speakers. However, due to the lack of non-native English lexicon, existing bilingual TTS systems employ native English lexicons that are widely available, in addition to their native language lexicon. Due to the inconsistency between the non-native English pronunciation in the audio and native English lexicon in the text, the intelligibility of synthesized speech in such TTS systems is significantly reduced. This paper is motivated by the knowledge that the native language of the speaker highly influences non-native English pronunciation. We propose a generic approach to obtain rules based on letter to phoneme alignment to map native English lexicon to their non-native version. The effectiveness of such mapping is studied by comparing bilingual (Indian English and Hindi) TTS systems trained with and without the proposed rules. The subjective evaluation shows that the bilingual TTS system trained with the proposed non-native English lexicon rules obtains a 6% absolute improvement in preference.

【9】 Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis 标题:Glow-WaveGAN:高保真流语音合成中GaN基变分自动编码器的语音表示学习

作者:Jian Cong,Shan Yang,Lei Xie,Dan Su 机构:Northwestern Polytechnical University, Xi’an, China, Tencent AI Lab, China 链接:https://arxiv.org/abs/2106.10831 摘要:当前的两级TTS框架通常将声学模型与声码器集成在一起——声学模型预测低分辨率的中间表示,如Mel谱,而声码器则从中间表示生成波形。虽然中间表示作为桥梁,但是由于声学模型和声码器通常是分开学习的,并且工作在不同的表示分布上,因此它们之间仍然存在严重的不匹配,从而导致合成语音中不可避免的伪影。与以往大多数研究中使用预先设计的中间表征不同,本研究提出利用VAE结合GAN直接从语音中学习潜在表征,然后利用基于流的声学模型来模拟文本中潜在表征的分布。这样,当两个阶段在同一分布上工作时,不匹配问题就被迁移了。结果表明,基于流的声学模型能够准确地模拟我们学习到的语音表示的分布,并且所提出的TTS框架,即Glow-WaveGAN,能够产生比最新的基于GAN的模型更好的高保真语音。 摘要:Current two-stage TTS framework typically integrates an acoustic model with a vocoder -- the acoustic model predicts a low resolution intermediate representation such as Mel-spectrum while the vocoder generates waveform from the intermediate representation. Although the intermediate representation is served as a bridge, there still exists critical mismatch between the acoustic model and the vocoder as they are commonly separately learned and work on different distributions of representation, leading to inevitable artifacts in the synthesized speech. In this work, different from using pre-designed intermediate representation in most previous studies, we propose to use VAE combining with GAN to learn a latent representation directly from speech and then utilize a flow-based acoustic model to model the distribution of the latent representation from text. In this way, the mismatch problem is migrated as the two stages work on the same distribution. Results demonstrate that the flow-based acoustic model can exactly model the distribution of our learned speech representation and the proposed TTS framework, namely Glow-WaveGAN, can produce high fidelity speech outperforming the state-of-the-art GAN-based model.

【10】 Controllable Context-aware Conversational Speech Synthesis 标题:可控的上下文感知会话语音合成

作者:Jian Cong,Shan Yang,Na Hu,Guangzhi Li,Lei Xie,Dan Su 机构:Northwestern Polytechnical University, Xi’an, China, Tencent AI Lab, China 备注:Accepted to INTERSPEECH 2021 链接:https://arxiv.org/abs/2106.10828 摘要:在口语会话中,像停顿和延长这样的自发行为经常发生。会话伙伴倾向于将他们的语言特征与他们的对话者联系起来,这就是所谓的夹带。为了产生类人会话,我们提出了一个统一的可控自发会话语音合成框架来模拟上述两种现象。具体来说,我们使用显式标签来表示声学模型中两种典型的自发行为:停顿和延长,并开发了一个基于神经网络的预测器来预测文本中这两种行为的发生。随后,我们开发了一种基于预测器的算法来控制行为的发生频率,使合成语音由不流畅变为不流畅。为了在声学水平上对语音夹带进行建模,我们利用上下文声学编码器在当前语音合成的基础上,从先前的语音条件中提取一个全局风格的嵌入。此外,由于当前话语和前一话语属于会话中的不同说话人,我们在语音编码器中加入了域对抗训练模块,在保留风格相关信息的同时,消除了说话人相关信息。实验表明,该方法能够合成真实的会话,并能自然地控制自发行为的发生。 摘要:In spoken conversations, spontaneous behaviors like filled pause and prolongations always happen. Conversational partner tends to align features of their speech with their interlocutor which is known as entrainment. To produce human-like conversations, we propose a unified controllable spontaneous conversational speech synthesis framework to model the above two phenomena. Specifically, we use explicit labels to represent two typical spontaneous behaviors filled-pause and prolongation in the acoustic model and develop a neural network based predictor to predict the occurrences of the two behaviors from text. We subsequently develop an algorithm based on the predictor to control the occurrence frequency of the behaviors, making the synthesized speech vary from less disfluent to more disfluent. To model the speech entrainment at acoustic level, we utilize a context acoustic encoder to extract a global style embedding from the previous speech conditioning on the synthesizing of current speech. Furthermore, since the current and previous utterances belong to the different speakers in a conversation, we add a domain adversarial training module to eliminate the speaker-related information in the acoustic encoder while maintaining the style-related information. Experiments show that our proposed approach can synthesize realistic conversations and control the occurrences of the spontaneous behaviors naturally.

【11】 Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection 标题:基于D3网的ACCDOA和EINV2系统集成及声事件定位和检测的脉冲响应仿真

作者:Kazuki Shimada,Naoya Takahashi,Yuichiro Koyama,Shusuke Takahashi,Emiru Tsunoo,Masafumi Takahashi,Yuki Mitsufuji 机构:Sony Group Corporation, Japan 备注:5 pages, 3 figures, submitted to DCASE2021 task3 链接:https://arxiv.org/abs/2106.10806 摘要:本报告描述了我们提交给DCASE2021挑战任务3:具有定向干扰的声音事件定位和检测(SELD)的系统。我们以前的系统基于活动耦合笛卡尔到达方向(ACCDOA)表示,使我们能够解决单一目标的SELD任务。这种基于ACCDOA的系统具有高效的网络体系结构RD3Net和数据增强技术,在定位和位置相关检测方面优于最先进的SELD系统。以基于ACCDOA的系统为基础,通过对不同条件下训练的系统的输出进行平均,如输入特征、训练折叠和模型结构。我们还使用基于事件无关网络v2(EINV2)的系统来增加模型集合的多样性。为了推广这些模型,我们进一步提出了脉冲响应模拟(IRS),它通过将模拟的室内脉冲响应(RIRs)与从原始数据集中提取的源信号进行卷积来生成模拟的多通道信号。与开发数据集上的基线系统相比,我们的系统有了显著的改进。 摘要:This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference. Our previous system based on activity-coupled Cartesian direction of arrival (ACCDOA) representation enables us to solve a SELD task with a single target. This ACCDOA-based system with efficient network architecture called RD3Net and data augmentation techniques outperformed state-of-the-art SELD systems in terms of localization and location-dependent detection. Using the ACCDOA-based system as a base, we perform model ensembles by averaging outputs of several systems trained with different conditions such as input features, training folds, and model architectures. We also use the event independent network v2 (EINV2)-based system to increase the diversity of the model ensembles. To generalize the models, we further propose impulse response simulation (IRS), which generates simulated multi-channel signals by convolving simulated room impulse responses (RIRs) with source signals extracted from the original dataset. Our systems significantly improved over the baseline system on the development dataset.

【12】 MeshRIR: A Dataset of Room Impulse Responses on Meshed Grid Points For Evaluating Sound Field Analysis and Synthesis Methods 标题:MESHRIR:用于评价声场分析和合成方法的网格网格点上的房间脉冲响应数据集

作者:Shoichi Koyama,Tomoya Nishida,Keisuke Kimura,Takumi Abe,Natsuki Ueno,Jesper Brunnström 机构:The University of Tokyo,-,-, Hongo, Bunkyo-ku, Tokyo ,-, Japan 链接:https://arxiv.org/abs/2106.10801 摘要:介绍了一种新的脉冲响应数据集MeshRIR。目前可用的数据集通常包括在不同房间条件下来自多个源位置的麦克风阵列上的IRs,其基本设计用于评估语音增强和远程语音识别方法。另一方面,估计或控制空间声场的方法近年来得到了广泛的研究;然而,由于测量点的空间分辨率较低,现有的红外数据集不适用于验证和比较这些方法。MeshRIR由在精细离散空间区域获得的位置处测量的IRs组成。目前有两个子数据集:一个子数据集由来自单个震源的三维立方体区域的IRs组成,另一个子数据集由来自32个震源阵列的二维正方形区域的IRs组成。因此,MeshRIR适用于评价声场分析和综合方法。此数据集可在\url免费获得{https://sh01k.github.io/MeshRIR/}一些示例应用程序代码。 摘要:A new impulse response (IR) dataset called "MeshRIR" is introduced. Currently available datasets usually include IRs at an array of microphones from several source positions under various room conditions, which are basically designed for evaluating speech enhancement and distant speech recognition methods. On the other hand, methods of estimating or controlling spatial sound fields have been extensively investigated in recent years; however, the current IR datasets are not applicable to validating and comparing these methods because of the low spatial resolution of measurement points. MeshRIR consists of IRs measured at positions obtained by finely discretizing a spatial region. Two subdatasets are currently available: one consists of IRs in a three-dimensional cuboidal region from a single source, and the other consists of IRs in a two-dimensional square region from an array of 32 sources. Therefore, MeshRIR is suitable for evaluating sound field analysis and synthesis methods. This dataset is freely available at \url{https://sh01k.github.io/MeshRIR/} with some codes of sample applications.

【13】 Encoder-Decoder Based Attractor Calculation for End-to-End Neural Diarization 标题:基于编解码器的端到端神经网络二值化吸引子计算

作者:Shota Horiguchi,Yusuke Fujita,Shinji Watanabe,Yawen Xue,Paola Garcia 机构: Watanabe is with Carnegie Mellon University, Garc´ıa is with Johns Hopkins University 备注:Submitted to IEEE TASLP. This article is based on our previous conference paper arxiv:2005.09921 链接:https://arxiv.org/abs/2106.10654 摘要:研究了一种针对未知说话人数的端到端神经二值化(EEND)方法。与传统的流水线方法相比,EEND方法在处理说话人重叠方面有更好的效果。然而,EEND仍然有一个缺点,那就是它不能处理数量灵活的说话者。为了解决这一问题,我们引入了基于编码器-解码器的吸引子计算模块(EDA)。一旦获得逐帧嵌入,EDA基于使用LSTM编解码器的序列到序列方法,依次生成逐说话人吸引子。吸引子的产生一直持续到满足停止条件为止;因此,吸引子的数量可以是灵活的。然后将二值化结果估计为吸引子和嵌入的点积。说话人重叠的嵌入导致了具有多个吸引子的较大点积值;因此,该方法可以处理说话人重叠。由于最大输出说话人数仍然受到训练集的限制,我们还提出了一种迭代推理方法来消除这种限制。此外,我们提出了一种方法,将估计的二值化结果与外部语音活动检测器的结果进行比对,从而与流水线方法进行公平比较。对模拟数据集和真实数据集的广泛评估表明,EEND-EDA优于传统的流水线方法。 摘要:This paper investigates an end-to-end neural diarization (EEND) method for an unknown number of speakers. In contrast to the conventional pipeline approach to speaker diarization, EEND methods are better in terms of speaker overlap handling. However, EEND still has a disadvantage in that it cannot deal with a flexible number of speakers. To remedy this problem, we introduce encoder-decoder-based attractor calculation module (EDA) to EEND. Once frame-wise embeddings are obtained, EDA sequentially generates speaker-wise attractors on the basis of a sequence-to-sequence method using an LSTM encoder-decoder. The attractor generation continues until a stopping condition is satisfied; thus, the number of attractors can be flexible. Diarization results are then estimated as dot products of the attractors and embeddings. The embeddings from speaker overlaps result in larger dot product values with multiple attractors; thus, this method can deal with speaker overlaps. Because the maximum number of output speakers is still limited by the training set, we also propose an iterative inference method to remove this restriction. Further, we propose a method that aligns the estimated diarization results with the results of an external speech activity detector, which enables fair comparison against pipeline approaches. Extensive evaluations on simulated and real datasets show that EEND-EDA outperforms the conventional pipeline approach.

【14】 GPLA-12: An Acoustic Signal Dataset of Gas Pipeline Leakage 标题:GPLA-12:天然气管道泄漏声信号数据集

作者:Jie Li,Lizhong Yao 机构:School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing, China, School of Electrical Engineering 链接:https://arxiv.org/abs/2106.10277 摘要:本文介绍了一种新的天然气管道声泄漏数据集GPLA-12,该数据集共有12类684个训练/测试声信号。与海量的图像和语音数据集不同,声学信号数据集相对较少,特别是用于工程故障检测的数据集。为了加强故障诊断的发展,我们在一个完整的天然气管道系统的基础上,采集外部人为泄漏的声泄漏信号,然后对采集到的数据进行结构化裁剪预处理,转化为GPLA-12,GPLA-12致力于作为时间序列任务和分类的特征学习数据集。为了进一步了解数据集,我们训练阴影和深度学习算法来观察性能。数据集和预训练模型已经在这两个会议上发布www.daip.club 和github.com/Deep-AI-Application-DAIP 摘要:In this paper, we introduce a new acoustic leakage dataset of gas pipelines, called as GPLA-12, which has 12 categories over 684 training/testing acoustic signals. Unlike massive image and voice datasets, there have relatively few acoustic signal datasets, especially for engineering fault detection. In order to enhance the development of fault diagnosis, we collect acoustic leakage signals on the basis of an intact gas pipe system with external artificial leakages, and then preprocess the collected data with structured tailoring which are turned into GPLA-12. GPLA-12 dedicates to serve as a feature learning dataset for time-series tasks and classifications. To further understand the dataset, we train both shadow and deep learning algorithms to observe the performance. The dataset as well as the pretrained models have been released at both www.daip.club and github.com/Deep-AI-Application-DAIP

3.eess.AS音频处理:

【1】 UniTTS: Residual Learning of Unified Embedding Space for Speech Style Control 标题:UniTTS:用于语音风格控制的统一嵌入空间的残差学习

作者:Minsu Kang,Sungjae Kim,Injung Kim 机构:Department of Computer Science and Electronic Engineering, Handong Global University 链接:https://arxiv.org/abs/2106.11171 摘要:提出了一种新的高保真表达性语音合成模型UniTTS,该模型学习和控制重叠样式属性,避免干扰。UniTTS通过应用属性前后音素嵌入之间的残差来表示单个统一嵌入空间中的多个风格属性。该方法在控制说话人ID和情感等难以清晰分离的多个属性时尤其有效,因为它在增加说话人ID和情感的方差时最小化了冗余,并基于说话人ID和情感预测了时长、基音和能量。实验结果表明,该方法能够以一种易于分离的方式协调地学习多个属性。此外,UniTTS还合成了控制多种风格属性的高保真语音信号。最后给出了合成语音样本https://jackson-kang.github.io/paper_works/UniTTS/demos. 摘要:We propose a novel high-fidelity expressive speech synthesis model, UniTTS, that learns and controls overlapping style attributes avoiding interference. UniTTS represents multiple style attributes in a single unified embedding space by the residuals between the phoneme embeddings before and after applying the attributes. The proposed method is especially effective in controlling multiple attributes that are difficult to separate cleanly, such as speaker ID and emotion, because it minimizes redundancy when adding variance in speaker ID and emotion, and additionally, predicts duration, pitch, and energy based on the speaker ID and emotion. In experiments, the visualization results exhibit that the proposed methods learned multiple attributes harmoniously in a manner that can be easily separated again. As well, UniTTS synthesized high-fidelity speech signals controlling multiple style attributes. The synthesized speech samples are presented at https://jackson-kang.github.io/paper_works/UniTTS/demos.

【2】 Towards sound based testing of COVID-19 -- Summary of the first Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge 标题:走向冠状病毒的声学检测--首次声学诊断冠状病毒(DiCOVA)挑战赛综述

作者:Neeraj Kumar Sharma,Ananya Muguli,Prashant Krishnan,Rohit Kumar,Srikanth Raj Chetupalli,Sriram Ganapathy 机构:Learning and Extraction of Acoustic Patterns (LEAP) Lab, Electrical Engineering, Indian Institute of Science, Bangalore, India 备注:Manuscript in review in the Elsevier Computer Speech and Language journal 链接:https://arxiv.org/abs/2106.10997 摘要:近年来,针对呼吸系统疾病的护理点测试(poct)技术的发展见证了需求的增长。研究咳嗽、呼吸和语音等模式中声学生物标记物的存在,并将其用于构建poct可以提供快速、非接触和廉价的检测。有鉴于此,在过去的一年里,我们启动了“Coswara”项目,通过全球众包收集咳嗽、呼吸和语音录音。根据这些数据,Interspeech 2021宣布了开发诊断工具的呼吁,这是一个名为“使用声学(DiCOVA)挑战诊断COVID-19”的特别会议。目标是通过使研究人员和实践者能够在同一组开发和测试数据集上工作,将对开发基于声学的COVID-19 poct感兴趣的研究人员和实践者聚集在一起。作为挑战的一部分,将来自COVID-19和非COVID-19个体的呼吸、咳嗽和语音样本数据集发布给参与者。挑战包括两条赛道。第一赛道只关注咳嗽声,参赛者在排行榜上竞争。在第二轨道,呼吸和讲话样本提供给参与者,没有竞争排行榜。这次挑战吸引了85多人报名,29人最终提交了Track-1。本文描述了挑战(数据集、任务、基线系统),并对参赛团队提交的各种系统进行了重点总结。对前四个小组的结果进行分析表明,这些小组的分数融合后,盲测数据的曲线下面积为95.1%。通过总结经验教训,我们预见了本文中的挑战概述,以帮助加快基于声学的poct技术。 摘要:The technology development for point-of-care tests (POCTs) targeting respiratory diseases has witnessed a growing demand in the recent past. Investigating the presence of acoustic biomarkers in modalities such as cough, breathing and speech sounds, and using them for building POCTs can offer fast, contactless and inexpensive testing. In view of this, over the past year, we launched the ``Coswara'' project to collect cough, breathing and speech sound recordings via worldwide crowdsourcing. With this data, a call for development of diagnostic tools was announced in the Interspeech 2021 as a special session titled ``Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge''. The goal was to bring together researchers and practitioners interested in developing acoustics-based COVID-19 POCTs by enabling them to work on the same set of development and test datasets. As part of the challenge, datasets with breathing, cough, and speech sound samples from COVID-19 and non-COVID-19 individuals were released to the participants. The challenge consisted of two tracks. The Track-1 focused only on cough sounds, and participants competed in a leaderboard setting. In Track-2, breathing and speech samples were provided for the participants, without a competitive leaderboard. The challenge attracted 85 plus registrations with 29 final submissions for Track-1. This paper describes the challenge (datasets, tasks, baseline system), and presents a focused summary of the various systems submitted by the participating teams. An analysis of the results from the top four teams showed that a fusion of the scores from these teams yields an area-under-the-curve of 95.1% on the blind test data. By summarizing the lessons learned, we foresee the challenge overview in this paper to help accelerate technology for acoustic-based POCTs.

【3】 Speech prosody and remote experiments: a technical report 标题:语音韵律与远程实验:一份技术报告

作者:Giuseppe Magistro 机构:Department of linguistics, University of Ghent, Belgium, Department of linguistics, University of Konstanz, Germany 链接:https://arxiv.org/abs/2106.10915 摘要:本文的目的是双重的。首先,我们回顾了在实地调查不可行的情况下(例如由于流行病)收集韵律数据的不同记录方法。在此基础上,采用不同的软硬件同步模拟远程阅读任务实验。为了评估所采用的方法,我们提取噪音水平和频率操纵的记录。随后,我们研究了不同录音对语言变量的影响,如音高曲线和音值。我们还讨论了实验的实用性。在平衡了这些因素之后,我们宣布一个在线平台Zencastr是最经济实惠、最实用的声学数据采集平台。第二,我们想就语音韵律研究者可以使用的最佳远程方法展开讨论。 摘要:The aim of this paper is twofold. First, we present a review of different recording options for gathering prosodic data in the event that fieldwork is impracticable (e.g. due to pandemics). Under this light, we mimic a long-distance reading task experiment using different software and hardware synchronously. In order to evaluate the employed methodologies, we extract noise levels and frequency manipulation of the recordings. Subsequently, we examine the impact of the different recordings onto linguistic variables, such as the pitch curves and values. We also include a discussion on experimental practicalities. After balancing these factors, we decree an online platform, Zencastr, as the most affordable and practical for acoustic data collection. Secondly, we want to open up a debate on the most optimal remote methodology that researchers on speech prosody can deploy.

【4】 Non-native English lexicon creation for bilingual speech synthesis 标题:面向双语语音合成的非母语英语词汇创设

作者:Arun Baby,Pranav Jawale,Saranya Vinnaitherthan,Sumukh Badam,Nagaraj Adiga,Sharath Adavanne 机构:Zapr Media Labs (Red Brick Lane Marketing Solutions Pvt. Ltd.), India 备注:Accepted for Presentation at Speech Synthesis Workshop (SSW), 2021 (August 2021) 链接:https://arxiv.org/abs/2106.10870 摘要:说英语的双语者把英语作为他们的语言之一。他们的英语是非母语的,他们的对话是代码混合的方式。双语文语转换(TTS)系统对于非英语母语者的可理解性取决于一个能够捕捉非英语母语者使用的音位序列的词汇。然而,由于缺乏非母语英语词汇,现有的双语TTS系统除了使用母语词汇外,还使用了广泛使用的母语英语词汇。由于语音中的非母语英语发音与文本中的母语英语词汇不一致,在这种TTS系统中合成语音的可懂度大大降低。本文的出发点在于说话人的母语对非母语英语发音的影响。我们提出了一种基于字母-音素对齐的规则获取方法,将英语本族语词汇映射到非本族语词汇。这种映射的有效性是通过比较双语(印度英语和印地语)TTS系统训练与不建议的规则。主观评价结果表明,采用本文提出的非母语英语词汇规则训练的双语TTS系统在偏好上获得了6%的绝对提高。 摘要:Bilingual English speakers speak English as one of their languages. Their English is of a non-native kind, and their conversations are of a code-mixed fashion. The intelligibility of a bilingual text-to-speech (TTS) system for such non-native English speakers depends on a lexicon that captures the phoneme sequence used by non-native speakers. However, due to the lack of non-native English lexicon, existing bilingual TTS systems employ native English lexicons that are widely available, in addition to their native language lexicon. Due to the inconsistency between the non-native English pronunciation in the audio and native English lexicon in the text, the intelligibility of synthesized speech in such TTS systems is significantly reduced. This paper is motivated by the knowledge that the native language of the speaker highly influences non-native English pronunciation. We propose a generic approach to obtain rules based on letter to phoneme alignment to map native English lexicon to their non-native version. The effectiveness of such mapping is studied by comparing bilingual (Indian English and Hindi) TTS systems trained with and without the proposed rules. The subjective evaluation shows that the bilingual TTS system trained with the proposed non-native English lexicon rules obtains a 6% absolute improvement in preference.

【5】 Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis 标题:Glow-WaveGAN:高保真流语音合成中GaN基变分自动编码器的语音表示学习

作者:Jian Cong,Shan Yang,Lei Xie,Dan Su 机构:Northwestern Polytechnical University, Xi’an, China, Tencent AI Lab, China 链接:https://arxiv.org/abs/2106.10831 摘要:当前的两级TTS框架通常将声学模型与声码器集成在一起——声学模型预测低分辨率的中间表示,如Mel谱,而声码器则从中间表示生成波形。虽然中间表示作为桥梁,但是由于声学模型和声码器通常是分开学习的,并且工作在不同的表示分布上,因此它们之间仍然存在严重的不匹配,从而导致合成语音中不可避免的伪影。与以往大多数研究中使用预先设计的中间表征不同,本研究提出利用VAE结合GAN直接从语音中学习潜在表征,然后利用基于流的声学模型来模拟文本中潜在表征的分布。这样,当两个阶段在同一分布上工作时,不匹配问题就被迁移了。结果表明,基于流的声学模型能够准确地模拟我们学习到的语音表示的分布,并且所提出的TTS框架,即Glow-WaveGAN,能够产生比最新的基于GAN的模型更好的高保真语音。 摘要:Current two-stage TTS framework typically integrates an acoustic model with a vocoder -- the acoustic model predicts a low resolution intermediate representation such as Mel-spectrum while the vocoder generates waveform from the intermediate representation. Although the intermediate representation is served as a bridge, there still exists critical mismatch between the acoustic model and the vocoder as they are commonly separately learned and work on different distributions of representation, leading to inevitable artifacts in the synthesized speech. In this work, different from using pre-designed intermediate representation in most previous studies, we propose to use VAE combining with GAN to learn a latent representation directly from speech and then utilize a flow-based acoustic model to model the distribution of the latent representation from text. In this way, the mismatch problem is migrated as the two stages work on the same distribution. Results demonstrate that the flow-based acoustic model can exactly model the distribution of our learned speech representation and the proposed TTS framework, namely Glow-WaveGAN, can produce high fidelity speech outperforming the state-of-the-art GAN-based model.

【6】 Controllable Context-aware Conversational Speech Synthesis 标题:可控的上下文感知会话语音合成

作者:Jian Cong,Shan Yang,Na Hu,Guangzhi Li,Lei Xie,Dan Su 机构:Northwestern Polytechnical University, Xi’an, China, Tencent AI Lab, China 备注:Accepted to INTERSPEECH 2021 链接:https://arxiv.org/abs/2106.10828 摘要:在口语会话中,像停顿和延长这样的自发行为经常发生。会话伙伴倾向于将他们的语言特征与他们的对话者联系起来,这就是所谓的夹带。为了产生类人会话,我们提出了一个统一的可控自发会话语音合成框架来模拟上述两种现象。具体来说,我们使用显式标签来表示声学模型中两种典型的自发行为:停顿和延长,并开发了一个基于神经网络的预测器来预测文本中这两种行为的发生。随后,我们开发了一种基于预测器的算法来控制行为的发生频率,使合成语音由不流畅变为不流畅。为了在声学水平上对语音夹带进行建模,我们利用上下文声学编码器在当前语音合成的基础上,从先前的语音条件中提取一个全局风格的嵌入。此外,由于当前话语和前一话语属于会话中的不同说话人,我们在语音编码器中加入了域对抗训练模块,在保留风格相关信息的同时,消除了说话人相关信息。实验表明,该方法能够合成真实的会话,并能自然地控制自发行为的发生。 摘要:In spoken conversations, spontaneous behaviors like filled pause and prolongations always happen. Conversational partner tends to align features of their speech with their interlocutor which is known as entrainment. To produce human-like conversations, we propose a unified controllable spontaneous conversational speech synthesis framework to model the above two phenomena. Specifically, we use explicit labels to represent two typical spontaneous behaviors filled-pause and prolongation in the acoustic model and develop a neural network based predictor to predict the occurrences of the two behaviors from text. We subsequently develop an algorithm based on the predictor to control the occurrence frequency of the behaviors, making the synthesized speech vary from less disfluent to more disfluent. To model the speech entrainment at acoustic level, we utilize a context acoustic encoder to extract a global style embedding from the previous speech conditioning on the synthesizing of current speech. Furthermore, since the current and previous utterances belong to the different speakers in a conversation, we add a domain adversarial training module to eliminate the speaker-related information in the acoustic encoder while maintaining the style-related information. Experiments show that our proposed approach can synthesize realistic conversations and control the occurrences of the spontaneous behaviors naturally.

【7】 Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection 标题:基于D3网的ACCDOA和EINV2系统集成及声事件定位和检测的脉冲响应仿真

作者:Kazuki Shimada,Naoya Takahashi,Yuichiro Koyama,Shusuke Takahashi,Emiru Tsunoo,Masafumi Takahashi,Yuki Mitsufuji 机构:Sony Group Corporation, Japan 备注:5 pages, 3 figures, submitted to DCASE2021 task3 链接:https://arxiv.org/abs/2106.10806 摘要:本报告描述了我们提交给DCASE2021挑战任务3:具有定向干扰的声音事件定位和检测(SELD)的系统。我们以前的系统基于活动耦合笛卡尔到达方向(ACCDOA)表示,使我们能够解决单一目标的SELD任务。这种基于ACCDOA的系统具有高效的网络体系结构RD3Net和数据增强技术,在定位和位置相关检测方面优于最先进的SELD系统。以基于ACCDOA的系统为基础,通过对不同条件下训练的系统的输出进行平均,如输入特征、训练折叠和模型结构。我们还使用基于事件无关网络v2(EINV2)的系统来增加模型集合的多样性。为了推广这些模型,我们进一步提出了脉冲响应模拟(IRS),它通过将模拟的室内脉冲响应(RIRs)与从原始数据集中提取的源信号进行卷积来生成模拟的多通道信号。与开发数据集上的基线系统相比,我们的系统有了显著的改进。 摘要:This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference. Our previous system based on activity-coupled Cartesian direction of arrival (ACCDOA) representation enables us to solve a SELD task with a single target. This ACCDOA-based system with efficient network architecture called RD3Net and data augmentation techniques outperformed state-of-the-art SELD systems in terms of localization and location-dependent detection. Using the ACCDOA-based system as a base, we perform model ensembles by averaging outputs of several systems trained with different conditions such as input features, training folds, and model architectures. We also use the event independent network v2 (EINV2)-based system to increase the diversity of the model ensembles. To generalize the models, we further propose impulse response simulation (IRS), which generates simulated multi-channel signals by convolving simulated room impulse responses (RIRs) with source signals extracted from the original dataset. Our systems significantly improved over the baseline system on the development dataset.

【8】 MeshRIR: A Dataset of Room Impulse Responses on Meshed Grid Points For Evaluating Sound Field Analysis and Synthesis Methods 标题:MESHRIR:用于评价声场分析和合成方法的网格网格点上的房间脉冲响应数据集

作者:Shoichi Koyama,Tomoya Nishida,Keisuke Kimura,Takumi Abe,Natsuki Ueno,Jesper Brunnström 机构:The University of Tokyo,-,-, Hongo, Bunkyo-ku, Tokyo ,-, Japan 链接:https://arxiv.org/abs/2106.10801 摘要:介绍了一种新的脉冲响应数据集MeshRIR。目前可用的数据集通常包括在不同房间条件下来自多个源位置的麦克风阵列上的IRs,其基本设计用于评估语音增强和远程语音识别方法。另一方面,估计或控制空间声场的方法近年来得到了广泛的研究;然而,由于测量点的空间分辨率较低,现有的红外数据集不适用于验证和比较这些方法。MeshRIR由在精细离散空间区域获得的位置处测量的IRs组成。目前有两个子数据集:一个子数据集由来自单个震源的三维立方体区域的IRs组成,另一个子数据集由来自32个震源阵列的二维正方形区域的IRs组成。因此,MeshRIR适用于评价声场分析和综合方法。此数据集可在\url免费获得{https://sh01k.github.io/MeshRIR/}一些示例应用程序代码。 摘要:A new impulse response (IR) dataset called "MeshRIR" is introduced. Currently available datasets usually include IRs at an array of microphones from several source positions under various room conditions, which are basically designed for evaluating speech enhancement and distant speech recognition methods. On the other hand, methods of estimating or controlling spatial sound fields have been extensively investigated in recent years; however, the current IR datasets are not applicable to validating and comparing these methods because of the low spatial resolution of measurement points. MeshRIR consists of IRs measured at positions obtained by finely discretizing a spatial region. Two subdatasets are currently available: one consists of IRs in a three-dimensional cuboidal region from a single source, and the other consists of IRs in a two-dimensional square region from an array of 32 sources. Therefore, MeshRIR is suitable for evaluating sound field analysis and synthesis methods. This dataset is freely available at \url{https://sh01k.github.io/MeshRIR/} with some codes of sample applications.

【9】 Encoder-Decoder Based Attractor Calculation for End-to-End Neural Diarization 标题:基于编解码器的端到端神经网络二值化吸引子计算

作者:Shota Horiguchi,Yusuke Fujita,Shinji Watanabe,Yawen Xue,Paola Garcia 机构: Watanabe is with Carnegie Mellon University, Garc´ıa is with Johns Hopkins University 备注:Submitted to IEEE TASLP. This article is based on our previous conference paper arxiv:2005.09921 链接:https://arxiv.org/abs/2106.10654 摘要:研究了一种针对未知说话人数的端到端神经二值化(EEND)方法。与传统的流水线方法相比,EEND方法在处理说话人重叠方面有更好的效果。然而,EEND仍然有一个缺点,那就是它不能处理数量灵活的说话者。为了解决这一问题,我们引入了基于编码器-解码器的吸引子计算模块(EDA)。一旦获得逐帧嵌入,EDA基于使用LSTM编解码器的序列到序列方法,依次生成逐说话人吸引子。吸引子的产生一直持续到满足停止条件为止;因此,吸引子的数量可以是灵活的。然后将二值化结果估计为吸引子和嵌入的点积。说话人重叠的嵌入导致了具有多个吸引子的较大点积值;因此,该方法可以处理说话人重叠。由于最大输出说话人数仍然受到训练集的限制,我们还提出了一种迭代推理方法来消除这种限制。此外,我们提出了一种方法,将估计的二值化结果与外部语音活动检测器的结果进行比对,从而与流水线方法进行公平比较。对模拟数据集和真实数据集的广泛评估表明,EEND-EDA优于传统的流水线方法。 摘要:This paper investigates an end-to-end neural diarization (EEND) method for an unknown number of speakers. In contrast to the conventional pipeline approach to speaker diarization, EEND methods are better in terms of speaker overlap handling. However, EEND still has a disadvantage in that it cannot deal with a flexible number of speakers. To remedy this problem, we introduce encoder-decoder-based attractor calculation module (EDA) to EEND. Once frame-wise embeddings are obtained, EDA sequentially generates speaker-wise attractors on the basis of a sequence-to-sequence method using an LSTM encoder-decoder. The attractor generation continues until a stopping condition is satisfied; thus, the number of attractors can be flexible. Diarization results are then estimated as dot products of the attractors and embeddings. The embeddings from speaker overlaps result in larger dot product values with multiple attractors; thus, this method can deal with speaker overlaps. Because the maximum number of output speakers is still limited by the training set, we also propose an iterative inference method to remove this restriction. Further, we propose a method that aligns the estimated diarization results with the results of an external speech activity detector, which enables fair comparison against pipeline approaches. Extensive evaluations on simulated and real datasets show that EEND-EDA outperforms the conventional pipeline approach.

【10】 GPLA-12: An Acoustic Signal Dataset of Gas Pipeline Leakage 标题:GPLA-12:天然气管道泄漏声信号数据集

作者:Jie Li,Lizhong Yao 机构:School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing, China, School of Electrical Engineering 链接:https://arxiv.org/abs/2106.10277 摘要:本文介绍了一种新的天然气管道声泄漏数据集GPLA-12,该数据集共有12类684个训练/测试声信号。与海量的图像和语音数据集不同,声学信号数据集相对较少,特别是用于工程故障检测的数据集。为了加强故障诊断的发展,我们在一个完整的天然气管道系统的基础上,采集外部人为泄漏的声泄漏信号,然后对采集到的数据进行结构化裁剪预处理,转化为GPLA-12,GPLA-12致力于作为时间序列任务和分类的特征学习数据集。为了进一步了解数据集,我们训练阴影和深度学习算法来观察性能。数据集和预训练模型已经在这两个会议上发布www.daip.club 和github.com/Deep-AI-Application-DAIP 摘要:In this paper, we introduce a new acoustic leakage dataset of gas pipelines, called as GPLA-12, which has 12 categories over 684 training/testing acoustic signals. Unlike massive image and voice datasets, there have relatively few acoustic signal datasets, especially for engineering fault detection. In order to enhance the development of fault diagnosis, we collect acoustic leakage signals on the basis of an intact gas pipe system with external artificial leakages, and then preprocess the collected data with structured tailoring which are turned into GPLA-12. GPLA-12 dedicates to serve as a feature learning dataset for time-series tasks and classifications. To further understand the dataset, we train both shadow and deep learning algorithms to observe the performance. The dataset as well as the pretrained models have been released at both www.daip.club and github.com/Deep-AI-Application-DAIP

【11】 Affinity Mixup for Weakly Supervised Sound Event Detection 标题:基于亲和力混合的弱监督声音事件检测

作者:Mohammad Rasool Izadi,Robert Stevenson,Laura N. Kloepper 链接:https://arxiv.org/abs/2106.11233 摘要:弱监督声音事件检测问题是在弱标记数据集中预测声音事件的存在及其相应的起点和终点。弱数据集将每个训练样本(短记录)与一个或多个当前源相关联。仅依赖卷积层和循环层的网络不能直接关联记录中的多个帧。在注意和图神经网络的激励下,我们引入了亲和性混合的概念,将时间层次的相似性结合起来,并在帧之间建立联系。这种正则化技术使用自适应亲和矩阵混合不同层中的特征。我们提议的亲和性混合网络比最先进的技术event-F1分数提高了8.2\%$。 摘要:The weakly supervised sound event detection problem is the task of predicting the presence of sound events and their corresponding starting and ending points in a weakly labeled dataset. A weak dataset associates each training sample (a short recording) to one or more present sources. Networks that solely rely on convolutional and recurrent layers cannot directly relate multiple frames in a recording. Motivated by attention and graph neural networks, we introduce the concept of an affinity mixup to incorporate time-level similarities and make a connection between frames. This regularization technique mixes up features in different layers using an adaptive affinity matrix. Our proposed affinity mixup network improves over state-of-the-art techniques event-F1 scores by $8.2\%$.

【12】 EML Online Speech Activity Detection for the Fearless Steps Challenge Phase-III 标题:面向无畏舞步挑战赛第三阶段的EML在线语音活动检测

作者:Omid Ghahabi,Volker Fischer 机构:EML Speech Technology GmbH, Berliner Straße , Heidelberg, Germany 链接:https://arxiv.org/abs/2106.11075 摘要:语音活动检测(SAD)是大多数语音技术应用的一个主要部分,它是在音频记录中定位语音片段。在信噪比(SNR)变化的噪声环境下,鲁棒SAD通常比较困难。“无畏的脚步挑战”最近从美国宇航局阿波罗11号任务中为不同的语音处理任务(包括SAD)提供了这样的数据。大多数录音都会因频道内和频道之间不同种类和级别的噪声而降级。本文介绍了EML在线算法的最新阶段的这一挑战。该算法可以在有监督和无监督的情况下进行训练,并在运行时大约每0.1秒分配一次语音和非语音标签。实验结果表明,在单CPU环境下,开发和评估数据集的实时性因子约为0.002。 摘要:Speech Activity Detection (SAD), locating speech segments within an audio recording, is a main part of most speech technology applications. Robust SAD is usually more difficult in noisy conditions with varying signal-to-noise ratios (SNR). The Fearless Steps challenge has recently provided such data from the NASA Apollo-11 mission for different speech processing tasks including SAD. Most audio recordings are degraded by different kinds and levels of noise varying within and between channels. This paper describes the EML online algorithm for the most recent phase of this challenge. The proposed algorithm can be trained both in a supervised and unsupervised manner and assigns speech and non-speech labels at runtime approximately every 0.1 sec. The experimental results show a competitive accuracy on both development and evaluation datasets with a real-time factor of about 0.002 using a single CPU machine.

【13】 Advances in Speech Vocoding for Text-to-Speech with Continuous Parameters 标题:连续参数文语转换语音声码研究进展

作者:Mohammed Salah Al-Radhi,Tamás Gábor Csapó,Géza Németh 机构: Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary, MTA-ELTE Lendület Lingual Articulation Research Group, Budapest, Hungary 备注:6 pages, 3 figures, International Conference on Artificial Intelligence and Speech Technology (AIST2020) 链接:https://arxiv.org/abs/2106.10481 摘要:声码器作为统计参数文本到语音(TTS)合成和语音转换系统中的主要部件,受到了新的关注。尽管有一些声码技术提供了几乎被接受的合成语音,但是它们的高计算复杂度和不规则结构仍然被认为是具有挑战性的问题,这会导致各种各样的语音质量下降。因此,本文提出了一种新的连续声码器技术,即所有特征都是连续的,并提出了一种灵活的语音合成系统。首先,提出了一种新的基于相位失真的连续噪声掩蔽方法,消除了残余噪声对感知的影响,使噪声特性得到准确的重建。其次,针对基于递归网络的TTS任务,提出了神经序列到序列的建模方法。研究了双向长短时记忆(LSTM)和选通递归单元(GRU),并将其应用于连续参数建模,使其听起来更像人。评价结果表明,与其他传统方法相比,该模型达到了语音合成的最新水平。 摘要:Vocoders received renewed attention as main components in statistical parametric text-to-speech (TTS) synthesis and speech transformation systems. Even though there are vocoding techniques give almost accepted synthesized speech, their high computational complexity and irregular structures are still considered challenging concerns, which yield a variety of voice quality degradation. Therefore, this paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system. First, a new continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise and letting an accurate reconstruction of noise characteristics. Second, we addressed the need of neural sequence to sequence modeling approach for the task of TTS based on recurrent networks. Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human. The evaluation results proved that the proposed model achieves the state-of-the-art performance of the speech synthesis compared with the other traditional methods.

【14】 Improving robustness of one-shot voice conversion with deep discriminative speaker encoder 标题:利用深度区分性说话人编码器提高单次语音转换的鲁棒性

作者:Hongqiang Du,Lei Xie 机构:Northwestern Polytechnical University, China 链接:https://arxiv.org/abs/2106.10406 摘要:由于只需要源说话人和目标说话人的一个话语,一次语音转换受到了广泛的关注。此外,在训练过程中不需要看到源说话人和目标说话人。然而,现有的一次语音转换方法对于看不见的说话人来说并不稳定,因为从一个看不见的说话人的一句话中提取的说话人嵌入是不可靠的。在本文中,我们提出了一种深度辨别的说话人编码器,以更有效地从一个话语中提取说话人嵌入。具体地说,说话人编码器首先将残差网络和压缩激励网络相结合,通过对说话人特征的逐帧和逐信道相互依赖性建模,在帧级提取说话人信息。然后引入注意机制,通过对帧级说话人信息赋予不同的权重,进一步强调说话人相关信息。最后利用统计池层对加权帧级说话人信息进行聚合,形成话语级说话人嵌入。实验结果表明,我们提出的说话人编码器可以提高一次语音转换的鲁棒性,并且在语音质量和说话人相似性方面优于基线系统。 摘要:One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling frame-wise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-06-22,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
语音识别
腾讯云语音识别(Automatic Speech Recognition,ASR)是将语音转化成文字的PaaS产品,为企业提供精准而极具性价比的识别服务。被微信、王者荣耀、腾讯视频等大量业务使用,适用于录音质检、会议实时转写、语音输入法等多个场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档