前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >金融/语音/音频处理学术速递[6.28]

金融/语音/音频处理学术速递[6.28]

作者头像
公众号-arXiv每日学术速递
发布2021-07-02 17:25:18
4450
发布2021-07-02 17:25:18
举报

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

q-fin金融,共计7篇

cs.SD语音,共计9篇

eess.AS音频处理,共计9篇

1.q-fin金融:

【1】 Sovereign wealth funds: main activity trends 标题:主权财富基金:主要活动趋势

作者:Oksana Mamina,Alexander Barannikov,Ludmila Gruzdeva 机构:Associate Professor, Russian University of Transport, Moscow, Russia, Associate Professor, Russian Academy of National Economy and Public Administration 备注:Total pages: 7. Key words: export, investment, gas, oil, revenue, sovereign wealth fund, trend, commodity prices, market, investor, budget. JEL codes: E-00; E-6 链接:https://arxiv.org/abs/2106.13670 摘要:主权财富基金是在预算高度依赖市场因素(通常是世界大宗商品价格)的国家设立的。同时,这些基金都是大型机构投资者。挪威国家养恤基金全球基金对投资性质的分析表明,该基金的投资是以投资多样化的七级模式为基础的。该模型也可应用于俄罗斯国家财富基金的投资,提高其盈利能力。 摘要:Sovereign wealth funds are created in those countries whose budget is highly dependent on market factors, usually world commodity prices. At the same time, these funds are large institutional investors. An analysis of the nature of investments by the State Pension Fund Global of Norway showed that investments of the Fund are based on a seven-level model of diversifying its investments. This model can also be applied to the investments of the National Wealth Fund of Russia to increase its profitability.

【2】 Intergenerational risk sharing in a collective defined contribution pension system: a simulation study with Bayesian optimization 标题:集体固定缴费养老金制度的代际风险分担:基于贝叶斯优化的模拟研究

作者:An Chen,Motonobu Kanagawa,Fangyuan Zhang 链接:https://arxiv.org/abs/2106.13644 摘要:养老金改革在许多国家是一个至关重要的社会问题,传统的养老金计划,如现收现付和固定福利计划,正在被更可持续的计划所取代。公共养老金制度面临的一个挑战是如何管理影响一代人中所有人的系统性风险(例如,由更糟糕的经济状况造成的风险)。这种风险不能在一代人内多样化,但可以通过与其他(年轻人和/或老年人)一代人分担,即通过代际风险分担(IRS)来降低。在这项工作中,我们调查国税局在一个集体界定供款(CDC)养老金制度。我们考虑了一个具有多个世代重叠的CDC养老金模型,该模型使用了类似于申报率的资金比率作为IRS的一种手段。我们进行了广泛的模拟研究,以探讨IRS的机制。我们的主要发现之一是,国税局的工作特别有效,以保护养老金参与者在最坏的情况下,一个艰难的金融市场。除了这些经济贡献外,我们还利用贝叶斯优化(一种现代机器学习的黑箱优化方法)系统地搜索养老金模型中的最优参数,为养老金研究做出了模拟方法学上的贡献。 摘要:Pension reform is a crucial societal problem in many countries, and traditional pension schemes, such as Pay-As-You-Go and Defined-Benefit schemes, are being replaced by more sustainable ones. One challenge for a public pension system is the management of a systematic risk that affects all individuals in one generation (e.g., that caused by a worse economic situation). Such a risk cannot be diversified within one generation, but may be reduced by sharing with other (younger and/or older) generations, i.e., by intergenerational risk sharing (IRS). In this work, we investigate IRS in a Collective Defined-Contribution (CDC) pension system. We consider a CDC pension model with overlapping multiple generations, in which a funding-ratio-liked declaration rate is used as a means of IRS. We perform an extensive simulation study to investigate the mechanism of IRS. One of our main findings is that the IRS works particularly effectively for protecting pension participants in the worst scenarios of a tough financial market. Apart from these economic contributions, we make a simulation-methodological contribution for pension studies by employing Bayesian optimization, a modern machine learning approach to black-box optimization, in systematically searching for optimal parameters in our pension model.

【3】 Political Power and Market Power 标题:政治力量与市场力量

作者:Bo Cowgill,Andrea Prat,Tommaso Valletti 链接:https://arxiv.org/abs/2106.13612 摘要:我们研究游说与产业集中度之间的联系。利用美国过去20年的数据,我们展示了当一个行业变得更加集中时,利用合并作为集中度冲击时,游说活动是如何增加的。联邦游说支出和竞选捐款支出都是如此。结果与一个模型的预测是一致的,在这个模型中,游说类似于在职者的公共利益,因此通常提供不足,而合并解决了协调问题。 摘要:We study the link between lobbying and industrial concentration. Using data for the past 20 years in the US, we show how lobbying increases when an industry becomes more concentrated, using mergers as shocks to concentration. This holds true both for expenditures on federal lobbying as well as expenditures on campaign contributions. Results are in line with the predictions of a model where lobbying is akin to a public good for incumbents, and thus typically underprovided, while a merger solves the coordination problem.

【4】 Bibliometric Analysis Of Herding Behavior In Times Of Crisis 标题:危机时期羊群行为的文献计量学分析

作者:Fenny Marietza,Ridwan Nurazi,Fitri Santi,Saiful 机构:Economic and Business Faculty, University of Bengkulu 链接:https://arxiv.org/abs/2106.13598 摘要:羊群行为的社会心理学概念为理解资本市场中经常出现的行为偏差提供了一个合适的解决方案。本文的目的是提供一个更广泛的文献计量学文献综述的术语和概念的羊群行为。文章的收集是通过一个系统的方法,显式的和再生的方法,通过由Publish or destrom(PoP)、Google Scholar、Mendeley和VOSViewer组成的软件来完成的。此外,文章由Scimagojr.com(Q1、Q2、Q3和Q4)扫描,分析了1996年至2021年期间来自声誉良好和非声誉良好期刊的261篇相关文章中的83篇。Mendeley软件用于管理和恢复参考文献。为了检查该数据库,使用VOSviewer软件进行分类。综述了四个聚类;在每一组中出现频率最高的词是股市类型、危机类型和导致羊群效应的因素。因此,这四个集群成为危机时期羊群问题的主要研究主题。同时,方法论和策略是未来研究的主题。 摘要:The social and psychological concept of herding behavior provides a suitable solution to give an understanding of the behavioral biases that often occur in the capital market. The aim of this paper is to provide an overview of the broader bibliometric literature on the term and concept of herding behavior. Articles are collected through the help of software consisting of Publish or Perish (PoP), Google Scholar, Mendeley, and VOSViewer through a systematic approach, explicit and reproductive methods. In addition, the articles were scanned by Scimagojr.com (Q1, Q2, Q3, and Q4), analyzing 83 articles of 261 related articles from reputable and non-reputable journals from 1996 to 2021. Mendeley software is used to manage and resume references. To review this database, classification was performed using the VOSviewer software. Four clusters were reviewed; The words that appear most often in each group are the type of stock market, the type of crisis, and the factors that cause herding. Thus these four clusters became the main research themes on the topic of herding in times of crisis. Meanwhile, methodology and strategy are the themes for future research in the future.

【5】 Relationship between Cultural Values, Sense of Community and Trust and the Effect of Trust in Workplace 标题:文化价值观、社区意识、信任感与工作场所信任效应的关系

作者:Nazli Mohammad,Yvonne Stedham 机构:College of Business, University of Nevada, Reno 链接:https://arxiv.org/abs/2106.13347 摘要:本文概述了信任的不同观点和研究,给出了信任的定义,提出了在发展社会信任中起重要作用的因素,并说明了从哪些角度可以促进社会信任。研究结果表明,对于参与跨国战略伙伴关系的组织来说,信任在成功中起着重要作用。信任可以降低交易成本,促进组织间的关系,改善管理者之间的从属关系。 摘要:This paper provides a general overview of different perspectives and studies on trust, offers a definition of trust, and provides factors that play a substantial role in developing social trust, and shows from which perspectives it can be fostered. The results showed that trust is playing an important role in success for organizations involved in cross-national strategic partnerships. Trust can reduce transaction costs, promotes inter-organizational relationships, and improve subordinate relationships between managers.

【6】 Game theory and scholarly publishing: premises for an agreement around open access 标题:博弈论与学术出版:围绕开放获取达成协议的前提

作者:Abdelghani Maddi 机构:Observatoire des Sciences et Techniques, Hcéres, Rue Albert Einstein, Paris, France. 链接:https://arxiv.org/abs/2106.13321 摘要:研究和科学出版的参与者正逐渐加入开放获取(OA)运动,这一运动正逐渐成为当今高收入国家科学政策的核心。OA的兴起使知识的生产和传播链发生了深刻的变化。自由获取同行评议的研究方法和成果有助于近年来观察到的科学动态。出版和获取的方式也发生了变化;以期刊订阅为基础的经典模式正逐渐被OA到来后出现的新经济模式所取代。本文的目的是双重的。首先,根据文献和开放科学政策的变化,提出一个出版市场模型。其次,分析出版社和机构的出版策略。为此,我们依赖经济学中的博弈论。结果表明,在短期内,出版商的均衡策略是采用混合出版模式,而机构的均衡策略是在OA中出版。这种均衡是不稳定的,从中长期来看,这两个参与者将在OA出版战略上趋同。混合策略的均衡分析证实了这一结果。 摘要:Actors in research and scientific publishing are gradually joining the Open-Access (OA) movement, which is gaining momentum to become nowadays at the heart of scientific policies in high-income countries. The rise of OA generates profound changes in the chain of production and dissemination of knowledge. Free access to peer-reviewed research methods and results has contributed to the dynamics of science observed in recent years. The modes of publication and access have also evolved; the classic model, based on journal subscriptions is gradually giving way to new economic models that have appeared with the arrival of OA. The objective of this article is twofold. First, propose a model for the publishing market based on the literature as well as on changes in open science policies. Second, analyze publishing strategies of publishers and institutions. To do so, we relied on game theory in economics. Results show that in the short term, the publisher's equilibrium strategy is to adopt a hybridpublishing model, while the institutions' equilibrium strategy is to publish in OA. This equilibrium is not stable and that in the medium/long term, the two players will converge on an OA publishing strategy. The analysis of the equilibrium in mixed-strategies confirms this result.

【7】 Pricing and hedging contingent claims in a multi-asset binomial market 标题:多资产二项式市场的未定权益定价与套期保值

作者:Jarek Kędra,Assaf Libman,Victoria Steblovskaya 备注:35 pages 链接:https://arxiv.org/abs/2106.13283 摘要:我们考虑一个不完全的多资产二项市场模型。我们证明了对于一类广泛的未定权益,极值多步鞅测度是相应的单步极值鞅测度的幂。这使得无套利未定权益价格区间的界限可以用封闭形式的公式来表示。我们构造了一个可行的算法来计算这些边界以及相应的套期保值策略。例如,我们的结果适用于欧洲一篮子看跌期权和亚洲算术平均期权。 摘要:We consider an incomplete multi-asset binomial market model. We prove that for a wide class of contingent claims the extremal multi-step martingale measure is a power of the corresponding single-step extremal martingale measure. This allows for closed form formulas for the bounds of a no-arbitrage contingent claim price interval. We construct a feasible algorithm for computing those boundaries as well as for the corresponding hedging strategies. Our results apply, for example, to European basket call and put options and Asian arithmetic average options.

2.cs.SD语音:

【1】 Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets 标题:基于扩散网的瞬态噪声环境下的语音活动检测

作者:Amir Ivry,Baruch Berdugo,Israel Cohen 机构: Technion-Israel Institute of Technology 备注:None 链接:https://arxiv.org/abs/2106.13763 摘要:我们研究了在瞬态和平稳噪声环境中的语音活动检测,这在现实生活中经常发生。我们利用独特的空间模式的语音和非语音音频帧通过独立学习其基本的几何结构。这个过程是通过一个基于深度编码-解码器的神经网络结构来完成的。这种结构包括一个编码器,它将光谱特征和时间信息映射到它们的低维表示,这些低维表示是通过应用扩散映射方法生成的。编码器向解码器提供信息,解码器将嵌入的数据映射回高维空间。通过将译码器和编码器连接起来,得到了一个深度神经网络,它被训练用来将语音和非语音帧分离,类似于已知的扩散网络结构。实验结果表明,与竞争性语音活动检测方法相比,该方法具有更好的性能。该算法在精度、鲁棒性和泛化能力等方面都得到了提高。我们的模型可以实时执行,并且可以集成到基于音频的通信系统中。我们还提出了一个批处理算法,获得了更高的精度离线应用。 摘要:We address voice activity detection in acoustic environments of transients and stationary noises, which often occur in real life scenarios. We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure. This process is done through a deep encoder-decoder based neural network architecture. This structure involves an encoder that maps spectral features with temporal information to their low-dimensional representations, which are generated by applying the diffusion maps method. The encoder feeds a decoder that maps the embedded data back into the high-dimensional space. A deep neural network, which is trained to separate speech from non-speech frames, is obtained by concatenating the decoder to the encoder, resembling the known Diffusion nets architecture. Experimental results show enhanced performance compared to competing voice activity detection methods. The improvement is achieved in both accuracy, robustness and generalization ability. Our model performs in a real-time manner and can be integrated into audio-based communication systems. We also present a batch algorithm which obtains an even higher accuracy for off-line applications.

【2】 Nonlinear Acoustic Echo Cancellation with Deep Learning 标题:基于深度学习的非线性声学回波抵消

作者:Amir Ivry,Israel Cohen,Baruch Berdugo 机构:Technion – Israel Institute of Technology, Technion City, Haifa , Israel 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.13754 摘要:提出了一种非线性声学回波抵消系统,该系统分两部分对从远端信号到近端传声器的回波路径进行建模。受现代免提设备物理特性的启发,我们首先介绍了一种新的神经网络结构,该结构专门用于模拟这些设备在接收和播放远端信号之间产生的非线性失真。为了考虑不同设备之间的差异,我们构建了一个具有可训练记忆长度和非线性激活函数的网络,这些函数不是预先参数化的,而是在训练阶段使用训练数据进行优化的。其次,该网络由一个标准的自适应线性滤波器取代,该滤波器不断跟踪扬声器输出和麦克风之间的回声路径。在训练过程中,对网络和滤波器进行联合优化,学习网络参数。这个系统需要17000个参数,每秒消耗5亿个浮点运算和40KB的内存。它还满足标准神经处理器上的免提通信时序要求,这使得它足以嵌入免提通信设备。使用280小时的真实数据和合成数据,实验表明,性能优于竞争的方法。 摘要:We propose a nonlinear acoustic echo cancellation system, which aims to model the echo path from the far-end signal to the near-end microphone in two parts. Inspired by the physical behavior of modern hands-free devices, we first introduce a novel neural network architecture that is specifically designed to model the nonlinear distortions these devices induce between receiving and playing the far-end signal. To account for variations between devices, we construct this network with trainable memory length and nonlinear activation functions that are not parameterized in advance, but are rather optimized during the training stage using the training data. Second, the network is succeeded by a standard adaptive linear filter that constantly tracks the echo path between the loudspeaker output and the microphone. During training, the network and filter are jointly optimized to learn the network parameters. This system requires 17 thousand parameters that consume 500 Million floating-point operations per second and 40 Kilo-bytes of memory. It also satisfies hands-free communication timing requirements on a standard neural processor, which renders it adequate for embedding on hands-free communication devices. Using 280 hours of real and synthetic data, experiments show advantageous performance compared to competing methods.

【3】 Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition 标题:一种面向自动提示语音识别的跨模态知识提取方法

作者:Jianrong Wang,Ziyue Tang,Xuewei Li,Mei Yu,Qiang Fang,Li Liu 机构:College of Intelligence and Computing, Tianjin University, Tianjin, China, Tianjin International Engineering Institute, Tianjin University, Tianjin, China, Institute of Linguistics, Chinese Academy of Social Science, Beijing, China 链接:https://arxiv.org/abs/2106.13686 摘要:提示语是聋人或听障者的一种视觉交流系统。它结合嘴唇动作和手的提示,以获得一个完整的语音曲目。目前基于深度学习的CS自动识别方法存在一个共同的问题,即数据的稀缺性。到目前为止,只有法语(238句)和英式英语(97句)的两个公共单说话人数据集。本文提出了一种基于师生结构的跨模态知识提取方法,该方法将语音信息传输到CS中,克服了数据有限的问题。首先,利用大量的开源语音数据预训练一个用于CS识别的教师模型,同时利用CS数据预训练嘴唇和手的特征抽取器。然后,采用框架级和序列级的知识提取策略,将教师模型中的知识提取到学生模型中。重要的是,对于帧级,我们利用多任务学习自动加权损失,得到平衡系数。此外,我们还首次建立了一个五人的英式英语CS数据集。在法国和英国的CS数据集上对该方法进行了评价,结果表明,该方法在很大程度上优于SOTA方法。 摘要:Cued Speech (CS) is a visual communication system for the deaf or hearing impaired people. It combines lip movements with hand cues to obtain a complete phonetic repertoire. Current deep learning based methods on automatic CS recognition suffer from a common problem, which is the data scarcity. Until now, there are only two public single speaker datasets for French (238 sentences) and British English (97 sentences). In this work, we propose a cross-modal knowledge distillation method with teacher-student structure, which transfers audio speech information to CS to overcome the limited data problem. Firstly, we pretrain a teacher model for CS recognition with a large amount of open source audio speech data, and simultaneously pretrain the feature extractors for lips and hands using CS data. Then, we distill the knowledge from teacher model to the student model with frame-level and sequence-level distillation strategies. Importantly, for frame-level, we exploit multi-task learning to weigh losses automatically, to obtain the balance coefficient. Besides, we establish a five-speaker British English CS dataset for the first time. The proposed method is evaluated on French and British English CS datasets, showing superior CS recognition performance to the state-of-the-art (SOTA) by a large margin.

【4】 Deep Residual Echo Suppression with A Tunable Tradeoff Between Signal Distortion and Echo Suppression 标题:在信号失真和回波抑制之间进行可调折衷的深层残余回波抑制

作者:Amir Ivry,Israel Cohen,Baruch Berdugo 机构:Technion – Israel Institute of Technology, Technion City, Haifa , Israel 备注:None 链接:https://arxiv.org/abs/2106.13531 摘要:本文提出了一种利用非线性神经网络将线性声回波对消器的输出直接映射到谱域期望信号的剩余回波抑制方法。该系统嵌入了一个设计参数,允许在期望的信号失真和残余回波抑制之间进行可调的折衷。该系统使用了13.6万个参数,每秒需要1.6千兆浮点运算和10兆字节的内存。该实现既满足AEC挑战的时序要求,又满足设备上应用程序的计算和内存限制。实验用来自AEC挑战数据库和真实独立记录的161小时数据进行。我们展示了该系统在实际环境中的性能,并将其与两种竞争方法进行了比较,包括回声抑制和期望信号失真、对各种环境的泛化以及对高回声水平的鲁棒性。 摘要:In this paper, we propose a residual echo suppression method using a UNet neural network that directly maps the outputs of a linear acoustic echo canceler to the desired signal in the spectral domain. This system embeds a design parameter that allows a tunable tradeoff between the desired-signal distortion and residual echo suppression in double-talk scenarios. The system employs 136 thousand parameters, and requires 1.6 Giga floating-point operations per second and 10 Mega-bytes of memory. The implementation satisfies both the timing requirements of the AEC challenge and the computational and memory limitations of on-device applications. Experiments are conducted with 161~h of data from the AEC challenge database and from real independent recordings. We demonstrate the performance of the proposed system in real-life conditions and compare it with two competing methods regarding echo suppression and desired-signal distortion, generalization to various environments, and robustness to high echo levels.

【5】 Phoneme-aware and Channel-wise Attentive Learning for Text DependentSpeaker Verification 标题:文本相关说话人确认中的音素感知和通道注意学习

作者:Yan Liu,Zheng Li,Lin Li,Qingyang Hong 机构:School of Electronic Science and Engineering, Xiamen University, China, School of Informatics, Xiamen University, China 链接:https://arxiv.org/abs/2106.13514 摘要:提出了一种基于音素感知和通道注意学习策略的文本相关说话人验证多任务学习网络。该结构采用帧级多任务学习和分段级对抗学习相结合的方法进行说话人嵌入提取。在主网的帧级特征上利用语音感知注意池进行说话人分类器,并对辅助子网中的语音分布给出相应的后验概率。此外,压缩和激励(SE块)的引入实现了动态的通道特征重校准,提高了表征能力。该方法利用了说话人与短语相关的特质,并分别从时间和通道两个方面对基于音素的注意池和SE块进行了改进。在RSR2015 Part 1数据库上进行的实验表明,该系统在文本相关SV上取得了很好的效果。 摘要:This paper proposes a multi-task learning network with phoneme-aware and channel-wise attentive learning strategies for text-dependent Speaker Verification (SV). In the proposed structure, the frame-level multi-task learning along with the segment-level adversarial learning is adopted for speaker embedding extraction. The phoneme-aware attentive pooling is exploited on frame-level features in the main network for speaker classifier, with the corresponding posterior probability for the phoneme distribution in the auxiliary subnet. Further, the introduction of Squeeze and Excitation (SE-block) performs dynamic channel-wise feature recalibration, which improves the representational ability. The proposed method exploits speaker idiosyncrasies associated with pass-phrases, and is further improved by the phoneme-aware attentive pooling and SE-block from temporal and channel-wise aspects, respectively. The experiments conducted on RSR2015 Part 1 database confirm that the proposed system achieves outstanding results for textdependent SV.

【6】 Evaluation of Deep-Learning-Based Voice Activity Detectors and Room Impulse Response Models in Reverberant Environments 标题:混响环境中基于深度学习的语音活动检测器和房间脉冲响应模型的评价

作者:Amir Ivry,Israel Cohen,Baruch Berdugo 机构:Technion – Israel Institute of Technology, Technion City, Haifa , Israel 备注:Accepted to ICASSP 2020 链接:https://arxiv.org/abs/2106.13511 摘要:最先进的基于深度学习的语音活动检测器(vad)通常使用消声数据进行训练。然而,真实的声学环境通常是混响的,这会导致性能显著恶化。为了减少训练数据和真实数据之间的不匹配,我们模拟了一个包含近500万个话语的增强训练集。这种扩展包括消声话语及其混响修改,由消声话语与各种室内脉冲响应(rir)的卷积产生。我们考虑了五种不同的模型来生成RIR,以及五种不同的VAD,它们是用增广训练集训练的。我们在三个不同的真实混响环境中测试所有训练过的系统。实验结果表明,与消声训练相比,所有探测器和响应模型的准确度、精确度和召回率平均提高了20\%$。此外,对于所有测试的VAD,其中一个RIR模型始终比其他模型产生更好的性能。此外,在所有实验中,其中一个vad始终优于其他vad。 摘要:State-of-the-art deep-learning-based voice activity detectors (VADs) are often trained with anechoic data. However, real acoustic environments are generally reverberant, which causes the performance to significantly deteriorate. To mitigate this mismatch between training data and real data, we simulate an augmented training set that contains nearly five million utterances. This extension comprises of anechoic utterances and their reverberant modifications, generated by convolutions of the anechoic utterances with a variety of room impulse responses (RIRs). We consider five different models to generate RIRs, and five different VADs that are trained with the augmented training set. We test all trained systems in three different real reverberant environments. Experimental results show $20\%$ increase on average in accuracy, precision and recall for all detectors and response models, compared to anechoic training. Furthermore, one of the RIR models consistently yields better performance than the other models, for all the tested VADs. Additionally, one of the VADs consistently outperformed the other VADs in all experiments.

【7】 Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance 标题:利用矢量量化潜在空间实现TTS/VC系统性能一致性的初步研究

作者:Hieu-Thi Luong,Junichi Yamagishi 机构:National Institute of Informatics, Tokyo, Japan 备注:to be presented at SSW11 链接:https://arxiv.org/abs/2106.13479 摘要:一般来说,训练神经语音合成系统的主要目的是从神经网络的输出层合成自然的、有表现力的语音,而不需要太多的注意隐含层。然而,通过学习有用的潜在表示,该系统可以用于许多更实际的场景。在这篇文章中,我们研究了如何使用量化向量来模拟潜在的语言嵌入,并将其与连续的嵌入进行了比较。通过对训练中的潜在空间实施不同的策略,我们可以得到具有不同性质的潜在语言嵌入,同时在质量和说话人相似性方面具有相似的性能。我们的实验表明,用矢量量化构建的语音克隆系统在感知评估方面只有很小的退化,但有一个离散的潜在空间,有助于降低表示比特率,这是数据传输所需要的,或者限制信息泄漏,这对于说话人匿名化和其他类似的任务很重要。 摘要:Generally speaking, the main objective when training a neural speech synthesis system is to synthesize natural and expressive speech from the output layer of the neural network without much attention given to the hidden layers. However, by learning useful latent representation, the system can be used for many more practical scenarios. In this paper, we investigate the use of quantized vectors to model the latent linguistic embedding and compare it with the continuous counterpart. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding that takes on different properties while having a similar performance in terms of quality and speaker similarity. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations, but has a discrete latent space that is useful for reducing the representation bit-rate, which is desirable for data transferring, or limiting the information leaking, which is important for speaker anonymization and other tasks of that nature.

【8】 Basis-MelGAN: Efficient Neural Vocoder Based on Audio Decomposition 标题:Basis-Melgan:基于音频分解的高效神经声码器

作者:Zhengxi Liu,Yanmin Qian 机构:Sun Yat-Sen University, MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 备注:Accepted to INTERSPEECH 2021 链接:https://arxiv.org/abs/2106.13419 摘要:最近的研究表明,基于产生式对抗网络(GAN)的神经声码器可以产生高质量的音频。虽然基于GAN的神经声码器在计算上比基于自回归预测的神经声码器效率更高,但是在CPU上实时生成最高质量的音频仍然是一项非常具有挑战性的任务。所有基于GAN的神经声码器的一个主要计算来自叠加的上采样层,该上采样层被设计为匹配波形的输出长度和时间分辨率。同时,上采样网络的计算复杂度与每个窗口生成的样本数密切相关。为了减少上采样层的计算量,我们提出了一种新的基于GAN的神经声码器Basis-MelGAN,在该声码器中,原始音频样本用学习的基及其相关的权值进行分解。由于Basis-MelGAN的预测目标是与每个学习基相关联的权值,而不是原始音频样本,因此Basis-MelGAN中的上采样层可以用更简单的网络来设计。与其他基于GAN的神经声码器相比,本文提出的基MelGAN可以产生相当高质量的音频,但计算复杂度从hifiganv1的17.74gflops降低到7.95gflops。 摘要:Recent studies have shown that neural vocoders based on generative adversarial network (GAN) can generate audios with high quality. While GAN based neural vocoders have shown to be computationally much more efficient than those based on autoregressive predictions, the real-time generation of the highest quality audio on CPU is still a very challenging task. One major computation of all GAN-based neural vocoders comes from the stacked upsampling layers, which were designed to match the length of the waveform's length of output and temporal resolution. Meanwhile, the computational complexity of upsampling networks is closely correlated with the numbers of samples generated for each window. To reduce the computation of upsampling layers, we propose a new GAN based neural vocoder called Basis-MelGAN where the raw audio samples are decomposed with a learned basis and their associated weights. As the prediction targets of Basis-MelGAN are the weight values associated with each learned basis instead of the raw audio samples, the upsampling layers in Basis-MelGAN can be designed with much simpler networks. Compared with other GAN based neural vocoders, the proposed Basis-MelGAN could produce comparable high-quality audio but significantly reduced computational complexity from HiFi-GAN V1's 17.74 GFLOPs to 7.95 GFLOPs.

【9】 Online Self-Attentive Gated RNNs for Real-Time Speaker Separation 标题:用于实时说话人分离的在线自关注门控RNN

作者:Ori Kabeli,Yossi Adi,Zhenyu Tang,Buye Xu,Anurag Kumar 机构:Facebook AI Research, TLV, Israel, Facebook Reality Labs, Redmond, WA, USA, University of Maryland, College Park, MD, USA 链接:https://arxiv.org/abs/2106.13493 摘要:深度神经网络在单、双耳盲源分离方面取得了巨大的成功。虽然这些方法被证明能产生高质量的分离,但它们主要应用于离线设置下,即模型在分离信号的同时可以访问完整的输入信号。在这项研究中,我们将一个非因果的最新分离模型转换成一个因果的实时模型,并评估其在在线和离线环境下的性能。我们比较了所提出的模型与几种基线方法在消声、噪声和混响记录条件下的性能,同时考察了单耳和双耳的输入和输出。我们的发现揭示了分离时因果模型和非因果模型之间的相对差异。与离线模式相比,我们在线分离的有状态实现导致性能略有下降;单耳输入为0.8dB,双耳输入为0.3dB,实时系数为0.65。样本可在以下链接中找到:https://kwanum.github.io/sagrnnc-stream-results/. 摘要:Deep neural networks have recently shown great success in the task of blind source separation, both under monaural and binaural settings. Although these methods were shown to produce high-quality separations, they were mainly applied under offline settings, in which the model has access to the full input signal while separating the signal. In this study, we convert a non-causal state-of-the-art separation model into a causal and real-time model and evaluate its performance under both online and offline settings. We compare the performance of the proposed model to several baseline methods under anechoic, noisy, and noisy-reverberant recording conditions while exploring both monaural and binaural inputs and outputs. Our findings shed light on the relative difference between causal and non-causal models when performing separation. Our stateful implementation for online separation leads to a minor drop in performance compared to the offline model; 0.8dB for monaural inputs and 0.3dB for binaural inputs while reaching a real-time factor of 0.65. Samples can be found under the following link: https://kwanum.github.io/sagrnnc-stream-results/.

3.eess.AS音频处理:

【1】 Online Self-Attentive Gated RNNs for Real-Time Speaker Separation 标题:用于实时说话人分离的在线自关注门控RNN

作者:Ori Kabeli,Yossi Adi,Zhenyu Tang,Buye Xu,Anurag Kumar 机构:Facebook AI Research, TLV, Israel, Facebook Reality Labs, Redmond, WA, USA, University of Maryland, College Park, MD, USA 链接:https://arxiv.org/abs/2106.13493 摘要:深度神经网络在单、双耳盲源分离方面取得了巨大的成功。虽然这些方法被证明能产生高质量的分离,但它们主要应用于离线设置下,即模型在分离信号的同时可以访问完整的输入信号。在这项研究中,我们将一个非因果的最新分离模型转换成一个因果的实时模型,并评估其在在线和离线环境下的性能。我们比较了所提出的模型与几种基线方法在消声、噪声和混响记录条件下的性能,同时考察了单耳和双耳的输入和输出。我们的发现揭示了分离时因果模型和非因果模型之间的相对差异。与离线模式相比,我们在线分离的有状态实现导致性能略有下降;单耳输入为0.8dB,双耳输入为0.3dB,实时系数为0.65。样本可在以下链接中找到:https://kwanum.github.io/sagrnnc-stream-results/. 摘要:Deep neural networks have recently shown great success in the task of blind source separation, both under monaural and binaural settings. Although these methods were shown to produce high-quality separations, they were mainly applied under offline settings, in which the model has access to the full input signal while separating the signal. In this study, we convert a non-causal state-of-the-art separation model into a causal and real-time model and evaluate its performance under both online and offline settings. We compare the performance of the proposed model to several baseline methods under anechoic, noisy, and noisy-reverberant recording conditions while exploring both monaural and binaural inputs and outputs. Our findings shed light on the relative difference between causal and non-causal models when performing separation. Our stateful implementation for online separation leads to a minor drop in performance compared to the offline model; 0.8dB for monaural inputs and 0.3dB for binaural inputs while reaching a real-time factor of 0.65. Samples can be found under the following link: https://kwanum.github.io/sagrnnc-stream-results/.

【2】 Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets 标题:基于扩散网的瞬态噪声环境下的语音活动检测

作者:Amir Ivry,Baruch Berdugo,Israel Cohen 机构: Technion-Israel Institute of Technology 备注:None 链接:https://arxiv.org/abs/2106.13763 摘要:我们研究了在瞬态和平稳噪声环境中的语音活动检测,这在现实生活中经常发生。我们利用独特的空间模式的语音和非语音音频帧通过独立学习其基本的几何结构。这个过程是通过一个基于深度编码-解码器的神经网络结构来完成的。这种结构包括一个编码器,它将光谱特征和时间信息映射到它们的低维表示,这些低维表示是通过应用扩散映射方法生成的。编码器向解码器提供信息,解码器将嵌入的数据映射回高维空间。通过将译码器和编码器连接起来,得到了一个深度神经网络,它被训练用来将语音和非语音帧分离,类似于已知的扩散网络结构。实验结果表明,与竞争性语音活动检测方法相比,该方法具有更好的性能。该算法在精度、鲁棒性和泛化能力等方面都得到了提高。我们的模型可以实时执行,并且可以集成到基于音频的通信系统中。我们还提出了一个批处理算法,获得了更高的精度离线应用。 摘要:We address voice activity detection in acoustic environments of transients and stationary noises, which often occur in real life scenarios. We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure. This process is done through a deep encoder-decoder based neural network architecture. This structure involves an encoder that maps spectral features with temporal information to their low-dimensional representations, which are generated by applying the diffusion maps method. The encoder feeds a decoder that maps the embedded data back into the high-dimensional space. A deep neural network, which is trained to separate speech from non-speech frames, is obtained by concatenating the decoder to the encoder, resembling the known Diffusion nets architecture. Experimental results show enhanced performance compared to competing voice activity detection methods. The improvement is achieved in both accuracy, robustness and generalization ability. Our model performs in a real-time manner and can be integrated into audio-based communication systems. We also present a batch algorithm which obtains an even higher accuracy for off-line applications.

【3】 Nonlinear Acoustic Echo Cancellation with Deep Learning 标题:基于深度学习的非线性声学回波抵消

作者:Amir Ivry,Israel Cohen,Baruch Berdugo 机构:Technion – Israel Institute of Technology, Technion City, Haifa , Israel 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.13754 摘要:提出了一种非线性声学回波抵消系统,该系统分两部分对从远端信号到近端传声器的回波路径进行建模。受现代免提设备物理特性的启发,我们首先介绍了一种新的神经网络结构,该结构专门用于模拟这些设备在接收和播放远端信号之间产生的非线性失真。为了考虑不同设备之间的差异,我们构建了一个具有可训练记忆长度和非线性激活函数的网络,这些函数不是预先参数化的,而是在训练阶段使用训练数据进行优化的。其次,该网络由一个标准的自适应线性滤波器取代,该滤波器不断跟踪扬声器输出和麦克风之间的回声路径。在训练过程中,对网络和滤波器进行联合优化,学习网络参数。这个系统需要17000个参数,每秒消耗5亿个浮点运算和40KB的内存。它还满足标准神经处理器上的免提通信时序要求,这使得它足以嵌入免提通信设备。使用280小时的真实数据和合成数据,实验表明,性能优于竞争的方法。 摘要:We propose a nonlinear acoustic echo cancellation system, which aims to model the echo path from the far-end signal to the near-end microphone in two parts. Inspired by the physical behavior of modern hands-free devices, we first introduce a novel neural network architecture that is specifically designed to model the nonlinear distortions these devices induce between receiving and playing the far-end signal. To account for variations between devices, we construct this network with trainable memory length and nonlinear activation functions that are not parameterized in advance, but are rather optimized during the training stage using the training data. Second, the network is succeeded by a standard adaptive linear filter that constantly tracks the echo path between the loudspeaker output and the microphone. During training, the network and filter are jointly optimized to learn the network parameters. This system requires 17 thousand parameters that consume 500 Million floating-point operations per second and 40 Kilo-bytes of memory. It also satisfies hands-free communication timing requirements on a standard neural processor, which renders it adequate for embedding on hands-free communication devices. Using 280 hours of real and synthetic data, experiments show advantageous performance compared to competing methods.

【4】 Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition 标题:一种面向自动提示语音识别的跨模态知识提取方法

作者:Jianrong Wang,Ziyue Tang,Xuewei Li,Mei Yu,Qiang Fang,Li Liu 机构:College of Intelligence and Computing, Tianjin University, Tianjin, China, Tianjin International Engineering Institute, Tianjin University, Tianjin, China, Institute of Linguistics, Chinese Academy of Social Science, Beijing, China 链接:https://arxiv.org/abs/2106.13686 摘要:提示语是聋人或听障者的一种视觉交流系统。它结合嘴唇动作和手的提示,以获得一个完整的语音曲目。目前基于深度学习的CS自动识别方法存在一个共同的问题,即数据的稀缺性。到目前为止,只有法语(238句)和英式英语(97句)的两个公共单说话人数据集。本文提出了一种基于师生结构的跨模态知识提取方法,该方法将语音信息传输到CS中,克服了数据有限的问题。首先,利用大量的开源语音数据预训练一个用于CS识别的教师模型,同时利用CS数据预训练嘴唇和手的特征抽取器。然后,采用框架级和序列级的知识提取策略,将教师模型中的知识提取到学生模型中。重要的是,对于帧级,我们利用多任务学习自动加权损失,得到平衡系数。此外,我们还首次建立了一个五人的英式英语CS数据集。在法国和英国的CS数据集上对该方法进行了评价,结果表明,该方法在很大程度上优于SOTA方法。 摘要:Cued Speech (CS) is a visual communication system for the deaf or hearing impaired people. It combines lip movements with hand cues to obtain a complete phonetic repertoire. Current deep learning based methods on automatic CS recognition suffer from a common problem, which is the data scarcity. Until now, there are only two public single speaker datasets for French (238 sentences) and British English (97 sentences). In this work, we propose a cross-modal knowledge distillation method with teacher-student structure, which transfers audio speech information to CS to overcome the limited data problem. Firstly, we pretrain a teacher model for CS recognition with a large amount of open source audio speech data, and simultaneously pretrain the feature extractors for lips and hands using CS data. Then, we distill the knowledge from teacher model to the student model with frame-level and sequence-level distillation strategies. Importantly, for frame-level, we exploit multi-task learning to weigh losses automatically, to obtain the balance coefficient. Besides, we establish a five-speaker British English CS dataset for the first time. The proposed method is evaluated on French and British English CS datasets, showing superior CS recognition performance to the state-of-the-art (SOTA) by a large margin.

【5】 Deep Residual Echo Suppression with A Tunable Tradeoff Between Signal Distortion and Echo Suppression 标题:在信号失真和回波抑制之间进行可调折衷的深层残余回波抑制

作者:Amir Ivry,Israel Cohen,Baruch Berdugo 机构:Technion – Israel Institute of Technology, Technion City, Haifa , Israel 备注:None 链接:https://arxiv.org/abs/2106.13531 摘要:本文提出了一种利用非线性神经网络将线性声回波对消器的输出直接映射到谱域期望信号的剩余回波抑制方法。该系统嵌入了一个设计参数,允许在期望的信号失真和残余回波抑制之间进行可调的折衷。该系统使用了13.6万个参数,每秒需要1.6千兆浮点运算和10兆字节的内存。该实现既满足AEC挑战的时序要求,又满足设备上应用程序的计算和内存限制。实验用来自AEC挑战数据库和真实独立记录的161小时数据进行。我们展示了该系统在实际环境中的性能,并将其与两种竞争方法进行了比较,包括回声抑制和期望信号失真、对各种环境的泛化以及对高回声水平的鲁棒性。 摘要:In this paper, we propose a residual echo suppression method using a UNet neural network that directly maps the outputs of a linear acoustic echo canceler to the desired signal in the spectral domain. This system embeds a design parameter that allows a tunable tradeoff between the desired-signal distortion and residual echo suppression in double-talk scenarios. The system employs 136 thousand parameters, and requires 1.6 Giga floating-point operations per second and 10 Mega-bytes of memory. The implementation satisfies both the timing requirements of the AEC challenge and the computational and memory limitations of on-device applications. Experiments are conducted with 161~h of data from the AEC challenge database and from real independent recordings. We demonstrate the performance of the proposed system in real-life conditions and compare it with two competing methods regarding echo suppression and desired-signal distortion, generalization to various environments, and robustness to high echo levels.

【6】 Phoneme-aware and Channel-wise Attentive Learning for Text DependentSpeaker Verification 标题:文本相关说话人确认中的音素感知和通道注意学习

作者:Yan Liu,Zheng Li,Lin Li,Qingyang Hong 机构:School of Electronic Science and Engineering, Xiamen University, China, School of Informatics, Xiamen University, China 链接:https://arxiv.org/abs/2106.13514 摘要:提出了一种基于音素感知和通道注意学习策略的文本相关说话人验证多任务学习网络。该结构采用帧级多任务学习和分段级对抗学习相结合的方法进行说话人嵌入提取。在主网的帧级特征上利用语音感知注意池进行说话人分类器,并对辅助子网中的语音分布给出相应的后验概率。此外,压缩和激励(SE块)的引入实现了动态的通道特征重校准,提高了表征能力。该方法利用了说话人与短语相关的特质,并分别从时间和通道两个方面对基于音素的注意池和SE块进行了改进。在RSR2015 Part 1数据库上进行的实验表明,该系统在文本相关SV上取得了很好的效果。 摘要:This paper proposes a multi-task learning network with phoneme-aware and channel-wise attentive learning strategies for text-dependent Speaker Verification (SV). In the proposed structure, the frame-level multi-task learning along with the segment-level adversarial learning is adopted for speaker embedding extraction. The phoneme-aware attentive pooling is exploited on frame-level features in the main network for speaker classifier, with the corresponding posterior probability for the phoneme distribution in the auxiliary subnet. Further, the introduction of Squeeze and Excitation (SE-block) performs dynamic channel-wise feature recalibration, which improves the representational ability. The proposed method exploits speaker idiosyncrasies associated with pass-phrases, and is further improved by the phoneme-aware attentive pooling and SE-block from temporal and channel-wise aspects, respectively. The experiments conducted on RSR2015 Part 1 database confirm that the proposed system achieves outstanding results for textdependent SV.

【7】 Evaluation of Deep-Learning-Based Voice Activity Detectors and Room Impulse Response Models in Reverberant Environments 标题:混响环境中基于深度学习的语音活动检测器和房间脉冲响应模型的评价

作者:Amir Ivry,Israel Cohen,Baruch Berdugo 机构:Technion – Israel Institute of Technology, Technion City, Haifa , Israel 备注:Accepted to ICASSP 2020 链接:https://arxiv.org/abs/2106.13511 摘要:最先进的基于深度学习的语音活动检测器(vad)通常使用消声数据进行训练。然而,真实的声学环境通常是混响的,这会导致性能显著恶化。为了减少训练数据和真实数据之间的不匹配,我们模拟了一个包含近500万个话语的增强训练集。这种扩展包括消声话语及其混响修改,由消声话语与各种室内脉冲响应(rir)的卷积产生。我们考虑了五种不同的模型来生成RIR,以及五种不同的VAD,它们是用增广训练集训练的。我们在三个不同的真实混响环境中测试所有训练过的系统。实验结果表明,与消声训练相比,所有探测器和响应模型的准确度、精确度和召回率平均提高了20\%$。此外,对于所有测试的VAD,其中一个RIR模型始终比其他模型产生更好的性能。此外,在所有实验中,其中一个vad始终优于其他vad。 摘要:State-of-the-art deep-learning-based voice activity detectors (VADs) are often trained with anechoic data. However, real acoustic environments are generally reverberant, which causes the performance to significantly deteriorate. To mitigate this mismatch between training data and real data, we simulate an augmented training set that contains nearly five million utterances. This extension comprises of anechoic utterances and their reverberant modifications, generated by convolutions of the anechoic utterances with a variety of room impulse responses (RIRs). We consider five different models to generate RIRs, and five different VADs that are trained with the augmented training set. We test all trained systems in three different real reverberant environments. Experimental results show $20\%$ increase on average in accuracy, precision and recall for all detectors and response models, compared to anechoic training. Furthermore, one of the RIR models consistently yields better performance than the other models, for all the tested VADs. Additionally, one of the VADs consistently outperformed the other VADs in all experiments.

【8】 Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance 标题:利用矢量量化潜在空间实现TTS/VC系统性能一致性的初步研究

作者:Hieu-Thi Luong,Junichi Yamagishi 机构:National Institute of Informatics, Tokyo, Japan 备注:to be presented at SSW11 链接:https://arxiv.org/abs/2106.13479 摘要:一般来说,训练神经语音合成系统的主要目的是从神经网络的输出层合成自然的、有表现力的语音,而不需要太多的注意隐含层。然而,通过学习有用的潜在表示,该系统可以用于许多更实际的场景。在这篇文章中,我们研究了如何使用量化向量来模拟潜在的语言嵌入,并将其与连续的嵌入进行了比较。通过对训练中的潜在空间实施不同的策略,我们可以得到具有不同性质的潜在语言嵌入,同时在质量和说话人相似性方面具有相似的性能。我们的实验表明,用矢量量化构建的语音克隆系统在感知评估方面只有很小的退化,但有一个离散的潜在空间,有助于降低表示比特率,这是数据传输所需要的,或者限制信息泄漏,这对于说话人匿名化和其他类似的任务很重要。 摘要:Generally speaking, the main objective when training a neural speech synthesis system is to synthesize natural and expressive speech from the output layer of the neural network without much attention given to the hidden layers. However, by learning useful latent representation, the system can be used for many more practical scenarios. In this paper, we investigate the use of quantized vectors to model the latent linguistic embedding and compare it with the continuous counterpart. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding that takes on different properties while having a similar performance in terms of quality and speaker similarity. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations, but has a discrete latent space that is useful for reducing the representation bit-rate, which is desirable for data transferring, or limiting the information leaking, which is important for speaker anonymization and other tasks of that nature.

【9】 Basis-MelGAN: Efficient Neural Vocoder Based on Audio Decomposition 标题:Basis-Melgan:基于音频分解的高效神经声码器

作者:Zhengxi Liu,Yanmin Qian 机构:Sun Yat-Sen University, MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 备注:Accepted to INTERSPEECH 2021 链接:https://arxiv.org/abs/2106.13419 摘要:最近的研究表明,基于产生式对抗网络(GAN)的神经声码器可以产生高质量的音频。虽然基于GAN的神经声码器在计算上比基于自回归预测的神经声码器效率更高,但是在CPU上实时生成最高质量的音频仍然是一项非常具有挑战性的任务。所有基于GAN的神经声码器的一个主要计算来自叠加的上采样层,该上采样层被设计为匹配波形的输出长度和时间分辨率。同时,上采样网络的计算复杂度与每个窗口生成的样本数密切相关。为了减少上采样层的计算量,我们提出了一种新的基于GAN的神经声码器Basis-MelGAN,在该声码器中,原始音频样本用学习的基及其相关的权值进行分解。由于Basis-MelGAN的预测目标是与每个学习基相关联的权值,而不是原始音频样本,因此Basis-MelGAN中的上采样层可以用更简单的网络来设计。与其他基于GAN的神经声码器相比,本文提出的基MelGAN可以产生相当高质量的音频,但计算复杂度从hifiganv1的17.74gflops降低到7.95gflops。 摘要:Recent studies have shown that neural vocoders based on generative adversarial network (GAN) can generate audios with high quality. While GAN based neural vocoders have shown to be computationally much more efficient than those based on autoregressive predictions, the real-time generation of the highest quality audio on CPU is still a very challenging task. One major computation of all GAN-based neural vocoders comes from the stacked upsampling layers, which were designed to match the length of the waveform's length of output and temporal resolution. Meanwhile, the computational complexity of upsampling networks is closely correlated with the numbers of samples generated for each window. To reduce the computation of upsampling layers, we propose a new GAN based neural vocoder called Basis-MelGAN where the raw audio samples are decomposed with a learned basis and their associated weights. As the prediction targets of Basis-MelGAN are the weight values associated with each learned basis instead of the raw audio samples, the upsampling layers in Basis-MelGAN can be designed with much simpler networks. Compared with other GAN based neural vocoders, the proposed Basis-MelGAN could produce comparable high-quality audio but significantly reduced computational complexity from HiFi-GAN V1's 17.74 GFLOPs to 7.95 GFLOPs.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-06-28,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
语音识别
腾讯云语音识别(Automatic Speech Recognition,ASR)是将语音转化成文字的PaaS产品,为企业提供精准而极具性价比的识别服务。被微信、王者荣耀、腾讯视频等大量业务使用,适用于录音质检、会议实时转写、语音输入法等多个场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档