金融/语音/音频处理学术速递[10.19]

公众号-arXiv每日学术速递

发布于 2021-10-21 16:19:50

1.1K0

发布于 2021-10-21 16:19:50

文章被收录于专栏：arXiv每日学术速递

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

q-fin金融，共计17篇

cs.SD语音，共计27篇

eess.AS音频处理，共计27篇

1.q-fin金融:

【1】 Sector Volatility Prediction Performance Using GARCH Models and Artificial Neural Networks 标题：基于GARCH模型和人工神经网络的行业波动性预测性能链接：https://arxiv.org/abs/2110.09489

作者：Curtis Nybo 机构：University of London 备注：26 pages 摘要：最近，人工神经网络（ANN）在波动率预测方面取得了成功，但关于在何处使用ANN而不是普通的GARCH模型，文献存在分歧。本研究的目的是比较ANN和GARCH模型在应用于低、中、高波动率股票时的波动预测性能。该方法旨在确定每种情况下应使用哪种模型。波动率曲线由五个部门组成，涵盖2005年至2020年美国股市的所有股票。针对每个部门检查了三个GARCH规范和三个ANN体系结构，其中选择了最合适的模型进行预测。结果表明，ANN模型应用于预测低波动率资产的波动性，GARCH模型应用于预测中高波动率资产的波动性。摘要：Recently artificial neural networks (ANNs) have seen success in volatility prediction, but the literature is divided on where an ANN should be used rather than the common GARCH model. The purpose of this study is to compare the volatility prediction performance of ANN and GARCH models when applied to stocks with low, medium, and high volatility profiles. This approach intends to identify which model should be used for each case. The volatility profiles comprise of five sectors that cover all stocks in the U.S stock market from 2005 to 2020. Three GARCH specifications and three ANN architectures are examined for each sector, where the most adequate model is chosen to move on to forecasting. The results indicate that the ANN model should be used for predicting volatility of assets with low volatility profiles, and GARCH models should be used when predicting volatility of medium and high volatility assets.

【2】 Understanding jumps in high frequency digital asset markets 标题：了解高频数字资产市场的跳跃链接：https://arxiv.org/abs/2110.09429

作者：Danial Saef,Odett Nagy,Sergej Sizov,Wolfgang Karl Härdle 机构：Wolfgang Karl H¨ardle¶ 摘要：虽然注意力是数字资产价格的一个预测因素，比特币价格的上涨也是众所周知的，但我们对其替代品知之甚少。研究高频加密数据为我们提供了一种独特的可能性，可以确认跨市场数字资产回报是由围绕黑天鹅事件聚集的高频跳跃驱动的，类似于波动性和交易量的季节性。回归显示，日内跳跃显著影响日终收益的大小和方向。这为加密期权定价模型提供了基础研究。然而，我们需要更好的计量经济学方法来捕捉加密产品的特定市场微观结构。所有计算均可通过quantlet.com技术进行复制。摘要：While attention is a predictor for digital asset prices, and jumps in Bitcoin prices are well-known, we know little about its alternatives. Studying high frequency crypto data gives us the unique possibility to confirm that cross market digital asset returns are driven by high frequency jumps clustered around black swan events, resembling volatility and trading volume seasonalities. Regressions show that intra-day jumps significantly influence end of day returns in size and direction. This provides fundamental research for crypto option pricing models. However, we need better econometric methods for capturing the specific market microstructure of cryptos. All calculations are reproducible via the quantlet.com technology.

【3】 Mean-Variance Portfolio Selection in Contagious Markets 标题：传染市场中的均值-方差投资组合选择链接：https://arxiv.org/abs/2110.09417

作者：Yang Shen,Bin Zou 机构：Accepted to SIAM Journal on Financial Mathematics 摘要：我们考虑具有传染风险的金融市场中的均值方差投资组合选择问题。风险资产遵循跳跃-扩散模型，其中跳跃由具有相互激励效应的多元Hawkes过程驱动。霍克斯过程的相互激励特征捕捉到了传染风险，即资产的每一次价格跳升都会增加未来跳升的可能性，不仅是在同一资产中，而且在其他资产中。应用随机极大值原理、倒向随机微分方程理论和线性二次型控制技术，在非局部偏微分方程的约束下，得到了半封闭形式的有效策略和有效前沿。数值例子说明了我们的结果。摘要：We consider a mean-variance portfolio selection problem in a financial market with contagion risk. The risky assets follow a jump-diffusion model, in which jumps are driven by a multivariate Hawkes process with mutual-excitation effect. The mutual-excitation feature of the Hawkes process captures the contagion risk in the sense that each price jump of an asset increases the likelihood of future jumps not only in the same asset but also in other assets. We apply the stochastic maximum principle, backward stochastic differential equation theory, and linear-quadratic control technique to solve the problem and obtain the efficient strategy and efficient frontier in semi-closed form, subject to a non-local partial differential equation. Numerical examples are provided to illustrate our results.

【4】 Identifying the Effects of Sanctions on the Iranian Economy using Newspaper Coverage 标题：利用报纸报道确定制裁对伊朗经济的影响链接：https://arxiv.org/abs/2110.09400

作者：Dario Laudati,M. Hashem Pesaran 机构：University of Southern California, USA, and Trinity College, Cambridge, UK 摘要：本文采用一种基于每日报纸报道的新的制裁强度度量方法，研究制裁对伊朗经济的影响。报告发现，制裁对汇率、通胀和产出增长有重大影响，伊朗里亚尔对制裁反应过度，随之而来的是通胀上升和产出下降。如果没有制裁，伊朗的平均年增长率可能在4%至5%左右，而实际增长率为3%。制裁还对就业、劳动力参与、中学和高中教育产生不利影响，对女性的影响更大。摘要：This paper considers how sanctions affected the Iranian economy using a novel measure of sanctions intensity based on daily newspaper coverage. It finds sanctions to have significant effects on exchange rates, inflation, and output growth, with the Iranian rial over-reacting to sanctions, followed up with a rise in inflation and a fall in output. In absence of sanctions, Iran's average annual growth could have been around 4-5 per cent, as compared to the 3 per cent realized. Sanctions are also found to have adverse effects on employment, labor force participation, secondary and high-school education, with such effects amplified for females.

【5】 Predicting Status of Pre and Post M&A Deals Using Machine Learning and Deep Learning Techniques 标题：基于机器学习和深度学习技术的并购前后状态预测链接：https://arxiv.org/abs/2110.09315

作者：Tugce Karatas,Ali Hirsa 机构：†Department of IEOR, Columbia University, edu‡Department of IEOR 备注：21 pages 摘要：风险套利或合并套利是一种著名的投资策略，它对并购交易的成功进行投机。提前预测交易状态对于风险套利者非常重要。如果一项交易被错误地归类为已完成的交易，那么投资目标公司股票可能会产生巨大的成本。相反，风险套利者可能会失去获利的机会。在本文中，我们提出了一种基于ML和DL的接管成功预测方法。我们最初将各种ML技术用于数据预处理，例如用于数据插补的kNN、用于数值变量低维表示的PCA、用于分类变量的MCA以及用于情绪分数的LSTM自动编码器。我们使用不同的成本函数、不同的评估指标和过采样技术来解决数据集中的类不平衡问题。然后，我们实施前馈神经网络来预测交易状态的成功。我们的初步结果表明，我们的方法优于基准模型，如logit和加权logit模型。我们还使用不同的模型架构将情绪评分集成到我们的方法中，但我们的初步结果表明，与简单的FFNN框架相比，性能变化不大。我们将探索不同的体系结构，并在未来的工作中对情绪分数进行彻底的超参数调整。摘要：Risk arbitrage or merger arbitrage is a well-known investment strategy that speculates on the success of M&A deals. Prediction of the deal status in advance is of great importance for risk arbitrageurs. If a deal is mistakenly classified as a completed deal, then enormous cost can be incurred as a result of investing in target company shares. On the contrary, risk arbitrageurs may lose the opportunity of making profit. In this paper, we present an ML and DL based methodology for takeover success prediction problem. We initially apply various ML techniques for data preprocessing such as kNN for data imputation, PCA for lower dimensional representation of numerical variables, MCA for categorical variables, and LSTM autoencoder for sentiment scores. We experiment with different cost functions, different evaluation metrics, and oversampling techniques to address class imbalance in our dataset. We then implement feedforward neural networks to predict the success of the deal status. Our preliminary results indicate that our methodology outperforms the benchmark models such as logit and weighted logit models. We also integrate sentiment scores into our methodology using different model architectures, but our preliminary results show that the performance is not changing much compared to the simple FFNN framework. We will explore different architectures and employ a thorough hyperparameter tuning for sentiment scores as a future work.

【6】 Prosecutor Politics: The Impact of Election Cycles on Criminal Sentencing in the Era of Rising Incarceration 标题：检察官政治：监禁上升时代选举周期对刑事量刑的影响链接：https://arxiv.org/abs/2110.09169

作者：Chika O. Okafor 摘要：我调查政治激励如何影响地区律师（DAs）的行为。我开发了一个理论模型，预测DAs将在选举期间比之前增加量刑强度。为了对这一预测进行实证检验，我收集了迄今为止最全面的数据集之一，对美国历史上监禁率上升最快的时期（大约1986-2006年）在任的所有地区检察官的政治生涯进行了统计。使用准实验方法，我发现因果证据表明，在地方检察官选举年增加了人均总入院人数和人均总判刑月数。我估计选举年对入院率的影响类似于沿着州内DA行为分布移动0.85个标准差（例如，判决强度从第50个百分位到第80个百分位）。我发现有证据表明，选举效应更大（1）当DA选举有竞争时，（2）在共和党县，（3）在美国南部——所有这些因素都与选举效应产生于影响DA的政治激励的观点一致。此外，我发现1986-2006年期间，随着美国公众对刑事处罚的态度软化，地方检察官选举的影响有所下降。这些发现表明，检察官的行为可能会对选民的偏好做出反应，特别是对公众对法院系统严厉程度的看法。摘要：I investigate how political incentives affect the behavior of district attorneys (DAs). I develop a theoretical model that predicts DAs will increase sentencing intensity in an election period compared to the period prior. To empirically test this prediction, I compile one of the most comprehensive datasets to date on the political careers of all district attorneys in office during the steepest rise in incarceration in U.S. history (roughly 1986-2006). Using quasi-experimental methods, I find causal evidence that being in a DA election year increases total admissions per capita and total months sentenced per capita. I estimate that the election year effects on admissions are akin to moving 0.85 standard deviations along the distribution of DA behavior within state (e.g., going from the 50th to 80th percentile in sentencing intensity). I find evidence that election effects are larger (1) when DA elections are contested, (2) in Republican counties, and (3) in the southern United States--all these factors are consistent with the perspective that election effects arise from political incentives influencing DAs. Further, I find that district attorney election effects decline over the period 1986-2006, in tandem with U.S. public opinion softening regarding criminal punishment. These findings suggest DA behavior may respond to voter preferences--in particular to public sentiment regarding the harshness of the court system.

【7】 Study of The Relationship Between Public and Private Venture Capitalists in France: A Qualitative Approach 标题：法国公私风险投资家关系的定性研究链接：https://arxiv.org/abs/2110.09098

作者：Jonathan Labbe 机构：Université de Lorraine - CEREFIGE Laboratory – PhD Student 备注：3rd International Conference on Digital, Innovation, Entrepreneurship & Financing., Oct 2021, Lyon, France 摘要：本研究侧重于研究法国公共和私人股本投资者之间的关系。在这方面，我们需要理解在传统创新网络中有时可能发生的正式或非正式互动的性质（Djellal\&Gallouj，2018）。为此，我们的文章采用了公私伙伴关系方法（PPPs）和基于资源的观点理论。这些观点强调纪律和激励机制的互补作用以及作为创造价值杠杆的具体资源交换。此外，这些方向与联合投资混合形式的观点相交叉，使我们能够构建混合银团现象的一致性和解释性框架。我们的方法是基于定性方法和解释目的，其中包括27个半结构化访谈。使用Nvivo软件对这些数据进行主题内容分析。结果表明，在国家或地区层面上，正式或非正式性质的公共和私人风险资本家（VCs）之间的关系，更具体地说，在银团背景下，代表了网络和创新的“经济认知”（Farrugia，2014，第6页）方法。此外，混合银团现象揭示了一种公私参与者混合的背景，这将使私人风投在公司发展创新时从财富分配中受益。我们还可以确定一个与公共行为者寻求合法性相关的过程，其特点是其在公私伙伴关系中的控制作用（Beuve和Saussier，2019年）。最后，我们的研究有一些局限性。一个例子是衡量关系对“可见”或“不可见”创新的影响（Djellal\&Gallouj，2018，第90页）。摘要：This research focuses on the study of relationships between public and private equity investors in France. In this regard, we need to apprehend the formal or informal nature of interactions that can sometimes take place within traditional innovation networks (Djellal \& Gallouj, 2018). For this, our article mobilizes a public-private partnerships approach (PPPs) and the resource-based view theory. These perspectives emphasize the complementary role of disciplinary and incentive mechanisms as well as the exchange of specific resources as levers for value creation. Moreover, these orientations crossed with the perspective of a hybrid form of co-investment allow us to build a coherent and explanatory framework of the mixed syndication phenomenon. Our methodology is based on a qualitative approach with an interpretative aim, which includes twenty-seven semi-structured interviews. These data were subjected to a thematic content analysis using Nvivo software. The results suggest that the relationships between public and private Venture capitalists (VCs) of a formal or informal nature, more specifically in a syndication context, at a national or regional level, are representative of an ''economico-cognitive'' (Farrugia, 2014, page 6) approach to networking and innovation. Moreover, the phenomenon of mixed syndication reveals a context of hybridization of public and private actors that would allow the private VCs to benefit from the distribution of wealth when the company develops its innovation. We can also identify a process related to a quest for legitimacy on the part of the public actor characterized by its controlling role within the public-private partnership (Beuve and Saussier, 2019). Finally, our study has some limitations. One example is the measurement of the effects of relationships on ''visible'' or ''invisible'' innovation (Djellal \& Gallouj, 2018, page 90).

【8】 Predictable Forward Performance Processes: Infrequent Evaluation and Robo-Advising Applications 标题：可预测的前瞻性绩效流程：不常用的评估和ROBO建议应用程序链接：https://arxiv.org/abs/2110.08900

作者：Gechun Liang,Moris S. Strub,Yuwei Wang 摘要：在金融市场的二叉树模型中，我们研究了当交易时间与绩效评估时间不一致时的离散时间可预测远期过程。构造这些过程的关键步骤是求解与驱动可预测正向过程演化的反问题相关的高阶线性函数方程。在这些条件下，我们给出了可预测正向过程存在唯一性的充分条件，并给出了可预测正向过程的显式构造。此外，我们还证明了这些过程在评价期内是时间单调的。最后，我们认为，可预测的远期偏好是一个可行的框架，可以对机器人咨询应用程序的偏好进行建模，并确定客户和机器人顾问之间的最佳交互计划，以平衡客户对金融市场的信念日益增加的不确定性和交互成本之间的权衡。摘要：We study discrete-time predictable forward processes when trading times do not coincide with performance evaluation times in the binomial tree model for the financial market. The key step in the construction of these processes is to solve a linear functional equation of higher order associated with the inverse problem driving the evolution of the predictable forward process. We provide sufficient conditions for the existence and uniqueness and an explicit construction of the predictable forward process under these conditions. Furthermore, we show that these processes are time-monotone in the evaluation period. Finally, we argue that predictable forward preferences are a viable framework to model preferences for robo-advising applications and determine an optimal interaction schedule between client and robo-advisor that balances a tradeoff between increasing uncertainty about the client's beliefs on the financial market and an interaction cost.

【9】 Estimating returns to special education: combining machine learning and text analysis to address confounding 标题：评估特殊教育的回报：结合机器学习和文本分析解决困惑链接：https://arxiv.org/abs/2110.08807

作者：Aurélien Sallin 机构：Check updated version HERE 摘要：虽然发达国家已确定有特殊需求的学生人数在增加，但很少有证据表明学业成绩和劳动力市场融合回归特殊教育。我展示了有史以来第一次使用因果机器学习和计算文本分析的最新方法来检验特殊教育项目短期和长期回报的研究结果。我发现，包容性环境下的特殊教育项目在数学和语言学习成绩以及就业和工资方面都有积极的回报。此外，我还发现，与隔离计划相比，包容性特殊教育计划具有积极的效果。然而，我发现种族隔离对一些学生有好处：有情绪或行为问题的学生和非本地学生。最后，使用浅层决策树，我提供了优化的安置规则，为有特殊需求的学生增加总体回报，降低特殊教育成本。这些安置规则将把大多数有特殊需要的学生从隔离重新分配到包容，这强化了包容对有特殊需要的学生有利的结论。摘要：While the number of students with identified special needs is increasing in developed countries, there is little evidence on academic outcomes and labor market integration returns to special education. I present results from the first ever study to examine short- and long-term returns to special education programs using recent methods in causal machine learning and computational text analysis. I find that special education programs in inclusive settings have positive returns on academic performance in math and language as well as on employment and wages. Moreover, I uncover a positive effect of inclusive special education programs in comparison to segregated programs. However, I find that segregation has benefits for some students: students with emotional or behavioral problems, and nonnative students. Finally, using shallow decision trees, I deliver optimal placement rules that increase overall returns for students with special needs and lower special education costs. These placement rules would reallocate most students with special needs from segregation to inclusion, which reinforces the conclusion that inclusion is beneficial to students with special needs.

【10】 Gender identity and relative income within household: Evidence from China 标题：家庭内性别认同与相对收入：来自中国的证据链接：https://arxiv.org/abs/2110.08723

作者：Han Dongcheng,Kong Fanbo,Wang Zixun 备注：This is a paper written by three high school students living in Singapore. We look forward to valuable comments and suggestions to improve the paper 摘要：妇女对传统性别角色的服从如何影响她们的劳动成果？为了研究这一问题，我们采用不连续性检验和时滞固定效应回归来衡量中国已婚妇女如何减少她们的劳动成果，从而维持她们丈夫的谋生地位。在这项研究的前半部分，我们的不连续性测试显示了一大批刚刚超过丈夫收入的已婚女性的缺失，这被解释为表明这些女性在性别规范的影响下减少了收入的证据。在下半年，我们使用带有时滞的固定效应回归来评估女性当前收入高于丈夫的未来劳动成果的变化。我们的研究结果表明，妇女未来的劳动参与决策（无论她们是否仍然加入劳动大军）不会受到影响，但她们的年收入和每周工作时间将在未来减少。最后，进行了多种多样的研究，表明低收入和教育程度较低的已婚妇女更容易受到性别规范的影响。摘要：How does women's obedience to traditional gender roles affect their labour outcomes? To investigate on this question, we employ discontinuity tests and fixed effect regressions with time lag to measure how married women in China diminish their labour outcomes so as to maintain the bread-winning status of their husbands. In the first half of this research, our discontinuity test exhibits a missing mass of married women who just out-earn their husbands, which is interpreted as an evidence showing that these females diminish their earnings under the influence of gender norms. In the second half, we use fixed effect regressions with time lag to assess the change of a female's future labour outcomes if she currently earns more than her husband. Our results suggest that women's future labour participation decisions (whether they still join the workforce) are unaffected, but their yearly incomes and weekly working hours will be reduced in the future. Lastly, heterogeneous studies are conducted, showing that low-income and less educated married women are more susceptible to the influence of gender norms.

【11】 Star-shaped acceptability indexes 标题：星形可接受性指标链接：https://arxiv.org/abs/2110.08630

作者：Marcelo Brutti Righi 机构：Federal University of Rio Grande do Sul 摘要：我们提出了星型可接受性指数，作为Cherny和Madan（2009年）和Rosazza Gianin和Sgarra（2013年）方法的推广，与星型风险度量概括一致和凸型风险度量。我们通过星型风险度量、星型接受集以及一些准凹可接受性指数族的最小值来描述可接受性指数。此外，我们还介绍了与风险价值、风险调整资本报酬、基于报酬的损益比率、单调报酬偏离比率和稳健可接受性指数相关的具体示例。我们还公开了一个关于性能优化的应用程序。摘要：We propose the star-shaped acceptability indexes as generalizations of both the approaches of Cherny and Madan (2009) and Rosazza Gianin and Sgarra (2013) in the same vein as star-shaped risk measures generalize both the classes of coherent and convex risk measures. We characterize acceptability indexes through star-shaped risk measures, star-shaped acceptance sets, and as the minimum of some family of quasi-concave acceptability indexes. Further, we introduce concrete examples under our approach linked to Value at Risk, risk-adjusted reward on capital, reward-based gain-loss ratio, monotone reward-deviation ratio, and robust acceptability indexes. We also expose an application regarding optimization of performance.

【12】 The elastic origins of tail asymmetry 标题：尾部不对称的弹性起源链接：https://arxiv.org/abs/2110.08612

作者：Satoshi Nakano,Kazuhiko Nishimura 摘要：基于多部门一般均衡框架，我们证明了部门替代弹性在总体宏观经济波动的稳健性和不对称尾部的演化中起着关键作用。生产网络的非单一替代弹性呈现出一种非线性多域聚合，其中正常部门生产率冲击转化为非正常聚合冲击。我们使用日本的时间序列关联投入产出表估计了100个部门的替代弹性，并发现生产经济总体上是弹性的。除了对美国非弹性生产经济的既得评估外，还探讨了美国和日本之间总冲击分布的对比尾部不对称性。摘要：Based on a multisector general equilibrium framework, we show that the sectoral elasticity of substitution plays the key role in the evolution of robustness and the asymmetric tails in the aggregate macroeconomic fluctuations. Non-unitary elasticity of substitution of the production networks renders a nonlinear Domar aggregation where normal sectoral productivity shocks translates into a non-normal aggregated shocks. We estimate 100 sectoral elasticities of substitution, using the time-series linked input-output tables for Japan, and find that the production economy is elastic overall. Along with the vested assessment of an inelastic production economy for the US, the contrasting tail asymmetry of the distribution of aggregated shocks between the US and Japan is explored.

【13】 Dropping diversity of products of large US firms: Models and measures 标题：降低美国大公司产品多样性的模式与措施链接：https://arxiv.org/abs/2110.08367

作者：Ananthan Nambiar,Tobias Rubel,James McCaull,Jon deVries,Mark Bedau 机构： Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Carl R. Woese Institute for Genomic Biology, University of Illinois at, Urbana-Champaign, Urbana, Illinois, USA, Department of Philosophy, Reed College, Portland, Oregon, USA 摘要：人们普遍认为，在我们的有生之年，全球经济中可用的产品变得更加多样化。然而，这种假设很难直接调查，因为很难收集每年经济体中每种产品的必要数据。我们通过挖掘从1997年到2017年每年美国各大公司产品的公开文本描述来解决这一问题。尽管在此期间，经济生产率的许多方面都在稳步上升，但我们基于文本的测量结果表明，至少美国大公司的产品多样性在稳步下降。使用各种产品多样性指标可以看到这种下降趋势，其中一些指标取决于对每一对公司产品相似性的衡量。由于Hoberg和Phillips的观点，目前综合和详细的企业相似性度量是一个布尔词向量模型。我们使用这个布尔模型和两个更复杂的变量中的公司相似性来衡量多样性，并且我们始终观察到产品多样性的显著下降趋势。这些结果使我们有可能构建并开始测试解释产品多样性下降趋势的特定假设。摘要：It is widely assumed that in our lifetimes the products available in the global economy have become more diverse. This assumption is difficult to investigate directly, however, because it is difficult to collect the necessary data about every product in an economy each year. We solve this problem by mining publicly available textual descriptions of the products of every large US firms each year from 1997 to 2017. Although many aspects of economic productivity have been steadily rising during this period, our text-based measurements show that the diversity of the products of at least large US firms has steadily declined. This downward trend is visible using a variety of product diversity metrics, including some that depend on a measurement of the similarity of the products of every single pair of firms. The current state of the art in comprehensive and detailed firm-similarity measurements is a Boolean word vector model due to Hoberg and Phillips. We measure diversity using firm-similarities from this Boolean model and two more sophisticated variants, and we consistently observe a significant dropping trend in product diversity. These results make it possible to frame and start to test specific hypotheses for explaining the dropping product diversity trend.

【14】 Semimartingale and continuous-time Markov chain approximation for rough stochastic local volatility models 标题：粗糙随机局部波动率模型的半鞅和连续时间马尔可夫链逼近链接：https://arxiv.org/abs/2110.08320

作者：Jingtang Ma,Wensheng Yang,Zhenyu Cui 备注：27 pages 摘要：最近的经验表明，粗糙波动率模型能够很好地拟合SPX期权的历史波动率时间序列和隐含波动率微笑。它们是连续时间随机波动模型，其波动过程由Hurst参数小于一半的分数布朗运动驱动。由于它既不是半鞅也不是马尔可夫过程这一挑战，目前还没有一种统一的方法不仅适用于所有粗糙波动率模型，而且计算效率也很高。针对一般粗糙随机局部波动率（RSLV）模型，提出了一种半鞅和连续时间马尔可夫链（CTMC）近似方法。特别地，我们引入了扰动随机局部波动率（PSLV）模型作为RSLV模型的半鞅近似，并建立了它的存在性、唯一性和马尔可夫表示。我们提出了一种快速CTMC算法，并证明了其弱收敛性。数值实验证明了该方法在欧式、障碍式和美式期权定价中的准确性和高效性。与现有文献相比，观察到达到相同精度水平所需的CPU时间显著减少。摘要：Rough volatility models have recently been empirically shown to provide a good fit to historical volatility time series and implied volatility smiles of SPX options. They are continuous-time stochastic volatility models, whose volatility process is driven by a fractional Brownian motion with Hurst parameter less than half. Due to the challenge that it is neither a semimartingale nor a Markov process, there is no unified method that not only applies to all rough volatility models, but also is computationally efficient. This paper proposes a semimartingale and continuous-time Markov chain (CTMC) approximation approach for the general class of rough stochastic local volatility (RSLV) models. In particular, we introduce the perturbed stochastic local volatility (PSLV) model as the semimartingale approximation for the RSLV model and establish its existence , uniqueness and Markovian representation. We propose a fast CTMC algorithm and prove its weak convergence. Numerical experiments demonstrate the accuracy and high efficiency of the method in pricing European, barrier and American options. Comparing with existing literature, a significant reduction in the CPU time to arrive at the same level of accuracy is observed.

【15】 Numeraire-invariant quadratic hedging and mean--variance portfolio allocation 标题：数值不变的二次套期保值与均值-方差投资组合分配链接：https://arxiv.org/abs/2110.09416

作者：Aleš Černý,Christoph Czichowsky,Jan Kallsen 机构：Bayes Business School ∗, LSE†, CAU Kiel‡ 备注：35 pages 摘要：研究了一般半鞅市场中不一定包含无风险资产的二次套期保值问题。建立了有无计价变动套期保值的等价结果。这允许直接计算最优策略，而无需选择参考资产和/或进行数字更改。获得了最优策略的新的显式表达式，其特点是使用斜投影，提供了无风险资产和无风险资产情况下的统一处理。主要结果促进了我们对无风险资产可能不存在的最一般情况下有效前沿形成的理解。文中给出了数字不变方法的几个例子。摘要：The paper investigates quadratic hedging in a general semimartingale market that does not necessarily contain a risk-free asset. An equivalence result for hedging with and without numeraire change is established. This permits direct computation of the optimal strategy without choosing a reference asset and/or performing a numeraire change. New explicit expressions for optimal strategies are obtained, featuring the use of oblique projections that provide unified treatment of the case with and without a risk-free asset. The main result advances our understanding of the efficient frontier formation in the most general case where a risk-free asset may not be present. Several illustrations of the numeraire-invariant approach are given.

【16】 Persuasion by Dimension Reduction 标题：降维的说服力链接：https://arxiv.org/abs/2110.08884

作者：Semyon Malamud,Andreas Schrimpf 备注：arXiv admin note: text overlap with arXiv:2102.10909 摘要：观察多维数据（状态向量）的代理（发送者）应该如何说服另一个代理采取所需的行动？我们证明，发送方通过将状态向量投影到低维对象（我们称之为“最优信息流形”）上执行（非线性）降维始终是最优的。我们描述了该流形的几何特性，并将其与发送方的偏好相联系。最优策略将信息分为“好”和“坏”两部分。当发送者的边际效用是线性的时，揭示好信息的全部数量总是最优的。相反，在凹边际效用下，最优信息设计隐藏了好信息的极端实现，只揭示了它的方向（符号）。我们通过显式地解决几个多维贝叶斯说服问题来说明这些效果。摘要：How should an agent (the sender) observing multi-dimensional data (the state vector) persuade another agent to take the desired action? We show that it is always optimal for the sender to perform a (non-linear) dimension reduction by projecting the state vector onto a lower-dimensional object that we call the "optimal information manifold." We characterize geometric properties of this manifold and link them to the sender's preferences. Optimal policy splits information into "good" and "bad" components. When the sender's marginal utility is linear, revealing the full magnitude of good information is always optimal. In contrast, with concave marginal utility, optimal information design conceals the extreme realizations of good information and only reveals its direction (sign). We illustrate these effects by explicitly solving several multi-dimensional Bayesian persuasion problems.

【17】 Scaling Blockchains: Can Elected Committees Help? 标题：扩展区块链：民选委员会能有所帮助吗？链接：https://arxiv.org/abs/2110.08673

作者：Alon Benhaim,Brett Hemenway Falk,Gerry Tsoukalas 摘要：在开发更具可伸缩性区块链的高风险竞赛中，一些平台（Cosmos、EOS、TRON等）采用了基于委员会的共识协议，将区块链的记录保留权委托给选举产生的区块生产者委员会。理论上，委员会越小，区块链达成共识的速度越快，规模也越大。不太清楚的是，鉴于选民通常掌握的信息有限，这一机制是否能确保诚实的委员会能够始终如一地当选。以EOS的委托权益证明（DPoS）协议为背景，我们证明了确定最优投票策略是复杂的，实际上是遥不可及的。我们根据经验描述了一些令牌持有者在实践中采用的简单（次优）投票策略，并表明这些策略仍然以指数速度收敛到最优。与依赖随机区组生成器选择的其他PoS协议相比，这产生了效率增益。我们的结果表明，在DPO中实施的（选举产生的）以委员会为基础的共识，尽管其复杂性，但仍然是稳健和有效的。摘要：In the high-stakes race to develop more scalable blockchains, some platforms (Cosmos, EOS, TRON, etc.) have adopted committee-based consensus protocols, whereby the blockchain's record-keeping rights are entrusted to a committee of elected block producers. In theory, the smaller the committee, the faster the blockchain can reach consensus and the more it can scale. What's less clear, is whether this mechanism ensures that honest committees can be consistently elected, given voters typically have limited information. Using EOS' Delegated Proof of Stake (DPoS) protocol as a backdrop, we show that identifying the optimal voting strategy is complex and practically out of reach. We empirically characterize some simpler (suboptimal) voting strategies that token holders resort to in practice and show that these nonetheless converge to optimality, exponentially quickly. This yields efficiency gains over other PoS protocols that rely on randomized block producer selection. Our results suggest that (elected) committee-based consensus, as implemented in DPoS, can be robust and efficient, despite its complexity.

2.cs.SD语音:

【1】 FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection 标题：FMFCC-A：一种具有挑战性的汉语合成语音检测数据集链接：https://arxiv.org/abs/2110.09441

作者：Zhenyu Zhang,Yewei Gu,Xiaowei Yi,Xianfeng Zhao 机构： State Key Laboratory of Information Security, Institute of Information, Engineering, Chinese Academy of Sciences, Beijing , China, School of Cyber Security, University of Chinese Academy of Sciences 摘要：随着文本到语音（TTS）和语音转换（VC）技术的不断发展，合成语音的检测受到了极大的影响。为了促进针对普通话TTS和VC技术的合成语音检测模型的发展，我们构建了一个具有挑战性的普通话数据集，并组织了中国图像与图形学会（FMFCC-a）第一次假媒体取证挑战赛的伴奏音轨。FMFCC-A数据集是迄今为止最大的用于合成语音检测的公开普通话数据集，其中包含由11个普通话TTS系统和两个普通话VC系统生成的40000个合成普通话语音，以及从58位发言者收集的10000个真实普通话语音。FMFCC-A数据集分为训练集、开发集和评估集，用于研究各种未知语音合成系统或音频后处理操作下的合成汉语语音检测。除了描述FMFCC-A数据集的构造外，我们还详细分析了两种基线方法和FMFCC-A提交的最优秀的数据，这说明了FMFCC-A数据集的有用性和挑战性。我们希望FMFCC-A数据集能够填补汉语合成语音检测数据集的不足。摘要：As increasing development of text-to-speech (TTS) and voice conversion (VC) technologies, the detection of synthetic speech has been suffered dramatically. In order to promote the development of synthetic speech detection model against Mandarin TTS and VC technologies, we have constructed a challenging Mandarin dataset and organized the accompanying audio track of the first fake media forensic challenge of China Society of Image and Graphics (FMFCC-A). The FMFCC-A dataset is by far the largest publicly-available Mandarin dataset for synthetic speech detection, which contains 40,000 synthesized Mandarin utterances that generated by 11 Mandarin TTS systems and two Mandarin VC systems, and 10,000 genuine Mandarin utterances collected from 58 speakers. The FMFCC-A dataset is divided into the training, development and evaluation sets, which are used for the research of detection of synthesized Mandarin speech under various previously unknown speech synthesis systems or audio post-processing operations. In addition to describing the construction of the FMFCC-A dataset, we provide a detailed analysis of two baseline methods and the top-performing submissions from the FMFCC-A, which illustrates the usefulness and challenge of FMFCC-A dataset. We hope that the FMFCC-A dataset can fill the gap of lack of Mandarin datasets for synthetic speech detection.

【2】 Automatic Learning of Subword Dependent Model Scales 标题：子词相关模型尺度的自动学习链接：https://arxiv.org/abs/2110.09324

作者：Felix Meyer,Wilfried Michel,Mohammad Zeineldeen,Ralf Schlüter,Hermann Ney 机构：Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany, AppTek GmbH, Aachen, Germany 备注：submitted to ICASSP 2022 摘要：为了提高最先进的自动语音识别系统的性能，通常的做法是包括外部知识源，如语言模型或先前的更正。这通常通过对数线性模型组合完成，每个模型使用单独的缩放参数。通常，这些参数是在某些保留数据上手动优化的。在这项工作中，我们建议通过自动微分和类似于神经网络模型参数的随机梯度下降来优化这些缩放参数。我们在LibriSpeech（LBS）和SWB（SWB）语料库上展示了该模型的扩展，基于注意的编码器-解码器声学模型和语言模型的组合可以像手动调谐一样有效地学习。我们进一步将此方法扩展到子词相关的模型尺度，这些尺度无法手动调整，从而导致LBS提高7%，SWB提高3%。我们还表明，联合训练的规模和模型参数是可能的，并给予额外的6%的改善LBS。摘要：To improve the performance of state-of-the-art automatic speech recognition systems it is common practice to include external knowledge sources such as language models or prior corrections. This is usually done via log-linear model combination using separate scaling parameters for each model. Typically these parameters are manually optimized on some held-out data. In this work we propose to optimize these scaling parameters via automatic differentiation and stochastic gradient decent similar to the neural network model parameters. We show on the LibriSpeech (LBS) and Switchboard (SWB) corpora that the model scales for a combination of attentionbased encoder-decoder acoustic model and language model can be learned as effectively as with manual tuning. We further extend this approach to subword dependent model scales which could not be tuned manually which leads to 7% improvement on LBS and 3% on SWB. We also show that joint training of scales and model parameters is possible and gives additional 6% improvement on LBS.

【3】 Intent Classification Using Pre-Trained Embeddings For Low Resource Languages 标题：基于预训练嵌入的低资源语言意图分类链接：https://arxiv.org/abs/2110.09264

作者：Hemant Yadav,Akshat Gupta,Sai Krishna Rallabandi,Alan W Black,Rajiv Ratn Shah 机构：MIDAS, IIIT Delhi, India,J.P.Morgan AI Research, New York, USA, Carnegie Mellon University 摘要：构建不依赖于特定语言的自动语音识别（ASR）的口语理解（SLU）系统是语言处理中一个重要但尚未深入研究的问题。在本文中，我们提出了一项比较研究，旨在采用预先训练的声学模型在低资源场景中执行SLU。具体来说，我们使用了三种不同的嵌入，它们是使用Allosaurus（一种预先训练过的通用电话解码器）提取的：（1）电话（2）Panphone，和（3）Allo嵌入。然后使用这些嵌入来识别口头意图。我们使用三种不同的语言进行实验：英语、僧伽罗语和泰米尔语，每种语言都有不同的数据大小，以模拟高、中、低资源场景。我们的系统将僧伽罗语和泰米尔语的最新（SOTA）意向分类准确率分别提高了约2.11%和7.00%，并在英语方面取得了竞争性的成绩。此外，我们还定量分析了绩效如何随每个意图使用的训练示例数量而变化。摘要：Building Spoken Language Understanding (SLU) systems that do not rely on language specific Automatic Speech Recognition (ASR) is an important yet less explored problem in language processing. In this paper, we present a comparative study aimed at employing a pre-trained acoustic model to perform SLU in low resource scenarios. Specifically, we use three different embeddings extracted using Allosaurus, a pre-trained universal phone decoder: (1) Phone (2) Panphone, and (3) Allo embeddings. These embeddings are then used in identifying the spoken intent. We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios. Our system improves on the state-of-the-art (SOTA) intent classification accuracy by approximately 2.11% for Sinhala and 7.00% for Tamil and achieves competitive results on English. Furthermore, we present a quantitative analysis of how the performance scales with the number of training examples used per intent.

【4】 Efficient Sequence Training of Attention Models using Approximative Recombination 标题：基于近似重组的注意力模型高效序列训练链接：https://arxiv.org/abs/2110.09245

作者：Nils-Philipp Wynands,Wilfried Michel,Jan Rosendahl,Ralf Schlüter,Hermann Ney 机构：Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany, AppTek GmbH, Aachen, Germany 备注：submitted to ICASSP 2022 摘要：序列判别训练是提高自动语音识别系统性能的重要工具。然而，它确实需要对所有可能的单词序列求和，这在实践中很难计算。当前具有无限标签上下文的最先进系统通过将求和限制为从beam搜索获得的相关竞争假设的n-最佳列表来规避此问题。这项工作建议在波束搜索期间对假设进行（近似）重组，前提是它们具有共同的本地历史。分析了近似产生的误差，结果表明，使用这种技术，有效光束尺寸可以增加几个数量级，而不会显著增加计算要求。最后，该技术可以有效地对LibriSpeech任务中基于注意的编解码声学模型进行序列判别训练。摘要：Sequence discriminative training is a great tool to improve the performance of an automatic speech recognition system. It does, however, necessitate a sum over all possible word sequences, which is intractable to compute in practice. Current state-of-the-art systems with unlimited label context circumvent this problem by limiting the summation to an n-best list of relevant competing hypotheses obtained from beam search. This work proposes to perform (approximative) recombinations of hypotheses during beam search, if they share a common local history. The error that is incurred by the approximation is analyzed and it is shown that using this technique the effective beam size can be increased by several orders of magnitude without significantly increasing the computational requirements. Lastly, it is shown that this technique can be used to effectively perform sequence discriminative training for attention-based encoder-decoder acoustic models on the LibriSpeech task.

【5】 EIHW-MTG: Second DiCOVA Challenge System Report 标题：EIHW-MTG：第二份DiCOVA挑战系统报告链接：https://arxiv.org/abs/2110.09239

作者：Adria Mallol-Ragolta,Helena Cuesta,Emilia Gómez,Björn W. Schuller 机构： EIHW – Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, Germany, MTG – Music Technology Group, Universitat Pompeu Fabra, Spain, Joint Research Centre, European Commission, Spain 摘要：这项工作提出了一种基于外积的方法来融合从咳嗽、呼吸和语音样本的频谱图生成的嵌入式表示，以自动检测新冠病毒-19。为了从频谱图中提取深入学习的表示，我们比较了从头开始训练的CNN和针对手头任务进行微调的ResNet18体系结构的性能。此外，我们还调查了患者的性别和上下文注意机制的使用是否有益。我们的实验使用了作为第二次使用声学诊断新冠病毒-19（DiCOVA）挑战的一部分发布的数据集。结果表明，融合呼吸和语音信息检测新冠病毒-19是合适的。当使用从零开始训练的具有上下文注意机制的CNN时，在测试分区上获得84.06%的曲线下面积（AUC）。当使用ResNet18体系结构进行特征提取时，基线模型的性能得分最高，AUC为84.26%。摘要：This work presents an outer product-based approach to fuse the embedded representations generated from the spectrograms of cough, breath, and speech samples for the automatic detection of COVID-19. To extract deep learnt representations from the spectrograms, we compare the performance of a CNN trained from scratch and a ResNet18 architecture fine-tuned for the task at hand. Furthermore, we investigate whether the patients' sex and the use of contextual attention mechanisms is beneficial. Our experiments use the dataset released as part of the Second Diagnosing COVID-19 using Acoustics (DiCOVA) Challenge. The results suggest the suitability of fusing breath and speech information to detect COVID-19. An Area Under the Curve (AUC) of 84.06% is obtained on the test partition when using a CNN trained from scratch with contextual attention mechanisms. When using the ResNet18 architecture for feature extraction, the baseline model scores the highest performance with an AUC of 84.26%.

【6】 Learning Models for Query by Vocal Percussion: A Comparative Study 标题：声击提问学习模型的比较研究链接：https://arxiv.org/abs/2110.09223

作者：Alejandro Delgado,SkoT McDonald,Ning Xu,Charalampos Saitis,Mark Sandler 机构：Roli Queen Mary University of London 备注：Published in proceedings of the International Computer Music Conference (ICMC) 2021 摘要：通过人声模仿敲击声是一种自然而有效的工具，可以在飞行中传达节奏感。因此，使用人声敲击自动检索鼓的声音可以帮助艺术家以舒适、快速的方式制作鼓的原型，从而平滑创作流程。在这里，我们利用传统的机器学习算法和最新的深度学习技术，探索执行此类查询的不同策略。通过向网格搜索算法提供性能指标，仔细选择所涉及模型的主要超参数。我们还研究了几种音频数据增强技术，这些技术可以潜在地规范深度学习模型并提高泛化能力。我们从有效性（分类准确度）、效率（计算速度）、稳定性（性能一致性）和可解释性（决策模式）等方面比较了最终的性能，并讨论了这些结果在通过语音敲击系统设计成功的查询时的相关性。摘要：The imitation of percussive sounds via the human voice is a natural and effective tool for communicating rhythmic ideas on the fly. Thus, the automatic retrieval of drum sounds using vocal percussion can help artists prototype drum patterns in a comfortable and quick way, smoothing the creative workflow as a result. Here we explore different strategies to perform this type of query, making use of both traditional machine learning algorithms and recent deep learning techniques. The main hyperparameters from the models involved are carefully selected by feeding performance metrics to a grid search algorithm. We also look into several audio data augmentation techniques, which can potentially regularise deep learning models and improve generalisation. We compare the final performances in terms of effectiveness (classification accuracy), efficiency (computational speed), stability (performance consistency), and interpretability (decision patterns), and discuss the relevance of these results when it comes to the design of successful query-by-vocal-percussion systems.

【7】 SpecTNT: a Time-Frequency Transformer for Music Audio 标题：SPECTNT：一种用于音乐音频的时频转换器链接：https://arxiv.org/abs/2110.09127

作者：Wei-Tsung Lu,Ju-Chiang Wang,Minz Won,Keunwoo Choi,Xuchen Song 机构： ByteDance, Mountain View, California, United States, Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain 备注：None 摘要：Transformer因其在自然语言处理和计算机视觉方面的卓越性能而受到和平号领域的关注。然而，以前在音频处理领域的工作大多使用Transformer作为一个时间特征聚合器，其作用类似于RNN。在本文中，我们提出了SpecTNT，这是一种基于转换器的架构，用于对输入时频表示的频谱和时间序列进行建模。具体来说，我们介绍了Transformer中Transformer（TNT）结构的一种新变体。在每个SpecTNT块中，频谱转换器将与频率相关的特征提取到每个帧的频率类标记（FCT）中。随后，对FCT进行线性投影并将其添加到时间嵌入（TE）中，从而聚合来自FCT的有用信息。然后，时间转换器处理te以跨时间轴交换信息。通过堆叠SpecTNT块，我们建立SpecTNT模型来学习音乐信号的表示。在实验中，SpecTNT在音乐标记和人声旋律提取方面展示了最先进的性能，并在和弦识别方面显示了具有竞争力的性能。通过消融研究进一步检验SpecTNT和其他设计选择的有效性。摘要：Transformers have drawn attention in the MIR field for their remarkable performance shown in natural language processing and computer vision. However, prior works in the audio processing domain mostly use Transformer as a temporal feature aggregator that acts similar to RNNs. In this paper, we propose SpecTNT, a Transformer-based architecture to model both spectral and temporal sequences of an input time-frequency representation. Specifically, we introduce a novel variant of the Transformer-in-Transformer (TNT) architecture. In each SpecTNT block, a spectral Transformer extracts frequency-related features into the frequency class token (FCT) for each frame. Later, the FCTs are linearly projected and added to the temporal embeddings (TEs), which aggregate useful information from the FCTs. Then, a temporal Transformer processes the TEs to exchange information across the time axis. By stacking the SpecTNT blocks, we build the SpecTNT model to learn the representation for music signals. In experiments, SpecTNT demonstrates state-of-the-art performance in music tagging and vocal melody extraction, and shows competitive performance for chord recognition. The effectiveness of SpecTNT and other design choices are further examined through ablation studies.

【8】 KaraTuner: Towards end to end natural pitch correction for singing voice in karaoke 标题：KaraTuner：卡拉OK中歌声的端到端自然基音校正链接：https://arxiv.org/abs/2110.09121

作者：Xiaobin Zhuang,Huiran Yu,Weifeng Zhao,Tao Jiang,Peng Hu,Simon Lui,Wenjiang Zhou 机构：⋆ Tencent Music Entertainment Lyra Lab, Shenzhen, China, † Carnegie Mellon University, Pittsburgh, PA, USA 备注：Submitted to ICASSP 2022 摘要：自动基音校正系统通常包括几个阶段，例如基音提取、偏差估计、基音偏移处理和交叉淡入平滑。然而，使用策略设计这些组件通常需要领域专家的专业知识，并且在某些情况下可能会失败。在本文中，我们介绍了KaraTuner，一种端到端的神经结构，它可以预测音高曲线，并直接从原始录音中提取的调谐音高和声谱中重新合成歌唱声音。KaraTuner中引入了几个关键技术点，以确保音高精度、音高自然度、音色一致性和音质。在基音预测器中采用了前馈变换器来捕获声谱和音符的长期相关性。我们还开发了一种基于新型源滤波器块和Fre-GAN结构的基音周期可控声码器。KaraTuner通过a/B测试获得了比基于规则的基音校正方法更高的偏好，感知实验表明，与参数世界声码器和相位声码器相比，该声码器在音色一致性和音质方面具有显著优势。摘要：An automatic pitch correction system typically includes several stages, such as pitch extraction, deviation estimation, pitch shift processing, and cross-fade smoothing. However, designing these components with strategies often requires domain expertise and they are likely to fail on corner cases. In this paper, we present KaraTuner, an end-to-end neural architecture that predicts pitch curve and resynthesizes the singing voice directly from the tuned pitch and vocal spectrum extracted from the original recordings. Several vital technical points have been introduced in KaraTuner to ensure pitch accuracy, pitch naturalness, timbre consistency, and sound quality. A feed-forward Transformer is employed in the pitch predictor to capture long-term dependencies in the vocal spectrum and musical note. We also develop a pitch-controllable vocoder base on a novel source-filter block and the Fre-GAN architecture. KaraTuner obtains a higher preference than the rule-based pitch correction approach through A/B tests, and perceptual experiments show that the proposed vocoder achieves significant advantages in timbre consistency and sound quality compared with the parametric WORLD vocoder and phase vocoder.

【9】 Real Additive Margin Softmax for Speaker Verification 标题：用于说话人确认的实加性余量Softmax 链接：https://arxiv.org/abs/2110.09116

作者：Lantian Li,Ruiqian Nai,Dong Wang 机构：Center for Speech and Language Technologies, BNRist, Tsinghua University, China 备注：Submitted to ICASSP 2022 摘要：加性余量softmax（AM softmax）损耗在说话人验证中具有显著的性能。AM Softmax的一个假定行为是，它可以通过强调目标登录来缩小类内的变化，从而提高目标类和非目标类之间的差异。在本文中，我们对AM Softmax损耗的行为进行了仔细的分析，并表明这种损耗并没有实现真正的最大余量训练。基于这一观察，我们提出了一个真实的AM Softmax损耗，它涉及Softmax训练中的真实裕度函数。在VoxCeleb1、SITW和CNCeleb上进行的实验表明，修正后的AM Softmax损耗始终优于原始损耗。该代码已于发布https://gitlab.com/csltstu/sunine. 摘要：The additive margin softmax (AM-Softmax) loss has delivered remarkable performance in speaker verification. A supposed behavior of AM-Softmax is that it can shrink within-class variation by putting emphasis on target logits, which in turn improves margin between target and non-target classes. In this paper, we conduct a careful analysis on the behavior of AM-Softmax loss, and show that this loss does not implement real max-margin training. Based on this observation, we present a Real AM-Softmax loss which involves a true margin function in the softmax training. Experiments conducted on VoxCeleb1, SITW and CNCeleb demonstrated that the corrected AM-Softmax loss consistently outperforms the original one. The code has been released at https://gitlab.com/csltstu/sunine.

【10】 LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech 标题：LDNet：合成语音MOS预测中的统一听者相关建模链接：https://arxiv.org/abs/2110.09103

作者：Wen-Chin Huang,Erica Cooper,Junichi Yamagishi,Tomoki Toda 机构：Nagoya University, Japan, National Institute of Informatics, Japan 备注：Submitted to ICASSP 2022. Code available at: this https URL 摘要：自动预测合成语音主观评分的一种有效方法是在带有人类注释分数的听力测试数据集上进行训练。尽管数据集中的每个语音样本都由几个听者进行评分，但大多数以前的工作仅使用平均分数作为训练目标。在这项工作中，我们提出了LDNet，一个统一的平均意见分数（MOS）预测框架，该框架预测给定输入语音和听者身份的听者感知质量。我们反映了LD建模的最新进展，包括模型架构的设计选择，并提出了两种推理方法，可提供更稳定的结果和高效的计算。我们在voice conversion challenge（VCC）2018基准测试和新收集的大规模MOS数据集上进行了系统实验，对提议的框架进行了深入分析。结果表明，平均听者推理方法是一种更好的利用平均分数的方法，当每个样本的评分越多，其有效性越明显。摘要：An effective approach to automatically predict the subjective rating for synthetic speech is to train on a listening test dataset with human-annotated scores. Although each speech sample in the dataset is rated by several listeners, most previous works only used the mean score as the training target. In this work, we present LDNet, a unified framework for mean opinion score (MOS) prediction that predicts the listener-wise perceived quality given the input speech and the listener identity. We reflect recent advances in LD modeling, including design choices of the model architecture, and propose two inference methods that provide more stable results and efficient computation. We conduct systematic experiments on the voice conversion challenge (VCC) 2018 benchmark and a newly collected large-scale MOS dataset, providing an in-depth analysis of the proposed framework. Results show that the mean listener inference method is a better way to utilize the mean scores, whose effectiveness is more obvious when having more ratings per sample.

【11】 Deep Clustering For General-Purpose Audio Representations 标题：通用音频表示的深度聚类链接：https://arxiv.org/abs/2110.08895

作者：Sreyan Ghosh,Sandesh V Katta,Ashish Seth,S. Umesh 机构：† Speech Lab, Dept. of Electrical Engineering, IIT Madras, Chennai, India 备注：Submitted to ICASSP 2022 摘要：我们介绍了DECAR，一种用于学习通用音频表示的自我监督预训练方法。我们的系统基于聚类：它利用离线聚类步骤来提供目标标签，作为解决预测任务的伪标签。我们在计算机视觉自监督学习最新进展的基础上开发了一个轻量级、易于使用的自监督预训练方案。我们在大规模音频集数据集的平衡子集上预训练DECAR嵌入，并将这些表示转移到9个下游分类任务，包括语音、音乐、动物声音和声学场景。此外，我们进行消融研究，确定关键设计选择，并公开所有代码和预先训练的模型。摘要：We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations. Our system is based on clustering: it utilizes an offline clustering step to provide target labels that act as pseudo-labels for solving a prediction task. We develop on top of recent advances in self-supervised learning for computer vision and design a lightweight, easy-to-use self-supervised pre-training scheme. We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset and transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes. Furthermore, we conduct ablation studies identifying key design choices and also make all our code and pre-trained models publicly available.

【12】 Storage and Authentication of Audio Footage for IoAuT Devices Using Distributed Ledger Technology 标题：使用分布式分类帐技术的IoAuT设备音频片段的存储和认证链接：https://arxiv.org/abs/2110.08821

作者：Srivatsav Chenna,Nils Peters 机构：International Audio Laboratories Erlangen, University of Erlangen-Nuremberg, Erlangen, Germany 备注：11 pages, 3 Figures, 1 code listing 摘要：检测伪造或操纵的音频内容以防止（例如）在数字媒体中传播伪造品至关重要，特别是在政治和声誉背景下。需要更好的工具来保护媒体创建的完整性。在音频物联网（IoAuT）的范例中，我们讨论了IoAuT网络使用分布式账本技术验证原始音频真实性的能力。通过将音频记录与IoAuT捕获设备获得的相关记录特定元数据相结合，该体系结构能够安全地分发原始音频片段，验证未知音频内容，并在未来的衍生作品中引用原始音频材料。通过开发一个概念验证系统，对所提出的体系结构的可行性进行了评估和讨论。摘要：Detection of fabricated or manipulated audio content to prevent, e.g., distribution of forgeries in digital media, is crucial, especially in political and reputational contexts. Better tools for protecting the integrity of media creation are desired. Within the paradigm of the Internet of Audio Things(IoAuT), we discuss the ability of the IoAuT network to verify the authenticity of original audio using distributed ledger technology. By storing audio recordings in combination with associated recording-specific metadata obtained by the IoAuT capturing device, this architecture enables secure distribution of original audio footage, authentication of unknown audio content, and referencing of original audio material in future derivative works. By developing a proof-of-concept system, the feasibility of the proposed architecture is evaluated and discussed.

【13】 Taming Visually Guided Sound Generation 标题：驯服视觉引导的声音生成链接：https://arxiv.org/abs/2110.08791

作者：Vladimir Iashin,Esa Rahtu 机构：Computing Sciences, Tampere University, Tampere, Finland, Playing, Harp, Lions, Roaring, Canary, Calling, Visually-, Guided, Sound, Generation, Model, � Click to Play, in Adobe Reader 备注：Accepted as an oral presentation for the BMVC 2021. Code: this https URL Project page: this https URL 摘要：视觉诱导音频生成的最新进展是基于采样短、低保真度和一类声音。此外，在高端GPU上，从最先进的模型中采集1秒的音频需要几分钟。在这项工作中，我们提出了一种单一的模型，它能够在比在单个GPU上播放所需时间更短的时间内，从开放域视频中生成一组帧提示的视觉相关的高保真声音。我们训练一个转换器，在给定一组视频特征的情况下，从预先训练的频谱图码本中采样一个新的频谱图。该码本是使用VQGAN的一种变体获得的，该变体经过训练以产生一个紧凑的采样空间，该采样空间具有一种新的基于谱图的感知损失。生成的光谱图使用基于窗口的GAN转换为波形，显著加快生成速度。考虑到缺乏自动评估生成光谱图的指标，我们还构建了一系列指标，称为FID和MKL。这些指标基于一种称为Melection的新型声音分类器，旨在评估开放域样本的保真度和相关性。定性和定量研究均在小型和大型数据集上进行，以评估生成样本的保真度和相关性。我们还将我们的模型与最新技术进行了比较，并观察到在质量、大小和计算时间方面有了实质性的改进。代码、演示和示例：v-iashin.github.io/SpecVQGAN 摘要：Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

【14】 Improving End-To-End Modeling for Mispronunciation Detection with Effective Augmentation Mechanisms 标题：用有效的增强机制改进发音错误检测的端到端建模链接：https://arxiv.org/abs/2110.08731

作者：Tien-Hong Lo,Yao-Ting Sung,Berlin Chen 机构：National Taiwan Normal University, Taipei City, Taiwan 备注：7 pages, 2 figures, 4 tables, accepted to Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC 2021) 摘要：近年来，端到端（E2E）模型在开发发音错误检测（MD）系统时引起了广泛的研究关注，该模型允许将第二语言学习者话语的频谱向量序列作为输入，并生成相应的电话级序列作为输出。然而，由于缺乏足够的二语说话人标记语音数据进行模型估计，E2E MD模型相对于基于DNN-HMM声学模型的传统模型容易过度拟合。为了缓解这一关键问题，我们在本文中提出了两种建模策略，以增强E2E MD模型的识别能力，每种建模策略都可以隐式地利用在预训练声学模型中编码的语音和语音特征，并分别包含在训练数据的参考转录本中。第一种是输入增强，其目的是从DNN-HMM声学模型中提取语音识别知识。第二种是标签增强，它设法从训练数据的转录本中捕获更多的语音模式。在L2-ARCTIC英语数据集上进行的一系列实证实验似乎证实了我们的E2E MD模型与一些顶级E2E MD模型和基于DNN-HMM声学模型的经典发音评分方法相比的有效性。摘要：Recently, end-to-end (E2E) models, which allow to take spectral vector sequences of L2 (second-language) learners' utterances as input and produce the corresponding phone-level sequences as output, have attracted much research attention in developing mispronunciation detection (MD) systems. However, due to the lack of sufficient labeled speech data of L2 speakers for model estimation, E2E MD models are prone to overfitting in relation to conventional ones that are built on DNN-HMM acoustic models. To alleviate this critical issue, we in this paper propose two modeling strategies to enhance the discrimination capability of E2E MD models, each of which can implicitly leverage the phonetic and phonological traits encoded in a pretrained acoustic model and contained within reference transcripts of the training data, respectively. The first one is input augmentation, which aims to distill knowledge about phonetic discrimination from a DNN-HMM acoustic model. The second one is label augmentation, which manages to capture more phonological patterns from the transcripts of training data. A series of empirical experiments conducted on the L2-ARCTIC English dataset seem to confirm the efficacy of our E2E MD model when compared to some top-of-the-line E2E MD models and a classic pronunciation-scoring based method built on a DNN-HMM acoustic model.

【15】 Towards Robust Waveform-Based Acoustic Models 标题：走向稳健的基于波形的声学模型链接：https://arxiv.org/abs/2110.08634

作者：Dino Oglic,Zoran Cvetkovic,Peter Sollich,Steve Renals,Bin Yu 机构： Sollich is with the Department of Mathematics 摘要：我们提出了一种在不利环境中学习鲁棒声学模型的方法，其特点是训练和测试条件之间存在显著的不匹配。这个问题对于需要在看不见的环境中表现良好的语音识别系统的部署至关重要。我们的方法是邻域风险最小化的一个例子，其目的是通过将定义输入空间上的经验密度的增量函数替换为训练样本附近的边际人口密度近似值来改进训练期间的风险估计。更具体地说，我们假设以训练样本为中心的局部邻域可以使用高斯混合近似，并从理论上证明这可以将鲁棒归纳偏差纳入学习过程。我们通过数据增强方案隐式地描述了单个混合成分的特征，该方案旨在解决声学模型中常见的伪相关源。为了避免由于信息丢失（与标准特征提取技术（例如FBANK和MFCC特征）而对稳健性造成的潜在混淆影响，我们将评估重点放在基于波形的设置上。我们的实验结果表明，所提出的方法可以推广到看不见的噪声条件，与使用标准风险最小化原则的训练相比，在分布外推广方面相对提高了150%。此外，研究结果表明，相对于使用训练样本学习的模型，该样本设计用于匹配测试话语的声学条件特征（即，最佳邻近密度），具有竞争力。摘要：We propose an approach for learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. Our approach is an instance of vicinal risk minimization, which aims to improve risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We characterize the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus our evaluation on the waveform-based setting. Our empirical results show that the proposed approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances (i.e., optimal vicinal densities).

【16】 Learning velocity model for complex media with deep convolutional neural networks 标题：基于深卷积神经网络的复杂介质学习速度模型链接：https://arxiv.org/abs/2110.08626

作者：A. Stankevich,I. Nechepurenko,A. Shevchenko,L. Gremyachikh,A. Ustyuzhanin,A. Vasyukov 机构：Vasyukov 1 1Moscow Institute of Physics and Technology, Russia 2HSE University 备注：14 pages, 6 figures, 6 tables 摘要：本文研究了基于边界测量的复杂介质速度模型获取问题。声学模型用于描述介质。我们使用了一个开放源代码的速度分布数据集来直接比较所给出的结果和以前的工作。采用网格特征数值方法进行正演模拟。利用深度卷积神经网络求解反问题。建议对基线UNet架构进行修改，以改进结构相似性指数和地面真实情况下速度剖面的定量对应关系。我们评估了我们的增强，并展示了结果的统计意义。摘要：The paper considers the problem of velocity model acquisition for a complex media based on boundary measurements. The acoustic model is used to describe the media. We used an open-source dataset of velocity distributions to compare the presented results with the previous works directly. Forward modeling is performed using the grid-characteristic numerical method. The inverse problem is solved using deep convolutional neural networks. Modifications for a baseline UNet architecture are proposed to improve both structural similarity index measure quantitative correspondence of the velocity profiles with the ground truth. We evaluate our enhancements and demonstrate the statistical significance of the results.

【17】 Controllable Multichannel Speech Dereverberation based on Deep Neural Networks 标题：基于深度神经网络的可控多通道语音去混响链接：https://arxiv.org/abs/2110.08439

作者：Ziteng Wang,Yueyue Na,Biao Tian,Qiang Fu 机构：Alibaba Group, China 备注：submitted to ICASSP2022 摘要：基于神经网络的语音去冗余技术在最近的研究中取得了很好的效果。然而，许多人只关注于恢复直接路径声音，而放弃了可能有益于语音感知的早期反射。当对早期混响目标进行评估时，经过训练以恢复干净语音的模型的性能会下降，反之亦然。提出了一种基于深度神经网络的多通道语音去冗余算法，该算法的去冗余度是可控的。这是通过添加一个简单的浮点数作为模型的目标控制器来实现的。使用空间分布的麦克风进行了实验，并在各种模拟条件下验证了该算法的有效性。摘要：Neural network based speech dereverberation has achieved promising results in recent studies. Nevertheless, many are focused on recovery of only the direct path sound and early reflections, which could be beneficial to speech perception, are discarded. The performance of a model trained to recover clean speech degrades when evaluated on early reverberation targets, and vice versa. This paper proposes a novel deep neural network based multichannel speech dereverberation algorithm, in which the dereverberation level is controllable. This is realized by adding a simple floating-point number as target controller of the model. Experiments are conducted using spatially distributed microphones, and the efficacy of the proposed algorithm is confirmed in various simulated conditions.

【18】 NN3A: Neural Network supported Acoustic Echo Cancellation, Noise Suppression and Automatic Gain Control for Real-Time Communications 标题：NN3A：神经网络支持的声学回波抵消、噪声抑制和自动增益控制的实时通信链接：https://arxiv.org/abs/2110.08437

作者：Ziteng Wang,Yueyue Na,Biao Tian,Qiang Fu 机构：Alibaba Group, China 备注：submitted to ICASSP2022 摘要：声回波消除（AEC）、噪声抑制（NS）和自动增益控制（AGC）是实时通信（RTC）经常需要的三个模块。本文提出了一种神经网络支持的RTC算法，即NN3A，它结合了一个自适应滤波器和一个多任务模型，用于残余回波抑制、降噪和近端语音活动检测。该算法的性能优于使用单独模型的方法和端到端替代方法。进一步表明，该模型在残差抑制和近端语音失真之间存在折衷，可以通过一种新的损失加权函数来平衡残差抑制和近端语音失真。还研究了训练关节模型的几个实际方面，以使其性能达到极限。摘要：Acoustic echo cancellation (AEC), noise suppression (NS) and automatic gain control (AGC) are three often required modules for real-time communications (RTC). This paper proposes a neural network supported algorithm for RTC, namely NN3A, which incorporates an adaptive filter and a multi-task model for residual echo suppression, noise reduction and near-end speech activity detection. The proposed algorithm is shown to outperform both a method using separate models and an end-to-end alternative. It is further shown that there exists a trade-off in the model between residual suppression and near-end speech distortion, which could be balanced by a novel loss weighting function. Several practical aspects of training the joint model are also investigated to push its performance to limit.

【19】 Omni-sparsity DNN: Fast Sparsity Optimization for On-Device Streaming E2E ASR via Supernet 标题：全稀疏DNN：超网设备上流式E2E ASR的快速稀疏性优化链接：https://arxiv.org/abs/2110.08352

作者：Haichuan Yang,Yuan Shangguan,Dilin Wang,Meng Li,Pierce Chuang,Xiaohui Zhang,Ganesh Venkatesh,Ozlem Kalinli,Vikas Chandra 机构：Facebook AI 摘要：从可穿戴设备到功能强大的智能设备，现代自动语音识别（ASR）模型运行在各种计算预算不同的边缘设备上。为了浏览模型精度与模型尺寸之间的帕累托前沿，研究人员陷入了一个两难境地，即通过训练和微调每个边缘设备的模型来优化模型精度，同时保持训练GPU时间的可控性。在本文中，我们提出了全稀疏DNN，其中一个单一的神经网络可以修剪，以生成优化模型的大范围的模型大小。我们为全向稀疏DNN开发了训练策略，允许它沿着单词错误率（WER）与模型大小的帕累托前沿查找模型，同时保持训练GPU的时间不超过训练一个单一模型的时间。我们用流E2E ASR模型演示了全稀疏DNN。我们的结果表明，与单独修剪的稀疏模型相比，LibriSpeech在训练时间和资源上节省了大量的时间和资源，具有相似或更好的准确度：在其他测试中，WER提高了2%-6.6%。摘要：From wearables to powerful smart devices, modern automatic speech recognition (ASR) models run on a variety of edge devices with different computational budgets. To navigate the Pareto front of model accuracy vs model size, researchers are trapped in a dilemma of optimizing model accuracy by training and fine-tuning models for each individual edge device while keeping the training GPU-hours tractable. In this paper, we propose Omni-sparsity DNN, where a single neural network can be pruned to generate optimized model for a large range of model sizes. We develop training strategies for Omni-sparsity DNN that allows it to find models along the Pareto front of word-error-rate (WER) vs model size while keeping the training GPU-hours to no more than that of training one singular model. We demonstrate the Omni-sparsity DNN with streaming E2E ASR models. Our results show great saving on training time and resources with similar or better accuracy on LibriSpeech compared to individually pruned sparse models: 2%-6.6% better WER on Test-other.

【20】 Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information 标题：利用语言信息解决跨语言说话人确认中的分数漂移问题链接：https://arxiv.org/abs/2110.09150

作者：Jenthe Thienpondt,Brecht Desplanques,Kris Demuynck 机构：IDLab, Department of Electronics and Information Systems, Ghent University - imec, Belgium 备注：Submitted to ICASSP 2022 摘要：本文包含一份关于IDLab提交VoxCeleb说话人识别挑战2021（VoxSRC-21）的跨语言说话人验证的挑战后性能分析。我们发现，在说话人跨语言试验中，当前的说话人嵌入提取器一直低估说话人相似性。因此，典型的训练和评分协议没有充分强调说话人内部语言变异性的补偿。我们提出了两种提高跨语言说话人确认鲁棒性的技术。首先，我们使用一种小批量采样策略来增强先前提出的大幅度微调（LM-FT）训练阶段，该策略增加了小批量中说话人内部跨语言样本的数量。第二，我们将语言信息纳入逻辑回归校正阶段。我们集成了基于VoxLingua107语言识别模型软决策和硬决策的质量度量。建议的技术在VoxSRC-21测试集上比基线模型相对提高了11.7%，并有助于我们在相应挑战中获得第三名。摘要：This paper contains a post-challenge performance analysis on cross-lingual speaker verification of the IDLab submission to the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). We show that current speaker embedding extractors consistently underestimate speaker similarity in within-speaker cross-lingual trials. Consequently, the typical training and scoring protocols do not put enough emphasis on the compensation of intra-speaker language variability. We propose two techniques to increase cross-lingual speaker verification robustness. First, we enhance our previously proposed Large-Margin Fine-Tuning (LM-FT) training stage with a mini-batch sampling strategy which increases the amount of intra-speaker cross-lingual samples within the mini-batch. Second, we incorporate language information in the logistic regression calibration stage. We integrate quality metrics based on soft and hard decisions of a VoxLingua107 language identification model. The proposed techniques result in a 11.7% relative improvement over the baseline model on the VoxSRC-21 test set and contributed to our third place finish in the corresponding challenge.

【21】 Similarity-and-Independence-Aware Beamformer with Iterative Casting and Boost Start for Target Source Extraction Using Reference 标题：具有迭代投射和Boost Start的相似性和独立性感知波束形成器用于使用参考的目标源提取链接：https://arxiv.org/abs/2110.09019

作者：Atsuo Hiroe 机构：T 备注：Accepted for publication as a regular paper in the IEEE Open Journal of Signal Processing 摘要：目标源提取对于提高人类语音清晰度和计算机语音识别性能具有重要意义。本研究描述了一种目标源提取方法，称为相似性和独立性感知波束形成器（SIBF）。SIBF使用粗略的幅度谱图作为参考信号来提取目标源。SIBF的优点是，它可以获得比基于深度神经网络的语音增强等目标增强方法生成的谱图更精确的信号。对于提取，我们通过考虑参考源和提取的目标源之间的相似性以及所有潜在源之间的相互独立性，扩展了通货紧缩独立分量分析（ICA）的框架。为了通过最大似然估计解决提取问题，我们引入了三种能够反映相似性的源模型。本研究的主要贡献如下。首先，使用两种方法提高提取性能，即boost start以加快收敛速度和迭代casting以生成更精确的参考。通过使用CHiME3数据集的实验验证了这些方法的有效性。其次，提出了与精度有关的固定点概念。这一概念有助于理解参考和SIBF输出在准确性方面的关系。第三，实现了SIBF和基于掩模的波束形成器的统一公式，以将传统BFs的专业知识应用于SIBF。本研究的结果还可以改善SIBF的性能，促进ICA和常规波束形成器的研究。索引项：波束形成器、独立分量分析、源分离、语音增强、目标源提取摘要：Target source extraction is significant for improving human speech intelligibility and the speech recognition performance of computers. This study describes a method for target source extraction, called the similarity-and-independence-aware beamformer (SIBF). The SIBF extracts the target source using a rough magnitude spectrogram as the reference signal. The advantage of the SIBF is that it can obtain a more accurate signal than the spectrogram generated by target-enhancing methods such as speech enhancement based on deep neural networks. For the extraction, we extend the framework of deflationary independent component analysis (ICA) by considering the similarities between the reference and extracted target sources, in addition to the mutual independence of all the potential sources. To solve the extraction problem by maximum-likelihood estimation, we introduce three source models that can reflect the similarities. The major contributions of this study are as follows. First, the extraction performance is improved using two methods, namely boost start for faster convergence and iterative casting for generating a more accurate reference. The effectiveness of these methods is verified through experiments using the CHiME3 dataset. Second, a concept of a fixed point pertaining to accuracy is developed. This concept facilitates understanding the relationship between the reference and SIBF output in terms of accuracy. Third, a unified formulation of the SIBF and mask-based beamformer is realized to apply the expertise of conventional BFs to the SIBF. The findings of this study can also improve the performance of the SIBF and promote research on ICA and conventional beamformers. Index Terms: beamformer, independent component analysis, source separation, speech enhancement, target source extraction

【22】 Supervised Metric Learning for Music Structure Feature 标题：音乐结构特征的有监督度量学习链接：https://arxiv.org/abs/2110.09000

作者：Ju-Chiang Wang,Jordan B. L. Smith,Wei-Tsung Lu,Xuchen Song 机构：ByteDance 备注：Pre-print for an accepted paper by ISMIR 2021 摘要：音乐结构分析（MSA）方法传统上寻找音频中具有音乐意义的模式：同质性、重复性、新颖性和片段长度规则性。手工制作的音频功能（如MFCC或色度图）通常用于引出这些模式。然而，随着更多的节标签注释（例如，韵文、合唱和桥牌）变得可用，可以使用有监督的特征学习使这些模式更加清晰，并提高MSA性能。为此，我们采用有监督的度量学习方法：如果两个谱图输入具有相同的截面类型（根据注释），或者相距较远，我们训练一个深度神经网络输出彼此接近的嵌入。我们提出了一种批量抽样方案，以确保训练对中的标签得到有意义的解释。经过训练的模型提取可用于现有MSA算法的特征。在对三个数据集（HarmonixSet、SALAMI和RWC）的评估中，我们证明了在数据集内部和跨数据集的情况下，使用所提出的功能可以显著改进传统的MSA算法。摘要：Music structure analysis (MSA) methods traditionally search for musically meaningful patterns in audio: homogeneity, repetition, novelty, and segment-length regularity. Hand-crafted audio features such as MFCCs or chromagrams are often used to elicit these patterns. However, with more annotations of section labels (e.g., verse, chorus, and bridge) becoming available, one can use supervised feature learning to make these patterns even clearer and improve MSA performance. To this end, we take a supervised metric learning approach: we train a deep neural network to output embeddings that are near each other for two spectrogram inputs if both have the same section type (according to an annotation), and otherwise far apart. We propose a batch sampling scheme to ensure the labels in a training pair are interpreted meaningfully. The trained model extracts features that can be used in existing MSA algorithms. In evaluations with three datasets (HarmonixSet, SALAMI, and RWC), we demonstrate that using the proposed features can improve a traditional MSA algorithm significantly in both intra- and cross-dataset scenarios.

【23】 Deep Learning Based EDM Subgenre Classification using Mel-Spectrogram and Tempogram Features 标题：基于深度学习的MEL谱图和时间图特征在EDM亚类分类中的应用链接：https://arxiv.org/abs/2110.08862

作者：Wei-Han Hsu,Bo-Yu Chen,Yi-Hsuan Yang 机构：Research Center for IT Innovation, Academia Sinica, Taipei, Taiwan 摘要：随着音乐技术的发展，近年来出现了大量的电子舞蹈音乐（EDM）风格或“亚流派”。虽然区分EDM和非EDM的分类任务经常在音乐体裁分类的背景下进行研究，但对更具挑战性的EDM子体裁分类的研究却很少。最先进的模型基于高度随机的树，可以通过深度学习方法加以改进。在本文中，我们将最先进的音乐自动标记模型“短块CNN+Resnet”扩展到EDM子体裁分类，并添加了两种中层节奏相关特征表示，称为傅立叶节奏图和自相关节奏图。并且，我们探索了两种融合策略，早期融合和晚期融合，以聚合这两种类型的温度图。我们使用一个包含75000首歌曲的大型数据集对所提出的模型进行了评估，该数据集针对30种不同的EDM子体裁，并表明采用深度学习模型和节奏特征确实可以提高分类精度。摘要：Along with the evolution of music technology, a large number of styles, or "subgenres," of Electronic Dance Music(EDM) have emerged in recent years. While the classification task of distinguishing between EDM and non-EDM has been often studied in the context of music genre classification, little work has been done on the more challenging EDM subgenre classification. The state-of-art model is based on extremely randomized trees and could be improved by deep learning methods. In this paper, we extend the state-of-art music auto-tagging model "short-chunkCNN+Resnet" to EDM subgenre classification, with the addition of two mid-level tempo-related feature representations, called the Fourier tempogram and autocorrelation tempogram. And, we explore two fusion strategies, early fusion and late fusion, to aggregate the two types of tempograms. We evaluate the proposed models using a large dataset consisting of 75,000 songs for 30 different EDM subgenres, and show that the adoption of deep learning models and tempo features indeed leads to higher classification accuracy.

【24】 VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis 标题：VISinger：端到端歌声合成的对抗性学习变分推理链接：https://arxiv.org/abs/2110.08813

作者：Yongmao Zhang,Jian Cong,Heyang Xue,Lei Xie,Pengcheng Zhu,Mengxiao Bi 机构：School of Computer Science, Northwestern Polytechnical University, Xi’an, China, Fuxi AI Lab, NetEase Inc., Hangzhou, China 备注：5 pages, submitted to ICASSP 2022 摘要：在本文中，我们提出VISinger，一个完整的端到端的高质量歌唱语音合成（SVS）系统，直接从歌词和乐谱生成音频波形。我们的方法受到VITS的启发，VITS采用基于VAE的后验编码器，再加上基于归一化流的前验编码器和对抗性解码器，以实现完整的端到端语音生成。VISinger遵循VITS的主要架构，但根据歌唱的特点对先前的编码器进行了实质性的改进。首先，我们引入了一个长度调节器和一个帧先验网络来获得声学特征的帧级均值和方差，而不是使用声学特征的音素级均值和方差，对歌唱中丰富的声学变化进行建模。其次，我们进一步引入F0预测器来引导帧先验网络，从而使歌唱性能更加稳定。最后，为了改善歌唱节奏，我们修改了持续时间预测器，专门预测音素与音符的持续时间比率，这有助于歌唱音符的正常化。在专业普通话歌唱语料库上的实验表明，VISinger显著优于FastSpeech+神经声码器两阶段法和OracleVITS；消融研究证明了不同贡献的有效性。摘要：In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates audio waveform from lyrics and musical score. Our approach is inspired by VITS, which adopts VAE-based posterior encoder augmented with normalizing flow-based prior encoder and adversarial decoder to realize complete end-to-end speech generation. VISinger follows the main architecture of VITS, but makes substantial improvements to the prior encoder based on the characteristics of singing. First, instead of using phoneme-level mean and variance of acoustic features, we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features, modeling the rich acoustic variation in singing. Second, we further introduce an F0 predictor to guide the frame prior network, leading to stabler singing performance. Finally, to improve the singing rhythm, we modify the duration predictor to specifically predict the phoneme to note duration ratio, helped with singing note normalization. Experiments on a professional Mandarin singing corpus show that VISinger significantly outperforms FastSpeech+Neural-Vocoder two-stage approach and the oracle VITS; ablation study demonstrates the effectiveness of different contributions.

【25】 A Variational Bayesian Approach to Learning Latent Variables for Acoustic Knowledge Transfer 标题：声学知识转移潜变量学习的变分贝叶斯方法链接：https://arxiv.org/abs/2110.08598

作者：Hu Hu,Sabato Marco Siniscalchi,Chao-Han Huck Yang,Chin-Hui Lee 机构：School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA, Computer Engineering School, University of Enna Kore, Italy 备注：Submitted to ICASSP 2022 摘要：我们提出了一种变分贝叶斯（VB）方法来学习深度神经网络（DNN）模型中潜在变量的分布，用于跨领域知识转移，以解决训练和测试条件之间的声学不匹配问题。与传统的最大后验概率估计中的点估计不同，在估计大量模型参数时存在维数灾难的风险，我们将注意力集中在通过VB推理框架估计可管理数量的DNN潜在变量上。为了完成模型转换，从源域学习的知识被编码在潜在变量的先验分布中，并在贝叶斯意义上与来自目标域的一小组自适应数据进行最佳组合，以近似相应的后验分布。声场景分类中设备自适应的实验结果表明，我们提出的VB方法可以对目标设备进行很好的改进，并且始终优于13种最先进的知识转移算法。摘要：We propose a variational Bayesian (VB) approach to learning distributions of latent variables in deep neural network (DNN) models for cross-domain knowledge transfer, to address acoustic mismatches between training and testing conditions. Instead of carrying out point estimation in conventional maximum a posteriori estimation with a risk of having a curse of dimensionality in estimating a huge number of model parameters, we focus our attention on estimating a manageable number of latent variables of DNNs via a VB inference framework. To accomplish model transfer, knowledge learnt from a source domain is encoded in prior distributions of latent variables and optimally combined, in a Bayesian sense, with a small set of adaptation data from a target domain to approximate the corresponding posterior distributions. Experimental results on device adaptation in acoustic scene classification show that our proposed VB approach can obtain good improvements on target devices, and consistently outperforms 13 state-of-the-art knowledge transfer algorithms.

【26】 ASR4REAL: An extended benchmark for speech models 标题：ASR4REAL：一种扩展的语音模型基准链接：https://arxiv.org/abs/2110.08583

作者：Morgane Riviere,Jade Copet,Gabriel Synnaeve 机构：†Facebook AI Research 备注：Submitted to ICASSP 2022 摘要：流行的ASR基准测试，如Librispeech和Switchboard，其所代表的设置和扬声器的多样性受到限制。我们引入了一组与现实生活条件相匹配的基准，旨在发现模型中可能存在的偏差和弱点。我们发现，尽管最近的模型似乎没有表现出性别偏见，但它们通常通过口音表现出重要的表现差异，甚至更重要的表现差异取决于说话者的社会经济地位。最后，当对会话语音进行测试时，所有测试模型的性能都有很大的下降，在这种精确的情况下，即使是在像Common Crawl这样大的数据集上训练的语言模型，似乎也没有显著的积极影响，这重申了开发会话语言模型的重要性摘要：Popular ASR benchmarks such as Librispeech and Switchboard are limited in the diversity of settings and speakers they represent. We introduce a set of benchmarks matching real-life conditions, aimed at spotting possible biases and weaknesses in models. We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent, and even more important ones depending on the socio-economic status of the speakers. Finally, all tested models show a strong performance drop when tested on conversational speech, and in this precise context even a language model trained on a dataset as big as Common Crawl does not seem to have significant positive effect which reiterates the importance of developing conversational language models

【27】 A Unified Speaker Adaptation Approach for ASR 标题：一种适用于ASR的统一说话人自适应方法链接：https://arxiv.org/abs/2110.08545

作者：Yingzhu Zhao,Chongjia Ni,Cheung-Chi Leung,Shafiq Joty,Eng Siong Chng,Bin Ma 机构：Nanyang Technological University, Singapore, Machine Intelligence Technology, Alibaba Group 备注：Accepted by EMNLP 2021 摘要：Transformer模型已成功地应用于自动语音识别（ASR），并产生了最先进的结果。然而，它的性能仍然受到训练数据和测试数据之间说话人不匹配的影响。使用目标说话人数据进一步微调训练模型是最自然的自适应方法，但这需要大量计算，并可能导致现有说话人的灾难性遗忘。在这项工作中，我们提出了一种统一的说话人自适应方法，包括特征自适应和模型自适应。对于特征自适应，我们采用了一种说话人感知的持久记忆模型，该模型利用说话人i向量形成持久记忆，从而更好地推广到不可见的测试说话人。对于模型自适应，我们使用了一种新的逐步修剪方法来适应目标说话人，而不改变模型结构，据我们所知，这在ASR中从未被探索过。具体地说，我们将模型编码器上贡献较少的参数逐渐修剪到一定的稀疏水平，并使用修剪后的参数进行自适应，同时冻结未运行的参数以保持原始模型的性能。我们在Librispeech数据集上进行了实验。我们提出的方法在一般说话人自适应方面相对减少了2.74-6.52%的字错误率（WER）。在目标说话人自适应方面，我们的方法比基线方法的相对功率降低了20.58%，比微调方法的相对功率降低了2.54%。此外，对于极低的资源适应数据（例如，1次话语），我们的方法仅需几次训练就可以相对提高WER 6.53%。摘要：Transformer models have been used in automatic speech recognition (ASR) successfully and yields state-of-the-art results. However, its performance is still affected by speaker mismatch between training and test data. Further finetuning a trained model with target speaker data is the most natural approach for adaptation, but it takes a lot of compute and may cause catastrophic forgetting to the existing speakers. In this work, we propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation. For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers by making use of speaker i-vectors to form a persistent memory. For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture, which to the best of our knowledge, has never been explored in ASR. Specifically, we gradually prune less contributing parameters on model encoder to a certain sparsity level, and use the pruned parameters for adaptation, while freezing the unpruned parameters to keep the original model performance. We conduct experiments on the Librispeech dataset. Our proposed approach brings relative 2.74-6.52% word error rate (WER) reduction on general speaker adaptation. On target speaker adaptation, our method outperforms the baseline with up to 20.58% relative WER reduction, and surpasses the finetuning method by up to relative 2.54%. Besides, with extremely low-resource adaptation data (e.g., 1 utterance), our method could improve the WER by relative 6.53% with only a few epochs of training.

3.eess.AS音频处理:

【1】 Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information 标题：利用语言信息解决跨语言说话人确认中的分数漂移问题链接：https://arxiv.org/abs/2110.09150

作者：Jenthe Thienpondt,Brecht Desplanques,Kris Demuynck 机构：IDLab, Department of Electronics and Information Systems, Ghent University - imec, Belgium 备注：Submitted to ICASSP 2022 摘要：本文包含一份关于IDLab提交VoxCeleb说话人识别挑战2021（VoxSRC-21）的跨语言说话人验证的挑战后性能分析。我们发现，在说话人跨语言试验中，当前的说话人嵌入提取器一直低估说话人相似性。因此，典型的训练和评分协议没有充分强调说话人内部语言变异性的补偿。我们提出了两种提高跨语言说话人确认鲁棒性的技术。首先，我们使用一种小批量采样策略来增强先前提出的大幅度微调（LM-FT）训练阶段，该策略增加了小批量中说话人内部跨语言样本的数量。第二，我们将语言信息纳入逻辑回归校正阶段。我们集成了基于VoxLingua107语言识别模型软决策和硬决策的质量度量。建议的技术在VoxSRC-21测试集上比基线模型相对提高了11.7%，并有助于我们在相应挑战中获得第三名。摘要：This paper contains a post-challenge performance analysis on cross-lingual speaker verification of the IDLab submission to the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). We show that current speaker embedding extractors consistently underestimate speaker similarity in within-speaker cross-lingual trials. Consequently, the typical training and scoring protocols do not put enough emphasis on the compensation of intra-speaker language variability. We propose two techniques to increase cross-lingual speaker verification robustness. First, we enhance our previously proposed Large-Margin Fine-Tuning (LM-FT) training stage with a mini-batch sampling strategy which increases the amount of intra-speaker cross-lingual samples within the mini-batch. Second, we incorporate language information in the logistic regression calibration stage. We integrate quality metrics based on soft and hard decisions of a VoxLingua107 language identification model. The proposed techniques result in a 11.7% relative improvement over the baseline model on the VoxSRC-21 test set and contributed to our third place finish in the corresponding challenge.

【2】 Similarity-and-Independence-Aware Beamformer with Iterative Casting and Boost Start for Target Source Extraction Using Reference 标题：具有迭代投射和Boost Start的相似性和独立性感知波束形成器用于使用参考的目标源提取链接：https://arxiv.org/abs/2110.09019

作者：Atsuo Hiroe 机构：T 备注：Accepted for publication as a regular paper in the IEEE Open Journal of Signal Processing 摘要：目标源提取对于提高人类语音清晰度和计算机语音识别性能具有重要意义。本研究描述了一种目标源提取方法，称为相似性和独立性感知波束形成器（SIBF）。SIBF使用粗略的幅度谱图作为参考信号来提取目标源。SIBF的优点是，它可以获得比基于深度神经网络的语音增强等目标增强方法生成的谱图更精确的信号。对于提取，我们通过考虑参考源和提取的目标源之间的相似性以及所有潜在源之间的相互独立性，扩展了通货紧缩独立分量分析（ICA）的框架。为了通过最大似然估计解决提取问题，我们引入了三种能够反映相似性的源模型。本研究的主要贡献如下。首先，使用两种方法提高提取性能，即boost start以加快收敛速度和迭代casting以生成更精确的参考。通过使用CHiME3数据集的实验验证了这些方法的有效性。其次，提出了与精度有关的固定点概念。这一概念有助于理解参考和SIBF输出在准确性方面的关系。第三，实现了SIBF和基于掩模的波束形成器的统一公式，以将传统BFs的专业知识应用于SIBF。本研究的结果还可以改善SIBF的性能，促进ICA和常规波束形成器的研究。索引项：波束形成器、独立分量分析、源分离、语音增强、目标源提取摘要：Target source extraction is significant for improving human speech intelligibility and the speech recognition performance of computers. This study describes a method for target source extraction, called the similarity-and-independence-aware beamformer (SIBF). The SIBF extracts the target source using a rough magnitude spectrogram as the reference signal. The advantage of the SIBF is that it can obtain a more accurate signal than the spectrogram generated by target-enhancing methods such as speech enhancement based on deep neural networks. For the extraction, we extend the framework of deflationary independent component analysis (ICA) by considering the similarities between the reference and extracted target sources, in addition to the mutual independence of all the potential sources. To solve the extraction problem by maximum-likelihood estimation, we introduce three source models that can reflect the similarities. The major contributions of this study are as follows. First, the extraction performance is improved using two methods, namely boost start for faster convergence and iterative casting for generating a more accurate reference. The effectiveness of these methods is verified through experiments using the CHiME3 dataset. Second, a concept of a fixed point pertaining to accuracy is developed. This concept facilitates understanding the relationship between the reference and SIBF output in terms of accuracy. Third, a unified formulation of the SIBF and mask-based beamformer is realized to apply the expertise of conventional BFs to the SIBF. The findings of this study can also improve the performance of the SIBF and promote research on ICA and conventional beamformers. Index Terms: beamformer, independent component analysis, source separation, speech enhancement, target source extraction

【3】 Supervised Metric Learning for Music Structure Feature 标题：音乐结构特征的有监督度量学习链接：https://arxiv.org/abs/2110.09000

作者：Ju-Chiang Wang,Jordan B. L. Smith,Wei-Tsung Lu,Xuchen Song 机构：ByteDance 备注：Pre-print for an accepted paper by ISMIR 2021 摘要：音乐结构分析（MSA）方法传统上寻找音频中具有音乐意义的模式：同质性、重复性、新颖性和片段长度规则性。手工制作的音频功能（如MFCC或色度图）通常用于引出这些模式。然而，随着更多的节标签注释（例如，韵文、合唱和桥牌）变得可用，可以使用有监督的特征学习使这些模式更加清晰，并提高MSA性能。为此，我们采用有监督的度量学习方法：如果两个谱图输入具有相同的截面类型（根据注释），或者相距较远，我们训练一个深度神经网络输出彼此接近的嵌入。我们提出了一种批量抽样方案，以确保训练对中的标签得到有意义的解释。经过训练的模型提取可用于现有MSA算法的特征。在对三个数据集（HarmonixSet、SALAMI和RWC）的评估中，我们证明了在数据集内部和跨数据集的情况下，使用所提出的功能可以显著改进传统的MSA算法。摘要：Music structure analysis (MSA) methods traditionally search for musically meaningful patterns in audio: homogeneity, repetition, novelty, and segment-length regularity. Hand-crafted audio features such as MFCCs or chromagrams are often used to elicit these patterns. However, with more annotations of section labels (e.g., verse, chorus, and bridge) becoming available, one can use supervised feature learning to make these patterns even clearer and improve MSA performance. To this end, we take a supervised metric learning approach: we train a deep neural network to output embeddings that are near each other for two spectrogram inputs if both have the same section type (according to an annotation), and otherwise far apart. We propose a batch sampling scheme to ensure the labels in a training pair are interpreted meaningfully. The trained model extracts features that can be used in existing MSA algorithms. In evaluations with three datasets (HarmonixSet, SALAMI, and RWC), we demonstrate that using the proposed features can improve a traditional MSA algorithm significantly in both intra- and cross-dataset scenarios.

【4】 Deep Learning Based EDM Subgenre Classification using Mel-Spectrogram and Tempogram Features 标题：基于深度学习的MEL谱图和时间图特征在EDM亚类分类中的应用链接：https://arxiv.org/abs/2110.08862

作者：Wei-Han Hsu,Bo-Yu Chen,Yi-Hsuan Yang 机构：Research Center for IT Innovation, Academia Sinica, Taipei, Taiwan 摘要：随着音乐技术的发展，近年来出现了大量的电子舞蹈音乐（EDM）风格或“亚流派”。虽然区分EDM和非EDM的分类任务经常在音乐体裁分类的背景下进行研究，但对更具挑战性的EDM子体裁分类的研究却很少。最先进的模型基于高度随机的树，可以通过深度学习方法加以改进。在本文中，我们将最先进的音乐自动标记模型“短块CNN+Resnet”扩展到EDM子体裁分类，并添加了两种中层节奏相关特征表示，称为傅立叶节奏图和自相关节奏图。并且，我们探索了两种融合策略，早期融合和晚期融合，以聚合这两种类型的温度图。我们使用一个包含75000首歌曲的大型数据集对所提出的模型进行了评估，该数据集针对30种不同的EDM子体裁，并表明采用深度学习模型和节奏特征确实可以提高分类精度。摘要：Along with the evolution of music technology, a large number of styles, or "subgenres," of Electronic Dance Music(EDM) have emerged in recent years. While the classification task of distinguishing between EDM and non-EDM has been often studied in the context of music genre classification, little work has been done on the more challenging EDM subgenre classification. The state-of-art model is based on extremely randomized trees and could be improved by deep learning methods. In this paper, we extend the state-of-art music auto-tagging model "short-chunkCNN+Resnet" to EDM subgenre classification, with the addition of two mid-level tempo-related feature representations, called the Fourier tempogram and autocorrelation tempogram. And, we explore two fusion strategies, early fusion and late fusion, to aggregate the two types of tempograms. We evaluate the proposed models using a large dataset consisting of 75,000 songs for 30 different EDM subgenres, and show that the adoption of deep learning models and tempo features indeed leads to higher classification accuracy.

【5】 VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis 标题：VISinger：端到端歌声合成的对抗性学习变分推理链接：https://arxiv.org/abs/2110.08813

作者：Yongmao Zhang,Jian Cong,Heyang Xue,Lei Xie,Pengcheng Zhu,Mengxiao Bi 机构：School of Computer Science, Northwestern Polytechnical University, Xi’an, China, Fuxi AI Lab, NetEase Inc., Hangzhou, China 备注：5 pages, submitted to ICASSP 2022 摘要：在本文中，我们提出VISinger，一个完整的端到端的高质量歌唱语音合成（SVS）系统，直接从歌词和乐谱生成音频波形。我们的方法受到VITS的启发，VITS采用基于VAE的后验编码器，再加上基于归一化流的前验编码器和对抗性解码器，以实现完整的端到端语音生成。VISinger遵循VITS的主要架构，但根据歌唱的特点对先前的编码器进行了实质性的改进。首先，我们引入了一个长度调节器和一个帧先验网络来获得声学特征的帧级均值和方差，而不是使用声学特征的音素级均值和方差，对歌唱中丰富的声学变化进行建模。其次，我们进一步引入F0预测器来引导帧先验网络，从而使歌唱性能更加稳定。最后，为了改善歌唱节奏，我们修改了持续时间预测器，专门预测音素与音符的持续时间比率，这有助于歌唱音符的正常化。在专业普通话歌唱语料库上的实验表明，VISinger显著优于FastSpeech+神经声码器两阶段法和OracleVITS；消融研究证明了不同贡献的有效性。摘要：In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates audio waveform from lyrics and musical score. Our approach is inspired by VITS, which adopts VAE-based posterior encoder augmented with normalizing flow-based prior encoder and adversarial decoder to realize complete end-to-end speech generation. VISinger follows the main architecture of VITS, but makes substantial improvements to the prior encoder based on the characteristics of singing. First, instead of using phoneme-level mean and variance of acoustic features, we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features, modeling the rich acoustic variation in singing. Second, we further introduce an F0 predictor to guide the frame prior network, leading to stabler singing performance. Finally, to improve the singing rhythm, we modify the duration predictor to specifically predict the phoneme to note duration ratio, helped with singing note normalization. Experiments on a professional Mandarin singing corpus show that VISinger significantly outperforms FastSpeech+Neural-Vocoder two-stage approach and the oracle VITS; ablation study demonstrates the effectiveness of different contributions.

【6】 A Variational Bayesian Approach to Learning Latent Variables for Acoustic Knowledge Transfer 标题：声学知识转移潜变量学习的变分贝叶斯方法链接：https://arxiv.org/abs/2110.08598

作者：Hu Hu,Sabato Marco Siniscalchi,Chao-Han Huck Yang,Chin-Hui Lee 机构：School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA, Computer Engineering School, University of Enna Kore, Italy 备注：Submitted to ICASSP 2022 摘要：我们提出了一种变分贝叶斯（VB）方法来学习深度神经网络（DNN）模型中潜在变量的分布，用于跨领域知识转移，以解决训练和测试条件之间的声学不匹配问题。与传统的最大后验概率估计中的点估计不同，在估计大量模型参数时存在维数灾难的风险，我们将注意力集中在通过VB推理框架估计可管理数量的DNN潜在变量上。为了完成模型转换，从源域学习的知识被编码在潜在变量的先验分布中，并在贝叶斯意义上与来自目标域的一小组自适应数据进行最佳组合，以近似相应的后验分布。声场景分类中设备自适应的实验结果表明，我们提出的VB方法可以对目标设备进行很好的改进，并且始终优于13种最先进的知识转移算法。摘要：We propose a variational Bayesian (VB) approach to learning distributions of latent variables in deep neural network (DNN) models for cross-domain knowledge transfer, to address acoustic mismatches between training and testing conditions. Instead of carrying out point estimation in conventional maximum a posteriori estimation with a risk of having a curse of dimensionality in estimating a huge number of model parameters, we focus our attention on estimating a manageable number of latent variables of DNNs via a VB inference framework. To accomplish model transfer, knowledge learnt from a source domain is encoded in prior distributions of latent variables and optimally combined, in a Bayesian sense, with a small set of adaptation data from a target domain to approximate the corresponding posterior distributions. Experimental results on device adaptation in acoustic scene classification show that our proposed VB approach can obtain good improvements on target devices, and consistently outperforms 13 state-of-the-art knowledge transfer algorithms.

【7】 ASR4REAL: An extended benchmark for speech models 标题：ASR4REAL：一种扩展的语音模型基准链接：https://arxiv.org/abs/2110.08583

作者：Morgane Riviere,Jade Copet,Gabriel Synnaeve 机构：†Facebook AI Research 备注：Submitted to ICASSP 2022 摘要：流行的ASR基准测试，如Librispeech和Switchboard，其所代表的设置和扬声器的多样性受到限制。我们引入了一组与现实生活条件相匹配的基准，旨在发现模型中可能存在的偏差和弱点。我们发现，尽管最近的模型似乎没有表现出性别偏见，但它们通常通过口音表现出重要的表现差异，甚至更重要的表现差异取决于说话者的社会经济地位。最后，当对会话语音进行测试时，所有测试模型的性能都有很大的下降，在这种精确的情况下，即使是在像Common Crawl这样大的数据集上训练的语言模型，似乎也没有显著的积极影响，这重申了开发会话语言模型的重要性摘要：Popular ASR benchmarks such as Librispeech and Switchboard are limited in the diversity of settings and speakers they represent. We introduce a set of benchmarks matching real-life conditions, aimed at spotting possible biases and weaknesses in models. We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent, and even more important ones depending on the socio-economic status of the speakers. Finally, all tested models show a strong performance drop when tested on conversational speech, and in this precise context even a language model trained on a dataset as big as Common Crawl does not seem to have significant positive effect which reiterates the importance of developing conversational language models

【8】 A Unified Speaker Adaptation Approach for ASR 标题：一种适用于ASR的统一说话人自适应方法链接：https://arxiv.org/abs/2110.08545

作者：Yingzhu Zhao,Chongjia Ni,Cheung-Chi Leung,Shafiq Joty,Eng Siong Chng,Bin Ma 机构：Nanyang Technological University, Singapore, Machine Intelligence Technology, Alibaba Group 备注：Accepted by EMNLP 2021 摘要：Transformer模型已成功地应用于自动语音识别（ASR），并产生了最先进的结果。然而，它的性能仍然受到训练数据和测试数据之间说话人不匹配的影响。使用目标说话人数据进一步微调训练模型是最自然的自适应方法，但这需要大量计算，并可能导致现有说话人的灾难性遗忘。在这项工作中，我们提出了一种统一的说话人自适应方法，包括特征自适应和模型自适应。对于特征自适应，我们采用了一种说话人感知的持久记忆模型，该模型利用说话人i向量形成持久记忆，从而更好地推广到不可见的测试说话人。对于模型自适应，我们使用了一种新的逐步修剪方法来适应目标说话人，而不改变模型结构，据我们所知，这在ASR中从未被探索过。具体地说，我们将模型编码器上贡献较少的参数逐渐修剪到一定的稀疏水平，并使用修剪后的参数进行自适应，同时冻结未运行的参数以保持原始模型的性能。我们在Librispeech数据集上进行了实验。我们提出的方法在一般说话人自适应方面相对减少了2.74-6.52%的字错误率（WER）。在目标说话人自适应方面，我们的方法比基线方法的相对功率降低了20.58%，比微调方法的相对功率降低了2.54%。此外，对于极低的资源适应数据（例如，1次话语），我们的方法仅需几次训练就可以相对提高WER 6.53%。摘要：Transformer models have been used in automatic speech recognition (ASR) successfully and yields state-of-the-art results. However, its performance is still affected by speaker mismatch between training and test data. Further finetuning a trained model with target speaker data is the most natural approach for adaptation, but it takes a lot of compute and may cause catastrophic forgetting to the existing speakers. In this work, we propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation. For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers by making use of speaker i-vectors to form a persistent memory. For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture, which to the best of our knowledge, has never been explored in ASR. Specifically, we gradually prune less contributing parameters on model encoder to a certain sparsity level, and use the pruned parameters for adaptation, while freezing the unpruned parameters to keep the original model performance. We conduct experiments on the Librispeech dataset. Our proposed approach brings relative 2.74-6.52% word error rate (WER) reduction on general speaker adaptation. On target speaker adaptation, our method outperforms the baseline with up to 20.58% relative WER reduction, and surpasses the finetuning method by up to relative 2.54%. Besides, with extremely low-resource adaptation data (e.g., 1 utterance), our method could improve the WER by relative 6.53% with only a few epochs of training.

【9】 FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection 标题：FMFCC-A：一种具有挑战性的汉语合成语音检测数据集链接：https://arxiv.org/abs/2110.09441

作者：Zhenyu Zhang,Yewei Gu,Xiaowei Yi,Xianfeng Zhao 机构： State Key Laboratory of Information Security, Institute of Information, Engineering, Chinese Academy of Sciences, Beijing , China, School of Cyber Security, University of Chinese Academy of Sciences 摘要：随着文本到语音（TTS）和语音转换（VC）技术的不断发展，合成语音的检测受到了极大的影响。为了促进针对普通话TTS和VC技术的合成语音检测模型的发展，我们构建了一个具有挑战性的普通话数据集，并组织了中国图像与图形学会（FMFCC-a）第一次假媒体取证挑战赛的伴奏音轨。FMFCC-A数据集是迄今为止最大的用于合成语音检测的公开普通话数据集，其中包含由11个普通话TTS系统和两个普通话VC系统生成的40000个合成普通话语音，以及从58位发言者收集的10000个真实普通话语音。FMFCC-A数据集分为训练集、开发集和评估集，用于研究各种未知语音合成系统或音频后处理操作下的合成汉语语音检测。除了描述FMFCC-A数据集的构造外，我们还详细分析了两种基线方法和FMFCC-A提交的最优秀的数据，这说明了FMFCC-A数据集的有用性和挑战性。我们希望FMFCC-A数据集能够填补汉语合成语音检测数据集的不足。摘要：As increasing development of text-to-speech (TTS) and voice conversion (VC) technologies, the detection of synthetic speech has been suffered dramatically. In order to promote the development of synthetic speech detection model against Mandarin TTS and VC technologies, we have constructed a challenging Mandarin dataset and organized the accompanying audio track of the first fake media forensic challenge of China Society of Image and Graphics (FMFCC-A). The FMFCC-A dataset is by far the largest publicly-available Mandarin dataset for synthetic speech detection, which contains 40,000 synthesized Mandarin utterances that generated by 11 Mandarin TTS systems and two Mandarin VC systems, and 10,000 genuine Mandarin utterances collected from 58 speakers. The FMFCC-A dataset is divided into the training, development and evaluation sets, which are used for the research of detection of synthesized Mandarin speech under various previously unknown speech synthesis systems or audio post-processing operations. In addition to describing the construction of the FMFCC-A dataset, we provide a detailed analysis of two baseline methods and the top-performing submissions from the FMFCC-A, which illustrates the usefulness and challenge of FMFCC-A dataset. We hope that the FMFCC-A dataset can fill the gap of lack of Mandarin datasets for synthetic speech detection.

【10】 Automatic Learning of Subword Dependent Model Scales 标题：子词相关模型尺度的自动学习链接：https://arxiv.org/abs/2110.09324

作者：Felix Meyer,Wilfried Michel,Mohammad Zeineldeen,Ralf Schlüter,Hermann Ney 机构：Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany, AppTek GmbH, Aachen, Germany 备注：submitted to ICASSP 2022 摘要：为了提高最先进的自动语音识别系统的性能，通常的做法是包括外部知识源，如语言模型或先前的更正。这通常通过对数线性模型组合完成，每个模型使用单独的缩放参数。通常，这些参数是在某些保留数据上手动优化的。在这项工作中，我们建议通过自动微分和类似于神经网络模型参数的随机梯度下降来优化这些缩放参数。我们在LibriSpeech（LBS）和SWB（SWB）语料库上展示了该模型的扩展，基于注意的编码器-解码器声学模型和语言模型的组合可以像手动调谐一样有效地学习。我们进一步将此方法扩展到子词相关的模型尺度，这些尺度无法手动调整，从而导致LBS提高7%，SWB提高3%。我们还表明，联合训练的规模和模型参数是可能的，并给予额外的6%的改善LBS。摘要：To improve the performance of state-of-the-art automatic speech recognition systems it is common practice to include external knowledge sources such as language models or prior corrections. This is usually done via log-linear model combination using separate scaling parameters for each model. Typically these parameters are manually optimized on some held-out data. In this work we propose to optimize these scaling parameters via automatic differentiation and stochastic gradient decent similar to the neural network model parameters. We show on the LibriSpeech (LBS) and Switchboard (SWB) corpora that the model scales for a combination of attentionbased encoder-decoder acoustic model and language model can be learned as effectively as with manual tuning. We further extend this approach to subword dependent model scales which could not be tuned manually which leads to 7% improvement on LBS and 3% on SWB. We also show that joint training of scales and model parameters is possible and gives additional 6% improvement on LBS.

【11】 Intent Classification Using Pre-Trained Embeddings For Low Resource Languages 标题：基于预训练嵌入的低资源语言意图分类链接：https://arxiv.org/abs/2110.09264

作者：Hemant Yadav,Akshat Gupta,Sai Krishna Rallabandi,Alan W Black,Rajiv Ratn Shah 机构：MIDAS, IIIT Delhi, India,J.P.Morgan AI Research, New York, USA, Carnegie Mellon University 摘要：构建不依赖于特定语言的自动语音识别（ASR）的口语理解（SLU）系统是语言处理中一个重要但尚未深入研究的问题。在本文中，我们提出了一项比较研究，旨在采用预先训练的声学模型在低资源场景中执行SLU。具体来说，我们使用了三种不同的嵌入，它们是使用Allosaurus（一种预先训练过的通用电话解码器）提取的：（1）电话（2）Panphone，和（3）Allo嵌入。然后使用这些嵌入来识别口头意图。我们使用三种不同的语言进行实验：英语、僧伽罗语和泰米尔语，每种语言都有不同的数据大小，以模拟高、中、低资源场景。我们的系统将僧伽罗语和泰米尔语的最新（SOTA）意向分类准确率分别提高了约2.11%和7.00%，并在英语方面取得了竞争性的成绩。此外，我们还定量分析了绩效如何随每个意图使用的训练示例数量而变化。摘要：Building Spoken Language Understanding (SLU) systems that do not rely on language specific Automatic Speech Recognition (ASR) is an important yet less explored problem in language processing. In this paper, we present a comparative study aimed at employing a pre-trained acoustic model to perform SLU in low resource scenarios. Specifically, we use three different embeddings extracted using Allosaurus, a pre-trained universal phone decoder: (1) Phone (2) Panphone, and (3) Allo embeddings. These embeddings are then used in identifying the spoken intent. We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios. Our system improves on the state-of-the-art (SOTA) intent classification accuracy by approximately 2.11% for Sinhala and 7.00% for Tamil and achieves competitive results on English. Furthermore, we present a quantitative analysis of how the performance scales with the number of training examples used per intent.

【12】 Efficient Sequence Training of Attention Models using Approximative Recombination 标题：基于近似重组的注意力模型高效序列训练链接：https://arxiv.org/abs/2110.09245

作者：Nils-Philipp Wynands,Wilfried Michel,Jan Rosendahl,Ralf Schlüter,Hermann Ney 机构：Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany, AppTek GmbH, Aachen, Germany 备注：submitted to ICASSP 2022 摘要：序列判别训练是提高自动语音识别系统性能的重要工具。然而，它确实需要对所有可能的单词序列求和，这在实践中很难计算。当前具有无限标签上下文的最先进系统通过将求和限制为从beam搜索获得的相关竞争假设的n-最佳列表来规避此问题。这项工作建议在波束搜索期间对假设进行（近似）重组，前提是它们具有共同的本地历史。分析了近似产生的误差，结果表明，使用这种技术，有效光束尺寸可以增加几个数量级，而不会显著增加计算要求。最后，该技术可以有效地对LibriSpeech任务中基于注意的编解码声学模型进行序列判别训练。摘要：Sequence discriminative training is a great tool to improve the performance of an automatic speech recognition system. It does, however, necessitate a sum over all possible word sequences, which is intractable to compute in practice. Current state-of-the-art systems with unlimited label context circumvent this problem by limiting the summation to an n-best list of relevant competing hypotheses obtained from beam search. This work proposes to perform (approximative) recombinations of hypotheses during beam search, if they share a common local history. The error that is incurred by the approximation is analyzed and it is shown that using this technique the effective beam size can be increased by several orders of magnitude without significantly increasing the computational requirements. Lastly, it is shown that this technique can be used to effectively perform sequence discriminative training for attention-based encoder-decoder acoustic models on the LibriSpeech task.

【13】 EIHW-MTG: Second DiCOVA Challenge System Report 标题：EIHW-MTG：第二份DiCOVA挑战系统报告链接：https://arxiv.org/abs/2110.09239

作者：Adria Mallol-Ragolta,Helena Cuesta,Emilia Gómez,Björn W. Schuller 机构： EIHW – Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, Germany, MTG – Music Technology Group, Universitat Pompeu Fabra, Spain, Joint Research Centre, European Commission, Spain 摘要：这项工作提出了一种基于外积的方法来融合从咳嗽、呼吸和语音样本的频谱图生成的嵌入式表示，以自动检测新冠病毒-19。为了从频谱图中提取深入学习的表示，我们比较了从头开始训练的CNN和针对手头任务进行微调的ResNet18体系结构的性能。此外，我们还调查了患者的性别和上下文注意机制的使用是否有益。我们的实验使用了作为第二次使用声学诊断新冠病毒-19（DiCOVA）挑战的一部分发布的数据集。结果表明，融合呼吸和语音信息检测新冠病毒-19是合适的。当使用从零开始训练的具有上下文注意机制的CNN时，在测试分区上获得84.06%的曲线下面积（AUC）。当使用ResNet18体系结构进行特征提取时，基线模型的性能得分最高，AUC为84.26%。摘要：This work presents an outer product-based approach to fuse the embedded representations generated from the spectrograms of cough, breath, and speech samples for the automatic detection of COVID-19. To extract deep learnt representations from the spectrograms, we compare the performance of a CNN trained from scratch and a ResNet18 architecture fine-tuned for the task at hand. Furthermore, we investigate whether the patients' sex and the use of contextual attention mechanisms is beneficial. Our experiments use the dataset released as part of the Second Diagnosing COVID-19 using Acoustics (DiCOVA) Challenge. The results suggest the suitability of fusing breath and speech information to detect COVID-19. An Area Under the Curve (AUC) of 84.06% is obtained on the test partition when using a CNN trained from scratch with contextual attention mechanisms. When using the ResNet18 architecture for feature extraction, the baseline model scores the highest performance with an AUC of 84.26%.

【14】 Learning Models for Query by Vocal Percussion: A Comparative Study 标题：声击提问学习模型的比较研究链接：https://arxiv.org/abs/2110.09223

作者：Alejandro Delgado,SkoT McDonald,Ning Xu,Charalampos Saitis,Mark Sandler 机构：Roli Queen Mary University of London 备注：Published in proceedings of the International Computer Music Conference (ICMC) 2021 摘要：通过人声模仿敲击声是一种自然而有效的工具，可以在飞行中传达节奏感。因此，使用人声敲击自动检索鼓的声音可以帮助艺术家以舒适、快速的方式制作鼓的原型，从而平滑创作流程。在这里，我们利用传统的机器学习算法和最新的深度学习技术，探索执行此类查询的不同策略。通过向网格搜索算法提供性能指标，仔细选择所涉及模型的主要超参数。我们还研究了几种音频数据增强技术，这些技术可以潜在地规范深度学习模型并提高泛化能力。我们从有效性（分类准确度）、效率（计算速度）、稳定性（性能一致性）和可解释性（决策模式）等方面比较了最终的性能，并讨论了这些结果在通过语音敲击系统设计成功的查询时的相关性。摘要：The imitation of percussive sounds via the human voice is a natural and effective tool for communicating rhythmic ideas on the fly. Thus, the automatic retrieval of drum sounds using vocal percussion can help artists prototype drum patterns in a comfortable and quick way, smoothing the creative workflow as a result. Here we explore different strategies to perform this type of query, making use of both traditional machine learning algorithms and recent deep learning techniques. The main hyperparameters from the models involved are carefully selected by feeding performance metrics to a grid search algorithm. We also look into several audio data augmentation techniques, which can potentially regularise deep learning models and improve generalisation. We compare the final performances in terms of effectiveness (classification accuracy), efficiency (computational speed), stability (performance consistency), and interpretability (decision patterns), and discuss the relevance of these results when it comes to the design of successful query-by-vocal-percussion systems.

【15】 SpecTNT: a Time-Frequency Transformer for Music Audio 标题：SPECTNT：一种用于音乐音频的时频转换器链接：https://arxiv.org/abs/2110.09127

作者：Wei-Tsung Lu,Ju-Chiang Wang,Minz Won,Keunwoo Choi,Xuchen Song 机构： ByteDance, Mountain View, California, United States, Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain 备注：None 摘要：Transformer因其在自然语言处理和计算机视觉方面的卓越性能而受到和平号领域的关注。然而，以前在音频处理领域的工作大多使用Transformer作为一个时间特征聚合器，其作用类似于RNN。在本文中，我们提出了SpecTNT，这是一种基于转换器的架构，用于对输入时频表示的频谱和时间序列进行建模。具体来说，我们介绍了Transformer中Transformer（TNT）结构的一种新变体。在每个SpecTNT块中，频谱转换器将与频率相关的特征提取到每个帧的频率类标记（FCT）中。随后，对FCT进行线性投影并将其添加到时间嵌入（TE）中，从而聚合来自FCT的有用信息。然后，时间转换器处理te以跨时间轴交换信息。通过堆叠SpecTNT块，我们建立SpecTNT模型来学习音乐信号的表示。在实验中，SpecTNT在音乐标记和人声旋律提取方面展示了最先进的性能，并在和弦识别方面显示了具有竞争力的性能。通过消融研究进一步检验SpecTNT和其他设计选择的有效性。摘要：Transformers have drawn attention in the MIR field for their remarkable performance shown in natural language processing and computer vision. However, prior works in the audio processing domain mostly use Transformer as a temporal feature aggregator that acts similar to RNNs. In this paper, we propose SpecTNT, a Transformer-based architecture to model both spectral and temporal sequences of an input time-frequency representation. Specifically, we introduce a novel variant of the Transformer-in-Transformer (TNT) architecture. In each SpecTNT block, a spectral Transformer extracts frequency-related features into the frequency class token (FCT) for each frame. Later, the FCTs are linearly projected and added to the temporal embeddings (TEs), which aggregate useful information from the FCTs. Then, a temporal Transformer processes the TEs to exchange information across the time axis. By stacking the SpecTNT blocks, we build the SpecTNT model to learn the representation for music signals. In experiments, SpecTNT demonstrates state-of-the-art performance in music tagging and vocal melody extraction, and shows competitive performance for chord recognition. The effectiveness of SpecTNT and other design choices are further examined through ablation studies.

【16】 KaraTuner: Towards end to end natural pitch correction for singing voice in karaoke 标题：KaraTuner：卡拉OK中歌声的端到端自然基音校正链接：https://arxiv.org/abs/2110.09121

作者：Xiaobin Zhuang,Huiran Yu,Weifeng Zhao,Tao Jiang,Peng Hu,Simon Lui,Wenjiang Zhou 机构：⋆ Tencent Music Entertainment Lyra Lab, Shenzhen, China, † Carnegie Mellon University, Pittsburgh, PA, USA 备注：Submitted to ICASSP 2022 摘要：自动基音校正系统通常包括几个阶段，例如基音提取、偏差估计、基音偏移处理和交叉淡入平滑。然而，使用策略设计这些组件通常需要领域专家的专业知识，并且在某些情况下可能会失败。在本文中，我们介绍了KaraTuner，一种端到端的神经结构，它可以预测音高曲线，并直接从原始录音中提取的调谐音高和声谱中重新合成歌唱声音。KaraTuner中引入了几个关键技术点，以确保音高精度、音高自然度、音色一致性和音质。在基音预测器中采用了前馈变换器来捕获声谱和音符的长期相关性。我们还开发了一种基于新型源滤波器块和Fre-GAN结构的基音周期可控声码器。KaraTuner通过a/B测试获得了比基于规则的基音校正方法更高的偏好，感知实验表明，与参数世界声码器和相位声码器相比，该声码器在音色一致性和音质方面具有显著优势。摘要：An automatic pitch correction system typically includes several stages, such as pitch extraction, deviation estimation, pitch shift processing, and cross-fade smoothing. However, designing these components with strategies often requires domain expertise and they are likely to fail on corner cases. In this paper, we present KaraTuner, an end-to-end neural architecture that predicts pitch curve and resynthesizes the singing voice directly from the tuned pitch and vocal spectrum extracted from the original recordings. Several vital technical points have been introduced in KaraTuner to ensure pitch accuracy, pitch naturalness, timbre consistency, and sound quality. A feed-forward Transformer is employed in the pitch predictor to capture long-term dependencies in the vocal spectrum and musical note. We also develop a pitch-controllable vocoder base on a novel source-filter block and the Fre-GAN architecture. KaraTuner obtains a higher preference than the rule-based pitch correction approach through A/B tests, and perceptual experiments show that the proposed vocoder achieves significant advantages in timbre consistency and sound quality compared with the parametric WORLD vocoder and phase vocoder.

【17】 Real Additive Margin Softmax for Speaker Verification 标题：用于说话人确认的实加性余量Softmax 链接：https://arxiv.org/abs/2110.09116

作者：Lantian Li,Ruiqian Nai,Dong Wang 机构：Center for Speech and Language Technologies, BNRist, Tsinghua University, China 备注：Submitted to ICASSP 2022 摘要：加性余量softmax（AM softmax）损耗在说话人验证中具有显著的性能。AM Softmax的一个假定行为是，它可以通过强调目标登录来缩小类内的变化，从而提高目标类和非目标类之间的差异。在本文中，我们对AM Softmax损耗的行为进行了仔细的分析，并表明这种损耗并没有实现真正的最大余量训练。基于这一观察，我们提出了一个真实的AM Softmax损耗，它涉及Softmax训练中的真实裕度函数。在VoxCeleb1、SITW和CNCeleb上进行的实验表明，修正后的AM Softmax损耗始终优于原始损耗。该代码已于发布https://gitlab.com/csltstu/sunine. 摘要：The additive margin softmax (AM-Softmax) loss has delivered remarkable performance in speaker verification. A supposed behavior of AM-Softmax is that it can shrink within-class variation by putting emphasis on target logits, which in turn improves margin between target and non-target classes. In this paper, we conduct a careful analysis on the behavior of AM-Softmax loss, and show that this loss does not implement real max-margin training. Based on this observation, we present a Real AM-Softmax loss which involves a true margin function in the softmax training. Experiments conducted on VoxCeleb1, SITW and CNCeleb demonstrated that the corrected AM-Softmax loss consistently outperforms the original one. The code has been released at https://gitlab.com/csltstu/sunine.

【18】 LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech 标题：LDNet：合成语音MOS预测中的统一听者相关建模链接：https://arxiv.org/abs/2110.09103

作者：Wen-Chin Huang,Erica Cooper,Junichi Yamagishi,Tomoki Toda 机构：Nagoya University, Japan, National Institute of Informatics, Japan 备注：Submitted to ICASSP 2022. Code available at: this https URL 摘要：自动预测合成语音主观评分的一种有效方法是在带有人类注释分数的听力测试数据集上进行训练。尽管数据集中的每个语音样本都由几个听者进行评分，但大多数以前的工作仅使用平均分数作为训练目标。在这项工作中，我们提出了LDNet，一个统一的平均意见分数（MOS）预测框架，该框架预测给定输入语音和听者身份的听者感知质量。我们反映了LD建模的最新进展，包括模型架构的设计选择，并提出了两种推理方法，可提供更稳定的结果和高效的计算。我们在voice conversion challenge（VCC）2018基准测试和新收集的大规模MOS数据集上进行了系统实验，对提议的框架进行了深入分析。结果表明，平均听者推理方法是一种更好的利用平均分数的方法，当每个样本的评分越多，其有效性越明显。摘要：An effective approach to automatically predict the subjective rating for synthetic speech is to train on a listening test dataset with human-annotated scores. Although each speech sample in the dataset is rated by several listeners, most previous works only used the mean score as the training target. In this work, we present LDNet, a unified framework for mean opinion score (MOS) prediction that predicts the listener-wise perceived quality given the input speech and the listener identity. We reflect recent advances in LD modeling, including design choices of the model architecture, and propose two inference methods that provide more stable results and efficient computation. We conduct systematic experiments on the voice conversion challenge (VCC) 2018 benchmark and a newly collected large-scale MOS dataset, providing an in-depth analysis of the proposed framework. Results show that the mean listener inference method is a better way to utilize the mean scores, whose effectiveness is more obvious when having more ratings per sample.

【19】 Deep Clustering For General-Purpose Audio Representations 标题：通用音频表示的深度聚类链接：https://arxiv.org/abs/2110.08895

作者：Sreyan Ghosh,Sandesh V Katta,Ashish Seth,S. Umesh 机构：† Speech Lab, Dept. of Electrical Engineering, IIT Madras, Chennai, India 备注：Submitted to ICASSP 2022 摘要：我们介绍了DECAR，一种用于学习通用音频表示的自我监督预训练方法。我们的系统基于聚类：它利用离线聚类步骤来提供目标标签，作为解决预测任务的伪标签。我们在计算机视觉自监督学习最新进展的基础上开发了一个轻量级、易于使用的自监督预训练方案。我们在大规模音频集数据集的平衡子集上预训练DECAR嵌入，并将这些表示转移到9个下游分类任务，包括语音、音乐、动物声音和声学场景。此外，我们进行消融研究，确定关键设计选择，并公开所有代码和预先训练的模型。摘要：We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations. Our system is based on clustering: it utilizes an offline clustering step to provide target labels that act as pseudo-labels for solving a prediction task. We develop on top of recent advances in self-supervised learning for computer vision and design a lightweight, easy-to-use self-supervised pre-training scheme. We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset and transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes. Furthermore, we conduct ablation studies identifying key design choices and also make all our code and pre-trained models publicly available.

【20】 Storage and Authentication of Audio Footage for IoAuT Devices Using Distributed Ledger Technology 标题：使用分布式分类帐技术的IoAuT设备音频片段的存储和认证链接：https://arxiv.org/abs/2110.08821

作者：Srivatsav Chenna,Nils Peters 机构：International Audio Laboratories Erlangen, University of Erlangen-Nuremberg, Erlangen, Germany 备注：11 pages, 3 Figures, 1 code listing 摘要：检测伪造或操纵的音频内容以防止（例如）在数字媒体中传播伪造品至关重要，特别是在政治和声誉背景下。需要更好的工具来保护媒体创建的完整性。在音频物联网（IoAuT）的范例中，我们讨论了IoAuT网络使用分布式账本技术验证原始音频真实性的能力。通过将音频记录与IoAuT捕获设备获得的相关记录特定元数据相结合，该体系结构能够安全地分发原始音频片段，验证未知音频内容，并在未来的衍生作品中引用原始音频材料。通过开发一个概念验证系统，对所提出的体系结构的可行性进行了评估和讨论。摘要：Detection of fabricated or manipulated audio content to prevent, e.g., distribution of forgeries in digital media, is crucial, especially in political and reputational contexts. Better tools for protecting the integrity of media creation are desired. Within the paradigm of the Internet of Audio Things(IoAuT), we discuss the ability of the IoAuT network to verify the authenticity of original audio using distributed ledger technology. By storing audio recordings in combination with associated recording-specific metadata obtained by the IoAuT capturing device, this architecture enables secure distribution of original audio footage, authentication of unknown audio content, and referencing of original audio material in future derivative works. By developing a proof-of-concept system, the feasibility of the proposed architecture is evaluated and discussed.

【21】 Taming Visually Guided Sound Generation 标题：驯服视觉引导的声音生成链接：https://arxiv.org/abs/2110.08791

作者：Vladimir Iashin,Esa Rahtu 机构：Computing Sciences, Tampere University, Tampere, Finland, Playing, Harp, Lions, Roaring, Canary, Calling, Visually-, Guided, Sound, Generation, Model, � Click to Play, in Adobe Reader 备注：Accepted as an oral presentation for the BMVC 2021. Code: this https URL Project page: this https URL 摘要：视觉诱导音频生成的最新进展是基于采样短、低保真度和一类声音。此外，在高端GPU上，从最先进的模型中采集1秒的音频需要几分钟。在这项工作中，我们提出了一种单一的模型，它能够在比在单个GPU上播放所需时间更短的时间内，从开放域视频中生成一组帧提示的视觉相关的高保真声音。我们训练一个转换器，在给定一组视频特征的情况下，从预先训练的频谱图码本中采样一个新的频谱图。该码本是使用VQGAN的一种变体获得的，该变体经过训练以产生一个紧凑的采样空间，该采样空间具有一种新的基于谱图的感知损失。生成的光谱图使用基于窗口的GAN转换为波形，显著加快生成速度。考虑到缺乏自动评估生成光谱图的指标，我们还构建了一系列指标，称为FID和MKL。这些指标基于一种称为Melection的新型声音分类器，旨在评估开放域样本的保真度和相关性。定性和定量研究均在小型和大型数据集上进行，以评估生成样本的保真度和相关性。我们还将我们的模型与最新技术进行了比较，并观察到在质量、大小和计算时间方面有了实质性的改进。代码、演示和示例：v-iashin.github.io/SpecVQGAN 摘要：Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

【22】 Improving End-To-End Modeling for Mispronunciation Detection with Effective Augmentation Mechanisms 标题：用有效的增强机制改进发音错误检测的端到端建模链接：https://arxiv.org/abs/2110.08731

作者：Tien-Hong Lo,Yao-Ting Sung,Berlin Chen 机构：National Taiwan Normal University, Taipei City, Taiwan 备注：7 pages, 2 figures, 4 tables, accepted to Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC 2021) 摘要：近年来，端到端（E2E）模型在开发发音错误检测（MD）系统时引起了广泛的研究关注，该模型允许将第二语言学习者话语的频谱向量序列作为输入，并生成相应的电话级序列作为输出。然而，由于缺乏足够的二语说话人标记语音数据进行模型估计，E2E MD模型相对于基于DNN-HMM声学模型的传统模型容易过度拟合。为了缓解这一关键问题，我们在本文中提出了两种建模策略，以增强E2E MD模型的识别能力，每种建模策略都可以隐式地利用在预训练声学模型中编码的语音和语音特征，并分别包含在训练数据的参考转录本中。第一种是输入增强，其目的是从DNN-HMM声学模型中提取语音识别知识。第二种是标签增强，它设法从训练数据的转录本中捕获更多的语音模式。在L2-ARCTIC英语数据集上进行的一系列实证实验似乎证实了我们的E2E MD模型与一些顶级E2E MD模型和基于DNN-HMM声学模型的经典发音评分方法相比的有效性。摘要：Recently, end-to-end (E2E) models, which allow to take spectral vector sequences of L2 (second-language) learners' utterances as input and produce the corresponding phone-level sequences as output, have attracted much research attention in developing mispronunciation detection (MD) systems. However, due to the lack of sufficient labeled speech data of L2 speakers for model estimation, E2E MD models are prone to overfitting in relation to conventional ones that are built on DNN-HMM acoustic models. To alleviate this critical issue, we in this paper propose two modeling strategies to enhance the discrimination capability of E2E MD models, each of which can implicitly leverage the phonetic and phonological traits encoded in a pretrained acoustic model and contained within reference transcripts of the training data, respectively. The first one is input augmentation, which aims to distill knowledge about phonetic discrimination from a DNN-HMM acoustic model. The second one is label augmentation, which manages to capture more phonological patterns from the transcripts of training data. A series of empirical experiments conducted on the L2-ARCTIC English dataset seem to confirm the efficacy of our E2E MD model when compared to some top-of-the-line E2E MD models and a classic pronunciation-scoring based method built on a DNN-HMM acoustic model.

【23】 Towards Robust Waveform-Based Acoustic Models 标题：走向稳健的基于波形的声学模型链接：https://arxiv.org/abs/2110.08634

作者：Dino Oglic,Zoran Cvetkovic,Peter Sollich,Steve Renals,Bin Yu 机构： Sollich is with the Department of Mathematics 摘要：我们提出了一种在不利环境中学习鲁棒声学模型的方法，其特点是训练和测试条件之间存在显著的不匹配。这个问题对于需要在看不见的环境中表现良好的语音识别系统的部署至关重要。我们的方法是邻域风险最小化的一个例子，其目的是通过将定义输入空间上的经验密度的增量函数替换为训练样本附近的边际人口密度近似值来改进训练期间的风险估计。更具体地说，我们假设以训练样本为中心的局部邻域可以使用高斯混合近似，并从理论上证明这可以将鲁棒归纳偏差纳入学习过程。我们通过数据增强方案隐式地描述了单个混合成分的特征，该方案旨在解决声学模型中常见的伪相关源。为了避免由于信息丢失（与标准特征提取技术（例如FBANK和MFCC特征）而对稳健性造成的潜在混淆影响，我们将评估重点放在基于波形的设置上。我们的实验结果表明，所提出的方法可以推广到看不见的噪声条件，与使用标准风险最小化原则的训练相比，在分布外推广方面相对提高了150%。此外，研究结果表明，相对于使用训练样本学习的模型，该样本设计用于匹配测试话语的声学条件特征（即，最佳邻近密度），具有竞争力。摘要：We propose an approach for learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. Our approach is an instance of vicinal risk minimization, which aims to improve risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We characterize the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus our evaluation on the waveform-based setting. Our empirical results show that the proposed approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances (i.e., optimal vicinal densities).

【24】 Learning velocity model for complex media with deep convolutional neural networks 标题：基于深卷积神经网络的复杂介质学习速度模型链接：https://arxiv.org/abs/2110.08626

作者：A. Stankevich,I. Nechepurenko,A. Shevchenko,L. Gremyachikh,A. Ustyuzhanin,A. Vasyukov 机构：Vasyukov 1 1Moscow Institute of Physics and Technology, Russia 2HSE University 备注：14 pages, 6 figures, 6 tables 摘要：本文研究了基于边界测量的复杂介质速度模型获取问题。声学模型用于描述介质。我们使用了一个开放源代码的速度分布数据集来直接比较所给出的结果和以前的工作。采用网格特征数值方法进行正演模拟。利用深度卷积神经网络求解反问题。建议对基线UNet架构进行修改，以改进结构相似性指数和地面真实情况下速度剖面的定量对应关系。我们评估了我们的增强，并展示了结果的统计意义。摘要：The paper considers the problem of velocity model acquisition for a complex media based on boundary measurements. The acoustic model is used to describe the media. We used an open-source dataset of velocity distributions to compare the presented results with the previous works directly. Forward modeling is performed using the grid-characteristic numerical method. The inverse problem is solved using deep convolutional neural networks. Modifications for a baseline UNet architecture are proposed to improve both structural similarity index measure quantitative correspondence of the velocity profiles with the ground truth. We evaluate our enhancements and demonstrate the statistical significance of the results.

【25】 Controllable Multichannel Speech Dereverberation based on Deep Neural Networks 标题：基于深度神经网络的可控多通道语音去混响链接：https://arxiv.org/abs/2110.08439

作者：Ziteng Wang,Yueyue Na,Biao Tian,Qiang Fu 机构：Alibaba Group, China 备注：submitted to ICASSP2022 摘要：基于神经网络的语音去冗余技术在最近的研究中取得了很好的效果。然而，许多人只关注于恢复直接路径声音，而放弃了可能有益于语音感知的早期反射。当对早期混响目标进行评估时，经过训练以恢复干净语音的模型的性能会下降，反之亦然。提出了一种基于深度神经网络的多通道语音去冗余算法，该算法的去冗余度是可控的。这是通过添加一个简单的浮点数作为模型的目标控制器来实现的。使用空间分布的麦克风进行了实验，并在各种模拟条件下验证了该算法的有效性。摘要：Neural network based speech dereverberation has achieved promising results in recent studies. Nevertheless, many are focused on recovery of only the direct path sound and early reflections, which could be beneficial to speech perception, are discarded. The performance of a model trained to recover clean speech degrades when evaluated on early reverberation targets, and vice versa. This paper proposes a novel deep neural network based multichannel speech dereverberation algorithm, in which the dereverberation level is controllable. This is realized by adding a simple floating-point number as target controller of the model. Experiments are conducted using spatially distributed microphones, and the efficacy of the proposed algorithm is confirmed in various simulated conditions.

【26】 NN3A: Neural Network supported Acoustic Echo Cancellation, Noise Suppression and Automatic Gain Control for Real-Time Communications 标题：NN3A：神经网络支持的声学回波抵消、噪声抑制和自动增益控制的实时通信链接：https://arxiv.org/abs/2110.08437

作者：Ziteng Wang,Yueyue Na,Biao Tian,Qiang Fu 机构：Alibaba Group, China 备注：submitted to ICASSP2022 摘要：声回波消除（AEC）、噪声抑制（NS）和自动增益控制（AGC）是实时通信（RTC）经常需要的三个模块。本文提出了一种神经网络支持的RTC算法，即NN3A，它结合了一个自适应滤波器和一个多任务模型，用于残余回波抑制、降噪和近端语音活动检测。该算法的性能优于使用单独模型的方法和端到端替代方法。进一步表明，该模型在残差抑制和近端语音失真之间存在折衷，可以通过一种新的损失加权函数来平衡残差抑制和近端语音失真。还研究了训练关节模型的几个实际方面，以使其性能达到极限。摘要：Acoustic echo cancellation (AEC), noise suppression (NS) and automatic gain control (AGC) are three often required modules for real-time communications (RTC). This paper proposes a neural network supported algorithm for RTC, namely NN3A, which incorporates an adaptive filter and a multi-task model for residual echo suppression, noise reduction and near-end speech activity detection. The proposed algorithm is shown to outperform both a method using separate models and an end-to-end alternative. It is further shown that there exists a trade-off in the model between residual suppression and near-end speech distortion, which could be balanced by a novel loss weighting function. Several practical aspects of training the joint model are also investigated to push its performance to limit.

【27】 Omni-sparsity DNN: Fast Sparsity Optimization for On-Device Streaming E2E ASR via Supernet 标题：全稀疏DNN：超网设备上流式E2E ASR的快速稀疏性优化链接：https://arxiv.org/abs/2110.08352

作者：Haichuan Yang,Yuan Shangguan,Dilin Wang,Meng Li,Pierce Chuang,Xiaohui Zhang,Ganesh Venkatesh,Ozlem Kalinli,Vikas Chandra 机构：Facebook AI 摘要：从可穿戴设备到功能强大的智能设备，现代自动语音识别（ASR）模型运行在各种计算预算不同的边缘设备上。为了浏览模型精度与模型尺寸之间的帕累托前沿，研究人员陷入了一个两难境地，即通过训练和微调每个边缘设备的模型来优化模型精度，同时保持训练GPU时间的可控性。在本文中，我们提出了全稀疏DNN，其中一个单一的神经网络可以修剪，以生成优化模型的大范围的模型大小。我们为全向稀疏DNN开发了训练策略，允许它沿着单词错误率（WER）与模型大小的帕累托前沿查找模型，同时保持训练GPU的时间不超过训练一个单一模型的时间。我们用流E2E ASR模型演示了全稀疏DNN。我们的结果表明，与单独修剪的稀疏模型相比，LibriSpeech在训练时间和资源上节省了大量的时间和资源，具有相似或更好的准确度：在其他测试中，WER提高了2%-6.6%。摘要：From wearables to powerful smart devices, modern automatic speech recognition (ASR) models run on a variety of edge devices with different computational budgets. To navigate the Pareto front of model accuracy vs model size, researchers are trapped in a dilemma of optimizing model accuracy by training and fine-tuning models for each individual edge device while keeping the training GPU-hours tractable. In this paper, we propose Omni-sparsity DNN, where a single neural network can be pruned to generate optimized model for a large range of model sizes. We develop training strategies for Omni-sparsity DNN that allows it to find models along the Pareto front of word-error-rate (WER) vs model size while keeping the training GPU-hours to no more than that of training one singular model. We demonstrate the Omni-sparsity DNN with streaming E2E ASR models. Our results show great saving on training time and resources with similar or better accuracy on LibriSpeech compared to individually pruned sparse models: 2%-6.6% better WER on Test-other.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-10-19，如有侵权请联系 cloudcommunity@tencent.com 删除

linux