金融/语音/音频处理学术速递[9.3]

公众号-arXiv每日学术速递

发布于 2021-09-16 15:10:27

6120

发布于 2021-09-16 15:10:27

文章被收录于专栏：arXiv每日学术速递

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

q-fin金融，共计8篇

cs.SD语音，共计8篇

eess.AS音频处理，共计10篇

1.q-fin金融:

【1】 Detection of Structural Regimes and Analyzing the Impact of Crude Oil Market on Canadian Stock Market: Markov Regime-Switching Approach 标题：结构体制检测与原油市场对加拿大股市影响分析：马尔科夫体制转换方法链接：https://arxiv.org/abs/2109.01046

作者：Mohammadreza Mahmoudi,Hana Ghaneei 机构：a Department of Economics, Northern Illinois University, Dekalb, USA., b Department of Industrial and Systems Engineering, Northern Illinois University, Dekalb, USA., Correspondence 备注：14 pages,10 tables, 2 figures 摘要：本研究旨在基于1979年至2018年的月度数据，采用非线性马尔可夫机制转换方法分析原油市场对多伦多证券交易所指数（TSX）的影响。研究结果表明，TSX收益率包含两种状态：当股票指数增长率为正时，为正收益（状态1）；当股票指数的增长率为负时，则为负收益（制度2）。研究结果还表明，原油市场对两种制度下的股票市场都有正向影响，但油价对制度1下股票市场的影响大于制度2下的影响。此外，油价的两个时期滞后使制度1下的股票价格上升，而制度2下的股票价格下降摘要：This study aims to analyze the impact of the crude oil market on the Toronto Stock Exchange Index (TSX) based on monthly data from 1979 to 2018 using a nonlinear Markov regime-switching approach. The results indicate that TSX return contains two regimes, including: positive return (regime 1), when growth rate of stock index is positive; and negative return (regime 2), when growth rate of stock index is negative. The findings also show the crude oil market has positive effect on the stock market in both regimes, however, the effect of oil price on the stock market in regime 1 is more than regime 2. Moreover, two period lag of oil price increases stock price in regime 1, while it decreases stock price in regime 2

【2】 Analysis of taste heterogeneity in commuters travel decisions using joint parking and mode choice model: A case from urban India 标题：使用联合停车和方式选择模型分析通勤者出行决策中的口味异质性：以印度城市为例链接：https://arxiv.org/abs/2109.01045

作者：Janak Parmar,Gulnazbanu Saiyed,Sanjaykumar Dave 备注：36 pages, 2 figures 摘要：交通需求管理（TDM）的概念通过城市最佳平衡交通模式份额的胜利，支持可持续交通的发展。模式分割管理直接反映在每个运输子系统（包括停车）的TDM上。在发展中国家，决策者主要关注供应方措施，而需求方措施在政策影响方面仍然没有得到解决。有大量文献可提供TDM策略的响应，但大多数研究分别考虑模式选择和停车选择行为，而不是考虑两者之间的权衡。如果不这样做，可能会导致有偏见的模型估计和政策含义不当。本文试图通过承认停车选择是模式选择行为模型中的内生决策来填补这一空白。除停车和出行属性外，本研究还整合了态度因素和建筑环境变量，以得出综合评估结果。基于马尔可夫链蒙特卡罗模拟方法，采用层次贝叶斯方法对随机系数的混合logit模型进行了估计。结果显示，除了建筑环境因素和模式/停车相关属性外，模式/停车特定态度对通勤者选择行为也有显著影响。据确定，停车类型之间正在发生相当大的变化，而不是在停车属性发生假设变化的情况下切换行驶模式。此外，本研究通过后续回归模型调查了支付意愿的异质性，这为确定受访者中这种异质性的可能来源提供了重要的见解。这项研究提供了显著的结果，可能有助于规划当局改进TDM战略，特别是在发展中国家。摘要：The concept of transportation demand management (TDM) upholds the development of sustainable mobility through the triumph of optimally balanced transport modal share in cities. The modal split management directly reflects on TDM of each transport subsystem, including parking. In developing countries, the policy-makers have largely focused on supply-side measures, yet demand-side measures have remained unaddressed in policy implications. Ample literature is available presenting responses of TDM strategies, but most studies account mode choice and parking choice behaviour separately rather than considering trade-offs between them. Failing to do so may lead to biased model estimates and impropriety in policy implications. This paper seeks to fill this gap by admitting parking choice as an endogenous decision within the model of mode choice behaviour. This study integrates attitudinal factors and built-environment variables in addition to parking and travel attributes for developing comprehensive estimation results. A mixed logit model with random coefficients is estimated using hierarchical Bayes approach based on the Markov Chain Monte Carlo simulation method. The results reveal significant influence of mode/parking specific attitudes on commuters choice behaviour in addition to the built-environment factors and mode/parking related attributes. It is identified that considerable shift is occurring between parking-types in preference to switching travel mode with hypothetical changes in parking attributes. Besides, study investigates the heterogeneity in the willingness-to-pay through a follow-up regression model, which provides important insights for identifying possible sources of this heterogeneity among respondents. The study provides remarkable results which may be beneficial to planning authorities for improving TDM strategies especially in developing countries.

【3】 Forecasting High-Dimensional Covariance Matrices of Asset Returns with Hybrid GARCH-LSTMs 标题：基于混合GARCH-LSTMS的资产收益率高维协方差矩阵预测链接：https://arxiv.org/abs/2109.01044

作者：Lucien Boulet 机构：MSc Student, Financial Markets, Université Paris-Dauphine - PSL., MSc Student, Pure Mathematics, Sorbonne Université. 备注：30 pages 摘要：一些学者研究了混合单变量广义自回归条件异方差（GARCH）模型和神经网络的混合模型的能力，以提供比纯计量经济模型更好的波动预测。尽管提出了非常有希望的结果，但将此类模型推广到多变量情况仍有待研究。此外，很少有论文研究了神经网络预测资产回报协方差矩阵的能力，并且都使用了相当少的资产，因此没有解决所谓的维度诅咒问题。本文的目的是研究混合GARCH过程和神经网络的混合模型预测资产收益协方差矩阵的能力。为此，我们提出了一个新模型，该模型基于分解波动率和相关性预测的多元GARCH。这里使用混合神经网络预测波动率，而相关性遵循传统的计量经济学过程。在最小方差投资组合框架中实现模型后，我们的结果如下。首先，添加GARCH参数作为输入有利于所提出的模型。第二，使用一个热编码来帮助神经网络区分每个股票，从而提高了业绩。第三，提出的新模型非常有希望，因为它不仅优于同等加权的投资组合，而且大大优于使用单变量GARCH预测波动性的计量经济学模型。摘要：Several academics have studied the ability of hybrid models mixing univariate Generalized Autoregressive Conditional Heteroskedasticity (GARCH) models and neural networks to deliver better volatility predictions than purely econometric models. Despite presenting very promising results, the generalization of such models to the multivariate case has yet to be studied. Moreover, very few papers have examined the ability of neural networks to predict the covariance matrix of asset returns, and all use a rather small number of assets, thus not addressing what is known as the curse of dimensionality. The goal of this paper is to investigate the ability of hybrid models, mixing GARCH processes and neural networks, to forecast covariance matrices of asset returns. To do so, we propose a new model, based on multivariate GARCHs that decompose volatility and correlation predictions. The volatilities are here forecast using hybrid neural networks while correlations follow a traditional econometric process. After implementing the models in a minimum variance portfolio framework, our results are as follows. First, the addition of GARCH parameters as inputs is beneficial to the model proposed. Second, the use of one-hot-encoding to help the neural network differentiate between each stock improves the performance. Third, the new model proposed is very promising as it not only outperforms the equally weighted portfolio, but also by a significant margin its econometric counterpart that uses univariate GARCHs to predict the volatilities.

【4】 Land use change in agricultural systems: an integrated ecological-social simulation model of farmer decisions and cropping system performance based on the Cell-DEVS formalism 标题：农业系统土地利用变化：基于CELL-DEVS形式主义的农户决策与种植制度绩效的生态-社会综合模拟模型链接：https://arxiv.org/abs/2109.01031

作者：Diego Ferraro,Daniela Blanco,Sebastián Pessah,Rodrigo Castro 机构： Rodrigo 2 3 1 Universidad de Buenos Aires, Argentina 2Universidad de Buenos Aires 摘要：农业系统经历了由人口增长和技术投入强化推动的土地利用变化。这导致土地利用和覆被变化（LUCC）动态表现为一个复杂的景观转变过程。为了研究LUCC过程，我们开发了一个基于agent的空间显式模型，该模型采用细胞自动机的形式，并采用Cell-DEVS形式实现。由此产生的名为AgroDEVS的模型用于预测LUCC动态及其相关的经济和环境变化。AgroDEVS的结构使用行为规则和函数表示a）作物产量、b）天气条件、c）经济利润、d）农民偏好、e）技术水平采用和f）基于具体能源核算的自然资源消耗。利用1988-2015年期间潘帕地区（阿根廷）典型地点的数据，模拟演习表明，平均每10年中有6年实现了经济目标，但环境阈值仅在10年中有1.9年实现。在一组50年的模拟中，LUCC模式迅速向最有利可图的作物序列收敛，经济和环境条件之间没有明显的权衡。摘要：Agricultural systems experience land-use changes that are driven by population growth and intensification of technological inputs. This results in land-use and cover change (LUCC) dynamics representing a complex landscape transformation process. In order to study the LUCC process we developed a spatially explicit agent-based model in the form of a Cellular Automata implemented with the Cell-DEVS formalism. The resulting model called AgroDEVS is used for predicting LUCC dynamics along with their associated economic and environmental changes. AgroDEVS is structured using behavioral rules and functions representing a) crop yields, b) weather conditions, c) economic profit, d) farmer preferences, e) technology level adoption and f) natural resources consumption based on embodied energy accounting. Using data from a typical location of the Pampa region (Argentina) for the 1988-2015 period, simulation exercises showed that the economic goals were achieved, on average, each 6 out of 10 years, but the environmental thresholds were only achieved in 1.9 out of 10 years. In a set of 50-years simulations, LUCC patterns quickly converge towards the most profitable crop sequences, with no noticeable tradeoff between the economic and environmental conditions.

【5】 Precise option pricing by the COS method - How to choose the truncation interval 标题：基于COS方法的精确期权定价--如何选择截断区间链接：https://arxiv.org/abs/2109.01030

作者：Gero Junike,Konstantin Pankrashkin 机构：Carl von Ossietzky Universit¨at, Institut f¨ur Mathematik, Oldenburg, Germany, Declarations of interest: none 摘要：Fourier-cosine展开（COS）方法用于欧洲期权的快速定价。要应用COS方法，需要提供日志返回密度的截断间隔。利用马尔可夫不等式，我们推导了一个新的公式来获得截断区间，并证明了该区间足够大，可以确保COS方法在预定的误差容限内收敛。我们还通过几个例子说明，用累积量确定截断区间的经典方法可能导致严重的错误定价。通常，在这两种情况下，COS方法的计算时间是相似的。摘要：The Fourier cosine expansion (COS) method is used for pricing European options numerically very fast. To apply the COS method, a truncation interval for the density of the log-returns need to be provided. Using Markov's inequality, we derive a new formula to obtain the truncation interval and prove that the interval is large enough to ensure convergence of the COS method within a predefined error tolerance. We also show by several examples that the classical approach to determine the truncation interval by cumulants may lead to serious mispricing. Usually, the computational time of the COS method is of similar magnitude in both cases.

【6】 Bilinear Input Normalization for Neural Networks in Financial Forecasting 标题：金融预测中神经网络的双线性输入归一化链接：https://arxiv.org/abs/2109.00983

作者：Dat Thanh Tran,Juho Kanniainen,Moncef Gabbouj,Alexandros Iosifidis 机构：∗Department of Computing Sciences, Tampere University, Finland, †Department of Electrical and Computer Engineering, Aarhus University, Denmark 备注：1 figure, 6 tables 摘要：在建立机器学习模型时，数据规范化是最重要的预处理步骤之一，特别是当感兴趣的模型是深度神经网络时。这是因为采用随机梯度下降优化的深层神经网络对输入变量范围敏感，并且容易出现数值问题。与其他类型的信号不同，金融时间序列通常表现出独特的特征，如高波动性、非平稳性和多模态，这使得它们很难处理，通常需要专家领域知识来设计合适的处理管道。在本文中，我们提出了一种新的数据驱动的规范化方法，用于处理高频金融时间序列的深层神经网络。所提出的归一化方案考虑了金融多元时间序列的双峰特性，不需要专家知识来预处理金融时间序列，因为该步骤是作为端到端优化过程的一部分制定的。我们的实验使用了最先进的神经网络和来自北欧和美国市场的两个大型限价订单簿的高频数据，结果表明，在预测未来股价动态方面，与其他规范化技术相比，我们有了显著的改进。摘要：Data normalization is one of the most important preprocessing steps when building a machine learning model, especially when the model of interest is a deep neural network. This is because deep neural network optimized with stochastic gradient descent is sensitive to the input variable range and prone to numerical issues. Different than other types of signals, financial time-series often exhibit unique characteristics such as high volatility, non-stationarity and multi-modality that make them challenging to work with, often requiring expert domain knowledge for devising a suitable processing pipeline. In this paper, we propose a novel data-driven normalization method for deep neural networks that handle high-frequency financial time-series. The proposed normalization scheme, which takes into account the bimodal characteristic of financial multivariate time-series, requires no expert knowledge to preprocess a financial time-series since this step is formulated as part of the end-to-end optimization process. Our experiments, conducted with state-of-the-arts neural networks and high-frequency data from two large-scale limit order books coming from the Nordic and US markets, show significant improvements over other normalization techniques in forecasting future stock price dynamics.

【7】 Auctions and Prediction Markets for Scientific Peer Review 标题：科学同行评审的拍卖和预测市场链接：https://arxiv.org/abs/2109.00923

作者：Siddarth Srinivasan,Jamie Morgenstern 机构：School of Computer Science and Engineering, University of Washington, Seattle, WA 摘要：同行评议的出版物被认为是证明和传播研究界认为有价值的观点的黄金标准。然而，我们发现了当前系统的两个主要缺点：（1）由于大量提交，对评审员的需求巨大；（2）缺乏激励评审员参与并花费必要的精力提供高质量的评审。在这项工作中，我们采用了一种机制设计方法来改进同行评审过程。我们提出了一个两阶段机制，将论文提交和审核过程联系在一起，同时激励高质量的审核和高质量的提交。在第一阶段，作者通过提交论文以及代表其论文评审预期价值的投标，参与VCG评审时段拍卖。对于第二阶段，我们在信息获取文献的最新工作基础上提出了一种新的预测市场风格机制（H-DIPP），该机制鼓励参与评审的评审人员提供诚实、轻松的评审。第一阶段拍卖所得的收入在第二阶段用于根据评审质量向评审人员付款。摘要：Peer reviewed publications are considered the gold standard in certifying and disseminating ideas that a research community considers valuable. However, we identify two major drawbacks of the current system: (1) the overwhelming demand for reviewers due to a large volume of submissions, and (2) the lack of incentives for reviewers to participate and expend the necessary effort to provide high-quality reviews. In this work, we adopt a mechanism-design approach to propose improvements to the peer review process. We present a two-stage mechanism which ties together the paper submission and review process, simultaneously incentivizing high-quality reviews and high-quality submissions. In the first stage, authors participate in a VCG auction for review slots by submitting their papers along with a bid that represents their expected value for having their paper reviewed. For the second stage, we propose a novel prediction market-style mechanism (H-DIPP) building on recent work in the information elicitation literature, which incentivizes participating reviewers to provide honest and effortful reviews. The revenue raised by the Stage I auction is used in Stage II to pay reviewers based on the quality of their reviews.

【8】 Using Temperature Sensitivity to Estimate Shiftable Electricity Demand: Implications for power system investments and climate change 标题：利用温度敏感度估计可变电力需求：对电力系统投资和气候变化的影响链接：https://arxiv.org/abs/2109.00643

作者：Michael J. Roberts,Sisi Zhang,Eleanor Yuan,James Jones,Matthias Fripp 机构：Department of Economics, University of Hawai‘i at M¯anoa, Maile Way, Saunders , Honolulu, HI , USA, University of Hawai‘i Economic Research Organization (UHERO), University of Hawai‘i Sea Grant College Program 备注：23 pages, plus 17 page supplement, 4 figures, 1 table, plus supplementary tables and figures 摘要：间歇性可再生能源的增长和气候变化使得管理电力需求变化变得越来越困难。传输和集中存储技术可能有所帮助，但成本高昂。集中式存储的替代方案是更好地利用可移动需求，但尚不清楚存在多少可移动需求。电力需求的很大一部分用于制冷和供热，并且存在着转移这些负荷的低成本技术。如果有足够的隔热层，用于空调和空间供暖的能量可以储存在冰或热水中数小时到数天。在这项研究中，我们将区域小时需求与美国各地的细粒度天气数据相结合，以估计温度敏感需求，以及通过在每天内转移温度敏感负载（无论是否改善传输）可以减少多少需求变化。我们发现，通过仅转移一半的温度敏感需求，可以消除约四分之三的日内需求变化。利用可转换需求的可变性降低效益补充了改善区域间传输所获得的效益，并大大缓解了在气候变化下为更高峰值服务的挑战。摘要：Growth of intermittent renewable energy and climate change make it increasingly difficult to manage electricity demand variability. Transmission and centralized storage technologies can help, but are costly. An alternative to centralized storage is to make better use of shiftable demand, but it is unclear how much shiftable demand exists. A significant share of electricity demand is used for cooling and heating, and low-cost technologies exists to shift these loads. With sufficient insulation, energy used for air conditioning and space heating can be stored in ice or hot water from hours to days. In this study, we combine regional hourly demand with fine-grained weather data across the United States to estimate temperature-sensitive demand, and how much demand variability can be reduced by shifting temperature-sensitive loads within each day, with and without improved transmission. We find that approximately three quarters of within-day demand variability can be eliminated by shifting only half of temperature-sensitive demand. The variability-reducing benefits of employing available shiftable demand complement those gained from improved interregional transmission, and greatly mitigate the challenge of serving higher peaks under climate change.

2.cs.SD语音:

【1】 Binaural Audio Generation via Multi-task Learning 标题：基于多任务学习的双耳音频生成链接：https://arxiv.org/abs/2109.00748

作者：Sijia Li,Shiguang Liu,Dinesh Manocha 机构： Tianjin University, University of Maryland at College Park 摘要：我们提出了一种基于学习的方法，使用多任务学习从单声道音频生成双耳音频。我们的公式利用了两个相关任务的附加信息：双耳音频生成任务和翻转音频分类任务。我们的学习模型从视觉和音频输入中提取空间化特征，预测左声道和右声道，并判断左声道和右声道是否翻转。首先，我们使用ResNet从视频帧中提取视觉特征。接下来，我们使用基于视觉特征的独立子网络执行双耳音频生成和翻转音频分类。我们的学习方法基于两个任务的损失加权和来优化总体损失。我们在公平竞争数据集和YouTube ASMR数据集上训练和评估我们的模型。我们进行定量和定性评估，以证明我们的方法比以前的技术的好处。摘要：We present a learning-based approach for generating binaural audio from mono audio using multi-task learning. Our formulation leverages additional information from two related tasks: the binaural audio generation task and the flipped audio classification task. Our learning model extracts spatialization features from the visual and audio input, predicts the left and right audio channels, and judges whether the left and right channels are flipped. First, we extract visual features using ResNet from the video frames. Next, we perform binaural audio generation and flipped audio classification using separate subnetworks based on visual features. Our learning method optimizes the overall loss based on the weighted sum of the losses of the two tasks. We train and evaluate our model on the FAIR-Play dataset and the YouTube-ASMR dataset. We perform quantitative and qualitative evaluations to demonstrate the benefits of our approach over prior techniques.

【2】 Multichannel Audio Source Separation with Independent Deeply Learned Matrix Analysis Using Product of Source Models 标题：基于源模型乘积的独立深度学习矩阵分析多声道音源分离链接：https://arxiv.org/abs/2109.00704

作者：Takuya Hasumi,Tomohiko Nakamura,Norihiro Takamune,Hiroshi Saruwatari,Daichi Kitamura,Yu Takahashi,Kazunobu Kondo 机构：†The University of Tokyo, Tokyo, Japan, ⋆National Institute of Technology, Kagawa College, Kagawa, Japan, ‡Yamaha Corporation, Shizuoka, Japan 备注：8 pages, 5 figures, accepted for Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2021 (APSIPA ASC 2021) 摘要：独立深度学习矩阵分析（IDLMA）是一种基于深度神经网络（DNN）的声源功率估计的多声道音频源分离方法。基于DNN的功率估计适用于音色类似于DNN训练数据的声音。然而，应用IDLMA的声音并不总是有这样的音色，音色不匹配会导致IDLMA的性能下降。为了解决这个问题，我们重点研究了IDLMA的盲源分离，即独立低秩矩阵分析。它采用非负矩阵分解（NMF）作为源模型，以源谱图的低阶结构为线索，捕获仅出现在目标混合物中的源谱成分。因此，我们扩展了基于DNN的源模型，以包含基于专家概念产品的NMF源模型，我们称之为源模型产品（PoSM）。对于提出的基于PoSM的IDLMA，我们推导了一种基于优化原理的计算效率高的参数估计算法，称为优化最小化算法。实验结果表明了该方法的有效性。摘要：Independent deeply learned matrix analysis (IDLMA) is one of the state-of-the-art multichannel audio source separation methods using the source power estimation based on deep neural networks (DNNs). The DNN-based power estimation works well for sounds having timbres similar to the DNN training data. However, the sounds to which IDLMA is applied do not always have such timbres, and the timbral mismatch causes the performance degradation of IDLMA. To tackle this problem, we focus on a blind source separation counterpart of IDLMA, independent low-rank matrix analysis. It uses nonnegative matrix factorization (NMF) as the source model, which can capture source spectral components that only appear in the target mixture, using the low-rank structure of the source spectrogram as a clue. We thus extend the DNN-based source model to encompass the NMF-based source model on the basis of the product-of-expert concept, which we call the product of source models (PoSM). For the proposed PoSM-based IDLMA, we derive a computationally efficient parameter estimation algorithm based on an optimization principle called the majorization-minimization algorithm. Experimental evaluations show the effectiveness of the proposed method.

【3】 Controllable deep melody generation via hierarchical music structure representation 标题：通过层次化音乐结构表示实现可控的深调生成链接：https://arxiv.org/abs/2109.00663

作者：Shuqi Dai,Zeyu Jin,Celso Gomes,Roger B. Dannenberg 机构：Carnegie Mellon university, Adobe Inc., Carnegie Mellon University 备注：6 pages, 9 figures, in Proc. of the 22nd Int. Society for Music Information Retrieval Conf.,Online, 2021 摘要：深度学习的最新进展扩大了生成音乐的可能性，但生成具有一致长期结构的可定制完整音乐片段仍然是一个挑战。本文介绍了音乐框架，一种分层的音乐结构表示和一个多步骤的生成过程，以创建一个由长期重复结构、和弦、旋律轮廓和节奏约束引导的全长旋律。我们首先用分段和短语层次结构组织整个旋律。为了在每个短语中生成旋律，我们使用两个独立的基于Transformer的网络生成节奏和基本旋律，然后以自回归的方式生成以基本旋律、节奏和和弦为条件的旋律。通过将音乐生成分解为子问题，我们的方法允许更简单的模型，并且需要更少的数据。要定制或增加多样性，可以在音乐框架中改变和弦、基本旋律和节奏结构，让我们的网络相应地生成旋律。此外，我们还引入了新的特征来编码基于音乐领域知识的音乐位置信息、节奏模式和旋律轮廓。一项听力测试显示，在POP909数据集中，我们的方法产生的旋律大约有一半的时间被评为与人类创作的音乐一样好或更好。摘要：Recent advances in deep learning have expanded possibilities to generate music, but generating a customizable full piece of music with consistent long-term structure remains a challenge. This paper introduces MusicFrameworks, a hierarchical music structure representation and a multi-step generative process to create a full-length melody guided by long-term repetitive structure, chord, melodic contour, and rhythm constraints. We first organize the full melody with section and phrase-level structure. To generate melody in each phrase, we generate rhythm and basic melody using two separate transformer-based networks, and then generate the melody conditioned on the basic melody, rhythm and chords in an auto-regressive manner. By factoring music generation into sub-problems, our approach allows simpler models and requires less data. To customize or add variety, one can alter chords, basic melody, and rhythm structure in the music frameworks, letting our networks generate the melody accordingly. Additionally, we introduce new features to encode musical positional information, rhythm patterns, and melodic contours based on musical domain knowledge. A listening test reveals that melodies generated by our method are rated as good as or better than human-composed music in the POP909 dataset about half the time.

【4】 Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition 标题：用于端到端上下文语音识别的树约束指针生成器链接：https://arxiv.org/abs/2109.00627

作者：Guangzhi Sun,Chao Zhang,Philip C. Woodland 机构：Cambridge University Engineering Dept., Trumpington St., Cambridge, CB,PZ U.K. 备注：Submitted to ASRU 2021, 5 pages 摘要：上下文知识对于真实世界的自动语音识别（ASR）应用非常重要。在本文中，提出了一种新的树约束指针生成器（TCPGen）组件，该组件以神经符号的方式将偏置词列表等知识合并到基于注意的编码器-解码器和传感器端到端ASR模型中。TCPGen将偏置字构造成一个有效的前缀树，作为其符号输入，并在树和最终ASR输出分布之间创建一个神经捷径，以便于在解码期间识别偏置字。系统在Librispeech语料库上进行训练和评估，在语料库中，以话语、章节或书籍的规模提取偏见词，以模拟不同的应用场景。实验结果表明，与基线相比，TCPGen持续提高了字错误率（WER），特别是在偏压字上显著降低了字错误率。TCPGen效率很高：它可以处理5000个偏倚单词和干扰，只会增加内存使用和计算成本的一小部分开销。摘要：Contextual knowledge is important for real-world automatic speech recognition (ASR) applications. In this paper, a novel tree-constrained pointer generator (TCPGen) component is proposed that incorporates such knowledge as a list of biasing words into both attention-based encoder-decoder and transducer end-to-end ASR models in a neural-symbolic way. TCPGen structures the biasing words into an efficient prefix tree to serve as its symbolic input and creates a neural shortcut between the tree and the final ASR output distribution to facilitate recognising biasing words during decoding. Systems were trained and evaluated on the Librispeech corpus where biasing words were extracted at the scales of an utterance, a chapter, or a book to simulate different application scenarios. Experimental results showed that TCPGen consistently improved word error rates (WERs) compared to the baselines, and in particular, achieved significant WER reductions on the biasing words. TCPGen is highly efficient: it can handle 5,000 biasing words and distractors and only add a small overhead to memory use and computation cost.

【5】 FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection 标题：FaVoA：人脸-语音关联支持模棱两可的说话人检测链接：https://arxiv.org/abs/2109.00577

作者：Hugo Carneiro,Cornelius Weber,Stefan Wermter 机构：Universit¨at Hamburg, Department of Informatics, Knowledge Technology, Vogt-Koelln-Str. , Hamburg 摘要：当人脸可见时，即使在困难的环境中，当一个说话人的脸不清晰时，或者当同一场景中有几个人时，人脸和语音之间的强烈关系可以帮助主动说话人检测系统。通过能够从一个人的语音中估计他/她的正面面部表情，可以更容易地确定他/她是否是被归类为活跃说话人的潜在候选人，即使在同一场景中没有检测到任何人的口腔运动的具有挑战性的情况下也是如此。通过将人脸-语音关联神经网络与现有最先进的主动说话人检测模型相结合，我们引入了FaVoA（人脸-语音关联模糊说话人检测器），这是一种能够正确分类特定模糊场景的神经网络模型。FaVoA不仅发现正面关联，而且有助于排除面部与声音不匹配的非匹配面部与声音关联。它使用门控双模单元体系结构融合这些模型，提供了一种定量确定每种模态对分类贡献多少的方法。摘要：The strong relation between face and voice can aid active speaker detection systems when faces are visible, even in difficult settings, when the face of a speaker is not clear or when there are several people in the same scene. By being capable of estimating the frontal facial representation of a person from his/her speech, it becomes easier to determine whether he/she is a potential candidate for being classified as an active speaker, even in challenging cases in which no mouth movement is detected from any person in that same scene. By incorporating a face-voice association neural network into an existing state-of-the-art active speaker detection model, we introduce FaVoA (Face-Voice Association Ambiguous Speaker Detector), a neural network model that can correctly classify particularly ambiguous scenarios. FaVoA not only finds positive associations, but helps to rule out non-matching face-voice associations, where a face does not match a voice. Its use of a gated-bimodal-unit architecture for the fusion of those models offers a way to quantitatively determine how much each modality contributes to the classification.

【6】 You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection 标题：你只听一次：一种类YOLO的音频分割和声音事件检测算法链接：https://arxiv.org/abs/2109.00962

作者：Satvik Venkatesh,David Moffat,Eduardo Reck Miranda 机构： University of Plymouth 备注：7 pages, 3 figures, 5 tables. Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing 摘要：音频分割和声音事件检测是机器听觉中的重要课题，旨在检测声学类别及其各自的边界。它适用于音频内容分析、语音识别、音频索引和音乐信息检索。近年来，大多数研究文章都采用了分类分割的方法。该技术将音频划分为小帧，并分别对这些帧执行分类。在本文中，我们提出了一种称为You Only Heart Once（YOHO）的新方法，该方法受计算机视觉中普遍采用的YOLO算法的启发。我们将声学边界的检测转化为回归问题，而不是基于帧的分类。这是通过使用单独的输出神经元来检测音频类的存在并预测其起点和终点来实现的。YOHO在多个数据集上获得了比最先进的卷积递归神经网络更高的F-测度和更低的错误率。由于YOHO是一个纯粹的卷积神经网络，并且没有递归层，因此它在推理过程中速度更快。此外，由于这种方法更接近端到端，并且直接预测声学边界，因此在后处理和平滑过程中，它的速度要快得多。摘要：Audio segmentation and sound event detection are crucial topics in machine listening that aim to detect acoustic classes and their respective boundaries. It is useful for audio-content analysis, speech recognition, audio-indexing, and music information retrieval. In recent years, most research articles adopt segmentation-by-classification. This technique divides audio into small frames and individually performs classification on these frames. In this paper, we present a novel approach called You Only Hear Once (YOHO), which is inspired by the YOLO algorithm popularly adopted in Computer Vision. We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification. This is done by having separate output neurons to detect the presence of an audio class and predict its start and end points. YOHO obtained a higher F-measure and lower error rate than the state-of-the-art Convolutional Recurrent Neural Network on multiple datasets. As YOHO is purely a convolutional neural network and has no recurrent layers, it is faster during inference. In addition, as this approach is more end-to-end and predicts acoustic boundaries directly, it is significantly quicker during post-processing and smoothing.

【7】 ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection 标题：ASVspoof 2021：加快欺骗和深度伪造语音检测的进展链接：https://arxiv.org/abs/2109.00537

作者：Junichi Yamagishi,Xin Wang,Massimiliano Todisco,Md Sahidullah,Jose Patino,Andreas Nautsch,Xuechen Liu,Kong Aik Lee,Tomi Kinnunen,Nicholas Evans,Héctor Delgado 机构：ASVspoof consortium 备注：Accepted to the ASVspoof 2021 Workshop 摘要：ASVspoof 2021是两年一次的挑战系列的第四版，旨在促进欺骗研究和对策设计，以保护自动扬声器验证系统免受操纵。除了继续关注逻辑和物理访问任务外，ASVspoof 2021还引入了一项涉及语音检测的新任务，与以前的版本相比，该任务有许多改进。本文描述了所有三项任务、每项任务的新数据库、评估指标、四个挑战基线、评估平台和挑战结果摘要。尽管引入了信道和压缩可变性，这增加了难度，但逻辑访问和deepfake任务的结果与以前ASVspoof版本的结果非常接近。物理访问任务的结果表明，在真实、可变的物理空间中检测攻击很困难。ASVspoof 2021是第一版，没有向参与者提供任何匹配的训练或发展数据，这反映了真实情况，在这种情况下，欺骗和假话的性质永远无法自信地预测，结果非常令人鼓舞，表明近年来在这一领域取得了重大进展。摘要：ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing and the design of countermeasures to protect automatic speaker verification systems from manipulation. In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task involving deepfake speech detection. This paper describes all three tasks, the new databases for each of them, the evaluation metrics, four challenge baselines, the evaluation platform and a summary of challenge results. Despite the introduction of channel and compression variability which compound the difficulty, results for the logical access and deepfake tasks are close to those from previous ASVspoof editions. Results for the physical access task show the difficulty in detecting attacks in real, variable physical spaces. With ASVspoof 2021 being the first edition for which participants were not provided with any matched training or development data and with this reflecting real conditions in which the nature of spoofed and deepfake speech can never be predicated with confidence, the results are extremely encouraging and demonstrate the substantial progress made in the field in recent years.

【8】 ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan 标题：ASVspoof 2021：自动说话人验证欺骗及对策挑战评估方案链接：https://arxiv.org/abs/2109.00535

作者：Héctor Delgado,Nicholas Evans,Tomi Kinnunen,Kong Aik Lee,Xuechen Liu,Andreas Nautsch,Jose Patino,Md Sahidullah,Massimiliano Todisco,Xin Wang,Junichi Yamagishi 机构：ASVspoof consortium 备注：this http URL 摘要：自动说话人验证欺骗和对策（ASVspoof）挑战系列是一项社区主导的倡议，旨在促进对欺骗的考虑和对策的制定。ASVspoof 2021是一系列两年一次的竞争性挑战中的第四个，其目标是制定能够区分真实和欺骗或虚假言论的对策。本文件提供了ASVspoof 2021挑战的技术说明，包括训练、开发和评估数据、指标、基线、评估规则、提交程序和时间表的详细信息。摘要：The automatic speaker verification spoofing and countermeasures (ASVspoof) challenge series is a community-led initiative which aims to promote the consideration of spoofing and the development of countermeasures. ASVspoof 2021 is the 4th in a series of bi-annual, competitive challenges where the goal is to develop countermeasures capable of discriminating between bona fide and spoofed or deepfake speech. This document provides a technical description of the ASVspoof 2021 challenge, including details of training, development and evaluation data, metrics, baselines, evaluation rules, submission procedures and the schedule.

3.eess.AS音频处理:

【1】 You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection 标题：你只听一次：一种类YOLO的音频分割和声音事件检测算法链接：https://arxiv.org/abs/2109.00962

作者：Satvik Venkatesh,David Moffat,Eduardo Reck Miranda 机构： University of Plymouth 备注：7 pages, 3 figures, 5 tables. Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing 摘要：音频分割和声音事件检测是机器听觉中的重要课题，旨在检测声学类别及其各自的边界。它适用于音频内容分析、语音识别、音频索引和音乐信息检索。近年来，大多数研究文章都采用了分类分割的方法。该技术将音频划分为小帧，并分别对这些帧执行分类。在本文中，我们提出了一种称为You Only Heart Once（YOHO）的新方法，该方法受计算机视觉中普遍采用的YOLO算法的启发。我们将声学边界的检测转化为回归问题，而不是基于帧的分类。这是通过使用单独的输出神经元来检测音频类的存在并预测其起点和终点来实现的。YOHO在多个数据集上获得了比最先进的卷积递归神经网络更高的F-测度和更低的错误率。由于YOHO是一个纯粹的卷积神经网络，并且没有递归层，因此它在推理过程中速度更快。此外，由于这种方法更接近端到端，并且直接预测声学边界，因此在后处理和平滑过程中，它的速度要快得多。摘要：Audio segmentation and sound event detection are crucial topics in machine listening that aim to detect acoustic classes and their respective boundaries. It is useful for audio-content analysis, speech recognition, audio-indexing, and music information retrieval. In recent years, most research articles adopt segmentation-by-classification. This technique divides audio into small frames and individually performs classification on these frames. In this paper, we present a novel approach called You Only Hear Once (YOHO), which is inspired by the YOLO algorithm popularly adopted in Computer Vision. We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification. This is done by having separate output neurons to detect the presence of an audio class and predict its start and end points. YOHO obtained a higher F-measure and lower error rate than the state-of-the-art Convolutional Recurrent Neural Network on multiple datasets. As YOHO is purely a convolutional neural network and has no recurrent layers, it is faster during inference. In addition, as this approach is more end-to-end and predicts acoustic boundaries directly, it is significantly quicker during post-processing and smoothing.

【2】 Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring 标题：基于说话人条件的语音自动评分分层建模链接：https://arxiv.org/abs/2109.00928

作者：Yaman Kumar Singla,Avykat Gupta,Shaurya Bagga,Changyou Chen,Balaji Krishnamurthy,Rajiv Ratn Shah 机构：. IIIT-Delhi,. Adobe,. SUNY-Buffalo, . India,. India,. USA 备注：Published in CIKM 2021 摘要：自动语音评分（ASS）是计算机辅助评估考生的语言口语能力。ASS系统面临许多挑战，如开放语法、可变发音和非结构化或半结构化内容。最近的深度学习方法在这一领域显示出了一些希望。然而，这些方法大多侧重于从单个音频中提取特征，这使得它们缺乏建模如此复杂任务所需的特定于说话人的上下文。我们提出了一种新的非母语ASS深度学习技术，称为说话人条件分层建模。在我们的技术中，我们充分利用了口语能力测试为一名考生评估多个答案的事实。我们从这些响应中提取上下文向量，并将它们作为额外的特定于说话人的上下文提供给我们的网络，以对特定的响应进行评分。我们将我们的技术与强基线进行比较，发现这种建模将模型的平均性能提高了6.92%（最大值=12.86%，最小值=4.51%）。我们进一步展示了定量和定性的见解，以了解这一额外背景在解决ASS问题中的重要性。摘要：Automatic Speech Scoring (ASS) is the computer-assisted evaluation of a candidate's speaking proficiency in a language. ASS systems face many challenges like open grammar, variable pronunciations, and unstructured or semi-structured content. Recent deep learning approaches have shown some promise in this domain. However, most of these approaches focus on extracting features from a single audio, making them suffer from the lack of speaker-specific context required to model such a complex task. We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context vectors from these responses and feed them as additional speaker-specific context to our network to score a particular response. We compare our technique with strong baselines and find that such modeling improves the model's average performance by 6.92% (maximum = 12.86%, minimum = 4.51%). We further show both quantitative and qualitative insights into the importance of this additional context in solving the problem of ASS.

【3】 Exploring Retraining-Free Speech Recognition for Intra-sentential Code-Switching 标题：用于句内语码转换的免再训练语音识别探索链接：https://arxiv.org/abs/2109.00921

作者：Zhen Huang,Xiaodan Zhuang,Daben Liu,Xiaoqiang Xiao,Yuchen Zhang,Sabato Marco Siniscalchi 机构：Apple Inc., Broadway, Cambridge, MA 备注：None 摘要：在本文中，我们介绍了利用现有声学模型（AMs）和语言模型（LMs）构建代码转换（CS）语音识别系统的初步工作，即不需要训练，特别针对句内转换。为了实现这一雄心勃勃的目标，人们设计了新的外语发音生成和语言模型（LM）充实机制。具体而言，我们设计了一种自动方法，使用现有的声学电话解码器和基于LSTM的字对音（G2P）模型，在母语（NL）音素集中获得高质量的外语（FL）单词发音。因此，通过直接从数据中学习外国发音，可以改善重音发音。此外，通过使用翻译的词对和NL LM的借用统计信息将原始NL LM转换为CS LM，部署了代码交换LM。实验证据清楚地表明，与基于人类标记的技术相比，我们的方法更好地处理重音的外国发音。此外，我们的最佳系统实现了55.5%的相对字错误率降低，从传统单语ASR系统的34.4%降低到句子内CS任务的15.3%，而不会损害单语识别的准确性。摘要：In this paper, we present our initial efforts for building a code-switching (CS) speech recognition system leveraging existing acoustic models (AMs) and language models (LMs), i.e., no training required, and specifically targeting intra-sentential switching. To achieve such an ambitious goal, new mechanisms for foreign pronunciation generation and language model (LM) enrichment have been devised. Specifically, we have designed an automatic approach to obtain high quality pronunciation of foreign language (FL) words in the native language (NL) phoneme set using existing acoustic phone decoders and an LSTM-based grapheme-to-phoneme (G2P) model. Improved accented pronunciations have thus been obtained by learning foreign pronunciations directly from data. Furthermore, a code-switching LM was deployed by converting the original NL LM into a CS LM using translated word pairs and borrowing statistics for the NL LM. Experimental evidence clearly demonstrates that our approach better deals with accented foreign pronunciations than techniques based on human labeling. Moreover, our best system achieves a 55.5% relative word error rate reduction from 34.4%, obtained with a conventional monolingual ASR system, to 15.3% on an intra-sentential CS task without harming the monolingual recognition accuracy.

【4】 Physiological-Physical Feature Fusion for Automatic Voice Spoofing Detection 标题：用于自动语音欺骗检测的生理-物理特征融合链接：https://arxiv.org/abs/2109.00913

作者：Junxiao Xue,Hao Zhou,Yabo Wang 机构： Xue was with the School of Software, Zhou are with Zhengzhou University, Bo are with Westlake University 摘要：近年来，说话人验证系统已经在许多生产场景中得到应用。不幸的是，它们仍然非常容易受到各种各样的欺骗攻击，如语音转换和语音合成等。本文提出了一种基于生理物理特征融合的新方法来处理语音欺骗攻击。该方法包括特征提取、带压缩和激励块的密集连接卷积神经网络（SE-DenseNet）、带压缩和激励块的多尺度残差神经网络（SE-Res2Net）和特征融合策略。我们首先使用视频中说话人的声音和面部作为监控信号预先训练卷积神经网络。它可以从语音中提取生理特征。然后使用SE-DenseNet和SE-Res2Net提取物理特征。这种密集连接模式具有较高的参数效率，挤压和激励块可以增强特征的传输。最后，我们将这两个功能集成到SE Densenet中，以识别欺骗攻击。在ASVSPOOF2019数据集上的实验结果表明，我们的模型对于语音欺骗检测是有效的。在逻辑访问场景中，与其他方法相比，我们的模型将串联决策代价函数（t-DCF）和等错误率（EER）分数分别提高了4%和7%。在物理访问场景中，我们的模型将t-DCF和EER分数分别提高了8%和10%。摘要：Speaker verification systems have been used in many production scenarios in recent years. Unfortunately, they are still highly prone to different kinds of spoofing attacks such as voice conversion and speech synthesis, etc. In this paper, we propose a new method base on physiological-physical feature fusion to deal with voice spoofing attacks. This method involves feature extraction, a densely connected convolutional neural network with squeeze and excitation block (SE-DenseNet), multi-scale residual neural network with squeeze and excitation block (SE-Res2Net) and feature fusion strategies. We first pre-trained a convolutional neural network using the speaker's voice and face in the video as surveillance signals. It can extract physiological features from speech. Then we use SE-DenseNet and SE-Res2Net to extract physical features. Such a densely connection pattern has high parameter efficiency and squeeze and excitation block can enhance the transmission of the feature. Finally, we integrate the two features into the SE-Densenet to identify the spoofing attacks. Experimental results on the ASVspoof 2019 data set show that our model is effective for voice spoofing detection. In the logical access scenario, our model improves the tandem decision cost function (t-DCF) and equal error rate (EER) scores by 4% and 7%, respectively, compared with other methods. In the physical access scenario, our model improved t-DCF and EER scores by 8% and 10%, respectively.

【5】 ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection 标题：ASVspoof 2021：加快欺骗和深度伪造语音检测的进展链接：https://arxiv.org/abs/2109.00537

作者：Junichi Yamagishi,Xin Wang,Massimiliano Todisco,Md Sahidullah,Jose Patino,Andreas Nautsch,Xuechen Liu,Kong Aik Lee,Tomi Kinnunen,Nicholas Evans,Héctor Delgado 机构：ASVspoof consortium 备注：Accepted to the ASVspoof 2021 Workshop 摘要：ASVspoof 2021是两年一次的挑战系列的第四版，旨在促进欺骗研究和对策设计，以保护自动扬声器验证系统免受操纵。除了继续关注逻辑和物理访问任务外，ASVspoof 2021还引入了一项涉及语音检测的新任务，与以前的版本相比，该任务有许多改进。本文描述了所有三项任务、每项任务的新数据库、评估指标、四个挑战基线、评估平台和挑战结果摘要。尽管引入了信道和压缩可变性，这增加了难度，但逻辑访问和deepfake任务的结果与以前ASVspoof版本的结果非常接近。物理访问任务的结果表明，在真实、可变的物理空间中检测攻击很困难。ASVspoof 2021是第一版，没有向参与者提供任何匹配的训练或发展数据，这反映了真实情况，在这种情况下，欺骗和假话的性质永远无法自信地预测，结果非常令人鼓舞，表明近年来在这一领域取得了重大进展。摘要：ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing and the design of countermeasures to protect automatic speaker verification systems from manipulation. In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task involving deepfake speech detection. This paper describes all three tasks, the new databases for each of them, the evaluation metrics, four challenge baselines, the evaluation platform and a summary of challenge results. Despite the introduction of channel and compression variability which compound the difficulty, results for the logical access and deepfake tasks are close to those from previous ASVspoof editions. Results for the physical access task show the difficulty in detecting attacks in real, variable physical spaces. With ASVspoof 2021 being the first edition for which participants were not provided with any matched training or development data and with this reflecting real conditions in which the nature of spoofed and deepfake speech can never be predicated with confidence, the results are extremely encouraging and demonstrate the substantial progress made in the field in recent years.

【6】 ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan 标题：ASVspoof 2021：自动说话人验证欺骗及对策挑战评估方案链接：https://arxiv.org/abs/2109.00535

作者：Héctor Delgado,Nicholas Evans,Tomi Kinnunen,Kong Aik Lee,Xuechen Liu,Andreas Nautsch,Jose Patino,Md Sahidullah,Massimiliano Todisco,Xin Wang,Junichi Yamagishi 机构：ASVspoof consortium 备注：this http URL 摘要：自动说话人验证欺骗和对策（ASVspoof）挑战系列是一项社区主导的倡议，旨在促进对欺骗的考虑和对策的制定。ASVspoof 2021是一系列两年一次的竞争性挑战中的第四个，其目标是制定能够区分真实和欺骗或虚假言论的对策。本文件提供了ASVspoof 2021挑战的技术说明，包括训练、开发和评估数据、指标、基线、评估规则、提交程序和时间表的详细信息。摘要：The automatic speaker verification spoofing and countermeasures (ASVspoof) challenge series is a community-led initiative which aims to promote the consideration of spoofing and the development of countermeasures. ASVspoof 2021 is the 4th in a series of bi-annual, competitive challenges where the goal is to develop countermeasures capable of discriminating between bona fide and spoofed or deepfake speech. This document provides a technical description of the ASVspoof 2021 challenge, including details of training, development and evaluation data, metrics, baselines, evaluation rules, submission procedures and the schedule.

【7】 Binaural Audio Generation via Multi-task Learning 标题：基于多任务学习的双耳音频生成链接：https://arxiv.org/abs/2109.00748

作者：Sijia Li,Shiguang Liu,Dinesh Manocha 机构： Tianjin University, University of Maryland at College Park 摘要：我们提出了一种基于学习的方法，使用多任务学习从单声道音频生成双耳音频。我们的公式利用了两个相关任务的附加信息：双耳音频生成任务和翻转音频分类任务。我们的学习模型从视觉和音频输入中提取空间化特征，预测左声道和右声道，并判断左声道和右声道是否翻转。首先，我们使用ResNet从视频帧中提取视觉特征。接下来，我们使用基于视觉特征的独立子网络执行双耳音频生成和翻转音频分类。我们的学习方法基于两个任务的损失加权和来优化总体损失。我们在公平竞争数据集和YouTube ASMR数据集上训练和评估我们的模型。我们进行定量和定性评估，以证明我们的方法比以前的技术的好处。摘要：We present a learning-based approach for generating binaural audio from mono audio using multi-task learning. Our formulation leverages additional information from two related tasks: the binaural audio generation task and the flipped audio classification task. Our learning model extracts spatialization features from the visual and audio input, predicts the left and right audio channels, and judges whether the left and right channels are flipped. First, we extract visual features using ResNet from the video frames. Next, we perform binaural audio generation and flipped audio classification using separate subnetworks based on visual features. Our learning method optimizes the overall loss based on the weighted sum of the losses of the two tasks. We train and evaluate our model on the FAIR-Play dataset and the YouTube-ASMR dataset. We perform quantitative and qualitative evaluations to demonstrate the benefits of our approach over prior techniques.

【8】 Multichannel Audio Source Separation with Independent Deeply Learned Matrix Analysis Using Product of Source Models 标题：基于源模型乘积的独立深度学习矩阵分析多声道音源分离链接：https://arxiv.org/abs/2109.00704

作者：Takuya Hasumi,Tomohiko Nakamura,Norihiro Takamune,Hiroshi Saruwatari,Daichi Kitamura,Yu Takahashi,Kazunobu Kondo 机构：†The University of Tokyo, Tokyo, Japan, ⋆National Institute of Technology, Kagawa College, Kagawa, Japan, ‡Yamaha Corporation, Shizuoka, Japan 备注：8 pages, 5 figures, accepted for Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2021 (APSIPA ASC 2021) 摘要：独立深度学习矩阵分析（IDLMA）是一种基于深度神经网络（DNN）的声源功率估计的多声道音频源分离方法。基于DNN的功率估计适用于音色类似于DNN训练数据的声音。然而，应用IDLMA的声音并不总是有这样的音色，音色不匹配会导致IDLMA的性能下降。为了解决这个问题，我们重点研究了IDLMA的盲源分离，即独立低秩矩阵分析。它采用非负矩阵分解（NMF）作为源模型，以源谱图的低阶结构为线索，捕获仅出现在目标混合物中的源谱成分。因此，我们扩展了基于DNN的源模型，以包含基于专家概念产品的NMF源模型，我们称之为源模型产品（PoSM）。对于提出的基于PoSM的IDLMA，我们推导了一种基于优化原理的计算效率高的参数估计算法，称为优化最小化算法。实验结果表明了该方法的有效性。摘要：Independent deeply learned matrix analysis (IDLMA) is one of the state-of-the-art multichannel audio source separation methods using the source power estimation based on deep neural networks (DNNs). The DNN-based power estimation works well for sounds having timbres similar to the DNN training data. However, the sounds to which IDLMA is applied do not always have such timbres, and the timbral mismatch causes the performance degradation of IDLMA. To tackle this problem, we focus on a blind source separation counterpart of IDLMA, independent low-rank matrix analysis. It uses nonnegative matrix factorization (NMF) as the source model, which can capture source spectral components that only appear in the target mixture, using the low-rank structure of the source spectrogram as a clue. We thus extend the DNN-based source model to encompass the NMF-based source model on the basis of the product-of-expert concept, which we call the product of source models (PoSM). For the proposed PoSM-based IDLMA, we derive a computationally efficient parameter estimation algorithm based on an optimization principle called the majorization-minimization algorithm. Experimental evaluations show the effectiveness of the proposed method.

【9】 Controllable deep melody generation via hierarchical music structure representation 标题：通过层次化音乐结构表示实现可控的深调生成链接：https://arxiv.org/abs/2109.00663

作者：Shuqi Dai,Zeyu Jin,Celso Gomes,Roger B. Dannenberg 机构：Carnegie Mellon university, Adobe Inc., Carnegie Mellon University 备注：6 pages, 9 figures, in Proc. of the 22nd Int. Society for Music Information Retrieval Conf.,Online, 2021 摘要：深度学习的最新进展扩大了生成音乐的可能性，但生成具有一致长期结构的可定制完整音乐片段仍然是一个挑战。本文介绍了音乐框架，一种分层的音乐结构表示和一个多步骤的生成过程，以创建一个由长期重复结构、和弦、旋律轮廓和节奏约束引导的全长旋律。我们首先用分段和短语层次结构组织整个旋律。为了在每个短语中生成旋律，我们使用两个独立的基于Transformer的网络生成节奏和基本旋律，然后以自回归的方式生成以基本旋律、节奏和和弦为条件的旋律。通过将音乐生成分解为子问题，我们的方法允许更简单的模型，并且需要更少的数据。要定制或增加多样性，可以在音乐框架中改变和弦、基本旋律和节奏结构，让我们的网络相应地生成旋律。此外，我们还引入了新的特征来编码基于音乐领域知识的音乐位置信息、节奏模式和旋律轮廓。一项听力测试显示，在POP909数据集中，我们的方法产生的旋律大约有一半的时间被评为与人类创作的音乐一样好或更好。摘要：Recent advances in deep learning have expanded possibilities to generate music, but generating a customizable full piece of music with consistent long-term structure remains a challenge. This paper introduces MusicFrameworks, a hierarchical music structure representation and a multi-step generative process to create a full-length melody guided by long-term repetitive structure, chord, melodic contour, and rhythm constraints. We first organize the full melody with section and phrase-level structure. To generate melody in each phrase, we generate rhythm and basic melody using two separate transformer-based networks, and then generate the melody conditioned on the basic melody, rhythm and chords in an auto-regressive manner. By factoring music generation into sub-problems, our approach allows simpler models and requires less data. To customize or add variety, one can alter chords, basic melody, and rhythm structure in the music frameworks, letting our networks generate the melody accordingly. Additionally, we introduce new features to encode musical positional information, rhythm patterns, and melodic contours based on musical domain knowledge. A listening test reveals that melodies generated by our method are rated as good as or better than human-composed music in the POP909 dataset about half the time.

【10】 FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection 标题：FaVoA：人脸-语音关联支持模棱两可的说话人检测链接：https://arxiv.org/abs/2109.00577

作者：Hugo Carneiro,Cornelius Weber,Stefan Wermter 机构：Universit¨at Hamburg, Department of Informatics, Knowledge Technology, Vogt-Koelln-Str. , Hamburg 摘要：当人脸可见时，即使在困难的环境中，当一个说话人的脸不清晰时，或者当同一场景中有几个人时，人脸和语音之间的强烈关系可以帮助主动说话人检测系统。通过能够从一个人的语音中估计他/她的正面面部表情，可以更容易地确定他/她是否是被归类为活跃说话人的潜在候选人，即使在同一场景中没有检测到任何人的口腔运动的具有挑战性的情况下也是如此。通过将人脸-语音关联神经网络与现有最先进的主动说话人检测模型相结合，我们引入了FaVoA（人脸-语音关联模糊说话人检测器），这是一种能够正确分类特定模糊场景的神经网络模型。FaVoA不仅发现正面关联，而且有助于排除面部与声音不匹配的非匹配面部与声音关联。它使用门控双模单元体系结构融合这些模型，提供了一种定量确定每种模态对分类贡献多少的方法。摘要：The strong relation between face and voice can aid active speaker detection systems when faces are visible, even in difficult settings, when the face of a speaker is not clear or when there are several people in the same scene. By being capable of estimating the frontal facial representation of a person from his/her speech, it becomes easier to determine whether he/she is a potential candidate for being classified as an active speaker, even in challenging cases in which no mouth movement is detected from any person in that same scene. By incorporating a face-voice association neural network into an existing state-of-the-art active speaker detection model, we introduce FaVoA (Face-Voice Association Ambiguous Speaker Detector), a neural network model that can correctly classify particularly ambiguous scenarios. FaVoA not only finds positive associations, but helps to rule out non-matching face-voice associations, where a face does not match a voice. Its use of a gated-bimodal-unit architecture for the fusion of those models offers a way to quantitatively determine how much each modality contributes to the classification.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-09-03，如有侵权请联系 cloudcommunity@tencent.com 删除

linux