金融/语音/音频处理学术速递[7.26]

公众号-arXiv每日学术速递

发布于 2021-07-27 11:17:14

3430

发布于 2021-07-27 11:17:14

文章被收录于专栏：arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

q-fin金融，共计12篇

cs.SD语音，共计9篇

eess.AS音频处理，共计7篇

1.q-fin金融:

【1】 Optimum Risk Portfolio and Eigen Portfolio: A Comparative Analysis Using Selected Stocks from the Indian Stock Market 标题：最优风险投资组合与特征投资组合：基于印度股市精选股票的比较分析

作者：Jaydip Sen,Sidra Mehtab 机构：Praxis Business School, Kolkata, India 备注：The is the preprint of our accepted paper in the journal International Journal of Business Forecasting and Marketing Intelligence published by Inderscience Publishers, Switzerland. It consists of 35 pages, and includes 29 figures and 36 tables 链接：https://arxiv.org/abs/2107.11371 摘要：设计一个最优的投资组合，将权重分配给它的成份股，从而在收益和风险之间实现最佳的权衡，是一个具有挑战性的研究问题。Markowitz提出的经典均值-方差投资组合理论在实际股市数据上表现为次优，因为预期收益的估计误差会对投资组合的表现产生不利影响。本文介绍了印度股市七个重要板块的三种投资组合设计方法，即最小风险投资组合、最优风险投资组合和特征投资组合。从2016年1月1日至2020年12月31日，从雅虎财经网站上截取股票的每日历史价格。为本研究选择的七个行业中的每一个行业建立了三个投资组合，并根据训练数据分析了投资组合，这些投资组合基于若干指标，如年化收益率和风险，分配给组成股票的权重、相关热图和特征投资组合的主成分。最后，对所有行业的最优风险投资组合和特征投资组合进行了为期六个月的收益测试。比较了投资组合的表现，确定了各行业回报率较高的投资组合。摘要：Designing an optimum portfolio that allocates weights to its constituent stocks in a way that achieves the best trade-off between the return and the risk is a challenging research problem. The classical mean-variance theory of portfolio proposed by Markowitz is found to perform sub-optimally on the real-world stock market data since the error in estimation for the expected returns adversely affects the performance of the portfolio. This paper presents three approaches to portfolio design, viz, the minimum risk portfolio, the optimum risk portfolio, and the Eigen portfolio, for seven important sectors of the Indian stock market. The daily historical prices of the stocks are scraped from Yahoo Finance website from January 1, 2016, to December 31, 2020. Three portfolios are built for each of the seven sectors chosen for this study, and the portfolios are analyzed on the training data based on several metrics such as annualized return and risk, weights assigned to the constituent stocks, the correlation heatmaps, and the principal components of the Eigen portfolios. Finally, the optimum risk portfolios and the Eigen portfolios for all sectors are tested on their return over a period of a six-month period. The performances of the portfolios are compared and the portfolio yielding the higher return for each sector is identified.

【2】 Deep equal risk pricing of financial derivatives with non-translation invariant risk measures 标题：非平移不变风险测度下金融衍生品的深度等风险定价

作者：Alexandre Carbonneau,Frédéric Godin 机构：a,bConcordia University, Department of Mathematics and Statistics, Montr´eal, Canada 链接：https://arxiv.org/abs/2107.11340 摘要：研究了等风险定价（ERP）方法中非换算不变风险度量在金融衍生工具估值中的应用。在一些先前的研究中，能够超越凸风险度量的能力在定价方案中提供了更大的灵活性。特别地，本文展示了嵌入在ERP框架中的风险度量的合适选择，例如半均方误差（SMSE），以缓解在基于风险的尾部价值的ERP下观察到的价格膨胀现象，如carboneau和Godin（2021b）中所述。非翻译不变ERP的数值实现是通过深度强化学习来实现的，其中对传统的深度模糊限制训练算法进行了轻微的修改（参见Buehler等人。，2019年），以便能够通过与各自的长期和短期对冲策略相关联的两个神经网络的单一训练运行获得价格。仿真实验表明，神经网络训练过程的精度不受训练算法的影响。摘要：The use of non-translation invariant risk measures within the equal risk pricing (ERP) methodology for the valuation of financial derivatives is investigated. The ability to move beyond the class of convex risk measures considered in several prior studies provides more flexibility within the pricing scheme. In particular, suitable choices for the risk measure embedded in the ERP framework such as the semi-mean-square-error (SMSE) are shown herein to alleviate the price inflation phenomenon observed under Tail Value-at-Risk based ERP as documented for instance in Carbonneau and Godin (2021b). The numerical implementation of non-translation invariant ERP is performed through deep reinforcement learning, where a slight modification is applied to the conventional deep hedging training algorithm (see Buehler et al., 2019) so as to enable obtaining a price through a single training run for the two neural networks associated with the respective long and short hedging strategies. The accuracy of the neural network training procedure is shown in simulation experiments not to be materially impacted by such modification of the training algorithm.

【3】 Dealing with Uncertainty: The Value of Reputation in the Absence of Legal Institutions 标题：应对不确定性：法律制度缺失下的声誉价值

作者：Nicolas Eschenbaum,Helge Liebert 机构： and the University of St, Institute of Economics (FGN) 链接：https://arxiv.org/abs/2107.11314 摘要：本文研究了在不存在合法机构的情况下，非法药品在网上市场的声誉问题，以缓解不确定性。交易是在为卖家提供评级系统的平台上进行的，从而提供了一个可观察的声誉衡量标准。分析利用了两个主要平台中的一个意外消失的事实。重新输入卖家重置他们的评级。结果显示，平均价格下降了9%，而评级提高1%会导致价格上涨1%。评级和价格在大约三个月后恢复。我们计算出，确定好的类型赚取1650美元每周多。摘要：This paper studies reputation in the online market for illegal drugs in which no legal institutions exist to alleviate uncertainty. Trade takes place on platforms that offer rating systems for sellers, thereby providing an observable measure of reputation. The analysis exploits the fact that one of the two dominant platforms unexpectedly disappeared. Re-entering sellers reset their rating. The results show that on average prices decreased by up to 9% and that a 1% increase in rating causes a price increase of 1%. Ratings and prices recover after about three months. We calculate that identified good types earn 1,650 USD more per week.

【4】 Margin Trading, Short-Selling and Corporate Green Innovation 标题：融资融券、卖空与企业绿色创新

作者：Ge-zhi Wu,Da-ming You 机构： Business School, Central South University,Changsha, China, Collaborative Innovation Center of Resource-conserving and Environmentally friendly, Society and Ecological Civilization, Central South University, Changsha, China, Correspondence 链接：https://arxiv.org/abs/2107.11255 摘要：本文采用2007-2019年中国上市公司面板数据，以中国放宽融资融券限制为准实验研究基础，然后构建了一个双差模型，分析融资融券是否会激励企业进行绿色技术创新活动。首先，本文的研究结果表明，融资融券交易实施后，试点公司的绿色技术创新行为将显著增加。我们认为，卖空给企业带来的卖空威胁和压力是试点企业进行绿色技术创新的主要原因。其次，实证结果表明，融资融券限制的实施将显著促进试点企业绿色技术创新的数量，但不会促进试点企业绿色技术创新的质量。进一步分析了融资融券限制对不同时期试点企业绿色技术创新数量的影响差异。最后，我们发现，金融资产与经营性资产的收益率差距、股价下跌风险、管理层持股比例、机构持股比例、产品市场竞争疲软和牛市等因素都会影响卖空对试点企业绿色技术创新的促进作用。摘要：This paper uses the panel data of Chinese listed companies from 2007 to 2019, uses the relaxation of China's margin trading and Short-Selling restrictions as the basis of quasi experimental research, and then constructs a double difference model to analyze whether the margin trading and Short-Selling will encourage enterprises to engage in green technology innovation activities. Firstly, our research results show that after the implementation of the margin trading and Short-Selling, the green technology innovation behavior of pilot companies will increase significantly. We believe that the short selling threat and pressure brought by short selling to enterprises are the main reasons for pilot enterprises to engage in green technology innovation. Secondly, the empirical results show that the implementation of margin trading and Short-Selling restrictions will significantly promote the number of green technology innovation of pilot enterprises, but will not promote the quality of green technology innovation of pilot enterprises. Furthermore, we analyze the difference of the impact of margin trading and Short-Selling restrictions on the number of green technology innovation of pilot enterprises in different periods. Finally, we find that the yield gap between financial assets and operating assets, the risk of stock price decline, management shareholding, institutional shareholding ratio, weak product market competition and bull market will affect the role of short selling in promoting green technology innovation of pilot enterprises.

【5】 Where do I rank? Am I happy?: learning income position and subjective-wellbeing in an internet experiment 标题：我的排名是第几位？我快乐吗？：在互联网实验中学习收入状况和主观幸福感

作者：Eiji Yamamura 机构：Seinan Gakuin University, Japan 链接：https://arxiv.org/abs/2107.11185 摘要：一项量身定做的互联网调查实验为个人提供了有关其收入状况的信息，以检验其对主观幸福感的影响。在第一次调查中，受访者被问及他们的家庭收入和主观幸福感。根据收集到的数据，获得了三个不同的受访者在居住地区、同一教育背景的群体和队列中的收入状况。在对治疗组的后续调查中，受访者被告知其收入状况，然后询问主观幸福感。关键的发现是，在获得信息后，较高的个人收入状况改善了他们的主观幸福感。效果因个体特征和代用品而异。摘要：A tailor-made internet survey experiment provides individuals with information on their income positions to examine their effects on subjective well-being. In the first survey, respondents were asked about their household income and subjective well-being. Based on the data collected, three different respondents' income positions within the residential locality, within a group of the same educational background, and cohort were obtained. In the follow-up survey for the treatment group, respondents are informed of their income positions and then asked for subjective well-being. Key findings are that, after obtaining information, a higher individual's income position improves their subjective well-being. The effects varied according to individual characteristics and proxies.

【6】 Reference Class Selection in Similarity-Based Forecasting of Sales Growth 标题：基于相似度的销售增长预测中的参考类选择

作者：Etienne Theising,Dominik Wied,Daniel Ziggel 机构：and Statistics, University of Cologne, Institute of Econometrics and, Data Analytics, Flossbach von Storch AG 链接：https://arxiv.org/abs/2107.11133 摘要：本文提出了一种为分析师的销售预测寻找合适外部视角的方法。其思想是为每个被分析的公司分别找到参考类，即同级组。因此，额外的公司被认为在特定预测因素方面与感兴趣的公司有相似之处。如果预测的销售分布与实际分布尽可能接近，则这些类别被认为是最优的。通过对估计的概率积分变换进行拟合优度检验和比较预测分位数来衡量预测质量。该方法应用于由21808家美国公司组成的1950-2019年数据集，并进行了描述性分析。特别是过去的营业利润率似乎是未来销售分布的良好预测指标。通过一个案例研究，将我们的预测与实际分析师的估计进行比较，强调了我们的方法在实践中的相关性。摘要：This paper proposes a method to find appropriate outside views for sales forecasts of analysts. The idea is to find reference classes, i.e. peer groups, for each analyzed company separately. Hence, additional companies are considered that share similarities to the firm of interest with respect to a specific predictor. The classes are regarded to be optimal if the forecasted sales distributions match the actual distributions as closely as possible. The forecast quality is measured by applying goodness-of-fit tests on the estimated probability integral transformations and by comparing the predicted quantiles. The method is applied on a data set consisting of 21,808 US firms over the time period 1950 - 2019, which is also descriptively analyzed. It appears that in particular the past operating margins are good predictors for the distribution of future sales. A case study with a comparison of our forecasts with actual analysts' estimates emphasizes the relevance of our approach in practice.

【7】 COVID-19 and the gig economy in Poland 标题：冠状病毒与波兰的零工经济

作者：Maciej Beręsewicz,Dagmara Nikulin 机构：COVID- 19 and the gig economy in PolandBer˛esewicz MaciejPozna´n University of Economics and Business, PolandNikulin Dagmara†Gda´nsk University of Technology 链接：https://arxiv.org/abs/2107.11124 摘要：我们使用一个覆盖几乎所有目标人群的数据集，基于从智能手机上被动收集的数据，来衡量第一次COVID-19浪潮对波兰gig经济的影响。特别是，我们专注于交通（Uber，Bolt）和送货（Wolt，Takeaway，Glover，DeliGoo）应用，这使得我们能够区分这个市场的需求和供应部分。基于贝叶斯结构时间序列模型，我们估计了第一个COVID-19波对活跃司机和信使数量的因果影响。与反事实对照组相比，Wolt和Glover显著增加（分别为15%和24%），Uber和Bolt略有减少（分别为-3%和-7%）。Uber和Bolt的变化可以部分解释为新法律（所谓的Uber-Lex）的前景，该法律已经在2019年宣布，旨在规范平台驱动程序的工作。摘要：We use a dataset covering nearly the entire target population based on passively collected data from smartphones to measure the impact of the first COVID-19 wave on the gig economy in Poland. In particular, we focus on transportation (Uber, Bolt) and delivery (Wolt, Takeaway, Glover, DeliGoo) apps, which make it possible to distinguish between the demand and supply part of this market. Based on Bayesian structural time-series models, we estimate the causal impact of the first COVID-19 wave on the number of active drivers and couriers. We show a significant relative increase for Wolt and Glover (15% and 24%) and a slight relative decrease for Uber and Bolt (-3% and -7%) in comparison to a counterfactual control. The change for Uber and Bolt can be partially explained by the prospect of a new law (the so-called Uber Lex), which was already announced in 2019 and is intended to regulate the work of platform drivers.

【8】 Economic Recession Prediction Using Deep Neural Network 标题：基于深度神经网络的经济衰退预测

作者：Zihao Wang,Kun Li,Steve Q. Xia,Hongfu Liu 机构：Michtom School of Computer Science, Brandeis University, Waltham, MA, United, Guardian Life, Hudson Yard, New York, NY, United States 链接：https://arxiv.org/abs/2107.10980 摘要：我们研究了不同的机器学习方法在预测经济周期中的有效性。我们确定带自动编码器的Bi LSTM的深度学习方法是预测美国经济衰退开始和结束的最精确模型。我们采用常用的宏观和市场条件特征来比较不同机器学习模型在样本和样本中产生良好预测的能力样本不足。当预测变量和模型系数随时间变化时，该模型具有灵活性和动态性。它为过去两次衰退提供了很好的样本外预测，并为COVID-19衰退提供了早期预警。摘要：We investigate the effectiveness of different machine learning methodologies in predicting economic cycles. We identify the deep learning methodology of Bi-LSTM with Autoencoder as the most accurate model to forecast the beginning and end of economic recessions in the U.S. We adopt commonly-available macro and market-condition features to compare the ability of different machine learning models to generate good predictions both in-sample and out-of-sample. The proposed model is flexible and dynamic when both predictive variables and model coefficients vary over time. It provided good out-of-sample predictions for the past two recessions and early warning about the COVID-19 recession.

【9】 A bridge between Local GAAP and Solvency II frameworks to quantify Capital Requirement for demographic risk 标题：地方GAAP和偿付能力II框架之间量化人口风险资本需求的桥梁

作者：Gian Paolo Clemente,Francesco Della Corte,Nino Savelli 机构：Universita Cattolica del Sacro Cuore, Milano, Universita La Sapienza, Roma 备注：24 pages, 5 figures 链接：https://arxiv.org/abs/2107.10891 摘要：本文提供了一个用于评估人口风险资本需求的随机模型。该模型扩展到在本地会计框架中开发的与市场一致的经典方法。特别地，我们为不同的非参与式寿险合同提供了一个独特的公式，并且我们通过分析证明了人口统计利润的估值会受到市场财务状况的显著影响。一个案例研究也被开发考虑人寿保险合同的投资组合。结果证明了该模型在突出资本要求评估的主要驱动因素方面的有效性，并与当地公认会计原则框架进行了比较。摘要：The paper provides a stochastic model useful for assessing the capital requirement for demographic risk. The model extends to the market consistent context classical methodologies developed in a local accounting framework. In particular we provide a unique formulation for different non-participating life insurance contracts and we prove analytically that the valuation of demographic profit can be significantly affected by the financial conditions in the market. A case study has been also developed considering a portfolio of life insurance contracts. Results prove the effectiveness of the model in highlighting main drivers of capital requirement evaluation, also compared to local GAAP framework.

【10】 LocalGLMnet: interpretable deep learning for tabular data 标题：LocalGLMnet：表格数据的可解释深度学习

作者：Ronald Richman,Mario V. Wüthrich 机构：Mario V. W¨uthrich† 链接：https://arxiv.org/abs/2107.11059 摘要：深度学习模型在统计建模中得到了广泛的应用，因为它们导致了非常有竞争力的回归模型，通常比经典的统计模型（如广义线性模型）表现更好。深度学习模型的缺点是其解很难解释和解释，变量选择也不容易，因为深度学习模型在内部以不透明的方式解决特征工程和变量选择问题。受广义线性模型吸引人的结构启发，我们提出了一种新的网络结构，该结构与广义线性模型具有相似的特性，但得益于表示学习的艺术，它提供了优越的预测能力。这种新的体系结构允许表格数据的变量选择和校准的深度学习模型的解释，事实上，我们的方法提供了一种基于Shapley值和综合梯度的加法分解。摘要：Deep learning models have gained great popularity in statistical modeling because they lead to very competitive regression models, often outperforming classical statistical models such as generalized linear models. The disadvantage of deep learning models is that their solutions are difficult to interpret and explain, and variable selection is not easily possible because deep learning models solve feature engineering and variable selection internally in a nontransparent way. Inspired by the appealing structure of generalized linear models, we propose a new network architecture that shares similar features as generalized linear models, but provides superior predictive power benefiting from the art of representation learning. This new architecture allows for variable selection of tabular data and for interpretation of the calibrated deep learning model, in fact, our approach provides an additive decomposition in the spirit of Shapley values and integrated gradients.

【11】 Stability of backward stochastic differential equations: the general case 标题：倒向随机微分方程的稳定性：一般情况

作者：Antonis Papapantoleon,Dylan Possamaï,Alexandros Saplaouras 备注：44 pages 链接：https://arxiv.org/abs/2107.11048 摘要：本文在一个非常一般的框架下，得到了带跳倒向随机微分方程（BSDEs）的稳定性结果。更具体地，我们考虑标准数据的收敛序列，每个序列都与它们自己的过滤相关，并且我们证明了（唯一）解的相关序列也是收敛的。目前的结果扩展了BSDEs稳定性文献中的早期贡献，并统一了BSDEs数值近似的几个框架及其实现。摘要：In this paper, we obtain stability results for backward stochastic differential equations with jumps (BSDEs) in a very general framework. More specifically, we consider a convergent sequence of standard data, each associated to their own filtration, and we prove that the associated sequence of (unique) solutions is also convergent. The current result extends earlier contributions in the literature of stability of BSDEs and unifies several frameworks for numerical approximations of BSDEs and their implementations.

【12】 Graph-Based Learning for Stock Movement Prediction with Textual and Relational Data 标题：基于图的文本和关系数据股市走势预测学习

作者：Qinkai Chen,Christian-Yann Robert 机构：†Ecole Polytechnique, Palaiseau, France, ‡Exoduspoint Capital Management France, Paris, France, §ENSAE Paris, Palaiseau, France 备注：10 pages, 3 figures, 5 tables 链接：https://arxiv.org/abs/2107.10941 摘要：由于市场的不确定性和从机器的角度理解自然语言的困难，从文本信息预测股票价格是一项具有挑战性的任务。以往的研究主要集中在基于单个新闻的情感提取上。然而，金融市场上的股票可以高度相关，一个关于一只股票的消息可以迅速影响其他股票的价格。为此，本文提出了一种新的股票运动预测框架：多图回归股票预测网络（MGRN）。该体系结构允许将来自金融新闻的文本情感和从其他金融数据中提取的多个关系信息结合起来。通过对斯托克欧洲600指数（STOXX europe600）的精度检验和交易模拟，我们证明了我们的模型比其他基准指数有更好的表现。摘要：Predicting stock prices from textual information is a challenging task due to the uncertainty of the market and the difficulty understanding the natural language from a machine's perspective. Previous researches focus mostly on sentiment extraction based on single news. However, the stocks on the financial market can be highly correlated, one news regarding one stock can quickly impact the prices of other stocks. To take this effect into account, we propose a new stock movement prediction framework: Multi-Graph Recurrent Network for Stock Forecasting (MGRN). This architecture allows to combine the textual sentiment from financial news and multiple relational information extracted from other financial data. Through an accuracy test and a trading simulation on the stocks in the STOXX Europe 600 index, we demonstrate a better performance from our model than other benchmarks.

2.cs.SD语音:

【1】 Multi-Channel Automatic Music Transcription Using Tensor Algebra 标题：基于张量代数的多通道自动配乐

作者：Marmoret Axel,Bertin Nancy,Cohen Jeremy 备注：40 pages, 14 figues, 5 tables, code can be found at: this https URL 链接：https://arxiv.org/abs/2107.11250 摘要：音乐是一门艺术，每一位听众都能以独特的方式感知到来自声音信号的声音。同时，音乐乐谱的标准也在描述它。即使人类能够进行这种转录，在时间和精力上也是代价高昂的，随着信息的不断爆炸和互联网的兴起，代价更高。从这个意义上说，研究是朝着音乐自动抄写的方向发展的。虽然这项任务在单音符的情况下被认为是可以解决的，但当音符重叠形成和弦时，它仍然是开放的。本报告旨在发展一些现有的音乐转录技术，特别是矩阵分解，并介绍多通道自动音乐转录的概念。这个概念将用称为张量的数学对象来探讨。摘要：Music is an art, perceived in unique ways by every listener, coming from acoustic signals. In the meantime, standards as musical scores exist to describe it. Even if humans can make this transcription, it is costly in terms of time and efforts, even more with the explosion of information consecutively to the rise of the Internet. In that sense, researches are driven in the direction of Automatic Music Transcription. While this task is considered solved in the case of single notes, it is still open when notes superpose themselves, forming chords. This report aims at developing some of the existing techniques towards Music Transcription, particularly matrix factorization, and introducing the concept of multi-channel automatic music transcription. This concept will be explored with mathematical objects called tensors.

【2】 Multi-channel Speech Enhancement with 2-D Convolutional Time-frequency Domain Features and a Pre-trained Acoustic Model 标题：基于二维卷积时频域特征和预训练声学模型的多通道语音增强

作者：Quandong Wang,Junnan Wu,Zhao Yan,Sichong Qian,Liyong Guo,Lichun Fan,Weiji Zhuang,Peng Gao,Yujun Wang 机构：Wang, Xiaomi Corporation, Beijing, China 备注：7 pages, 3 figures, submitted to APSIPA 2021 链接：https://arxiv.org/abs/2107.11222 摘要：我们提出了一种多通道语音增强方法，采用一种新的两级特征融合方法和多任务学习模式下的预训练声学模型。在第一级融合中，分别提取时域和频域特征。在时域中，计算多通道卷积和（MCS）和通道间卷积差（ICD）特征，然后与二维卷积层集成，而在频域中，原始信道和超定向波束形成输出的对数功率谱（LPS）特征与另一个二维卷积层相结合。为了充分融合多通道语音的丰富信息，即时频域特征和阵列几何结构，我们在第二级融合中采用第三层二维卷积层来获得最终的卷积特征。此外，我们建议使用端到端无格子最大互信息准则训练的固定清洁声学模型来强制增强输出与清洁波形具有相同的分布，以减轻增强任务的过度估计问题并限制失真。在ConferencingSpeech2021挑战的Task1开发数据集上，与官方基线和最近提出的多通道分离方法相比，PESQ分别提高了0.24和0.19。摘要：We propose a multi-channel speech enhancement approach with a novel two-level feature fusion method and a pre-trained acoustic model in a multi-task learning paradigm. In the first fusion level, the time-domain and frequency-domain features are extracted separately. In the time domain, the multi-channel convolution sum (MCS) and the inter-channel convolution differences (ICDs) features are computed and then integrated with a 2-D convolutional layer, while in the frequency domain, the log-power spectra (LPS) features from both original channels and super-directive beamforming outputs are combined with another 2-D convolutional layer. To fully integrate the rich information of multi-channel speech, i.e. time-frequency domain features and the array geometry, we apply a third 2-D convolutional layer in the second level of fusion to obtain the final convolutional features. Furthermore, we propose to use a fixed clean acoustic model trained with the end-to-end lattice-free maximum mutual information criterion to enforce the enhanced output to have the same distribution as the clean waveform to alleviate the over-estimation problem of the enhancement task and constrain distortion. On the Task1 development dataset of the ConferencingSpeech 2021 challenge, a PESQ improvement of 0.24 and 0.19 is attained compared to the official baseline and a recently proposed multi-channel separation method.

【3】 OLR 2021 Challenge: Datasets, Rules and Baselines 标题：OLR 2021挑战：数据集、规则和基线

作者：Binling Wang,Wenxuan Hu,Jing Li,Yiming Zhi,Zheng Li,Qingyang Hong,Lin Li,Dong Wang,Liming Song,Cheng Yang 机构：† School of Informatics, Xiamen University, ‡ School of Electronic Science and Engineering, Xiamen University, § Center for Speech and Language Technologies, Tsinghua University, ¶ Beijing National Research Center for Information Science and Technology 备注：arXiv admin note: text overlap with arXiv:2006.03473, arXiv:1907.07626, arXiv:1806.00616, arXiv:1706.09742 链接：https://arxiv.org/abs/2107.11113 摘要：本文介绍了第六届东方语言识别（OLR）2021挑战赛，旨在提高语言识别系统和语音识别系统在多语言场景下的性能。本文介绍了数据概况、四项任务、两条基线和评价原则。除了语言识别（LID）任务外，多语言自动语音识别（ASR）任务首次被引入olr2021挑战。今年的挑战集中在更实际和更具挑战性的问题上，有四项任务：（1）受限LID，（2）无约束LID，（3）受限多语言ASR，（4）无约束多语言ASR。分别提供了LID任务和多语言ASR任务的基线。LID基线系统是用Pytorch构造的扩展tdnnx矢量模型。提供了一个基于转换器的端到端模型作为多语言ASR基线系统。这些食谱将在线发布，供参与者构建自己的LID或ASR系统。基线结果表明，这些任务具有相当大的挑战性，需要付出更多努力才能取得更好的绩效。摘要：This paper introduces the sixth Oriental Language Recognition (OLR) 2021 Challenge, which intends to improve the performance of language recognition systems and speech recognition systems within multilingual scenarios. The data profile, four tasks, two baselines, and the evaluation principles are introduced in this paper. In addition to the Language Identification (LID) tasks, multilingual Automatic Speech Recognition (ASR) tasks are introduced to OLR 2021 Challenge for the first time. The challenge this year focuses on more practical and challenging problems, with four tasks: (1) constrained LID, (2) unconstrained LID, (3) constrained multilingual ASR, (4) unconstrained multilingual ASR. Baselines for LID tasks and multilingual ASR tasks are provided, respectively. The LID baseline system is an extended TDNN x-vector model constructed with Pytorch. A transformer-based end-to-end model is provided as the multilingual ASR baseline system. These recipes will be online published, and available for participants to construct their own LID or ASR systems. The baseline results demonstrate that those tasks are rather challenging and deserve more effort to achieve better performance.

【4】 SALADnet: Self-Attentive multisource Localization in the Ambisonics Domain 标题：SALADnet：双声域中的自觉式多源定位

作者：Pierre-Amaury Grumiaux,Srdan Kitic,Prerak Srivastava,Laurent Girin,Alexandre Guérin 机构：Orange Labs, Cesson-Sévigné, France, Univ. de Lorraine, Inria, Nancy, France, Univ. Grenoble Alpes, GIPSA-lab, Grenoble-INP, CNRS, Grenoble, France 备注：Accepted to Workshop on Applications of Signal Processing to Audio and Acoustics 链接：https://arxiv.org/abs/2107.11066 摘要：在这项工作中，我们提出了一种新的基于自注意的神经网络，用于从环境音记录中进行鲁棒的多说话人定位。从一个最先进的卷积递归神经网络开始，我们研究了用继承自Transformer结构的自注意编码器取代递归层的好处。我们评估这些模型的合成和现实世界的数据，多达3个同时发言。所获得的结果表明，大多数所提出的架构要么执行PAR，或优于CRNN基线，特别是在多源场景中。此外，通过避免递归层，所提出的模型有助于并行计算，从而节省了大量的执行时间。摘要：In this work, we propose a novel self-attention based neural network for robust multi-speaker localization from Ambisonics recordings. Starting from a state-of-the-art convolutional recurrent neural network, we investigate the benefit of replacing the recurrent layers by self-attention encoders, inherited from the Transformer architecture. We evaluate these models on synthetic and real-world data, with up to 3 simultaneous speakers. The obtained results indicate that the majority of the proposed architectures either perform on par, or outperform the CRNN baseline, especially in the multisource scenario. Moreover, by avoiding the recurrent layers, the proposed models lend themselves to parallel computing, which is shown to produce considerable savings in execution time.

【5】 Using UMAP to Inspect Audio Data for Unsupervised Anomaly Detection under Domain-Shift Conditions 标题：利用UMAP检测音频数据实现移域条件下的无监督异常检测

作者：Andres Fernandez,Mark D. Plumbley 机构：Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK 备注：Submitted for publication 链接：https://arxiv.org/abs/2107.10880 摘要：无监督异常检测（UAD）的目标是在只有非异常（正常）数据的情况下检测异常信号。在域转移条件下的UAD（UAD-S）中，数据进一步暴露于通常事先未知的上下文变化。受2021年版《声场景和事件的检测与分类》（DCASE）挑战赛上UAD-S任务中遇到的困难的启发，我们目视检查了对数STFT、对数mel和预训练外观的均匀流形近似和投影（UMAP），聆听并学习DCASE UAD-S数据集的三级表示法。在我们的探索性研究中，我们寻找了可分离性（SEP）和判别支持（DSUP）这两个特征，并提出了一些有助于诊断和进一步表征和检测方法发展的假设。特别地，我们假设输入长度和预训练可能调节SEP和DSUP之间的相关权衡。我们的代码以及生成的umap和plot都是公开的。摘要：The goal of Unsupervised Anomaly Detection (UAD) is to detect anomalous signals under the condition that only non-anomalous (normal) data is available beforehand. In UAD under Domain-Shift Conditions (UAD-S), data is further exposed to contextual changes that are usually unknown beforehand. Motivated by the difficulties encountered in the UAD-S task presented at the 2021 edition of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, we visually inspect Uniform Manifold Approximations and Projections (UMAPs) for log-STFT, log-mel and pretrained Look, Listen and Learn (L3) representations of the DCASE UAD-S dataset. In our exploratory investigation, we look for two qualities, Separability (SEP) and Discriminative Support (DSUP), and formulate several hypotheses that could facilitate diagnosis and developement of further representation and detection approaches. Particularly, we hypothesize that input length and pretraining may regulate a relevant tradeoff between SEP and DSUP. Our code as well as the resulting UMAPs and plots are publicly available.

【6】 Semantic Communications for Speech Recognition 标题：用于语音识别的语义通信

作者：Zhenzi Weng,Zhijin Qin,Geoffrey Ye Li 机构：∗ Queen Mary University of London, London, UK, † Imperial College London, London, UK 链接：https://arxiv.org/abs/2107.11190 摘要：传统的通信方式不考虑信源的内容和接收者所需要的语义信息，而是将信源数据全部以位表示。然而，在一些应用中，接收者只需要部分表示关键语义信息的源数据，这会提示发送与应用相关的信息，特别是在带宽资源有限的情况下。在本文中，我们考虑一个语义通信系统的语音识别设计的收发器作为端到端（E2E）系统。特别地，开发了一个支持深度学习（DL）的语义通信系统DeepSC-SR，用于学习和提取发送端与文本相关的语义特征，从而使系统在不降低性能的情况下传输比源语音数据少得多的语音数据。此外，为了使提出的DeepSC-SR更适用于动态信道环境，我们研究了一种不需要再训练就能处理各种信道环境的鲁棒模型。仿真结果表明，本文提出的DeepSC-SR在字符错误率和字错误率等语音识别指标上优于传统的通信系统，并且对信道变化具有较强的鲁棒性，特别是在低信噪比情况下。摘要：The traditional communications transmit all the source date represented by bits, regardless of the content of source and the semantic information required by the receiver. However, in some applications, the receiver only needs part of the source data that represents critical semantic information, which prompts to transmit the application-related information, especially when bandwidth resources are limited. In this paper, we consider a semantic communication system for speech recognition by designing the transceiver as an end-to-end (E2E) system. Particularly, a deep learning (DL)-enabled semantic communication system, named DeepSC-SR, is developed to learn and extract text-related semantic features at the transmitter, which motivates the system to transmit much less than the source speech data without performance degradation. Moreover, in order to facilitate the proposed DeepSC-SR for dynamic channel environments, we investigate a robust model to cope with various channel environments without requiring retraining. The simulation results demonstrate that our proposed DeepSC-SR outperforms the traditional communication systems in terms of the speech recognition metrics, such as character-error-rate and word-error-rate, and is more robust to channel variations, especially in the low signal-to-noise (SNR) regime.

【7】 Put your money where your mouth is: Using AI voice analysis to detect whether spoken arguments reflect the speaker's true convictions 标题：把钱放在嘴边：使用人工智能语音分析来检测口头论点是否反映了说话人的真实信念

作者：Fabian Thaler,Stefan Faußer,Heiko Gewald 机构：Doctoral Student, +, Professor, Neu-Ulm University of, Applied Sciences, Center for Research on, Service Sciences (CROSS), Management, Wileystr. , Neu-Ulm, Germany, corresponding author 备注：26 pages, 2 figures, 4 tables 链接：https://arxiv.org/abs/2107.11175 摘要：顾客情绪在服务业中起着至关重要的作用。一线人员越了解客户，他们能提供的服务就越好。当人类的情绪产生某些（无意的）身体反应时，比如心率加快、出汗、扩张、脸红和发红，这些都是可以测量的，人工智能（AI）技术可以解释这些信号。近年来，自动检测基本情绪（如高兴、愤怒等）的研究取得了很大进展。复杂情绪由多种相互依赖的基本情绪组成，更难识别。服务业非常感兴趣的一种复杂情绪很难察觉：顾客是在说真话，还是只是在讲故事。本研究提出一种人工智能方法来捕捉和感知情绪数据。经过训练的最佳模型的准确率在98%左右，能够通过言语分析检测辩论挑战的参与者是否支持或反对自己的信念。数据集是在一个有40名参与者的实验环境中收集的。研究结果适用于广泛的服务流程，特别适用于通过电话进行的所有客户交互。所提出的算法可以应用于任何有助于代理知道客户是否在和他/她说话的情况。例如，这可能导致可疑的保险索赔减少，或在工作面试中出现不真实的陈述。这不仅可以减少服务公司的运营损失，还可以鼓励客户更加诚实。摘要：Customers' emotions play a vital role in the service industry. The better frontline personnel understand the customer, the better the service they can provide. As human emotions generate certain (unintentional) bodily reactions, such as increase in heart rate, sweating, dilation, blushing and paling, which are measurable, artificial intelligence (AI) technologies can interpret these signals. Great progress has been made in recent years to automatically detect basic emotions like joy, anger etc. Complex emotions, consisting of multiple interdependent basic emotions, are more difficult to identify. One complex emotion which is of great interest to the service industry is difficult to detect: whether a customer is telling the truth or just a story. This research presents an AI-method for capturing and sensing emotional data. With an accuracy of around 98 %, the best trained model was able to detect whether a participant of a debating challenge was arguing for or against her/his conviction, using speech analysis. The data set was collected in an experimental setting with 40 participants. The findings are applicable to a wide range of service processes and specifically useful for all customer interactions that take place via telephone. The algorithm presented can be applied in any situation where it is helpful for the agent to know whether a customer is speaking to her/his conviction. This could, for example, lead to a reduction in doubtful insurance claims, or untruthful statements in job interviews. This would not only reduce operational losses for service companies, but also encourage customers to be more truthful.

【8】 Unsupervised Domain Adaptation for Dysarthric Speech Detection via Domain Adversarial Training and Mutual Information Minimization 标题：基于域对抗训练和互信息最小化的无监督域自适应节律语音检测

作者：Disong Wang,Liqun Deng,Yu Ting Yeung,Xiao Chen,Xunying Liu,Helen Meng 机构：The Chinese University of Hong Kong, Hong Kong SAR, China, Huawei Noah’s Ark Lab 备注：Accepted to Interspeech 2021 链接：https://arxiv.org/abs/2106.10127 摘要：构音障碍语音检测（DSD）系统旨在从语音中检测神经运动障碍的特征。这种系统特别容易受到域不匹配的影响，其中训练和测试数据分别来自源域和目标域，但这两个域在语音刺激、疾病病因等方面可能不同。由于注释大量数据集的高成本，很难获得目标域中的标记数据。本文首次尝试将跨域DSD描述为一个无监督域自适应（UDA）问题。我们使用标记的源域数据和未标记的目标域数据，提出了一种多任务学习策略，包括构音障碍存在分类（DPC）、域对抗训练（DAT）和互信息最小化（MIM），旨在学习构音障碍区分性和域不变的生物标记嵌入。具体来说，DPC有助于生物标记物嵌入捕获构音障碍的关键指标；DAT迫使生物标记嵌入在源域和靶域无法区分；MIM进一步降低了生物标记嵌入与领域相关线索之间的相关性。将UASPEECH和TORGO语料库分别作为源域和目标域，实验结果表明，UDA的加入在话语级加权平均回忆和说话人级准确率上分别获得了22.2%和20.0%的绝对提高。摘要：Dysarthric speech detection (DSD) systems aim to detect characteristics of the neuromotor disorder from speech. Such systems are particularly susceptible to domain mismatch where the training and testing data come from the source and target domains respectively, but the two domains may differ in terms of speech stimuli, disease etiology, etc. It is hard to acquire labelled data in the target domain, due to high costs of annotating sizeable datasets. This paper makes a first attempt to formulate cross-domain DSD as an unsupervised domain adaptation (UDA) problem. We use labelled source-domain data and unlabelled target-domain data, and propose a multi-task learning strategy, including dysarthria presence classification (DPC), domain adversarial training (DAT) and mutual information minimization (MIM), which aim to learn dysarthria-discriminative and domain-invariant biomarker embeddings. Specifically, DPC helps biomarker embeddings capture critical indicators of dysarthria; DAT forces biomarker embeddings to be indistinguishable in source and target domains; and MIM further reduces the correlation between biomarker embeddings and domain-related cues. By treating the UASPEECH and TORGO corpora respectively as the source and target domains, experiments show that the incorporation of UDA attains absolute increases of 22.2% and 20.0% respectively in utterance-level weighted average recall and speaker-level accuracy.

【9】 Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion 标题：用于非典型语音转换的显式韵律模型学习和深度说话人嵌入

作者：Disong Wang,Songxiang Liu,Lifa Sun,Xixin Wu,Xunying Liu,Helen Meng 机构：Human-Computer Communications Laboratory, The Chinese University of Hong Kong, Hong Kong SAR, China, SpeechX Limited, Shenzhen, China, Department of Engineering, University of Cambridge, UK 备注：Accepted to Interspeech 2021 链接：https://arxiv.org/abs/2011.01678 摘要：尽管在典型语音的语音转换（VC）方面已经取得了很大的进展，但是对于非典型语音（如构音障碍和第二语言（L2）语音）的语音转换仍然是一个挑战，因为它涉及到在保持说话人身份的同时纠正非典型韵律。为了解决这个问题，我们提出了一个具有显式韵律建模和深度说话人嵌入（DSE）学习的VC系统。首先，语音编码器努力从非典型语音中提取健壮的音素嵌入。其次，韵律校正器通过音位嵌入来推断典型的音位时长和音高值。第三，转换模型以音素嵌入和典型韵律特征为输入，以说话人编码器或说话人自适应学习的目标DSE为条件，生成转换后的语音。大量实验表明，说话人自适应可以获得更高的说话人相似度，基于说话人编码器的转换模型可以大大减少构音障碍和非母语发音模式，提高语音清晰度。将原始语音和转换语音的识别结果进行了比较，结果表明，该方法可使语音识别的字符错误率（CER）和单词错误率（WER）分别降低47.6%和29.3%。摘要：Though significant progress has been made for the voice conversion (VC) of typical speech, VC for atypical speech, e.g., dysarthric and second-language (L2) speech, remains a challenge, since it involves correcting for atypical prosody while maintaining speaker identity. To address this issue, we propose a VC system with explicit prosodic modelling and deep speaker embedding (DSE) learning. First, a speech-encoder strives to extract robust phoneme embeddings from atypical speech. Second, a prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values. Third, a conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech, conditioned on the target DSE that is learned via speaker encoder or speaker adaptation. Extensive experiments demonstrate that speaker adaptation can achieve higher speaker similarity, and the speaker encoder based conversion model can greatly reduce dysarthric and non-native pronunciation patterns with improved speech intelligibility. A comparison of speech recognition results between the original dysarthric speech and converted speech show that absolute reduction of 47.6% character error rate (CER) and 29.3% word error rate (WER) can be achieved.

3.eess.AS音频处理:

【1】 Semantic Communications for Speech Recognition 标题：用于语音识别的语义通信

作者：Zhenzi Weng,Zhijin Qin,Geoffrey Ye Li 机构：∗ Queen Mary University of London, London, UK, † Imperial College London, London, UK 链接：https://arxiv.org/abs/2107.11190 摘要：传统的通信方式不考虑信源的内容和接收者所需要的语义信息，而是将信源数据全部以位表示。然而，在一些应用中，接收者只需要部分表示关键语义信息的源数据，这会提示发送与应用相关的信息，特别是在带宽资源有限的情况下。在本文中，我们考虑一个语义通信系统的语音识别设计的收发器作为端到端（E2E）系统。特别地，开发了一个支持深度学习（DL）的语义通信系统DeepSC-SR，用于学习和提取发送端与文本相关的语义特征，从而使系统在不降低性能的情况下传输比源语音数据少得多的语音数据。此外，为了使提出的DeepSC-SR更适用于动态信道环境，我们研究了一种不需要再训练就能处理各种信道环境的鲁棒模型。仿真结果表明，本文提出的DeepSC-SR在字符错误率和字错误率等语音识别指标上优于传统的通信系统，并且对信道变化具有较强的鲁棒性，特别是在低信噪比情况下。摘要：The traditional communications transmit all the source date represented by bits, regardless of the content of source and the semantic information required by the receiver. However, in some applications, the receiver only needs part of the source data that represents critical semantic information, which prompts to transmit the application-related information, especially when bandwidth resources are limited. In this paper, we consider a semantic communication system for speech recognition by designing the transceiver as an end-to-end (E2E) system. Particularly, a deep learning (DL)-enabled semantic communication system, named DeepSC-SR, is developed to learn and extract text-related semantic features at the transmitter, which motivates the system to transmit much less than the source speech data without performance degradation. Moreover, in order to facilitate the proposed DeepSC-SR for dynamic channel environments, we investigate a robust model to cope with various channel environments without requiring retraining. The simulation results demonstrate that our proposed DeepSC-SR outperforms the traditional communication systems in terms of the speech recognition metrics, such as character-error-rate and word-error-rate, and is more robust to channel variations, especially in the low signal-to-noise (SNR) regime.

【2】 Put your money where your mouth is: Using AI voice analysis to detect whether spoken arguments reflect the speaker's true convictions 标题：把钱放在嘴边：使用人工智能语音分析来检测口头论点是否反映了说话人的真实信念

作者：Fabian Thaler,Stefan Faußer,Heiko Gewald 机构：Doctoral Student, +, Professor, Neu-Ulm University of, Applied Sciences, Center for Research on, Service Sciences (CROSS), Management, Wileystr. , Neu-Ulm, Germany, corresponding author 备注：26 pages, 2 figures, 4 tables 链接：https://arxiv.org/abs/2107.11175 摘要：顾客情绪在服务业中起着至关重要的作用。一线人员越了解客户，他们能提供的服务就越好。当人类的情绪产生某些（无意的）身体反应时，比如心率加快、出汗、扩张、脸红和发红，这些都是可以测量的，人工智能（AI）技术可以解释这些信号。近年来，自动检测基本情绪（如高兴、愤怒等）的研究取得了很大进展。复杂情绪由多种相互依赖的基本情绪组成，更难识别。服务业非常感兴趣的一种复杂情绪很难察觉：顾客是在说真话，还是只是在讲故事。本研究提出一种人工智能方法来捕捉和感知情绪数据。经过训练的最佳模型的准确率在98%左右，能够通过言语分析检测辩论挑战的参与者是否支持或反对自己的信念。数据集是在一个有40名参与者的实验环境中收集的。研究结果适用于广泛的服务流程，特别适用于通过电话进行的所有客户交互。所提出的算法可以应用于任何有助于代理知道客户是否在和他/她说话的情况。例如，这可能导致可疑的保险索赔减少，或在工作面试中出现不真实的陈述。这不仅可以减少服务公司的运营损失，还可以鼓励客户更加诚实。摘要：Customers' emotions play a vital role in the service industry. The better frontline personnel understand the customer, the better the service they can provide. As human emotions generate certain (unintentional) bodily reactions, such as increase in heart rate, sweating, dilation, blushing and paling, which are measurable, artificial intelligence (AI) technologies can interpret these signals. Great progress has been made in recent years to automatically detect basic emotions like joy, anger etc. Complex emotions, consisting of multiple interdependent basic emotions, are more difficult to identify. One complex emotion which is of great interest to the service industry is difficult to detect: whether a customer is telling the truth or just a story. This research presents an AI-method for capturing and sensing emotional data. With an accuracy of around 98 %, the best trained model was able to detect whether a participant of a debating challenge was arguing for or against her/his conviction, using speech analysis. The data set was collected in an experimental setting with 40 participants. The findings are applicable to a wide range of service processes and specifically useful for all customer interactions that take place via telephone. The algorithm presented can be applied in any situation where it is helpful for the agent to know whether a customer is speaking to her/his conviction. This could, for example, lead to a reduction in doubtful insurance claims, or untruthful statements in job interviews. This would not only reduce operational losses for service companies, but also encourage customers to be more truthful.

【3】 Multi-Channel Automatic Music Transcription Using Tensor Algebra 标题：基于张量代数的多通道自动配乐

作者：Marmoret Axel,Bertin Nancy,Cohen Jeremy 备注：40 pages, 14 figues, 5 tables, code can be found at: this https URL 链接：https://arxiv.org/abs/2107.11250 摘要：音乐是一门艺术，每一位听众都能以独特的方式感知到来自声音信号的声音。同时，音乐乐谱的标准也在描述它。即使人类能够进行这种转录，在时间和精力上也是代价高昂的，随着信息的不断爆炸和互联网的兴起，代价更高。从这个意义上说，研究是朝着音乐自动抄写的方向发展的。虽然这项任务在单音符的情况下被认为是可以解决的，但当音符重叠形成和弦时，它仍然是开放的。本报告旨在发展一些现有的音乐转录技术，特别是矩阵分解，并介绍多通道自动音乐转录的概念。这个概念将用称为张量的数学对象来探讨。摘要：Music is an art, perceived in unique ways by every listener, coming from acoustic signals. In the meantime, standards as musical scores exist to describe it. Even if humans can make this transcription, it is costly in terms of time and efforts, even more with the explosion of information consecutively to the rise of the Internet. In that sense, researches are driven in the direction of Automatic Music Transcription. While this task is considered solved in the case of single notes, it is still open when notes superpose themselves, forming chords. This report aims at developing some of the existing techniques towards Music Transcription, particularly matrix factorization, and introducing the concept of multi-channel automatic music transcription. This concept will be explored with mathematical objects called tensors.

【4】 Multi-channel Speech Enhancement with 2-D Convolutional Time-frequency Domain Features and a Pre-trained Acoustic Model 标题：基于二维卷积时频域特征和预训练声学模型的多通道语音增强

作者：Quandong Wang,Junnan Wu,Zhao Yan,Sichong Qian,Liyong Guo,Lichun Fan,Weiji Zhuang,Peng Gao,Yujun Wang 机构：Wang, Xiaomi Corporation, Beijing, China 备注：7 pages, 3 figures, submitted to APSIPA 2021 链接：https://arxiv.org/abs/2107.11222 摘要：我们提出了一种多通道语音增强方法，采用一种新的两级特征融合方法和多任务学习模式下的预训练声学模型。在第一级融合中，分别提取时域和频域特征。在时域中，计算多通道卷积和（MCS）和通道间卷积差（ICD）特征，然后与二维卷积层集成，而在频域中，原始信道和超定向波束形成输出的对数功率谱（LPS）特征与另一个二维卷积层相结合。为了充分融合多通道语音的丰富信息，即时频域特征和阵列几何结构，我们在第二级融合中采用第三层二维卷积层来获得最终的卷积特征。此外，我们建议使用端到端无格子最大互信息准则训练的固定清洁声学模型来强制增强输出与清洁波形具有相同的分布，以减轻增强任务的过度估计问题并限制失真。在ConferencingSpeech2021挑战的Task1开发数据集上，与官方基线和最近提出的多通道分离方法相比，PESQ分别提高了0.24和0.19。摘要：We propose a multi-channel speech enhancement approach with a novel two-level feature fusion method and a pre-trained acoustic model in a multi-task learning paradigm. In the first fusion level, the time-domain and frequency-domain features are extracted separately. In the time domain, the multi-channel convolution sum (MCS) and the inter-channel convolution differences (ICDs) features are computed and then integrated with a 2-D convolutional layer, while in the frequency domain, the log-power spectra (LPS) features from both original channels and super-directive beamforming outputs are combined with another 2-D convolutional layer. To fully integrate the rich information of multi-channel speech, i.e. time-frequency domain features and the array geometry, we apply a third 2-D convolutional layer in the second level of fusion to obtain the final convolutional features. Furthermore, we propose to use a fixed clean acoustic model trained with the end-to-end lattice-free maximum mutual information criterion to enforce the enhanced output to have the same distribution as the clean waveform to alleviate the over-estimation problem of the enhancement task and constrain distortion. On the Task1 development dataset of the ConferencingSpeech 2021 challenge, a PESQ improvement of 0.24 and 0.19 is attained compared to the official baseline and a recently proposed multi-channel separation method.

【5】 OLR 2021 Challenge: Datasets, Rules and Baselines 标题：OLR 2021挑战：数据集、规则和基线

作者：Binling Wang,Wenxuan Hu,Jing Li,Yiming Zhi,Zheng Li,Qingyang Hong,Lin Li,Dong Wang,Liming Song,Cheng Yang 机构：† School of Informatics, Xiamen University, ‡ School of Electronic Science and Engineering, Xiamen University, § Center for Speech and Language Technologies, Tsinghua University, ¶ Beijing National Research Center for Information Science and Technology 备注：arXiv admin note: text overlap with arXiv:2006.03473, arXiv:1907.07626, arXiv:1806.00616, arXiv:1706.09742 链接：https://arxiv.org/abs/2107.11113 摘要：本文介绍了第六届东方语言识别（OLR）2021挑战赛，旨在提高语言识别系统和语音识别系统在多语言场景下的性能。本文介绍了数据概况、四项任务、两条基线和评价原则。除了语言识别（LID）任务外，多语言自动语音识别（ASR）任务首次被引入olr2021挑战。今年的挑战集中在更实际和更具挑战性的问题上，有四项任务：（1）受限LID，（2）无约束LID，（3）受限多语言ASR，（4）无约束多语言ASR。分别提供了LID任务和多语言ASR任务的基线。LID基线系统是用Pytorch构造的扩展tdnnx矢量模型。提供了一个基于转换器的端到端模型作为多语言ASR基线系统。这些食谱将在线发布，供参与者构建自己的LID或ASR系统。基线结果表明，这些任务具有相当大的挑战性，需要付出更多努力才能取得更好的绩效。摘要：This paper introduces the sixth Oriental Language Recognition (OLR) 2021 Challenge, which intends to improve the performance of language recognition systems and speech recognition systems within multilingual scenarios. The data profile, four tasks, two baselines, and the evaluation principles are introduced in this paper. In addition to the Language Identification (LID) tasks, multilingual Automatic Speech Recognition (ASR) tasks are introduced to OLR 2021 Challenge for the first time. The challenge this year focuses on more practical and challenging problems, with four tasks: (1) constrained LID, (2) unconstrained LID, (3) constrained multilingual ASR, (4) unconstrained multilingual ASR. Baselines for LID tasks and multilingual ASR tasks are provided, respectively. The LID baseline system is an extended TDNN x-vector model constructed with Pytorch. A transformer-based end-to-end model is provided as the multilingual ASR baseline system. These recipes will be online published, and available for participants to construct their own LID or ASR systems. The baseline results demonstrate that those tasks are rather challenging and deserve more effort to achieve better performance.

【6】 SALADnet: Self-Attentive multisource Localization in the Ambisonics Domain 标题：SALADnet：双声域中的自觉式多源定位

作者：Pierre-Amaury Grumiaux,Srdan Kitic,Prerak Srivastava,Laurent Girin,Alexandre Guérin 机构：Orange Labs, Cesson-Sévigné, France, Univ. de Lorraine, Inria, Nancy, France, Univ. Grenoble Alpes, GIPSA-lab, Grenoble-INP, CNRS, Grenoble, France 备注：Accepted to Workshop on Applications of Signal Processing to Audio and Acoustics 链接：https://arxiv.org/abs/2107.11066 摘要：在这项工作中，我们提出了一种新的基于自注意的神经网络，用于从环境音记录中进行鲁棒的多说话人定位。从一个最先进的卷积递归神经网络开始，我们研究了用继承自Transformer结构的自注意编码器取代递归层的好处。我们评估这些模型的合成和现实世界的数据，多达3个同时发言。所获得的结果表明，大多数所提出的架构要么执行PAR，或优于CRNN基线，特别是在多源场景中。此外，通过避免递归层，所提出的模型有助于并行计算，从而节省了大量的执行时间。摘要：In this work, we propose a novel self-attention based neural network for robust multi-speaker localization from Ambisonics recordings. Starting from a state-of-the-art convolutional recurrent neural network, we investigate the benefit of replacing the recurrent layers by self-attention encoders, inherited from the Transformer architecture. We evaluate these models on synthetic and real-world data, with up to 3 simultaneous speakers. The obtained results indicate that the majority of the proposed architectures either perform on par, or outperform the CRNN baseline, especially in the multisource scenario. Moreover, by avoiding the recurrent layers, the proposed models lend themselves to parallel computing, which is shown to produce considerable savings in execution time.

【7】 Using UMAP to Inspect Audio Data for Unsupervised Anomaly Detection under Domain-Shift Conditions 标题：利用UMAP检测音频数据实现移域条件下的无监督异常检测

作者：Andres Fernandez,Mark D. Plumbley 机构：Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK 备注：Submitted for publication 链接：https://arxiv.org/abs/2107.10880 摘要：无监督异常检测（UAD）的目标是在只有非异常（正常）数据的情况下检测异常信号。在域转移条件下的UAD（UAD-S）中，数据进一步暴露于通常事先未知的上下文变化。受2021年版《声场景和事件的检测与分类》（DCASE）挑战赛上UAD-S任务中遇到的困难的启发，我们目视检查了对数STFT、对数mel和预训练外观的均匀流形近似和投影（UMAP），聆听并学习DCASE UAD-S数据集的三级表示法。在我们的探索性研究中，我们寻找了可分离性（SEP）和判别支持（DSUP）这两个特征，并提出了一些有助于诊断和进一步表征和检测方法发展的假设。特别地，我们假设输入长度和预训练可能调节SEP和DSUP之间的相关权衡。我们的代码以及生成的umap和plot都是公开的。摘要：The goal of Unsupervised Anomaly Detection (UAD) is to detect anomalous signals under the condition that only non-anomalous (normal) data is available beforehand. In UAD under Domain-Shift Conditions (UAD-S), data is further exposed to contextual changes that are usually unknown beforehand. Motivated by the difficulties encountered in the UAD-S task presented at the 2021 edition of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, we visually inspect Uniform Manifold Approximations and Projections (UMAPs) for log-STFT, log-mel and pretrained Look, Listen and Learn (L3) representations of the DCASE UAD-S dataset. In our exploratory investigation, we look for two qualities, Separability (SEP) and Discriminative Support (DSUP), and formulate several hypotheses that could facilitate diagnosis and developement of further representation and detection approaches. Particularly, we hypothesize that input length and pretraining may regulate a relevant tradeoff between SEP and DSUP. Our code as well as the resulting UMAPs and plots are publicly available.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-26，如有侵权请联系 cloudcommunity@tencent.com 删除

linux