金融/语音/音频处理学术速递[11.9]

公众号-arXiv每日学术速递

发布于 2021-11-17 10:53:58

5350

发布于 2021-11-17 10:53:58

文章被收录于专栏：arXiv每日学术速递

q-fin金融，共计13篇

cs.SD语音，共计16篇

eess.AS音频处理，共计17篇

1.q-fin金融:

【1】 Stock Portfolio Optimization Using a Deep Learning LSTM Model 标题：基于深度学习LSTM模型的股票投资组合优化链接：https://arxiv.org/abs/2111.04709

作者：Jaydip Sen,Abhishek Dutta,Sidra Mehtab 机构：Department of Data Science, Praxis Business School, Kolkata, INDIA 备注：This is the accepted version of our paper in the international conference, IEEE Mysurucon'21, which was organized in Hassan, Karnataka, India from October 24, 2021 to October 25, 2021. The paper is 9 pages long, and it contains 19 figures and 19 tables. This is the preprint of the conference paper 摘要：预测未来股票价格及其运动模式是一个复杂的问题。因此，使用预测价格构建资本资产组合以实现其收益和风险之间的优化是一项更加困难的任务。本研究对2016年1月1日至2020年12月31日期间印度股市九个不同板块的前五大股票的历史价格时间序列进行了分析。为这些部门中的每一个建立最佳投资组合。为了预测未来的股票价格，我们还设计并微调了一个长期和短期记忆（LSTM）模型。在投资组合构建五个月后，计算每个投资组合的实际和预测收益和风险。每个投资组合的预测和实际回报率都很高，这表明LSTM模型的精度很高。摘要：Predicting future stock prices and their movement patterns is a complex problem. Hence, building a portfolio of capital assets using the predicted prices to achieve the optimization between its return and risk is an even more difficult task. This work has carried out an analysis of the time series of the historical prices of the top five stocks from the nine different sectors of the Indian stock market from January 1, 2016, to December 31, 2020. Optimum portfolios are built for each of these sectors. For predicting future stock prices, a long-and-short-term memory (LSTM) model is also designed and fine-tuned. After five months of the portfolio construction, the actual and the predicted returns and risks of each portfolio are computed. The predicted and the actual returns of each portfolio are found to be high, indicating the high precision of the LSTM model.

【2】 Procurements with Bidder Asymmetry in Cost and Risk-Aversion 标题：投标人成本不对称且规避风险的采购问题链接：https://arxiv.org/abs/2111.04626

作者：Gaurab Aryal,Hanna Charankevich,Seungwon Jeong,Dong-Hyuk Kim 摘要：我们提出了一种实证方法来分析来自首次价格采购的数据，其中投标人的风险规避（CRRA）系数和私人成本分布是不对称的。我们的贝叶斯方法通过使用边界值方法求解类型对称平衡来评估可能性，并通过数据扩充来整合未观察到的异质性。我们研究了俄罗斯政府采购的一个新数据集，重点是印刷纸的类别。我们发现不存在未观察到的异质性（可能是因为工作是常规的），但投标人在成本和风险规避方面高度不对称。我们的反事实研究表明，选择一种特定类型的成本最小化保留价格，可以略微降低采购成本；然而，再邀请一位投标人，成本将大幅降低至少5.5%。此外，错误地实施风险中性会严重误导推断和政策建议，但实施风险厌恶同质性的偏差很小。摘要：We propose an empirical method to analyze data from first-price procurements where bidders are asymmetric in their risk-aversion (CRRA) coefficients and distributions of private costs. Our Bayesian approach evaluates the likelihood by solving type-symmetric equilibria using the boundary-value method and integrates out unobserved heterogeneity through data augmentation. We study a new dataset from Russian government procurements focusing on the category of printing papers. We find that there is no unobserved heterogeneity (presumably because the job is routine), but bidders are highly asymmetric in their cost and risk-aversion. Our counterfactual study shows that choosing a type-specific cost-minimizing reserve price marginally reduces the procurement cost; however, inviting one more bidder substantially reduces the cost, by at least 5.5%. Furthermore, incorrectly imposing risk-neutrality would severely mislead inference and policy recommendations, but the bias from imposing homogeneity in risk-aversion is small.

【3】 Revisiting the Properties of Money 标题：重新审视货币的属性链接：https://arxiv.org/abs/2111.04483

作者：Isaiah Hull,Or Sattath 机构：Research Division, Sveriges Riksbank, Stockholm, Sweden, Department of Computer Science, Ben-Gurion University, Beersheba, Israel 备注：27 pages, 1 figure 摘要：经济学文献中通常提及的货币属性最初由杰文斯（1876）和门格尔（1892）在19世纪末确定，旨在描述实物货币，如商品货币、金属硬币和纸币。在数字时代，许多非实物货币已经进入流通或正在开发中，包括活期存款、加密货币、稳定货币、中央银行数字货币（CBDC）、游戏内货币和量子货币。这些形式的货币具有经济学文献中未广泛研究过的新特性，但可能是即将到来的货币竞争加剧时代出现的货币均衡的重要决定因素。这篇论文首次详尽地尝试识别和定义所有物理和数字货币形式的属性。它回顾了经济学和计算机科学文献，并在包括社会和监管目标的原始货币功能和属性框架的扩展版本中对属性进行了分类。摘要：The properties of money commonly referenced in the economics literature were originally identified by Jevons (1876) and Menger (1892) in the late 1800s and were intended to describe physical currencies, such as commodity money, metallic coins, and paper bills. In the digital era, many non-physical currencies have either entered circulation or are under development, including demand deposits, cryptocurrencies, stablecoins, central bank digital currencies (CBDCs), in-game currencies, and quantum money. These forms of money have novel properties that have not been studied extensively within the economics literature, but may be important determinants of the monetary equilibrium that emerges in forthcoming era of heightened currency competition. This paper makes the first exhaustive attempt to identify and define the properties of all physical and digital forms of money. It reviews both the economics and computer science literatures and categorizes properties within an expanded version of the original functions-and-properties framework of money that includes societal and regulatory objectives.

【4】 Portfolio analysis with mean-CVaR and mean-CVaR-skewness criteria based on mean-variance mixture models 标题：基于均值-方差混合模型的均值-CVaR和均值-CVaR-偏度准则的投资组合分析链接：https://arxiv.org/abs/2111.04311

作者：Nuerxiati Abudurexiti,Kai He,Dongdong Hu,Svetlozar T. Rachev Hasanjan Sayit,Ruoyu Sun 机构：Department of Financial and Actuarial Mathematics, Xi’an Jiaotong Liverpool University, Suzhou, China, Department of Mathematics and Statistics, Texas Tech University, Lubbock, TX, USA 备注：25pages, 1 figure, 2 tables 摘要：Zhao等人（2015）的论文表明，基于不对称拉普拉斯（al）分布的平均CVaR偏度投资组合优化问题可以转化为二次优化问题，在二次优化问题下可以找到闭式解。在本文中，我们证明了当基础分布是一类比AL分布更大的正态均值-方差混合（NMVM）模型时，这种结果也适用于平均风险偏态投资组合优化问题。然后，我们研究了NMVM分布回报组合的风险价值（VaR）和条件风险价值（CVaR）风险度量。如Rockafellar&Uryasev（2000年）和Landsman&Valdez（2003年）所述，他们对正态和更普遍的椭圆分布回报的投资组合进行了封闭式表达式。当收益具有一般NMVM分布时，这些风险度量不会给出封闭形式的表达式。在本文中，我们给出了具有NMVM分布的收益组合的VaR和CVaR的近似闭式表达式。数值试验表明，我们的封闭式公式给出了VaR和CVaR的精确值，大大缩短了与VaR和CVaR相关的投资组合优化问题的计算时间。摘要：The paper Zhao et al. (2015) shows that mean-CVaR-skewness portfolio optimization prob- lems based on asymetric Laplace (AL) distributions can be transformed into quadratic optimiza- tion problems under which closed form solutions can be found. In this note, we show that such result also holds for mean-risk-skewness portfolio optimization problems when the underlying distribution is a larger class of normal mean-variance mixture (NMVM) models than the class of AL distributions. We then study the value at risk (VaR) and conditional value at risk (CVaR) risk measures on portfolios of returns with NMVM distributions. They have closed form expres- sions for portfolios of normal and more generally elliptically distributed returns as discussed in Rockafellar & Uryasev (2000) and in Landsman & Valdez (2003). When the returns have gen- eral NMVM distributions, these risk measures do not give closed form expressions. In this note, we give approximate closed form expressions for VaR and CVaR of portfolios of returns with NMVM distributions. Numerical tests show that our closed form formulas give accurate values for VaR and CVaR and shortens the computational time for portfolio optimization problems associated with VaR and CVaR considerably.

【5】 The Wrong Kind of Information 标题：错误的信息种类链接：https://arxiv.org/abs/2111.04172

作者：Aditya Kuvalekar,João Ramos,Johannes Schneider 摘要：代理人根据自己的信息决定是否批准一个项目，其中一些信息由法院核实。公正的代理人希望实施可能成功的项目；有偏见的代理想要实施任何项目。如果项目失败，法院将审查可核实的信息并决定处罚。法院试图阻止恶意代理人实施可能失败的项目，同时鼓励使用无法核实的信息。我们展示了不同种类的信息如何影响福利。改进可验证的信息会降低福利，而改进不可验证的信息总是会增加福利。摘要：An agent decides whether to approve a project based on his information, some of which is verified by a court. An unbiased agent wants to implement projects that are likely to succeed; a biased agent wants to implement any project. If the project fails, the court examines the verifiable information and decides the punishment. The court seeks to deter ill-intentioned agents from implementing projects likely to fail while incentivizing the use of the unverifiable information. We show how information of different kinds affects welfare. Improving the verifiable information can reduce welfare, whereas improving the unverifiable information always increases welfare.

【6】 On the Limits of Design: What Are the Conceptual Constraints on Designing Artificial Intelligence for Social Good? 标题：论设计的极限：为社会公益而设计人工智能的概念约束是什么？链接：https://arxiv.org/abs/2111.04165

作者：Jakob Mokander 机构： Mökander Oxford Internet Institute, University of Oxford 备注：None 摘要：人工智能AI可以通过帮助降低成本、提高效率和为复杂问题提供新的解决方案，为社会带来巨大的好处。使用Floridi的概念，如何设计“信息圈”作为起点，在这一章中，我考虑的问题是什么是设计的限制，即什么是概念设计AI的社会善？本章的主要论点是，虽然设计是塑造技术和社会的有用概念工具，但设计未来社会的集体努力受到内部和外部因素的制约。通过唤起哈丁关于“公地悲剧”的思想实验，讨论了设计的内部约束。此外，哈耶克对“宇宙”和“出租车”的经典区分被用来界定设计的外部约束。最后，提出了五个设计原则，旨在帮助决策者管理设计的内部和外部约束。设计未来社会的成功方法需要考虑复杂系统的涌现特性，为偶然性和社会技术共同进化留出空间。摘要：Artificial intelligence AI can bring substantial benefits to society by helping to reduce costs, increase efficiency and enable new solutions to complex problems. Using Floridi's notion of how to design the 'infosphere' as a starting point, in this chapter I consider the question: what are the limits of design, i.e. what are the conceptual constraints on designing AI for social good? The main argument of this chapter is that while design is a useful conceptual tool to shape technologies and societies, collective efforts towards designing future societies are constrained by both internal and external factors. Internal constraints on design are discussed by evoking Hardin's thought experiment regarding 'the Tragedy of the Commons'. Further, Hayek's classical distinction between 'cosmos' and 'taxis' is used to demarcate external constraints on design. Finally, five design principles are presented which are aimed at helping policymakers manage the internal and external constraints on design. A successful approach to designing future societies needs to account for the emergent properties of complex systems by allowing space for serendipity and socio-technological coevolution.

【7】 Equity--Linked Life Insurances on Maximum of Several Assets 标题：最多几种资产的权益连结人寿保险链接：https://arxiv.org/abs/2111.04038

作者：Battulga Gankhuu 机构：Department of Applied Mathematics, National University of Mongolia 备注：17 pages. arXiv admin note: substantial text overlap with arXiv:2109.05998 摘要：本文提出了基于贝叶斯马尔可夫-切换向量自回归（MS-VAR）过程的分离基金和单位连结人寿保险产品的定价和套期保值方法。这里我们假设一个制度转换过程是由齐次马尔可夫过程产生的。我们的模型的一个优点是它依赖于经济变量，并不复杂。摘要：This paper presents pricing and hedging methods for segregated funds and unit-linked life insurance products that are based on a Bayesian Markov--Switching Vector Autoregressive (MS--VAR) process. Here we assumed that a regime-switching process is generated by a homogeneous Markov process. An advantage of our model is it depends on economic variables and is not complicated.

【8】 Explainable Deep Reinforcement Learning for Portfolio Management: An Empirical Approach 标题：用于投资组合管理的可解释深度强化学习：一种实证方法链接：https://arxiv.org/abs/2111.03995

作者：Mao Guan,Xiao-Yang Liu 机构：Computer Science, Columbia University, New York City, New York, Electrical Engineering, Columbia University 摘要：深度强化学习（DRL）在项目组合管理任务中得到了广泛的研究。然而，由于深层神经网络的黑箱性质，理解基于DRL的交易策略是一个挑战。在本文中，我们提出了一个实证方法来解释投资组合管理任务中DRL代理的策略。首先，我们使用后见之明的线性模型作为参考模型，该模型通过假设在预见中知道实际股票收益来确定最佳投资组合权重。特别是，我们使用后见之明的线性模型的系数作为参考特征权重。其次，对于DRL代理，我们使用积分梯度来定义特征权重，即线性回归模型下奖励和特征之间的系数。第三，研究了单步预测和多步预测两种情况下的预测能力。特别是，我们通过计算DRL代理的特征权重和参考特征权重之间的线性相关性来量化预测能力，对于机器学习方法也是如此。最后，我们评估了2009年1月1日至2021年1月9日期间道琼斯30只成份股的投资组合管理任务。我们的方法经验表明，DRL代理比机器学习方法具有更强的多步预测能力。摘要：Deep reinforcement learning (DRL) has been widely studied in the portfolio management task. However, it is challenging to understand a DRL-based trading strategy because of the black-box nature of deep neural networks. In this paper, we propose an empirical approach to explain the strategies of DRL agents for the portfolio management task. First, we use a linear model in hindsight as the reference model, which finds the best portfolio weights by assuming knowing actual stock returns in foresight. In particular, we use the coefficients of a linear model in hindsight as the reference feature weights. Secondly, for DRL agents, we use integrated gradients to define the feature weights, which are the coefficients between reward and features under a linear regression model. Thirdly, we study the prediction power in two cases, single-step prediction and multi-step prediction. In particular, we quantify the prediction power by calculating the linear correlations between the feature weights of a DRL agent and the reference feature weights, and similarly for machine learning methods. Finally, we evaluate a portfolio management task on Dow Jones 30 constituent stocks during 01/01/2009 to 09/01/2021. Our approach empirically reveals that a DRL agent exhibits a stronger multi-step prediction power than machine learning methods.

【9】 Projection of Functionals and Fast Pricing of Exotic Options 标题：泛函投影与奇异期权的快速定价链接：https://arxiv.org/abs/2111.03713

作者：Valentin Tissot-Daguette 机构： Princeton University 备注：15 pages, 8 figures 摘要：本文研究泛函在c\`adl\`ag路径空间中的投影。特别是，我们提倡Karhunen Lo\`eve（KL）扩展，直接从函数图像中提取信息。在收集近似理论的结果的同时，我们也在希尔BERT投影和根据其特征重构路径之间画出了一个新的平行线。在数值例子中，我们说明了KL展开如何允许快速计算路径相关期权的价格面。摘要：This note investigates the projection of functionals in the space of c\`adl\`ag paths. In particular, we advocate the Karhunen-Lo\`eve (KL) expansion to extract information directly from the image of a functional. While gathering results from approximation theory, we also draw a new parallel between Hilbert projections and the reconstruction of a path from its signature. In the numerical examples, we illustrate how the KL expansion allows fast computation of the price surface of path-dependent options.

【10】 Predicting Mortality from Credit Reports 标题：从信用报告预测死亡率链接：https://arxiv.org/abs/2111.03662

作者：Giacomo De Giorgi,Matthew Harding,Gabriel Vasconcelos 机构：Geneva School of Economics and Management, University of Geneva, Department of Economics, University of California, Irvine, Gabriel F. R. Vasconcelos 摘要：许多国家定期收集与个人消费金融行为（如信用卡和贷款活动）相关的数百个变量的数据，这些数据在贷款决策中发挥着重要作用。我们假设，这些数据的详细性质可用于预测看似无关的领域（如个人健康）的结果。我们建立了一系列机器学习模型来证明信用报告数据可以用来预测个人死亡率。与信用卡和各种贷款（主要是无担保贷款）相关的变量组具有显著的预测能力。这些变量的滞后也很显著，因此表明动力学也很重要。基于消费者金融数据的死亡率预测的改进可能对保险市场产生重要的经济影响，但也可能引起隐私问题。摘要：Data on hundreds of variables related to individual consumer finance behavior (such as credit card and loan activity) is routinely collected in many countries and plays an important role in lending decisions. We postulate that the detailed nature of this data may be used to predict outcomes in seemingly unrelated domains such as individual health. We build a series of machine learning models to demonstrate that credit report data can be used to predict individual mortality. Variable groups related to credit cards and various loans, mostly unsecured loans, are shown to carry significant predictive power. Lags of these variables are also significant thus indicating that dynamics also matters. Improved mortality predictions based on consumer finance data can have important economic implications in insurance markets but may also raise privacy concerns.

【11】 A McKean-Vlasov game of commodity production, consumption and trading 标题：商品生产、消费和贸易的McKean-Vlasov博弈链接：https://arxiv.org/abs/2111.04391

作者：René Aïd,Ofelia Bonesini,Giorgia Callegaro,Luciano Campi 备注：29 pages, 2 figures 摘要：我们提出了一个模型，其中生产者和消费者可以影响某些商品的价格动态，分别控制生产率和消费率的漂移和波动。我们假设生产者在远期合约中以固定价格F持有标的资产的λ单位的空头头寸，而消费者持有相应的多头头寸。此外，这两个参与者对其财务状况都是风险规避者，他们的风险规避通过综合方差惩罚进行建模。我们研究了风险规避对生产者和消费者之间的互动以及对衍生品价格的影响。在数学方面，我们正在处理一个两人线性二次型McKean-Vlasov随机微分对策。利用基于鞅最优性原理和BSDEs的方法，我们找到了一个纳什均衡，并以半显式形式刻画了相应的策略和收益。此外，我们计算由该均衡引起的两个无差异价格（一个用于生产者，一个用于消费者），并确定数量λ，以使参与者就价格达成一致。最后，我们用一些数字来说明我们的结果。我们特别关注参与者的风险规避和波动控制成本如何影响衍生品价格。摘要：We propose a model where a producer and a consumer can affect the price dynamics of some commodity controlling drift and volatility of, respectively, the production rate and the consumption rate. We assume that the producer has a short position in a forward contract on \lambda units of the underlying at a fixed price F, while the consumer has the corresponding long position. Moreover, both players are risk-averse with respect to their financial position and their risk aversions are modelled through an integrated-variance penalization. We study the impact of risk aversion on the interaction between the producer and the consumer as well as on the derivative price. In mathematical terms, we are dealing with a two-player linear-quadratic McKean-Vlasov stochastic differential game. Using methods based on the martingale optimality principle and BSDEs, we find a Nash equilibrium and characterize the corresponding strategies and payoffs in semi-explicit form. Furthermore, we compute the two indifference prices (one for the producer and one for the consumer) induced by that equilibrium and we determine the quantity \lambda such that the players agree on the price. Finally, we illustrate our results with some numerics. In particular, we focus on how the risk aversions and the volatility control costs of the players affect the derivative price.

【12】 Optimal Dividends under Markov-Modulated Bankruptcy Level 标题：马尔可夫调制破产水平下的最优分红链接：https://arxiv.org/abs/2111.03724

作者：Giorgio Ferrari,Patrick Schuhmann,Shihao Zhu 备注：34 pages, 16 figures 摘要：本文提出并解决了一个最优股利问题，在这个问题中，两种状态的制度转换环境会影响公司现金盈余的动态，作为一个新的特征，也会影响破产水平。其目的是最大限度地提高破产前股息的预期利润总额。因此，公司的最佳股息支付同时受到四个因素的影响：现金盈余的布朗波动，以及漂移、波动和破产水平的制度变化。特别是，在这两种制度下，平均盈利能力可能呈现不同的迹象。我们发现了一个丰富的最优策略结构，根据模型参数的相互作用，它要么是障碍型，要么是清算障碍型。此外，我们还提供了最优策略和值函数的显式表达式。最后，我们通过详细的数值研究来补充我们的理论结果，并对最优股利政策对问题参数的敏感性进行了深入的分析。摘要：This paper proposes and solves an optimal dividend problem in which a two-state regime-switching environment affects the dynamics of the company's cash surplus and, as a novel feature, also the bankruptcy level. The aim is to maximize the total expected profits from dividends until bankruptcy. The company's optimal dividend payout is therefore influenced by four factors simultaneously: Brownian fluctuations in the cash surplus, as well as regime changes in drift, volatility and bankruptcy levels. In particular, the average profitability can assume different signs in the two regimes. We find a rich structure of the optimal strategy, which, depending on the interaction of the model's parameters, is either of barrier-type or of liquidation-barrier type. Furthermore, we provide explicit expressions of the optimal policies and value functions. Finally, we complement our theoretical results by a detailed numerical study, where also a thorough analysis of the sensitivities of the optimal dividend policy with respect to the problem's parameters is performed.

【13】 Exposure of occupations to technologies of the fourth industrial revolution 标题：职业接触第四次工业革命的技术链接：https://arxiv.org/abs/2110.13317

作者：Benjamin Meindl,Morgan R. Frank,Joana Mendonça 机构：Center for Innovation, Technology and Policy Research (IN+), Instituto Superior T´ecnico, Universidade de Lisboa, Department of Informatics and Networked Systems, University of Pittsburgh, Media Laboratory, Massachusetts Institute of Technology 备注：65 pages, 18 figures 摘要：第四次工业革命（4IR）可能对经济产生重大影响。公司需要建立实施新技术的能力，自动化可能会使某些职业过时。然而，在何处、何时以及如何发生变化仍有待确定。与职业相关的技术进步的可靠实证指标有助于阐明这一变化。为此，我们提供了一个基于专利数据的指标。使用自然语言处理，我们计算了900多个职业的专利曝光分数，这代表了与之相关的技术进步。为了提供4IR影响的镜头，我们区分了传统和4IR专利曝光。我们的方法不同于以前的方法，因为它既考虑了职业中任务级专利暴露的多样性，又更准确地反映了工作活动。我们发现，暴露于4IR专利不同于传统的专利暴露。人工任务以及相应的职业，如建筑和生产，主要接触传统（非4IR）专利，但对4IR专利的接触较少。分析表明，4IR技术可能对就业增长产生负面影响；这种影响出现在专利申请10到20年后。此外，我们将4IR暴露与其他自动化和AI暴露分数进行了比较。尽管许多衡量标准都涉及理论上的自动化潜力，但我们基于专利的指标反映了实际的技术扩散。我们的工作不仅可以分析4IR技术的整体影响，还可以提供300多个技术领域的曝光分数，如AI和智能办公技术。最后，这项工作提供了专利与任务和职业的一般映射，使未来的研究人员能够构建个人暴露测量。摘要：The fourth industrial revolution (4IR) is likely to have a substantial impact on the economy. Companies need to build up capabilities to implement new technologies, and automation may make some occupations obsolete. However, where, when, and how the change will happen remain to be determined. Robust empirical indicators of technological progress linked to occupations can help to illuminate this change. With this aim, we provide such an indicator based on patent data. Using natural language processing, we calculate patent exposure scores for more than 900 occupations, which represent the technological progress related to them. To provide a lens on the impact of the 4IR, we differentiate between traditional and 4IR patent exposure. Our method differs from previous approaches in that it both accounts for the diversity of task-level patent exposures within an occupation and reflects work activities more accurately. We find that exposure to 4IR patents differs from traditional patent exposure. Manual tasks, and accordingly occupations such as construction and production, are exposed mainly to traditional (non-4IR) patents but have low exposure to 4IR patents. The analysis suggests that 4IR technologies may have a negative impact on job growth; this impact appears 10 to 20 years after patent filing. Further, we compared the 4IR exposure to other automation and AI exposure scores. Whereas many measures refer to theoretical automation potential, our patent-based indicator reflects actual technology diffusion. Our work not only allows analyses of the impact of 4IR technologies as a whole, but also provides exposure scores for more than 300 technology fields, such as AI and smart office technologies. Finally, the work provides a general mapping of patents to tasks and occupations, which enables future researchers to construct individual exposure measures.

2.cs.SD语音:

【1】 SEOFP-NET: Compression and Acceleration of Deep Neural Networks for Speech Enhancement Using Sign-Exponent-Only Floating-Points 标题：SEOFP-NET：基于符号指数浮点的深层神经网络语音增强压缩与加速链接：https://arxiv.org/abs/2111.04436

作者：Yu-Chen Lin,Cheng Yu,Yi-Te Hsu,Szu-Wei Fu,Yu Tsao,Tei-Wei Kuo 摘要：众多的压缩和加速策略在计算机视觉和语音信号处理等领域的分类任务中取得了显著的效果。然而，同样的策略在回归任务上产生了未批准的性能，因为这些任务和分类任务之间的性质不同。本文提出了一种新的仅符号指数浮点网络（SEOFP-NET）技术，用于压缩语音信号处理的回归任务——语音增强的模型大小和加快推理时间。该方法通过在训练过程中量化单精度浮点参数的分数位来压缩基于深度神经网络（DNN）的语音增强模型的大小。在推理实现之前，通过将浮点乘法器替换为整数加法器，对训练好的SEOFP-NET模型中的所有参数进行微调，以加快推理时间。为了推广，SEOFP-NET技术被引入到不同语料库下不同模型结构的语音信号处理中的不同语音增强任务中。实验结果表明，SEOFP-NET模型的大小可以显著压缩81.249%，而不会显著降低其语音增强性能，推理时间可以比基线模型加快1.212倍。结果还验证了所提出的SEOFP-NET可以与其他效率策略合作，实现模型压缩的协同效应。此外，在用户研究实验中应用了刚刚显著差异（JND），统计分析了语音增强对听力的影响。结果表明，听者无法轻易区分基线模型处理的增强语音信号和建议的SEOFP-NET。摘要：Numerous compression and acceleration strategies have achieved outstanding results on classification tasks in various fields, such as computer vision and speech signal processing. Nevertheless, the same strategies have yielded ungratified performance on regression tasks because the nature between these and classification tasks differs. In this paper, a novel sign-exponent-only floating-point network (SEOFP-NET) technique is proposed to compress the model size and accelerate the inference time for speech enhancement, a regression task of speech signal processing. The proposed method compressed the sizes of deep neural network (DNN)-based speech enhancement models by quantizing the fraction bits of single-precision floating-point parameters during training. Before inference implementation, all parameters in the trained SEOFP-NET model are slightly adjusted to accelerate the inference time by replacing the floating-point multiplier with an integer-adder. For generalization, the SEOFP-NET technique is introduced to different speech enhancement tasks in speech signal processing with different model architectures under various corpora. The experimental results indicate that the size of SEOFP-NET models can be significantly compressed by up to 81.249% without noticeably downgrading their speech enhancement performance, and the inference time can be accelerated to 1.212x compared with the baseline models. The results also verify that the proposed SEOFP-NET can cooperate with other efficiency strategies to achieve a synergy effect for model compression. In addition, the just noticeable difference (JND) was applied to the user study experiment to statistically analyze the effect of speech enhancement on listening. The results indicate that the listeners cannot facilely differentiate between the enhanced speech signals processed by the baseline model and the proposed SEOFP-NET.

【2】 Characterizing the adversarial vulnerability of speech self-supervised learning 标题：语音自监督学习的对抗性脆弱性表征链接：https://arxiv.org/abs/2111.04330

作者：Haibin Wu,Bo Zheng,Xu Li,Xixin Wu,Hung-yi Lee,Helen Meng 机构： Graduate Institute of Communication Engineering, National Taiwan University, Human-Computer Communications Laboratory, The Chinese University of Hong Kong 摘要：一个名为Speech processing Universal PERformance Benchmark（SUPERB）的排行榜推动了语音表征学习的研究。该排行榜旨在通过对体系结构和少量数据的最小修改，对各种下游语音任务的共享自监督学习（SSL）语音模型的性能进行基准测试。SSL上游模型通过最小的自适应改进了各种下游任务的性能。由于自监督学习上游模型和下游任务的范式在言语社区中引起了更多的关注，因此，表征这种范式的对抗鲁棒性具有高度优先性。在本文中，我们首次尝试研究这种范式在零知识对手和有限知识对手攻击下的对抗脆弱性。实验结果表明，SUPERB提出的攻击范式对有限的知识对手非常脆弱，零知识对手的攻击具有可转移性。XAB测试验证了精心设计的对手攻击的不可察觉性。摘要：A leaderboard named Speech processing Universal PERformance Benchmark (SUPERB), which aims at benchmarking the performance of a shared self-supervised learning (SSL) speech model across various downstream speech tasks with minimal modification of architectures and small amount of data, has fueled the research for speech representation learning. The SUPERB demonstrates speech SSL upstream models improve the performance of various downstream tasks through just minimal adaptation. As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority. In this paper, we make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries. The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries, and the attacks generated by zero-knowledge adversaries are with transferability. The XAB test verifies the imperceptibility of crafted adversarial attacks.

【3】 Retrieving Speaker Information from Personalized Acoustic Models for Speech Recognition 标题：用于语音识别的个性化声学模型中的说话人信息检索链接：https://arxiv.org/abs/2111.04194

作者：Salima Mdhaffar,Jean-François Bonastre,Marc Tommasi,Natalia Tomashenko,Yannick Estève 机构： LIA, Avignon Universit´e, France, Universit´e de Lille, CNRS, Inria, Centrale Lille, UMR , - CRIStAL, Lille, France 摘要：能够收集用户声音的功能强大的个人设备的广泛使用为构建说话人自适应语音识别系统（ASR）或参与ASR协作学习提供了机会。在这两种情况下，都可以建立个性化的声学模型（AM），即带有特定扬声器数据的微调AM。一个自然产生的问题是，个性化声学模型的传播是否会泄露个人信息。在本文中，我们证明了通过利用局部适应于说话人的神经声学模型的权值矩阵变化来检索说话人的性别以及他的身份是可能的。顺便说一句，我们观察到的现象可能有助于解释语音处理中的深层神经网络。几乎可以肯定的是，仅使用第一层就可以识别性别，而使用中间层时，说话人验证性能良好。我们在TED-LIUM 3数据集上使用HMM/TDNN模型进行的实验研究表明，性别检测的准确率为95%，说话人验证任务的错误率为9.07%，仅利用可交换的个性化模型的权重，而不是用户数据。摘要：The widespread of powerful personal devices capable of collecting voice of their users has opened the opportunity to build speaker adapted speech recognition system (ASR) or to participate to collaborative learning of ASR. In both cases, personalized acoustic models (AM), i.e. fine-tuned AM with specific speaker data, can be built. A question that naturally arises is whether the dissemination of personalized acoustic models can leak personal information. In this paper, we show that it is possible to retrieve the gender of the speaker, but also his identity, by just exploiting the weight matrix changes of a neural acoustic model locally adapted to this speaker. Incidentally we observe phenomena that may be useful towards explainability of deep neural networks in the context of speech processing. Gender can be identified almost surely using only the first layers and speaker verification performs well when using middle-up layers. Our experimental study on the TED-LIUM 3 dataset with HMM/TDNN models shows an accuracy of 95% for gender detection, and an Equal Error Rate of 9.07% for a speaker verification task by only exploiting the weights from personalized models that could be exchanged instead of user data.

【4】 Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer 标题：主题Transformer：用主题制约的Transformer生成符号音乐链接：https://arxiv.org/abs/2111.04093

作者：Yi-Jen Shih,Shih-Lun Wu,Frank Zalkow,Meinard Müller,Yi-Hsuan Yang 摘要：基于注意力的Transformer模型已越来越多地用于自动音乐生成。为了使用用户指定的序列来调节此类模型的生成过程，一种流行的方法是将该调节序列作为启动序列，并要求转换器解码器生成延续。然而，这种基于提示的条件作用不能保证条件作用序列会发展，甚至不能在生成的延续中简单地重复自身。在本文中，我们提出了一种替代的条件作用方法，称为基于主题的条件作用，该方法明确训练转换器将条件作用序列视为主题材料，必须在其生成结果中多次显示。这是通过两项主要的技术贡献实现的。首先，我们提出了一种基于深度学习的方法，使用对比表征学习和聚类从训练数据中的音乐片段中自动检索主题材料。其次，我们提出了一种新的门控并行注意模块，用于序列对序列（seq2seq）编码器/解码器体系结构中，以更有效地解释Transformer解码器生成过程中给定的调节主题材料。我们报告了对提议的主题变换器和传统的基于提示的基线变体的客观和主观评估，表明我们的最佳模型可以在一定程度上生成具有给定条件的重复和合理变体的复调流行钢琴音乐。摘要：Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with a user-specified sequence, a popular approach is to take that conditioning sequence as a priming sequence and ask a Transformer decoder to generate a continuation. However, this prompt-based conditioning cannot guarantee that the conditioning sequence would develop or even simply repeat itself in the generated continuation. In this paper, we propose an alternative conditioning approach, called theme-based conditioning, that explicitly trains the Transformer to treat the conditioning sequence as a thematic material that has to manifest itself multiple times in its generation result. This is achieved with two main technical contributions. First, we propose a deep learning-based approach that uses contrastive representation learning and clustering to automatically retrieve thematic materials from music pieces in the training data. Second, we propose a novel gated parallel attention module to be used in a sequence-to-sequence (seq2seq) encoder/decoder architecture to more effectively account for a given conditioning thematic material in the generation process of the Transformer decoder. We report on objective and subjective evaluations of variants of the proposed Theme Transformer and the conventional prompt-based baseline, showing that our best model can generate, to some extent, polyphonic pop piano music with repetition and plausible variations of a given condition.

【5】 Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech 标题：Meta-TTS：面向Few-Shot说话人自适应文语转换的元学习链接：https://arxiv.org/abs/2111.04040

作者：Sung-Feng Huang,Chyi-Jiunn Lin,Hung-yi Lee 机构： Lee are with the Department of Electrical En-gineering 备注：under review 摘要：个性化语音合成系统是一个非常理想的应用，该系统可以用用户的声音生成语音，并记录很少的录音。在最近的工作中，有两种主要的方法来构建这样一个系统：说话人自适应和说话人编码。一方面，说话人自适应方法在样本数较少的情况下对训练好的多说话人文本到语音（TTS）模型进行微调。然而，它们至少需要数千个微调步骤才能实现高质量的自适应，这使得它很难应用到设备上。另一方面，说话人编码方法将注册话语编码为说话人嵌入。训练后的TTS模型可以在相应的说话人嵌入条件下合成用户的语音。然而，说话人编码器在可见说话人和不可见说话人之间存在泛化差距。在本文中，我们提出了一种元学习算法应用于说话人自适应方法。更具体地说，我们使用模型不可知元学习（MAML）作为多说话人TTS模型的训练算法，其目的是找到一个很好的元初始化，使该模型能够快速适应任何少量的镜头说话人自适应任务。因此，我们也可以将元训练的TTS模型有效地应用于看不见的说话人。我们的实验将所提出的方法（Meta-TTS）与两个基线进行了比较：说话人自适应方法基线和说话人编码方法基线。评估结果表明，Meta-TTS可以从少量的注册样本中合成出高说话人相似度的语音，其自适应步骤比说话人自适应基线要少，并且在相同的训练方案下，Meta-TTS的性能优于说话人编码基线。当基线的说话人编码器使用额外的8371个说话人数据进行预训练时，Meta TTS仍然可以在LibriTTS数据集上优于基线，并在VCTK数据集上获得可比结果。摘要：Personalizing a speech synthesis system is a highly desired application, where the system can generate speech with the user's voice with rare enrolled recordings. There are two main approaches to build such a system in recent works: speaker adaptation and speaker encoding. On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples. However, they require at least thousands of fine-tuning steps for high-quality adaptation, making it hard to apply on devices. On the other hand, speaker encoding methods encode enrollment utterances into a speaker embedding. The trained TTS model can synthesize the user's speech conditioned on the corresponding speaker embedding. Nevertheless, the speaker encoder suffers from the generalization gap between the seen and unseen speakers. In this paper, we propose applying a meta-learning algorithm to the speaker adaptation method. More specifically, we use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model, which aims to find a great meta-initialization to adapt the model to any few-shot speaker adaptation tasks quickly. Therefore, we can also adapt the meta-trained TTS model to unseen speakers efficiently. Our experiments compare the proposed method (Meta-TTS) with two baselines: a speaker adaptation method baseline and a speaker encoding method baseline. The evaluation results show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline and outperforms the speaker encoding baseline under the same training scheme. When the speaker encoder of the baseline is pre-trained with extra 8371 speakers of data, Meta-TTS can still outperform the baseline on LibriTTS dataset and achieve comparable results on VCTK dataset.

【6】 Towards noise robust trigger-word detection with contrastive learning pre-task for fast on-boarding of new trigger-words 标题：具有对比学习预任务的噪声鲁棒触发字检测，用于新触发字的快速自注册链接：https://arxiv.org/abs/2111.03971

作者：Sivakumar Balasubramanian,Aditya Jajodia,Gowtham Srinivasan 机构：Samsung Research America, Mountain View, USA 备注：submitted to ICASSP 2022 摘要：触发字检测作为用户与语音助手进行通信的切入点，发挥着重要作用。但是支持一个特定的单词作为触发词需要大量的数据收集、扩充和标记。这使得支持新的触发词成为一个乏味而耗时的过程。为了解决这个问题，我们将对比学习作为一项预训练任务，帮助检测模型推广到不同的单词和噪声条件。我们探讨了监督对比技术，并提出了一种使用长句音频中的语块词的自我监督技术。我们发现，对比预训练技术在数据可用性较低的情况下，对新触发词的预训练效果与传统分类预训练相当。摘要：Trigger-word detection plays an important role as the entry point of user's communication with voice assistants. But supporting a particular word as a trigger-word involves huge amount of data collection, augmentation and labelling for that word. This makes supporting new trigger-words a tedious and time consuming process. To combat this, we explore the use of contrastive learning as a pre-training task that helps the detection model to generalize to different words and noise conditions. We explore supervised contrastive techniques and also propose a self-supervised technique using chunked words from long sentence audios. We show that the contrastive pre-training techniques have comparable results to a traditional classification pre-training on new trigger words with less data availability.

【7】 Digital Audio Processing Tools for Music Corpus Studies 标题：用于音乐语料库研究的数字音频处理工具链接：https://arxiv.org/abs/2111.03895

作者：Johanna Devaney 摘要：数字音频处理工具为音乐研究人员提供了一个机会，让他们可以研究非注释音乐和作为表演的音乐。本章总结了可以从音频中提取的信息类型，以及当前可用于音乐语料库研究的音频工具。提取方法综述包括信号处理入门和音频特征提取背景理论。音频工具调查重点关注广泛使用的工具，包括具有图形用户界面的工具，即Audacity和Sonic Visualiser，以及使用C/C++、Java、MATLAB和Python计算机编程语言编写的基于代码的工具。摘要：Digital audio processing tools offer music researchers the opportunity to examine both non-notated music and music as performance. This chapter summarises the types of information that can be extracted from audio as well as currently available audio tools for music corpus studies. The survey of extraction methods includes both a primer on signal processing and background theory on audio feature extraction. The survey of audio tools focuses on widely used tools, including both those with a graphical user interface, namely Audacity and Sonic Visualiser, and code-based tools written in the C/C++, Java, MATLAB, and Python computer programming languages.

【8】 SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines 标题：SIG-VC：一种说话人信息制导的人机零发声转换系统链接：https://arxiv.org/abs/2111.03811

作者：Zhang Haozhe,Cai Zexin,Qin Xiaoyi,Li Ming 机构： Duke Kunshan University, China 2School of Computer Science, Wuhan University 摘要：如今，随着越来越多的系统在传统的语音转换（VC）任务中取得了良好的性能，人们的注意力逐渐转向极端条件下的VC任务。在本文中，我们提出了一种新的Zero-Shot语音转换方法。我们的目标是获得说话人内容分离的中间表示，以便更好地去除说话人信息，获得纯内容信息。因此，我们提出的框架包含一个模块，该模块从源说话人的声学特征中去除说话人信息。此外，系统还增加了说话人信息控制，以保持语音克隆性能。通过主观和客观指标对所提出的系统进行评估。结果表明，我们提出的系统显著减少了零炮语音转换中的折衷问题，同时对说话人验证系统具有较高的欺骗能力。摘要：Nowadays, as more and more systems achieve good performance in traditional voice conversion (VC) tasks, people's attention gradually turns to VC tasks under extreme conditions. In this paper, we propose a novel method for zero-shot voice conversion. We aim to obtain intermediate representations for speaker-content disentanglement of speech to better remove speaker information and get pure content information. Accordingly, our proposed framework contains a module that removes the speaker information from the acoustic feature of the source speaker. Moreover, speaker information control is added to our system to maintain the voice cloning performance. The proposed system is evaluated by subjective and objective metrics. Results show that our proposed system significantly reduces the trade-off problem in zero-shot voice conversion, while it also manages to have high spoofing power to the speaker verification system.

【9】 Privacy attacks for automatic speech recognition acoustic models in a federated learning framework 标题：联合学习框架中自动语音识别声学模型的隐私攻击链接：https://arxiv.org/abs/2111.03777

作者：Natalia Tomashenko,Salima Mdhaffar,Marc Tommasi,Yannick Estève,Jean-François Bonastre 机构： LIA, Avignon Universit´e, France, Universit´e de Lille, CNRS, Inria, Centrale Lille, UMR , - CRIStAL, Lille, France 备注：Submitted to ICASSP 2022 摘要：本文研究了在自动语音识别（ASR）中，从个性化说话人自适应神经网络声学模型（AMs）中有效地检索说话人信息的方法。在ASR声学模型的联合学习环境中，这个问题尤其重要，在这种环境中，基于从多个客户端接收的更新，在服务器上学习全局模型。我们提出了一种基于指标数据集上的神经网络足迹分析神经网络AMs中信息的方法。利用这种方法，我们开发了两种攻击模型，旨在从更新的个性化模型中推断说话人身份，而不需要访问实际用户的语音数据。在TED-LIUM 3语料库上的实验表明，所提出的方法是非常有效的，可以提供1-2%的等错误率（EER）。摘要：This paper investigates methods to effectively retrieve speaker information from the personalized speaker adapted neural network acoustic models (AMs) in automatic speech recognition (ASR). This problem is especially important in the context of federated learning of ASR acoustic models where a global model is learnt on the server based on the updates received from multiple clients. We propose an approach to analyze information in neural network AMs based on a neural network footprint on the so-called Indicator dataset. Using this method, we develop two attack models that aim to infer speaker identity from the updated personalized models without access to the actual users' speech data. Experiments on the TED-LIUM 3 corpus demonstrate that the proposed approaches are very effective and can provide equal error rate (EER) of 1-2%.

【10】 The complex-valued correlation coefficient accounts for binaural detection 标题：复值相关系数用于双耳检测链接：https://arxiv.org/abs/2111.04637

作者：Jörg Encke,Mathias Dietz 机构： EnckeDepartment für Medizinische Physik und AkustikUniversität Oldenburg 26 1 1 1 Oldenburg, DietzDepartment für Medizinische Physik und AkustikUniversität Oldenburg 26 1 1 1 Oldenburg 摘要：双耳听觉是实现空间声源定位的主要机制之一。此外，双耳听觉还显著提高了在噪声中检测信号的能力。人类可以在低于等效同相音检测阈值15 dB的声级下检测掩蔽噪声中的耳间反相位音。中间阈值是通过检测具有双耳时间差（ITD）的噪声中的音调而产生的。ITD依赖性已通过使用内部延迟线机制的模型得到最准确的解释。然而，在哺乳动物中还没有发现延迟线或类似的机制。不包括延迟线的替代编码原则可以解释声音定位的许多方面，但无法解释双耳检测的一些可用数据。通过采用复值相关系数，我们证明了最小假设模型可以解释大量双耳检测实验的结果。与延迟线模型相比，所提出的机制需要更少的自由度，同时可以说提高了与哺乳动物生理学的兼容性。研究还表明，复相关系数的二维声学特征空间同时也是双耳检测的感知一致空间。摘要：Binaural hearing is one of the principal mechanisms enabling the localization of sound sources in space. In addition, binaural hearing also significantly improves the ability to detect signals in noise. Humans can detect interaurally anti-phasic tones in masking noise at sound levels 15 dB below the detection threshold of the equivalent in-phase tones. Intermediate thresholds result from detecting tones in noise with an interaural time difference (ITD). The ITD dependence has been most accurately accounted for by models using an internal delay-line mechanism. The delay lines, or an equivalent mechanism, however, have not been found in mammals. Alternative coding principles that do not include delay lines can explain many aspects of sound localization but have failed to account for some of the available data on binaural detection. By employing the complex-valued correlation coefficient, we show that a minimum assumption model can explain the outcome of a wide range of binaural detection experiments. The proposed mechanism requires fewer degrees of freedom when compared to delay-line models while arguably improving compatibility with mammalian physiology. The study also shows that the 2-dimensional acoustic feature space of complex correlation coefficients is at the same time a perceptually uniform space for binaural detection.

【11】 Learning Filterbanks for End-to-End Acoustic Beamforming 标题：用于端到端声学波束形成的学习滤波器组链接：https://arxiv.org/abs/2111.04614

作者：Samuele Cornell,Manuel Pariente,François Grondin,Stefano Squartini 机构：Italy, Universit´e de Lorraine, CNRS, Inria, LORIA, France, Universit´e de Sherbrooke, Canada 摘要：最近关于单声道声源分离的研究表明，使用具有短窗口的完全学习滤波器组可以提高性能。另一方面，众所周知，对于传统波束形成技术，性能随着长分析窗口而提高。这也适用于大多数依赖深度神经网络（DNN）来估计空间协方差矩阵的混合神经波束形成方法。在这项工作中，我们试图弥合这两个世界之间的差距，并探索完全端到端的混合神经波束形成，在这种波束形成中，我们不使用短时傅里叶变换，而是与DNN联合学习分析和合成滤波器组。详细地说，我们探讨了两种不同类型的习得滤波器组：完全习得滤波器组和分析滤波器组。我们使用最近的Clarity Challenge数据进行了详细分析，结果表明，对于短窗口，使用学习过的滤波器组有可能超过基于oracle mask的波束形成。摘要：Recent work on monaural source separation has shown that performance can be increased by using fully learned filterbanks with short windows. On the other hand it is widely known that, for conventional beamforming techniques, performance increases with long analysis windows. This applies also to most hybrid neural beamforming methods which rely on a deep neural network (DNN) to estimate the spatial covariance matrices. In this work we try to bridge the gap between these two worlds and explore fully end-to-end hybrid neural beamforming in which, instead of using the Short-Time-Fourier Transform, also the analysis and synthesis filterbanks are learnt jointly with the DNN. In detail, we explore two different types of learned filterbanks: fully learned and analytic. We perform a detailed analysis using the recent Clarity Challenge data and show that by using learnt filterbanks is possible to surpass oracle-mask based beamforming for short windows.

【12】 RawBoost: A Raw Data Boosting and Augmentation Method applied to Automatic Speaker Verification Anti-Spoofing 标题：RawBoost：一种应用于自动说话人确认反欺骗的原始数据提升和增强方法链接：https://arxiv.org/abs/2111.04433

作者：Hemlata Tak,Madhu Kamble,Jose Patino,Massimiliano Todisco,Nicholas Evans 机构：EURECOM, Sophia Antipolis, France 备注：submitted to ICASSP 2022, 5 pages, 2 figures and 3 tables 摘要：本文介绍了RawBoost，这是一种数据增强和扩充方法，用于设计更可靠的欺骗检测解决方案，直接对原始波形输入进行操作。虽然RawBoost不需要额外的数据源，例如噪音记录或脉冲响应，并且数据、应用和模型不可知，但它是为电话场景设计的。基于线性和非线性卷积噪声、脉冲信号相关加性噪声和平稳信号无关加性噪声的组合，RawBoost模型分析了编码、传输、麦克风和放大器等引起的可变性，以及线性和非线性失真。使用ASVspoof 2021逻辑访问数据库进行的实验表明，RawBoost将最先进的原始端到端基线系统的性能相对提高了27%，并且仅在依赖外部数据或需要在模型级别进行额外干预的解决方案中优于RawBoost。摘要：This paper introduces RawBoost, a data boosting and augmentation method for the design of more reliable spoofing detection solutions which operate directly upon raw waveform inputs. While RawBoost requires no additional data sources, e.g. noise recordings or impulse responses and is data, application and model agnostic, it is designed for telephony scenarios. Based upon the combination of linear and non-linear convolutive noise, impulsive signal-dependent additive noise and stationary signal-independent additive noise, RawBoost models nuisance variability stemming from, e.g., encoding, transmission, microphones and amplifiers, and both linear and non-linear distortion. Experiments performed using the ASVspoof 2021 logical access database show that RawBoost improves the performance of a state-of-the-art raw end-to-end baseline system by 27% relative and is only outperformed by solutions that either depend on external data or that require additional intervention at the model level.

【13】 Inter-channel Conv-TasNet for multichannel speech enhancement 标题：用于多通道语音增强的通道间Conv-TasNet 链接：https://arxiv.org/abs/2111.04312

作者：Dongheon Lee,Seongrae Kim,Jung-Woo Choi 机构：Korea Advanced Institute of Science and Technology (KAIST), Daejeon , South, (e-mail:, Inter-channel Conv-TasNet for multichannel, speech enhancement, M 备注：10 pages, this work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 摘要：利用嵌入在多个麦克风信号中的空间信息，实现了多通道环境下的语音增强。此外，深度神经网络（DNN）最近在这一领域得到了发展；然而，充分利用空间信息和信道间关系的高效多信道网络结构的研究仍处于初级阶段。在这项研究中，我们提出了一种端到端的时域语音增强网络，它可以促进在DNN的各个层上使用信道间关系。该技术基于完全卷积时域音频分离网络（Conv-TasNet），最初是为语音分离任务开发的。我们将Conv TasNet扩展为几种形式，可以处理多通道输入信号并学习通道间关系。为此，我们修改网络的编码器-掩码-解码器结构，使其与空间通道、特征和时间维度上定义的三维张量兼容。特别是，我们对卷积结构进行了广泛的参数分析，并分别提出了深度和1$\乘以1$1卷积层对特征和空间维度的独立分配。我们证明了该网络丰富的信道间信息在抑制来自不同方向的噪声信号方面起着重要作用。所提出的通道间Conv TasNet优于最先进的多通道神经网络变体，即使其参数大小为后者的十分之一。使用CHiME-3数据集对所提出模型的性能进行了评估，该数据集在SDR、PESQ和STOI方面表现出显著的改进。摘要：Speech enhancement in multichannel settings has been realized by utilizing the spatial information embedded in multiple microphone signals. Moreover, deep neural networks (DNNs) have been recently advanced in this field; however, studies on the efficient multichannel network structure fully exploiting spatial information and inter-channel relationships is still in its early stages. In this study, we propose an end-to-end time-domain speech enhancement network that can facilitate the use of inter-channel relationships at individual layers of a DNN. The proposed technique is based on a fully convolutional time-domain audio separation network (Conv-TasNet), originally developed for speech separation tasks. We extend Conv-TasNet into several forms that can handle multichannel input signals and learn inter-channel relationships. To this end, we modify the encoder-mask-decoder structures of the network to be compatible with 3-D tensors defined over spatial channels, features, and time dimensions. In particular, we conduct extensive parameter analyses on the convolution structure and propose independent assignment of the depthwise and 1$\times$1 convolution layers to the feature and spatial dimensions, respectively. We demonstrate that the enriched inter-channel information from the proposed network plays a significant role in suppressing noisy signals impinging from various directions. The proposed inter-channel Conv-TasNet outperforms the state-of-the-art multichannel variants of neural networks, even with one-tenth of their parameter size. The performance of the proposed model is evaluated using the CHiME-3 dataset, which exhibits a remarkable improvement in SDR, PESQ, and STOI.

【14】 LiMuSE: Lightweight Multi-modal Speaker Extraction 标题：LiMuSE：轻量级多模态说话人提取链接：https://arxiv.org/abs/2111.04063

作者：Qinghua Liu,Yating Huang,Yunzhe Hao,Jiaming Xu,Bo Xu 机构：Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China, Qiushi Honors College, Tianjin University, Tianjin, China, School of Future Technology, University of Chinese Academy of Sciences, Beijing, China 摘要：在过去的几年中，在语音分离和说话人提取方面，鸡尾酒会问题的建模取得了重大进展。近年来，多模态线索（包括空间信息、面部表情和声纹）被引入到说话人提取任务中，作为相互补充的信息以达到更好的效果。然而，用于说话人提取的前端模型变得庞大，难以部署在资源受限的设备上。在本文中，我们利用新的模型结构和模型压缩技术解决了上述问题，并提出了一种轻量级的多模式说话人提取框架（称为LiMuSE），该框架采用组通信（GC）将多模态高维特征分解为一组宽度较小的低维特征，这些特征可以并行运行，并进一步使用超低比特量化策略来实现较低的模型尺寸。在网格数据集上的实验表明，将GC集成到多模态框架中可以获得24.86倍更少的参数的PAR或更好的性能，并且将量化策略应用到GC配备的模型中进一步获得约9倍的压缩比，同时保持与基线相比的可比性能。我们的代码将在https://github.com/aispeech-lab/LiMuSE. 摘要：The past several years have witnessed significant progress in modeling the Cocktail Party Problem in terms of speech separation and speaker extraction. In recent years, multi-modal cues, including spatial information, facial expression and voiceprint, are introduced to speaker extraction task to serve as complementary information to each other to achieve better performance. However, the front-end model, for speaker extraction, become large and hard to deploy on a resource-constrained device. In this paper, we address the aforementioned problem with novel model architectures and model compression techniques, and propose a lightweight multi-modal framework for speaker extraction (dubbed LiMuSE), which adopts group communication (GC) to split multi-modal high-dimension features into groups of low-dimension features with smaller width which could be run in parallel, and further uses an ultra-low bit quantization strategy to achieve lower model size. The experiments on the GRID dataset show that incorporating GC into the multi-modal framework achieves on par or better performance with 24.86 times fewer parameters, and applying the quantization strategy to the GC-equipped model further obtains about 9 times compression ratio while maintaining a comparable performance compared with baselines. Our code will be available at https://github.com/aispeech-lab/LiMuSE.

【15】 Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet 标题：基于非侵入式PESQNet的深度噪声抑制最大化不可微PESQ 链接：https://arxiv.org/abs/2111.03847

作者：Ziyi Xu,Maximilian Strake,Tim Fingscheidt 机构： Fingscheidt are with Institute for CommunicationsTechnology, Technische Universit¨at Braunschweig 摘要：使用深度神经网络（DNN）去噪的语音增强称为深度噪声抑制（DNS）。在训练期间，DNS方法通常使用均方误差（MSE）类型的损失函数进行训练，这不能保证良好的感知质量。语音质量感知评价（PESQ）是一种广泛使用的语音质量评价指标。然而，原始的PESQ算法是不可微的，因此不能直接用作基于梯度学习的优化准则。在这项工作中，我们提出了一种端到端的非侵入式PESQNet DNN来估计增强语音信号的PESQ分数。因此，通过提供无参考感知损失，它充当DNS训练的中介，允许最大化增强语音信号的PESQ分数。我们在已经强大的基线DNS的基础上说明了我们提议的PESQNet中介训练的潜力。作为进一步的创新，我们建议交替训练DNS和PESQNet，以使PESQNet保持最新，并特别针对正在训练的DNS执行良好的性能。将我们提出的方法与使用基于MSE的损失训练的相同DNS进行联合去噪和去冗余，并与Interspeech 2021 DNS挑战基线进行比较。详细分析表明，与基于MSE的损失的训练相比，PESQNet中介可以在合成测试数据上进一步提高DNS性能约0.1个PESQ点，在真实测试数据上进一步提高DNS性能0.03个DNSMOS点。我们提出的方法在合成测试数据上比挑战基线高0.2 PESQ点，在真实测试数据上比挑战基线高0.1 DNSMOS点。摘要：Speech enhancement employing deep neural networks (DNNs) for denoising are called deep noise suppression (DNS). During training, DNS methods are typically trained with mean squared error (MSE) type loss functions, which do not guarantee good perceptual quality. Perceptual evaluation of speech quality (PESQ) is a widely used metric for evaluating speech quality. However, the original PESQ algorithm is non-differentiable, and therefore cannot directly be used as optimization criterion for gradient-based learning. In this work, we propose an end-to-end non-intrusive PESQNet DNN to estimate the PESQ scores of the enhanced speech signal. Thus, by providing a reference-free perceptual loss, it serves as a mediator towards the DNS training, allowing to maximize the PESQ score of the enhanced speech signal. We illustrate the potential of our proposed PESQNet-mediated training on the basis of an already strong baseline DNS. As further novelty, we propose to train the DNS and the PESQNet alternatingly to keep the PESQNet up-to-date and perform well specifically for the DNS under training. Our proposed method is compared to the same DNS trained with MSE-based loss for joint denoising and dereverberation, and the Interspeech 2021 DNS Challenge baseline. Detailed analysis shows that the PESQNet mediation can further increase the DNS performance by about 0.1 PESQ points on synthetic test data and by 0.03 DNSMOS points on real test data, compared to training with the MSE-based loss. Our proposed method also outperforms the Challenge baseline by 0.2 PESQ points on synthetic test data and 0.1 DNSMOS points on real test data.

【16】 Class Token and Knowledge Distillation for Multi-head Self-Attention Speaker Verification Systems 标题：多头自关注说话人确认系统的类表征与知识提取链接：https://arxiv.org/abs/2111.03842

作者：Victoria Mingote,Antonio Miguel,Alfonso Ortega,Eduardo Lleida 机构： Arag´onInstitute for Engineering Research (I 3A), University of Zaragoza 摘要：本文探讨了三种新的方法来提高说话人确认（SV）系统的性能基于深度神经网络（DNN）使用多头自注意（MSA）机制和记忆层。首先，我们提出使用一种称为类令牌的可学习向量来代替平均全局池机制来提取嵌入。与全局平均池不同，我们的建议考虑了与文本相关SV任务相关的输入的时间结构。类标记连接到第一个MSA层之前的输入，其在输出处的状态用于预测类。为了获得额外的鲁棒性，我们引入了两种方法。首先，我们开发了类标记的贝叶斯估计。其次，我们添加了一个蒸馏表示令牌，用于使用知识蒸馏（KD）原理训练师生对网络，该原理与类令牌相结合。该蒸馏标记经过训练以模仿来自教师网络的预测，而类标记复制真实标签。所有策略都在RSR2015 Part II和DeepMine Part 1数据库上针对文本相关SV进行了测试，与使用平均池机制提取平均嵌入的相同体系结构相比，提供了具有竞争力的结果。摘要：This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a Bayesian estimation of the class token. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation (KD) philosophy, which is combined with the class token. This distillation token is trained to mimic the predictions from the teacher network, while the class token replicates the true label. All the strategies have been tested on the RSR2015-Part II and DeepMine-Part 1 databases for text-dependent SV, providing competitive results compared to the same architecture using the average pooling mechanism to extract average embeddings.

3.eess.AS音频处理:

【1】 The complex-valued correlation coefficient accounts for binaural detection 标题：复值相关系数用于双耳检测链接：https://arxiv.org/abs/2111.04637

作者：Jörg Encke,Mathias Dietz 机构： EnckeDepartment für Medizinische Physik und AkustikUniversität Oldenburg 26 1 1 1 Oldenburg, DietzDepartment für Medizinische Physik und AkustikUniversität Oldenburg 26 1 1 1 Oldenburg 摘要：双耳听觉是实现空间声源定位的主要机制之一。此外，双耳听觉还显著提高了在噪声中检测信号的能力。人类可以在低于等效同相音检测阈值15 dB的声级下检测掩蔽噪声中的耳间反相位音。中间阈值是通过检测具有双耳时间差（ITD）的噪声中的音调而产生的。ITD依赖性已通过使用内部延迟线机制的模型得到最准确的解释。然而，在哺乳动物中还没有发现延迟线或类似的机制。不包括延迟线的替代编码原则可以解释声音定位的许多方面，但无法解释双耳检测的一些可用数据。通过采用复值相关系数，我们证明了最小假设模型可以解释大量双耳检测实验的结果。与延迟线模型相比，所提出的机制需要更少的自由度，同时可以说提高了与哺乳动物生理学的兼容性。研究还表明，复相关系数的二维声学特征空间同时也是双耳检测的感知一致空间。摘要：Binaural hearing is one of the principal mechanisms enabling the localization of sound sources in space. In addition, binaural hearing also significantly improves the ability to detect signals in noise. Humans can detect interaurally anti-phasic tones in masking noise at sound levels 15 dB below the detection threshold of the equivalent in-phase tones. Intermediate thresholds result from detecting tones in noise with an interaural time difference (ITD). The ITD dependence has been most accurately accounted for by models using an internal delay-line mechanism. The delay lines, or an equivalent mechanism, however, have not been found in mammals. Alternative coding principles that do not include delay lines can explain many aspects of sound localization but have failed to account for some of the available data on binaural detection. By employing the complex-valued correlation coefficient, we show that a minimum assumption model can explain the outcome of a wide range of binaural detection experiments. The proposed mechanism requires fewer degrees of freedom when compared to delay-line models while arguably improving compatibility with mammalian physiology. The study also shows that the 2-dimensional acoustic feature space of complex correlation coefficients is at the same time a perceptually uniform space for binaural detection.

【2】 Learning Filterbanks for End-to-End Acoustic Beamforming 标题：用于端到端声学波束形成的学习滤波器组链接：https://arxiv.org/abs/2111.04614

作者：Samuele Cornell,Manuel Pariente,François Grondin,Stefano Squartini 机构：Italy, Universit´e de Lorraine, CNRS, Inria, LORIA, France, Universit´e de Sherbrooke, Canada 摘要：最近关于单声道声源分离的研究表明，使用具有短窗口的完全学习滤波器组可以提高性能。另一方面，众所周知，对于传统波束形成技术，性能随着长分析窗口而提高。这也适用于大多数依赖深度神经网络（DNN）来估计空间协方差矩阵的混合神经波束形成方法。在这项工作中，我们试图弥合这两个世界之间的差距，并探索完全端到端的混合神经波束形成，在这种波束形成中，我们不使用短时傅里叶变换，而是与DNN联合学习分析和合成滤波器组。详细地说，我们探讨了两种不同类型的习得滤波器组：完全习得滤波器组和分析滤波器组。我们使用最近的Clarity Challenge数据进行了详细分析，结果表明，对于短窗口，使用学习过的滤波器组有可能超过基于oracle mask的波束形成。摘要：Recent work on monaural source separation has shown that performance can be increased by using fully learned filterbanks with short windows. On the other hand it is widely known that, for conventional beamforming techniques, performance increases with long analysis windows. This applies also to most hybrid neural beamforming methods which rely on a deep neural network (DNN) to estimate the spatial covariance matrices. In this work we try to bridge the gap between these two worlds and explore fully end-to-end hybrid neural beamforming in which, instead of using the Short-Time-Fourier Transform, also the analysis and synthesis filterbanks are learnt jointly with the DNN. In detail, we explore two different types of learned filterbanks: fully learned and analytic. We perform a detailed analysis using the recent Clarity Challenge data and show that by using learnt filterbanks is possible to surpass oracle-mask based beamforming for short windows.

【3】 RawBoost: A Raw Data Boosting and Augmentation Method applied to Automatic Speaker Verification Anti-Spoofing 标题：RawBoost：一种应用于自动说话人确认反欺骗的原始数据提升和增强方法链接：https://arxiv.org/abs/2111.04433

作者：Hemlata Tak,Madhu Kamble,Jose Patino,Massimiliano Todisco,Nicholas Evans 机构：EURECOM, Sophia Antipolis, France 备注：submitted to ICASSP 2022, 5 pages, 2 figures and 3 tables 摘要：本文介绍了RawBoost，这是一种数据增强和扩充方法，用于设计更可靠的欺骗检测解决方案，直接对原始波形输入进行操作。虽然RawBoost不需要额外的数据源，例如噪音记录或脉冲响应，并且数据、应用和模型不可知，但它是为电话场景设计的。基于线性和非线性卷积噪声、脉冲信号相关加性噪声和平稳信号无关加性噪声的组合，RawBoost模型分析了编码、传输、麦克风和放大器等引起的可变性，以及线性和非线性失真。使用ASVspoof 2021逻辑访问数据库进行的实验表明，RawBoost将最先进的原始端到端基线系统的性能相对提高了27%，并且仅在依赖外部数据或需要在模型级别进行额外干预的解决方案中优于RawBoost。摘要：This paper introduces RawBoost, a data boosting and augmentation method for the design of more reliable spoofing detection solutions which operate directly upon raw waveform inputs. While RawBoost requires no additional data sources, e.g. noise recordings or impulse responses and is data, application and model agnostic, it is designed for telephony scenarios. Based upon the combination of linear and non-linear convolutive noise, impulsive signal-dependent additive noise and stationary signal-independent additive noise, RawBoost models nuisance variability stemming from, e.g., encoding, transmission, microphones and amplifiers, and both linear and non-linear distortion. Experiments performed using the ASVspoof 2021 logical access database show that RawBoost improves the performance of a state-of-the-art raw end-to-end baseline system by 27% relative and is only outperformed by solutions that either depend on external data or that require additional intervention at the model level.

【4】 Inter-channel Conv-TasNet for multichannel speech enhancement 标题：用于多通道语音增强的通道间Conv-TasNet 链接：https://arxiv.org/abs/2111.04312

作者：Dongheon Lee,Seongrae Kim,Jung-Woo Choi 机构：Korea Advanced Institute of Science and Technology (KAIST), Daejeon , South, (e-mail:, Inter-channel Conv-TasNet for multichannel, speech enhancement, M 备注：10 pages, this work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 摘要：利用嵌入在多个麦克风信号中的空间信息，实现了多通道环境下的语音增强。此外，深度神经网络（DNN）最近在这一领域得到了发展；然而，充分利用空间信息和信道间关系的高效多信道网络结构的研究仍处于初级阶段。在这项研究中，我们提出了一种端到端的时域语音增强网络，它可以促进在DNN的各个层上使用信道间关系。该技术基于完全卷积时域音频分离网络（Conv-TasNet），最初是为语音分离任务开发的。我们将Conv TasNet扩展为几种形式，可以处理多通道输入信号并学习通道间关系。为此，我们修改网络的编码器-掩码-解码器结构，使其与空间通道、特征和时间维度上定义的三维张量兼容。特别是，我们对卷积结构进行了广泛的参数分析，并分别提出了深度和1$\乘以1$1卷积层对特征和空间维度的独立分配。我们证明了该网络丰富的信道间信息在抑制来自不同方向的噪声信号方面起着重要作用。所提出的通道间Conv TasNet优于最先进的多通道神经网络变体，即使其参数大小为后者的十分之一。使用CHiME-3数据集对所提出模型的性能进行了评估，该数据集在SDR、PESQ和STOI方面表现出显著的改进。摘要：Speech enhancement in multichannel settings has been realized by utilizing the spatial information embedded in multiple microphone signals. Moreover, deep neural networks (DNNs) have been recently advanced in this field; however, studies on the efficient multichannel network structure fully exploiting spatial information and inter-channel relationships is still in its early stages. In this study, we propose an end-to-end time-domain speech enhancement network that can facilitate the use of inter-channel relationships at individual layers of a DNN. The proposed technique is based on a fully convolutional time-domain audio separation network (Conv-TasNet), originally developed for speech separation tasks. We extend Conv-TasNet into several forms that can handle multichannel input signals and learn inter-channel relationships. To this end, we modify the encoder-mask-decoder structures of the network to be compatible with 3-D tensors defined over spatial channels, features, and time dimensions. In particular, we conduct extensive parameter analyses on the convolution structure and propose independent assignment of the depthwise and 1$\times$1 convolution layers to the feature and spatial dimensions, respectively. We demonstrate that the enriched inter-channel information from the proposed network plays a significant role in suppressing noisy signals impinging from various directions. The proposed inter-channel Conv-TasNet outperforms the state-of-the-art multichannel variants of neural networks, even with one-tenth of their parameter size. The performance of the proposed model is evaluated using the CHiME-3 dataset, which exhibits a remarkable improvement in SDR, PESQ, and STOI.

【5】 LiMuSE: Lightweight Multi-modal Speaker Extraction 标题：LiMuSE：轻量级多模态说话人提取链接：https://arxiv.org/abs/2111.04063

作者：Qinghua Liu,Yating Huang,Yunzhe Hao,Jiaming Xu,Bo Xu 机构：Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China, Qiushi Honors College, Tianjin University, Tianjin, China, School of Future Technology, University of Chinese Academy of Sciences, Beijing, China 摘要：在过去的几年中，在语音分离和说话人提取方面，鸡尾酒会问题的建模取得了重大进展。近年来，多模态线索（包括空间信息、面部表情和声纹）被引入到说话人提取任务中，作为相互补充的信息以达到更好的效果。然而，用于说话人提取的前端模型变得庞大，难以部署在资源受限的设备上。在本文中，我们利用新的模型结构和模型压缩技术解决了上述问题，并提出了一种轻量级的多模式说话人提取框架（称为LiMuSE），该框架采用组通信（GC）将多模态高维特征分解为一组宽度较小的低维特征，这些特征可以并行运行，并进一步使用超低比特量化策略来实现较低的模型尺寸。在网格数据集上的实验表明，将GC集成到多模态框架中可以获得24.86倍更少的参数的PAR或更好的性能，并且将量化策略应用到GC配备的模型中进一步获得约9倍的压缩比，同时保持与基线相比的可比性能。我们的代码将在https://github.com/aispeech-lab/LiMuSE. 摘要：The past several years have witnessed significant progress in modeling the Cocktail Party Problem in terms of speech separation and speaker extraction. In recent years, multi-modal cues, including spatial information, facial expression and voiceprint, are introduced to speaker extraction task to serve as complementary information to each other to achieve better performance. However, the front-end model, for speaker extraction, become large and hard to deploy on a resource-constrained device. In this paper, we address the aforementioned problem with novel model architectures and model compression techniques, and propose a lightweight multi-modal framework for speaker extraction (dubbed LiMuSE), which adopts group communication (GC) to split multi-modal high-dimension features into groups of low-dimension features with smaller width which could be run in parallel, and further uses an ultra-low bit quantization strategy to achieve lower model size. The experiments on the GRID dataset show that incorporating GC into the multi-modal framework achieves on par or better performance with 24.86 times fewer parameters, and applying the quantization strategy to the GC-equipped model further obtains about 9 times compression ratio while maintaining a comparable performance compared with baselines. Our code will be available at https://github.com/aispeech-lab/LiMuSE.

【6】 Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet 标题：基于非侵入式PESQNet的深度噪声抑制最大化不可微PESQ 链接：https://arxiv.org/abs/2111.03847

作者：Ziyi Xu,Maximilian Strake,Tim Fingscheidt 机构： Fingscheidt are with Institute for CommunicationsTechnology, Technische Universit¨at Braunschweig 摘要：使用深度神经网络（DNN）去噪的语音增强称为深度噪声抑制（DNS）。在训练期间，DNS方法通常使用均方误差（MSE）类型的损失函数进行训练，这不能保证良好的感知质量。语音质量感知评价（PESQ）是一种广泛使用的语音质量评价指标。然而，原始的PESQ算法是不可微的，因此不能直接用作基于梯度学习的优化准则。在这项工作中，我们提出了一种端到端的非侵入式PESQNet DNN来估计增强语音信号的PESQ分数。因此，通过提供无参考感知损失，它充当DNS训练的中介，允许最大化增强语音信号的PESQ分数。我们在已经强大的基线DNS的基础上说明了我们提议的PESQNet中介训练的潜力。作为进一步的创新，我们建议交替训练DNS和PESQNet，以使PESQNet保持最新，并特别针对正在训练的DNS执行良好的性能。将我们提出的方法与使用基于MSE的损失训练的相同DNS进行联合去噪和去冗余，并与Interspeech 2021 DNS挑战基线进行比较。详细分析表明，与基于MSE的损失的训练相比，PESQNet中介可以在合成测试数据上进一步提高DNS性能约0.1个PESQ点，在真实测试数据上进一步提高DNS性能0.03个DNSMOS点。我们提出的方法在合成测试数据上比挑战基线高0.2 PESQ点，在真实测试数据上比挑战基线高0.1 DNSMOS点。摘要：Speech enhancement employing deep neural networks (DNNs) for denoising are called deep noise suppression (DNS). During training, DNS methods are typically trained with mean squared error (MSE) type loss functions, which do not guarantee good perceptual quality. Perceptual evaluation of speech quality (PESQ) is a widely used metric for evaluating speech quality. However, the original PESQ algorithm is non-differentiable, and therefore cannot directly be used as optimization criterion for gradient-based learning. In this work, we propose an end-to-end non-intrusive PESQNet DNN to estimate the PESQ scores of the enhanced speech signal. Thus, by providing a reference-free perceptual loss, it serves as a mediator towards the DNS training, allowing to maximize the PESQ score of the enhanced speech signal. We illustrate the potential of our proposed PESQNet-mediated training on the basis of an already strong baseline DNS. As further novelty, we propose to train the DNS and the PESQNet alternatingly to keep the PESQNet up-to-date and perform well specifically for the DNS under training. Our proposed method is compared to the same DNS trained with MSE-based loss for joint denoising and dereverberation, and the Interspeech 2021 DNS Challenge baseline. Detailed analysis shows that the PESQNet mediation can further increase the DNS performance by about 0.1 PESQ points on synthetic test data and by 0.03 DNSMOS points on real test data, compared to training with the MSE-based loss. Our proposed method also outperforms the Challenge baseline by 0.2 PESQ points on synthetic test data and 0.1 DNSMOS points on real test data.

【7】 Class Token and Knowledge Distillation for Multi-head Self-Attention Speaker Verification Systems 标题：多头自关注说话人确认系统的类表征与知识提取链接：https://arxiv.org/abs/2111.03842

作者：Victoria Mingote,Antonio Miguel,Alfonso Ortega,Eduardo Lleida 机构： Arag´onInstitute for Engineering Research (I 3A), University of Zaragoza 摘要：本文探讨了三种新的方法来提高说话人确认（SV）系统的性能基于深度神经网络（DNN）使用多头自注意（MSA）机制和记忆层。首先，我们提出使用一种称为类令牌的可学习向量来代替平均全局池机制来提取嵌入。与全局平均池不同，我们的建议考虑了与文本相关SV任务相关的输入的时间结构。类标记连接到第一个MSA层之前的输入，其在输出处的状态用于预测类。为了获得额外的鲁棒性，我们引入了两种方法。首先，我们开发了类标记的贝叶斯估计。其次，我们添加了一个蒸馏表示令牌，用于使用知识蒸馏（KD）原理训练师生对网络，该原理与类令牌相结合。该蒸馏标记经过训练以模仿来自教师网络的预测，而类标记复制真实标签。所有策略都在RSR2015 Part II和DeepMine Part 1数据库上针对文本相关SV进行了测试，与使用平均池机制提取平均嵌入的相同体系结构相比，提供了具有竞争力的结果。摘要：This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a Bayesian estimation of the class token. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation (KD) philosophy, which is combined with the class token. This distillation token is trained to mimic the predictions from the teacher network, while the class token replicates the true label. All the strategies have been tested on the RSR2015-Part II and DeepMine-Part 1 databases for text-dependent SV, providing competitive results compared to the same architecture using the average pooling mechanism to extract average embeddings.

【8】 SEOFP-NET: Compression and Acceleration of Deep Neural Networks for Speech Enhancement Using Sign-Exponent-Only Floating-Points 标题：SEOFP-NET：基于符号指数浮点的深层神经网络语音增强压缩与加速链接：https://arxiv.org/abs/2111.04436

作者：Yu-Chen Lin,Cheng Yu,Yi-Te Hsu,Szu-Wei Fu,Yu Tsao,Tei-Wei Kuo 摘要：众多的压缩和加速策略在计算机视觉和语音信号处理等领域的分类任务中取得了显著的效果。然而，同样的策略在回归任务上产生了未批准的性能，因为这些任务和分类任务之间的性质不同。本文提出了一种新的仅符号指数浮点网络（SEOFP-NET）技术，用于压缩语音信号处理的回归任务——语音增强的模型大小和加快推理时间。该方法通过在训练过程中量化单精度浮点参数的分数位来压缩基于深度神经网络（DNN）的语音增强模型的大小。在推理实现之前，通过将浮点乘法器替换为整数加法器，对训练好的SEOFP-NET模型中的所有参数进行微调，以加快推理时间。为了推广，SEOFP-NET技术被引入到不同语料库下不同模型结构的语音信号处理中的不同语音增强任务中。实验结果表明，SEOFP-NET模型的大小可以显著压缩81.249%，而不会显著降低其语音增强性能，推理时间可以比基线模型加快1.212倍。结果还验证了所提出的SEOFP-NET可以与其他效率策略合作，实现模型压缩的协同效应。此外，在用户研究实验中应用了刚刚显著差异（JND），统计分析了语音增强对听力的影响。结果表明，听者无法轻易区分基线模型处理的增强语音信号和建议的SEOFP-NET。摘要：Numerous compression and acceleration strategies have achieved outstanding results on classification tasks in various fields, such as computer vision and speech signal processing. Nevertheless, the same strategies have yielded ungratified performance on regression tasks because the nature between these and classification tasks differs. In this paper, a novel sign-exponent-only floating-point network (SEOFP-NET) technique is proposed to compress the model size and accelerate the inference time for speech enhancement, a regression task of speech signal processing. The proposed method compressed the sizes of deep neural network (DNN)-based speech enhancement models by quantizing the fraction bits of single-precision floating-point parameters during training. Before inference implementation, all parameters in the trained SEOFP-NET model are slightly adjusted to accelerate the inference time by replacing the floating-point multiplier with an integer-adder. For generalization, the SEOFP-NET technique is introduced to different speech enhancement tasks in speech signal processing with different model architectures under various corpora. The experimental results indicate that the size of SEOFP-NET models can be significantly compressed by up to 81.249% without noticeably downgrading their speech enhancement performance, and the inference time can be accelerated to 1.212x compared with the baseline models. The results also verify that the proposed SEOFP-NET can cooperate with other efficiency strategies to achieve a synergy effect for model compression. In addition, the just noticeable difference (JND) was applied to the user study experiment to statistically analyze the effect of speech enhancement on listening. The results indicate that the listeners cannot facilely differentiate between the enhanced speech signals processed by the baseline model and the proposed SEOFP-NET.

【9】 Characterizing the adversarial vulnerability of speech self-supervised learning 标题：语音自监督学习的对抗性脆弱性表征链接：https://arxiv.org/abs/2111.04330

作者：Haibin Wu,Bo Zheng,Xu Li,Xixin Wu,Hung-yi Lee,Helen Meng 机构： Graduate Institute of Communication Engineering, National Taiwan University, Human-Computer Communications Laboratory, The Chinese University of Hong Kong 摘要：一个名为Speech processing Universal PERformance Benchmark（SUPERB）的排行榜推动了语音表征学习的研究。该排行榜旨在通过对体系结构和少量数据的最小修改，对各种下游语音任务的共享自监督学习（SSL）语音模型的性能进行基准测试。SSL上游模型通过最小的自适应改进了各种下游任务的性能。由于自监督学习上游模型和下游任务的范式在言语社区中引起了更多的关注，因此，表征这种范式的对抗鲁棒性具有高度优先性。在本文中，我们首次尝试研究这种范式在零知识对手和有限知识对手攻击下的对抗脆弱性。实验结果表明，SUPERB提出的攻击范式对有限的知识对手非常脆弱，零知识对手的攻击具有可转移性。XAB测试验证了精心设计的对手攻击的不可察觉性。摘要：A leaderboard named Speech processing Universal PERformance Benchmark (SUPERB), which aims at benchmarking the performance of a shared self-supervised learning (SSL) speech model across various downstream speech tasks with minimal modification of architectures and small amount of data, has fueled the research for speech representation learning. The SUPERB demonstrates speech SSL upstream models improve the performance of various downstream tasks through just minimal adaptation. As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority. In this paper, we make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries. The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries, and the attacks generated by zero-knowledge adversaries are with transferability. The XAB test verifies the imperceptibility of crafted adversarial attacks.

【10】 Retrieving Speaker Information from Personalized Acoustic Models for Speech Recognition 标题：用于语音识别的个性化声学模型中的说话人信息检索链接：https://arxiv.org/abs/2111.04194

作者：Salima Mdhaffar,Jean-François Bonastre,Marc Tommasi,Natalia Tomashenko,Yannick Estève 机构： LIA, Avignon Universit´e, France, Universit´e de Lille, CNRS, Inria, Centrale Lille, UMR , - CRIStAL, Lille, France 摘要：能够收集用户声音的功能强大的个人设备的广泛使用为构建说话人自适应语音识别系统（ASR）或参与ASR协作学习提供了机会。在这两种情况下，都可以建立个性化的声学模型（AM），即带有特定扬声器数据的微调AM。一个自然产生的问题是，个性化声学模型的传播是否会泄露个人信息。在本文中，我们证明了通过利用局部适应于说话人的神经声学模型的权值矩阵变化来检索说话人的性别以及他的身份是可能的。顺便说一句，我们观察到的现象可能有助于解释语音处理中的深层神经网络。几乎可以肯定的是，仅使用第一层就可以识别性别，而使用中间层时，说话人验证性能良好。我们在TED-LIUM 3数据集上使用HMM/TDNN模型进行的实验研究表明，性别检测的准确率为95%，说话人验证任务的错误率为9.07%，仅利用可交换的个性化模型的权重，而不是用户数据。摘要：The widespread of powerful personal devices capable of collecting voice of their users has opened the opportunity to build speaker adapted speech recognition system (ASR) or to participate to collaborative learning of ASR. In both cases, personalized acoustic models (AM), i.e. fine-tuned AM with specific speaker data, can be built. A question that naturally arises is whether the dissemination of personalized acoustic models can leak personal information. In this paper, we show that it is possible to retrieve the gender of the speaker, but also his identity, by just exploiting the weight matrix changes of a neural acoustic model locally adapted to this speaker. Incidentally we observe phenomena that may be useful towards explainability of deep neural networks in the context of speech processing. Gender can be identified almost surely using only the first layers and speaker verification performs well when using middle-up layers. Our experimental study on the TED-LIUM 3 dataset with HMM/TDNN models shows an accuracy of 95% for gender detection, and an Equal Error Rate of 9.07% for a speaker verification task by only exploiting the weights from personalized models that could be exchanged instead of user data.

【11】 Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer 标题：主题Transformer：用主题制约的Transformer生成符号音乐链接：https://arxiv.org/abs/2111.04093

作者：Yi-Jen Shih,Shih-Lun Wu,Frank Zalkow,Meinard Müller,Yi-Hsuan Yang 摘要：基于注意力的Transformer模型已越来越多地用于自动音乐生成。为了使用用户指定的序列来调节此类模型的生成过程，一种流行的方法是将该调节序列作为启动序列，并要求转换器解码器生成延续。然而，这种基于提示的条件作用不能保证条件作用序列会发展，甚至不能在生成的延续中简单地重复自身。在本文中，我们提出了一种替代的条件作用方法，称为基于主题的条件作用，该方法明确训练转换器将条件作用序列视为主题材料，必须在其生成结果中多次显示。这是通过两项主要的技术贡献实现的。首先，我们提出了一种基于深度学习的方法，使用对比表征学习和聚类从训练数据中的音乐片段中自动检索主题材料。其次，我们提出了一种新的门控并行注意模块，用于序列对序列（seq2seq）编码器/解码器体系结构中，以更有效地解释Transformer解码器生成过程中给定的调节主题材料。我们报告了对提议的主题变换器和传统的基于提示的基线变体的客观和主观评估，表明我们的最佳模型可以在一定程度上生成具有给定条件的重复和合理变体的复调流行钢琴音乐。摘要：Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with a user-specified sequence, a popular approach is to take that conditioning sequence as a priming sequence and ask a Transformer decoder to generate a continuation. However, this prompt-based conditioning cannot guarantee that the conditioning sequence would develop or even simply repeat itself in the generated continuation. In this paper, we propose an alternative conditioning approach, called theme-based conditioning, that explicitly trains the Transformer to treat the conditioning sequence as a thematic material that has to manifest itself multiple times in its generation result. This is achieved with two main technical contributions. First, we propose a deep learning-based approach that uses contrastive representation learning and clustering to automatically retrieve thematic materials from music pieces in the training data. Second, we propose a novel gated parallel attention module to be used in a sequence-to-sequence (seq2seq) encoder/decoder architecture to more effectively account for a given conditioning thematic material in the generation process of the Transformer decoder. We report on objective and subjective evaluations of variants of the proposed Theme Transformer and the conventional prompt-based baseline, showing that our best model can generate, to some extent, polyphonic pop piano music with repetition and plausible variations of a given condition.

【12】 Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech 标题：Meta-TTS：面向Few-Shot说话人自适应文语转换的元学习链接：https://arxiv.org/abs/2111.04040

作者：Sung-Feng Huang,Chyi-Jiunn Lin,Hung-yi Lee 机构： Lee are with the Department of Electrical En-gineering 备注：under review 摘要：个性化语音合成系统是一个非常理想的应用，该系统可以用用户的声音生成语音，并记录很少的录音。在最近的工作中，有两种主要的方法来构建这样一个系统：说话人自适应和说话人编码。一方面，说话人自适应方法在样本数较少的情况下对训练好的多说话人文本到语音（TTS）模型进行微调。然而，它们至少需要数千个微调步骤才能实现高质量的自适应，这使得它很难应用到设备上。另一方面，说话人编码方法将注册话语编码为说话人嵌入。训练后的TTS模型可以在相应的说话人嵌入条件下合成用户的语音。然而，说话人编码器在可见说话人和不可见说话人之间存在泛化差距。在本文中，我们提出了一种元学习算法应用于说话人自适应方法。更具体地说，我们使用模型不可知元学习（MAML）作为多说话人TTS模型的训练算法，其目的是找到一个很好的元初始化，使该模型能够快速适应任何少量的镜头说话人自适应任务。因此，我们也可以将元训练的TTS模型有效地应用于看不见的说话人。我们的实验将所提出的方法（Meta-TTS）与两个基线进行了比较：说话人自适应方法基线和说话人编码方法基线。评估结果表明，Meta-TTS可以从少量的注册样本中合成出高说话人相似度的语音，其自适应步骤比说话人自适应基线要少，并且在相同的训练方案下，Meta-TTS的性能优于说话人编码基线。当基线的说话人编码器使用额外的8371个说话人数据进行预训练时，Meta TTS仍然可以在LibriTTS数据集上优于基线，并在VCTK数据集上获得可比结果。摘要：Personalizing a speech synthesis system is a highly desired application, where the system can generate speech with the user's voice with rare enrolled recordings. There are two main approaches to build such a system in recent works: speaker adaptation and speaker encoding. On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples. However, they require at least thousands of fine-tuning steps for high-quality adaptation, making it hard to apply on devices. On the other hand, speaker encoding methods encode enrollment utterances into a speaker embedding. The trained TTS model can synthesize the user's speech conditioned on the corresponding speaker embedding. Nevertheless, the speaker encoder suffers from the generalization gap between the seen and unseen speakers. In this paper, we propose applying a meta-learning algorithm to the speaker adaptation method. More specifically, we use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model, which aims to find a great meta-initialization to adapt the model to any few-shot speaker adaptation tasks quickly. Therefore, we can also adapt the meta-trained TTS model to unseen speakers efficiently. Our experiments compare the proposed method (Meta-TTS) with two baselines: a speaker adaptation method baseline and a speaker encoding method baseline. The evaluation results show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline and outperforms the speaker encoding baseline under the same training scheme. When the speaker encoder of the baseline is pre-trained with extra 8371 speakers of data, Meta-TTS can still outperform the baseline on LibriTTS dataset and achieve comparable results on VCTK dataset.

【13】 Towards noise robust trigger-word detection with contrastive learning pre-task for fast on-boarding of new trigger-words 标题：具有对比学习预任务的噪声鲁棒触发字检测，用于新触发字的快速自注册链接：https://arxiv.org/abs/2111.03971

作者：Sivakumar Balasubramanian,Aditya Jajodia,Gowtham Srinivasan 机构：Samsung Research America, Mountain View, USA 备注：submitted to ICASSP 2022 摘要：触发字检测作为用户与语音助手进行通信的切入点，发挥着重要作用。但是支持一个特定的单词作为触发词需要大量的数据收集、扩充和标记。这使得支持新的触发词成为一个乏味而耗时的过程。为了解决这个问题，我们将对比学习作为一项预训练任务，帮助检测模型推广到不同的单词和噪声条件。我们探讨了监督对比技术，并提出了一种使用长句音频中的语块词的自我监督技术。我们发现，对比预训练技术在数据可用性较低的情况下，对新触发词的预训练效果与传统分类预训练相当。摘要：Trigger-word detection plays an important role as the entry point of user's communication with voice assistants. But supporting a particular word as a trigger-word involves huge amount of data collection, augmentation and labelling for that word. This makes supporting new trigger-words a tedious and time consuming process. To combat this, we explore the use of contrastive learning as a pre-training task that helps the detection model to generalize to different words and noise conditions. We explore supervised contrastive techniques and also propose a self-supervised technique using chunked words from long sentence audios. We show that the contrastive pre-training techniques have comparable results to a traditional classification pre-training on new trigger words with less data availability.

【14】 Digital Audio Processing Tools for Music Corpus Studies 标题：用于音乐语料库研究的数字音频处理工具链接：https://arxiv.org/abs/2111.03895

作者：Johanna Devaney 摘要：数字音频处理工具为音乐研究人员提供了一个机会，让他们可以研究非注释音乐和作为表演的音乐。本章总结了可以从音频中提取的信息类型，以及当前可用于音乐语料库研究的音频工具。提取方法综述包括信号处理入门和音频特征提取背景理论。音频工具调查重点关注广泛使用的工具，包括具有图形用户界面的工具，即Audacity和Sonic Visualiser，以及使用C/C++、Java、MATLAB和Python计算机编程语言编写的基于代码的工具。摘要：Digital audio processing tools offer music researchers the opportunity to examine both non-notated music and music as performance. This chapter summarises the types of information that can be extracted from audio as well as currently available audio tools for music corpus studies. The survey of extraction methods includes both a primer on signal processing and background theory on audio feature extraction. The survey of audio tools focuses on widely used tools, including both those with a graphical user interface, namely Audacity and Sonic Visualiser, and code-based tools written in the C/C++, Java, MATLAB, and Python computer programming languages.

【15】 SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines 标题：SIG-VC：一种说话人信息制导的人机零发声转换系统链接：https://arxiv.org/abs/2111.03811

作者：Zhang Haozhe,Cai Zexin,Qin Xiaoyi,Li Ming 机构： Duke Kunshan University, China 2School of Computer Science, Wuhan University 摘要：如今，随着越来越多的系统在传统的语音转换（VC）任务中取得了良好的性能，人们的注意力逐渐转向极端条件下的VC任务。在本文中，我们提出了一种新的Zero-Shot语音转换方法。我们的目标是获得说话人内容分离的中间表示，以便更好地去除说话人信息，获得纯内容信息。因此，我们提出的框架包含一个模块，该模块从源说话人的声学特征中去除说话人信息。此外，系统还增加了说话人信息控制，以保持语音克隆性能。通过主观和客观指标对所提出的系统进行评估。结果表明，我们提出的系统显著减少了零炮语音转换中的折衷问题，同时对说话人验证系统具有较高的欺骗能力。摘要：Nowadays, as more and more systems achieve good performance in traditional voice conversion (VC) tasks, people's attention gradually turns to VC tasks under extreme conditions. In this paper, we propose a novel method for zero-shot voice conversion. We aim to obtain intermediate representations for speaker-content disentanglement of speech to better remove speaker information and get pure content information. Accordingly, our proposed framework contains a module that removes the speaker information from the acoustic feature of the source speaker. Moreover, speaker information control is added to our system to maintain the voice cloning performance. The proposed system is evaluated by subjective and objective metrics. Results show that our proposed system significantly reduces the trade-off problem in zero-shot voice conversion, while it also manages to have high spoofing power to the speaker verification system.

【16】 Privacy attacks for automatic speech recognition acoustic models in a federated learning framework 标题：联合学习框架中自动语音识别声学模型的隐私攻击链接：https://arxiv.org/abs/2111.03777

作者：Natalia Tomashenko,Salima Mdhaffar,Marc Tommasi,Yannick Estève,Jean-François Bonastre 机构： LIA, Avignon Universit´e, France, Universit´e de Lille, CNRS, Inria, Centrale Lille, UMR , - CRIStAL, Lille, France 备注：Submitted to ICASSP 2022 摘要：本文研究了在自动语音识别（ASR）中，从个性化说话人自适应神经网络声学模型（AMs）中有效地检索说话人信息的方法。在ASR声学模型的联合学习环境中，这个问题尤其重要，在这种环境中，基于从多个客户端接收的更新，在服务器上学习全局模型。我们提出了一种基于指标数据集上的神经网络足迹分析神经网络AMs中信息的方法。利用这种方法，我们开发了两种攻击模型，旨在从更新的个性化模型中推断说话人身份，而不需要访问实际用户的语音数据。在TED-LIUM 3语料库上的实验表明，所提出的方法是非常有效的，可以提供1-2%的等错误率（EER）。摘要：This paper investigates methods to effectively retrieve speaker information from the personalized speaker adapted neural network acoustic models (AMs) in automatic speech recognition (ASR). This problem is especially important in the context of federated learning of ASR acoustic models where a global model is learnt on the server based on the updates received from multiple clients. We propose an approach to analyze information in neural network AMs based on a neural network footprint on the so-called Indicator dataset. Using this method, we develop two attack models that aim to infer speaker identity from the updated personalized models without access to the actual users' speech data. Experiments on the TED-LIUM 3 corpus demonstrate that the proposed approaches are very effective and can provide equal error rate (EER) of 1-2%.

【17】 Oracle Teacher: Towards Better Knowledge Distillation 标题：甲骨文老师：迈向更好的知识蒸馏链接：https://arxiv.org/abs/2111.03664

作者：Ji Won Yoon,Hyung Yong Kim,Hyeonseung Lee,Sunghwan Ahn,Nam Soo Kim 备注：This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 摘要：知识提取（KD）是一种有效的模型压缩方法，旨在将较大网络（教师）的知识转移到较小网络（学生）中。传统的KD方法通常采用有监督的教师模型，其中输出标签仅被视为目标。进一步扩展这种有监督的方案，我们引入了一种新的KD教师模型，即Oracle教师模型，该模型利用源输入和输出标签的嵌入来提取更准确的知识并传递给学生。该模型遵循Transformer网络的编解码器注意结构，允许模型关注输出标签中的相关信息。在三种不同的序列学习任务上进行了广泛的实验：语音识别、场景文本识别和机器翻译。从实验结果来看，我们的经验表明，该模型提高了学生完成这些任务的能力，同时大大加快了教师模型的训练时间。摘要：Knowledge distillation (KD), best known as an effective method for model compression, aims at transferring the knowledge of a bigger network (teacher) to a much smaller network (student). Conventional KD methods usually employ the teacher model trained in a supervised manner, where output labels are treated only as targets. Extending this supervised scheme further, we introduce a new type of teacher model for KD, namely Oracle Teacher, that utilizes the embeddings of both the source inputs and the output labels to extract a more accurate knowledge to be transferred to the student. The proposed model follows the encoder-decoder attention structure of the Transformer network, which allows the model to attend to related information from the output labels. Extensive experiments are conducted on three different sequence learning tasks: speech recognition, scene text recognition, and machine translation. From the experimental results, we empirically show that the proposed model improves the students across these tasks while achieving a considerable speed-up in the teacher model's training time.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-11-09，如有侵权请联系 cloudcommunity@tencent.com 删除

linux