金融/语音/音频处理学术速递[7.13]

公众号-arXiv每日学术速递

发布于 2021-07-27 10:50:08

7950

发布于 2021-07-27 10:50:08

文章被收录于专栏：arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

q-fin金融，共计17篇

cs.SD语音，共计13篇

eess.AS音频处理，共计17篇

1.q-fin金融:

【1】 Investor Behavior Modeling by Analyzing Financial Advisor Notes: A Machine Learning Perspective 标题：基于机器学习的财务顾问笔记分析投资者行为建模

作者：Cynthia Pagliaro,Dhagash Mehta,Han-Tai Shiao,Shaofei Wang,Luwei Xiong 机构：The Vanguard Group, Malvern, PA, USA 备注：8 pages, 2 column format, 7 figures+5 tables 链接：https://arxiv.org/abs/2107.05592 摘要：对投资者行为进行建模对于确定财务顾问的行为辅导机会至关重要。借助自然语言处理（NLP），我们分析了一个非结构化（文本）的数据集，这些数据集是金融顾问在每次投资者谈话后所做的总结笔记，从而首次深入了解顾问与投资者之间的互动。这些洞察用于预测不利市场条件下的投资者需求；从而使顾问能够指导投资者，帮助避免不当的财务决策。首先，我们执行主题建模以深入了解新出现的主题和趋势。基于这一观点，我们构建了一个监督分类模型来预测在市场波动期间，建议投资者需要行为指导的概率。据我们所知，我们的研究是第一个利用非结构化数据探索顾问与投资者关系的研究。这项工作可能对传统的和新兴的金融咨询服务模式（如robo咨询）产生深远的影响。摘要：Modeling investor behavior is crucial to identifying behavioral coaching opportunities for financial advisors. With the help of natural language processing (NLP) we analyze an unstructured (textual) dataset of financial advisors' summary notes, taken after every investor conversation, to gain first ever insights into advisor-investor interactions. These insights are used to predict investor needs during adverse market conditions; thus allowing advisors to coach investors and help avoid inappropriate financial decision-making. First, we perform topic modeling to gain insight into the emerging topics and trends. Based on this insight, we construct a supervised classification model to predict the probability that an advised investor will require behavioral coaching during volatile market periods. To the best of our knowledge, ours is the first work on exploring the advisor-investor relationship using unstructured data. This work may have far-reaching implications for both traditional and emerging financial advisory service models like robo-advising.

【2】 Predicting Risk-adjusted Returns using an Asset Independent Regime-switching Model 标题：利用资产独立的制度转换模型预测风险调整后的收益

作者：Nicklas Werge 机构：LPSM, Sorbonne Université, place Jussieu, Paris, France 链接：https://arxiv.org/abs/2107.05535 摘要：随着时间的推移，金融市场往往会在不同的市场制度之间切换，这使得基于平稳性的模型难以为继。基于隐马尔可夫模型，我们构造了一个独立于资产类别的制度转换模型，用于风险调整后的收益预测。这个框架可以区分商品、货币、股票和固定收益市场等多种金融市场的市场制度。该方法利用粘性特征，直接影响制度粘性，从而改变流动水平。通过分析近20年来金融市场的每日变化，对我们的风险调整收益预测指标进行了调查。样本外观察的实证结果可以准确检测牛市、熊市和高波动期，提高风险调整后的回报率，同时保持较好的换手率水平。摘要：Financial markets tend to switch between various market regimes over time, making stationarity-based models unsustainable. We construct a regime-switching model independent of asset classes for risk-adjusted return predictions based on hidden Markov models. This framework can distinguish between market regimes in a wide range of financial markets such as the commodity, currency, stock, and fixed income market. The proposed method employs sticky features that directly affect the regime stickiness and thereby changing turnover levels. An investigation of our metric for risk-adjusted return predictions is conducted by analyzing daily financial market changes for almost twenty years. Empirical demonstrations of out-of-sample observations obtain an accurate detection of bull, bear, and high volatility periods, improving risk-adjusted returns while keeping a preferable turnover level.

【3】 Are Rents Excessive in the Central City?: A Geospatial Analysis 标题：中心城区房租过高吗？--一个地理空间分析

作者：Scott W. Hegerty 机构：Department of Economics, Northeastern Illinois University, Chicago, IL 链接：https://arxiv.org/abs/2107.05529 摘要：在美国许多中心城市，房产价值相对较低，而租金则更接近富裕社区。这种差距会给房东带来相对较大的利润，也被称为租房者的“剥削”。虽然这种差距很大程度上可以用风险来解释，但收入和种族等因素也可能发挥重要作用。本研究计算了30个大城市及其周边大都市地区的地租与房产价值（RPV）比率的普查区水平测度。在检查了密尔沃基和其他三个城市这个比率的空间分布以及与其他社会经济变量的关系之后，使用Z分数和分位数来确定全国范围内的“极端”RPV值底特律、克利夫兰和密尔沃基等城市的中值和95%值高于西雅图和旧金山等西海岸城市。空间滞后回归估计表明，在控制收入、财产价值和空置率的情况下，种族特征往往具有与预期相反的迹象，几乎没有证据表明纯粹基于种族的“剥削”租客。例如，黑人居民百分比的显著负系数可能表明，在某一特定地区，RPV比率较低，其他所有地区都相同。虽然这项研究显示了城市内部以及城市之间的RPV值最高，但进一步的调查可能会更全面地揭示这些空间差异的驱动因素。摘要：In many U.S. central cities, property values are relatively low, while rents are closer to those in better-off neighborhoods. This gap can lead to relatively large profits for landlords, and has been referred to as "exploitaton" for renters. While much of this gap might be explained by risk, factors such as income and race might play important roles as well. This study calculates Census tract-level measures of the rent-to-property-value (RPV) ratio for 30 large cities and their surrounding metropolitan areas. After examining the spatial distribution of this ratio and relationships with other socioeconomic variables for Milwaukee and three other cities, Z-scores and quantiles are used to identify "extreme" RPV values nationwide. "Rust Belt" cities such as Detroit, Cleveland, and Milwaukee are shown to have higher median and 95% values than do West Coast cities such as Seattle and San Francisco. A spatial lag regression estimation shows that, controlling for income, property values, and vacancy rates, racial characteristics often have the "opposite" signs from what might be expected and that there is little evidence of purely race-based "exploitation" of renters. A significantly negative coefficient for the percentage of Black residents, for example, might suggest that the RPV ratio is lower in a given tract, all else equal. While this study shows where RPV values are highest within as well as between cities, further investigation might uncover the drivers of these spatial differences more fully.

【4】 Characterization of the probability and information entropy of a process with an increasing sample space by different functional forms of expansion, with an application to hyperinflation 标题：用不同的函数展开形式刻画样本空间增加的过程的概率和信息熵，并在超膨胀中的应用

作者：Laurence Francis Lacey 机构：Lacey Solutions Ltd, Skerries, County Dublin, Ireland 备注：19 pages, 1 table, 9 figures 链接：https://arxiv.org/abs/2107.05483 摘要：有一个具有确定结果的随机变量（x）（即x= x0），p（x0）＝1。考虑x0在整数区间[1，s]上有一个离散的均匀分布，其中初始空间中样本空间（s）＝1的大小，使得p（x0）＝1。X0的概率和相关信息熵（H），当s通过不同的函数形式膨胀而增加时？这种过程的特点是：（1）样本空间的单指数扩展(2）幂函数展开(3）双指数展开。样本空间随时间的双指数展开（来自t和n之间的自然对数关系）描述了一个“恶性膨胀”过程。从1920年中期到1923年底，魏玛共和国纸马克购买一枚金马克的购买力接近于零（1枚纸马克=10相当于-12枚金马克的购买力）。从纸标的购买力到购买一枚金标的购买力，确定了这一恶性通货膨胀过程的信息熵。摘要：There is a random variable (X) with a determined outcome (i.e., X = x0), p(x0) = 1. Consider x0 to have a discrete uniform distribution over the integer interval [1, s], where the size of the sample space (s) = 1, in the initial state, such that p(x0) = 1. What is the probability of x0 and the associated information entropy (H), as s increases by means of different functional forms of expansion? Such a process has been characterised in the case of (1) a mono-exponential expansion of the sample space; (2) a power function expansion; (3) double exponential expansion. The double exponential expansion of the sample space with time (from a natural log relationship between t and n) describes a "hyperinflationary" process. Over the period from the middle of 1920 to the end of 1923, the purchasing power of the Weimar Republic paper Mark to purchase one gold Mark became close to zero (1 paper Mark = 10 to the power of -12 gold Mark). From the purchasing power of the paper Mark to purchase one gold Mark, the information entropy of this hyperinflationary process was determined.

【5】 Debt Swapping for Risk Mitigation in Financial Networks 标题：金融网络中的债务置换风险缓释

作者：Pál András Papp,Roger Wattenhofer 链接：https://arxiv.org/abs/2107.05359 摘要：我们研究的是通过债务契约将银行连接起来的金融网络。当两个债权银行决定交换一个新的支付义务时，我们考虑了债务交换的操作，从而导致了一个局部不同的网络结构。我们说，如果互换对双方都有利，它就是积极的；我们可以从银行收到的资产数量，或者银行在不同冲击下的风险敞口来解释这一概念。我们分析了金融网络中这些交换操作的各种性质。我们首先证明了在一个静态的金融系统中，或者当冲击按比例冲击网络中的每家银行时，任何一对银行都不可能有正交换。然后我们研究了最坏情况下的冲击模型，当给定大小的冲击以最坏的方式分布于特定银行时。如果银行的目标是在这种最坏的情况下将损失降到最低，那么积极的互换确实可以存在。我们分析了这种正交换对系统其他银行的影响，寻找交换的计算复杂性，以及可以有效地找到交换的特殊情况。最后，我们也给出了当银行交换多个合约或当两个以上银行参与交换时，更复杂交换操作的一些结果。摘要：We study financial networks where banks are connected by debt contracts. We consider the operation of debt swapping when two creditor banks decide to exchange an incoming payment obligation, thus leading to a locally different network structure. We say that a swap is positive if it is beneficial for both of the banks involved; we can interpret this notion either with respect to the amount of assets received by the banks, or their exposure to different shocks that might hit the system. We analyze various properties of these swapping operations in financial networks. We first show that there can be no positive swap for any pair of banks in a static financial system, or when a shock hits each bank in the network proportionally. We then study worst-case shock models, when a shock of given size is distributed in the worst possible way for a specific bank. If the goal of banks is to minimize their losses in such a worst-case setting, then a positive swap can indeed exist. We analyze the effects of such a positive swap on other banks of the system, the computational complexity of finding a swap, and special cases where a swap can be found efficiently. Finally, we also present some results for more complex swapping operations when the banks swap multiple contracts, or when more than two banks participate in the swap.

【6】 Latent Seasonal and Secular Periodicities Identified in the Dynamics of US FDA Medical Devices (1976-2020) 标题：美国FDA医疗器械动态中的潜在季节性和长期周期性(1976-2020)

作者：Iraj Daizadeh 链接：https://arxiv.org/abs/2107.05347 摘要：背景：美国食品和药物管理局（FDA）监管医疗器械（MD），这是基于经济和政策因素（如供需、危机、专利）的综合作用。假设FDA-MD（上市前通知（PMN）、批文（PMAs）及其总和）应用的数量表现与其他计量经济学相似，本研究探讨了经济周期（周期）存在的假设（如果存在的话，还有长度尺度）。方法：在总结统计的基础上，采用离散小波变换、滑动平均滤波（RMAF）、全集成经验模式自适应噪声分解（CEEMDAN）等多种时间序列分析技术，对1976年5月至2020年12月FDA-MD应用的月观测数进行统计分析，和季节趋势黄土（STL）分解）来详尽地搜索和表征这种周期性。结果：数据具有非正态性、非平稳性（分数阶积分<1）、非线性、强持久性（Hurst>0.5）。重要的是，周期性存在并遵循季节性、1年短期、5-6年（Juglar）和单个24年中期（Kuznets）周期（考虑MD申请总数时）。经济危机（如COVID-19）似乎并不影响周期的演变。结论：本研究的结论是：（1）PMA和PMN数据可以作为MD行业的一个替代指标(2）数据中存在周期性，时间长度与季节/1年、Juglar和Kuznets影响有关(4）这些指标似乎不受特定危机（如COVID-19）的影响（与周期性评估中使用的其他计量经济学类似）(5）中性粒细胞和中性粒细胞逆向演化，提示产业结构转型(6）在经济复苏之前，预计MDs总量将持续下降到2020年代中期。摘要：Background: The US Food and Drug Administration (FDA) regulates medical devices (MD), which are predicated on a concoction of economic and policy forces (e.g., supply/demand, crises, patents). Assuming that the number of FDA MD (Premarketing Notifications (PMN), Approvals (PMAs), and their sum) Applications behaves similarly to those of other econometrics, this work explores the hypothesis of the existence (and, if so, the length scale(s)) of economic cycles (periodicities). Methods: Beyond summary statistics, the monthly (May, 1976 to December, 2020) number of observed FDA MD Applications are investigated via an assortment of time series techniques (including: Discrete Wavelet Transform, Running Moving Average Filter (RMAF), Complete Ensemble Empirical Mode with Adaptive Noise decomposition (CEEMDAN), and Seasonal Trend Loess (STL) decomposition) to exhaustively search and characterize such periodicities. Results: The data were found to be non-normal, non-stationary (fractional order of integration < 1), non-linear, and strongly persistent (Hurst > 0.5). Importantly, periodicities exist and follow seasonal, 1 year short-term, 5-6 year (Juglar), and a single 24-year medium-term (Kuznets) period (when considering the total number of MD Applications). Economic crises (e.g., COVID-19) do not seem to affect the evolution of the periodicities. Conclusions: This work concludes that (1) PMA and PMN data may be viewed as a proxy measure of the MD industry; (2) periodicities exists in the data with time lengths associated with seasonal/1-year, Juglar and Kuznets affects; (4) these metrics do not seem affected by specific crises (such as COVID-19) (similarly with other econometrics used in periodicity assessments); (5) PMNs and PMAs evolve inversely and suggest a structural industrial transformation; (6) Total MDs are predicted to continue their decline into the mid-2020s prior to recovery.

【7】 Can we improve the environmental benefits of biobased PET production through local 1 biomass value chains? A life cycle assessment perspective 标题：我们能否通过当地的生物质价值链提高生物基聚酯生产的环境效益？生命周期评估视角

作者：Carlos Garcia-Velasquez,Yvonne van der Meer 机构：a Aachen-Maastricht Institute for Biobased Materials (AMIBM), Maastricht University, Brightlands Chemelot Campus, Urmonderbaan , RD Geleen, The Netherlands 链接：https://arxiv.org/abs/2107.05251 摘要：向低碳经济转型是欧盟2030年的目标之一，生物产业在这一转型中起着至关重要的作用。然而，人们一直在讨论利用生物质生产生物制品的实际效益，特别是利用农业材料（如玉米和甘蔗）。本文介绍了利用欧盟生物质供应链（如甜菜、小麦和芒）生产30%和100%生物基PET（聚对苯二甲酸乙二酯）的环境影响评估。生命周期评估方法和全球敏感性评估之间的综合评估是提出和选择改善生物基PET生产环境绩效的供应链的早期支持工具。从结果来看，芒属植物是生产生物基PET的最佳选择：促进欧盟当地供应链，减少温室气体（GHG）排放（工艺和土地利用变化），并在与资源消耗、生态系统质量和人类健康相关的中点类别中产生较低的影响。这一工具有助于改善工艺的环境绩效，从而推动向低碳经济的转变。摘要：The transition to a low-carbon economy is one of the ambitions of the European Union for 2030. Biobased industries play an essential role in this transition. However, there has been an on-going discussion about the actual benefit of using biomass to produce biobased products, specifically the use of agricultural materials (e.g., corn and sugarcane). This paper presents the environmental impact assessment of 30% and 100% biobased PET (polyethylene terephthalate) production using EU biomass supply chains (e.g., sugar beet, wheat, and Miscanthus). An integral assessment between the life cycle assessment methodology and the global sensitivity assessment is presented as an early-stage support tool to propose and select supply chains that improve the environmental performance of biobased PET production. From the results, Miscanthus is the best option for the production of biobased PET: promoting EU local supply chains, reducing greenhouse gas (GHG) emissions (process and land-use change), and generating lower impacts in midpoint categories related to resource depletion, ecosystem quality, and human health. This tool can help improving the environmental performance of processes that could boost the shift to a low-carbon economy.

【8】 Deep Risk Model: A Deep Learning Solution for Mining Latent Risk Factors to Improve Covariance Matrix Estimation 标题：深度风险模型：挖掘潜在风险因素改进协方差矩阵估计的深度学习解决方案

作者：Hengxu Lin,Dong Zhou,Weiqing Liu,Jiang Bian 机构：Columbia Business School, New York, United States, Microsoft Research, Beijing, China 链接：https://arxiv.org/abs/2107.05201 摘要：建模和管理投资组合风险可能是实现增长和保持投资业绩的最重要步骤。在建立在Markowitz理论基础上的现代投资组合构建框架中，需要股票收益的协方差矩阵来建模投资组合的风险。传统的估计协方差矩阵的方法都是基于人为设计的风险因素，这往往需要花费大量的时间和精力来设计更好的风险因素来改进协方差估计。在这项工作中，我们将挖掘风险因素的探索描述为一个学习问题，并提出一个深入的学习解决方案来有效地用神经网络“设计”风险因素。精心设定学习目标，以确保所学习的风险因素能够有效地解释股票收益，并具有期望的正交性和稳定性。我们在股票市场数据上的实验证明了该方法的有效性：我们的方法可以获得1.9\%$的高解释方差（R^2$），同时也降低了全局最小方差投资组合的风险。增量分析进一步支持我们的架构和学习目标的设计。摘要：Modeling and managing portfolio risk is perhaps the most important step to achieve growing and preserving investment performance. Within the modern portfolio construction framework that built on Markowitz's theory, the covariance matrix of stock returns is required to model the portfolio risk. Traditional approaches to estimate the covariance matrix are based on human designed risk factors, which often requires tremendous time and effort to design better risk factors to improve the covariance estimation. In this work, we formulate the quest of mining risk factors as a learning problem and propose a deep learning solution to effectively "design" risk factors with neural networks. The learning objective is carefully set to ensure the learned risk factors are effective in explaining stock returns as well as have desired orthogonality and stability. Our experiments on the stock market data demonstrate the effectiveness of the proposed method: our method can obtain $1.9\%$ higher explained variance measured by $R^2$ and also reduce the risk of a global minimum variance portfolio. Incremental analysis further supports our design of both the architecture and the learning objective.

【9】 Recursive Utility with Investment Gains and Losses: Existence, Uniqueness, and Convergence 标题：具有投资收益和损失的递归效用：存在性、唯一性和收敛性

作者：Jing Guo,Xue Dong He 链接：https://arxiv.org/abs/2107.05163 摘要：我们考虑一个概括的递归效用模型，通过添加一个新的组件，代表效用的投资收益和损失。我们还研究了在无限时间范围内，具有恒定跨期替代弹性和相对风险厌恶度的广义模型的效用过程。在特定的有限状态马尔可夫环境下，我们证明了当代理得到非负的损益效用时，效用过程是唯一存在的，否则效用过程可以是不存在的或不唯一的。此外，我们还证明了当效用过程唯一存在时，可以通过从任何初始猜测出发，反复应用定义效用过程的递推方程来计算效用过程。然后，我们考虑有损损失效用的投资组合选择问题，并通过证明相应的动态规划方程具有唯一的解来解决它。最后，我们将以前的一些结果推广到状态空间无限的情形。摘要：We consider a generalization of the recursive utility model by adding a new component that represents utility of investment gains and losses. We also study the utility process in this generalized model with constant elasticity of intertemporal substitution and relative risk aversion degree, and with infinite time horizon. In a specific, finite-state Markovian setting, we prove that the utility process uniquely exists when the agent derives nonnegative gain-loss utility, and that it can be non-existent or non-unique otherwise. Moreover, we prove that the utility process, when it uniquely exists, can be computed by starting from any initial guess and applying the recursive equation that defines the utility process repeatedly. We then consider a portfolio selection problem with gain-loss utility and solve it by proving that the corresponding dynamic programming equation has a unique solution. Finally, we extend certain previous results to the case in which the state space is infinite.

【10】 The Experimenters' Dilemma: Inferential Preferences over Populations 标题：实验者的困境：推论偏好胜过总体

作者：Neeraja Gupta,Luca Rigott,Alistair Wilson 链接：https://arxiv.org/abs/2107.05064 摘要：我们比较了经济学家和其他社会科学家在实验中常用的三种人群：物理地点的本科生（lab）、亚马逊的机械土耳其人（MTurk）和多产的。从三个维度进行了比较：由于注意力不集中而产生的数据噪声、每次观测的成本和响应弹性。我们从每一个群体中抽取样本，研究在个人和社会效率选择之间具有不同紧张关系的四个一次性游戏中的决策。在没有紧张气氛的情况下，个人和亲社会的动机是一致的，在MTurk的观察中，噪音行为占60%，在Prolific的观察中占19%，在实验室的观察中占14%。考虑到成本，如果噪音数据是唯一的问题，那么从推理能力的角度来看，Prolific占主导地位，结合相对较低的噪音和成本，每次观察五分之一的实验室。然而，由于实验室人群对治疗更为敏感，在我们主要的PD游戏比较中，实验室仍然优于多产和MTurk。摘要：We compare three populations commonly used in experiments by economists and other social scientists: undergraduate students at a physical location (lab), Amazon's Mechanical Turk (MTurk), and Prolific. The comparison is made along three dimensions: the noise in the data due to inattention, the cost per observation, and the elasticity of response. We draw samples from each population, examining decisions in four one-shot games with varying tensions between the individual and socially efficient choices. When there is no tension, where individual and pro-social incentives coincide, noisy behavior accounts for 60% of the observations on MTurk, 19% on Prolific, and 14% for the lab. Taking costs into account, if noisy data is the only concern Prolific dominates from an inferential power point of view, combining relatively low noise with a cost per observation one fifth of the lab's. However, because the lab population is more sensitive to treatment, across our main PD game comparison the lab still outperforms both Prolific and MTurk.

【11】 E-Learning and its Socioeconomics 标题：E-Learning及其社会经济学

作者：Avni Singh 备注：39 pages 链接：https://arxiv.org/abs/2107.05041 摘要：尽管有争议，电子学习已经成为各种教育的一个重要工具：尤其是幼儿园到第十二部门。然而，这一领域的一小部分人缺乏机会，主要是经济上服务不足的学生。本文探讨了幼儿园到第十二教育部门中服务不足和资源充足的成员的选择：一个2.5亿人的市场，只有900万学生注册在线教育。该文件还提供了一个简短的概述的选择和挑战，使每个人在幼儿园到第十二教育部门的电子学习。为了确定e-learning是否有益，它还讨论了对经历过e-learning的学生和教育工作者进行的一项调查的结果，结果表明，e-learning是有益的，总的趋势是教师比学生更喜欢在线学习。本文利用了主要和次要的资源来实现这一目的，包括来自互联网的信息，以及来自系统内人员（家长、学生和教师）的调查。摘要：While controversial, e-learning has become an essential tool for all kinds of education: especially the kindergarten-to-twelfth sector. However, pockets of this sector lack access, mainly economically underserved students. This paper explores the options available to underserved and aptly resourced members of the kindergarten-to-twelfth educational sector: a 250-million-person market, with only 9 million students enrolled in online education. The paper also provides a brief overview of the options and challenges of making e-learning available to everyone in the kindergarten-to-twelfth educational sector. To establish whether e-learning is beneficial, it also discusses the results of a survey conducted on students and educators who have experienced e-learning, with the results showing that it is beneficial, with a general trend of teachers showing more comfort with online learning than students. The paper utilizes primary and secondary resources for this purpose, with information both from the internet, and from surveys conducted within people from the system: parents, students, and teachers.

【12】 Sustained cost declines in solar PV and battery storage needed to eliminate coal generation in India 标题：印度消除燃煤发电所需的太阳能光伏和电池存储成本持续下降

作者：Aniruddh Mohan,Shayak Sengupta,Parth Vaishnav,Rahul Tongia,Asim Ahmed,Ines L. Azevedo 机构：Azevedo, Department of Engineering and Public Policy, Carnegie Mellon University, Pittsburgh, USA, School for Environment and Sustainability, University of Michigan, Ann Arbor, USA, Centre for Social and Economic Progress, New Delhi, India 链接：https://arxiv.org/abs/2107.04928 摘要：印度有增无减的煤电必须在本世纪中叶前逐步淘汰，以实现《巴黎协定》规定的全球气候目标。在这里，我们估计混合动力发电厂的成本-锂离子电池储能与风能和太阳能光伏-以取代煤炭发电。我们设计了这些技术的最低成本组合，以提供印度卡纳塔克邦、古吉拉特邦和泰米尔纳德邦的基本负荷和负荷跟踪发电概况。我们的分析表明，逐步淘汰现有燃煤发电厂需要低成本资本、太阳能光伏安装成本$<$\$300/kW和电池储能成本$<$\$75/kWh。到2040年逐步淘汰需要混合动力系统的成本在未来20年内每年下降5%。太阳能光伏发电比风力发电更适合与短期储存相结合。我们的结果描述了实现《巴黎协定》目标所需的具有挑战性的技术和政策进步。摘要：Unabated coal power in India must be phased out by mid-century to achieve global climate targets under the Paris Agreement. Here we estimate the costs of hybrid power plants - lithium-ion battery storage with wind and solar PV - to replace coal generation. We design least cost mixes of these technologies to supply baseload and load-following generation profiles in three Indian states - Karnataka, Gujarat, and Tamil Nadu. Our analysis shows that availability of low cost capital, solar PV installation costs of $<$\$300/kW, and battery storage capacity costs of $<$\$75/kWh will be required to phase out existing coal power plants. Phaseout by 2040 requires a 5% annual decline in the cost of hybrid systems over the next two decades. Solar PV is more suited to pairing with short duration storage than wind power. Our results describe the challenging technological and policy advances needed to achieve the goals of the Paris Agreement.

【13】 Geographic Spillover Effects of Prescription Drug Monitoring Programs (PDMPs) 标题：处方药监测项目(PDMPS)的地理溢出效应

作者：Daniel Guth,Shiyu Zhang 链接：https://arxiv.org/abs/2107.04925 摘要：处方药监测计划（PDMP）通过限制阿片类药物在一个州的销售来寻求潜在的减少阿片类药物滥用。我们检查沿州边界的不连续性，其中一方可能有PDMP，另一方可能没有。我们发现，实施电子PDMP，医生和药剂师可以观察患者的阿片类药物购买历史，减少了一个州的阿片类药物销售，但增加了该州边界另一边相邻县的阿片类药物销售。我们还发现，在边境县和内陆县之间，阿片类药物销售和死亡率存在系统性差异。当相邻国家都有epdmp时，这些差异就会减小，这与个人跨越国家界线购买阿片类药物的假设是一致的。我们的工作强调了理解阿片市场的重要性，因为我们表明，国家受到邻国阿片政策的影响。摘要：Prescription Drug Monitoring Programs (PDMPs) seek to potentially reduce opioid misuse by restricting the sale of opioids in a state. We examine discontinuities along state borders, where one side may have a PDMP and the other side may not. We find that electronic PDMP implementation, whereby doctors and pharmacists can observe a patient's opioid purchase history, reduces a state's opioid sales but increases opioid sales in neighboring counties on the other side of the state border. We also find systematic differences in opioid sales and mortality between border counties and interior counties. These differences decrease when neighboring states both have ePDMPs, which is consistent with the hypothesis that individuals cross state lines to purchase opioids. Our work highlights the importance of understanding the opioid market as connected across counties or states, as we show that states are affected by the opioid policies of their neighbors.

【14】 The unreasonable effectiveness of optimal transport in economics 标题：经济学中最优运输的不合理效用

作者：Alfred Galichon 机构：This paper is dedicated to the memory of Emmanuel Farhi (,-,). 备注：Submitted to the proceeding of the 2020 World Congress of the Econometric Society 链接：https://arxiv.org/abs/2107.04700 摘要：优化运输已成为标准数量经济学工具箱的一部分。它是描述与转移匹配模型的选择框架，但除此之外，它还允许：扩展分位数回归；识别离散选择模型；提出了计算随机系数logit模型的新算法；并推广了贸易中的引力模型。本文简要回顾了这一理论的基础、在经济学中的应用以及一些扩展。摘要：Optimal transport has become part of the standard quantitative economics toolbox. It is the framework of choice to describe models of matching with transfers, but beyond that, it allows to: extend quantile regression; identify discrete choice models; provide new algorithms for computing the random coefficient logit model; and generalize the gravity model in trade. This paper offer a brief review of the basics of the theory, its applications to economics, and some extensions.

【15】 Waiting to Borrow From a 457(b) Plan 标题：等待从457(B)计划借款

作者：Alex Garivaltis 备注：52 pages, 21 figures 链接：https://arxiv.org/abs/2107.04698 摘要：本文提出并解决了401（k）、403（b）或457（b）计划等税收优惠退休账户中个人贷款的最优停止问题。如果计划参与者获得的外部资产的预期回报率高于退休账户内可用的投资基金和指数，那么他必须决定在行使贷款选择权之前等待多长时间。一方面，迅速获得贷款将导致多年的资本指数增长在较高的（外部）率；另一方面，如果我们等待在457（b）中积累更多的资金，那么我们可以在外部资产中进行更大的存款（尽管时间更短）。我推导了各种最优贷款控制的截止规则；一般来说，投资者必须等到积累了一定数量的资金（以供款年限衡量），这取决于不同的收益率、贷款参数以及他将清算退休账户的特定日期。让地平线趋于无穷远，最优（无地平线）策略将获得优雅、简单和对不同生活结果的实用健壮性。当资产价格和收益是随机的时，（连续时间）截止规则变成了一个“等待区”，即终端财富的平均值在上升，终端财富的方差在下降。在等待区域的逗留结束后，参与者发现自己处于均值-方差前沿，此时他的后续行为是个人风险偏好的问题。摘要：This paper formulates and solves the optimal stopping problem for a loan made to one's self from a tax-advantaged retirement account such as a 401(k), 403(b), or 457(b) plan. If the plan participant has access to an external asset with a higher expected rate of return than the investment funds and indices that are available within the retirement account, then he must decide how long to wait before exercising the loan option. On the one hand, taking the loan quickly will result in many years of exponential capital growth at the higher (external) rate; on the other hand, if we wait to accumulate more funds in the 457(b), then we can make a larger deposit into the external asset (albeit for a shorter period of time). I derive a variety of cutoff rules for optimal loan control; in general, the investor must wait until he accumulates a certain amount of money (measured in contribution-years) that depends on the disparate yields, the loan parameters, and the date certain at which he will liquidate the retirement account. Letting the horizon tend to infinity, the optimal (horizon-free) policy gains in elegance, simplicity, and practical robustness to different life outcomes. When asset prices and returns are stochastic, the (continuous time) cutoff rule turns into a "wait region," whereby the mean of terminal wealth is rising and the variance of terminal wealth is falling. After his sojourn through the wait region is over, the participant finds himself on the mean-variance frontier, at which point his subsequent behavior is a matter of personal risk preference.

【16】 End-to-End Risk Budgeting Portfolio Optimization with Neural Networks 标题：基于神经网络的端到端风险预算投资组合优化

作者：Ayse Sinem Uysal,Xiaoyue Li,John M. Mulvey 机构：Department of Operations Research and Financial Engineering, Princeton, University 链接：https://arxiv.org/abs/2107.04636 摘要：投资组合优化一直是金融学中的一个核心问题，通常分为两个步骤：标定参数，然后求解一个优化问题。然而，两步过程有时会遇到“误差最大化”问题，即参数估计的不准确会转化为不明智的分配决策。在本文中，我们将预测和优化任务结合在一个前馈神经网络中，实现了一种端到端的方法，直接从输入特征中学习投资组合的分配。包括两种端到端的投资组合结构：无模型网络和基于模型的网络。无模型方法被视为一个黑箱，而基于模型的方法，我们学习资产的最优风险贡献，并在神经网络中嵌入一个隐式优化层来求解资产分配问题。当最大化夏普比率作为训练目标函数时，基于模型的端到端框架在样本外（2017-2021）测试中提供了稳健的性能，当名义风险平价收益率为0.79，等权固定组合收益率为0.83时，夏普比率达到1.16。考虑到基于风险的投资组合对潜在资产范围的敏感性，我们提出了一种嵌入随机门神经网络的资产选择机制，以防止投资组合受到低收益低波动性资产的伤害。带过滤器的封闭端对端优于带朴素过滤机制的名义风险平价基准，将市场数据中样本外时期（2017-2021）的夏普比率提升至1.24。摘要：Portfolio optimization has been a central problem in finance, often approached with two steps: calibrating the parameters and then solving an optimization problem. Yet, the two-step procedure sometimes encounter the "error maximization" problem where inaccuracy in parameter estimation translates to unwise allocation decisions. In this paper, we combine the prediction and optimization tasks in a single feed-forward neural network and implement an end-to-end approach, where we learn the portfolio allocation directly from the input features. Two end-to-end portfolio constructions are included: a model-free network and a model-based network. The model-free approach is seen as a black-box, whereas in the model-based approach, we learn the optimal risk contribution on the assets and solve the allocation with an implicit optimization layer embedded in the neural network. The model-based end-to-end framework provides robust performance in the out-of-sample (2017-2021) tests when maximizing Sharpe ratio is used as the training objective function, achieving a Sharpe ratio of 1.16 when nominal risk parity yields 0.79 and equal-weight fix-mix yields 0.83. Noticing that risk-based portfolios can be sensitive to the underlying asset universe, we develop an asset selection mechanism embedded in the neural network with stochastic gates, in order to prevent the portfolio being hurt by the low-volatility assets with low returns. The gated end-to-end with filter outperforms the nominal risk-parity benchmarks with naive filtering mechanism, boosting the Sharpe ratio of the out-of-sample period (2017-2021) to 1.24 in the market data.

【17】 Collective intelligence and the blockchain: Technology, communities and social experiments 标题：集体智能与区块链：技术、社区和社会实验

作者：Andrea Baronchelli 机构：City University of London, Department of Mathematics, London EC,V ,HB, UK, The Alan Turing Institute, British Library, Euston Road, London NW,DB, UK, UCL Centre for Blockchain Technologies, University College London, London, UK. 备注：Brief "perspective" commentary piece 链接：https://arxiv.org/abs/2107.05527 摘要：区块链仍然主要被视为一种新技术。但每个区块链也是一个社区和一个社会实验，建立在社会共识之上。这里我将讨论三个例子，展示集体智慧如何帮助、威胁或利用基于区块链的生态系统。它们涉及智能合约的不变性、代码透明度和新的财产形式。这些例子表明，需要更多的研究、新的规范以及最终的法律来管理集体行为和区块链技术之间的相互作用。集体智慧研究者的见解可以帮助社会迎接挑战。摘要：Blockchains are still perceived chiefly as a new technology. But each blockchain is also a community and a social experiment, built around social consensus. Here I discuss three examples showing how collective intelligence can help, threat or capitalize on blockchain-based ecosystems. They concern the immutability of smart contracts, code transparency and new forms of property. The examples show that more research, new norms and, eventually, laws are needed to manage the interaction between collective behaviour and the blockchain technology. Insights from researchers in collective intelligence can help society rise up to the challenge.

2.cs.SD语音:

【1】 Calliope -- A Polyphonic Music Transformer 标题：Calliope--复调音乐Transformer

作者：Andrea Valenti,Stefano Berti,Davide Bacciu 机构：University of Pisa - Dept. of Computer Science, Largo B. Pontecorvo, Pisa - Italy 备注：Accepted at ESANN2021 链接：https://arxiv.org/abs/2107.05546 摘要：音乐的复调性使深度学习在音乐造型中的应用成为一项具有挑战性的任务。另一方面，Transformer架构似乎很适合这种数据。在这项工作中，我们提出Calliope，一种新的基于Transformer的自动编码器模型，用于有效地模拟多声道的复调音乐序列。实验表明，该模型能够提高音乐序列重构和生成的技术水平，特别是对长序列的重构和生成效果显著。摘要：The polyphonic nature of music makes the application of deep learning to music modelling a challenging task. On the other hand, the Transformer architecture seems to be a good fit for this kind of data. In this work, we present Calliope, a novel autoencoder model based on Transformers for the efficient modelling of multi-track sequences of polyphonic music. The experiments show that our model is able to improve the state of the art on musical sequence reconstruction and generation, with remarkably good results especially on long sequences.

【2】 DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement 标题：DPCRN：用于单通道语音增强的双径卷积递归网络

作者：Xiaohuai Le,Hongsheng Chen,Kai Chen,Jing Lu 机构：Key Laboratory of Modern Acoustics, Nanjing University, Nanjing , China, NJU-Horizon Intelligent Audio Lab, Horizon Robotics, Beijing , China, Nanjing Institute of Advanced Artificial Intelligence, Nanjing , China 备注：5 pages, 1 figure, accepted by Interspeech 2021 链接：https://arxiv.org/abs/2107.05429 摘要：提出了一种双路径神经网络（DPRNN）来更有效地对超长序列进行时域建模。通过将长序列分割成较小的块，并应用块内和块间RNN，DPRNN在有限的模型尺寸下达到了很好的语音分离性能。本文将dpnn模块与卷积递归网络（CRN）相结合，设计了一种用于时频域语音增强的双程卷积递归网络（DPCRN）模型。我们将CRN中的rnn替换为DPRNN模块，其中块内rnn用于模拟单个帧中的频谱模式，块间rnn用于模拟连续帧之间的相关性。仅使用0.8M参数，提交的DPCRN模型在Interspeech 2021 Deep Noise Suppression（DNS）挑战的宽带场景轨迹中实现了3.57的总体平均意见得分（MOS）。对其他测试集的评价也表明了该模型的有效性。摘要：The dual-path RNN (DPRNN) was proposed to more effectively model extremely long sequences for speech separation in the time domain. By splitting long sequences to smaller chunks and applying intra-chunk and inter-chunk RNNs, the DPRNN reached promising performance in speech separation with a limited model size. In this paper, we combine the DPRNN module with Convolution Recurrent Network (CRN) and design a model called Dual-Path Convolution Recurrent Network (DPCRN) for speech enhancement in the time-frequency domain. We replace the RNNs in the CRN with DPRNN modules, where the intra-chunk RNNs are used to model the spectrum pattern in a single frame and the inter-chunk RNNs are used to model the dependence between consecutive frames. With only 0.8M parameters, the submitted DPCRN model achieves an overall mean opinion score (MOS) of 3.57 in the wide band scenario track of the Interspeech 2021 Deep Noise Suppression (DNS) challenge. Evaluations on some other test sets also show the efficacy of our model.

【3】 End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning 标题：基于半监督学习的端到端富转录风格自动语音识别

作者：Tomohiro Tanaka,Ryo Masumura,Mana Ihori,Akihiko Takashima,Shota Orihashi,Naoki Makishima 机构：NTT Media Intelligence Laboratories, NTT Corporation 备注：Accepted at Interspeech 2021 链接：https://arxiv.org/abs/2107.05382 摘要：我们提出了一种半监督学习方法，用于从小型富转录风格和大型通用转录风格数据集构建端到端富转录风格自动语音识别（RT-ASR）系统。在自发的言语任务中，各种言语现象，如填充词、词片段、笑声和咳嗽等，常常被包括在内。虽然普通的转录并没有给予这些现象特别的意识，但丰富的转录明确地将它们转换为特殊的现象标记以及文本标记。在以往的研究中，文本标记和现象标记是以端到端的方式同时被估计的。然而，由于缺乏大规模的、丰富的转录风格的数据集，很难建立精确的RT-ASR系统。为了解决这个问题，我们的训练方法同时使用有限的丰富转录风格数据集和普通转录风格数据集。半监督学习的关键是将普通的转录风格数据集转化为伪富转录风格数据集。为此，我们将控制现象标记生成与否的样式标记引入到基于Transformer的自回归建模中。我们使用这个模型来生成伪富转录风格的数据集，并从伪和原始数据集构建RT-ASR系统。在自发ASR任务上的实验表明了该方法的有效性。摘要：We propose a semi-supervised learning method for building end-to-end rich transcription-style automatic speech recognition (RT-ASR) systems from small-scale rich transcription-style and large-scale common transcription-style datasets. In spontaneous speech tasks, various speech phenomena such as fillers, word fragments, laughter and coughs, etc. are often included. While common transcriptions do not give special awareness to these phenomena, rich transcriptions explicitly convert them into special phenomenon tokens as well as textual tokens. In previous studies, the textual and phenomenon tokens were simultaneously estimated in an end-to-end manner. However, it is difficult to build accurate RT-ASR systems because large-scale rich transcription-style datasets are often unavailable. To solve this problem, our training method uses a limited rich transcription-style dataset and common transcription-style dataset simultaneously. The Key process in our semi-supervised learning is to convert the common transcription-style dataset into a pseudo-rich transcription-style dataset. To this end, we introduce style tokens which control phenomenon tokens are generated or not into transformer-based autoregressive modeling. We use this modeling for generating the pseudo-rich transcription-style datasets and for building RT-ASR system from the pseudo and original datasets. Our experiments on spontaneous ASR tasks showed the effectiveness of the proposed method.

【4】 Oriental Language Recognition (OLR) 2020: Summary and Analysis 标题：2020年东方语言识别(OLR)综述与分析

作者：Jing Li,Binling Wang,Yiming Zhi,Zheng Li,Lin Li,Qingyang Hong,Dong Wang 机构：School of Informatics, Xiamen University, China, School of Electronic Science and Engineering, Xiamen University, China, Center for Speech and Language Technologies, Tsinghua University, China 链接：https://arxiv.org/abs/2107.05365 摘要：第五届东方语言识别（OLR）挑战赛的重点是在各种复杂的环境中促进语言识别的发展。olr2020挑战包括三项任务：（1）跨渠道语言识别，（2）方言识别，（3）噪声语言识别。我们选择Cavg作为主要评价指标，以等错误率（EER）作为次要评价指标。共有58个小组参加了这项挑战，其中三分之一的小组提交了有效的结果。与最佳基线相比，前1系统的Cavg值分别降低了82%、62%和48%。本文描述了这三个任务、数据库概要和最终结果。我们还概述了最显著地提高语言识别系统性能的新方法，例如辅助信息的利用。摘要：The fifth Oriental Language Recognition (OLR) Challenge focuses on language recognition in a variety of complex environments to promote its development. The OLR 2020 Challenge includes three tasks: (1) cross-channel language identification, (2) dialect identification, and (3) noisy language identification. We choose Cavg as the principle evaluation metric, and the Equal Error Rate (EER) as the secondary metric. There were 58 teams participating in this challenge and one third of the teams submitted valid results. Compared with the best baseline, the Cavg values of Top 1 system for the three tasks were relatively reduced by 82%, 62% and 48%, respectively. This paper describes the three tasks, the database profile, and the final results. We also outline the novel approaches that improve the performance of language recognition systems most significantly, such as the utilization of auxiliary information.

【5】 MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding 标题：MidiBERT-Piano：符号音乐理解的大规模预训

作者：Yi-Hui Chou,I-Chun Chen,Chin-Jui Chang,Joann Ching,Yi-Hsuan Yang 机构： YH Chou is also affiliated with National Taiwan University; ICChen with National Tsing Hua university; and YH Yang with Taiwan AILabs (e-mail 链接：https://arxiv.org/abs/2107.05223 摘要：本文尝试利用BERT的mask语言建模方法对4166个复调钢琴MIDI文件进行12层变换器模型的预训练，以解决一些符号域辨别性的音乐理解任务。这包括两个音符级分类任务，即旋律提取和速度预测，以及两个序列级分类任务，即作曲家分类和情感分类。我们发现，给定一个预先训练好的Transformer，我们的模型优于基于递归神经网络的基线，微调时间少于10个周期。消融研究表明，即使在预训练阶段没有看到下游任务的MIDI数据，预训练仍然有效，并且在微调阶段冻结Transformer的自我注意层会略微降低性能。这项工作中使用的所有五个数据集都是公开的，以及我们预先训练和微调模型的检查点。因此，我们的研究可以作为符号领域音乐理解的基准。摘要：This paper presents an attempt to employ the mask language modeling approach of BERT to pre-train a 12-layer Transformer model over 4,166 pieces of polyphonic piano MIDI files for tackling a number of symbolic-domain discriminative music understanding tasks. These include two note-level classification tasks, i.e., melody extraction and velocity prediction, as well as two sequence-level classification tasks, i.e., composer classification and emotion classification. We find that, given a pre-trained Transformer, our models outperform recurrent neural network based baselines with less than 10 epochs of fine-tuning. Ablation studies show that the pre-training remains effective even if none of the MIDI data of the downstream tasks are seen at the pre-training stage, and that freezing the self-attention layers of the Transformer at the fine-tuning stage slightly degrades performance. All the five datasets employed in this work are publicly available, as well as checkpoints of our pre-trained and fine-tuned models. As such, our research can be taken as a benchmark for symbolic-domain music understanding.

【6】 Neural Waveshaping Synthesis 标题：神经波形合成

作者：Ben Hayes,Charalampos Saitis,György Fazekas 机构：Centre for Digital Music, Queen Mary University of London 备注：Accepted to ISMIR 2021; See online supplement at this https URL 链接：https://arxiv.org/abs/2107.05050 摘要：我们提出了神经波形形成单元（NEWT）：一种新颖的、轻量级的、完全因果的神经音频合成方法，它直接在波形域中工作，并伴随着优化（FastNEWT）以实现高效的CPU推理。NEWT使用具有周期性激活的时间分布多层感知器来隐式学习编码目标音色特征的非线性传递函数。一旦训练，蝾螈可以通过简单的输入和输出信号仿射变换产生复杂的音色演变。我们将NEWT与一个可微噪声合成器和混响器配对，发现它在F0和响度特性的条件下，仅用260k的总模型参数就能产生逼真的乐器性能。我们将我们的方法与最先进的多刺激听力测试和Fr′echet音频距离基准进行了比较，发现它在测试的音色域中具有竞争力。我们的方法在生成速度方面显著优于基准测试，并且在使用和不使用FastNEWT的情况下，在消费CPU上实现了实时性能，这表明它是未来创造性声音设计工具的可行基础。摘要：We present the Neural Waveshaping Unit (NEWT): a novel, lightweight, fully causal approach to neural audio synthesis which operates directly in the waveform domain, with an accompanying optimisation (FastNEWT) for efficient CPU inference. The NEWT uses time-distributed multilayer perceptrons with periodic activations to implicitly learn nonlinear transfer functions that encode the characteristics of a target timbre. Once trained, a NEWT can produce complex timbral evolutions by simple affine transformations of its input and output signals. We paired the NEWT with a differentiable noise synthesiser and reverb and found it capable of generating realistic musical instrument performances with only 260k total model parameters, conditioned on F0 and loudness features. We compared our method to state-of-the-art benchmarks with a multi-stimulus listening test and the Fr\'echet Audio Distance and found it performed competitively across the tested timbral domains. Our method significantly outperformed the benchmarks in terms of generation speed, and achieved real-time performance on a consumer CPU, both with and without FastNEWT, suggesting it is a viable basis for future creative sound design tools.

【7】 Multilingual and crosslingual speech recognition using phonological-vector based phone embeddings 标题：使用基于音素矢量的音素嵌入的多语言和跨语言语音识别

作者：Chengrui Zhu,Keyu An,Huahuan Zheng,Zhijian Ou 机构：Speech Processing and Machine Intelligence (SPMI) Lab, Tsinghua University, China 链接：https://arxiv.org/abs/2107.05038 摘要：语音特征（PFs）的使用有可能使特定语言的电话在训练中保持连接，这对于低资源语言的多语言和跨语言语音识别方法的信息共享是非常理想的。以往的语音特征提取方法的一个缺点是，自底向上提取语音特征本身就比较困难。本文提出将语音驱动的电话嵌入（top-down）和基于深度神经网络（DNN）的声学特征提取（bottom-up）相结合来计算电话概率。这种新方法被称为JoinAP（声学与音韵学的结合）。值得注意的是，语音识别不需要从声学到语音特征的转换。对于IPA（internationalphonicalphabet）表中的每一个电话，我们将其语音特征编码成一个语音向量，然后对语音向量进行线性或非线性变换，得到电话嵌入。在通用语音数据集（德语、法语、西班牙语和意大利语）和AISHLL-1数据集（普通话）上进行了一系列多语种和跨语种（零拍和少拍）的语音识别实验，并证明了非线性嵌入的JoinAP方法优于线性嵌入的JoinAP方法和平面嵌入的JoinAP方法。摘要：The use of phonological features (PFs) potentially allows language-specific phones to remain linked in training, which is highly desirable for information sharing for multilingual and crosslingual speech recognition methods for low-resourced languages. A drawback suffered by previous methods in using phonological features is that the acoustic-to-PF extraction in a bottom-up way is itself difficult. In this paper, we propose to join phonology driven phone embedding (top-down) and deep neural network (DNN) based acoustic feature extraction (bottom-up) to calculate phone probabilities. The new method is called JoinAP (Joining of Acoustics and Phonology). Remarkably, no inversion from acoustics to phonological features is required for speech recognition. For each phone in the IPA (International Phonetic Alphabet) table, we encode its phonological features to a phonological-vector, and then apply linear or nonlinear transformation of the phonological-vector to obtain the phone embedding. A series of multilingual and crosslingual (both zero-shot and few-shot) speech recognition experiments are conducted on the CommonVoice dataset (German, French, Spanish and Italian) and the AISHLL-1 dataset (Mandarin), and demonstrate the superiority of JoinAP with nonlinear phone embeddings over both JoinAP with linear phone embeddings and the traditional method with flat phone embeddings.

【8】 PocketVAE: A Two-step Model for Groove Generation and Control 标题：PocketVAE：沟槽生成与控制的两步模型

作者：Kyungyun Lee,Wonil Kim,Juhan Nam 机构：Graduate School of Culture Technology, KAIST 链接：https://arxiv.org/abs/2107.05009 摘要：在数字音频工作站（DAW）中创建一个好的鼓轨来模仿熟练的表演者可能是一个耗时的过程，特别是对于那些不熟悉鼓的人来说。在这项工作中，我们介绍了PocketVAE，一个凹槽生成系统，将凹槽应用于用户的基本MIDI曲目，即模板。凹槽可以从参考轨迹转移，随机生成，也可以有条件生成，例如流派。我们的系统由每个groove组件的不同模块组成，采用类似于音乐创作过程的两步方法。首先，笔记模块通过添加和删除笔记来更新用户模板；第二，速度和微计时模块为生成的笔记分数添加了细节。为了对鼓形音符进行建模，我们采用了一种基于矢量量化变分自动编码器（VQ-VAE）的离散隐式表示方法，因为鼓形音符具有离散性，不同于速度和微定时值。我们证明了我们的两步方法和离散编码空间的使用改进了对原始数据分布的学习。此外，我们还讨论了在模型中加入控制元素（类型、速度和微计时模式）的好处。摘要：Creating a good drum track to imitate a skilled performer in digital audio workstations (DAWs) can be a time-consuming process, especially for those unfamiliar with drums. In this work, we introduce PocketVAE, a groove generation system that applies grooves to users' rudimentary MIDI tracks, i.e, templates. Grooves can be either transferred from a reference track, generated randomly or with conditions, such as genres. Our system, consisting of different modules for each groove component, takes a two-step approach that is analogous to a music creation process. First, the note module updates the user template through addition and deletion of notes; Second, the velocity and microtiming modules add details to this generated note score. In order to model the drum notes, we apply a discrete latent representation method via Vector Quantized Variational Autoencoder (VQ-VAE), as drum notes have a discrete property, unlike velocity and microtiming values. We show that our two-step approach and the usage of a discrete encoding space improves the learning of the original data distribution. Additionally, we discuss the benefit of incorporating control elements - genre, velocity and microtiming patterns - into the model.

【9】 ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data 标题：一种面向低资源真实数据的半监督自动音乐改编框架--RECORVAT

作者：Kin Wai Cheuk,Dorien Herremans,Li Su 机构：Information Systems Technology and Design, Singapore University of Technology and Design, Institute of Information Science, Academia Sinica, Taiwan 备注：Accepted in ACMMM 21 链接：https://arxiv.org/abs/2107.04954 摘要：现有的有监督音乐自动抄写（AMT）模型大多缺乏泛化能力。这意味着他们很难从不同的音乐类型中转录真实世界的音乐录音，而这些音乐类型并没有出现在标记的训练数据中。在本文中，我们提出了一个半监督框架ReconVAT，它通过利用大量可用的未标记音乐记录来解决这个问题。该方法采用了重建损失法和虚拟对抗训练法。当与现有的AMT U-net模型相结合时，ReconVAT在MAPS和MusicNet等通用基准数据集上取得了有竞争力的结果。例如，在MusicNet的弦乐部分版本的少数镜头设置中，ReconVAT在按音符和按偏移量的音符中分别获得61.0%和41.6%的F1分数，与监督基线模型相比，这意味着提高了22.2%和62.5%。我们提出的框架还展示了对新数据进行持续学习的潜力，这在不断提供新数据的实际应用中可能很有用。摘要：Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize. This means that they have trouble transcribing real-world music recordings from diverse musical genres that are not presented in the labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves this issue by leveraging the huge amount of available unlabelled music recordings. The proposed ReconVAT uses reconstruction loss and virtual adversarial training. When combined with existing U-net models for AMT, ReconVAT achieves competitive results on common benchmark datasets such as MAPS and MusicNet. For example, in the few-shot setting for the string part version of MusicNet, ReconVAT achieves F1-scores of 61.0% and 41.6% for the note-wise and note-with-offset-wise metrics respectively, which translates into an improvement of 22.2% and 62.5% compared to the supervised baseline model. Our proposed framework also demonstrates the potential of continual learning on new data, which could be useful in real-world applications whereby new data is constantly available.

【10】 Weakly-Supervised Classification and Detection of Bird Sounds in the Wild. A BirdCLEF 2021 Solution 标题：野外鸟声的弱监督分类与检测。BirdCLEF 2021解决方案

作者：Marcos V. Conde,Kumar Shubham,Prateek Agnihotri,Nitin D. Movva,Szilard Bessenyei 机构：Universidad de Valladolid, Spain, Jio Saavn, India, Clairvoyant.ai, India, Equal contribution. 备注：Proceedings Working Notes CEURWS @ CLEF 2021 - BirdCLEF 2021 链接：https://arxiv.org/abs/2107.04878 摘要：听鸟比看鸟容易，但它们在自然界中仍扮演着重要角色，是环境质量和污染恶化的极好指标。机器学习和卷积神经网络的最新进展使我们能够检测和分类鸟类的声音，通过这样做，我们可以帮助研究人员监测鸟类种群和生态系统生物多样性的现状和趋势。我们提出了一个声音检测和分类管道来分析复杂的声景记录和识别背景中的鸟巢。我们的管道从弱标签中学习，对野生鸟类的细粒度发声进行分类，并对背景声音（如飞机、雨水等）具有很强的鲁棒性。我们的解决方案在Kaggle举办的2021年BirdCLEF挑战赛中获得816支球队的第10名。摘要：It is easier to hear birds than see them, however, they still play an essential role in nature and they are excellent indicators of deteriorating environmental quality and pollution. Recent advances in Machine Learning and Convolutional Neural Networks allow us to detect and classify bird sounds, by doing this, we can assist researchers in monitoring the status and trends of bird populations and biodiversity in ecosystems. We propose a sound detection and classification pipeline for analyzing complex soundscape recordings and identify birdcalls in the background. Our pipeline learns from weak labels, classifies fine-grained bird vocalizations in the wild, and is robust against background sounds (e.g., airplanes, rain, etc). Our solution achieved 10th place of 816 teams at the BirdCLEF 2021 Challenge hosted on Kaggle.

【11】 Variational Information Bottleneck for Effective Low-resource Audio Classification 标题：有效低资源音频分类的变异信息瓶颈

作者：Shijing Si,Jianzong Wang,Huiming Sun,Jianhan Wu,Chuanyao Zhang,Xiaoyang Qu,Ning Cheng,Lei Chen,Jing Xiao 机构：Ping An Technology (Shenzhen) Co., Ltd., Hong Kong University of Science and Technology, University of Science and Technology of China 备注：Accepted by InterSpeech 2021 链接：https://arxiv.org/abs/2107.04803 摘要：以卷积神经网络（CNNs）为代表的大规模深度神经网络（DNNs）以其强大的容量和较强的泛化能力在音频分类方面取得了令人瞩目的成绩。然而，在低资源任务上训练DNN模型时，往往容易对小数据进行过度拟合，学习过多的冗余信息。为了解决这个问题，我们建议使用变分信息瓶颈（VIB）来减轻过度拟合和抑制不相关的信息。在这项工作中，我们在4层CNN上进行了实验。但是，VIB框架可以随时使用，并且可以很容易地与许多其他最先进的网络体系结构结合使用。对一些音频数据集的评估表明，我们的方法明显优于基线方法，在一些低源环境下，分类精度提高了5.0%以上。摘要：Large-scale deep neural networks (DNNs) such as convolutional neural networks (CNNs) have achieved impressive performance in audio classification for their powerful capacity and strong generalization ability. However, when training a DNN model on low-resource tasks, it is usually prone to overfitting the small data and learning too much redundant information. To address this issue, we propose to use variational information bottleneck (VIB) to mitigate overfitting and suppress irrelevant information. In this work, we conduct experiments ona 4-layer CNN. However, the VIB framework is ready-to-use and could be easily utilized with many other state-of-the-art network architectures. Evaluation on a few audio datasets shows that our approach significantly outperforms baseline methods, yielding more than 5.0% improvement in terms of classification accuracy in some low-source settings.

【12】 Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging 标题：基于超声舌象的节律运动预测扩展文语转换语音合成

作者：Tamás Gábor Csapó 机构： 2 1Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics 备注：accepted at SSW11 (11th Speech Synthesis Workshop). arXiv admin note: text overlap with arXiv:2107.02003 链接：https://arxiv.org/abs/2107.05550 摘要：在这篇论文中，我们提出了我们的第一个实验，在文本到发音预测，使用超声舌图像目标。我们扩展了传统的基于声码器的DNN-TTS框架，预测PCA压缩的超声图像，其中连续的舌头运动可以与合成语音同步重建。利用8个说话人的数据，训练全连通和递归神经网络，结果表明，在训练数据有限的情况下，FC-DNNs比LSTMs更适合于序列数据的预测。客观实验和可视化预测结果表明，该方法是可行的，生成的超声视频接近自然舌运动。基于文本输入的发音运动预测可用于视听语音合成或计算机辅助语音训练。摘要：In this paper, we present our first experiments in text-to-articulation prediction, using ultrasound tongue image targets. We extend a traditional (vocoder-based) DNN-TTS framework with predicting PCA-compressed ultrasound images, of which the continuous tongue motion can be reconstructed in synchrony with synthesized speech. We use the data of eight speakers, train fully connected and recurrent neural networks, and show that FC-DNNs are more suitable for the prediction of sequential data than LSTMs, in case of limited training data. Objective experiments and visualized predictions show that the proposed solution is feasible and the generated ultrasound videos are close to natural tongue movement. Articulatory movement prediction from text input can be useful for audiovisual speech synthesis or computer-assisted pronunciation training.

【13】 A Deep-Bayesian Framework for Adaptive Speech Duration Modification 标题：一种自适应语音时长调整的深度贝叶斯框架

作者：Ravi Shankar,Archana Venkataraman 机构： The authors are with the department of Electrical and ComputerEngineering at the Johns Hopkins University 备注：6 pages, 7 figures 链接：https://arxiv.org/abs/2107.04973 摘要：本文提出了第一种自适应调整语音信号持续时间的方法。我们的方法使用一个贝叶斯框架来定义一个潜在的注意图，该图连接输入和目标话语的框架。我们训练一个掩蔽卷积编解码网络，通过平均绝对误差损失函数的随机版本来产生这个注意图；我们的模型还使用编码器嵌入来预测目标语音信号的长度。预测长度确定解码器操作的步数。在推理过程中，我们生成注意图作为给定输入语音和未知目标语音信号之间相似度矩阵的代理。利用这个相似矩阵，我们计算了两个信号之间对齐的扭曲路径。我们的实验表明，这种自适应框架在语音转换和情感转换任务中产生了与依赖于已知目标信号的动态时间扭曲相似的结果。我们还表明，我们的技术在一个高质量的生成语音，这是等同于最先进的声码器。摘要：We propose the first method to adaptively modify the duration of a given speech signal. Our approach uses a Bayesian framework to define a latent attention map that links frames of the input and target utterances. We train a masked convolutional encoder-decoder network to produce this attention map via a stochastic version of the mean absolute error loss function; our model also predicts the length of the target speech signal using the encoder embeddings. The predicted length determines the number of steps for the decoder operation. During inference, we generate the attention map as a proxy for the similarity matrix between the given input speech and an unknown target speech signal. Using this similarity matrix, we compute a warping path of alignment between the two signals. Our experiments demonstrate that this adaptive framework produces similar results to dynamic time warping, which relies on a known target signal, on both voice conversion and emotion conversion tasks. We also show that our technique results in a high quality of generated speech that is on par with state-of-the-art vocoders.

3.eess.AS音频处理:

【1】 Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging 标题：基于超声舌象的节律运动预测扩展文语转换语音合成

作者：Tamás Gábor Csapó 机构： 2 1Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics 备注：accepted at SSW11 (11th Speech Synthesis Workshop). arXiv admin note: text overlap with arXiv:2107.02003 链接：https://arxiv.org/abs/2107.05550 摘要：在这篇论文中，我们提出了我们的第一个实验，在文本到发音预测，使用超声舌图像目标。我们扩展了传统的基于声码器的DNN-TTS框架，预测PCA压缩的超声图像，其中连续的舌头运动可以与合成语音同步重建。利用8个说话人的数据，训练全连通和递归神经网络，结果表明，在训练数据有限的情况下，FC-DNNs比LSTMs更适合于序列数据的预测。客观实验和可视化预测结果表明，该方法是可行的，生成的超声视频接近自然舌运动。基于文本输入的发音运动预测可用于视听语音合成或计算机辅助语音训练。摘要：In this paper, we present our first experiments in text-to-articulation prediction, using ultrasound tongue image targets. We extend a traditional (vocoder-based) DNN-TTS framework with predicting PCA-compressed ultrasound images, of which the continuous tongue motion can be reconstructed in synchrony with synthesized speech. We use the data of eight speakers, train fully connected and recurrent neural networks, and show that FC-DNNs are more suitable for the prediction of sequential data than LSTMs, in case of limited training data. Objective experiments and visualized predictions show that the proposed solution is feasible and the generated ultrasound videos are close to natural tongue movement. Articulatory movement prediction from text input can be useful for audiovisual speech synthesis or computer-assisted pronunciation training.

【2】 Sound Event Detection: A Tutorial 标题：声音事件检测：教程

作者：Annamaria Mesaros,Toni Heittola,Tuomas Virtanen,Mark D. Plumbley 机构： Tampere University, Finland, University of Surrey, UK, I. SOUND EVENTS IN OUR EVERYDAY ENVIRONMENT, Imagine standing on a street corner in the city. With your, eyes closed, you can hear and recognize a succession of 备注：to appear in IEEE Signal Processing Magazine, Volume 38, Issue 5 链接：https://arxiv.org/abs/2107.05463 摘要：自动声音事件检测（SED）方法的目标是识别音频信号中发生了什么以及何时发生。在实践中，目标是识别音频信号中不同声音的活动时间。本文给出了一个声音事件检测的教程介绍，包括它的定义，信号处理和机器学习方法，评估和未来的展望。摘要：The goal of automatic sound event detection (SED) methods is to recognize what is happening in an audio signal and when it is happening. In practice, the goal is to recognize at what temporal instances different sounds are active within an audio signal. This paper gives a tutorial presentation of sound event detection, including its definition, signal processing and machine learning approaches, evaluation, and future perspectives.

【3】 UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset 标题：UniSpeech at Scale：基于大规模语音识别数据集的预训练方法实证研究

作者：Chengyi Wang,Yu Wu,Shujie Liu,Jinyu Li,Yao Qian,Kenichi Kumatani,Furu Wei 机构：Microsoft 链接：https://arxiv.org/abs/2107.05233 摘要：近年来，人们对自监督学习（SSL）产生了极大的兴趣，在SSL中，模型在大规模无标记数据上进行预训练，然后在小的标记数据集上进行微调。通常的看法是，SSL有助于资源有限的任务，在这些任务中，只有有限数量的标记数据可用。当标记的训练数据量增加时，SSL的好处不断减少。据我们所知，在SSL的研究中，最多使用几千小时的标记数据。相比之下，业界通常使用数万小时的标记数据为资源丰富的语言构建高精度语音识别（ASR）系统。在这项研究中，我们采取的挑战是调查SSL是否以及如何提高最先进的生产规模Transformer-传感器模型的ASR精度，该模型是用6.5万小时的匿名标记EN-US数据建立的。摘要：Recently, there has been a vast interest in self-supervised learning (SSL) where the model is pre-trained on large scale unlabeled data and then fine-tuned on a small labeled dataset. The common wisdom is that SSL helps resource-limited tasks in which only a limited amount of labeled data is available. The benefit of SSL keeps diminishing when the labeled training data amount increases. To our best knowledge, at most a few thousand hours of labeled data was used in the study of SSL. In contrast, the industry usually uses tens of thousands of hours of labeled data to build high-accuracy speech recognition (ASR) systems for resource-rich languages. In this study, we take the challenge to investigate whether and how SSL can improve the ASR accuracy of a state-of-the-art production-scale Transformer-Transducer model, which was built with 65 thousand hours of anonymized labeled EN-US data.

【4】 Perceptual-based deep-learning denoiser as a defense against adversarial attacks on ASR systems 标题：基于感知的深度学习去噪器作为对ASR系统敌意攻击的防御

作者：Anirudh Sreeram,Nicholas Mehlman,Raghuveer Peri,Dillon Knox,Shrikanth Narayanan 机构：Signal Analysis and Interpretation Laboratory (SAIL), Ming Hsieh Department of Electrical and Computer Engineering, Viterbi school of Engineering, University of Southern California 备注：5 pages, 4 figures submitted to ASRU 2021 链接：https://arxiv.org/abs/2107.05222 摘要：在本文中，我们研究了语音去噪作为一种防御攻击的自动语音识别（ASR）系统。对抗性攻击试图通过在原始语音信号中添加小扰动来强制错误分类。我们建议通过在ASR流水线中使用基于神经网络的去噪器作为预处理器来抵消这一点。去噪器独立于下游ASR模型，因此可以在现有系统中快速部署。我们发现使用感知激励损失函数训练去噪器可以提高对抗鲁棒性，而不会影响ASR在良性样本上的性能。我们的防御被评估（作为DARPA GARD项目的一部分）在一系列攻击强度和言语样本的“凯南维尔”攻击策略上。在信噪比（SNR）为20db的攻击强度下，该模型的误字率平均提高了7.7%。摘要：In this paper we investigate speech denoising as a defense against adversarial attacks on automatic speech recognition (ASR) systems. Adversarial attacks attempt to force misclassification by adding small perturbations to the original speech signal. We propose to counteract this by employing a neural-network based denoiser as a pre-processor in the ASR pipeline. The denoiser is independent of the downstream ASR model, and thus can be rapidly deployed in existing systems. We found that training the denoisier using a perceptually motivated loss function resulted in increased adversarial robustness without compromising ASR performance on benign samples. Our defense was evaluated (as a part of the DARPA GARD program) on the 'Kenansville' attack strategy across a range of attack strengths and speech samples. An average improvement in Word Error Rate (WER) of about 7.7% was observed over the undefended model at 20 dB signal-to-noise-ratio (SNR) attack strength.

【5】 A Deep-Bayesian Framework for Adaptive Speech Duration Modification 标题：一种自适应语音时长调整的深度贝叶斯框架

作者：Ravi Shankar,Archana Venkataraman 机构： The authors are with the department of Electrical and ComputerEngineering at the Johns Hopkins University 备注：6 pages, 7 figures 链接：https://arxiv.org/abs/2107.04973 摘要：本文提出了第一种自适应调整语音信号持续时间的方法。我们的方法使用一个贝叶斯框架来定义一个潜在的注意图，该图连接输入和目标话语的框架。我们训练一个掩蔽卷积编解码网络，通过平均绝对误差损失函数的随机版本来产生这个注意图；我们的模型还使用编码器嵌入来预测目标语音信号的长度。预测长度确定解码器操作的步数。在推理过程中，我们生成注意图作为给定输入语音和未知目标语音信号之间相似度矩阵的代理。利用这个相似矩阵，我们计算了两个信号之间对齐的扭曲路径。我们的实验表明，这种自适应框架在语音转换和情感转换任务中产生了与依赖于已知目标信号的动态时间扭曲相似的结果。我们还表明，我们的技术在一个高质量的生成语音，这是等同于最先进的声码器。摘要：We propose the first method to adaptively modify the duration of a given speech signal. Our approach uses a Bayesian framework to define a latent attention map that links frames of the input and target utterances. We train a masked convolutional encoder-decoder network to produce this attention map via a stochastic version of the mean absolute error loss function; our model also predicts the length of the target speech signal using the encoder embeddings. The predicted length determines the number of steps for the decoder operation. During inference, we generate the attention map as a proxy for the similarity matrix between the given input speech and an unknown target speech signal. Using this similarity matrix, we compute a warping path of alignment between the two signals. Our experiments demonstrate that this adaptive framework produces similar results to dynamic time warping, which relies on a known target signal, on both voice conversion and emotion conversion tasks. We also show that our technique results in a high quality of generated speech that is on par with state-of-the-art vocoders.

【6】 Direct speech-to-speech translation with discrete units 标题：使用离散单元的直接语音到语音翻译

作者：Ann Lee,Peng-Jen Chen,Changhan Wang,Jiatao Gu,Xutai Ma,Adam Polyak,Yossi Adi,Qing He,Yun Tang,Juan Pino,Wei-Ning Hsu 机构：Facebook AI, Johns Hopkins University 链接：https://arxiv.org/abs/2107.05604 摘要：我们提出了一种直接的语-语转换（S2ST）模型，该模型不依赖于中间文本生成，将一种语言的语音转换为另一种语言的语音。以前的工作通过训练一个基于注意的序列到序列模型来解决这个问题，该模型将源语音谱图映射到目标语音谱图。为了解决对目标语音的连续谱图特征建模的挑战，我们提出了从未标记的语音语料库中预测自监督离散表示的方法。在有目标文本文本的情况下，我们设计了一个语音和文本联合训练的多任务学习框架，使该模型能够在同一推理过程中同时产生双模式输出（语音和文本）。在Fisher西班牙语-英语数据集上的实验表明，预测离散单元以及联合语音和文本训练比预测谱图的基线提高了11 BLEU的模型性能，并弥补了83%的性能差距。在没有任何文本文本的情况下进行训练，我们的模型可以获得与预测光谱图和使用文本数据进行训练的基线相似的性能。摘要：We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. Previous work addresses the problem by training an attention-based sequence-to-sequence model that maps source speech spectrograms into target spectrograms. To tackle the challenge of modeling continuous spectrogram features of the target speech, we propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that predicting discrete units and joint speech and text training improve model performance by 11 BLEU compared with a baseline that predicts spectrograms and bridges 83% of the performance gap towards a cascaded system. When trained without any text transcripts, our model achieves similar performance as a baseline that predicts spectrograms and is trained with text data.

【7】 Calliope -- A Polyphonic Music Transformer 标题：Calliope--复调音乐Transformer

作者：Andrea Valenti,Stefano Berti,Davide Bacciu 机构：University of Pisa - Dept. of Computer Science, Largo B. Pontecorvo, Pisa - Italy 备注：Accepted at ESANN2021 链接：https://arxiv.org/abs/2107.05546 摘要：音乐的复调性使深度学习在音乐造型中的应用成为一项具有挑战性的任务。另一方面，Transformer架构似乎很适合这种数据。在这项工作中，我们提出Calliope，一种新的基于Transformer的自动编码器模型，用于有效地模拟多声道的复调音乐序列。实验表明，该模型能够提高音乐序列重构和生成的技术水平，特别是对长序列的重构和生成效果显著。摘要：The polyphonic nature of music makes the application of deep learning to music modelling a challenging task. On the other hand, the Transformer architecture seems to be a good fit for this kind of data. In this work, we present Calliope, a novel autoencoder model based on Transformers for the efficient modelling of multi-track sequences of polyphonic music. The experiments show that our model is able to improve the state of the art on musical sequence reconstruction and generation, with remarkably good results especially on long sequences.

【8】 DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement 标题：DPCRN：用于单通道语音增强的双径卷积递归网络

作者：Xiaohuai Le,Hongsheng Chen,Kai Chen,Jing Lu 机构：Key Laboratory of Modern Acoustics, Nanjing University, Nanjing , China, NJU-Horizon Intelligent Audio Lab, Horizon Robotics, Beijing , China, Nanjing Institute of Advanced Artificial Intelligence, Nanjing , China 备注：5 pages, 1 figure, accepted by Interspeech 2021 链接：https://arxiv.org/abs/2107.05429 摘要：提出了一种双路径神经网络（DPRNN）来更有效地对超长序列进行时域建模。通过将长序列分割成较小的块，并应用块内和块间RNN，DPRNN在有限的模型尺寸下达到了很好的语音分离性能。本文将dpnn模块与卷积递归网络（CRN）相结合，设计了一种用于时频域语音增强的双程卷积递归网络（DPCRN）模型。我们将CRN中的rnn替换为DPRNN模块，其中块内rnn用于模拟单个帧中的频谱模式，块间rnn用于模拟连续帧之间的相关性。仅使用0.8M参数，提交的DPCRN模型在Interspeech 2021 Deep Noise Suppression（DNS）挑战的宽带场景轨迹中实现了3.57的总体平均意见得分（MOS）。对其他测试集的评价也表明了该模型的有效性。摘要：The dual-path RNN (DPRNN) was proposed to more effectively model extremely long sequences for speech separation in the time domain. By splitting long sequences to smaller chunks and applying intra-chunk and inter-chunk RNNs, the DPRNN reached promising performance in speech separation with a limited model size. In this paper, we combine the DPRNN module with Convolution Recurrent Network (CRN) and design a model called Dual-Path Convolution Recurrent Network (DPCRN) for speech enhancement in the time-frequency domain. We replace the RNNs in the CRN with DPRNN modules, where the intra-chunk RNNs are used to model the spectrum pattern in a single frame and the inter-chunk RNNs are used to model the dependence between consecutive frames. With only 0.8M parameters, the submitted DPCRN model achieves an overall mean opinion score (MOS) of 3.57 in the wide band scenario track of the Interspeech 2021 Deep Noise Suppression (DNS) challenge. Evaluations on some other test sets also show the efficacy of our model.

【9】 End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning 标题：基于半监督学习的端到端富转录风格自动语音识别

作者：Tomohiro Tanaka,Ryo Masumura,Mana Ihori,Akihiko Takashima,Shota Orihashi,Naoki Makishima 机构：NTT Media Intelligence Laboratories, NTT Corporation 备注：Accepted at Interspeech 2021 链接：https://arxiv.org/abs/2107.05382 摘要：我们提出了一种半监督学习方法，用于从小型富转录风格和大型通用转录风格数据集构建端到端富转录风格自动语音识别（RT-ASR）系统。在自发的言语任务中，各种言语现象，如填充词、词片段、笑声和咳嗽等，常常被包括在内。虽然普通的转录并没有给予这些现象特别的意识，但丰富的转录明确地将它们转换为特殊的现象标记以及文本标记。在以往的研究中，文本标记和现象标记是以端到端的方式同时被估计的。然而，由于缺乏大规模的、丰富的转录风格的数据集，很难建立精确的RT-ASR系统。为了解决这个问题，我们的训练方法同时使用有限的丰富转录风格数据集和普通转录风格数据集。半监督学习的关键是将普通的转录风格数据集转化为伪富转录风格数据集。为此，我们将控制现象标记生成与否的样式标记引入到基于Transformer的自回归建模中。我们使用这个模型来生成伪富转录风格的数据集，并从伪和原始数据集构建RT-ASR系统。在自发ASR任务上的实验表明了该方法的有效性。摘要：We propose a semi-supervised learning method for building end-to-end rich transcription-style automatic speech recognition (RT-ASR) systems from small-scale rich transcription-style and large-scale common transcription-style datasets. In spontaneous speech tasks, various speech phenomena such as fillers, word fragments, laughter and coughs, etc. are often included. While common transcriptions do not give special awareness to these phenomena, rich transcriptions explicitly convert them into special phenomenon tokens as well as textual tokens. In previous studies, the textual and phenomenon tokens were simultaneously estimated in an end-to-end manner. However, it is difficult to build accurate RT-ASR systems because large-scale rich transcription-style datasets are often unavailable. To solve this problem, our training method uses a limited rich transcription-style dataset and common transcription-style dataset simultaneously. The Key process in our semi-supervised learning is to convert the common transcription-style dataset into a pseudo-rich transcription-style dataset. To this end, we introduce style tokens which control phenomenon tokens are generated or not into transformer-based autoregressive modeling. We use this modeling for generating the pseudo-rich transcription-style datasets and for building RT-ASR system from the pseudo and original datasets. Our experiments on spontaneous ASR tasks showed the effectiveness of the proposed method.

【10】 MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding 标题：MidiBERT-Piano：符号音乐理解的大规模预训

作者：Yi-Hui Chou,I-Chun Chen,Chin-Jui Chang,Joann Ching,Yi-Hsuan Yang 机构： YH Chou is also affiliated with National Taiwan University; ICChen with National Tsing Hua university; and YH Yang with Taiwan AILabs (e-mail 链接：https://arxiv.org/abs/2107.05223 摘要：本文尝试利用BERT的mask语言建模方法对4166个复调钢琴MIDI文件进行12层变换器模型的预训练，以解决一些符号域辨别性的音乐理解任务。这包括两个音符级分类任务，即旋律提取和速度预测，以及两个序列级分类任务，即作曲家分类和情感分类。我们发现，给定一个预先训练好的Transformer，我们的模型优于基于递归神经网络的基线，微调时间少于10个周期。消融研究表明，即使在预训练阶段没有看到下游任务的MIDI数据，预训练仍然有效，并且在微调阶段冻结Transformer的自我注意层会略微降低性能。这项工作中使用的所有五个数据集都是公开的，以及我们预先训练和微调模型的检查点。因此，我们的研究可以作为符号领域音乐理解的基准。摘要：This paper presents an attempt to employ the mask language modeling approach of BERT to pre-train a 12-layer Transformer model over 4,166 pieces of polyphonic piano MIDI files for tackling a number of symbolic-domain discriminative music understanding tasks. These include two note-level classification tasks, i.e., melody extraction and velocity prediction, as well as two sequence-level classification tasks, i.e., composer classification and emotion classification. We find that, given a pre-trained Transformer, our models outperform recurrent neural network based baselines with less than 10 epochs of fine-tuning. Ablation studies show that the pre-training remains effective even if none of the MIDI data of the downstream tasks are seen at the pre-training stage, and that freezing the self-attention layers of the Transformer at the fine-tuning stage slightly degrades performance. All the five datasets employed in this work are publicly available, as well as checkpoints of our pre-trained and fine-tuned models. As such, our research can be taken as a benchmark for symbolic-domain music understanding.

【11】 Neural Waveshaping Synthesis 标题：神经波形合成

作者：Ben Hayes,Charalampos Saitis,György Fazekas 机构：Centre for Digital Music, Queen Mary University of London 备注：Accepted to ISMIR 2021; See online supplement at this https URL 链接：https://arxiv.org/abs/2107.05050 摘要：我们提出了神经波形形成单元（NEWT）：一种新颖的、轻量级的、完全因果的神经音频合成方法，它直接在波形域中工作，并伴随着优化（FastNEWT）以实现高效的CPU推理。NEWT使用具有周期性激活的时间分布多层感知器来隐式学习编码目标音色特征的非线性传递函数。一旦训练，蝾螈可以通过简单的输入和输出信号仿射变换产生复杂的音色演变。我们将NEWT与一个可微噪声合成器和混响器配对，发现它在F0和响度特性的条件下，仅用260k的总模型参数就能产生逼真的乐器性能。我们将我们的方法与最先进的多刺激听力测试和Fr′echet音频距离基准进行了比较，发现它在测试的音色域中具有竞争力。我们的方法在生成速度方面显著优于基准测试，并且在使用和不使用FastNEWT的情况下，在消费CPU上实现了实时性能，这表明它是未来创造性声音设计工具的可行基础。摘要：We present the Neural Waveshaping Unit (NEWT): a novel, lightweight, fully causal approach to neural audio synthesis which operates directly in the waveform domain, with an accompanying optimisation (FastNEWT) for efficient CPU inference. The NEWT uses time-distributed multilayer perceptrons with periodic activations to implicitly learn nonlinear transfer functions that encode the characteristics of a target timbre. Once trained, a NEWT can produce complex timbral evolutions by simple affine transformations of its input and output signals. We paired the NEWT with a differentiable noise synthesiser and reverb and found it capable of generating realistic musical instrument performances with only 260k total model parameters, conditioned on F0 and loudness features. We compared our method to state-of-the-art benchmarks with a multi-stimulus listening test and the Fr\'echet Audio Distance and found it performed competitively across the tested timbral domains. Our method significantly outperformed the benchmarks in terms of generation speed, and achieved real-time performance on a consumer CPU, both with and without FastNEWT, suggesting it is a viable basis for future creative sound design tools.

【12】 Multilingual and crosslingual speech recognition using phonological-vector based phone embeddings 标题：使用基于音素矢量的音素嵌入的多语言和跨语言语音识别

作者：Chengrui Zhu,Keyu An,Huahuan Zheng,Zhijian Ou 机构：Speech Processing and Machine Intelligence (SPMI) Lab, Tsinghua University, China 链接：https://arxiv.org/abs/2107.05038 摘要：语音特征（PFs）的使用有可能使特定语言的电话在训练中保持连接，这对于低资源语言的多语言和跨语言语音识别方法的信息共享是非常理想的。以往的语音特征提取方法的一个缺点是，自底向上提取语音特征本身就比较困难。本文提出将语音驱动的电话嵌入（top-down）和基于深度神经网络（DNN）的声学特征提取（bottom-up）相结合来计算电话概率。这种新方法被称为JoinAP（声学与音韵学的结合）。值得注意的是，语音识别不需要从声学到语音特征的转换。对于IPA（internationalphonicalphabet）表中的每一个电话，我们将其语音特征编码成一个语音向量，然后对语音向量进行线性或非线性变换，得到电话嵌入。在通用语音数据集（德语、法语、西班牙语和意大利语）和AISHLL-1数据集（普通话）上进行了一系列多语种和跨语种（零拍和少拍）的语音识别实验，并证明了非线性嵌入的JoinAP方法优于线性嵌入的JoinAP方法和平面嵌入的JoinAP方法。摘要：The use of phonological features (PFs) potentially allows language-specific phones to remain linked in training, which is highly desirable for information sharing for multilingual and crosslingual speech recognition methods for low-resourced languages. A drawback suffered by previous methods in using phonological features is that the acoustic-to-PF extraction in a bottom-up way is itself difficult. In this paper, we propose to join phonology driven phone embedding (top-down) and deep neural network (DNN) based acoustic feature extraction (bottom-up) to calculate phone probabilities. The new method is called JoinAP (Joining of Acoustics and Phonology). Remarkably, no inversion from acoustics to phonological features is required for speech recognition. For each phone in the IPA (International Phonetic Alphabet) table, we encode its phonological features to a phonological-vector, and then apply linear or nonlinear transformation of the phonological-vector to obtain the phone embedding. A series of multilingual and crosslingual (both zero-shot and few-shot) speech recognition experiments are conducted on the CommonVoice dataset (German, French, Spanish and Italian) and the AISHLL-1 dataset (Mandarin), and demonstrate the superiority of JoinAP with nonlinear phone embeddings over both JoinAP with linear phone embeddings and the traditional method with flat phone embeddings.

【13】 PocketVAE: A Two-step Model for Groove Generation and Control 标题：PocketVAE：沟槽生成与控制的两步模型

作者：Kyungyun Lee,Wonil Kim,Juhan Nam 机构：Graduate School of Culture Technology, KAIST 链接：https://arxiv.org/abs/2107.05009 摘要：在数字音频工作站（DAW）中创建一个好的鼓轨来模仿熟练的表演者可能是一个耗时的过程，特别是对于那些不熟悉鼓的人来说。在这项工作中，我们介绍了PocketVAE，一个凹槽生成系统，将凹槽应用于用户的基本MIDI曲目，即模板。凹槽可以从参考轨迹转移，随机生成，也可以有条件生成，例如流派。我们的系统由每个groove组件的不同模块组成，采用类似于音乐创作过程的两步方法。首先，笔记模块通过添加和删除笔记来更新用户模板；第二，速度和微计时模块为生成的笔记分数添加了细节。为了对鼓形音符进行建模，我们采用了一种基于矢量量化变分自动编码器（VQ-VAE）的离散隐式表示方法，因为鼓形音符具有离散性，不同于速度和微定时值。我们证明了我们的两步方法和离散编码空间的使用改进了对原始数据分布的学习。此外，我们还讨论了在模型中加入控制元素（类型、速度和微计时模式）的好处。摘要：Creating a good drum track to imitate a skilled performer in digital audio workstations (DAWs) can be a time-consuming process, especially for those unfamiliar with drums. In this work, we introduce PocketVAE, a groove generation system that applies grooves to users' rudimentary MIDI tracks, i.e, templates. Grooves can be either transferred from a reference track, generated randomly or with conditions, such as genres. Our system, consisting of different modules for each groove component, takes a two-step approach that is analogous to a music creation process. First, the note module updates the user template through addition and deletion of notes; Second, the velocity and microtiming modules add details to this generated note score. In order to model the drum notes, we apply a discrete latent representation method via Vector Quantized Variational Autoencoder (VQ-VAE), as drum notes have a discrete property, unlike velocity and microtiming values. We show that our two-step approach and the usage of a discrete encoding space improves the learning of the original data distribution. Additionally, we discuss the benefit of incorporating control elements - genre, velocity and microtiming patterns - into the model.

【14】 ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data 标题：一种面向低资源真实数据的半监督自动音乐改编框架--RECORVAT

作者：Kin Wai Cheuk,Dorien Herremans,Li Su 机构：Information Systems Technology and Design, Singapore University of Technology and Design, Institute of Information Science, Academia Sinica, Taiwan 备注：Accepted in ACMMM 21 链接：https://arxiv.org/abs/2107.04954 摘要：现有的有监督音乐自动抄写（AMT）模型大多缺乏泛化能力。这意味着他们很难从不同的音乐类型中转录真实世界的音乐录音，而这些音乐类型并没有出现在标记的训练数据中。在本文中，我们提出了一个半监督框架ReconVAT，它通过利用大量可用的未标记音乐记录来解决这个问题。该方法采用了重建损失法和虚拟对抗训练法。当与现有的AMT U-net模型相结合时，ReconVAT在MAPS和MusicNet等通用基准数据集上取得了有竞争力的结果。例如，在MusicNet的弦乐部分版本的少数镜头设置中，ReconVAT在按音符和按偏移量的音符中分别获得61.0%和41.6%的F1分数，与监督基线模型相比，这意味着提高了22.2%和62.5%。我们提出的框架还展示了对新数据进行持续学习的潜力，这在不断提供新数据的实际应用中可能很有用。摘要：Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize. This means that they have trouble transcribing real-world music recordings from diverse musical genres that are not presented in the labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves this issue by leveraging the huge amount of available unlabelled music recordings. The proposed ReconVAT uses reconstruction loss and virtual adversarial training. When combined with existing U-net models for AMT, ReconVAT achieves competitive results on common benchmark datasets such as MAPS and MusicNet. For example, in the few-shot setting for the string part version of MusicNet, ReconVAT achieves F1-scores of 61.0% and 41.6% for the note-wise and note-with-offset-wise metrics respectively, which translates into an improvement of 22.2% and 62.5% compared to the supervised baseline model. Our proposed framework also demonstrates the potential of continual learning on new data, which could be useful in real-world applications whereby new data is constantly available.

【15】 Weakly-Supervised Classification and Detection of Bird Sounds in the Wild. A BirdCLEF 2021 Solution 标题：野外鸟声的弱监督分类与检测。BirdCLEF 2021解决方案

作者：Marcos V. Conde,Kumar Shubham,Prateek Agnihotri,Nitin D. Movva,Szilard Bessenyei 机构：Universidad de Valladolid, Spain, Jio Saavn, India, Clairvoyant.ai, India, Equal contribution. 备注：Proceedings Working Notes CEURWS @ CLEF 2021 - BirdCLEF 2021 链接：https://arxiv.org/abs/2107.04878 摘要：听鸟比看鸟容易，但它们在自然界中仍扮演着重要角色，是环境质量和污染恶化的极好指标。机器学习和卷积神经网络的最新进展使我们能够检测和分类鸟类的声音，通过这样做，我们可以帮助研究人员监测鸟类种群和生态系统生物多样性的现状和趋势。我们提出了一个声音检测和分类管道来分析复杂的声景记录和识别背景中的鸟巢。我们的管道从弱标签中学习，对野生鸟类的细粒度发声进行分类，并对背景声音（如飞机、雨水等）具有很强的鲁棒性。我们的解决方案在Kaggle举办的2021年BirdCLEF挑战赛中获得816支球队的第10名。摘要：It is easier to hear birds than see them, however, they still play an essential role in nature and they are excellent indicators of deteriorating environmental quality and pollution. Recent advances in Machine Learning and Convolutional Neural Networks allow us to detect and classify bird sounds, by doing this, we can assist researchers in monitoring the status and trends of bird populations and biodiversity in ecosystems. We propose a sound detection and classification pipeline for analyzing complex soundscape recordings and identify birdcalls in the background. Our pipeline learns from weak labels, classifies fine-grained bird vocalizations in the wild, and is robust against background sounds (e.g., airplanes, rain, etc). Our solution achieved 10th place of 816 teams at the BirdCLEF 2021 Challenge hosted on Kaggle.

【16】 Variational Information Bottleneck for Effective Low-resource Audio Classification 标题：有效低资源音频分类的变异信息瓶颈

作者：Shijing Si,Jianzong Wang,Huiming Sun,Jianhan Wu,Chuanyao Zhang,Xiaoyang Qu,Ning Cheng,Lei Chen,Jing Xiao 机构：Ping An Technology (Shenzhen) Co., Ltd., Hong Kong University of Science and Technology, University of Science and Technology of China 备注：Accepted by InterSpeech 2021 链接：https://arxiv.org/abs/2107.04803 摘要：以卷积神经网络（CNNs）为代表的大规模深度神经网络（DNNs）以其强大的容量和较强的泛化能力在音频分类方面取得了令人瞩目的成绩。然而，在低资源任务上训练DNN模型时，往往容易对小数据进行过度拟合，学习过多的冗余信息。为了解决这个问题，我们建议使用变分信息瓶颈（VIB）来减轻过度拟合和抑制不相关的信息。在这项工作中，我们在4层CNN上进行了实验。但是，VIB框架可以随时使用，并且可以很容易地与许多其他最先进的网络体系结构结合使用。对一些音频数据集的评估表明，我们的方法明显优于基线方法，在一些低源环境下，分类精度提高了5.0%以上。摘要：Large-scale deep neural networks (DNNs) such as convolutional neural networks (CNNs) have achieved impressive performance in audio classification for their powerful capacity and strong generalization ability. However, when training a DNN model on low-resource tasks, it is usually prone to overfitting the small data and learning too much redundant information. To address this issue, we propose to use variational information bottleneck (VIB) to mitigate overfitting and suppress irrelevant information. In this work, we conduct experiments ona 4-layer CNN. However, the VIB framework is ready-to-use and could be easily utilized with many other state-of-the-art network architectures. Evaluation on a few audio datasets shows that our approach significantly outperforms baseline methods, yielding more than 5.0% improvement in terms of classification accuracy in some low-source settings.

【17】 Layer-wise Analysis of a Self-supervised Speech Representation Model 标题：一种自监督语音表示模型的分层分析

作者：Ankita Pasad,Ju-Chieh Chou,Karen Livescu 机构：Toyota Technological Institute at Chicago 链接：https://arxiv.org/abs/2107.04734 摘要：最近提出的自监督学习方法已经成功地用于预训练语音表示模型。这些学习到的表征的效用已经被经验地观察到，但是关于在预先训练的表征中编码的信息的类型或程度的研究并不多。开发这样的洞察可以帮助理解这些模型的能力和局限性，并使研究团体能够更有效地开发它们在下游应用程序中的使用。在这项工作中，我们开始填补这一空白，通过检查一个最近成功的预训练模型（wav2vec2.0），通过其中间表示向量，使用一套分析工具。我们使用典型相关、互信息和非参数探测的简单下游任务性能指标，以便（i）查询声学和语言信息内容，（ii）描述跨模型层的信息演化，以及（iii）了解自动语音识别（ASR）模型的微调如何影响这些观察结果。我们的发现促使修改ASR的微调协议，从而在低资源环境下提高字错误率。摘要：Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the type or extent of information encoded in the pre-trained representations themselves. Developing such insights can help understand the capabilities and limits of these models and enable the research community to more efficiently develop their usage for downstream applications. In this work, we begin to fill this gap by examining one recent and successful pre-trained model (wav2vec 2.0), via its intermediate representation vectors, using a suite of analysis tools. We use the metrics of canonical correlation, mutual information, and performance on simple downstream tasks with non-parametric probes, in order to (i) query for acoustic and linguistic information content, (ii) characterize the evolution of information across model layers, and (iii) understand how fine-tuning the model for automatic speech recognition (ASR) affects these observations. Our findings motivate modifying the fine-tuning protocol for ASR, which produces improved word error rates in a low-resource setting.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-13，如有侵权请联系 cloudcommunity@tencent.com 删除

linux