金融/语音/音频处理学术速递[7.16]

公众号-arXiv每日学术速递

发布于 2021-07-27 10:59:25

5740

发布于 2021-07-27 10:59:25

文章被收录于专栏：arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

q-fin金融，共计5篇

cs.SD语音，共计8篇

eess.AS音频处理，共计8篇

1.q-fin金融:

【1】 Article Processing Charges based publications: to which extent the price explains scientific impact? 标题：以文章加工费为基础的出版物：价格在多大程度上解释了科学影响？

作者：Abdelghani Maddi,David Sapinho 机构：Observatoire des Sciences et Techniques, Hcéres, Rue Albert Einstein, Paris, France., Sorbonne Paris Nord, CEPN, UMR-CNRS , Villetaneuse, France. 备注：ISSI 2021 - 18th International Conference on Scientometrics & Informetrics, Jul 2021, Leuven, Belgium 链接：https://arxiv.org/abs/2107.07348 摘要：本研究旨在探讨科技期刊引文规范化评分（NCS）与黄金开放存取期刊论文处理费（APCs）之间的关系。为此，我们使用OpenAPC数据库提供的APCs信息和webofscience数据库（WoS）中出版物的引用分数。该数据库涵盖2006年至2019年期间，共有83752篇文章发表在4751种期刊上，隶属于267家不同的出版商。结果表明，与这一信念相反，支付高昂的费用并不一定会增加出版物的影响。首先，具有高影响力的大型出版商并不是最昂贵的。第二，拥有最高APC的出版商在影响力方面未必是最好的。APCs和撞击之间的相关性是中等的。另外，在计量经济学分析中，我们已经证明了出版质量强烈地取决于出版期刊的质量。国际合作在引文评分中也起着重要作用。摘要：The present study aims to analyze relationship between Citations Normalized Score (NCS) of scientific publications and Article Processing Charges (APCs) amounts of Gold Open access publications. To do so, we use APCs information provided by OpenAPC database and citations scores of publications in the Web of Science database (WoS). Database covers the period from 2006 to 2019 with 83,752 articles published in 4751 journals belonging to 267 distinct publishers. Results show that contrary to this belief, paying dearly does not necessarily increase the impact of publications. First, large publishers with high impact are not the most expensive. Second, publishers with the highest APCs are not necessarily the best in terms of impact. Correlation between APCs and impact is moderate. Otherwise, in the econometric analysis we have shown that publication quality is strongly determined by journal quality in which it is published. International collaboration also plays an important role in citations score.

【2】 Computing near-optimal Value-at-Risk portfolios using Integer Programming techniques 标题：用整数规划技术计算接近最优的风险值投资组合

作者：Onur Babat,Juan C. Vera,Luis F. Zuluaga 机构：Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA USA, Department of Econometrics and Operations Research, Tilburg University, Tilburg, The Netherlands 备注：None 链接：https://arxiv.org/abs/2107.07339 摘要：风险价值（VaR）是用于风险管理的主要监管工具之一。然而，很难计算最优VaR投资组合；即以VaR作为风险度量的最优风险报酬组合配置。这是由于VaR是非凸的，具有组合性质。特别是，众所周知，VaR投资组合问题可以表示为一个混合整数线性规划（mixed integer linear program，MILP），对于中大型问题，用现有的MILP求解器很难求解。在这里，我们提出了一个算法来计算近最优VaR投资组合，利用这个MILP公式，并提供了解决方案的近最优性的保证。作为一个副产品，我们得到了一个算法来计算VaR投资组合问题的紧下界，其性能优于文献中提出的相关算法。利用满足一个风险基准的最小风险投资组合与相应的满足一个风险基准的最大回报投资组合之间的关系，得到了该算法的近似最优性保证。在凸风险测度和凹报酬函数的情况下，这些投资组合分配问题的替代公式经常被研究。在这里，这种关系被认为是一般的风险措施和奖励函数。为了说明该算法的有效性，利用美国金融市场的历史资产收益率给出了数值结果。摘要：Value-at-Risk (VaR) is one of the main regulatory tools used for risk management purposes. However, it is difficult to compute optimal VaR portfolios; that is, an optimal risk-reward portfolio allocation using VaR as the risk measure. This is due to VaR being non-convex and of combinatorial nature. In particular, it is well known that the VaR portfolio problem can be formulated as a mixed integer linear program (MILP) that is difficult to solve with current MILP solvers for medium to large-scale instances of the problem. Here, we present an algorithm to compute near-optimal VaR portfolios that takes advantage of this MILP formulation and provides a guarantee of the solution's near-optimality. As a byproduct, we obtain an algorithm to compute tight lower bounds on the VaR portfolio problem that outperform related algorithms proposed in the literature for this purpose. The near-optimality guarantee provided by the proposed algorithm is obtained thanks to the relation between minimum risk portfolios satisfying a reward benchmark and the corresponding maximum reward portfolios satisfying a risk benchmark. These alternate formulations of the portfolio allocation problem have been frequently studied in the case of convex risk measures and concave reward functions. Here, this relationship is considered for general risk measures and reward functions. To illustrate the efficiency of the presented algorithm, numerical results are presented using historical asset returns from the US financial market.

【3】 Credit scoring using neural networks and SURE posterior probability calibration 标题：基于神经网络和确定性后验概率校正的信用评分

作者：Matthieu Garcin,Samuel Stéphan 机构：L´eonard de Vinci Pˆole Universitaire, Research center, Paris La D´efense, SAMM, Universit´e Paris , Panth´eon-Sorbonne, rue de Tolbiac, Paris, cedex , France. 备注：22 pages 链接：https://arxiv.org/abs/2107.07206 摘要：在这篇文章中，我们比较了性能的逻辑回归和前馈神经网络信用评分的目的。结果表明，logistic回归在数据集上得到了很好的结果，而神经网络在性能上有一定的提高。我们还考虑不同的特征集，以评估它们在预测精度方面的重要性。我们发现，时间特征（即随时间的重复测量）可以是一个重要的信息来源，从而提高整体模型的准确性。最后，我们介绍了一种基于Stein无偏风险估计（SURE）的预测概率校正方法。这种校准技术可以应用于非常一般的校准功能。特别是，我们详细介绍了sigmoid函数和Kumaraswamy函数的这种方法，其中包括作为特殊情况的标识。结果表明，将SURE校正技术与经典Platt方法叠加可以提高预测概率的校正效果。摘要：In this article we compare the performances of a logistic regression and a feed forward neural network for credit scoring purposes. Our results show that the logistic regression gives quite good results on the dataset and the neural network can improve a little the performance. We also consider different sets of features in order to assess their importance in terms of prediction accuracy. We found that temporal features (i.e. repeated measures over time) can be an important source of information resulting in an increase in the overall model accuracy. Finally, we introduce a new technique for the calibration of predicted probabilities based on Stein's unbiased risk estimate (SURE). This calibration technique can be applied to very general calibration functions. In particular, we detail this method for the sigmoid function as well as for the Kumaraswamy function, which includes the identity as a particular case. We show that stacking the SURE calibration technique with the classical Platt method can improve the calibration of predicted probabilities.

【4】 Market risk factors analysis for an international mining company. Multi-dimensional, heavy-tailed-based modelling 标题：一家国际矿业公司的市场风险因素分析。基于重尾的多维建模

作者：Łukasz Bielak,Aleksandra Grzesiek,Joanna Janczura,Agnieszka Wyłomańska 机构：KGHM, M. Sklodowskiej-Curie ,-, Lubin, Poland, Hugo Steinhaus Center, Wroclaw University of Science and Technology, Wyspianskiego ,-, Wroclaw, Poland 备注：12 pages, 10 figures, 4 tables 链接：https://arxiv.org/abs/2107.07142 摘要：矿业企业要妥善管理经营，做好经营决策准备，就必须对潜在的市场风险因素进行分析。KGHM是活跃于金属和采矿业的最大公司之一，其最重要的风险因素是以美元交易的铜（Cu）价格和波兰兹罗提（PLN）汇率（USDPLN）。本文的主要研究范围是了解这两个风险因素的中长期动态。对于矿业公司来说，这可能有助于正确评估潜在的下行市场风险和优化对冲工具。从市场风险管理的角度来看，结合波兰兹罗提铜价（印尼国家电力公司铜价）分析这两个因素的动态也很重要，这两个因素共同推动了公司的收入、现金流和财务业绩。基于所分析的风险因素与分布分析之间的关系，我们建议使用具有$\alpha-$稳定分布的二维向量自回归（VAR）模型。数据的非均质性反映在两个确定的制度上：第一个是对应于2008年危机的制度，第二个是对应于稳定的市场形势。作为适用于市场资产的模型的自然含义，我们推导了PLN铜价的动态，PLN不是交易资产，但对KGHM公司的风险敞口至关重要。比较研究表明，包括资产的依赖性和制度变迁的影响。由于对各种国际公司而言，风险因素是以本国货币而不是以市场货币给出的，因此这种方法具有普遍性，可以在不同的市场环境中使用，如矿业或石油公司，也可以在全球贸易体系中涉及的其他商品中使用。摘要：Mining companies to properly manage their operations and be ready to make business decisions, are required to analyze potential scenarios for main market risk factors. The most important risk factors for KGHM, one of the biggest companies active in the metals and mining industry, are the price of copper (Cu), traded in US dollars, and the Polish zloty (PLN) exchange rate (USDPLN). The main scope of the paper is to understand the mid- and long-term dynamics of these two risk factors. For a mining company it might help to properly evaluate potential downside market risk and optimise hedging instruments. From the market risk management perspective, it is also important to analyze the dynamics of these two factors combined with the price of copper in Polish zloty (Cu in PLN), which jointly drive the revenues, cash flows, and financial results of the company. Based on the relation between analyzed risk factors and distribution analysis, we propose to use two-dimensional vector autoregressive (VAR) model with the $\alpha-$stable distribution. The non-homogeneity of the data is reflected in two identified regimes: first - corresponding to the 2008 crisis and second - to the stable market situation. As a natural implication of the model fitted to market assets, we derive the dynamics of the copper price in PLN, which is not a traded asset but is crucial for the KGHM company risk exposure. A comparative study is performed to demonstrate the effect of including dependencies of the assets and the implications of the regime change. Since for various international companies, risk factors are given rather in the national than the market currency, the approach is universal and can be used in different market contexts, like mining or oil companies, but also other commodities involved in the global trading system.

【5】 A new class of conditional Markov jump processes with regime switching and path dependence: properties and maximum likelihood estimation 标题：一类新的具有状态切换和路径依赖的条件马尔可夫跳跃过程的性质和极大似然估计

作者：Budhi Surya 备注：28 pages, 3 figures 链接：https://arxiv.org/abs/2107.07026 摘要：本文提出了一类新的具有区域切换和路径依赖的条件Markov跳过程。所开发过程的关键新特性在于它能够在从一个状态移动到另一个状态时切换转换速率，切换概率取决于过程的当前状态和时间以及其过去的轨迹。因此，从当前状态到另一个状态的转换取决于进程在该状态下的保持时间。利用有限个不同的转移矩阵表示的速度区域、在每个状态中选择区域成员的概率以及过程的过去实现，给出了过程的分布性质。特别是，它具有分布等价随机表示，具有Frydman和Surya（2020）提出的马尔可夫跳跃过程的一般混合。以封闭形式导出了过程分布参数的极大似然估计。使用EM算法迭代地进行估计。采用Akaike信息准则来评价所选模型的拟合优度。为了计算极大似然估计的标准误差，导出了极大似然估计的显式观测Fisher信息矩阵。信息矩阵采用Louis（1982）的一般矩阵公式的简化形式。给出了极大似然估计的大样本性质。特别地，过渡率的极大似然估计的协方差矩阵等于Cram′er-Rao下界，而对于政权成员的极大似然估计的协方差矩阵较小。仿真研究证实了这些结论，并表明参数估计是准确的，一致的，并且随着样本量的增加具有渐近正态性。摘要：This paper develops a new class of conditional Markov jump processes with regime switching and paths dependence. The key novel feature of the developed process lies on its ability to switch the transition rate as it moves from one state to another with switching probability depending on the current state and time of the process as well as its past trajectories. As such, the transition from current state to another depends on the holding time of the process in the state. Distributional properties of the process are given explicitly in terms of the speed regimes represented by a finite number of different transition matrices, the probabilities of selecting regime membership within each state, and past realization of the process. In particular, it has distributional equivalent stochastic representation with a general mixture of Markov jump processes introduced in Frydman and Surya (2020). Maximum likelihood estimates (MLE) of the distribution parameters of the process are derived in closed form. The estimation is done iteratively using the EM algorithm. Akaike information criterion is used to assess the goodness-of-fit of the selected model. An explicit observed Fisher information matrix of the MLE is derived for the calculation of standard errors of the MLE. The information matrix takes on a simplified form of the general matrix formula of Louis (1982). Large sample properties of the MLE are presented. In particular, the covariance matrix for the MLE of transition rates is equal to the Cram\'er-Rao lower bound, and is less for the MLE of regime membership. The simulation study confirms these findings and shows that the parameter estimates are accurate, consistent, and have asymptotic normality as the sample size increases.

2.cs.SD语音:

【1】 Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk 标题：评价双向通话过程中残余回波抑制的客观指标

作者：Amir Ivry,Israel Cohen,Baruch Berdugo 机构：Technion – Israel Institute of Technology, Technion City, Haifa , Israel 备注：Accepted to WASPAA 链接：https://arxiv.org/abs/2107.07471 摘要：主观评价是评价语音质量的最佳方法。最近引入的深度噪声抑制平均意见得分（DNSMOS）度量被证明可以非常准确地估计人类的评分。信噪比（SDR）指标被广泛应用于评估话音质量的残余回波抑制（RES）系统。然而，由于SDR同时受到语音失真和残余回声的影响，因此根据DNSMOS，它与人类的评分没有很好的相关性。为了解决这个问题，我们引入了两个客观指标来分别量化双音通话过程中所需的语音保持水平（DSML）和剩余回声抑制水平（RESL）。使用基于深度学习的RES系统和可调的设计参数来评估这些度量。通过280小时的真实和模拟记录，我们发现DSML和RESL与DNSMOS有很好的相关性，对不同的设置有很高的通用性。此外，我们还实证研究了调整RES系统设计参数与DSML-RESL折衷之间的关系，并为动态系统需求提供了一种实用的设计方案。摘要：Human subjective evaluation is optimal to assess speech quality for human perception. The recently introduced deep noise suppression mean opinion score (DNSMOS) metric was shown to estimate human ratings with great accuracy. The signal-to-distortion ratio (SDR) metric is widely used to evaluate residual-echo suppression (RES) systems by estimating speech quality during double-talk. However, since the SDR is affected by both speech distortion and residual-echo presence, it does not correlate well with human ratings according to the DNSMOS. To address that, we introduce two objective metrics to separately quantify the desired-speech maintained level (DSML) and residual-echo suppression level (RESL) during double-talk. These metrics are evaluated using a deep learning-based RES-system with a tunable design parameter. Using 280 hours of real and simulated recordings, we show that the DSML and RESL correlate well with the DNSMOS with high generalization to various setups. Also, we empirically investigate the relation between tuning the RES-system design parameter and the DSML-RESL tradeoff it creates and offer a practical design scheme for dynamic system requirements.

【2】 CLSRIL-23: Cross Lingual Speech Representations for Indic Languages 标题：CLSRIL-23：印地语的跨语言语音表示

作者：Anirudh Gupta,Harveen Singh Chadha,Priyanshi Shah,Neeraj Chimmwal,Ankur Dhuriya,Rishabh Gaur,Vivek Raghavan 机构：Thoughtworks, Ekstep Foundation 备注：7 pages, 2 figures 链接：https://arxiv.org/abs/2107.07402 摘要：我们提出了一个CLSRIL-23，一个基于自监督学习的音频预训练模型，它从23种印度语的原始音频中学习跨语言的语音表征。它是建立在wav2vec2.0的基础上的，wav2vec2.0通过训练隐藏的潜在言语表征的对比任务来解决这个问题，并联合学习所有语言共享的潜在语的量化。我们比较了训练前的语言损失，以比较单语和多语训练的效果。我们还比较了一些下游语音识别微调任务的性能，实验表明，多语言预训练在学习编码语言语音相似性的语音表征方面以及在下游任务上的性能都优于单语训练。当使用印地语的多语种预训练模型进行微调时，WER和CER分别下降了5%和9.5%。所有的代码模型都是开源的。CLSRIL-23是一个以23美元的语言和近10000小时的音频数据训练的模型，用于促进印度语语音识别的研究。我们希望，新的国家的最先进的系统将创建使用自我监督的方法，特别是低资源的印度语。摘要：We present a CLSRIL-23, a self supervised learning based audio pre-trained model which learns cross lingual speech representations from raw audio across 23 Indic languages. It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations and jointly learns the quantization of latents shared across all languages. We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining. Performance on some downstream fine-tuning tasks for speech recognition is also compared and our experiments show that multilingual pretraining outperforms monolingual training, in terms of learning speech representations which encodes phonetic similarity of languages and also in terms of performance on down stream tasks. A decrease of 5% is observed in WER and 9.5% in CER when a multilingual pretrained model is used for finetuning in Hindi. All the code models are also open sourced. CLSRIL-23 is a model trained on $23$ languages and almost 10,000 hours of audio data to facilitate research in speech recognition for Indic languages. We hope that new state of the art systems will be created using the self supervised approach, especially for low resources Indic languages.

【3】 Sketching sounds: an exploratory study on sound-shape associations 标题：素描声音：音形联想的探索性研究

作者：Sebastian Löbbers,Mathieu Barthet,György Fazekas 机构：Centre for Digital Music, Queen Mary University of London, Gy¨orgy Fazekas 备注：accepted for International Computer Music Conference (ICMC) 2021 链接：https://arxiv.org/abs/2107.07360 摘要：声音合成器控件通常对应于信号处理算法的技术参数，而不是与人类对声音的感知相关的直观声音描述符。这使得以直接的方式实现合理的想法变得困难。跨模态映射，例如手势和声音之间的映射，被认为是一种更直观的控制机制。大量研究表明，人类在声音和形状之间的联系是一致的。然而，利用绘画来驱动声音合成的方法还没有得到充分的探讨。本文提出了一项探索性研究，要求参与者用单色数字绘图界面绘制声音的视觉图像，目的是确定不同的表征方法，并确定音色声音特征是否可以通过视觉草图可靠地传达。结果表明，开发一种利用声形关联的合成器是可行的，但在后续研究中还需要一个更大、更集中的数据集。摘要：Sound synthesiser controls typically correspond to technical parameters of signal processing algorithms rather than intuitive sound descriptors that relate to human perception of sound. This makes it difficult to realise sound ideas in a straightforward way. Cross-modal mappings, for example between gestures and sound, have been suggested as a more intuitive control mechanism. A large body of research shows consistency in human associations between sounds and shapes. However, the use of drawings to drive sound synthesis has not been explored to its full extent. This paper presents an exploratory study that asked participants to sketch visual imagery of sounds with a monochromatic digital drawing interface, with the aim to identify different representational approaches and determine whether timbral sound characteristics can be communicated reliably through visual sketches. Results imply that the development of a synthesiser exploiting sound-shape associations is feasible, but a larger and more focused dataset is needed in followup studies.

【4】 Leveraging Hierarchical Structures for Few-Shot Musical Instrument Recognition 标题：利用层次结构进行Few-Shot乐器识别

作者：Hugo Flores Garcia,Aldo Aguilar,Ethan Manilow,Bryan Pardo 机构：Interactive Audio Lab, Northwestern University, Evanston, IL, USA 链接：https://arxiv.org/abs/2107.07029 摘要：乐器识别的深度学习工作通常集中在乐器类，我们有丰富的数据。在这项工作中，我们利用在少数镜头学习设置的乐器之间的层次关系，使分类更广泛的一套乐器，给出了一些例子在推理。我们把一个典型的层次结构的音乐原型与一个预先定义的层次结构相结合。这些扩展不需要对网络体系结构进行任何更改，并且可以轻松地添加或删除新的级别。与非分层的少数镜头基线相比，该方法显著提高了分类精度，显著降低了训练中未发现的仪器分类错误的严重程度。摘要：Deep learning work on musical instrument recognition has generally focused on instrument classes for which we have abundant data. In this work, we exploit hierarchical relationships between instruments in a few-shot learning setup to enable classification of a wider set of musical instruments, given a few examples at inference. We apply a hierarchical loss function to the training of prototypical networks, combined with a method to aggregate prototypes hierarchically, mirroring the structure of a predefined musical instrument hierarchy. These extensions require no changes to the network architecture and new levels can be easily added or removed. Compared to a non-hierarchical few-shot baseline, our method leads to a significant increase in classification accuracy and significant decrease mistake severity on instrument classes unseen in training.

【5】 FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task 标题：FST：面向IWSLT21多语种共享任务的公平语音翻译系统

作者：Yun Tang,Hongyu Gong,Xian Li,Changhan Wang,Juan Pino,Holger Schwenk,Naman Goyal 机构：Facebook AI Research 备注：Accepted by IWSLT 2021 as a system paper 链接：https://arxiv.org/abs/2107.06959 摘要：在这篇论文中，我们描述了我们的端到端多语种语音翻译系统提交给IWSLT2021多语种语音翻译共享任务评估活动。我们的系统是通过跨模式、任务和语言的迁移学习来构建的。首先，我们利用预先训练的通用多语言模块，其中包含大量未标记和已标记的数据。通过两个任务的联合训练，我们进一步实现了从文本任务到语音任务的知识转移。最后，我们的多语种模型是微调语音翻译任务的具体数据，以达到最佳的翻译效果。实验结果表明，我们的系统在端到端和级联两种方法上都比文献报道的系统有很大的优势。在某些翻译方向上，我们在公共多语言TEDx测试集上评估的语音翻译结果甚至可以与使用oracle语音转录本作为输入的强大文本到文本翻译系统的结果相比较。摘要：In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. Our system is built by leveraging transfer learning across modalities, tasks and languages. First, we leverage general-purpose multilingual modules pretrained with large amounts of unlabelled and labelled data. We further enable knowledge transfer from the text task to the speech task by training two tasks jointly. Finally, our multilingual model is finetuned on speech translation task-specific data to achieve the best translation results. Experimental results show our system outperforms the reported systems, including both end-to-end and cascaded based approaches, by a large margin. In some translation directions, our speech translation results evaluated on the public Multilingual TEDx test set are even comparable with the ones from a strong text-to-text translation system, which uses the oracle speech transcripts as input.

【6】 VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording 标题：用于不分段记录的免VAD流式混合CTC/注意ASR

作者：Hirofumi Inaguma,Tatsuya Kawahara 机构：Graduate School of Informatics, Kyoto University, Kyoto, Japan 备注：Accepted at Interspeech 2021 链接：https://arxiv.org/abs/2107.07509 摘要：在这项工作中，我们提出了一种新的解码算法，在没有语音活动检测（VAD）的情况下，基于单调分块注意（MoChA）和辅助连接时间分类（CTC）的目标，对未分段的长格式录音进行流式自动语音识别（ASR）。我们提出了一种块同步波束搜索译码，以利用高效的成批输出同步和低延迟输入同步搜索。我们还提出了一种无VAD的推理算法，该算法利用CTC概率来确定重置模型状态的合适时间，以解决长格式数据的脆弱性。实验结果表明，块同步译码的精度与标签同步译码相当。此外，VAD-free推理可以在几个小时内对长格式语音进行鲁棒识别。摘要：In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches. We also propose a VAD-free inference algorithm that leverages CTC probabilities to determine a suitable timing to reset the model states to tackle the vulnerability to long-form data. Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one. Moreover, the VAD-free inference can recognize long-form speech robustly for up to a few hours.

【7】 Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech 标题：混响语音时域房间冲激响应估计的滤波噪声整形

作者：Christian J. Steinmetz,Vamsi Krishna Ithapu,Paul Calamia 机构：Centre for Digital Music, Queen Mary University of London, London, UK, Facebook Reality Labs Research, Redmond, Washington, USA 备注：Accepted to WASPAA 2021. See details at this https URL 链接：https://arxiv.org/abs/2107.07503 摘要：深度学习方法已经出现，其目的是转换音频信号，使其听起来像是在同一个房间录制的参考录音，在音频后期制作和增强现实中都有应用。在这项工作中，我们提出了FiNS，一个滤波的噪声整形网络，直接从混响语音中估计时域室内脉冲响应（RIR）。我们的域启发架构具有时域编码器和滤波噪声成形解码器，该解码器将RIR建模为衰减滤波噪声信号以及直接声音和早期反射分量的总和。以前的声学匹配方法要么利用大模型转换音频以匹配目标房间，要么利用算法混响器的参数进行预测。相反，RIR的盲估计可以通过一次卷积实现高效和真实的变换。评估表明，我们的模型不仅综合了与目标房间参数匹配的RIR，如$T{60}$和DRR，而且更准确地再现了目标房间的感知特征，如听力测试中与深度学习基线相比所示。摘要：Deep learning approaches have emerged that aim to transform an audio signal so that it sounds as if it was recorded in the same room as a reference recording, with applications both in audio post-production and augmented reality. In this work, we propose FiNS, a Filtered Noise Shaping network that directly estimates the time domain room impulse response (RIR) from reverberant speech. Our domain-inspired architecture features a time domain encoder and a filtered noise shaping decoder that models the RIR as a summation of decaying filtered noise signals, along with direct sound and early reflection components. Previous methods for acoustic matching utilize either large models to transform audio to match the target room or predict parameters for algorithmic reverberators. Instead, blind estimation of the RIR enables efficient and realistic transformation with a single convolution. An evaluation demonstrates our model not only synthesizes RIRs that match parameters of the target room, such as the $T_{60}$ and DRR, but also more accurately reproduces perceptual characteristics of the target room, as shown in a listening test when compared to deep learning baselines.

【8】 DAL: Feature Learning from Overt Speech to Decode Imagined Speech-based EEG Signals with Convolutional Autoencoder 标题：DAL：基于卷积自动编码器的显性语音特征学习解码想象语音脑电信号

作者：Dae-Hyeok Lee,Sung-Jin Kim,Seong-Whan Lee 机构： Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-ku, Seoul , Korea, Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-ku 备注：14 pages, 6 figures 链接：https://arxiv.org/abs/2107.07064 摘要：脑机接口（Brain-computerinterface，BCI）是通过反映人的意图和状态来实现人与设备之间通信的工具之一。随着人工智能技术的发展，人们对利用脑电（EEG）实现人与无人机之间的通信越来越感兴趣。特别地，在控制无人机群（例如方向或编队）的情况下，与控制无人机单元相比有许多优点。想象语音是一种内生的BCI范式，它可以识别用户的意图。在进行假想言语时，使用者把发音想象成实际说话。相比之下，公开讲话是一项任务，其中用户直接发音的话。当使用想象语音控制无人机群时，复杂的指令可以更直观地传递，但解码性能低于其他内生BCI范式。提出了基于深度自学习（DAL）的公开语音脑电特征提取方法，用于基于想象语音的脑电信号分类。据我们所知，这项研究是首次尝试利用公开语音的脑电特征，用自动编码器解码基于想象语音的脑电信号。共有八名受试者参加了实验。对4个词进行分类时，DAL的平均准确率为48.41%。此外，在比较公开语音的w/o和w/EEG特征时，当包含公开语音的EEG特征时，性能提高了7.42%。因此，我们证明了公开语音的脑电特征可以提高想象语音的解码性能。摘要：Brain-computer interface (BCI) is one of the tools which enables the communication between humans and devices by reflecting intention and status of humans. With the development of artificial intelligence, the interest in communication between humans and drones using electroencephalogram (EEG) is increased. Especially, in the case of controlling drone swarms such as direction or formation, there are many advantages compared with controlling a drone unit. Imagined speech is one of the endogenous BCI paradigms, which can identify intentions of users. When conducting imagined speech, the users imagine the pronunciation as if actually speaking. In contrast, overt speech is a task in which the users directly pronounce the words. When controlling drone swarms using imagined speech, complex commands can be delivered more intuitively, but decoding performance is lower than that of other endogenous BCI paradigms. We proposed the Deep-autoleaner (DAL) to learn EEG features of overt speech for imagined speech-based EEG signals classification. To the best of our knowledge, this study is the first attempt to use EEG features of overt speech to decode imagined speech-based EEG signals with an autoencoder. A total of eight subjects participated in the experiment. When classifying four words, the average accuracy of the DAL was 48.41%. In addition, when comparing the performance between w/o and w/ EEG features of overt speech, there was a performance improvement of 7.42% when including EEG features of overt speech. Hence, we demonstrated that EEG features of overt speech could improve the decoding performance of imagined speech.

3.eess.AS音频处理:

【1】 VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording 标题：用于不分段记录的免VAD流式混合CTC/注意ASR

作者：Hirofumi Inaguma,Tatsuya Kawahara 机构：Graduate School of Informatics, Kyoto University, Kyoto, Japan 备注：Accepted at Interspeech 2021 链接：https://arxiv.org/abs/2107.07509 摘要：在这项工作中，我们提出了一种新的解码算法，在没有语音活动检测（VAD）的情况下，基于单调分块注意（MoChA）和辅助连接时间分类（CTC）的目标，对未分段的长格式录音进行流式自动语音识别（ASR）。我们提出了一种块同步波束搜索译码，以利用高效的成批输出同步和低延迟输入同步搜索。我们还提出了一种无VAD的推理算法，该算法利用CTC概率来确定重置模型状态的合适时间，以解决长格式数据的脆弱性。实验结果表明，块同步译码的精度与标签同步译码相当。此外，VAD-free推理可以在几个小时内对长格式语音进行鲁棒识别。摘要：In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches. We also propose a VAD-free inference algorithm that leverages CTC probabilities to determine a suitable timing to reset the model states to tackle the vulnerability to long-form data. Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one. Moreover, the VAD-free inference can recognize long-form speech robustly for up to a few hours.

【2】 Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech 标题：混响语音时域房间冲激响应估计的滤波噪声整形

作者：Christian J. Steinmetz,Vamsi Krishna Ithapu,Paul Calamia 机构：Centre for Digital Music, Queen Mary University of London, London, UK, Facebook Reality Labs Research, Redmond, Washington, USA 备注：Accepted to WASPAA 2021. See details at this https URL 链接：https://arxiv.org/abs/2107.07503 摘要：深度学习方法已经出现，其目的是转换音频信号，使其听起来像是在同一个房间录制的参考录音，在音频后期制作和增强现实中都有应用。在这项工作中，我们提出了FiNS，一个滤波的噪声整形网络，直接从混响语音中估计时域室内脉冲响应（RIR）。我们的域启发架构具有时域编码器和滤波噪声成形解码器，该解码器将RIR建模为衰减滤波噪声信号以及直接声音和早期反射分量的总和。以前的声学匹配方法要么利用大模型转换音频以匹配目标房间，要么利用算法混响器的参数进行预测。相反，RIR的盲估计可以通过一次卷积实现高效和真实的变换。评估表明，我们的模型不仅综合了与目标房间参数匹配的RIR，如$T{60}$和DRR，而且更准确地再现了目标房间的感知特征，如听力测试中与深度学习基线相比所示。摘要：Deep learning approaches have emerged that aim to transform an audio signal so that it sounds as if it was recorded in the same room as a reference recording, with applications both in audio post-production and augmented reality. In this work, we propose FiNS, a Filtered Noise Shaping network that directly estimates the time domain room impulse response (RIR) from reverberant speech. Our domain-inspired architecture features a time domain encoder and a filtered noise shaping decoder that models the RIR as a summation of decaying filtered noise signals, along with direct sound and early reflection components. Previous methods for acoustic matching utilize either large models to transform audio to match the target room or predict parameters for algorithmic reverberators. Instead, blind estimation of the RIR enables efficient and realistic transformation with a single convolution. An evaluation demonstrates our model not only synthesizes RIRs that match parameters of the target room, such as the $T_{60}$ and DRR, but also more accurately reproduces perceptual characteristics of the target room, as shown in a listening test when compared to deep learning baselines.

【3】 DAL: Feature Learning from Overt Speech to Decode Imagined Speech-based EEG Signals with Convolutional Autoencoder 标题：DAL：基于卷积自动编码器的显性语音特征学习解码想象语音脑电信号

作者：Dae-Hyeok Lee,Sung-Jin Kim,Seong-Whan Lee 机构： Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-ku, Seoul , Korea, Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-ku 备注：14 pages, 6 figures 链接：https://arxiv.org/abs/2107.07064 摘要：脑机接口（Brain-computerinterface，BCI）是通过反映人的意图和状态来实现人与设备之间通信的工具之一。随着人工智能技术的发展，人们对利用脑电（EEG）实现人与无人机之间的通信越来越感兴趣。特别地，在控制无人机群（例如方向或编队）的情况下，与控制无人机单元相比有许多优点。想象语音是一种内生的BCI范式，它可以识别用户的意图。在进行假想言语时，使用者把发音想象成实际说话。相比之下，公开讲话是一项任务，其中用户直接发音的话。当使用想象语音控制无人机群时，复杂的指令可以更直观地传递，但解码性能低于其他内生BCI范式。提出了基于深度自学习（DAL）的公开语音脑电特征提取方法，用于基于想象语音的脑电信号分类。据我们所知，这项研究是首次尝试利用公开语音的脑电特征，用自动编码器解码基于想象语音的脑电信号。共有八名受试者参加了实验。对4个词进行分类时，DAL的平均准确率为48.41%。此外，在比较公开语音的w/o和w/EEG特征时，当包含公开语音的EEG特征时，性能提高了7.42%。因此，我们证明了公开语音的脑电特征可以提高想象语音的解码性能。摘要：Brain-computer interface (BCI) is one of the tools which enables the communication between humans and devices by reflecting intention and status of humans. With the development of artificial intelligence, the interest in communication between humans and drones using electroencephalogram (EEG) is increased. Especially, in the case of controlling drone swarms such as direction or formation, there are many advantages compared with controlling a drone unit. Imagined speech is one of the endogenous BCI paradigms, which can identify intentions of users. When conducting imagined speech, the users imagine the pronunciation as if actually speaking. In contrast, overt speech is a task in which the users directly pronounce the words. When controlling drone swarms using imagined speech, complex commands can be delivered more intuitively, but decoding performance is lower than that of other endogenous BCI paradigms. We proposed the Deep-autoleaner (DAL) to learn EEG features of overt speech for imagined speech-based EEG signals classification. To the best of our knowledge, this study is the first attempt to use EEG features of overt speech to decode imagined speech-based EEG signals with an autoencoder. A total of eight subjects participated in the experiment. When classifying four words, the average accuracy of the DAL was 48.41%. In addition, when comparing the performance between w/o and w/ EEG features of overt speech, there was a performance improvement of 7.42% when including EEG features of overt speech. Hence, we demonstrated that EEG features of overt speech could improve the decoding performance of imagined speech.

【4】 Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk 标题：评价双向通话过程中残余回波抑制的客观指标

作者：Amir Ivry,Israel Cohen,Baruch Berdugo 机构：Technion – Israel Institute of Technology, Technion City, Haifa , Israel 备注：Accepted to WASPAA 链接：https://arxiv.org/abs/2107.07471 摘要：主观评价是评价语音质量的最佳方法。最近引入的深度噪声抑制平均意见得分（DNSMOS）度量被证明可以非常准确地估计人类的评分。信噪比（SDR）指标被广泛应用于评估话音质量的残余回波抑制（RES）系统。然而，由于SDR同时受到语音失真和残余回声的影响，因此根据DNSMOS，它与人类的评分没有很好的相关性。为了解决这个问题，我们引入了两个客观指标来分别量化双音通话过程中所需的语音保持水平（DSML）和剩余回声抑制水平（RESL）。使用基于深度学习的RES系统和可调的设计参数来评估这些度量。通过280小时的真实和模拟记录，我们发现DSML和RESL与DNSMOS有很好的相关性，对不同的设置有很高的通用性。此外，我们还实证研究了调整RES系统设计参数与DSML-RESL折衷之间的关系，并为动态系统需求提供了一种实用的设计方案。摘要：Human subjective evaluation is optimal to assess speech quality for human perception. The recently introduced deep noise suppression mean opinion score (DNSMOS) metric was shown to estimate human ratings with great accuracy. The signal-to-distortion ratio (SDR) metric is widely used to evaluate residual-echo suppression (RES) systems by estimating speech quality during double-talk. However, since the SDR is affected by both speech distortion and residual-echo presence, it does not correlate well with human ratings according to the DNSMOS. To address that, we introduce two objective metrics to separately quantify the desired-speech maintained level (DSML) and residual-echo suppression level (RESL) during double-talk. These metrics are evaluated using a deep learning-based RES-system with a tunable design parameter. Using 280 hours of real and simulated recordings, we show that the DSML and RESL correlate well with the DNSMOS with high generalization to various setups. Also, we empirically investigate the relation between tuning the RES-system design parameter and the DSML-RESL tradeoff it creates and offer a practical design scheme for dynamic system requirements.

【5】 CLSRIL-23: Cross Lingual Speech Representations for Indic Languages 标题：CLSRIL-23：印地语的跨语言语音表示

作者：Anirudh Gupta,Harveen Singh Chadha,Priyanshi Shah,Neeraj Chimmwal,Ankur Dhuriya,Rishabh Gaur,Vivek Raghavan 机构：Thoughtworks, Ekstep Foundation 备注：7 pages, 2 figures 链接：https://arxiv.org/abs/2107.07402 摘要：我们提出了一个CLSRIL-23，一个基于自监督学习的音频预训练模型，它从23种印度语的原始音频中学习跨语言的语音表征。它是建立在wav2vec2.0的基础上的，wav2vec2.0通过训练隐藏的潜在言语表征的对比任务来解决这个问题，并联合学习所有语言共享的潜在语的量化。我们比较了训练前的语言损失，以比较单语和多语训练的效果。我们还比较了一些下游语音识别微调任务的性能，实验表明，多语言预训练在学习编码语言语音相似性的语音表征方面以及在下游任务上的性能都优于单语训练。当使用印地语的多语种预训练模型进行微调时，WER和CER分别下降了5%和9.5%。所有的代码模型都是开源的。CLSRIL-23是一个以23美元的语言和近10000小时的音频数据训练的模型，用于促进印度语语音识别的研究。我们希望，新的国家的最先进的系统将创建使用自我监督的方法，特别是低资源的印度语。摘要：We present a CLSRIL-23, a self supervised learning based audio pre-trained model which learns cross lingual speech representations from raw audio across 23 Indic languages. It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations and jointly learns the quantization of latents shared across all languages. We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining. Performance on some downstream fine-tuning tasks for speech recognition is also compared and our experiments show that multilingual pretraining outperforms monolingual training, in terms of learning speech representations which encodes phonetic similarity of languages and also in terms of performance on down stream tasks. A decrease of 5% is observed in WER and 9.5% in CER when a multilingual pretrained model is used for finetuning in Hindi. All the code models are also open sourced. CLSRIL-23 is a model trained on $23$ languages and almost 10,000 hours of audio data to facilitate research in speech recognition for Indic languages. We hope that new state of the art systems will be created using the self supervised approach, especially for low resources Indic languages.

【6】 Sketching sounds: an exploratory study on sound-shape associations 标题：素描声音：音形联想的探索性研究

作者：Sebastian Löbbers,Mathieu Barthet,György Fazekas 机构：Centre for Digital Music, Queen Mary University of London, Gy¨orgy Fazekas 备注：accepted for International Computer Music Conference (ICMC) 2021 链接：https://arxiv.org/abs/2107.07360 摘要：声音合成器控件通常对应于信号处理算法的技术参数，而不是与人类对声音的感知相关的直观声音描述符。这使得以直接的方式实现合理的想法变得困难。跨模态映射，例如手势和声音之间的映射，被认为是一种更直观的控制机制。大量研究表明，人类在声音和形状之间的联系是一致的。然而，利用绘画来驱动声音合成的方法还没有得到充分的探讨。本文提出了一项探索性研究，要求参与者用单色数字绘图界面绘制声音的视觉图像，目的是确定不同的表征方法，并确定音色声音特征是否可以通过视觉草图可靠地传达。结果表明，开发一种利用声形关联的合成器是可行的，但在后续研究中还需要一个更大、更集中的数据集。摘要：Sound synthesiser controls typically correspond to technical parameters of signal processing algorithms rather than intuitive sound descriptors that relate to human perception of sound. This makes it difficult to realise sound ideas in a straightforward way. Cross-modal mappings, for example between gestures and sound, have been suggested as a more intuitive control mechanism. A large body of research shows consistency in human associations between sounds and shapes. However, the use of drawings to drive sound synthesis has not been explored to its full extent. This paper presents an exploratory study that asked participants to sketch visual imagery of sounds with a monochromatic digital drawing interface, with the aim to identify different representational approaches and determine whether timbral sound characteristics can be communicated reliably through visual sketches. Results imply that the development of a synthesiser exploiting sound-shape associations is feasible, but a larger and more focused dataset is needed in followup studies.

【7】 Leveraging Hierarchical Structures for Few-Shot Musical Instrument Recognition 标题：利用层次结构进行Few-Shot乐器识别

作者：Hugo Flores Garcia,Aldo Aguilar,Ethan Manilow,Bryan Pardo 机构：Interactive Audio Lab, Northwestern University, Evanston, IL, USA 链接：https://arxiv.org/abs/2107.07029 摘要：乐器识别的深度学习工作通常集中在乐器类，我们有丰富的数据。在这项工作中，我们利用在少数镜头学习设置的乐器之间的层次关系，使分类更广泛的一套乐器，给出了一些例子在推理。我们把一个典型的层次结构的音乐原型与一个预先定义的层次结构相结合。这些扩展不需要对网络体系结构进行任何更改，并且可以轻松地添加或删除新的级别。与非分层的少数镜头基线相比，该方法显著提高了分类精度，显著降低了训练中未发现的仪器分类错误的严重程度。摘要：Deep learning work on musical instrument recognition has generally focused on instrument classes for which we have abundant data. In this work, we exploit hierarchical relationships between instruments in a few-shot learning setup to enable classification of a wider set of musical instruments, given a few examples at inference. We apply a hierarchical loss function to the training of prototypical networks, combined with a method to aggregate prototypes hierarchically, mirroring the structure of a predefined musical instrument hierarchy. These extensions require no changes to the network architecture and new levels can be easily added or removed. Compared to a non-hierarchical few-shot baseline, our method leads to a significant increase in classification accuracy and significant decrease mistake severity on instrument classes unseen in training.

【8】 FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task 标题：FST：面向IWSLT21多语种共享任务的公平语音翻译系统

作者：Yun Tang,Hongyu Gong,Xian Li,Changhan Wang,Juan Pino,Holger Schwenk,Naman Goyal 机构：Facebook AI Research 备注：Accepted by IWSLT 2021 as a system paper 链接：https://arxiv.org/abs/2107.06959 摘要：在这篇论文中，我们描述了我们的端到端多语种语音翻译系统提交给IWSLT2021多语种语音翻译共享任务评估活动。我们的系统是通过跨模式、任务和语言的迁移学习来构建的。首先，我们利用预先训练的通用多语言模块，其中包含大量未标记和已标记的数据。通过两个任务的联合训练，我们进一步实现了从文本任务到语音任务的知识转移。最后，我们的多语种模型是微调语音翻译任务的具体数据，以达到最佳的翻译效果。实验结果表明，我们的系统在端到端和级联两种方法上都比文献报道的系统有很大的优势。在某些翻译方向上，我们在公共多语言TEDx测试集上评估的语音翻译结果甚至可以与使用oracle语音转录本作为输入的强大文本到文本翻译系统的结果相比较。摘要：In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. Our system is built by leveraging transfer learning across modalities, tasks and languages. First, we leverage general-purpose multilingual modules pretrained with large amounts of unlabelled and labelled data. We further enable knowledge transfer from the text task to the speech task by training two tasks jointly. Finally, our multilingual model is finetuned on speech translation task-specific data to achieve the best translation results. Experimental results show our system outperforms the reported systems, including both end-to-end and cascaded based approaches, by a large margin. In some translation directions, our speech translation results evaluated on the public Multilingual TEDx test set are even comparable with the ones from a strong text-to-text translation system, which uses the oracle speech transcripts as input.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-16，如有侵权请联系 cloudcommunity@tencent.com 删除

linux