金融/语音/音频处理学术速递[7.12]

公众号-arXiv每日学术速递

发布于 2021-07-27 10:46:26

5350

发布于 2021-07-27 10:46:26

文章被收录于专栏：arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

q-fin金融，共计4篇

cs.SD语音，共计7篇

eess.AS音频处理，共计11篇

1.q-fin金融:

【1】 Endogenous viral mutations, evolutionary selection, and containment policy design 标题：内源性病毒突变、进化选择和遏制政策设计

作者：Patrick Mellacher 机构： University of Graz, Graz Schumpeter Centre, Universitätsstraße ,FE, A-, Graz 链接：https://arxiv.org/abs/2107.04358 摘要：新型冠状病毒将如何进化？我研究了一个简单的分离模型，在这个模型中，突变可能随机改变病毒及其相关疾病的特性，抗原漂移允许新的变体部分地逃避免疫。我的分析表明，具有较高传染性、较长疾病持续时间和较短潜伏期的变异证明更适合。”针对有症状个体的聪明的“遏制政策”可能会改变病毒的进化方向，因为它们给潜伏期较长、无症状感染比例较高的变异带来优势。另一方面，降低死亡率本身并不是进化的优势。然后将此模型实现为基于agent的仿真模型，以探索其聚合动力学。蒙特卡罗模拟表明，a）遏制政策设计对病毒进化的速度和方向都有影响，b）病毒可能无限期地在人群中传播，前提是遏制措施过于宽松，病毒逃避免疫的倾向足够高，关键的是c）仅从短期的流行病学结果来看，可能无法区分缓慢和快速演变的病毒。因此，短期内看似成功的缓解策略，可能会产生毁灭性的长期影响。这些结果表明，最佳的遏制政策必须考虑到病毒变异和逃避免疫的倾向，即使在流行病的早期阶段，也要加强对基因和抗原的监测。摘要：How will the novel coronavirus evolve? I study a simple SEPAIRD model, in which mutations may change the properties of the virus and its associated disease stochastically and antigenic drifts allow new variants to partially evade immunity. I show analytically that variants with higher infectiousness, longer disease duration, and shorter latency period prove to be fitter. "Smart" containment policies targeting symptomatic individuals may redirect the evolution of the virus, as they give an edge to variants with a longer incubation period and a higher share of asymptomatic infections. Reduced mortality, on the other hand, does not per se prove to be an evolutionary advantage. I then implement this model as an agent-based simulation model in order to explore its aggregate dynamics. Monte Carlo simulations show that a) containment policy design has an impact on both speed and direction of viral evolution, b) the virus may circulate in the population indefinitely, provided that containment efforts are too relaxed and the propensity of the virus to escape immunity is high enough, and crucially c) that it may not be possible to distinguish between a slowly and a rapidly evolving virus looking only at short-term epidemiological outcomes. Thus, what looks like a successful mitigation strategy in the short run, may prove to have devastating long-run effects. These results suggest that optimal containment policy must take the propensity of the virus to mutate and escape immunity into account, strengthening the case for genetic and antigenic surveillance even in the early stages of an epidemic.

【2】 Deep Learning for Mean Field Games and Mean Field Control with Applications to Finance 标题：平均场对策和平均场控制的深度学习及其在金融中的应用

作者：René Carmona,Mathieu Laurière 链接：https://arxiv.org/abs/2107.04568 摘要：金融市场和更普遍的宏观经济模型涉及到大量的个体通过变量相互作用，例如由所有主体的总体行为产生的价格。引入平均场对策来研究这类问题在有限人数下的纳什均衡问题。这一理论在过去十年中得到了广泛的发展，使用了分析和概率工具，并发现了广泛的应用，从经济学到人群运动。最近，与机器学习的交互吸引了越来越多的兴趣。这方面特别适用于解决具有复杂结构、高维或具有共同随机性来源的非常大的博弈。在这一章中，我们回顾了关于平均场游戏和深度学习之间相互作用的文献，重点介绍了三种方法。特别强调金融应用。摘要：Financial markets and more generally macro-economic models involve a large number of individuals interacting through variables such as prices resulting from the aggregate behavior of all the agents. Mean field games have been introduced to study Nash equilibria for such problems in the limit when the number of players is infinite. The theory has been extensively developed in the past decade, using both analytical and probabilistic tools, and a wide range of applications have been discovered, from economics to crowd motion. More recently the interaction with machine learning has attracted a growing interest. This aspect is particularly relevant to solve very large games with complex structures, in high dimension or with common sources of randomness. In this chapter, we review the literature on the interplay between mean field games and deep learning, with a focus on three families of methods. A special emphasis is given to financial applications.

【3】 Simulation of Multidimensional Diffusions with Sticky Boundaries via Markov Chain Approximation 标题：用马尔科夫链近似模拟粘性边界的多维扩散

作者：Christian Meier,Lingfei Li,Gongqiu Zhang 链接：https://arxiv.org/abs/2107.04260 摘要：提出了一种新的粘性边界多维扩散模拟方法。挑战来自于模拟粘性边界的行为，对于这种行为，像Euler格式这样的标准方法是失败的。我们用多维连续时间马尔可夫链（CTMC）来近似粘性扩散过程，可以很容易地进行模拟。我们提出了两种构造CTMC的方法：用标准坐标方向的有限差分逼近粘性扩散的无穷小生成元，用协方差矩阵的漂移和特征向量作为过渡方向匹配局部矩。第一种方法并不总是保证一个有效的马尔可夫链，而第二种方法可以。我们证明了两种构造方法都得到了一阶模拟方案，该方案能够捕获粘性行为，并且不受维数灾难的影响。我们将我们的方法应用于两个应用：一个是作为例外服务策略的排队系统的极限而产生的具有所有维度粘性的多维布朗运动，另一个是随机因素无界但短期利率为零粘性的低利率环境下的多因素短期利率模型。摘要：We develop a new simulation method for multidimensional diffusions with sticky boundaries. The challenge comes from simulating the sticky boundary behavior, for which standard methods like the Euler scheme fail. We approximate the sticky diffusion process by a multidimensional continuous time Markov chain (CTMC), for which we can simulate easily. We develop two ways of constructing the CTMC: approximating the infinitesimal generator of the sticky diffusion by finite difference using standard coordinate directions, and matching the local moments using the drift and the eigenvectors of the covariance matrix as transition directions. The first approach does not always guarantee a valid Markov chain whereas the second one can. We show that both construction methods yield a first order simulation scheme, which can capture the sticky behavior and it is free from the curse of dimensionality. We apply our method to two applications: a multidimensional Brownian motion with all dimensions sticky which arises as the limit of a queuing system with exceptional service policy, and a multi-factor short rate model for low interest rate environment in which the stochastic factors are unbounded but the short rate is sticky at zero.

【4】 Centralized Matching with Incomplete Information 标题：不完全信息下的集中匹配

作者：Marcelo Ariel Fernandez,Kirill Rudov,Leeat Yariv 链接：https://arxiv.org/abs/2107.04098 摘要：我们研究了不完全信息对集中的一对一匹配市场的影响。我们关注常用的延迟接受机制（Gale和Shapley，1962）。我们表明，许多完全信息的结果是脆弱的，以一个小注入不确定性有关其他人的偏好。摘要：We study the impacts of incomplete information on centralized one-to-one matching markets. We focus on the commonly used Deferred Acceptance mechanism (Gale and Shapley, 1962). We show that many complete-information results are fragile to a small infusion of uncertainty about others' preferences.

2.cs.SD语音:

【1】 Improved Breath Phase and Continuous Adventitious Sound Detection in Lung and Tracheal Sound Using Mixed Set Training and Domain Adaptation 标题：基于混合集合训练和域自适应的肺音和气管音的改进呼吸相位和连续异声检测

作者：Fu-Shun Hsu,Shang-Ran Huang,Chang-Fu Su,Chien-Wen Huang,Yuan-Ren Cheng,Chun-Chieh Chen,Chun-Yu Wu,Chung-Wei Chen,Yen-Chun Lai,Tang-Wei Cheng,Nian-Jhen Lin,Wan-Ling Tsai,Ching-Shiang Lu,Chuan Chen,Feipei Lai 机构：a Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, b Department of Critical Care Medicine, Far Eastern Memorial Hospital, New Taipei, Taiwan, c Heroic Faith Medical Science Co., Ltd., Taipei, Taiwan 备注：To be submitted, 31 pages, 6 figures, 5 tables 链接：https://arxiv.org/abs/2107.04229 摘要：在此之前，我们建立了一个肺音数据库HF\u lung\u V2，并提出了卷积双向门控循环单元（CNN-BiGRU）模型，该模型具有足够的能力检测肺音中的吸入、呼气、连续不定音（CAS）和间断不定音。在这项研究中，我们建立了一个气管声音数据库HF_tracthal_V1，包含11107个15秒气管声音记录，23087个吸入标签，16728个呼气标签和6874个CAS标签。将HF\u气管V1的气管音和HF\u肺V2的肺音联合或单独用于CNN-BiGRU模型的肺和气管音分析。对不同的训练策略进行了研究和比较：（1）采用完全训练（从头开始训练），单独用肺音训练肺音模型，单独用气管音训练气管音模型；（2）采用同时包含肺音和气管音的混合集训练模型，（3）利用域自适应技术，将预先训练好的肺音模型与气管音数据进行微调，反之亦然。结果表明，仅用肺音训练的模型在气管音分析中表现较差，反之亦然。然而，与阳性对照组相比，混合集训练和域适应可以提高肺音中呼气和CAS检测的性能，以及气管音中吸入、呼气和CAS检测的性能（仅通过肺音训练的肺模型，反之亦然）。特别是在一石二鸟的情况下，混合集训练的模型更为普遍。摘要：Previously, we established a lung sound database, HF_Lung_V2 and proposed convolutional bidirectional gated recurrent unit (CNN-BiGRU) models with adequate ability for inhalation, exhalation, continuous adventitious sound (CAS), and discontinuous adventitious sound detection in the lung sound. In this study, we proceeded to build a tracheal sound database, HF_Tracheal_V1, containing 11107 of 15-second tracheal sound recordings, 23087 inhalation labels, 16728 exhalation labels, and 6874 CAS labels. The tracheal sound in HF_Tracheal_V1 and the lung sound in HF_Lung_V2 were either combined or used alone to train the CNN-BiGRU models for respective lung and tracheal sound analysis. Different training strategies were investigated and compared: (1) using full training (training from scratch) to train the lung sound models using lung sound alone and train the tracheal sound models using tracheal sound alone, (2) using a mixed set that contains both the lung and tracheal sound to train the models, and (3) using domain adaptation that finetuned the pre-trained lung sound models with the tracheal sound data and vice versa. Results showed that the models trained only by lung sound performed poorly in the tracheal sound analysis and vice versa. However, the mixed set training and domain adaptation can improve the performance of exhalation and CAS detection in the lung sound, and inhalation, exhalation, and CAS detection in the tracheal sound compared to positive controls (lung models trained only by lung sound and vice versa). Especially, a model derived from the mixed set training prevails in the situation of killing two birds with one stone.

【2】 Multi-path Convolutional Neural Networks Efficiently Improve Feature Extraction in Continuous Adventitious Lung Sound Detection 标题：多路径卷积神经网络有效提高连续异位肺音检测的特征提取

作者：Fu-Shun Hsu,Shang-Ran Huang,Chien-Wen Huang,Chun-Chieh Chen,Yuan-Ren Cheng,Feipei Lai 机构： Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan, University, Taipei, Taiwan, Department of Critical Care Medicine, Far Eastern Memorial Hospital, New Taipei, Heroic Faith Medical Science Co., Ltd., Taipei, Taiwan 备注：To be submitted, 32 pages, 8 figures, 2 tables 链接：https://arxiv.org/abs/2107.04226 摘要：我们之前建立了一个大型的肺音数据库，HF\u lung\u V2（lung\u V2）。我们训练了卷积双向门控循环单元（CNN-BiGRU）网络，用于在记录水平上检测吸入、呼气、连续不定音（CAS）和间断不定音。然而，由于多种原因，CAS检测性能较差，其中一个原因就是CAS模式的高度多样化。为了使原来的CNN-BiGRU模型更有效地学习CAS模式，并且不造成太大的计算负担，研究了三种对CNN层网络结构进行最小修改的策略：（1）利用残差块使CNN层更深一点，（2）通过增加CNN核的数目，使CNN的层次更宽；（3）将特征输入分为多条路径（模型用多路径CNN BiGRU表示）。评估了CAS片段和事件检测的性能。结果表明，在所有提出的结构改进模型中，CAS检测都得到了改进。模型的CAS事件检测F1评分从0.445提高到0.491-0.530，具有显著性。然而，多路径CNN-BiGRU模型在总共9个评价指标中的获奖次数（5次）超过了其他模型。此外，与原CNN-BiGRU模型相比，多路径CNN-BiGRU模型没有造成额外的计算负担（0.97倍的推理时间）。总之，多路径CNN层可以有效地提高特征提取的有效性，从而得到更好的CAS检测。摘要：We previously established a large lung sound database, HF_Lung_V2 (Lung_V2). We trained convolutional-bidirectional gated recurrent unit (CNN-BiGRU) networks for detecting inhalation, exhalation, continuous adventitious sound (CAS) and discontinuous adventitious sound at the recording level on the basis of Lung_V2. However, the performance of CAS detection was poor due to many reasons, one of which is the highly diversified CAS patterns. To make the original CNN-BiGRU model learn the CAS patterns more effectively and not cause too much computing burden, three strategies involving minimal modifications of the network architecture of the CNN layers were investigated: (1) making the CNN layers a bit deeper by using the residual blocks, (2) making the CNN layers a bit wider by increasing the number of CNN kernels, and (3) separating the feature input into multiple paths (the model was denoted by Multi-path CNN-BiGRU). The performance of CAS segment and event detection were evaluated. Results showed that improvement in CAS detection was observed among all the proposed architecture-modified models. The F1 score for CAS event detection of the proposed models increased from 0.445 to 0.491-0.530, which was deemed significant. However, the Multi-path CNN-BiGRU model outperformed the other models in terms of the number of winning titles (five) in total nine evaluation metrics. In addition, the Multi-path CNN-BiGRU model did not cause extra computing burden (0.97-fold inference time) compared to the original CNN-BiGRU model. Conclusively, the Multi-path CNN layers can efficiently improve the effectiveness of feature extraction and subsequently result in better CAS detection.

【3】 EasyCom: An Augmented Reality Dataset to Support Algorithms for Easy Communication in Noisy Environments 标题：EasyCom：增强现实数据集，支持在嘈杂环境中轻松通信的算法

作者：Jacob Donley,Vladimir Tourbabin,Jung-Suk Lee,Mark Broyles,Hao Jiang,Jie Shen,Maja Pantic,Vamsi Krishna Ithapu,Ravish Mehra 机构：∗Facebook Reality Labs Research, USA., †Facebook AI Applied Research, UK. 备注：Dataset is available at: this https URL 链接：https://arxiv.org/abs/2107.04174 摘要：增强现实（AR）作为一个平台，有助于降低鸡尾酒会效应。未来的AR头戴式耳机可能会利用来自多种不同模式的传感器阵列的信息。训练和测试信号处理和机器学习算法的任务，如波束形成和语音增强需要高质量的代表性数据。据作者所知，截至出版之日，还没有包含以自我为中心的多通道音频和视频的数据集，这些音频和视频在嘈杂的环境中具有动态移动和对话。在这项工作中，我们描述、评估和发布了一个数据集，其中包含超过5小时的多模态数据，可用于训练和测试算法，以改进AR眼镜佩戴者的会话。我们提供了一个基线方法的语音清晰度、质量和信噪比改进结果，并显示了所有测试指标的改进。我们正在发布的数据集包含AR眼镜、以自我为中心的多通道麦克风阵列音频、宽视场RGB视频、语音源姿势、耳机麦克风音频、带注释的语音活动、语音转录、头部边界框、语音目标和源识别标签。我们已经创建并发布了这个数据集，以促进鸡尾酒会问题的多模式AR解决方案的研究。摘要：Augmented Reality (AR) as a platform has the potential to facilitate the reduction of the cocktail party effect. Future AR headsets could potentially leverage information from an array of sensors spanning many different modalities. Training and testing signal processing and machine learning algorithms on tasks such as beam-forming and speech enhancement require high quality representative data. To the best of the author's knowledge, as of publication there are no available datasets that contain synchronized egocentric multi-channel audio and video with dynamic movement and conversations in a noisy environment. In this work, we describe, evaluate and release a dataset that contains over 5 hours of multi-modal data useful for training and testing algorithms for the application of improving conversations for an AR glasses wearer. We provide speech intelligibility, quality and signal-to-noise ratio improvement results for a baseline method and show improvements across all tested metrics. The dataset we are releasing contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head bounding boxes, target of speech and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.

【4】 Improved Language Identification Through Cross-Lingual Self-Supervised Learning 标题：通过跨语言自我监督学习提高语言识别能力

作者：Andros Tjandra,Diptanu Gon Choudhury,Frank Zhang,Kritika Singh,Alexei Baevski,Assaf Sela,Yatharth Saraf,Michael Auli 机构：Facebook AI, USA 备注：Submitted to ASRU 2021 链接：https://arxiv.org/abs/2107.04082 摘要：语言识别对自动语音识别等下游任务的成功与否有着重要的影响。最近，用wav2vec2.0学习的自监督语音表征被证明对一系列语音任务是非常有效的。我们扩展了以前在语言识别方面的自监督工作，通过实验使用了预先训练的模型，这些模型是在现实世界中的多语言无约束语音上学习的，而不仅仅是在英语上。我们证明了在许多语言上预先训练的模型表现得更好，并且使得需要很少标记数据的语言识别系统能够很好地执行。在25种语言上的实验结果表明，每种语言只有10分钟的标记数据，一个跨语言的预训练模型可以达到93%以上的准确率。摘要：Language identification greatly impacts the success of downstream tasks such as automatic speech recognition. Recently, self-supervised speech representations learned by wav2vec 2.0 have been shown to be very effective for a range of speech tasks. We extend previous self-supervised work on language identification by experimenting with pre-trained models which were learned on real-world unconstrained speech in multiple languages and not just on English. We show that models pre-trained on many languages perform better and enable language identification systems that require very little labeled data to perform well. Results on a 25 languages setup show that with only 10 minutes of labeled data per language, a cross-lingually pre-trained model can achieve over 93% accuracy.

【5】 Machine Learning for Stuttering Identification: Review, Challenges & Future Directions 标题：机器学习在口吃识别中的应用：回顾、挑战和未来方向

作者：Shakeel Ahmad Sheikh,Md Sahidullah,Fabrice Hirsch,Slim Ouni 机构： Université de Lorraine 备注：under review in ACM Computing Surveys 链接：https://arxiv.org/abs/2107.04057 摘要：口吃是一种言语障碍，在此期间，言语的流动被不自觉的停顿和重复的声音打断。口吃识别是一个涉及病理学、心理学、声学、信号处理等多学科交叉的研究课题，其检测难度大且复杂。机器学习和深度学习的最新发展极大地改变了语音领域，然而对口吃识别的关注却很少。这项工作填补了这一空白，试图把跨学科领域的研究人员聚集在一起。本文综述了基于声学特征、统计和深度学习的口吃/不流利分类方法。我们还提出了一些挑战和未来可能的方向。摘要：Stuttering is a speech disorder during which the flow of speech is interrupted by involuntary pauses and repetition of sounds. Stuttering identification is an interesting interdisciplinary domain research problem which involves pathology, psychology, acoustics, and signal processing that makes it hard and complicated to detect. Recent developments in machine and deep learning have dramatically revolutionized speech domain, however minimal attention has been given to stuttering identification. This work fills the gap by trying to bring researchers together from interdisciplinary fields. In this paper, we review comprehensively acoustic features, statistical and deep learning based stuttering/disfluency classification methods. We also present several challenges and possible future directions.

【6】 Training a Deep Neural Network via Policy Gradients for Blind Source Separation in Polyphonic Music Recordings 标题：基于策略梯度的深层神经网络复调音乐盲源分离

作者：Sören Schulze,Johannes Leuschner,Emily J. King 机构：Center for Industrial Mathematics, University of Bremen, Bibliothekstr. , Bremen, Germany;, Mathematics Department, Colorado State University, Campus Delivery, Weber Bldg, Fort, 链接：https://arxiv.org/abs/2107.04235 摘要：本文提出了一种在音频信号中实现乐器声音盲分离的方法。我们通过一个参数模型来描述单个音调，训练一个字典来捕捉谐波的相对振幅。模型参数的预测采用深度神经网络的U网络。基于模型预测和单个STFT时间帧之间的差异，在没有地面真值信息的情况下训练网络。由于一些模型参数不产生有用的反向传播梯度，我们对它们进行随机建模，并使用策略梯度代替。为了提供相位信息并解释基于字典的表示中的不精确性，我们还让网络输出一个直接预测，然后使用该预测来重新合成各个乐器的音频信号。由于神经网络的灵活性，非谐性可以无缝结合，不需要对输入光谱进行预处理。我们的算法产生高质量的分离结果，对各种不同的音频样本（包括声学和合成）的干扰特别低，前提是样本包含足够的训练数据，并且乐器的光谱特性足够稳定，可以通过字典进行近似。摘要：We propose a method for the blind separation of sounds of musical instruments in audio signals. We describe the individual tones via a parametric model, training a dictionary to capture the relative amplitudes of the harmonics. The model parameters are predicted via a U-Net, which is a type of deep neural network. The network is trained without ground truth information, based on the difference between the model prediction and the individual STFT time frames. Since some of the model parameters do not yield a useful backpropagation gradient, we model them stochastically and employ the policy gradient instead. To provide phase information and account for inaccuracies in the dictionary-based representation, we also let the network output a direct prediction, which we then use to resynthesize the audio signals for the individual instruments. Due to the flexibility of the neural network, inharmonicity can be incorporated seamlessly and no preprocessing of the input spectra is required. Our algorithm yields high-quality separation results with particularly low interference on a variety of different audio samples, both acoustic and synthetic, provided that the sample contains enough data for the training and that the spectral characteristics of the musical instruments are sufficiently stable to be approximated by the dictionary.

【7】 Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation 标题：Transformer编码器语音表示自监督学习的丢弃正则化方法

作者：Jian Luo,Jianzong Wang,Ning Cheng,Jing Xiao 机构：Ping An Technology (Shenzhen) Co., Ltd. 备注：will be presented in INTERSPEECH 2021 链接：https://arxiv.org/abs/2107.04227 摘要：预测语音帧的变化是一种有效的语音自监督学习方法。然而，如何防止预训练模型过拟合是一个挑战。本文提出将两种跳出正则化方法引入到Transformer编码器的预训练中：（1）注意跳出；（2）层跳出。这两种方法都鼓励模型利用全局语音信息，避免了在重构掩蔽帧时只复制局部频谱特征。我们在音素分类和说话人识别任务中对所提出的方法进行了评估。实验结果表明，我们的辍学方法取得了有竞争力的结果，提高了下游任务的分类精度。摘要：Predicting the altered acoustic frames is an effective way of self-supervised learning for speech representation. However, it is challenging to prevent the pretrained model from overfitting. In this paper, we proposed to introduce two dropout regularization methods into the pretraining of transformer encoder: (1) attention dropout, (2) layer dropout. Both of the two dropout methods encourage the model to utilize global speech information, and avoid just copying local spectrum features when reconstructing the masked frames. We evaluated the proposed methods on phoneme classification and speaker recognition tasks. The experiments demonstrate that our dropout approaches achieve competitive results, and improve the performance of classification accuracy on downstream tasks.

3.eess.AS音频处理:

【1】 Representation Learning to Classify and Detect Adversarial Attacks against Speaker and Speech Recognition Systems 标题：用于分类和检测针对说话人和语音识别系统的敌意攻击的表示学习

作者：Jesús Villalba,Sonal Joshi,Piotr Żelasko,Najim Dehak 机构：Center for Language and Speech Processing,Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD 备注：Accepted at Interspeech 2021 链接：https://arxiv.org/abs/2107.04448 摘要：对抗性攻击已经成为机器学习应用的主要威胁。人们越来越感兴趣的研究这些攻击在音频领域，例如，语音和说话人识别；找到防御他们的方法。在这项工作中，我们着重于利用表示学习来分类/检测攻击，包括攻击算法、威胁模型或信噪比。我们发现文献中常见的攻击可以分类，准确率高达90%。此外，训练用于分类针对说话人识别的攻击的表示也可用于分类针对说话人验证和语音识别的攻击。我们还测试了一个攻击验证任务，在该任务中，我们需要确定两个语音语句是否包含相同的攻击。我们观察到，我们的模型不能很好地推广到攻击表示模型训练中没有包含的攻击算法。基于此，我们评估了一个未知的攻击检测任务。我们能够以大约19%的相同错误率检测未知攻击，这是很有希望的。摘要：Adversarial attacks have become a major threat for machine learning applications. There is a growing interest in studying these attacks in the audio domain, e.g, speech and speaker recognition; and find defenses against them. In this work, we focus on using representation learning to classify/detect attacks w.r.t. the attack algorithm, threat model or signal-to-adversarial-noise ratio. We found that common attacks in the literature can be classified with accuracies as high as 90%. Also, representations trained to classify attacks against speaker identification can be used also to classify attacks against speaker verification and speech recognition. We also tested an attack verification task, where we need to decide whether two speech utterances contain the same attack. We observed that our models did not generalize well to attack algorithms not included in the attack representation model training. Motivated by this, we evaluated an unknown attack detection task. We were able to detect unknown attacks with equal error rates of about 19%, which is promising.

【2】 Loss Prediction: End-to-End Active Learning Approach For Speech Recognition 标题：损失预测：语音识别的端到端主动学习方法

作者：Jian Luo,Jianzong Wang,Ning Cheng,Jing Xiao 机构：Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China 备注：Accepted to IJCNN 2021 链接：https://arxiv.org/abs/2107.04289 摘要：端到端语音识别系统通常需要大量的标注资源，而语音数据的标注又是一项复杂而昂贵的工作。主动学习是选择最有价值的样本进行标注的解决方案。在本文中，我们建议使用预测损失来估计样本的不确定性。CTC（连接主义时间分类）和注意损失是语音识别的信息，因为它们是基于所有解码路径和对齐计算的。我们定义了一个端到端的主动学习管道，训练了一个ASR/LP（自动语音识别/丢失预测）联合模型。该方法在一个英汉语音识别任务中得到了验证。实验表明，该方法取得了很好的效果，优于随机选择、最小置信度和估计损失法。摘要：End-to-end speech recognition systems usually require huge amounts of labeling resource, while annotating the speech data is complicated and expensive. Active learning is the solution by selecting the most valuable samples for annotation. In this paper, we proposed to use a predicted loss that estimates the uncertainty of the sample. The CTC (Connectionist Temporal Classification) and attention loss are informative for speech recognition since they are computed based on all decoding paths and alignments. We defined an end-to-end active learning pipeline, training an ASR/LP (Automatic Speech Recognition/Loss Prediction) joint model. The proposed approach was validated on an English and a Chinese speech recognition task. The experiments show that our approach achieves competitive results, outperforming random selection, least confidence, and estimated loss method.

【3】 Training a Deep Neural Network via Policy Gradients for Blind Source Separation in Polyphonic Music Recordings 标题：基于策略梯度的深层神经网络复调音乐盲源分离

作者：Sören Schulze,Johannes Leuschner,Emily J. King 机构：Center for Industrial Mathematics, University of Bremen, Bibliothekstr. , Bremen, Germany;, Mathematics Department, Colorado State University, Campus Delivery, Weber Bldg, Fort, 链接：https://arxiv.org/abs/2107.04235 摘要：本文提出了一种在音频信号中实现乐器声音盲分离的方法。我们通过一个参数模型来描述单个音调，训练一个字典来捕捉谐波的相对振幅。模型参数的预测采用深度神经网络的U网络。基于模型预测和单个STFT时间帧之间的差异，在没有地面真值信息的情况下训练网络。由于一些模型参数不产生有用的反向传播梯度，我们对它们进行随机建模，并使用策略梯度代替。为了提供相位信息并解释基于字典的表示中的不精确性，我们还让网络输出一个直接预测，然后使用该预测来重新合成各个乐器的音频信号。由于神经网络的灵活性，非谐性可以无缝结合，不需要对输入光谱进行预处理。我们的算法产生高质量的分离结果，对各种不同的音频样本（包括声学和合成）的干扰特别低，前提是样本包含足够的训练数据，并且乐器的光谱特性足够稳定，可以通过字典进行近似。摘要：We propose a method for the blind separation of sounds of musical instruments in audio signals. We describe the individual tones via a parametric model, training a dictionary to capture the relative amplitudes of the harmonics. The model parameters are predicted via a U-Net, which is a type of deep neural network. The network is trained without ground truth information, based on the difference between the model prediction and the individual STFT time frames. Since some of the model parameters do not yield a useful backpropagation gradient, we model them stochastically and employ the policy gradient instead. To provide phase information and account for inaccuracies in the dictionary-based representation, we also let the network output a direct prediction, which we then use to resynthesize the audio signals for the individual instruments. Due to the flexibility of the neural network, inharmonicity can be incorporated seamlessly and no preprocessing of the input spectra is required. Our algorithm yields high-quality separation results with particularly low interference on a variety of different audio samples, both acoustic and synthetic, provided that the sample contains enough data for the training and that the spectral characteristics of the musical instruments are sufficiently stable to be approximated by the dictionary.

【4】 Incorporating Multi-Target in Multi-Stage Speech Enhancement Model for Better Generalization 标题：在多阶段语音增强模型中融合多目标以提高泛化能力

作者：Lu Zhang,Mingjiang Wang,Andong Li,Zehua Zhang,Xuyi Zhuang 机构：∗ Harbin Institute of Technology, Shenzhen, China, † Institute of Acoustics, Chinese Academy of Sciences, Beijing, China 备注：Submitted to APSIPA-ASC 2021 链接：https://arxiv.org/abs/2107.04232 摘要：近年来基于深度神经网络的单通道语音增强方法取得了显著的效果，但在实际场景中仍存在泛化问题。与其他数据驱动方法一样，基于DNN的语音增强模型在未经训练的数据上会产生显著的性能下降。在本研究中，我们充分利用多目标联合学习对模型泛化能力的贡献，提出一个轻量级且运算量较低的扩张卷积网络（DCN）模型来进行更稳健的语音去噪。我们的目标是将掩蔽目标、映射目标和传统语音增强估计器的参数集成到一个DCN模型中，以最大限度地发挥它们的互补优势。为此，我们构建了一个多阶段的学习框架，分阶段处理多个目标，实现他们的共同学习，即MT-in-MS。实验结果表明，与现有的时域和时频域模型相比，本文提出的低成本DCN模型在说话人、噪声和信道失配情况下都能获得更好的泛化性能。摘要：Recent single-channel speech enhancement methods based on deep neural networks (DNNs) have achieved remarkable results, but there are still generalization problems in real scenes. Like other data-driven methods, DNN-based speech enhancement models produce significant performance degradation on untrained data. In this study, we make full use of the contribution of multi-target joint learning to the model generalization capability, and propose a lightweight and low-computing dilated convolutional network (DCN) model for a more robust speech denoising task. Our goal is to integrate the masking target, the mapping target, and the parameters of the traditional speech enhancement estimator into a DCN model to maximize their complementary advantages. To do this, we build a multi-stage learning framework to deal with multiple targets in stages to achieve their joint learning, namely `MT-in-MS'. Our experimental results show that compared with the state-of-the-art time domain and time-frequency domain models, this proposed low-cost DCN model can achieve better generalization performance in speaker, noise, and channel mismatch cases.

【5】 Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation 标题：Transformer编码器语音表示自监督学习的丢弃正则化方法

作者：Jian Luo,Jianzong Wang,Ning Cheng,Jing Xiao 机构：Ping An Technology (Shenzhen) Co., Ltd. 备注：will be presented in INTERSPEECH 2021 链接：https://arxiv.org/abs/2107.04227 摘要：预测语音帧的变化是一种有效的语音自监督学习方法。然而，如何防止预训练模型过拟合是一个挑战。本文提出将两种跳出正则化方法引入到Transformer编码器的预训练中：（1）注意跳出；（2）层跳出。这两种方法都鼓励模型利用全局语音信息，避免了在重构掩蔽帧时只复制局部频谱特征。我们在音素分类和说话人识别任务中对所提出的方法进行了评估。实验结果表明，我们的辍学方法取得了有竞争力的结果，提高了下游任务的分类精度。摘要：Predicting the altered acoustic frames is an effective way of self-supervised learning for speech representation. However, it is challenging to prevent the pretrained model from overfitting. In this paper, we proposed to introduce two dropout regularization methods into the pretraining of transformer encoder: (1) attention dropout, (2) layer dropout. Both of the two dropout methods encourage the model to utilize global speech information, and avoid just copying local spectrum features when reconstructing the masked frames. We evaluated the proposed methods on phoneme classification and speaker recognition tasks. The experiments demonstrate that our dropout approaches achieve competitive results, and improve the performance of classification accuracy on downstream tasks.

【6】 On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models 标题：基于HMM和CTC的全上下文ASR模型的无格增强MMI训练

作者：Xiaohui Zhang,Vimal Manohar,David Zhang,Frank Zhang,Yangyang Shi,Nayan Singhal,Julian Chan,Fuchun Peng,Yatharth Saraf,Mike Seltzer 机构：Facebook AI, USA 备注：submitted to ASRU 2021 链接：https://arxiv.org/abs/2107.04154 摘要：混合自动语音识别（ASR）模型通常采用CTC或LF-MMI准则进行顺序训练。然而，它们有着截然不同的遗产，通常在不同的框架中实现。本文通过解耦建模单元和标签拓扑的概念，建立适当的分子/分母图，建立了混合声学建模的通用框架。在这个框架中，我们证明了LF-MMI是一个强大的训练标准，适用于有限上下文和全上下文模型，对于单字/单字/双字/单字单元，以及HMM/CTC两种拓扑结构。在此框架下，我们提出了三种新的训练方案：chenone（ch）/wordpiece（wp）-CTC-bMMI和wordpiece（wp）-HMM-bMMI，它们在训练性能、译码效率和译码时间戳精度方面具有不同的优势。在Librispeech上对不同训练方案的优点进行了综合评价，并在两个实际ASR任务上对wp-CTC-bMMI和ch-CTC-bMMI进行了评价。此外，我们还发现双字符HMM-MMI模型比传统的非神经GMM-HMM模型具有更好的对齐效果。摘要：Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria. However, they have vastly different legacies and are usually implemented in different frameworks. In this paper, by decoupling the concepts of modeling units and label topologies and building proper numerator/denominator graphs accordingly, we establish a generalized framework for hybrid acoustic modeling (AM). In this framework, we show that LF-MMI is a powerful training criterion applicable to both limited-context and full-context models, for wordpiece/mono-char/bi-char/chenone units, with both HMM/CTC topologies. From this framework, we propose three novel training schemes: chenone(ch)/wordpiece(wp)-CTC-bMMI, and wordpiece(wp)-HMM-bMMI with different advantages in training performance, decoding efficiency and decoding time-stamp accuracy. The advantages of different training schemes are evaluated comprehensively on Librispeech, and wp-CTC-bMMI and ch-CTC-bMMI are evaluated on two real world ASR tasks to show their effectiveness. Besides, we also show bi-char(bc) HMM-MMI models can serve as better alignment models than traditional non-neural GMM-HMMs.

【7】 Improved Breath Phase and Continuous Adventitious Sound Detection in Lung and Tracheal Sound Using Mixed Set Training and Domain Adaptation 标题：基于混合集合训练和域自适应的肺音和气管音的改进呼吸相位和连续异声检测

作者：Fu-Shun Hsu,Shang-Ran Huang,Chang-Fu Su,Chien-Wen Huang,Yuan-Ren Cheng,Chun-Chieh Chen,Chun-Yu Wu,Chung-Wei Chen,Yen-Chun Lai,Tang-Wei Cheng,Nian-Jhen Lin,Wan-Ling Tsai,Ching-Shiang Lu,Chuan Chen,Feipei Lai 机构：a Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, b Department of Critical Care Medicine, Far Eastern Memorial Hospital, New Taipei, Taiwan, c Heroic Faith Medical Science Co., Ltd., Taipei, Taiwan 备注：To be submitted, 31 pages, 6 figures, 5 tables 链接：https://arxiv.org/abs/2107.04229 摘要：在此之前，我们建立了一个肺音数据库HF\u lung\u V2，并提出了卷积双向门控循环单元（CNN-BiGRU）模型，该模型具有足够的能力检测肺音中的吸入、呼气、连续不定音（CAS）和间断不定音。在这项研究中，我们建立了一个气管声音数据库HF_tracthal_V1，包含11107个15秒气管声音记录，23087个吸入标签，16728个呼气标签和6874个CAS标签。将HF\u气管V1的气管音和HF\u肺V2的肺音联合或单独用于CNN-BiGRU模型的肺和气管音分析。对不同的训练策略进行了研究和比较：（1）采用完全训练（从头开始训练），单独用肺音训练肺音模型，单独用气管音训练气管音模型；（2）采用同时包含肺音和气管音的混合集训练模型，（3）利用域自适应技术，将预先训练好的肺音模型与气管音数据进行微调，反之亦然。结果表明，仅用肺音训练的模型在气管音分析中表现较差，反之亦然。然而，与阳性对照组相比，混合集训练和域适应可以提高肺音中呼气和CAS检测的性能，以及气管音中吸入、呼气和CAS检测的性能（仅通过肺音训练的肺模型，反之亦然）。特别是在一石二鸟的情况下，混合集训练的模型更为普遍。摘要：Previously, we established a lung sound database, HF_Lung_V2 and proposed convolutional bidirectional gated recurrent unit (CNN-BiGRU) models with adequate ability for inhalation, exhalation, continuous adventitious sound (CAS), and discontinuous adventitious sound detection in the lung sound. In this study, we proceeded to build a tracheal sound database, HF_Tracheal_V1, containing 11107 of 15-second tracheal sound recordings, 23087 inhalation labels, 16728 exhalation labels, and 6874 CAS labels. The tracheal sound in HF_Tracheal_V1 and the lung sound in HF_Lung_V2 were either combined or used alone to train the CNN-BiGRU models for respective lung and tracheal sound analysis. Different training strategies were investigated and compared: (1) using full training (training from scratch) to train the lung sound models using lung sound alone and train the tracheal sound models using tracheal sound alone, (2) using a mixed set that contains both the lung and tracheal sound to train the models, and (3) using domain adaptation that finetuned the pre-trained lung sound models with the tracheal sound data and vice versa. Results showed that the models trained only by lung sound performed poorly in the tracheal sound analysis and vice versa. However, the mixed set training and domain adaptation can improve the performance of exhalation and CAS detection in the lung sound, and inhalation, exhalation, and CAS detection in the tracheal sound compared to positive controls (lung models trained only by lung sound and vice versa). Especially, a model derived from the mixed set training prevails in the situation of killing two birds with one stone.

【8】 Multi-path Convolutional Neural Networks Efficiently Improve Feature Extraction in Continuous Adventitious Lung Sound Detection 标题：多路径卷积神经网络有效提高连续异位肺音检测的特征提取

作者：Fu-Shun Hsu,Shang-Ran Huang,Chien-Wen Huang,Chun-Chieh Chen,Yuan-Ren Cheng,Feipei Lai 机构： Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan, University, Taipei, Taiwan, Department of Critical Care Medicine, Far Eastern Memorial Hospital, New Taipei, Heroic Faith Medical Science Co., Ltd., Taipei, Taiwan 备注：To be submitted, 32 pages, 8 figures, 2 tables 链接：https://arxiv.org/abs/2107.04226 摘要：我们之前建立了一个大型的肺音数据库，HF\u lung\u V2（lung\u V2）。我们训练了卷积双向门控循环单元（CNN-BiGRU）网络，用于在记录水平上检测吸入、呼气、连续不定音（CAS）和间断不定音。然而，由于多种原因，CAS检测性能较差，其中一个原因就是CAS模式的高度多样化。为了使原来的CNN-BiGRU模型更有效地学习CAS模式，并且不造成太大的计算负担，研究了三种对CNN层网络结构进行最小修改的策略：（1）利用残差块使CNN层更深一点，（2）通过增加CNN核的数目，使CNN的层次更宽；（3）将特征输入分为多条路径（模型用多路径CNN BiGRU表示）。评估了CAS片段和事件检测的性能。结果表明，在所有提出的结构改进模型中，CAS检测都得到了改进。模型的CAS事件检测F1评分从0.445提高到0.491-0.530，具有显著性。然而，多路径CNN-BiGRU模型在总共9个评价指标中的获奖次数（5次）超过了其他模型。此外，与原CNN-BiGRU模型相比，多路径CNN-BiGRU模型没有造成额外的计算负担（0.97倍的推理时间）。总之，多路径CNN层可以有效地提高特征提取的有效性，从而得到更好的CAS检测。摘要：We previously established a large lung sound database, HF_Lung_V2 (Lung_V2). We trained convolutional-bidirectional gated recurrent unit (CNN-BiGRU) networks for detecting inhalation, exhalation, continuous adventitious sound (CAS) and discontinuous adventitious sound at the recording level on the basis of Lung_V2. However, the performance of CAS detection was poor due to many reasons, one of which is the highly diversified CAS patterns. To make the original CNN-BiGRU model learn the CAS patterns more effectively and not cause too much computing burden, three strategies involving minimal modifications of the network architecture of the CNN layers were investigated: (1) making the CNN layers a bit deeper by using the residual blocks, (2) making the CNN layers a bit wider by increasing the number of CNN kernels, and (3) separating the feature input into multiple paths (the model was denoted by Multi-path CNN-BiGRU). The performance of CAS segment and event detection were evaluated. Results showed that improvement in CAS detection was observed among all the proposed architecture-modified models. The F1 score for CAS event detection of the proposed models increased from 0.445 to 0.491-0.530, which was deemed significant. However, the Multi-path CNN-BiGRU model outperformed the other models in terms of the number of winning titles (five) in total nine evaluation metrics. In addition, the Multi-path CNN-BiGRU model did not cause extra computing burden (0.97-fold inference time) compared to the original CNN-BiGRU model. Conclusively, the Multi-path CNN layers can efficiently improve the effectiveness of feature extraction and subsequently result in better CAS detection.

【9】 EasyCom: An Augmented Reality Dataset to Support Algorithms for Easy Communication in Noisy Environments 标题：EasyCom：增强现实数据集，支持在嘈杂环境中轻松通信的算法

作者：Jacob Donley,Vladimir Tourbabin,Jung-Suk Lee,Mark Broyles,Hao Jiang,Jie Shen,Maja Pantic,Vamsi Krishna Ithapu,Ravish Mehra 机构：∗Facebook Reality Labs Research, USA., †Facebook AI Applied Research, UK. 备注：Dataset is available at: this https URL 链接：https://arxiv.org/abs/2107.04174 摘要：增强现实（AR）作为一个平台，有助于降低鸡尾酒会效应。未来的AR头戴式耳机可能会利用来自多种不同模式的传感器阵列的信息。训练和测试信号处理和机器学习算法的任务，如波束形成和语音增强需要高质量的代表性数据。据作者所知，截至出版之日，还没有包含以自我为中心的多通道音频和视频的数据集，这些音频和视频在嘈杂的环境中具有动态移动和对话。在这项工作中，我们描述、评估和发布了一个数据集，其中包含超过5小时的多模态数据，可用于训练和测试算法，以改进AR眼镜佩戴者的会话。我们提供了一个基线方法的语音清晰度、质量和信噪比改进结果，并显示了所有测试指标的改进。我们正在发布的数据集包含AR眼镜、以自我为中心的多通道麦克风阵列音频、宽视场RGB视频、语音源姿势、耳机麦克风音频、带注释的语音活动、语音转录、头部边界框、语音目标和源识别标签。我们已经创建并发布了这个数据集，以促进鸡尾酒会问题的多模式AR解决方案的研究。摘要：Augmented Reality (AR) as a platform has the potential to facilitate the reduction of the cocktail party effect. Future AR headsets could potentially leverage information from an array of sensors spanning many different modalities. Training and testing signal processing and machine learning algorithms on tasks such as beam-forming and speech enhancement require high quality representative data. To the best of the author's knowledge, as of publication there are no available datasets that contain synchronized egocentric multi-channel audio and video with dynamic movement and conversations in a noisy environment. In this work, we describe, evaluate and release a dataset that contains over 5 hours of multi-modal data useful for training and testing algorithms for the application of improving conversations for an AR glasses wearer. We provide speech intelligibility, quality and signal-to-noise ratio improvement results for a baseline method and show improvements across all tested metrics. The dataset we are releasing contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head bounding boxes, target of speech and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.

【10】 Improved Language Identification Through Cross-Lingual Self-Supervised Learning 标题：通过跨语言自我监督学习提高语言识别能力

作者：Andros Tjandra,Diptanu Gon Choudhury,Frank Zhang,Kritika Singh,Alexei Baevski,Assaf Sela,Yatharth Saraf,Michael Auli 机构：Facebook AI, USA 备注：Submitted to ASRU 2021 链接：https://arxiv.org/abs/2107.04082 摘要：语言识别对自动语音识别等下游任务的成功与否有着重要的影响。最近，用wav2vec2.0学习的自监督语音表征被证明对一系列语音任务是非常有效的。我们扩展了以前在语言识别方面的自监督工作，通过实验使用了预先训练的模型，这些模型是在现实世界中的多语言无约束语音上学习的，而不仅仅是在英语上。我们证明了在许多语言上预先训练的模型表现得更好，并且使得需要很少标记数据的语言识别系统能够很好地执行。在25种语言上的实验结果表明，每种语言只有10分钟的标记数据，一个跨语言的预训练模型可以达到93%以上的准确率。摘要：Language identification greatly impacts the success of downstream tasks such as automatic speech recognition. Recently, self-supervised speech representations learned by wav2vec 2.0 have been shown to be very effective for a range of speech tasks. We extend previous self-supervised work on language identification by experimenting with pre-trained models which were learned on real-world unconstrained speech in multiple languages and not just on English. We show that models pre-trained on many languages perform better and enable language identification systems that require very little labeled data to perform well. Results on a 25 languages setup show that with only 10 minutes of labeled data per language, a cross-lingually pre-trained model can achieve over 93% accuracy.

【11】 Machine Learning for Stuttering Identification: Review, Challenges & Future Directions 标题：机器学习在口吃识别中的应用：回顾、挑战和未来方向

作者：Shakeel Ahmad Sheikh,Md Sahidullah,Fabrice Hirsch,Slim Ouni 机构： Université de Lorraine 备注：under review in ACM Computing Surveys 链接：https://arxiv.org/abs/2107.04057 摘要：口吃是一种言语障碍，在此期间，言语的流动被不自觉的停顿和重复的声音打断。口吃识别是一个涉及病理学、心理学、声学、信号处理等多学科交叉的研究课题，其检测难度大且复杂。机器学习和深度学习的最新发展极大地改变了语音领域，然而对口吃识别的关注却很少。这项工作填补了这一空白，试图把跨学科领域的研究人员聚集在一起。本文综述了基于声学特征、统计和深度学习的口吃/不流利分类方法。我们还提出了一些挑战和未来可能的方向。摘要：Stuttering is a speech disorder during which the flow of speech is interrupted by involuntary pauses and repetition of sounds. Stuttering identification is an interesting interdisciplinary domain research problem which involves pathology, psychology, acoustics, and signal processing that makes it hard and complicated to detect. Recent developments in machine and deep learning have dramatically revolutionized speech domain, however minimal attention has been given to stuttering identification. This work fills the gap by trying to bring researchers together from interdisciplinary fields. In this paper, we review comprehensively acoustic features, statistical and deep learning based stuttering/disfluency classification methods. We also present several challenges and possible future directions.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-12，如有侵权请联系 cloudcommunity@tencent.com 删除

数据分析