金融/语音/音频处理学术速递[7.5]

公众号-arXiv每日学术速递

发布于 2021-07-27 10:22:10

4000

发布于 2021-07-27 10:22:10

文章被收录于专栏：arXiv每日学术速递

q-fin金融，共计9篇

cs.SD语音，共计4篇

eess.AS音频处理，共计5篇

1.q-fin金融:

【1】 Temporal Analysis of Worldwide War 标题：世界大战的时序分析

作者：Devansh Bajpai,Rishi Ranjan Singh 机构：Department of Electrical Engineering and Computer Science, Indian Institute of Technology, Bhilai, Chhattisgarh, India. 链接：https://arxiv.org/abs/2107.01098 摘要：分析地区间的战争和冲突一直是人类历史上的一个重要课题。在20世纪下半叶，在两次世界大战和核、生物和化学大屠杀的阴影之后，关于这个问题的文章比以往任何时候都多。战争对一个国家的经济、社会秩序、基础设施和公共卫生都有负面影响。本文研究历史上的战争，并从中得出结论。我们探讨了各国在战争中的参与以及不同时期各国关系的性质。今天战争的很大一部分是打击恐怖主义。因此，这项研究也试图阐明不同国家遭受恐怖袭击的情况，并从国内生产总值变化的角度分析战争对一国经济的影响。摘要：Analysis of wars and conflicts between regions has been an important topic of interest throughout the history of humankind. In the latter part of the 20th century, in the aftermath of two World Wars and the shadow of nuclear, biological, and chemical holocaust, more was written on the subject than ever before. Wars have a negative impact on a country's economy, social order, infrastructure, and public health. In this paper, we study the wars fought in history and draw conclusions from that. We explore the participation of countries in wars and the nature of relationships between various countries during different timelines. A big part of today's wars is fought against terrorism. Therefore, this study also attempts to shed light on different countries' exposure to terrorist encounters and analyses the impact of wars on a country's economy in terms of change in GDP.

【2】 Reverse Sensitivity Analysis for Risk Modelling 标题：风险建模的逆灵敏度分析

作者：Silvana M. Pesenti 机构： PesentiDepartment of Statistical Sciences, University of Toronto 链接：https://arxiv.org/abs/2107.01065 摘要：我们考虑的问题，其中一个模型进行敏感性分析的模型组成的随机输入因子，相应的随机输出的兴趣，以及基线概率测量。建模者试图了解模型（输入因素和输出的分布）在输出分布的压力下是如何变化的。具体地说，对于输出随机变量上的应力，我们导出了输出的唯一应力分布，它在Wasserstein距离内最接近基线输出的分布，并且满足应力。我们进一步推导了应力模型，包括输入的应力分布，它可以从一组基线蒙特卡罗样本中以数值有效的方式进行计算。所提出的反向敏感性分析框架是无模型的，并允许输出上的压力，例如（a）均值和方差，（b）任何扭曲风险度量，包括风险值和预期短缺，以及（c）预期效用类型约束，从而使反向敏感性分析框架适用于风险模型。摘要：We consider the problem where a modeller conducts sensitivity analysis of a model consisting of random input factors, a corresponding random output of interest, and a baseline probability measure. The modeller seeks to understand how the model (the distribution of the input factors as well as the output) changes under a stress on the output's distribution. Specifically, for a stress on the output random variable, we derive the unique stressed distribution of the output that is closest in the Wasserstein distance to the baseline output's distribution and satisfies the stress. We further derive the stressed model, including the stressed distribution of the inputs, which can be calculated in a numerically efficient way from a set of baseline Monte Carlo samples. The proposed reverse sensitivity analysis framework is model-free and allows for stresses on the output such as (a) the mean and variance, (b) any distortion risk measure including the Value-at-Risk and Expected-Shortfall, and (c) expected utility type constraints, thus making the reverse sensitivity analysis framework suitable for risk models.

【3】 Effectiveness of Artificial Intelligence in Stock Market Prediction based on Machine Learning 标题：基于机器学习的人工智能在股市预测中的有效性

作者：Sohrab Mokhtari,Kang K. Yen,Jin Liu 机构：Electrical and Computer Engineering, Florida International University, Miami, USA., Kang K Yen 链接：https://arxiv.org/abs/2107.01031 摘要：本文试图利用人工智能（AI）策略来解决股市预测问题。股票市场预测可以建立在两个主要的分析基础上，即技术分析和基本面分析。在技术分析方法中，基于历史价格数据，采用回归机器学习（ML）算法预测一个工作日结束时的股价走势。相比之下，在基本分析中，本文采用分类ML算法对基于新闻和社交媒体的公众情绪进行分类。在技术分析中，利用雅虎财经（Yahoo Finance）的历史价格数据，在基本面分析中，调查Twitter上与股市相关的公共推文，以评估情绪对股市预测的影响。结果显示，在目前的人工智能技术下，现在就断言人工智能可以击败股市还为时过早。摘要：This paper tries to address the problem of stock market prediction leveraging artificial intelligence (AI) strategies. The stock market prediction can be modeled based on two principal analyses called technical and fundamental. In the technical analysis approach, the regression machine learning (ML) algorithms are employed to predict the stock price trend at the end of a business day based on the historical price data. In contrast, in the fundamental analysis, the classification ML algorithms are applied to classify the public sentiment based on news and social media. In the technical analysis, the historical price data is exploited from Yahoo Finance, and in fundamental analysis, public tweets on Twitter associated with the stock market are investigated to assess the impact of sentiments on the stock market's forecast. The results show a median performance, implying that with the current technology of AI, it is too soon to claim AI can beat the stock markets.

【4】 MegazordNet: combining statistical and machine learning standpoints for time series forecasting 标题：MegazordNet：结合统计和机器学习观点进行时间序列预测

作者：Angelo Garangau Menezes,Saulo Martiello Mastelini 机构： Instituto de Ciˆencias Matem´aticas e de Computac¸˜ao – Universidade de S˜ao Paulo, Av. Trabalhador S˜ao Carlense, – ,-, S˜ao Carlos – SP, Brasil. 链接：https://arxiv.org/abs/2107.01017 摘要：由于金融时间序列的混沌特性，预测金融时间序列是一项困难的任务。统计方法在预测市场走向、股票单一价格等具体问题上取得了良好的效果；然而，随着近年来深度学习和大数据技术的发展，金融时间序列预测的新方法应运而生。此外，最近的文献表明，与单一解相比，采用统计和机器学习相结合的方法可以提高预测的准确性。考虑到上述方面，在这项工作中，我们提出了MegazordNet，这是一个探索金融序列内统计特征的框架，结合了一个用于时间序列预测的结构化深度学习模型。我们使用不同的指标评估了我们预测标准普尔500指数股票收盘价的方法，我们能够击败单一的统计和机器学习方法。摘要：Forecasting financial time series is considered to be a difficult task due to the chaotic feature of the series. Statistical approaches have shown solid results in some specific problems such as predicting market direction and single-price of stocks; however, with the recent advances in deep learning and big data techniques, new promising options have arises to tackle financial time series forecasting. Moreover, recent literature has shown that employing a combination of statistics and machine learning may improve accuracy in the forecasts in comparison to single solutions. Taking into consideration the mentioned aspects, in this work, we proposed the MegazordNet, a framework that explores statistical features within a financial series combined with a structured deep learning model for time series forecasting. We evaluated our approach predicting the closing price of stocks in the S&P 500 using different metrics, and we were able to beat single statistical and machine learning methods.

【5】 No Investment Fee Is Small, Long Term 标题：没有投资手续费是小的，是长期的

作者：Joseph Levine 链接：https://arxiv.org/abs/2107.00837 摘要：本说明提供了一个简单的“信封背面”公式，估计每年的投资费用${\epsilon}$%，加上$N$年，几乎消耗了$N${\epsilon}$%的投资价值。例如，一项投资在30年内按1%的年费复利支付近30%的期末费用。这个近似值反映了年费是如何在十年或更长时间内复合成一个数量级以上的累积成本的。这个公式很简单，便于快速手工计算。摘要：This note presents a simple "back of the envelope" formula estimating that an annual investment fee expense ${\epsilon}$%, compounding for $N$ years, consumes almost $N$${\epsilon}$% of an investment's value. For example, an investment with 1% annual fee compounding for 30 years pays almost 30% of its final value to fees over this period. This approximation captures how annual fees can compound over a decade or more to cumulative costs an order of magnitude higher. The formula is simple enough for rapid hand calculation.

【6】 Who Votes for Library Bonds? A Principal Component Exploration 标题：谁投票支持图书馆债券？一种主成分探索法

作者：Eric Jacobson 机构：Weber State University, retired 备注：26 pages, 4 tables 链接：https://arxiv.org/abs/2107.01095 摘要：以往的研究表明，选民特征与选民对税收债券的支持度之间存在一定的关系。然而，这些发现很难解释，因为这些指标之间存在高度的共线性。从一次图书馆债券选举的13个选民的人口统计指标中，提取了7个独立的主成分，占方差的95%。直接人口统计学指标与投票不一致，低社会经济地位、大学经历、女性和服务性工作的主成分与赞成票相关，而高家庭价值与反对票相关。摘要：Previous research has shown a relationship between voter characteristics and voter support for tax bonds. These findings, however, are difficult to interpret because of the high degree of collinearity across the measures. From 13 demographic measures of voters in a library bond election, seven independent principal components were extracted which accounted for 95 percent of the variance. Whereas the direct demographic measures showed inconsistent relationships with voting, the principal components of low SES, college experience, female and service job were related to affirmative voting, while high home value was related to negative voting.

【7】 Time series models with infinite-order partial copula dependence 标题：具有无穷阶部分Copula相依性的时间序列模型

作者：Martin Bladt,Alexander J. McNeil 机构：University of Lausanne, The York Management School, University of York 备注：30 pages, 4 figures 链接：https://arxiv.org/abs/2107.00960 摘要：基于二元copula函数集的s-vine分解可以构造平稳和遍历的时间序列。将这类过程推广到无限copula序列，得到了一类丰富的模型，该模型推广了高斯ARMA和ARFIMA过程，使得序列部分相关结构既具有非高斯的边缘行为，又具有非高斯的描述。将线性过程的经典因果表示和可逆表示推广到一般s-藤过程。提出了一种用Kendall偏自相关函数参数化s-vine过程的实用而简洁的方法。通过一个使用宏观经济数据的例子说明了所得到的模型在许多应用中改进统计拟合的潜力。摘要：Stationary and ergodic time series can be constructed using an s-vine decomposition based on sets of bivariate copula functions. The extension of such processes to infinite copula sequences is considered and shown to yield a rich class of models that generalizes Gaussian ARMA and ARFIMA processes to allow both non-Gaussian marginal behaviour and a non-Gaussian description of the serial partial dependence structure. Extensions of classical causal and invertible representations of linear processes to general s-vine processes are proposed and investigated. A practical and parsimonious method for parameterizing s-vine processes using the Kendall partial autocorrelation function is developed. The potential of the resulting models to give improved statistical fits in many applications is indicated with an example using macroeconomic data.

【8】 Formation of coalition structures as a non-cooperative game 标题：作为非合作博弈的联盟结构的形成

作者：Dmitry Levando 备注：Submitted to the Dynamic Games and Applications; Special Issue: Group Formation and Farsightedness 链接：https://arxiv.org/abs/2107.00711 摘要：研究了引入的嵌套非合作同时有限对策族中具有联盟内部和联盟内部外部性的联盟结构形成问题。非合作博弈嵌入了联盟结构的形成机制，具有两种结果：一种是对联盟的参与者分配，另一种是对每个参与者的报酬。博弈的联盟结构是用杨图来描述的。它们用于列举联盟结构和参与者在其上的分配。对于每个联盟结构，一个参与者都有一组有限的策略。玩家选择联盟结构和策略。一个（社会）机制消除了个人选择中的冲突，并产生了最终的联盟结构。每一个最终的联盟结构都是一个非合作博弈。混合均衡总是存在的，它由混合策略、收益和均衡联盟结构组成。我们使用最大联盟规模来参数化游戏系列。纳什的非合作博弈是该模型的一部分。结果不同于Shapley值、强Nash、联盟证明均衡、核心解等均衡概念。我们提供了几个非合作联盟结构稳定性准则。摘要：We study coalition structure formation with intra and inter-coalition externalities in the introduced family of nested non-cooperative simultaneous finite games. A non-cooperative game embeds a coalition structure formation mechanism, and has two outcomes: an allocation of players over coalitions and a payoff for every player. Coalition structures of a game are described by Young diagrams. They serve to enumerate coalition structures and allocations of players over them. For every coalition structure a player has a set of finite strategies. A player chooses a coalition structure and a strategy. A (social) mechanism eliminates conflicts in individual choices and produces final coalition structures. Every final coalition structure is a non-cooperative game. Mixed equilibrium always exists and consists of a mixed strategy profile, payoffs and equilibrium coalition structures. We use a maximum coalition size to parametrize the family of the games. The non-cooperative game of Nash is a partial case of the model. The result is different from the Shapley value, a strong Nash, coalition-proof equilibria, core solutions, and other equilibrium concepts. We supply few non-cooperative coalition structure stability criteria.

【9】 Mixed semimartingales: Volatility estimation in the presence of fractional noise 标题：混合半鞅：分数噪声下的波动率估计

作者：Carsten Chong,Thomas Delerue,Guoying Li 机构：Columbia University, Technical University of Munich 链接：https://arxiv.org/abs/2106.16149 摘要：当观测到的过程是连续的IT o半鞅和噪声过程的总和时，我们考虑了高频数据的波动性问题，该噪声过程的局部行为类似于具有Hurst参数H的分数Brownian运动。由此产生的一类过程，我们称之为混合半鞅，将Cheridito[Bernoulli 7（2001）913-934]引入的混合分数布朗运动推广到含时随机波动性。基于变分泛函的中心极限定理，我们得到了H的一致估计和渐近置信区间，以及半鞅和噪声部分的积分挥发度，在所有这些量可识别的情况下。当应用于最近的股票价格数据时，我们发现了分数噪声存在的强有力的经验证据，其中Hurst参数H随时间和资产之间变化很大。摘要：We consider the problem of estimating volatility for high-frequency data when the observed process is the sum of a continuous It\^o semimartingale and a noise process that locally behaves like fractional Brownian motion with Hurst parameter H. The resulting class of processes, which we call mixed semimartingales, generalizes the mixed fractional Brownian motion introduced by Cheridito [Bernoulli 7 (2001) 913-934] to time-dependent and stochastic volatility. Based on central limit theorems for variation functionals, we derive consistent estimators and asymptotic confidence intervals for H and the integrated volatilities of both the semimartingale and the noise part, in all cases where these quantities are identifiable. When applied to recent stock price data, we find strong empirical evidence for the presence of fractional noise, with Hurst parameters H that vary considerably over time and between assets.

2.cs.SD语音:

【1】 Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription 标题：VOX Populi，VOX DIY：众包音频转录基准数据集

作者：Nikita Pavlichenko,Ivan Stelmakh,Dmitry Ustalov 机构：Yandex, Moscow, Russia, Carnegie Mellon University, Pittsburgh, PA, USA, Saint Petersburg, Russia 链接：https://arxiv.org/abs/2107.01091 摘要：特定领域的数据是机器学习系统成功地从基准转移到现实生活的关键。对于诸如图像分类这样的简单问题，众包已经成为廉价和高效数据收集的标准工具之一：这在很大程度上要归功于聚合方法研究的进步。然而，由于缺乏这些模式的原则性聚合方法，众包对更复杂任务（如语音识别）的适用性仍然有限。设计高级聚合方法的主要障碍是缺乏训练数据，在这项工作中，我们致力于填补语音识别中的这一空白。为此，我们收集并发布了CrowdSpeech——第一个公开的大规模众包音频转录数据集。对现有数据聚合方法的评估显示了改进的空间，这表明我们的工作可能需要设计更好的算法。在更高的层次上，我们也为使用众包收集高质量数据集这一更普遍的挑战做出了贡献：我们开发了一个原则性的管道，用于构建任何新领域中的众包音频转录数据集。我们通过构建VoxDIY（俄语CrowdSpeech的对应词）来说明它在资源不足的语言上的适用性。我们还发布了允许完全复制我们的数据收集管道的代码，并通过众包分享关于数据收集最佳实践的各种见解。摘要：Domain-specific data is the crux of the successful transfer of machine learning systems from benchmarks to real life. Crowdsourcing has become one of the standard tools for cheap and time-efficient data collection for simple problems such as image classification: thanks in large part to advances in research on aggregation methods. However, the applicability of crowdsourcing to more complex tasks (e.g., speech recognition) remains limited due to the lack of principled aggregation methods for these modalities. The main obstacle towards designing advanced aggregation methods is the absence of training data, and in this work, we focus on bridging this gap in speech recognition. For this, we collect and release CrowdSpeech -- the first publicly available large-scale dataset of crowdsourced audio transcriptions. Evaluation of existing aggregation methods on our data shows room for improvement, suggesting that our work may entail the design of better algorithms. At a higher level, we also contribute to the more general challenge of collecting high-quality datasets using crowdsourcing: we develop a principled pipeline for constructing datasets of crowdsourced audio transcriptions in any novel domain. We show its applicability on an under-resourced language by constructing VoxDIY -- a counterpart of CrowdSpeech for the Russian language. We also release the code that allows a full replication of our data collection pipeline and share various insights on best practices of data collection via crowdsourcing.

【2】 Supervised Contrastive Learning for Accented Speech Recognition 标题：有监督对比学习在重音语音识别中的应用

作者：Tao Han,Hantao Huang,Ziang Yang,Wei Han 机构：Mediatek Singapore 备注：Accented speech recognition, deep neural networks, model adaptation, supervised contrastive learning 链接：https://arxiv.org/abs/2107.00921 摘要：基于神经网络的语音识别系统由于口音的存在，特别是陌生口音的存在，导致识别性能下降。本文研究了重音语音识别的有监督对比学习框架。为了建立不同的视角（相似的“正面”数据样本）进行对比学习，本文进一步研究了噪声注入、谱图增强和TTS同句生成三种数据增强技术。在普通语音数据集上的实验表明，对比学习有助于建立数据增强不变性和语音不变性表示，在Zero-Shot和全镜头两种情况下都显著优于传统的联合训练方法。实验表明，与联合训练法相比，对比学习法的准确率平均提高了3.66%（零分）和3.78%（满分）。摘要：Neural network based speech recognition systems suffer from performance degradation due to accented speech, especially unfamiliar accents. In this paper, we study the supervised contrastive learning framework for accented speech recognition. To build different views (similar "positive" data samples) for contrastive learning, three data augmentation techniques including noise injection, spectrogram augmentation and TTS-same-sentence generation are further investigated. From the experiments on the Common Voice dataset, we have shown that contrastive learning helps to build data-augmentation invariant and pronunciation invariant representations, which significantly outperforms traditional joint training methods in both zero-shot and full-shot settings. Experiments show that contrastive learning can improve accuracy by 3.66% (zero-shot) and 3.78% (full-shot) on average, comparing to the joint training method.

【3】 Normalizing Flow based Hidden Markov Models for Classification of Speech Phones with Explainability 标题：基于归一化流的隐马尔可夫模型用于可解释语音分类

作者：Anubhab Ghosh,Antoine Honoré,Dong Liu,Gustav Eje Henter,Saikat Chatterjee 机构：Digital Futures, and School of Electrical Engg. and Computer Sc., KTH Royal Institute of Technology, Sweden 备注：12 pages, 4 figures 链接：https://arxiv.org/abs/2107.00730 摘要：为了追求可解释性，我们开发了序列数据的生成模型。该模型为语音电话分类提供了最先进的分类结果和鲁棒性。我们结合了现代神经网络（规范化流）和传统的生成模型（隐马尔可夫模型-HMMs）。基于归一化流的混合模型（NMMs）被用来模拟给定隐状态的条件概率分布。模型参数的学习是通过时间测试贝叶斯学习方法和现代神经网络学习方法的明智组合。我们主要将期望最大化和小批量梯度下降相结合。所提出的生成模型可以计算数据的似然，因此直接适用于最大似然（ML）分类方法。由于HMMs的结构灵活性，我们可以使用不同的归一化流模型。这导致不同类型的hmm在数据建模能力方面提供多样性。这种多样性为不同模型的决策融合提供了方便。对于一个包含39个电话（类）和TIMIT数据集的标准语音电话分类系统，我们证明了使用mel频率倒谱系数（MFCCs）、提出的生成模型和决策融合的标准特征，仅通过生成训练就可以达到86.6\%$的准确率。这一结果接近最新的结果，例如，PyTorch-Kaldi工具箱[1]的准确率为86.2\%$，使用光选通循环单元[2]的准确率为85.1\%$。在本文中，我们不使用任何有区别的学习方法和相关的复杂特性。摘要：In pursuit of explainability, we develop generative models for sequential data. The proposed models provide state-of-the-art classification results and robust performance for speech phone classification. We combine modern neural networks (normalizing flows) and traditional generative models (hidden Markov models - HMMs). Normalizing flow-based mixture models (NMMs) are used to model the conditional probability distribution given the hidden state in the HMMs. Model parameters are learned through judicious combinations of time-tested Bayesian learning methods and contemporary neural network learning methods. We mainly combine expectation-maximization (EM) and mini-batch gradient descent. The proposed generative models can compute likelihood of a data and hence directly suitable for maximum-likelihood (ML) classification approach. Due to structural flexibility of HMMs, we can use different normalizing flow models. This leads to different types of HMMs providing diversity in data modeling capacity. The diversity provides an opportunity for easy decision fusion from different models. For a standard speech phone classification setup involving 39 phones (classes) and the TIMIT dataset, we show that the use of standard features called mel-frequency-cepstral-coeffcients (MFCCs), the proposed generative models, and the decision fusion together can achieve $86.6\%$ accuracy by generative training only. This result is close to state-of-the-art results, for examples, $86.2\%$ accuracy of PyTorch-Kaldi toolkit [1], and $85.1\%$ accuracy using light gated recurrent units [2]. We do not use any discriminative learning approach and related sophisticated features in this article.

【4】 Multi-user VoiceFilter-Lite via Attentive Speaker Embedding 标题：通过注意的扬声器嵌入实现多用户语音过滤-Lite

作者：Rajeev Rikhye,Quan Wang,Qiao Liang,Yanzhang He,Ian McGraw 机构：Google LLC, USA 链接：https://arxiv.org/abs/2107.01201 摘要：在本文中，我们提出了一个解决方案，使说话人条件语音模型，如语音滤波器Lite，支持在一个单一的过程中，注册用户的任意数量。这是通过在多个说话人嵌入上使用注意机制来计算单个注意嵌入，然后将其用作模型的侧输入来实现的。我们实现了多用户语音过滤器Lite，并对其进行了三个方面的评估：（1）流式自动语音识别（ASR）任务(2）与文本无关的说话人验证任务；以及（3）个性化的关键短语检测任务，其中ASR必须在噪声环境中检测来自多个注册用户的关键短语。我们的实验表明，在最多4个注册用户的情况下，多用户VoiceFilter-Lite能够显著减少语音重叠时的语音识别和说话人确认错误，而不会影响其他声学条件下的性能。这种专注的说话人嵌入方法也可以很容易地应用于其他说话人条件模型，如个人VAD和个性化ASR。摘要：In this paper, we propose a solution to allow speaker conditioned speech models, such as VoiceFilter-Lite, to support an arbitrary number of enrolled users in a single pass. This is achieved by using an attention mechanism on multiple speaker embeddings to compute a single attentive embedding, which is then used as a side input to the model. We implemented multi-user VoiceFilter-Lite and evaluated it for three tasks: (1) a streaming automatic speech recognition (ASR) task; (2) a text-independent speaker verification task; and (3) a personalized keyphrase detection task, where ASR has to detect keyphrases from multiple enrolled users in a noisy environment. Our experiments show that, with up to four enrolled users, multi-user VoiceFilter-Lite is able to significantly reduce speech recognition and speaker verification errors when there is overlapping speech, without affecting performance under other acoustic conditions. This attentive speaker embedding approach can also be easily applied to other speaker-conditioned models such as personal VAD and personalized ASR.

3.eess.AS音频处理:

【1】 Multi-user VoiceFilter-Lite via Attentive Speaker Embedding 标题：通过注意的扬声器嵌入实现多用户语音过滤-Lite

作者：Rajeev Rikhye,Quan Wang,Qiao Liang,Yanzhang He,Ian McGraw 机构：Google LLC, USA 链接：https://arxiv.org/abs/2107.01201 摘要：在本文中，我们提出了一个解决方案，使说话人条件语音模型，如语音滤波器Lite，支持在一个单一的过程中，注册用户的任意数量。这是通过在多个说话人嵌入上使用注意机制来计算单个注意嵌入，然后将其用作模型的侧输入来实现的。我们实现了多用户语音过滤器Lite，并对其进行了三个方面的评估：（1）流式自动语音识别（ASR）任务(2）与文本无关的说话人验证任务；以及（3）个性化的关键短语检测任务，其中ASR必须在噪声环境中检测来自多个注册用户的关键短语。我们的实验表明，在最多4个注册用户的情况下，多用户VoiceFilter-Lite能够显著减少语音重叠时的语音识别和说话人确认错误，而不会影响其他声学条件下的性能。这种专注的说话人嵌入方法也可以很容易地应用于其他说话人条件模型，如个人VAD和个性化ASR。摘要：In this paper, we propose a solution to allow speaker conditioned speech models, such as VoiceFilter-Lite, to support an arbitrary number of enrolled users in a single pass. This is achieved by using an attention mechanism on multiple speaker embeddings to compute a single attentive embedding, which is then used as a side input to the model. We implemented multi-user VoiceFilter-Lite and evaluated it for three tasks: (1) a streaming automatic speech recognition (ASR) task; (2) a text-independent speaker verification task; and (3) a personalized keyphrase detection task, where ASR has to detect keyphrases from multiple enrolled users in a noisy environment. Our experiments show that, with up to four enrolled users, multi-user VoiceFilter-Lite is able to significantly reduce speech recognition and speaker verification errors when there is overlapping speech, without affecting performance under other acoustic conditions. This attentive speaker embedding approach can also be easily applied to other speaker-conditioned models such as personal VAD and personalized ASR.

【2】 Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition 标题：帧同步和标签同步相结合的语音识别系统

作者：Qiujia Li,Chao Zhang,Philip C. Woodland 机构： Woodland are with the Department of En-gineering, University of Cambridge 备注：Submitted to IEEE/ACM Transactions on Audio Speech and Language Processing 链接：https://arxiv.org/abs/2107.00764 摘要：常用的自动语音识别（ASR）系统可以分为帧同步和标签同步两大类，这取决于语音是按帧解码还是按标签解码。帧同步系统，如传统的隐马尔可夫模型系统，可以很容易地结合现有的知识和支持流式ASR应用。标签同步系统以基于注意的编解码模型为基础，通过一个单一的模型来联合学习声音和语言信息，可视为音频接地语言模型。在这篇文章中，我们提出了在第二个过程中，用标签同步系统对第一个过程帧同步系统产生的N-最佳假设或格进行重新排序。通过利用不同方法的互补建模，组合的双通道系统在两个标准ASR任务上不需要任何额外的语音或文本数据，就可以获得具有竞争力的性能。对于80小时的AMI-IHM数据集，组合系统在评估集上的字错误率（WER）为13.7%，与单个系统相比，相对WER减少了29%。对于300小时交换机数据集，组合系统的WER在Hub5'00的交换机和呼叫中心子集上分别为5.7%和12.1%，在RT03的交换机蜂窝和Fisher子集上分别为13.2%和7.6%，与单个系统相比，WER相对减少了33%。摘要：Commonly used automatic speech recognition (ASR) systems can be classified into frame-synchronous and label-synchronous categories, based on whether the speech is decoded on a per-frame or per-label basis. Frame-synchronous systems, such as traditional hidden Markov model systems, can easily incorporate existing knowledge and can support streaming ASR applications. Label-synchronous systems, based on attention-based encoder-decoder models, can jointly learn the acoustic and language information with a single model, which can be regarded as audio-grounded language models. In this paper, we propose rescoring the N-best hypotheses or lattices produced by a first-pass frame-synchronous system with a label-synchronous system in a second-pass. By exploiting the complementary modelling of the different approaches, the combined two-pass systems achieve competitive performance without using any extra speech or text data on two standard ASR tasks. For the 80-hour AMI IHM dataset, the combined system has a 13.7% word error rate (WER) on the evaluation set, which is up to a 29% relative WER reduction over the individual systems. For the 300-hour Switchboard dataset, the WERs of the combined system are 5.7% and 12.1% on Switchboard and CallHome subsets of Hub5'00, and 13.2% and 7.6% on Switchboard Cellular and Fisher subsets of RT03, up to a 33% relative reduction in WER over the individual systems.

【3】 Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription 标题：VOX Populi，VOX DIY：众包音频转录基准数据集

作者：Nikita Pavlichenko,Ivan Stelmakh,Dmitry Ustalov 机构：Yandex, Moscow, Russia, Carnegie Mellon University, Pittsburgh, PA, USA, Saint Petersburg, Russia 链接：https://arxiv.org/abs/2107.01091 摘要：特定领域的数据是机器学习系统成功地从基准转移到现实生活的关键。对于诸如图像分类这样的简单问题，众包已经成为廉价和高效数据收集的标准工具之一：这在很大程度上要归功于聚合方法研究的进步。然而，由于缺乏这些模式的原则性聚合方法，众包对更复杂任务（如语音识别）的适用性仍然有限。设计高级聚合方法的主要障碍是缺乏训练数据，在这项工作中，我们致力于填补语音识别中的这一空白。为此，我们收集并发布了CrowdSpeech——第一个公开的大规模众包音频转录数据集。对现有数据聚合方法的评估显示了改进的空间，这表明我们的工作可能需要设计更好的算法。在更高的层次上，我们也为使用众包收集高质量数据集这一更普遍的挑战做出了贡献：我们开发了一个原则性的管道，用于构建任何新领域中的众包音频转录数据集。我们通过构建VoxDIY（俄语CrowdSpeech的对应词）来说明它在资源不足的语言上的适用性。我们还发布了允许完全复制我们的数据收集管道的代码，并通过众包分享关于数据收集最佳实践的各种见解。摘要：Domain-specific data is the crux of the successful transfer of machine learning systems from benchmarks to real life. Crowdsourcing has become one of the standard tools for cheap and time-efficient data collection for simple problems such as image classification: thanks in large part to advances in research on aggregation methods. However, the applicability of crowdsourcing to more complex tasks (e.g., speech recognition) remains limited due to the lack of principled aggregation methods for these modalities. The main obstacle towards designing advanced aggregation methods is the absence of training data, and in this work, we focus on bridging this gap in speech recognition. For this, we collect and release CrowdSpeech -- the first publicly available large-scale dataset of crowdsourced audio transcriptions. Evaluation of existing aggregation methods on our data shows room for improvement, suggesting that our work may entail the design of better algorithms. At a higher level, we also contribute to the more general challenge of collecting high-quality datasets using crowdsourcing: we develop a principled pipeline for constructing datasets of crowdsourced audio transcriptions in any novel domain. We show its applicability on an under-resourced language by constructing VoxDIY -- a counterpart of CrowdSpeech for the Russian language. We also release the code that allows a full replication of our data collection pipeline and share various insights on best practices of data collection via crowdsourcing.

【4】 Supervised Contrastive Learning for Accented Speech Recognition 标题：有监督对比学习在重音语音识别中的应用

作者：Tao Han,Hantao Huang,Ziang Yang,Wei Han 机构：Mediatek Singapore 备注：Accented speech recognition, deep neural networks, model adaptation, supervised contrastive learning 链接：https://arxiv.org/abs/2107.00921 摘要：基于神经网络的语音识别系统由于口音的存在，特别是陌生口音的存在，导致识别性能下降。本文研究了重音语音识别的有监督对比学习框架。为了建立不同的视角（相似的“正面”数据样本）进行对比学习，本文进一步研究了噪声注入、谱图增强和TTS同句生成三种数据增强技术。在普通语音数据集上的实验表明，对比学习有助于建立数据增强不变性和语音不变性表示，在Zero-Shot和全镜头两种情况下都显著优于传统的联合训练方法。实验表明，与联合训练法相比，对比学习法的准确率平均提高了3.66%（零分）和3.78%（满分）。摘要：Neural network based speech recognition systems suffer from performance degradation due to accented speech, especially unfamiliar accents. In this paper, we study the supervised contrastive learning framework for accented speech recognition. To build different views (similar "positive" data samples) for contrastive learning, three data augmentation techniques including noise injection, spectrogram augmentation and TTS-same-sentence generation are further investigated. From the experiments on the Common Voice dataset, we have shown that contrastive learning helps to build data-augmentation invariant and pronunciation invariant representations, which significantly outperforms traditional joint training methods in both zero-shot and full-shot settings. Experiments show that contrastive learning can improve accuracy by 3.66% (zero-shot) and 3.78% (full-shot) on average, comparing to the joint training method.

【5】 Normalizing Flow based Hidden Markov Models for Classification of Speech Phones with Explainability 标题：基于归一化流的隐马尔可夫模型用于可解释语音分类

作者：Anubhab Ghosh,Antoine Honoré,Dong Liu,Gustav Eje Henter,Saikat Chatterjee 机构：Digital Futures, and School of Electrical Engg. and Computer Sc., KTH Royal Institute of Technology, Sweden 备注：12 pages, 4 figures 链接：https://arxiv.org/abs/2107.00730 摘要：为了追求可解释性，我们开发了序列数据的生成模型。该模型为语音电话分类提供了最先进的分类结果和鲁棒性。我们结合了现代神经网络（规范化流）和传统的生成模型（隐马尔可夫模型-HMMs）。基于归一化流的混合模型（NMMs）被用来模拟给定隐状态的条件概率分布。模型参数的学习是通过时间测试贝叶斯学习方法和现代神经网络学习方法的明智组合。我们主要将期望最大化和小批量梯度下降相结合。所提出的生成模型可以计算数据的似然，因此直接适用于最大似然（ML）分类方法。由于HMMs的结构灵活性，我们可以使用不同的归一化流模型。这导致不同类型的hmm在数据建模能力方面提供多样性。这种多样性为不同模型的决策融合提供了方便。对于一个包含39个电话（类）和TIMIT数据集的标准语音电话分类系统，我们证明了使用mel频率倒谱系数（MFCCs）、提出的生成模型和决策融合的标准特征，仅通过生成训练就可以达到86.6\%$的准确率。这一结果接近最新的结果，例如，PyTorch-Kaldi工具箱[1]的准确率为86.2\%$，使用光选通循环单元[2]的准确率为85.1\%$。在本文中，我们不使用任何有区别的学习方法和相关的复杂特性。摘要：In pursuit of explainability, we develop generative models for sequential data. The proposed models provide state-of-the-art classification results and robust performance for speech phone classification. We combine modern neural networks (normalizing flows) and traditional generative models (hidden Markov models - HMMs). Normalizing flow-based mixture models (NMMs) are used to model the conditional probability distribution given the hidden state in the HMMs. Model parameters are learned through judicious combinations of time-tested Bayesian learning methods and contemporary neural network learning methods. We mainly combine expectation-maximization (EM) and mini-batch gradient descent. The proposed generative models can compute likelihood of a data and hence directly suitable for maximum-likelihood (ML) classification approach. Due to structural flexibility of HMMs, we can use different normalizing flow models. This leads to different types of HMMs providing diversity in data modeling capacity. The diversity provides an opportunity for easy decision fusion from different models. For a standard speech phone classification setup involving 39 phones (classes) and the TIMIT dataset, we show that the use of standard features called mel-frequency-cepstral-coeffcients (MFCCs), the proposed generative models, and the decision fusion together can achieve $86.6\%$ accuracy by generative training only. This result is close to state-of-the-art results, for examples, $86.2\%$ accuracy of PyTorch-Kaldi toolkit [1], and $85.1\%$ accuracy using light gated recurrent units [2]. We do not use any discriminative learning approach and related sophisticated features in this article.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-05，如有侵权请联系 cloudcommunity@tencent.com 删除

linux