前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >金融/语音/音频处理学术速递[8.30]

金融/语音/音频处理学术速递[8.30]

作者头像
公众号-arXiv每日学术速递
发布2021-09-16 14:30:29
4480
发布2021-09-16 14:30:29
举报

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

q-fin金融,共计2篇

cs.SD语音,共计7篇

eess.AS音频处理,共计9篇

1.q-fin金融:

【1】 European option pricing under generalized fractional Brownian motion 标题:广义分数布朗运动下的欧式期权定价 链接:https://arxiv.org/abs/2108.12042

作者:Axel A. Araneda 机构:Institute of Financial Complex Systems, Masaryk University, Lipová ,a, Brno, Czech Republic 摘要:广义分数布朗运动(gfBm)是一个随机过程,它是分数布朗运动、次分数布朗运动和标准布朗运动的推广。本文通过对欧式看涨期权的估值,研究了它在期权定价问题中的应用。通过推导广义分式Ito引理和相关的Fokker-Planck方程,得到了gfBm驱动的Black-Scholes和CEV模型的封闭形式定价公式。 摘要:The Generalized fractional Brownian motion (gfBm) is a stochastic process that acts as a generalization for both fractional, sub-fractional, and standard Brownian motion. Here we study its application into the option pricing problem by means of the valuation of a European Call option. By the derivation of the generalized fractional Ito's lemma and the related Fokker-Planck equation, a closed-form pricing formula for both Black-Scholes and CEV models driven by gfBm is obtained.

【2】 Asymptotically optimal strategies in a diffusion approximation of a repeated betting game 标题:重复博彩博弈扩散近似的渐近最优策略 链接:https://arxiv.org/abs/2108.11998

作者:Mikhail Zhitlukhin 机构:Steklov Mathematical Institute of the Russian Academy of Sciences 摘要:我们构造了一个重复博弈的扩散近似,其中代理对i.i.d.随机向量的结果下注,并且他们的策略接近于渐近最优策略。该模型可以解释为在具有短期资产的资产市场中进行交易。我们获得了在无限时间范围内保持总财富严格正份额的策略的充分条件。对于两人博弈,我们发现了财富分享过程在该模型中是暂时的或循环的充要条件,以及在马尔可夫政权转换的推广中。 摘要:We construct a diffusion approximation of a repeated game in which agents make bets on outcomes of i.i.d. random vectors and their strategies are close to an asymptotically optimal strategy. This model can be interpreted as trading in an asset market with short-lived assets. We obtain sufficient conditions for a strategy to maintain a strictly positive share of total wealth over the infinite time horizon. For the game with two players, we find necessary and sufficient conditions for the wealth share process to be transient or recurrent in this model, and also in its generalization with Markovian regime switching.

2.cs.SD语音:

【1】 Music Composition with Deep Learning: A Review 标题:基于深度学习的音乐创作述评 链接:https://arxiv.org/abs/2108.12290

作者:Carlos Hernandez-Olivan,Jose R. Beltran 机构:Department of Engineering and Communications, Calle María de Luna, Universidad de Zaragoza 摘要:创作复杂的艺术作品,如音乐作品,需要展现真正的创造力,这取决于与音乐语言层次相关的各种因素。音乐生成一直面临着算法方法的挑战,最近,深度学习模型正被用于计算机视觉等其他领域。在这篇论文中,我们想把基于人工智能的音乐作曲模型与人类音乐作曲和创作过程之间存在的关系放到上下文中去。我们概述了最近的音乐创作深度学习模型,并从理论角度将这些模型与音乐创作过程进行了比较。我们试图通过分析当前深度学习模型生成具有创造力的音乐的能力,或人工智能与人类作曲过程之间的相似性,来回答一些与此任务最相关的开放性问题。 摘要:Generating a complex work of art such as a musical composition requires exhibiting true creativity that depends on a variety of factors that are related to the hierarchy of musical language. Music generation have been faced with Algorithmic methods and recently, with Deep Learning models that are being used in other fields such as Computer Vision. In this paper we want to put into context the existing relationships between AI-based music composition models and human musical composition and creativity processes. We give an overview of the recent Deep Learning models for music composition and we compare these models to the music composition process from a theoretical point of view. We have tried to answer some of the most relevant open questions for this task by analyzing the ability of current Deep Learning models to generate music with creativity or the similarity between AI and human composition processes, among others.

【2】 Injecting Text in Self-Supervised Speech Pretraining 标题:自监督语音预训练中的文本注入 链接:https://arxiv.org/abs/2108.12226

作者:Zhehuai Chen,Yu Zhang,Andrew Rosenberg,Bhuvana Ramabhadran,Gary Wang,Pedro Moreno 机构:Google, Inc. 备注:submit to ASRU 2021 摘要:自动语音识别(ASR)的自我监督预训练已显示出不同程度的成功。在这篇论文中,我们建议在训练前从两种不同的模式:语音和文本共同学习表征。所提出的tts4pretrain方法补充了对比学习在自我监督中的力量,从合成语音中提取语言/词汇表征,有效地从未翻译语音和未说话文本中学习。语音编码器中的词汇学习是通过一个额外的序列丢失项来实现的,该项与训练前的对比丢失相结合。我们证明,这种新的预训练方法在仅使用wav2vec2.0预训练的最新基线上,相对于基准良好的Librispeech任务,字错误率(WER)降低了10%。该方法还可以作为一种有效的策略来弥补转录语音的不足,有效地匹配AMI会议转录任务中5000小时转录语音与100小时转录语音的性能。最后,我们演示了与传统的预训练相比,内部语音搜索任务的WER降低高达15%。将文本合并到编码器预训练中是对使用更大或域内语言模型重新排序的补充,从而使WER相对减少6%。 摘要:Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text. The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speech, effectively learning from untranscribed speech and unspoken text. Lexical learning in the speech encoder is enforced through an additional sequence loss term that is coupled with contrastive loss during pretraining. We demonstrate that this novel pretraining method yields Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task over a state-of-the-art baseline pretrained with wav2vec2.0 only. The proposed method also serves as an effective strategy to compensate for the lack of transcribed speech, effectively matching the performance of 5000 hours of transcribed speech with just 100 hours of transcribed speech on the AMI meeting transcription task. Finally, we demonstrate WER reductions of up to 15% on an in-house Voice Search task over traditional pretraining. Incorporating text into encoder pretraining is complimentary to rescoring with a larger or in-domain language model, resulting in additional 6% relative reduction in WER.

【3】 Separable Temporal Convolution plus Temporally Pooled Attention for Lightweight High-performance Keyword Spotting 标题:用于轻量级高性能关键词检测的可分离时间卷积加时间集中注意力 链接:https://arxiv.org/abs/2108.12146

作者:Shenghua Hu,Jing Wang,Yujun Wang,Wenjing Yang 机构:Beijing Institute of Technology, Beijing, China, †Xiaomi Inc., Beijing, China 摘要:移动设备上的关键字定位(KWS)通常需要较小的内存占用。然而,大多数当前模型仍然保留大量参数以确保良好的性能。在本文中,我们提出了一个时间集中注意力模块,它比平均集中注意力模块更能捕捉全局特征。此外,我们还设计了一个可分离的时间卷积网络,利用深度可分离和时间卷积来减少参数和计算量。最后,利用可分离时间卷积和时间集中注意力的优点,设计了一种适用于KWS系统的高效神经网络(ST AttNet)。我们在公开的Google speech commands数据集V1上评估模型。建议模型(48K)的参数数量为最先进的TC-ResNet14-1.5模型(305K)的1/6。该模型的准确率为96.6%,与TC-ResNet14-1.5模型(96.6%)相当。 摘要:Keyword spotting (KWS) on mobile devices generally requires a small memory footprint. However, most current models still maintain a large number of parameters in order to ensure good performance. In this paper, we propose a temporally pooled attention module which can capture global features better than the AveragePool. Besides, we design a separable temporal convolution network which leverages depthwise separable and temporal convolution to reduce the number of parameter and calculations. Finally, taking advantage of separable temporal convolution and temporally pooled attention, a efficient neural network (ST-AttNet) is designed for KWS system. We evaluate the models on the publicly available Google speech commands data sets V1. The number of parameters of proposed model (48K) is 1/6 of state-of-the-art TC-ResNet14-1.5 model (305K). The proposed model achieves a 96.6% accuracy, which is comparable to the TC-ResNet14-1.5 model (96.6%).

【4】 Task-aware Warping Factors in Mask-based Speech Enhancement 标题:基于掩码的语音增强中的任务感知翘曲因子 链接:https://arxiv.org/abs/2108.12128

作者:Qiongqiong Wang,Kong Aik Lee,Takafumi Koshinaka,Koji Okabe,Hitoshi Yamamoto 机构:Biometrics Research Laboratories, NEC Corporation, Japan 备注:EUSIPCO 2021 (the 29th European Signal Processing Conference) 摘要:本文提出在基于掩模的语音增强(SE)中使用两种任务感知扭曲因子。一个控制训练阶段语音维持和噪声消除之间的平衡,另一个控制测试阶段特定下游任务的SE功率。我们的目的是缓解这样一个问题,即经过训练以提高语音质量的SE系统通常无法改善其他下游任务,例如自动说话人验证(ASV)和自动语音识别(ASR),因为它们不共享相同的对象。将所提出的双翘曲因子方法应用于任何基于掩码的SE方法都很容易,并且它允许单个SE系统处理多个任务,而无需依赖于任务的训练。我们提出的方法的有效性已在SITW数据集上得到证实,该数据集用于ASV评估,LibriSpeech数据集用于ASR和0-20dB的语音质量评估。我们表明,不同的翘曲值对于单个SE来说是必要的,以实现三个任务的最佳性能w.r.t。通过使用任务相关的扭曲因子,在0dB语音上,语音质量提高了84.7%的PESQ,ASV的EER降低了22.4%,ASR的WER降低了52.2%。任务相关翘曲因子的有效性也在ASV的VoxCeleb-1测试集和ASV和质量评估的LibriSpeech-dev清洁集上进行了交叉验证。该方法效率高,易于实际应用。 摘要:This paper proposes the use of two task-aware warping factors in mask-based speech enhancement (SE). One controls the balance between speech-maintenance and noise-removal in training phases, while the other controls SE power applied to specific downstream tasks in testing phases. Our intention is to alleviate the problem that SE systems trained to improve speech quality often fail to improve other downstream tasks, such as automatic speaker verification (ASV) and automatic speech recognition (ASR), because they do not share the same objects. It is easy to apply the proposed dual-warping factors approach to any mask-based SE method, and it allows a single SE system to handle multiple tasks without task-dependent training. The effectiveness of our proposed approach has been confirmed on the SITW dataset for ASV evaluation and the LibriSpeech dataset for ASR and speech quality evaluations of 0-20dB. We show that different warping values are necessary for a single SE to achieve optimal performance w.r.t. the three tasks. With the use of task-dependent warping factors, speech quality was improved by an 84.7% PESQ increase, ASV had a 22.4% EER reduction, and ASR had a 52.2% WER reduction, on 0dB speech. The effectiveness of the task-dependent warping factors were also cross-validated on VoxCeleb-1 test set for ASV and LibriSpeech dev-clean set for ASV and quality evaluations. The proposed method is highly effective and easy to apply in practice.

【5】 Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement 标题:全注意力双向深度学习结构在单通道语音增强中的应用 链接:https://arxiv.org/abs/2108.12105

作者:Yuzi Yan,Wei-Qiang Zhang,Michael T. Johnson 机构:Beijing National Research Center for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing , China, Electrical and Computer Engineering, College of Engineering, University of Kentucky, Lexington, KY ,-, USA 备注:4 pages 摘要:语音增强作为语音识别和语音合成等重要技术的基石,是音频信号处理中的一个关键领域。本文提出了一种新的语音增强深度学习结构。该模型将“完全”注意机制引入到双向序列对序列方法中,以利用每个焦帧后的潜在信息。这是以前基于注意的RNN方法的扩展。与OM-LSA、CNN-LSTM、T-GSA和基于单向注意的LSTM基线相比,所提出的基于双向注意的架构在语音质量(PESQ)方面取得了更好的性能。 摘要:As the cornerstone of other important technologies, such as speech recognition and speech synthesis, speech enhancement is a critical area in audio signal processing. In this paper, a new deep learning structure for speech enhancement is demonstrated. The model introduces a "full" attention mechanism to a bidirectional sequence-to-sequence method to make use of latent information after each focal frame. This is an extension of the previous attention-based RNN method. The proposed bidirectional attention-based architecture achieves better performance in terms of speech quality (PESQ), compared with OM-LSA, CNN-LSTM, T-GSA and the unidirectional attention-based LSTM baseline.

【6】 4-bit Quantization of LSTM-based Speech Recognition Models 标题:基于LSTM的语音识别模型的4位量化 链接:https://arxiv.org/abs/2108.12074

作者:Andrea Fasoli,Chia-Yu Chen,Mauricio Serrano,Xiao Sun,Naigang Wang,Swagath Venkataramani,George Saon,Xiaodong Cui,Brian Kingsbury,Wei Zhang,Zoltán Tüske,Kailash Gopalakrishnan 机构:IBM Research, USA, †These authors contributed equally to this work 备注:5 pages, 3 figures, Andrea Fasoli and Chia-Yu Chen equally contributed to this work. Paper accepted to Interspeech 2021 摘要:我们研究了两大类基于LSTM的自动语音识别(ASR)体系结构:混合深层双向LSTM-隐马尔可夫模型(DBLSTM-HMMs)和递归神经网络-传感器(RNN-Ts)中权重和激活的积极低精度表示的影响。使用4位整数表示,应用于这些模型的LSTM部分的自然量化方法会导致显著的字错误率(WER)下降。另一方面,我们表明,通过适当选择量化器和初始化,可以实现最小的精度损失。特别是,我们根据网络的局部特性定制量化方案,在限制计算时间的同时提高识别性能。我们在NIST Hub5-2000评估的交换机(SWB)和呼叫中心(CH)测试集上演示了我们的解决方案。使用300或2000小时SWB数据训练的DBLSTM HMMs分别达到$<$0.5%和$<$1%的平均WER降级。在更具挑战性的RNN-T模型上,我们的量化策略将4位推断的退化限制在1.3%。 摘要:We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts). Using a 4-bit integer representation, a na\"ive quantization approach applied to the LSTM portion of these models results in significant Word Error Rate (WER) degradation. On the other hand, we show that minimal accuracy loss is achievable with an appropriate choice of quantizers and initializations. In particular, we customize quantization schemes depending on the local properties of the network, improving recognition performance while limiting computational time. We demonstrate our solution on the Switchboard (SWB) and CallHome (CH) test sets of the NIST Hub5-2000 evaluation. DBLSTM-HMMs trained with 300 or 2000 hours of SWB data achieves $<$0.5% and $<$1% average WER degradation, respectively. On the more challenging RNN-T models, our quantization strategy limits degradation in 4-bit inference to 1.3%.

【7】 Classification of Emotions and Evaluation of Customer Satisfaction from Speech in Real World Acoustic Environments 标题:真实声学环境中的语音情感分类与顾客满意度评价 链接:https://arxiv.org/abs/2108.11981

作者:Luis Felipe Parra-Gallego,Juan Rafael Orozco-Arroyave 机构:University of Antioquia UdeA, Medell´ın, Colombia., Konecta Group S.A.S. Medell´ın, Colombia., Pattern Recognition Lab. Friedrich Alexander University, Erlangen-Nuremberg 摘要:本文致力于寻找合适的特征,以便在真实声学场景中从语音中稳健地识别情感并评估客户满意度。情感的分类基于标准和知名语料库,客户满意度的评估基于客户在与呼叫中心代理打电话时对所接收服务的真实意见的记录。本研究中考虑的特征集包括两种说话人模型,即x向量和i向量,以及Interspeech 2010副语言学挑战赛(I2010PC)中引入的著名特征集。此外,我们还介绍了使用DisVoice框架提取的发音、发音和韵律特征作为备选特征集,从语音中稳健地建模情感和客户满意度。结果表明,I2010PC功能集是在文献中典型使用的标准数据库中对情绪进行分类的最佳方法。当考虑在呼叫中心收集的录音时,在没有任何声学条件控制的情况下,使用我们的发音功能可以获得最佳效果。I2010PC功能集包括1584个测量值,而铰接方法仅包括488个测量值。我们认为,所提出的方法更适合于声学条件不受控制的实际应用,并且可能更方便于工业应用。 摘要:This paper focuses on finding suitable features to robustly recognize emotions and evaluate customer satisfaction from speech in real acoustic scenarios. The classification of emotions is based on standard and well-known corpora and the evaluation of customer satisfaction is based on recordings of real opinions given by customers about the received service during phone calls with call-center agents. The feature sets considered in this study include two speaker models, namely x-vectors and i-vectors, and also the well known feature set introduced in the Interspeech 2010 Paralinguistics Challenge (I2010PC). Additionally, we introduce the use of phonation, articulation and prosody features extracted with the DisVoice framework as alternative feature sets to robustly model emotions and customer satisfaction from speech. The results indicate that the I2010PC feature set is the best approach to classify emotions in the standard databases typically used in the literature. When considering the recordings collected in the call-center, without any control over the acoustic conditions, the best results are obtained with our articulation features. The I2010PC feature set includes 1584 measures while the articulation approach only includes 488 measures. We think that the proposed approach is more suitable for real-world applications where the acoustic conditions are not controlled and also it is potentially more convenient for industrial applications.

3.eess.AS音频处理:

【1】 Music Composition with Deep Learning: A Review 标题:基于深度学习的音乐创作述评 链接:https://arxiv.org/abs/2108.12290

作者:Carlos Hernandez-Olivan,Jose R. Beltran 机构:Department of Engineering and Communications, Calle María de Luna, Universidad de Zaragoza 摘要:创作复杂的艺术作品,如音乐作品,需要展现真正的创造力,这取决于与音乐语言层次相关的各种因素。音乐生成一直面临着算法方法的挑战,最近,深度学习模型正被用于计算机视觉等其他领域。在这篇论文中,我们想把基于人工智能的音乐作曲模型与人类音乐作曲和创作过程之间存在的关系放到上下文中去。我们概述了最近的音乐创作深度学习模型,并从理论角度将这些模型与音乐创作过程进行了比较。我们试图通过分析当前深度学习模型生成具有创造力的音乐的能力,或人工智能与人类作曲过程之间的相似性,来回答一些与此任务最相关的开放性问题。 摘要:Generating a complex work of art such as a musical composition requires exhibiting true creativity that depends on a variety of factors that are related to the hierarchy of musical language. Music generation have been faced with Algorithmic methods and recently, with Deep Learning models that are being used in other fields such as Computer Vision. In this paper we want to put into context the existing relationships between AI-based music composition models and human musical composition and creativity processes. We give an overview of the recent Deep Learning models for music composition and we compare these models to the music composition process from a theoretical point of view. We have tried to answer some of the most relevant open questions for this task by analyzing the ability of current Deep Learning models to generate music with creativity or the similarity between AI and human composition processes, among others.

【2】 Injecting Text in Self-Supervised Speech Pretraining 标题:自监督语音预训练中的文本注入 链接:https://arxiv.org/abs/2108.12226

作者:Zhehuai Chen,Yu Zhang,Andrew Rosenberg,Bhuvana Ramabhadran,Gary Wang,Pedro Moreno 机构:Google, Inc. 备注:submit to ASRU 2021 摘要:自动语音识别(ASR)的自我监督预训练已显示出不同程度的成功。在这篇论文中,我们建议在训练前从两种不同的模式:语音和文本共同学习表征。所提出的tts4pretrain方法补充了对比学习在自我监督中的力量,从合成语音中提取语言/词汇表征,有效地从未翻译语音和未说话文本中学习。语音编码器中的词汇学习是通过一个额外的序列丢失项来实现的,该项与训练前的对比丢失相结合。我们证明,这种新的预训练方法在仅使用wav2vec2.0预训练的最新基线上,相对于基准良好的Librispeech任务,字错误率(WER)降低了10%。该方法还可以作为一种有效的策略来弥补转录语音的不足,有效地匹配AMI会议转录任务中5000小时转录语音与100小时转录语音的性能。最后,我们演示了与传统的预训练相比,内部语音搜索任务的WER降低高达15%。将文本合并到编码器预训练中是对使用更大或域内语言模型重新排序的补充,从而使WER相对减少6%。 摘要:Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text. The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speech, effectively learning from untranscribed speech and unspoken text. Lexical learning in the speech encoder is enforced through an additional sequence loss term that is coupled with contrastive loss during pretraining. We demonstrate that this novel pretraining method yields Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task over a state-of-the-art baseline pretrained with wav2vec2.0 only. The proposed method also serves as an effective strategy to compensate for the lack of transcribed speech, effectively matching the performance of 5000 hours of transcribed speech with just 100 hours of transcribed speech on the AMI meeting transcription task. Finally, we demonstrate WER reductions of up to 15% on an in-house Voice Search task over traditional pretraining. Incorporating text into encoder pretraining is complimentary to rescoring with a larger or in-domain language model, resulting in additional 6% relative reduction in WER.

【3】 Grammar Based Identification Of Speaker Role For Improving ATCO And Pilot ASR 标题:基于语法的说话人角色识别改进ATCO和Pilot ASR 链接:https://arxiv.org/abs/2108.12175

作者:Amrutha Prasad,Juan Zuluaga-Gomez,Petr Motlicek,Oliver Ohneiser,Hartmut Helmke,Saeed Sarfjoo,Iuliia Nigmatulina 机构:Idiap Research Institute, Martigny, Switzerland, Brno University of Technology, Brno, Czechia, Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland, German Aerospace Center (DLR), Institute of Flight Guidance, Braunschweig, Germany 备注:Submitted to Interspeech 2021 摘要:用于空中交通管制的基于助手的语音识别(ABSR)通常通过汇集空中交通管制员(ATCO)和飞行员数据进行训练。实际上,这是因为与ATCO相比,飞行员数据的比例较小,而他们的标准通信语言相似。然而,由于ATCO和飞行员的数据不平衡以及他们不同的声学条件,ATCO的ASR性能通常明显优于飞行员。在本文中,我们建议(1)使用ATR笔录的自动方法分割ATCO和导频数据,和(2)考虑阿特科和导频ASR作为声学模型(AM)训练的两个单独任务。对于ATCO和飞行员数据的说话人角色分类,使用种子模型生成假设的ASR转录本,然后根据从国际民用航空组织(ICAO)定义的语法中提取的知识对说话人角色进行分类。这种方法为ATCO和pilot提供了83%的平均说话人角色识别准确率。最后,我们表明,与通过汇集所有数据进行的AM训练相比,针对每个任务单独训练AM或使用多任务方法非常适合此数据。 摘要:Assistant Based Speech Recognition (ABSR) for air traffic control is generally trained by pooling both Air Traffic Controller (ATCO) and pilot data. In practice, this is motivated by the fact that the proportion of pilot data is lesser compared to ATCO while their standard language of communication is similar. However, due to data imbalance of ATCO and pilot and their varying acoustic conditions, the ASR performance is usually significantly better for ATCOs than pilots. In this paper, we propose to (1) split the ATCO and pilot data using an automatic approach exploiting ASR transcripts, and (2) consider ATCO and pilot ASR as two separate tasks for Acoustic Model (AM) training. For speaker role classification of ATCO and pilot data, a hypothesized ASR transcript is generated with a seed model, subsequently used to classify the speaker role based on the knowledge extracted from grammar defined by International Civil Aviation Organization (ICAO). This approach provides an average speaker role identification accuracy of 83% for ATCO and pilot. Finally, we show that training AMs separately for each task, or using a multitask approach is well suited for this data compared to AM trained by pooling all data.

【4】 Improving callsign recognition with air-surveillance data in air-traffic communication 标题:空中交通通信中利用空中监视数据改进呼号识别 链接:https://arxiv.org/abs/2108.12156

作者:Iuliia Nigmatulina,Rudolf Braun,Juan Zuluaga-Gomez,Petr Motlicek 机构:Idiap Research Institute, Martigny, Switzerland, Institute of Computational Linguistics, University of Z¨urich, Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland 备注:Submitted to Interspeech 2021 摘要:自动语音识别(ASR)可以作为飞行员和空中交通管制员之间语音通信的辅助手段。它的应用可以显著降低任务的复杂性,提高传输信息的可靠性。显然,需要高精度的预测来最小化错误风险。特别是,在识别用于导航飞行员的关键信息(如命令和呼号)时,需要高精度。我们的研究结果证明,当每个话语中可能的呼号n-gram的权重降低时,包含呼号的监控数据有助于显著提高对话语中呼号的识别。在本文中,我们研究了两种方法:(1)G-boosting,即在语言模型级别(G)调整呼号权重,然后使用动态合成的动态解码器;(2)格重排序,即在使用传统解码器生成的格上引入呼号信息。通过两种方法的结合提高呼号n-gram,我们可以在呼号识别准确率方面获得28.4%的绝对改善,在呼号识别WER方面获得74.2%的相对改善。 摘要:Automatic Speech Recognition (ASR) can be used as the assistance of speech communication between pilots and air-traffic controllers. Its application can significantly reduce the complexity of the task and increase the reliability of transmitted information. Evidently, high accuracy predictions are needed to minimize the risk of errors. Especially, high accuracy is required in recognition of key information, such as commands and callsigns, used to navigate pilots. Our results prove that the surveillance data containing callsigns can help to considerably improve the recognition of a callsign in an utterance when the weights of probable callsign n-grams are reduced per utterance. In this paper, we investigate two approaches: (1) G-boosting, when callsigns weights are adjusted at language model level (G) and followed by the dynamic decoder with an on-the-fly composition, and (2) lattice rescoring when callsign information is introduced on top of lattices generated using a conventional decoder. Boosting callsign n-grams with the combination of two methods allowed us to gain 28.4% of absolute improvement in callsign recognition accuracy and up to 74.2% of relative improvement in WER of callsign recognition.

【5】 Separable Temporal Convolution plus Temporally Pooled Attention for Lightweight High-performance Keyword Spotting 标题:用于轻量级高性能关键词检测的可分离时间卷积加时间集中注意力 链接:https://arxiv.org/abs/2108.12146

作者:Shenghua Hu,Jing Wang,Yujun Wang,Wenjing Yang 机构:Beijing Institute of Technology, Beijing, China, †Xiaomi Inc., Beijing, China 摘要:移动设备上的关键字定位(KWS)通常需要较小的内存占用。然而,大多数当前模型仍然保留大量参数以确保良好的性能。在本文中,我们提出了一个时间集中注意力模块,它比平均集中注意力模块更能捕捉全局特征。此外,我们还设计了一个可分离的时间卷积网络,利用深度可分离和时间卷积来减少参数和计算量。最后,利用可分离时间卷积和时间集中注意力的优点,设计了一种适用于KWS系统的高效神经网络(ST AttNet)。我们在公开的Google speech commands数据集V1上评估模型。建议模型(48K)的参数数量为最先进的TC-ResNet14-1.5模型(305K)的1/6。该模型的准确率为96.6%,与TC-ResNet14-1.5模型(96.6%)相当。 摘要:Keyword spotting (KWS) on mobile devices generally requires a small memory footprint. However, most current models still maintain a large number of parameters in order to ensure good performance. In this paper, we propose a temporally pooled attention module which can capture global features better than the AveragePool. Besides, we design a separable temporal convolution network which leverages depthwise separable and temporal convolution to reduce the number of parameter and calculations. Finally, taking advantage of separable temporal convolution and temporally pooled attention, a efficient neural network (ST-AttNet) is designed for KWS system. We evaluate the models on the publicly available Google speech commands data sets V1. The number of parameters of proposed model (48K) is 1/6 of state-of-the-art TC-ResNet14-1.5 model (305K). The proposed model achieves a 96.6% accuracy, which is comparable to the TC-ResNet14-1.5 model (96.6%).

【6】 Task-aware Warping Factors in Mask-based Speech Enhancement 标题:基于掩码的语音增强中的任务感知翘曲因子 链接:https://arxiv.org/abs/2108.12128

作者:Qiongqiong Wang,Kong Aik Lee,Takafumi Koshinaka,Koji Okabe,Hitoshi Yamamoto 机构:Biometrics Research Laboratories, NEC Corporation, Japan 备注:EUSIPCO 2021 (the 29th European Signal Processing Conference) 摘要:本文提出在基于掩模的语音增强(SE)中使用两种任务感知扭曲因子。一个控制训练阶段语音维持和噪声消除之间的平衡,另一个控制测试阶段特定下游任务的SE功率。我们的目的是缓解这样一个问题,即经过训练以提高语音质量的SE系统通常无法改善其他下游任务,例如自动说话人验证(ASV)和自动语音识别(ASR),因为它们不共享相同的对象。将所提出的双翘曲因子方法应用于任何基于掩码的SE方法都很容易,并且它允许单个SE系统处理多个任务,而无需依赖于任务的训练。我们提出的方法的有效性已在SITW数据集上得到证实,该数据集用于ASV评估,LibriSpeech数据集用于ASR和0-20dB的语音质量评估。我们表明,不同的翘曲值对于单个SE来说是必要的,以实现三个任务的最佳性能w.r.t。通过使用任务相关的扭曲因子,在0dB语音上,语音质量提高了84.7%的PESQ,ASV的EER降低了22.4%,ASR的WER降低了52.2%。任务相关翘曲因子的有效性也在ASV的VoxCeleb-1测试集和ASV和质量评估的LibriSpeech-dev清洁集上进行了交叉验证。该方法效率高,易于实际应用。 摘要:This paper proposes the use of two task-aware warping factors in mask-based speech enhancement (SE). One controls the balance between speech-maintenance and noise-removal in training phases, while the other controls SE power applied to specific downstream tasks in testing phases. Our intention is to alleviate the problem that SE systems trained to improve speech quality often fail to improve other downstream tasks, such as automatic speaker verification (ASV) and automatic speech recognition (ASR), because they do not share the same objects. It is easy to apply the proposed dual-warping factors approach to any mask-based SE method, and it allows a single SE system to handle multiple tasks without task-dependent training. The effectiveness of our proposed approach has been confirmed on the SITW dataset for ASV evaluation and the LibriSpeech dataset for ASR and speech quality evaluations of 0-20dB. We show that different warping values are necessary for a single SE to achieve optimal performance w.r.t. the three tasks. With the use of task-dependent warping factors, speech quality was improved by an 84.7% PESQ increase, ASV had a 22.4% EER reduction, and ASR had a 52.2% WER reduction, on 0dB speech. The effectiveness of the task-dependent warping factors were also cross-validated on VoxCeleb-1 test set for ASV and LibriSpeech dev-clean set for ASV and quality evaluations. The proposed method is highly effective and easy to apply in practice.

【7】 Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement 标题:全注意力双向深度学习结构在单通道语音增强中的应用 链接:https://arxiv.org/abs/2108.12105

作者:Yuzi Yan,Wei-Qiang Zhang,Michael T. Johnson 机构:Beijing National Research Center for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing , China, Electrical and Computer Engineering, College of Engineering, University of Kentucky, Lexington, KY ,-, USA 备注:4 pages 摘要:语音增强作为语音识别和语音合成等重要技术的基石,是音频信号处理中的一个关键领域。本文提出了一种新的语音增强深度学习结构。该模型将“完全”注意机制引入到双向序列对序列方法中,以利用每个焦帧后的潜在信息。这是以前基于注意的RNN方法的扩展。与OM-LSA、CNN-LSTM、T-GSA和基于单向注意的LSTM基线相比,所提出的基于双向注意的架构在语音质量(PESQ)方面取得了更好的性能。 摘要:As the cornerstone of other important technologies, such as speech recognition and speech synthesis, speech enhancement is a critical area in audio signal processing. In this paper, a new deep learning structure for speech enhancement is demonstrated. The model introduces a "full" attention mechanism to a bidirectional sequence-to-sequence method to make use of latent information after each focal frame. This is an extension of the previous attention-based RNN method. The proposed bidirectional attention-based architecture achieves better performance in terms of speech quality (PESQ), compared with OM-LSA, CNN-LSTM, T-GSA and the unidirectional attention-based LSTM baseline.

【8】 4-bit Quantization of LSTM-based Speech Recognition Models 标题:基于LSTM的语音识别模型的4位量化 链接:https://arxiv.org/abs/2108.12074

作者:Andrea Fasoli,Chia-Yu Chen,Mauricio Serrano,Xiao Sun,Naigang Wang,Swagath Venkataramani,George Saon,Xiaodong Cui,Brian Kingsbury,Wei Zhang,Zoltán Tüske,Kailash Gopalakrishnan 机构:IBM Research, USA, †These authors contributed equally to this work 备注:5 pages, 3 figures, Andrea Fasoli and Chia-Yu Chen equally contributed to this work. Paper accepted to Interspeech 2021 摘要:我们研究了两大类基于LSTM的自动语音识别(ASR)体系结构:混合深层双向LSTM-隐马尔可夫模型(DBLSTM-HMMs)和递归神经网络-传感器(RNN-Ts)中权重和激活的积极低精度表示的影响。使用4位整数表示,应用于这些模型的LSTM部分的自然量化方法会导致显著的字错误率(WER)下降。另一方面,我们表明,通过适当选择量化器和初始化,可以实现最小的精度损失。特别是,我们根据网络的局部特性定制量化方案,在限制计算时间的同时提高识别性能。我们在NIST Hub5-2000评估的交换机(SWB)和呼叫中心(CH)测试集上演示了我们的解决方案。使用300或2000小时SWB数据训练的DBLSTM HMMs分别达到$<$0.5%和$<$1%的平均WER降级。在更具挑战性的RNN-T模型上,我们的量化策略将4位推断的退化限制在1.3%。 摘要:We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts). Using a 4-bit integer representation, a na\"ive quantization approach applied to the LSTM portion of these models results in significant Word Error Rate (WER) degradation. On the other hand, we show that minimal accuracy loss is achievable with an appropriate choice of quantizers and initializations. In particular, we customize quantization schemes depending on the local properties of the network, improving recognition performance while limiting computational time. We demonstrate our solution on the Switchboard (SWB) and CallHome (CH) test sets of the NIST Hub5-2000 evaluation. DBLSTM-HMMs trained with 300 or 2000 hours of SWB data achieves $<$0.5% and $<$1% average WER degradation, respectively. On the more challenging RNN-T models, our quantization strategy limits degradation in 4-bit inference to 1.3%.

【9】 Classification of Emotions and Evaluation of Customer Satisfaction from Speech in Real World Acoustic Environments 标题:真实声学环境中的语音情感分类与顾客满意度评价 链接:https://arxiv.org/abs/2108.11981

作者:Luis Felipe Parra-Gallego,Juan Rafael Orozco-Arroyave 机构:University of Antioquia UdeA, Medell´ın, Colombia., Konecta Group S.A.S. Medell´ın, Colombia., Pattern Recognition Lab. Friedrich Alexander University, Erlangen-Nuremberg 摘要:本文致力于寻找合适的特征,以便在真实声学场景中从语音中稳健地识别情感并评估客户满意度。情感的分类基于标准和知名语料库,客户满意度的评估基于客户在与呼叫中心代理打电话时对所接收服务的真实意见的记录。本研究中考虑的特征集包括两种说话人模型,即x向量和i向量,以及Interspeech 2010副语言学挑战赛(I2010PC)中引入的著名特征集。此外,我们还介绍了使用DisVoice框架提取的发音、发音和韵律特征作为备选特征集,从语音中稳健地建模情感和客户满意度。结果表明,I2010PC功能集是在文献中典型使用的标准数据库中对情绪进行分类的最佳方法。当考虑在呼叫中心收集的录音时,在没有任何声学条件控制的情况下,使用我们的发音功能可以获得最佳效果。I2010PC功能集包括1584个测量值,而铰接方法仅包括488个测量值。我们认为,所提出的方法更适合于声学条件不受控制的实际应用,并且可能更方便于工业应用。 摘要:This paper focuses on finding suitable features to robustly recognize emotions and evaluate customer satisfaction from speech in real acoustic scenarios. The classification of emotions is based on standard and well-known corpora and the evaluation of customer satisfaction is based on recordings of real opinions given by customers about the received service during phone calls with call-center agents. The feature sets considered in this study include two speaker models, namely x-vectors and i-vectors, and also the well known feature set introduced in the Interspeech 2010 Paralinguistics Challenge (I2010PC). Additionally, we introduce the use of phonation, articulation and prosody features extracted with the DisVoice framework as alternative feature sets to robustly model emotions and customer satisfaction from speech. The results indicate that the I2010PC feature set is the best approach to classify emotions in the standard databases typically used in the literature. When considering the recordings collected in the call-center, without any control over the acoustic conditions, the best results are obtained with our articulation features. The I2010PC feature set includes 1584 measures while the articulation approach only includes 488 measures. We think that the proposed approach is more suitable for real-world applications where the acoustic conditions are not controlled and also it is potentially more convenient for industrial applications.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-08-30,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
语音识别
腾讯云语音识别(Automatic Speech Recognition,ASR)是将语音转化成文字的PaaS产品,为企业提供精准而极具性价比的识别服务。被微信、王者荣耀、腾讯视频等大量业务使用,适用于录音质检、会议实时转写、语音输入法等多个场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档