前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >金融/语音/音频处理学术速递[9.6]

金融/语音/音频处理学术速递[9.6]

作者头像
公众号-arXiv每日学术速递
发布2021-09-16 15:50:58
3300
发布2021-09-16 15:50:58
举报

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

q-fin金融,共计3篇

cs.SD语音,共计4篇

eess.AS音频处理,共计4篇

1.q-fin金融:

【1】 What drives bitcoin? An approach from continuous local transfer entropy and deep learning classification models 标题:是什么推动了比特币?一种基于连续局部传递熵和深度学习分类模型的方法 链接:https://arxiv.org/abs/2109.01214

作者:Andrés García-Medina,Toan Luu Duc Huynh3 机构:Centro de Investigación en Matemáticas, Unidad Monterrey, Apodaca, Mexico, Consejo Nacional de Ciencia y Tecnología, Cátedras Conacyt, Ciudad de México, WHU - Otto Beisheim School of Management, Germany, University of Economics Ho Chi Minh City, Vietnam 摘要:由于不可预测的价格模式,比特币吸引了不同市场参与者的注意。有时,价格会出现大幅上涨。比特币价格也出现了极端、意外的崩溃。我们使用连续转移熵方法作为特征选择标准,测试了各种决定因素对比特币价格方向的预测能力。因此,在深度学习分类模型中,将基于局部转移熵最近邻估计的置换检验意义上的统计显著资产用作特征或解释变量,以预测比特币的价格方向。拟议的变量选择方法不包括纳斯达克指数和特斯拉。在不同的场景和指标下,使用大流行期间的重要驱动因素作为验证来获得最佳结果。在测试中,在无驾驶人的情况下,在2020年7月至2021年1月的大流行后情景中,准确度有所提高。换句话说,我们的研究结果表明,在高波动时期,比特币似乎能够自动调节,不需要额外的驱动因素来提高价格方向的准确性。 摘要:Bitcoin has attracted attention from different market participants due to unpredictable price patterns. Sometimes, the price has exhibited big jumps. Bitcoin prices have also had extreme, unexpected crashes. We test the predictive power of a wide range of determinants on bitcoins' price direction under the continuous transfer entropy approach as a feature selection criterion. Accordingly, the statistically significant assets in the sense of permutation test on the nearest neighbour estimation of local transfer entropy are used as features or explanatory variables in a deep learning classification model to predict the price direction of bitcoin. The proposed variable selection methodology excludes the NASDAQ index and Tesla as drivers. Under different scenarios and metrics, the best results are obtained using the significant drivers during the pandemic as validation. In the test, the accuracy increased in the post-pandemic scenario of July 2020 to January 2021 without drivers. In other words, our results indicate that in times of high volatility, Bitcoin seems to autoregulate and does not need additional drivers to improve the accuracy of the price direction.

【2】 Evacuation Network Modeling for Alternative Fuel Vehicles 标题:替代燃料汽车疏散网络建模 链接:https://arxiv.org/abs/2109.01578

作者:Denissa Sari Darmawi Purba,Eleftheria Kontou,Chrysafis Vogiatzis 机构: Civil and Environmental Engineering, University of Illinois at Urbana-Champaign, Industrial and Enterprise Systems Engineering, University of Illinois at Urbana-Champaign 摘要:随着采用替代燃料车辆的数量增加,易受飓风和野火等危险事件影响的社区需要制定新的疏散计划,以满足其加油需求。在需要先发制人疏散的紧急情况下,使用替代燃料车辆的驾驶员在常规疏散路线下很容易受到伤害,这些路线在他们前往避难所的途中不提供通往加油站的通道。在本文中,我们提出了一个新的疏散路径问题,该问题考虑了多种类型的燃料车辆。具体地说,我们引入了一个跨越$k$的疏散树问题,该问题具有跳跃约束,能够捕获每种车辆燃料类型$k\ in k$在被送往避难所时的加油需求。我们为这个问题提供了一个混合整数的数学公式以及一个基于路径的重新公式,它允许我们创建一个基于列生成的数学公式来有效地解决这个问题。接下来,我们将拟议的框架应用于苏福尔斯运输网络,考虑到替代燃料车辆加油站的设置是为了满足习惯性需求。我们进行了一系列数值试验,讨论了每种车型在不同行驶里程下的最佳行驶和加油时间。我们的研究结果表明,每种车辆燃料类型(行驶里程和基础设施选址)的特征在确定最佳疏散树方面起着关键作用。对于一种类型的车辆来说,最适合的疏散路线对于其余车辆来说通常是不可行的;此外,驾驶范围限制和加油需求可能会迫使疏散人员在到达安全地点之前绕道而行。 摘要:As the number of adopted alternative fuel vehicles increases, communities that are susceptible to hazardous events, such as hurricanes and wildfires, need to create new evacuation plans that account for their refueling needs. During emergencies that require preemptive evacuation, drivers using alternative fuel vehicles are left vulnerable under conventional evacuation routes which do not provide access to refueling stations on their way to shelters. In this paper, we formulate a novel evacuation routing problem which considers multiple types of fuel vehicles. Specifically, we introduce a $k$-spanning evacuation tree problem with hop constraints that capture the refueling needs of each vehicle fuel type $k \in K$ as they are routed to a shelter. We provide a mixed integer mathematical formulation for the problem along with a path-based reformulation which allows us to create a column-generation based matheuristic to efficiently solve the problem. Next, we apply the proposed framework to the Sioux Falls transportation network considering that refueling stations for alternative fuel vehicles are placed to serve habitual demands. We present a series of numerical experiments where we discuss optimal travel and refueling times under different driving ranges for each vehicle type. Our findings show that the characteristics of each vehicle fuel type (driving range and infrastructure siting) play a pivotal role in determining the optimal evacuation trees. Evacuation routes that are optimal for one type of vehicles are often infeasible for the remaining vehicles; furthermore, driving range constraints and the need to refuel could force evacuees to detour prior to reaching safety.

【3】 Accelerating the Adoption of Disruptive Technologies: The Impact of COVID-19 on Intention to Use Self-Driving Vehicles 标题:加速采用破坏性技术:冠状病毒对使用自动驾驶车辆意向的影响 链接:https://arxiv.org/abs/2108.01615

作者:Maher Said,Emma R. Zajdela,Amanda Stathopoulos 机构:(corresponding author), Department of Civil and Environmental Engineering, Northwestern University, A, Technological Institute, Sheridan Road, Evanston, IL, USA, Emma Zajdela, Department of Engineering Sciences and Applied Mathematics, M, Technological Institute 备注:Submitted to Transportation Research Board 2022 摘要:全球最显著的交通趋势之一是车辆自动化技术的加速发展。由于在潜在的采用模式、所有权与共享使用状态以及旅行影响方面没有明确的共识,自动化移动的未来充满了不确定性。这一不确定性2019冠状病毒疾病的影响,引发了移动性行为的深刻变化,并以前所未有的速度加速了新技术的采用。本研究探讨2019冠状病毒疾病对自愿采用新兴技术的意愿。利用2020年6月向邻近美国的700名受访者发布的调查数据,我们进行了差异回归分析,以分析在大流行之前和期间,作为共享车队一部分使用自动驾驶汽车的意愿的变化。结果表明,COVID-19大流行对自主车辆的考虑具有积极和非常重要的影响。无论技术水平、性别或政治观点如何,这种转变都存在。年轻、左倾、经常使用共享出行方式的个人,一旦提供了自动驾驶汽车,预计将更有可能使用。了解这些属性对AVs考虑增加的影响对于政策制定非常重要,因为这些影响为预测自动驾驶汽车的采用(一旦可用)提供了指导,并确定可能更抵制采用AVs的人群群体。 摘要:One of the most notable global transportation trends is the accelerated pace of development in vehicle automation technologies. Uncertainty surrounds the future of automated mobility as there is no clear consensus on potential adoption patterns, ownership versus shared use status and travel impacts. Adding to this uncertainty is the impact of the COVID-19 pandemic that has triggered profound changes in mobility behaviors as well as accelerated the adoption of new technologies at an unprecedented rate. This study examines the impact of the COVID-19 pandemic on willingness to adopt the emerging new technology of self-driving vehicles. Using data from a survey disseminated in June 2020 to 700 respondents in contiguous United States, we perform a difference-in-difference regression to analyze the shift in willingness to use autonomous vehicles as part of a shared fleet before and during the pandemic. The results reveal that the COVID-19 pandemic has a positive and highly significant impact on consideration of autonomous vehicles. This shift is present regardless of techsavviness, gender or political views. Individuals who are younger, left-leaning and frequent users of shared modes of travel are expected to become more likely to use autonomous vehicles once offered. Understanding the effects of these attributes on the increase in consideration of AVs is important for policy making, as these effects provide a guide to predicting adoption of autonomous vehicles - once available - and to identify segments of the population likely to be more resistant to adopting AVs.

2.cs.SD语音:

【1】 Musical Tempo Estimation Using a Multi-scale Network 标题:基于多尺度网络的音乐节奏估计 链接:https://arxiv.org/abs/2109.01607

作者:Xiaoheng Sun,Qiqi He,Yongwei Gao,Wei Li 机构: School of Computer Science and Technology, Fudan University, Shanghai, China, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China 备注:Accepted by ISMIR 2021 摘要:最近,一些没有起始检测的单步系统在自动音乐节拍估计中显示了它们的有效性。随着这些系统的成功,在本文中,我们提出了一个多尺度分组注意网络,以进一步探索这种方法的潜力。引入多尺度结构作为整体网络结构,其中来自不同尺度的信息被聚合以加强上下文特征学习。此外,我们提出了一个分组注意模块作为网络的关键组成部分。所提出的模块将输入特征沿频率轴分成若干组,这使得它能够从频谱图上的不同频率位置捕获长程相关性。在对比实验中,在公共数据集上的结果表明,该模型在精确度1方面优于现有的最新方法。 摘要:Recently, some single-step systems without onset detection have shown their effectiveness in automatic musical tempo estimation. Following the success of these systems, in this paper we propose a Multi-scale Grouped Attention Network to further explore the potential of such methods. A multi-scale structure is introduced as the overall network architecture where information from different scales is aggregated to strengthen contextual feature learning. Furthermore, we propose a Grouped Attention Module as the key component of the network. The proposed module separates the input feature into several groups along the frequency axis, which makes it capable of capturing long-range dependencies from different frequency positions on the spectrogram. In comparison experiments, the results on public datasets show that the proposed model outperforms existing state-of-the-art methods on Accuracy1.

【2】 Phone Duration Modeling for Speaker Age Estimation in Children 标题:用于儿童说话人年龄估计的语音时长建模 链接:https://arxiv.org/abs/2109.01568

作者:Prashanth Gurunath Shivakumar,Somer Bishop,Catherine Lord,Shrikanth Narayanan 机构: Bishopis with Department of Psychiatry, University of California 摘要:从语音中自动推断重要的副语言信息(如年龄)是一个重要的研究领域,有许多基于口语技术的应用。说话人年龄估计在实现信息和内容的个性化和适合年龄的管理方面有应用。然而,儿童说话人年龄估计的研究尤其具有挑战性,因为缺乏代表发育谱的相关语音数据,并且高信号可变性,特别是使建模复杂化的年龄内可变性。大多数儿童说话人年龄估计方法直接采用成人语音处理研究的方法。在这篇论文中,我们提出了儿童特有的特征,并将说话人的电话持续时间作为儿童年龄的一个重要生物标志物。我们提出了一个电话时长模型来预测儿童的年龄。为了实现这一点,儿童语音首先被强制与相应的转录对齐,以获得电话持续时间分布。统计函数是根据每个音素的音长分布计算出来的,这些音长分布又被用来训练回归模型来预测说话人的年龄。两个儿童语音数据集被用来证明手机持续时间特征的鲁棒性。我们对从幼儿园到10年级的儿童进行了年龄回归实验。实验结果表明,电话持续时间包含了儿童重要的发展相关信息。音素对儿童说话人年龄的估计贡献最大。 摘要:Automatic inference of important paralinguistic information such as age from speech is an important area of research with numerous spoken language technology based applications. Speaker age estimation has applications in enabling personalization and age-appropriate curation of information and content. However, research in speaker age estimation in children is especially challenging due to paucity of relevant speech data representing the developmental spectrum, and the high signal variability especially intra age variability that complicates modeling. Most approaches in children speaker age estimation adopt methods directly from research on adult speech processing. In this paper, we propose features specific to children and focus on speaker's phone duration as an important biomarker of children's age. We propose phone duration modeling for predicting age from child's speech. To enable that, children speech is first forced aligned with the corresponding transcription to derive phone duration distributions. Statistical functionals are computed from phone duration distributions for each phoneme which are in turn used to train regression models to predict speaker age. Two children speech datasets are employed to demonstrate the robustness of phone duration features. We perform age regression experiments on age categories ranging from children studying in kindergarten to grade 10. Experimental results suggest phone durations contain important development-related information of children. Phonemes contributing most to estimation of children speaker age are analyzed and presented.

【3】 Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development 标题:面向高质量大语音数据集开发的可扩展数据标注流水线 链接:https://arxiv.org/abs/2109.01164

作者:Mingkuan Liu,Chi Zhang,Hua Xing,Chao Feng,Monchu Chen,Judith Bishop,Grace Ngapo 机构:Appen 备注:Submitted to NeurIPS 2021 Datasets and Benchmarks Track (Round 2) 摘要:本文介绍了一种人在回路(HITL)数据注释管道,用于生成高质量、大规模的语音数据集。该管道结合了人和机器的优点,通过机器预标记和完全手动审核,可以更快速、准确、经济高效地对数据集进行注释。注释管道中采用了质量控制机制,如盲测试、行为监控和数据验证,以减轻机器生成标签带来的潜在偏差。我们的A/B测试和试点结果表明,HITL管道可以将注释速度和容量提高至少80%,并且质量与手动双通道注释相当或更高。我们正在利用这一可扩展的管道来创建并持续增长用于多种语言的超高容量现成(UHV-OTS)语音语料库,其能力每年扩展到每种语言10000多小时。可以使用动态包装从UHV-OTS语料库生成定制数据集。UHV-OTS是一个长期的Appen项目,用于支持语音处理中的商业和学术研究数据需求。Appen将每年从UHV-OTS捐赠大量言论自由数据集,以支持CC-BY-SA许可下的学术和开源社区研究。我们还在Apache2.0许可下发布数据预处理和预标记管道的代码,以允许复制论文中报告的结果。 摘要:This paper introduces a human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets. The pipeline combines human and machine advantages to more quickly, accurately, and cost-effectively annotate datasets with machine pre-labeling and fully manual auditing. Quality control mechanisms such as blind testing, behavior monitoring, and data validation have been adopted in the annotation pipeline to mitigate potential bias introduced by machine-generated labels. Our A/B testing and pilot results demonstrated the HITL pipeline can improve annotation speed and capacity by at least 80% and quality is comparable to or higher than manual double pass annotation. We are leveraging this scalable pipeline to create and continuously grow ultra-high volume off-the-shelf (UHV-OTS) speech corpora for multiple languages, with the capability to expand to 10,000+ hours per language annually. Customized datasets can be produced from the UHV-OTS corpora using dynamic packaging. UHV-OTS is a long-term Appen project to support commercial and academic research data needs in speech processing. Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. We are also releasing the code of the data pre-processing and pre-tagging pipeline under the Apache 2.0 license to allow reproduction of the results reported in the paper.

【4】 Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition 标题:高效仿真器:自动语音识别中的渐进式下采样和分组注意 链接:https://arxiv.org/abs/2109.01163

作者:Maxime Burchi,Valentin Vielzeuf 机构:Orange Labs, Cesson-S´evign´e, France 备注:None 摘要:最近提出的Conformer体系结构通过结合卷积和对模型局部和全局依赖性的关注,在自动语音识别中显示了最先进的性能。在本文中,我们研究如何在有限的计算预算下降低Conformer体系结构的复杂性,从而实现一种更高效的体系结构设计,我们称之为高效Conformer。我们在构象编码器中引入渐进下采样,并提出了一种新的注意机制,称为分组注意,允许我们将注意复杂度从$O(n^{2}d)$降低到$O(n^{2}d/g)$,用于序列长度$n$、隐藏维度$d$和组大小参数$g$。我们还实验了使用跨步多头自我注意作为一种全局下采样操作。我们的实验是在LibriSpeech数据集上进行的,带有CTC和RNN传感器损耗。我们表明,在相同的计算预算下,与一致性架构相比,该架构在更快的训练和解码速度下实现了更好的性能。我们的13M参数CTC模型在不使用语言模型的情况下实现了3.6\%/9.0\%的有竞争力的WER,在测试清洁/测试其他集合时,使用外部n-gram语言模型实现了2.7\%/6.7\%,同时在推理时比我们的CTC一致性基线快29%,训练速度快36%。 摘要:The recently proposed Conformer architecture has shown state-of-the-art performances in Automatic Speech Recognition by combining convolution with attention to model both local and global dependencies. In this paper, we study how to reduce the Conformer architecture complexity with a limited computing budget, leading to a more efficient architecture design that we call Efficient Conformer. We introduce progressive downsampling to the Conformer encoder and propose a novel attention mechanism named grouped attention, allowing us to reduce attention complexity from $O(n^{2}d)$ to $O(n^{2}d / g)$ for sequence length $n$, hidden dimension $d$ and group size parameter $g$. We also experiment the use of strided multi-head self-attention as a global downsampling operation. Our experiments are performed on the LibriSpeech dataset with CTC and RNN-Transducer losses. We show that within the same computing budget, the proposed architecture achieves better performances with faster training and decoding compared to the Conformer. Our 13M parameters CTC model achieves competitive WERs of 3.6\%/9.0\% without using a language model and 2.7\%/6.7\% with an external n-gram language model on the test-clean/test-other sets while being 29\% faster than our CTC Conformer baseline at inference and 36\% faster to train.

3.eess.AS音频处理:

【1】 Musical Tempo Estimation Using a Multi-scale Network 标题:基于多尺度网络的音乐节奏估计 链接:https://arxiv.org/abs/2109.01607

作者:Xiaoheng Sun,Qiqi He,Yongwei Gao,Wei Li 机构: School of Computer Science and Technology, Fudan University, Shanghai, China, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China 备注:Accepted by ISMIR 2021 摘要:最近,一些没有起始检测的单步系统在自动音乐节拍估计中显示了它们的有效性。随着这些系统的成功,在本文中,我们提出了一个多尺度分组注意网络,以进一步探索这种方法的潜力。引入多尺度结构作为整体网络结构,其中来自不同尺度的信息被聚合以加强上下文特征学习。此外,我们提出了一个分组注意模块作为网络的关键组成部分。所提出的模块将输入特征沿频率轴分成若干组,这使得它能够从频谱图上的不同频率位置捕获长程相关性。在对比实验中,在公共数据集上的结果表明,该模型在精确度1方面优于现有的最新方法。 摘要:Recently, some single-step systems without onset detection have shown their effectiveness in automatic musical tempo estimation. Following the success of these systems, in this paper we propose a Multi-scale Grouped Attention Network to further explore the potential of such methods. A multi-scale structure is introduced as the overall network architecture where information from different scales is aggregated to strengthen contextual feature learning. Furthermore, we propose a Grouped Attention Module as the key component of the network. The proposed module separates the input feature into several groups along the frequency axis, which makes it capable of capturing long-range dependencies from different frequency positions on the spectrogram. In comparison experiments, the results on public datasets show that the proposed model outperforms existing state-of-the-art methods on Accuracy1.

【2】 Phone Duration Modeling for Speaker Age Estimation in Children 标题:用于儿童说话人年龄估计的语音时长建模 链接:https://arxiv.org/abs/2109.01568

作者:Prashanth Gurunath Shivakumar,Somer Bishop,Catherine Lord,Shrikanth Narayanan 机构: Bishopis with Department of Psychiatry, University of California 摘要:从语音中自动推断重要的副语言信息(如年龄)是一个重要的研究领域,有许多基于口语技术的应用。说话人年龄估计在实现信息和内容的个性化和适合年龄的管理方面有应用。然而,儿童说话人年龄估计的研究尤其具有挑战性,因为缺乏代表发育谱的相关语音数据,并且高信号可变性,特别是使建模复杂化的年龄内可变性。大多数儿童说话人年龄估计方法直接采用成人语音处理研究的方法。在这篇论文中,我们提出了儿童特有的特征,并将说话人的电话持续时间作为儿童年龄的一个重要生物标志物。我们提出了一个电话时长模型来预测儿童的年龄。为了实现这一点,儿童语音首先被强制与相应的转录对齐,以获得电话持续时间分布。统计函数是根据每个音素的音长分布计算出来的,这些音长分布又被用来训练回归模型来预测说话人的年龄。两个儿童语音数据集被用来证明手机持续时间特征的鲁棒性。我们对从幼儿园到10年级的儿童进行了年龄回归实验。实验结果表明,电话持续时间包含了儿童重要的发展相关信息。音素对儿童说话人年龄的估计贡献最大。 摘要:Automatic inference of important paralinguistic information such as age from speech is an important area of research with numerous spoken language technology based applications. Speaker age estimation has applications in enabling personalization and age-appropriate curation of information and content. However, research in speaker age estimation in children is especially challenging due to paucity of relevant speech data representing the developmental spectrum, and the high signal variability especially intra age variability that complicates modeling. Most approaches in children speaker age estimation adopt methods directly from research on adult speech processing. In this paper, we propose features specific to children and focus on speaker's phone duration as an important biomarker of children's age. We propose phone duration modeling for predicting age from child's speech. To enable that, children speech is first forced aligned with the corresponding transcription to derive phone duration distributions. Statistical functionals are computed from phone duration distributions for each phoneme which are in turn used to train regression models to predict speaker age. Two children speech datasets are employed to demonstrate the robustness of phone duration features. We perform age regression experiments on age categories ranging from children studying in kindergarten to grade 10. Experimental results suggest phone durations contain important development-related information of children. Phonemes contributing most to estimation of children speaker age are analyzed and presented.

【3】 Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development 标题:面向高质量大语音数据集开发的可扩展数据标注流水线 链接:https://arxiv.org/abs/2109.01164

作者:Mingkuan Liu,Chi Zhang,Hua Xing,Chao Feng,Monchu Chen,Judith Bishop,Grace Ngapo 机构:Appen 备注:Submitted to NeurIPS 2021 Datasets and Benchmarks Track (Round 2) 摘要:本文介绍了一种人在回路(HITL)数据注释管道,用于生成高质量、大规模的语音数据集。该管道结合了人和机器的优点,通过机器预标记和完全手动审核,可以更快速、准确、经济高效地对数据集进行注释。注释管道中采用了质量控制机制,如盲测试、行为监控和数据验证,以减轻机器生成标签带来的潜在偏差。我们的A/B测试和试点结果表明,HITL管道可以将注释速度和容量提高至少80%,并且质量与手动双通道注释相当或更高。我们正在利用这一可扩展的管道来创建并持续增长用于多种语言的超高容量现成(UHV-OTS)语音语料库,其能力每年扩展到每种语言10000多小时。可以使用动态包装从UHV-OTS语料库生成定制数据集。UHV-OTS是一个长期的Appen项目,用于支持语音处理中的商业和学术研究数据需求。Appen将每年从UHV-OTS捐赠大量言论自由数据集,以支持CC-BY-SA许可下的学术和开源社区研究。我们还在Apache2.0许可下发布数据预处理和预标记管道的代码,以允许复制论文中报告的结果。 摘要:This paper introduces a human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets. The pipeline combines human and machine advantages to more quickly, accurately, and cost-effectively annotate datasets with machine pre-labeling and fully manual auditing. Quality control mechanisms such as blind testing, behavior monitoring, and data validation have been adopted in the annotation pipeline to mitigate potential bias introduced by machine-generated labels. Our A/B testing and pilot results demonstrated the HITL pipeline can improve annotation speed and capacity by at least 80% and quality is comparable to or higher than manual double pass annotation. We are leveraging this scalable pipeline to create and continuously grow ultra-high volume off-the-shelf (UHV-OTS) speech corpora for multiple languages, with the capability to expand to 10,000+ hours per language annually. Customized datasets can be produced from the UHV-OTS corpora using dynamic packaging. UHV-OTS is a long-term Appen project to support commercial and academic research data needs in speech processing. Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. We are also releasing the code of the data pre-processing and pre-tagging pipeline under the Apache 2.0 license to allow reproduction of the results reported in the paper.

【4】 Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition 标题:高效仿真器:自动语音识别中的渐进式下采样和分组注意 链接:https://arxiv.org/abs/2109.01163

作者:Maxime Burchi,Valentin Vielzeuf 机构:Orange Labs, Cesson-S´evign´e, France 备注:None 摘要:最近提出的Conformer体系结构通过结合卷积和对模型局部和全局依赖性的关注,在自动语音识别中显示了最先进的性能。在本文中,我们研究如何在有限的计算预算下降低Conformer体系结构的复杂性,从而实现一种更高效的体系结构设计,我们称之为高效Conformer。我们在构象编码器中引入渐进下采样,并提出了一种新的注意机制,称为分组注意,允许我们将注意复杂度从$O(n^{2}d)$降低到$O(n^{2}d/g)$,用于序列长度$n$、隐藏维度$d$和组大小参数$g$。我们还实验了使用跨步多头自我注意作为一种全局下采样操作。我们的实验是在LibriSpeech数据集上进行的,带有CTC和RNN传感器损耗。我们表明,在相同的计算预算下,与一致性架构相比,该架构在更快的训练和解码速度下实现了更好的性能。我们的13M参数CTC模型在不使用语言模型的情况下实现了3.6\%/9.0\%的有竞争力的WER,在测试清洁/测试其他集合时,使用外部n-gram语言模型实现了2.7\%/6.7\%,同时在推理时比我们的CTC一致性基线快29%,训练速度快36%。 摘要:The recently proposed Conformer architecture has shown state-of-the-art performances in Automatic Speech Recognition by combining convolution with attention to model both local and global dependencies. In this paper, we study how to reduce the Conformer architecture complexity with a limited computing budget, leading to a more efficient architecture design that we call Efficient Conformer. We introduce progressive downsampling to the Conformer encoder and propose a novel attention mechanism named grouped attention, allowing us to reduce attention complexity from $O(n^{2}d)$ to $O(n^{2}d / g)$ for sequence length $n$, hidden dimension $d$ and group size parameter $g$. We also experiment the use of strided multi-head self-attention as a global downsampling operation. Our experiments are performed on the LibriSpeech dataset with CTC and RNN-Transducer losses. We show that within the same computing budget, the proposed architecture achieves better performances with faster training and decoding compared to the Conformer. Our 13M parameters CTC model achieves competitive WERs of 3.6\%/9.0\% without using a language model and 2.7\%/6.7\% with an external n-gram language model on the test-clean/test-other sets while being 29\% faster than our CTC Conformer baseline at inference and 36\% faster to train.

机器翻译,仅供参考

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-09-06,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
语音识别
腾讯云语音识别(Automatic Speech Recognition,ASR)是将语音转化成文字的PaaS产品,为企业提供精准而极具性价比的识别服务。被微信、王者荣耀、腾讯视频等大量业务使用,适用于录音质检、会议实时转写、语音输入法等多个场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档