金融/语音/音频处理学术速递[11.8]

公众号-arXiv每日学术速递

发布于 2021-11-17 10:50:45

2800

发布于 2021-11-17 10:50:45

文章被收录于专栏：arXiv每日学术速递

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

q-fin金融，共计3篇

cs.SD语音，共计6篇

eess.AS音频处理，共计8篇

1.q-fin金融:

【1】 Decrease of capital guarantees in life insurance products: can reinsurance stop it? 标题：寿险产品资本保证金减少：再保险能阻止吗？链接：https://arxiv.org/abs/2111.03603

作者：Marcos Escobar-Anel,Yevhen Havrylenko,Michel Kschonnek,Rudi Zagst 机构：Department of Statistical and Actuarial Sciences, Western University, Richmond street, London, Canada, Chair of Mathematical Finance, Department of Mathematics, Technical University of Munich, Parkring , Garching bei M¨unchen, Germany 备注：33 pages 摘要：我们分析了再保险扭转目前寿险产品资本担保减少趋势的潜力。为保险人提供一个将部分财务风险转移给再保险人的机会，我们解决了同时存在风险价值且无卖空约束的保险人动态投资再保险优化问题。我们引入了保证等价效用收益的概念，并用它来比较有无再保险的人寿保险产品。我们的数值研究表明，最优管理再保险允许保险公司向客户提供显著更高的资本担保，而不会损失保险公司的预期效用。投资期限越长，保险公司规避风险的程度越低，再保险收益越显著。摘要：We analyze the potential of reinsurance for reversing the current trend of decreasing capital guarantees in life insurance products. Providing an insurer with an opportunity to shift part of the financial risk to a reinsurer, we solve the insurer's dynamic investment-reinsurance optimization problem under simultaneous Value-at-Risk and no-short-selling constraints. We introduce the concept of guarantee-equivalent utility gain and use it to compare life insurance products with and without reinsurance. Our numerical studies indicate that the optimally managed reinsurance allows the insurer to offer significantly higher capital guarantees to clients without any loss in the insurer's expected utility. The longer the investment horizon and the less risk-averse the insurer, the more prominent the reinsurance benefit.

【2】 Data-driven Hedging of Stock Index Options via Deep Learning 标题：基于深度学习的股指期权数据驱动套期保值链接：https://arxiv.org/abs/2111.03477

作者：Jie Chen,Lingfei Li 摘要：我们开发了深度学习模型，直接从期权数据中学习标准普尔500指数期权的套期保值比率。我们比较了不同的特征组合，结果表明，在样本外测试中，以成熟时间、Black-Scholes delta和情绪变量（看涨期权的VIX和看跌期权的指数回报）作为输入特征的前馈神经网络模型表现最好。该模型显著优于使用Black-Scholes delta和最新数据驱动模型的标准套期保值实践。我们的结果证明了市场情绪对套期保值效率的重要性，这是以前在制定套期保值策略时被忽视的一个因素。摘要：We develop deep learning models to learn the hedge ratio for S&P500 index options directly from options data. We compare different combinations of features and show that a feedforward neural network model with time to maturity, Black-Scholes delta and a sentiment variable (VIX for calls and index return for puts) as input features performs the best in the out-of-sample test. This model significantly outperforms the standard hedging practice that uses the Black-Scholes delta and a recent data-driven model. Our results demonstrate the importance of market sentiment for hedging efficiency, a factor previously ignored in developing hedging strategies.

【3】 Cyber Risk Frequency, Severity and Insurance Viability 标题：网络风险频率、严重程度和保险生存能力链接：https://arxiv.org/abs/2111.03366

作者：Matteo Malavasi,Gareth W. Peters,Pavel V. Shevchenko,Stefan Trück,Jiwook Jang,Georgy Sofronov 机构：Department of Actuarial Studies and Business Analytics, Macquarie University, Australia, Department of Statistics and Applied Probability, University of California Santa Barbara, USA, Department of Mathematics and Statistics, Macquarie University, Australia 备注：42 pages, 14 figures 摘要：本研究基于Advisen提供的领先行业网络事件数据集，对美国网络保险行业的保险风险转移进行了探索。我们寻求解决两个尚未解决的核心问题。首先，哪些因素是最重要的协变量，可以解释网络损失事件的频率和严重性，它们在网络风险类别上是否具有异质性？第二，就所需保费、风险池规模而言，网络风险是否可投保？该决定如何随被保险公司行业和规模而变化？我们通过结合基于位置形状和规模的广义加性模型（GAMLSS）和一类有序回归的回归模型来解决这些问题。这些模型将构成我们分析网络风险损失过程的频率和严重性的基础。我们使用效用模型框架研究网络风险保险的可行性，保费由经典确定性等价分析使用开发的回归模型计算。我们的研究结果为网络风险可保性的本质提供了一些新的关键见解，并严格解决了真实数据驱动的案例研究分析中提出的两个保险问题。摘要：In this study an exploration of insurance risk transfer is undertaken for the cyber insurance industry in the United States of America, based on the leading industry dataset of cyber events provided by Advisen. We seek to address two core unresolved questions. First, what factors are the most significant covariates that may explain the frequency and severity of cyber loss events and are they heterogeneous over cyber risk categories? Second, is cyber risk insurable in regards to the required premiums, risk pool sizes and how would this decision vary with the insured companies industry sector and size? We address these questions through a combination of regression models based on the class of Generalised Additive Models for Location Shape and Scale (GAMLSS) and a class of ordinal regressions. These models will then form the basis for our analysis of frequency and severity of cyber risk loss processes. We investigate the viability of insurance for cyber risk using a utility modelling framework with premium calculated by classical certainty equivalence analysis utilising the developed regression models. Our results provide several new key insights into the nature of insurability of cyber risk and rigorously address the two insurance questions posed in a real data driven case study analysis.

2.cs.SD语音:

【1】 Objective measurement of pitch extractors' responses to frequency modulated sounds and two reference pitch extraction methods for analyzing voice pitch responses to auditory stimulation 标题：基音提取器对调频声音响应的客观测量及分析声音基音响应的两种参考基音提取方法链接：https://arxiv.org/abs/2111.03629

作者：Hideki Kawahara,Kohei Yatabe,Ken-Ichi Sakakibara,Tatsuya Kitamura,Hideki Banno,Masanori Morise 机构：K. Yatabe†, K. Sakakibara‡, T. Kitamura¶, H. Banno$, M. Morise§, ⋆Wakayama University, Japan∗, †Waseda University, Japan, ‡Health Science University of Hokkaido, Japan, ¶Konan University, Japan, $Meijo University, Japan, §Meiji University, Japan 备注：5 pages, 5 figures, submitted ICASSP2022 摘要：我们提出了一种客观测量基音提取对调频信号响应的方法。该方法同时测量线性和非线性时不变响应以及随机和时变响应。它使用由二进制正交序列组合的延长时间拉伸脉冲。我们最近发现，在发声时，听觉刺激会产生非自愿的音调反应，这促使我们提出这一建议。非自愿语音基音响应提供了单独和客观地调查语音链子系统的方法。这种响应分析需要可靠和精确的基音提取。我们发现，现有的基音提取无法正确地分析信号用于听觉刺激使用所提出的方法。因此，我们提出了两种基于瞬时频率分析和多分辨率功率谱分析的参考基音提取方法。建议的提取器正确分析测试信号。我们开放源代码MATLAB来测量基音提取器，并在我们的GitHub存储库上进行语音基音响应实验。摘要：We propose an objective measurement method for pitch extractors' responses to frequency-modulated signals. The method simultaneously measures the linear and the non-linear time-invariant responses and random and time-varying responses. It uses extended time-stretched pulses combined by binary orthogonal sequences. Our recent finding of involuntary voice pitch response to auditory stimulation while voicing motivated this proposal. The involuntary voice pitch response provides means to investigate voice chain subsystems individually and objectively. This response analysis requires reliable and precise pitch extraction. We found that existing pitch extractors failed to correctly analyze signals used for auditory stimulation by using the proposed method. Therefore, we propose two reference pitch extractors based on the instantaneous frequency analysis and multi-resolution power spectrum analysis. The proposed extractors correctly analyze the test signals. We open-sourced MATLAB codes to measure pitch extractors and codes for conducting the voice pitch response experiment on our GitHub repository.

【2】 Conversational speech recognition leveraging effective fusion methods for cross-utterance language modeling 标题：利用用于相声语言建模的有效融合方法的会话语音识别链接：https://arxiv.org/abs/2111.03333

作者：Bi-Cheng Yan,Hsin-Wei Wang,Shih-Hsuan Chiu,Hsuan-Sheng Chiu,Berlin Chen 机构： National Taiwan Normal University, Taipei, Taiwan, Chunghwa Telecom Laboratories, Taipei, Taiwan 备注：5 pages, 3 figures, submitted to ICASSP 2022 摘要：会话言语通常在话语层面上表现为松散的句法结构，但同时在连续的话语中表现出主题连贯关系。先前的研究表明，使用递归神经网络或长-短期记忆语言模型（LM）捕获较长的上下文信息可能会受到近期偏差的影响，同时排除了长范围上下文。为了捕获词与词之间以及跨话语之间的长期语义交互，我们提出了用于会话语音自动语音识别（ASR）中语言建模的不同会话历史融合方法。此外，还提出了一种新的音频融合机制，该机制能够以合作的方式融合和利用当前话语的声学嵌入及其相应会话历史的语义内容。为了充实我们的想法，我们将ASR N-best假设重分类任务作为一个预测问题，利用BERT（一个标志性的预训练LM）作为成分载体，以便于从给定的N-best假设列表中选择oracle假设。在AMI基准数据集上进行的实证实验似乎证明了我们的方法相对于当前一些顶级方法的可行性和有效性。摘要：Conversational speech normally is embodied with loose syntactic structures at the utterance level but simultaneously exhibits topical coherence relations across consecutive utterances. Prior work has shown that capturing longer context information with a recurrent neural network or long short-term memory language model (LM) may suffer from the recent bias while excluding the long-range context. In order to capture the long-term semantic interactions among words and across utterances, we put forward disparate conversation history fusion methods for language modeling in automatic speech recognition (ASR) of conversational speech. Furthermore, a novel audio-fusion mechanism is introduced, which manages to fuse and utilize the acoustic embeddings of a current utterance and the semantic content of its corresponding conversation history in a cooperative way. To flesh out our ideas, we frame the ASR N-best hypothesis rescoring task as a prediction problem, leveraging BERT, an iconic pre-trained LM, as the ingredient vehicle to facilitate selection of the oracle hypothesis from a given N-best hypothesis list. Empirical experiments conducted on the AMI benchmark dataset seem to demonstrate the feasibility and efficacy of our methods in relation to some current top-of-line methods.

【3】 Context-Aware Transformer Transducer for Speech Recognition 标题：用于语音识别的上下文感知转换器链接：https://arxiv.org/abs/2111.03250

作者：Feng-Ju Chang,Jing Liu,Martin Radfar,Athanasios Mouchtaris,Maurizio Omologo,Ariya Rastrow,Siegfried Kunzmann 机构：Amazon Alexa 备注：Accepted to ASRU 2021 摘要：端到端（E2E）自动语音识别（ASR）系统通常难以识别不常见的单词，这些单词很少出现在训练数据中。一种有希望的方法是在推理时利用个性化/上下文信息，以提高对这些稀有词的识别精度。在这项工作中，我们提出了一种新的上下文感知Transformer传感器（CATT）网络，该网络通过利用这种上下文信号改进了最先进的基于Transformer的ASR系统。具体来说，我们提出了一个基于多头注意的上下文偏向网络，该网络与其他ASR子网络联合训练。我们探索不同的技术来编码上下文数据并创建最终的注意上下文向量。我们还利用BLSTM和基于预训练的BERT模型对上下文数据进行编码，并指导网络训练。通过使用内部远场数据集，我们表明，使用基于BERT的上下文编码器的CATT提高了基线Transformer传感器的字错误率，并比现有的深层上下文模型分别高出24.2%和19.4%。摘要：End-to-end (E2E) automatic speech recognition (ASR) systems often have difficulty recognizing uncommon words, that appear infrequently in the training data. One promising method, to improve the recognition accuracy on such rare words, is to latch onto personalized/contextual information at inference. In this work, we present a novel context-aware transformer transducer (CATT) network that improves the state-of-the-art transformer-based ASR system by taking advantage of such contextual signals. Specifically, we propose a multi-head attention-based context-biasing network, which is jointly trained with the rest of the ASR sub-networks. We explore different techniques to encode contextual data and to create the final attention context vectors. We also leverage both BLSTM and pretrained BERT based models to encode contextual data and guide the network training. Using an in-house far-field dataset, we show that CATT, using a BERT based context encoder, improves the word error rate of the baseline transformer transducer and outperforms an existing deep contextual model by 24.2% and 19.4% respectively.

【4】 Generating Diverse Realistic Laughter for Interactive Art 标题：为互动艺术创造多样化的现实主义笑声链接：https://arxiv.org/abs/2111.03146

作者：M. Mehdi Afsar,Eric Park,Étienne Paquette,Gauthier Gidel,Kory W. Mathewson,Eilif Muller 机构：Mila & University of Calgary, Mila & University of Waterloo, Independent Artist, Mila & University of Montreal, DeepMind 备注：Presented at Machine Learning for Creativity and Design workshop at NeurIPS 2021 摘要：我们提出了一个互动艺术项目，通过欢笑的旋律，以及通过先进的笑合成方法创建和探索的联系，使那些因新冠肺炎及其伴随的孤独而隐形的人重新出现。然而，在高质量的听觉合成中无条件地产生人类情感反应的多样性仍然是一个开放的问题，对于这些方法在艺术环境中的应用具有重要意义。我们开发了LaughGANter，这是一种利用生成性对抗网络（GAN）再现人类笑声多样性的方法。在对不同笑声样本的数据集进行训练时，LaughGANter生成不同的高质量笑声样本，并学习适合情感分析和新颖艺术应用（如潜在混合/插值和情感转移）的潜在空间。摘要：We propose an interactive art project to make those rendered invisible by the COVID-19 crisis and its concomitant solitude reappear through the welcome melody of laughter, and connections created and explored through advanced laughter synthesis approaches. However, the unconditional generation of the diversity of human emotional responses in high-quality auditory synthesis remains an open problem, with important implications for the application of these approaches in artistic settings. We developed LaughGANter, an approach to reproduce the diversity of human laughter using generative adversarial networks (GANs). When trained on a dataset of diverse laughter samples, LaughGANter generates diverse, high quality laughter samples, and learns a latent space suitable for emotional analysis and novel artistic applications such as latent mixing/interpolation and emotional transfer.

【5】 Hybrid Spectrogram and Waveform Source Separation 标题：混合谱图与波形源分离链接：https://arxiv.org/abs/2111.03600

作者：Alexandre Défossez 机构： Facebook AI Research, Authors of papers retain, copyright and release the work, under a Creative Commons, Attribution ,., International, License (CC BY ,.,)., In partnership with 备注：ISMIR 2021 MDX Workshop, 11 pages, 2 figures 摘要：源分离模型要么在频谱图上工作，要么在波形域上工作。在这项工作中，我们展示了如何执行端到端混合源分离，让模型决定哪个域最适合每个源，甚至将两者结合起来。Demucs架构的拟议混合版本赢得了索尼组织的2021年音乐分频挑战赛。该体系结构还附带了其他改进，例如压缩剩余分支、局部注意或奇异值正则化。总体而言，在MusDB HQ数据集上测量的所有源中观察到1.4 dB的信号失真比（SDR）改善，这一改善经人类主观评估确认，总体质量评级为5分之2.83（非混合DEMC为2.36），无污染评级为3.04（非混合动力DEMCS为2.37，在竞赛中提交的排名第二的车型为2.44）。摘要：Source separation models either work on the spectrogram or waveform domain. In this work, we show how to perform end-to-end hybrid source separation, letting the model decide which domain is best suited for each source, and even combining both. The proposed hybrid version of the Demucs architecture won the Music Demixing Challenge 2021 organized by Sony. This architecture also comes with additional improvements, such as compressed residual branches, local attention or singular value regularization. Overall, a 1.4 dB improvement of the Signal-To-Distortion (SDR) was observed across all sources as measured on the MusDB HQ dataset, an improvement confirmed by human subjective evaluation, with an overall quality rated at 2.83 out of 5 (2.36 for the non hybrid Demucs), and absence of contamination at 3.04 (against 2.37 for the non hybrid Demucs and 2.44 for the second ranking model submitted at the competition).

【6】 Blind Extraction of Target Speech Source Guided by Supervised Speaker Identification via X-vectors 标题：基于X向量有监督说话人识别的目标语音源盲提取链接：https://arxiv.org/abs/2111.03482

作者：Jiri Malek,Jakub Jansky,Zbynek Koldovsky,Tomas Kounovsky,Jaroslav Cmejla,Jindrich Zdansky 机构： 20- 177 20S) and by the Student Grant Scheme 20 2 1 Project of the TechnicalUniversity in Liberec, Technical University of Liberec 备注：Submitted for review 摘要：本文提出了一种从混合音频源中提取感兴趣说话人（SOI）的新的鲁棒方法。SOI的估计是盲的，通过独立的向量提取来执行。最近提出的常数分离向量（CSV）采用了改进运动源估计的模型。盲算法通过帧式说话人识别引导至SOI，该识别以有监督的方式进行训练，与特定场景无关。在处理挑战性数据时，由于该引导的限制，可能会提取不正确的说话人。为了识别此类情况，提出了一个非侵入性评估估计SOI质量的标准。该标准使用与说话人识别相同的模型；因此不需要额外的训练。使用该标准，提出了提取的“通缩”方法。如果估计的源不正确，则将其从说话人识别中减去混合后，再次从减少的混合液中提取SOI。在包含挑战性现象（源运动、混响、瞬态噪声或麦克风故障）的人工和真实数据集上对所提出的方法进行了实验测试。所提出的方法与最先进的盲alg方法相当静态混合上的算法；对于包含源运动的混合更精确。与完全监督的方法相比，建议的方法实现的精度较低，但不需要场景特定的数据进行训练。摘要：This manuscript proposes a novel robust procedure for extraction of a speaker of interest (SOI) from a mixture of audio sources. The estimation of the SOI is blind, performed via independent vector extraction. A recently proposed constant separating vector (CSV) model is employed, which improves the estimation of moving sources. The blind algorithm is guided towards the SOI via the frame-wise speaker identification, which is trained in a supervised manner and is independent of a specific scenario. When processing challenging data, an incorrect speaker may be extracted due to limitations of this guidance. To identify such cases, a criterion non-intrusively assessing quality of the estimated SOI is proposed. It utilizes the same model as the speaker identification; no additional training is therefore required. Using this criterion, the ``deflation'' approach to extraction is presented. If an incorrect source is estimated, it is subtracted from the mixture and the extraction of the SOI is performed again from the reduced mixture. The proposed procedure is experimentally tested on both artificial and real-world datasets containing challenging phenomena: source movements, reverberation, transient noise or microphone failures. The presented method is comparable to the state-of-the-art blind algorithms on static mixtures; it is more accurate for mixtures containing source movements. Compared to fully supervised methods, the proposed procedure achieves a lower level of accuracy but requires no scenario-specific data for the training.

3.eess.AS音频处理:

【1】 Hybrid Spectrogram and Waveform Source Separation 标题：混合谱图与波形源分离链接：https://arxiv.org/abs/2111.03600

作者：Alexandre Défossez 机构： Facebook AI Research, Authors of papers retain, copyright and release the work, under a Creative Commons, Attribution ,., International, License (CC BY ,.,)., In partnership with 备注：ISMIR 2021 MDX Workshop, 11 pages, 2 figures 摘要：源分离模型要么在频谱图上工作，要么在波形域上工作。在这项工作中，我们展示了如何执行端到端混合源分离，让模型决定哪个域最适合每个源，甚至将两者结合起来。Demucs架构的拟议混合版本赢得了索尼组织的2021年音乐分频挑战赛。该体系结构还附带了其他改进，例如压缩剩余分支、局部注意或奇异值正则化。总体而言，在MusDB HQ数据集上测量的所有源中观察到1.4 dB的信号失真比（SDR）改善，这一改善经人类主观评估确认，总体质量评级为5分之2.83（非混合DEMC为2.36），无污染评级为3.04（非混合动力DEMCS为2.37，在竞赛中提交的排名第二的车型为2.44）。摘要：Source separation models either work on the spectrogram or waveform domain. In this work, we show how to perform end-to-end hybrid source separation, letting the model decide which domain is best suited for each source, and even combining both. The proposed hybrid version of the Demucs architecture won the Music Demixing Challenge 2021 organized by Sony. This architecture also comes with additional improvements, such as compressed residual branches, local attention or singular value regularization. Overall, a 1.4 dB improvement of the Signal-To-Distortion (SDR) was observed across all sources as measured on the MusDB HQ dataset, an improvement confirmed by human subjective evaluation, with an overall quality rated at 2.83 out of 5 (2.36 for the non hybrid Demucs), and absence of contamination at 3.04 (against 2.37 for the non hybrid Demucs and 2.44 for the second ranking model submitted at the competition).

【2】 Blind Extraction of Target Speech Source Guided by Supervised Speaker Identification via X-vectors 标题：基于X向量有监督说话人识别的目标语音源盲提取链接：https://arxiv.org/abs/2111.03482

作者：Jiri Malek,Jakub Jansky,Zbynek Koldovsky,Tomas Kounovsky,Jaroslav Cmejla,Jindrich Zdansky 机构： 20- 177 20S) and by the Student Grant Scheme 20 2 1 Project of the TechnicalUniversity in Liberec, Technical University of Liberec 备注：Submitted for review 摘要：本文提出了一种从混合音频源中提取感兴趣说话人（SOI）的新的鲁棒方法。SOI的估计是盲的，通过独立的向量提取来执行。最近提出的常数分离向量（CSV）采用了改进运动源估计的模型。盲算法通过帧式说话人识别引导至SOI，该识别以有监督的方式进行训练，与特定场景无关。在处理挑战性数据时，由于该引导的限制，可能会提取不正确的说话人。为了识别此类情况，提出了一个非侵入性评估估计SOI质量的标准。该标准使用与说话人识别相同的模型；因此不需要额外的训练。使用该标准，提出了提取的“通缩”方法。如果估计的源不正确，则将其从说话人识别中减去混合后，再次从减少的混合液中提取SOI。在包含挑战性现象（源运动、混响、瞬态噪声或麦克风故障）的人工和真实数据集上对所提出的方法进行了实验测试。所提出的方法与最先进的盲alg方法相当静态混合上的算法；对于包含源运动的混合更精确。与完全监督的方法相比，建议的方法实现的精度较低，但不需要场景特定的数据进行训练。摘要：This manuscript proposes a novel robust procedure for extraction of a speaker of interest (SOI) from a mixture of audio sources. The estimation of the SOI is blind, performed via independent vector extraction. A recently proposed constant separating vector (CSV) model is employed, which improves the estimation of moving sources. The blind algorithm is guided towards the SOI via the frame-wise speaker identification, which is trained in a supervised manner and is independent of a specific scenario. When processing challenging data, an incorrect speaker may be extracted due to limitations of this guidance. To identify such cases, a criterion non-intrusively assessing quality of the estimated SOI is proposed. It utilizes the same model as the speaker identification; no additional training is therefore required. Using this criterion, the ``deflation'' approach to extraction is presented. If an incorrect source is estimated, it is subtracted from the mixture and the extraction of the SOI is performed again from the reduced mixture. The proposed procedure is experimentally tested on both artificial and real-world datasets containing challenging phenomena: source movements, reverberation, transient noise or microphone failures. The presented method is comparable to the state-of-the-art blind algorithms on static mixtures; it is more accurate for mixtures containing source movements. Compared to fully supervised methods, the proposed procedure achieves a lower level of accuracy but requires no scenario-specific data for the training.

【3】 PerSpeechNorm: A Persian Toolkit for Speech Processing Normalization 标题：PerSpeechNorm：一个用于语音处理标准化的波斯语工具包链接：https://arxiv.org/abs/2111.03470

作者：Romina Oji,Seyedeh Fatemeh Razavi,Sajjad Abdi Dehsorkh,Alireza Hariri,Hadi Asheri,Reshad Hosseini 机构：∗ School of ECE, College of Engineering, University of Tehran, Tehran, Iran, † Hoosh Afzare Rahbare Aryaman (HARA) Company, Tehran, Iran, ‡Equal contribution 摘要：一般来说，语音处理模型由语言模型和声学模型组成。无论语言模型的复杂性和变体如何，语言模型中都需要三个关键的预处理步骤：清理、规范化和标记化。在上述步骤中，规范化步骤对于格式统一至关重要在纯文本应用程序中。但是，对于语音处理模块中的嵌入式语言模型，规范化不限于格式统一。此外，它必须将每个可读符号、数字等转换为它们的发音方式。据我们所知，sp中没有用于嵌入式语言模型的波斯语规范化工具包因此，在本文中，我们提出了一个用于语音应用中文本处理的开源标准化工具包。简单地，我们考虑不同的可读的波斯文本符号（通用货币、α、@、URL等）、数字（日期、时间、电话号码、国家代码等）。，等等。与其他可用的波斯语文本规范化工具的比较表明了该方法在语音处理中的优越性。此外，还比较了该模型对其中一个函数（句子分离）的性能与其他常用自然语言库（如HAZM和Parsivar）的比较表明了该方法的正确性能。此外，它对一些波斯维基百科数据的评估也证实了该方法的正确性能。摘要：In general, speech processing models consist of a language model along with an acoustic model. Regardless of the language model's complexity and variants, three critical pre-processing steps are needed in language models: cleaning, normalization, and tokenization. Among mentioned steps, the normalization step is so essential to format unification in pure textual applications. However, for embedded language models in speech processing modules, normalization is not limited to format unification. Moreover, it has to convert each readable symbol, number, etc., to how they are pronounced. To the best of our knowledge, there is no Persian normalization toolkits for embedded language models in speech processing modules, So in this paper, we propose an open-source normalization toolkit for text processing in speech applications. Briefly, we consider different readable Persian text like symbols (common currencies, #, @, URL, etc.), numbers (date, time, phone number, national code, etc.), and so on. Comparison with other available Persian textual normalization tools indicates the superiority of the proposed method in speech processing. Also, comparing the model's performance for one of the proposed functions (sentence separation) with other common natural language libraries such as HAZM and Parsivar indicates the proper performance of the proposed method. Besides, its evaluation of some Persian Wikipedia data confirms the proper performance of the proposed method.

【4】 Objective measurement of pitch extractors' responses to frequency modulated sounds and two reference pitch extraction methods for analyzing voice pitch responses to auditory stimulation 标题：基音提取器对调频声音响应的客观测量及分析声音基音响应的两种参考基音提取方法链接：https://arxiv.org/abs/2111.03629

作者：Hideki Kawahara,Kohei Yatabe,Ken-Ichi Sakakibara,Tatsuya Kitamura,Hideki Banno,Masanori Morise 机构：K. Yatabe†, K. Sakakibara‡, T. Kitamura¶, H. Banno$, M. Morise§, ⋆Wakayama University, Japan∗, †Waseda University, Japan, ‡Health Science University of Hokkaido, Japan, ¶Konan University, Japan, $Meijo University, Japan, §Meiji University, Japan 备注：5 pages, 5 figures, submitted ICASSP2022 摘要：我们提出了一种客观测量基音提取对调频信号响应的方法。该方法同时测量线性和非线性时不变响应以及随机和时变响应。它使用由二进制正交序列组合的延长时间拉伸脉冲。我们最近发现，在发声时，听觉刺激会产生非自愿的音调反应，这促使我们提出这一建议。非自愿语音基音响应提供了单独和客观地调查语音链子系统的方法。这种响应分析需要可靠和精确的基音提取。我们发现，现有的基音提取无法正确地分析信号用于听觉刺激使用所提出的方法。因此，我们提出了两种基于瞬时频率分析和多分辨率功率谱分析的参考基音提取方法。建议的提取器正确分析测试信号。我们开放源代码MATLAB来测量基音提取器，并在我们的GitHub存储库上进行语音基音响应实验。摘要：We propose an objective measurement method for pitch extractors' responses to frequency-modulated signals. The method simultaneously measures the linear and the non-linear time-invariant responses and random and time-varying responses. It uses extended time-stretched pulses combined by binary orthogonal sequences. Our recent finding of involuntary voice pitch response to auditory stimulation while voicing motivated this proposal. The involuntary voice pitch response provides means to investigate voice chain subsystems individually and objectively. This response analysis requires reliable and precise pitch extraction. We found that existing pitch extractors failed to correctly analyze signals used for auditory stimulation by using the proposed method. Therefore, we propose two reference pitch extractors based on the instantaneous frequency analysis and multi-resolution power spectrum analysis. The proposed extractors correctly analyze the test signals. We open-sourced MATLAB codes to measure pitch extractors and codes for conducting the voice pitch response experiment on our GitHub repository.

【5】 Conformer-based Hybrid ASR System for Switchboard Dataset 标题：基于一致性的交换机数据集混合ASR系统链接：https://arxiv.org/abs/2111.03442

作者：Mohammad Zeineldeen,Jingjing Xu,Christoph Lüscher,Wilfried Michel,Alexander Gerstenberger,Ralf Schlüter,Hermann Ney 机构：Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany, AppTek GmbH, Aachen, Germany 备注：Submitted to ICASSP 2022 摘要：最近提出的conformer体系结构已成功用于端到端自动语音识别（ASR）体系结构，在不同数据集上实现了最先进的性能。据我们所知，没有研究混合ASR使用一致性声学模型的影响。在本文中，我们提出并评估了一种基于竞争一致性的混合模型训练方法。我们研究了不同的训练方面和方法来提高单词错误率和提高训练速度。我们采用时间降采样方法进行有效训练，并使用转置卷积再次对输出序列进行上采样。我们在交换机300h数据集上进行了实验，与其他架构相比，我们基于一致性的混合模型取得了具有竞争力的结果。它在Hub5'01测试集上具有很好的通用性，显著优于基于BLSTM的混合模型。摘要：The recently proposed conformer architecture has been successfully used for end-to-end automatic speech recognition (ASR) architectures achieving state-of-the-art performance on different datasets. To our best knowledge, the impact of using conformer acoustic model for hybrid ASR is not investigated. In this paper, we present and evaluate a competitive conformer-based hybrid model training recipe. We study different training aspects and methods to improve word-error-rate as well as to increase training speed. We apply time downsampling methods for efficient training and use transposed convolutions to upsample the output sequence again. We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results compared to other architectures. It generalizes very well on Hub5'01 test set and outperforms the BLSTM-based hybrid model significantly.

【6】 Conversational speech recognition leveraging effective fusion methods for cross-utterance language modeling 标题：利用用于相声语言建模的有效融合方法的会话语音识别链接：https://arxiv.org/abs/2111.03333

作者：Bi-Cheng Yan,Hsin-Wei Wang,Shih-Hsuan Chiu,Hsuan-Sheng Chiu,Berlin Chen 机构： National Taiwan Normal University, Taipei, Taiwan, Chunghwa Telecom Laboratories, Taipei, Taiwan 备注：5 pages, 3 figures, submitted to ICASSP 2022 摘要：会话言语通常在话语层面上表现为松散的句法结构，但同时在连续的话语中表现出主题连贯关系。先前的研究表明，使用递归神经网络或长-短期记忆语言模型（LM）捕获较长的上下文信息可能会受到近期偏差的影响，同时排除了长范围上下文。为了捕获词与词之间以及跨话语之间的长期语义交互，我们提出了用于会话语音自动语音识别（ASR）中语言建模的不同会话历史融合方法。此外，还提出了一种新的音频融合机制，该机制能够以合作的方式融合和利用当前话语的声学嵌入及其相应会话历史的语义内容。为了充实我们的想法，我们将ASR N-best假设重分类任务作为一个预测问题，利用BERT（一个标志性的预训练LM）作为成分载体，以便于从给定的N-best假设列表中选择oracle假设。在AMI基准数据集上进行的实证实验似乎证明了我们的方法相对于当前一些顶级方法的可行性和有效性。摘要：Conversational speech normally is embodied with loose syntactic structures at the utterance level but simultaneously exhibits topical coherence relations across consecutive utterances. Prior work has shown that capturing longer context information with a recurrent neural network or long short-term memory language model (LM) may suffer from the recent bias while excluding the long-range context. In order to capture the long-term semantic interactions among words and across utterances, we put forward disparate conversation history fusion methods for language modeling in automatic speech recognition (ASR) of conversational speech. Furthermore, a novel audio-fusion mechanism is introduced, which manages to fuse and utilize the acoustic embeddings of a current utterance and the semantic content of its corresponding conversation history in a cooperative way. To flesh out our ideas, we frame the ASR N-best hypothesis rescoring task as a prediction problem, leveraging BERT, an iconic pre-trained LM, as the ingredient vehicle to facilitate selection of the oracle hypothesis from a given N-best hypothesis list. Empirical experiments conducted on the AMI benchmark dataset seem to demonstrate the feasibility and efficacy of our methods in relation to some current top-of-line methods.

【7】 Context-Aware Transformer Transducer for Speech Recognition 标题：用于语音识别的上下文感知转换器链接：https://arxiv.org/abs/2111.03250

作者：Feng-Ju Chang,Jing Liu,Martin Radfar,Athanasios Mouchtaris,Maurizio Omologo,Ariya Rastrow,Siegfried Kunzmann 机构：Amazon Alexa 备注：Accepted to ASRU 2021 摘要：端到端（E2E）自动语音识别（ASR）系统通常难以识别不常见的单词，这些单词很少出现在训练数据中。一种有希望的方法是在推理时利用个性化/上下文信息，以提高对这些稀有词的识别精度。在这项工作中，我们提出了一种新的上下文感知Transformer传感器（CATT）网络，该网络通过利用这种上下文信号改进了最先进的基于Transformer的ASR系统。具体来说，我们提出了一个基于多头注意的上下文偏向网络，该网络与其他ASR子网络联合训练。我们探索不同的技术来编码上下文数据并创建最终的注意上下文向量。我们还利用BLSTM和基于预训练的BERT模型对上下文数据进行编码，并指导网络训练。通过使用内部远场数据集，我们表明，使用基于BERT的上下文编码器的CATT提高了基线Transformer传感器的字错误率，并比现有的深层上下文模型分别高出24.2%和19.4%。摘要：End-to-end (E2E) automatic speech recognition (ASR) systems often have difficulty recognizing uncommon words, that appear infrequently in the training data. One promising method, to improve the recognition accuracy on such rare words, is to latch onto personalized/contextual information at inference. In this work, we present a novel context-aware transformer transducer (CATT) network that improves the state-of-the-art transformer-based ASR system by taking advantage of such contextual signals. Specifically, we propose a multi-head attention-based context-biasing network, which is jointly trained with the rest of the ASR sub-networks. We explore different techniques to encode contextual data and to create the final attention context vectors. We also leverage both BLSTM and pretrained BERT based models to encode contextual data and guide the network training. Using an in-house far-field dataset, we show that CATT, using a BERT based context encoder, improves the word error rate of the baseline transformer transducer and outperforms an existing deep contextual model by 24.2% and 19.4% respectively.

【8】 Generating Diverse Realistic Laughter for Interactive Art 标题：为互动艺术创造多样化的现实主义笑声链接：https://arxiv.org/abs/2111.03146

作者：M. Mehdi Afsar,Eric Park,Étienne Paquette,Gauthier Gidel,Kory W. Mathewson,Eilif Muller 机构：Mila & University of Calgary, Mila & University of Waterloo, Independent Artist, Mila & University of Montreal, DeepMind 备注：Presented at Machine Learning for Creativity and Design workshop at NeurIPS 2021 摘要：我们提出了一个互动艺术项目，通过欢笑的旋律，以及通过先进的笑合成方法创建和探索的联系，使那些因新冠肺炎及其伴随的孤独而隐形的人重新出现。然而，在高质量的听觉合成中无条件地产生人类情感反应的多样性仍然是一个开放的问题，对于这些方法在艺术环境中的应用具有重要意义。我们开发了LaughGANter，这是一种利用生成性对抗网络（GAN）再现人类笑声多样性的方法。在对不同笑声样本的数据集进行训练时，LaughGANter生成不同的高质量笑声样本，并学习适合情感分析和新颖艺术应用（如潜在混合/插值和情感转移）的潜在空间。摘要：We propose an interactive art project to make those rendered invisible by the COVID-19 crisis and its concomitant solitude reappear through the welcome melody of laughter, and connections created and explored through advanced laughter synthesis approaches. However, the unconditional generation of the diversity of human emotional responses in high-quality auditory synthesis remains an open problem, with important implications for the application of these approaches in artistic settings. We developed LaughGANter, an approach to reproduce the diversity of human laughter using generative adversarial networks (GANs). When trained on a dataset of diverse laughter samples, LaughGANter generates diverse, high quality laughter samples, and learns a latent space suitable for emotional analysis and novel artistic applications such as latent mixing/interpolation and emotional transfer.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-11-08，如有侵权请联系 cloudcommunity@tencent.com 删除

linux