金融/语音/音频处理学术速递[10.18]

公众号-arXiv每日学术速递

发布于 2021-10-21 16:05:10

4990

发布于 2021-10-21 16:05:10

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

q-fin金融，共计1篇

cs.SD语音，共计10篇

eess.AS音频处理，共计10篇

1.q-fin金融:

【1】 Credit Union Regulations' Mysterious Hold on Thrifts and Community Banks 标题：信用社条例对储蓄和社区银行的神秘把持链接：https://arxiv.org/abs/2110.07611

作者：Reka Sundaram-Stukel,Steven C Deller 机构：Research Fellow, Department of Economics, University of Wisconsin-Madison, Wisconsin ,., Professor, Department of Agricultural and Applied Economics, University of Wisconsin-Madison 备注：This paper has 33 pages, 3 figures, 3 tables. This paper will be submitted to a finance journal 摘要：信用合作社在2008-2010年监管框架内的持续运作仍然是金融部门争论的焦点。相互竞争的储蓄机构主张公平的监管待遇，而信用合作社则主张放松对其规模和经营范围的限制。我们以中心地理论为基础，为美国信用合作社提供了一系列区位模型，从而提供了一个全新的视角。我们的结果表明，信用合作社在零售银行集中度较低的地区开展业务。这一发现证明，信用合作社服务于利基市场，它们不是储蓄机构和社区银行直接竞争的重要来源。这可能标志着在流感大流行后的世界中，信用合作社的组建有所增加。摘要：The continued operation of credit unions in the 2008-2010 regulatory framework remained a source of debate in the financial sector. Competing thrifts argued for fair regulatory treatment while credit unions argued for the relaxation of restrictions on their scale and scope of operation. We provide a fresh perspective by building on central place theory and offering a family of location models for the U.S. credit unions. Our results showed that credit unions operated in areas with a low concentration of retail banks. This finding was evidence that credit unions serve niche markets and they were not a significant source of direct competition for thrifts and community banks. This may signal an increase in credit union formation in a post-pandemic world.

2.cs.SD语音:

【1】 Direct simultaneous speech to speech translation 标题：直接同步语音到语音翻译链接：https://arxiv.org/abs/2110.08250

作者：Xutai Ma,Hongyu Gong,Danni Liu,Ann Lee,Yun Tang,Peng-Jen Chen,Wei-Ning Hsu,Kenneth Heafield,Phillip Koehn,Juan Pino 机构：Facebook AI ,Johns Hopkins University ,Maastricht University , University of Edinburgh 摘要：我们提出了第一个直接同步语音到语音翻译（Simul-S2ST）模型，该模型能够在使用完整的源语音内容之前，独立于中间文本表示，在目标语音中开始生成翻译。我们的方法利用了离散单元直接语音到语音翻译的最新进展。代替连续谱图特征，从模型中预测以无监督方式学习的一系列直接表示，并直接传递给声码器进行语音合成。然后，同步策略对源语音特征和目标离散单元进行操作。最后，声码器动态地从离散单元合成目标语音。我们在Fisher西班牙语英语数据集上进行了数值研究，比较了级联方法和直接方法。摘要：We present the first direct simultaneous speech-to-speech translation (Simul-S2ST) model, with the ability to start generating translation in the target speech before consuming the full source speech content and independently from intermediate text representations. Our approach leverages recent progress on direct speech-to-speech translation with discrete units. Instead of continuous spectrogram features, a sequence of direct representations, which are learned in a unsupervised manner, are predicted from the model and passed directly to a vocoder for speech synthesis. The simultaneous policy then operates on source speech features and target discrete units. Finally, a vocoder synthesize the target speech from discrete units on-the-fly. We carry out numerical studies to compare cascaded and direct approach on Fisher Spanish-English dataset.

【2】 Incremental Speech Synthesis For Speech-To-Speech Translation 标题：增量式语音合成技术在语音到语音翻译中的应用链接：https://arxiv.org/abs/2110.08214

作者：Danni Liu,Changhan Wang,Hongyu Gong,Xutai Ma,Yun Tang,Juan Pino 机构：Maastricht University,Facebook AI, Johns Hopkins University 备注：Work-in-progress 摘要：在语音到语音翻译（S2ST）管道中，文本到语音（TTS）模块是将翻译后的语音传递给用户的重要组件。要启用增量S2ST，TTS模块必须能够在其输入文本仍在流式输入时合成和播放语音。在这项工作中，我们致力于提高TTS模型的增量综合性能。通过一种基于前缀的简单数据扩充策略，我们能够提高增量TTS质量，以接近离线性能。此外，我们将我们的增量TTS系统与上游同步语音翻译系统结合到实际场景中，并展示了该用例的收益。此外，我们提出了适合S2ST应用程序的延迟度量，并在此背景下研究了减少延迟的方法。摘要：In a speech-to-speech translation (S2ST) pipeline, the text-to-speech (TTS) module is an important component for delivering the translated speech to users. To enable incremental S2ST, the TTS module must be capable of synthesizing and playing utterances while its input text is still streaming in. In this work, we focus on improving the incremental synthesis performance of TTS models. With a simple data augmentation strategy based on prefixes, we are able to improve the incremental TTS quality to approach offline performance. Furthermore, we bring our incremental TTS system to the practical scenario in combination with an upstream simultaneous speech translation system, and show the gains also carry over to this use-case. In addition, we propose latency metrics tailored to S2ST applications, and investigate methods for latency reduction in this context.

【3】 Towards Identity Preserving Normal to Dysarthric Voice Conversion 标题：论保持身份的正常到异步性语音转换链接：https://arxiv.org/abs/2110.08213

作者：Wen-Chin Huang,Bence Mark Halpern,Lester Phillip Violeta,Odette Scharenborg,Tomoki Toda 机构：Nagoya University, Japan, Multimedia Computing Group, Delft University of Technology, Delft, The Netherlands, University of Amsterdam, Amsterdam, The Netherlands, Netherlands Cancer Institute, Amsterdam, The Netherlands 备注：Submitted to ICASSP 2022 摘要：我们提出了一个语音转换框架，将正常语音转换为构音障碍语音，同时保留说话人身份。这样一个框架对于（1）临床决策过程和减轻患者压力，（2）构音障碍语音识别的数据增强至关重要。这是一项特别具有挑战性的任务，因为转换后的样本应该能够捕捉到构音障碍语音的严重程度，同时具有高度的自然性和正常说话人的说话人身份。为此，我们采用了一个两阶段的框架，该框架由一个序列到序列模型和一个非平行的框架模型组成。对UASpeech数据集进行了客观和主观评估，结果表明，该方法能够产生合理的自然度并捕获病理语音的严重程度。另一方面，与正常源说话人声音的相似性有限，需要进一步改进。摘要：We present a voice conversion framework that converts normal speech into dysarthric speech while preserving the speaker identity. Such a framework is essential for (1) clinical decision making processes and alleviation of patient stress, (2) data augmentation for dysarthric speech recognition. This is an especially challenging task since the converted samples should capture the severity of dysarthric speech while being highly natural and possessing the speaker identity of the normal speaker. To this end, we adopted a two-stage framework, which consists of a sequence-to-sequence model and a nonparallel frame-wise model. Objective and subjective evaluations were conducted on the UASpeech dataset, and results showed that the method was able to yield reasonable naturalness and capture severity aspects of the pathological speech. On the other hand, the similarity to the normal source speaker's voice was limited and requires further improvements.

【4】 Using DeepProbLog to perform Complex Event Processing on an Audio Stream 标题：使用DeepProbLog对音频流进行复杂事件处理链接：https://arxiv.org/abs/2110.08090

作者：Marc Roig Vilamala,Tianwei Xing,Harrison Taylor,Luis Garcia,Mani Srivastava,Lance Kaplan,Alun Preece,Angelika Kimmig,Federico Cerutti 机构：Cardiff University, University of California, Los Angeles, Army Research Laboratory, KU Leuven, Department of Computer Science; Leuven.AI, University of Brescia 备注：8 pages, 3 figures 摘要：本文提出了一种基于DeepProbLog的复杂事件处理（CEP）方法。该方法具有以下目标：（i）允许使用子符号数据作为输入，（ii）保留复杂事件规则定义的灵活性和模块性，（iii）允许以端到端的方式对系统进行训练，以及（iv）对噪声标记数据具有鲁棒性。我们的方法利用DeepProbLog创建一个神经符号架构，该架构将神经网络与概率逻辑层结合起来处理子符号数据，以允许用户定义复杂事件的规则。我们证明了我们的方法能够从音频流中检测复杂事件。我们还证明了我们的方法即使在具有中等比例噪声数据的数据集上也能够进行训练。摘要：In this paper, we present an approach to Complex Event Processing (CEP) that is based on DeepProbLog. This approach has the following objectives: (i) allowing the use of subsymbolic data as an input, (ii) retaining the flexibility and modularity on the definitions of complex event rules, (iii) allowing the system to be trained in an end-to-end manner and (iv) being robust against noisily labelled data. Our approach makes use of DeepProbLog to create a neuro-symbolic architecture that combines a neural network to process the subsymbolic data with a probabilistic logic layer to allow the user to define the rules for the complex events. We demonstrate that our approach is capable of detecting complex events from an audio stream. We also demonstrate that our approach is capable of training even with a dataset that has a moderate proportion of noisy data.

【5】 Scribosermo: Fast Speech-to-Text models for German and other Languages 标题：Scribosermo：德语和其他语言的快速语音到文本模型链接：https://arxiv.org/abs/2110.07982

作者：Daniel Bermuth,Alexander Poeppel,Wolfgang Reif 机构：University of Augsburg, Institute for Software & Systems Engineering 摘要：最近的语音到文本模型通常需要大量的硬件资源，并且大多是用英语训练的。本文介绍了德语、西班牙语和法语的语音到文本模型，具有以下特点：（a）它们体积小，在微控制器（如树莓）上实时运行。（b）使用预先训练的英语模型，他们可以在相对较小的数据集的消费级硬件上接受训练。（c）该模型与其他解决方案相比具有竞争力，在德国的表现优于其他解决方案。在这方面，这些模型结合了其他方法的优点，这些方法只包括所呈现特征的子集。此外，本文还提供了一个用于处理数据集的新库，该库侧重于使用附加数据集进行简单扩展，并展示了一种优化方法，用于使用预训练模型从具有类似字母表的另一种语言迁移学习新语言。摘要：Recent Speech-to-Text models often require a large amount of hardware resources and are mostly trained in English. This paper presents Speech-to-Text models for German, as well as for Spanish and French with special features: (a) They are small and run in real-time on microcontrollers like a RaspberryPi. (b) Using a pretrained English model, they can be trained on consumer-grade hardware with a relatively small dataset. (c) The models are competitive with other solutions and outperform them in German. In this respect, the models combine advantages of other approaches, which only include a subset of the presented features. Furthermore, the paper provides a new library for handling datasets, which is focused on easy extension with additional datasets and shows an optimized way for transfer-learning new languages using a pretrained model from another language with a similar alphabet.

【6】 ESPnet2-TTS: Extending the Edge of TTS Research 标题：ESPnet2-TTS：拓展TTS研究的前沿链接：https://arxiv.org/abs/2110.07840

作者：Tomoki Hayashi,Ryuichi Yamamoto,Takenori Yoshimura,Peter Wu,Jiatong Shi,Takaaki Saeki,Yooncheol Ju,Yusuke Yasuda,Shinnosuke Takamichi,Shinji Watanabe 机构：Human Dataware Lab. Co., Ltd.,Nagoya University,LINE Corp.,Nagoya Institute of Technology, Carnegie Mellon University,The University of Tokyo,AIRS Company, Hyundai Motor Group 备注：Submitted to ICASSP2022. Demo HP: this https URL 摘要：本文介绍了端到端文本到语音（E2E-TTS）工具包ESPnet2 TTS。ESPnet2 TTS通过添加许多新功能扩展了我们的早期版本ESPnet TTS，包括：动态灵活的预处理、与神经声码器的联合训练，以及最先进的TTS模型和扩展，如全频带E2E文本到波形建模，这简化了训练管道并进一步增强了TTS性能。我们配方的统一设计使用户能够快速重现最先进的E2E-TTS结果。我们还在统一的Python接口中提供了许多经过预训练的模型，用于推理，为用户生成基线样本和构建演示提供了一种快速方法。使用英语和日语语料库进行的实验评估表明，我们提供的模型合成了与基本事实相当的话语，实现了最先进的TTS性能。该工具包可在线访问https://github.com/espnet/espnet. 摘要：This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art E2E-TTS results. We also provide many pre-trained models in a unified Python interface for inference, offering a quick means for users to generate baseline samples and build demos. Experimental evaluations with English and Japanese corpora demonstrate that our provided models synthesize utterances comparable to ground-truth ones, achieving state-of-the-art TTS performance. The toolkit is available online at https://github.com/espnet/espnet.

【7】 Attention-Free Keyword Spotting 标题：无需注意的关键词识别链接：https://arxiv.org/abs/2110.07749

作者：Mashrur M. Morshed,Ahmad Omar Ahsan 机构：Islamic University of Technology (IUT), Department of Computer Science & Engineering (CSE) 备注：Submitted to ICASSP-2022 (5 pages) 摘要：到目前为止，基于注意的模型已经在关键词识别领域取得了巨大的成功。然而，鉴于深度学习的最新进展，出现了一个问题：自我关注是否真的不可替代地识别语音关键词。因此，我们探讨了在关键字识别任务中使用门控MLP（以前在视觉任务中被证明是转换器的替代品）的问题。我们在Google Speech Commands V2-35数据集上验证了我们的方法，并表明在不使用任何明显的自我注意的情况下，可以获得与最新技术相当的性能。摘要：Till now, attention-based models have been used with great success in the keyword spotting problem domain. However, in light of recent advances in deep learning, the question arises whether self-attention is truly irreplaceable for recognizing speech keywords. We thus explore the usage of gated MLPs -- previously shown to be alternatives to transformers in vision tasks -- for the keyword spotting task. We verify our approach on the Google Speech Commands V2-35 dataset and show that it is possible to obtain performance comparable to the state of the art without any apparent usage of self-attention.

【8】 HumBugDB: A Large-scale Acoustic Mosquito Dataset 标题：HumBugDB：一个大规模声学蚊子数据集链接：https://arxiv.org/abs/2110.07607

作者：Ivan Kiskin,Marianne Sinka,Adam D. Cobb,Waqas Rafique,Lawrence Wang,Davide Zilli,Benjamin Gutteridge,Rinita Dam,Theodoros Marinos,Yunpeng Li,Dickson Msaky,Emmanuel Kaindoa,Gerard Killeen,Eva Herreros-Moya,Kathy J. Willis,Stephen J. Roberts 机构：University of Oxford, SRI International, Mind Foundry Ltd, University of Surrey, IHI Tanzania, UCC, BEES, Kathy Willis† 备注：Accepted at the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks. 10 pages main, 39 pages including appendix. This paper accompanies the dataset found at this https URL with corresponding code at this https URL 摘要：本文介绍了第一个在自由飞行中连续追踪蚊子的大规模多物种声学记录数据集。我们提供20个小时的录音，我们已经熟练地及时标记和标记。值得注意的是，18小时的记录包含了36个不同物种的注释。众所周知，蚊子是疟疾、登革热和黄热病等疾病的携带者。收集该数据集的动机是需要协助利用蚊子声学进行调查的应用程序，以帮助预测疫情并告知干预政策。从蚊子的翅膀拍打声中检测蚊子的任务是一项具有挑战性的任务，因为很难从现实场景中收集记录。为了解决这个问题，作为骗子项目的一部分，我们进行了全球实验，记录从在训练笼中繁殖到野外捕获的蚊子。因此，音频记录的信噪比各不相同，并且包含来自坦桑尼亚、泰国、肯尼亚、美国和英国的各种室内和室外背景环境。在本文中，我们详细描述了我们如何收集、标记和管理数据。数据来自PostgreSQL数据库，其中包含重要的元数据，如捕获方法、年龄、喂养状态和蚊子性别。此外，我们还提供了用于提取特征和训练贝叶斯卷积神经网络的代码，用于两项关键任务：从相应的背景环境中识别蚊子，以及将检测到的蚊子分类为物种。我们庞大的数据集不仅对专注于声学识别的机器学习研究人员具有挑战性，而且对昆虫学家、地理空间建模者和其他领域专家了解蚊子行为、模拟其分布并管理其对人类构成的威胁也至关重要。摘要：This paper presents the first large-scale multi-species dataset of acoustic recordings of mosquitoes tracked continuously in free flight. We present 20 hours of audio recordings that we have expertly labelled and tagged precisely in time. Significantly, 18 hours of recordings contain annotations from 36 different species. Mosquitoes are well-known carriers of diseases such as malaria, dengue and yellow fever. Collecting this dataset is motivated by the need to assist applications which utilise mosquito acoustics to conduct surveys to help predict outbreaks and inform intervention policy. The task of detecting mosquitoes from the sound of their wingbeats is challenging due to the difficulty in collecting recordings from realistic scenarios. To address this, as part of the HumBug project, we conducted global experiments to record mosquitoes ranging from those bred in culture cages to mosquitoes captured in the wild. Consequently, the audio recordings vary in signal-to-noise ratio and contain a broad range of indoor and outdoor background environments from Tanzania, Thailand, Kenya, the USA and the UK. In this paper we describe in detail how we collected, labelled and curated the data. The data is provided from a PostgreSQL database, which contains important metadata such as the capture method, age, feeding status and gender of the mosquitoes. Additionally, we provide code to extract features and train Bayesian convolutional neural networks for two key tasks: the identification of mosquitoes from their corresponding background environments, and the classification of detected mosquitoes into species. Our extensive dataset is both challenging to machine learning researchers focusing on acoustic identification, and critical to entomologists, geo-spatial modellers and other domain experts to understand mosquito behaviour, model their distribution, and manage the threat they pose to humans.

【9】 Neural Dubber: Dubbing for Silent Videos According to Scripts 标题：Neuran Dubber：根据脚本为无声视频配音链接：https://arxiv.org/abs/2110.08243

作者：Chenxu Hu,Qiao Tian,Tingle Li,Yuping Wang,Yuxuan Wang,Hang Zhao 机构：Tsinghua University, ByteDance, Shanghai Qi Zhi Institute 备注：Accepted by NeurIPS 2021 摘要：配音是重新录制演员对话的后期制作过程，广泛用于电影制作和视频制作。它通常由专业的配音演员手动执行，他们以适当的韵律朗读台词，并与预先录制的视频同步。在这项工作中，我们提出了神经配音器，第一个神经网络模型来解决一个新的自动视频配音（AVD）任务：从文本中合成与给定无声视频同步的人类语音。神经配音器是一种多模态文本到语音（TTS）模型，它利用视频中的嘴唇运动来控制生成语音的韵律。此外，还开发了基于图像的说话人嵌入（ISE）模块，用于多说话人设置，使神经配音器能够根据说话人的脸型生成具有合理音色的语音。在化学讲座单扬声器数据集和LRS2多说话者数据集上的实验表明，神经配音器可以在语音质量方面与最先进的TTS模型相媲美地生成语音音频。最重要的是，定性和定量评估表明，神经配音器可以通过视频控制合成语音的韵律，并生成与视频时间同步的高保真语音。摘要：Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given silent video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

【10】 Don't speak too fast: The impact of data bias on self-supervised speech models 标题：不要说得太快：数据偏差对自我监督语音模型的影响链接：https://arxiv.org/abs/2110.07957

作者：Yen Meng,Yi-Hui Chou,Andy T. Liu,Hung-yi Lee 机构：⋆Graduate Institute of Communication Engineering, National Taiwan University, †College of Electrical Engineering and Computer Science, National Taiwan University 备注：Submitted to ICASSP 2022 摘要：自监督语音模型（S3M）已被证明在许多语音下游任务中是成功的，如ASR。然而，训练前数据如何影响S3M的下游行为仍然是一个未探索的问题。在本文中，我们通过针对不同语音因素（包括性别、内容和韵律）的有偏数据集上的预训练模型，研究预训练数据对S3M的影响，并在SUPERB Benchmark中对这些预训练S3M进行下游任务评估。我们的实验表明，S3M对性别偏见具有耐受性。此外，我们发现语音内容对S3Ms在下游任务中的性能几乎没有影响，但S3Ms确实倾向于使用较慢的语音速率。摘要：Self-supervised Speech Models (S3Ms) have been proven successful in many speech downstream tasks, like ASR. However, how pre-training data affects S3Ms' downstream behavior remains an unexplored issue. In this paper, we study how pre-training data affects S3Ms by pre-training models on biased datasets targeting different factors of speech, including gender, content, and prosody, and evaluate these pre-trained S3Ms on selected downstream tasks in SUPERB Benchmark. Our experiments show that S3Ms have tolerance toward gender bias. Moreover, we find that the content of speech has little impact on the performance of S3Ms across downstream tasks, but S3Ms do show a preference toward a slower speech rate.

3.eess.AS音频处理:

【1】 Neural Dubber: Dubbing for Silent Videos According to Scripts 标题：Neuran Dubber：根据脚本为无声视频配音链接：https://arxiv.org/abs/2110.08243

作者：Chenxu Hu,Qiao Tian,Tingle Li,Yuping Wang,Yuxuan Wang,Hang Zhao 机构：Tsinghua University, ByteDance, Shanghai Qi Zhi Institute 备注：Accepted by NeurIPS 2021 摘要：配音是重新录制演员对话的后期制作过程，广泛用于电影制作和视频制作。它通常由专业的配音演员手动执行，他们以适当的韵律朗读台词，并与预先录制的视频同步。在这项工作中，我们提出了神经配音器，第一个神经网络模型来解决一个新的自动视频配音（AVD）任务：从文本中合成与给定无声视频同步的人类语音。神经配音器是一种多模态文本到语音（TTS）模型，它利用视频中的嘴唇运动来控制生成语音的韵律。此外，还开发了基于图像的说话人嵌入（ISE）模块，用于多说话人设置，使神经配音器能够根据说话人的脸型生成具有合理音色的语音。在化学讲座单扬声器数据集和LRS2多说话者数据集上的实验表明，神经配音器可以在语音质量方面与最先进的TTS模型相媲美地生成语音音频。最重要的是，定性和定量评估表明，神经配音器可以通过视频控制合成语音的韵律，并生成与视频时间同步的高保真语音。摘要：Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given silent video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

【2】 Don't speak too fast: The impact of data bias on self-supervised speech models 标题：不要说得太快：数据偏差对自我监督语音模型的影响链接：https://arxiv.org/abs/2110.07957

作者：Yen Meng,Yi-Hui Chou,Andy T. Liu,Hung-yi Lee 机构：⋆Graduate Institute of Communication Engineering, National Taiwan University, †College of Electrical Engineering and Computer Science, National Taiwan University 备注：Submitted to ICASSP 2022 摘要：自监督语音模型（S3M）已被证明在许多语音下游任务中是成功的，如ASR。然而，训练前数据如何影响S3M的下游行为仍然是一个未探索的问题。在本文中，我们通过针对不同语音因素（包括性别、内容和韵律）的有偏数据集上的预训练模型，研究预训练数据对S3M的影响，并在SUPERB Benchmark中对这些预训练S3M进行下游任务评估。我们的实验表明，S3M对性别偏见具有耐受性。此外，我们发现语音内容对S3Ms在下游任务中的性能几乎没有影响，但S3Ms确实倾向于使用较慢的语音速率。摘要：Self-supervised Speech Models (S3Ms) have been proven successful in many speech downstream tasks, like ASR. However, how pre-training data affects S3Ms' downstream behavior remains an unexplored issue. In this paper, we study how pre-training data affects S3Ms by pre-training models on biased datasets targeting different factors of speech, including gender, content, and prosody, and evaluate these pre-trained S3Ms on selected downstream tasks in SUPERB Benchmark. Our experiments show that S3Ms have tolerance toward gender bias. Moreover, we find that the content of speech has little impact on the performance of S3Ms across downstream tasks, but S3Ms do show a preference toward a slower speech rate.

【3】 Direct simultaneous speech to speech translation 标题：直接同步语音到语音翻译链接：https://arxiv.org/abs/2110.08250

作者：Xutai Ma,Hongyu Gong,Danni Liu,Ann Lee,Yun Tang,Peng-Jen Chen,Wei-Ning Hsu,Kenneth Heafield,Phillip Koehn,Juan Pino 机构：Facebook AI ,Johns Hopkins University ,Maastricht University , University of Edinburgh 摘要：我们提出了第一个直接同步语音到语音翻译（Simul-S2ST）模型，该模型能够在使用完整的源语音内容之前，独立于中间文本表示，在目标语音中开始生成翻译。我们的方法利用了离散单元直接语音到语音翻译的最新进展。代替连续谱图特征，从模型中预测以无监督方式学习的一系列直接表示，并直接传递给声码器进行语音合成。然后，同步策略对源语音特征和目标离散单元进行操作。最后，声码器动态地从离散单元合成目标语音。我们在Fisher西班牙语英语数据集上进行了数值研究，比较了级联方法和直接方法。摘要：We present the first direct simultaneous speech-to-speech translation (Simul-S2ST) model, with the ability to start generating translation in the target speech before consuming the full source speech content and independently from intermediate text representations. Our approach leverages recent progress on direct speech-to-speech translation with discrete units. Instead of continuous spectrogram features, a sequence of direct representations, which are learned in a unsupervised manner, are predicted from the model and passed directly to a vocoder for speech synthesis. The simultaneous policy then operates on source speech features and target discrete units. Finally, a vocoder synthesize the target speech from discrete units on-the-fly. We carry out numerical studies to compare cascaded and direct approach on Fisher Spanish-English dataset.

【4】 Incremental Speech Synthesis For Speech-To-Speech Translation 标题：增量式语音合成技术在语音到语音翻译中的应用链接：https://arxiv.org/abs/2110.08214

作者：Danni Liu,Changhan Wang,Hongyu Gong,Xutai Ma,Yun Tang,Juan Pino 机构：Maastricht University,Facebook AI, Johns Hopkins University 备注：Work-in-progress 摘要：在语音到语音翻译（S2ST）管道中，文本到语音（TTS）模块是将翻译后的语音传递给用户的重要组件。要启用增量S2ST，TTS模块必须能够在其输入文本仍在流式输入时合成和播放语音。在这项工作中，我们致力于提高TTS模型的增量综合性能。通过一种基于前缀的简单数据扩充策略，我们能够提高增量TTS质量，以接近离线性能。此外，我们将我们的增量TTS系统与上游同步语音翻译系统结合到实际场景中，并展示了该用例的收益。此外，我们提出了适合S2ST应用程序的延迟度量，并在此背景下研究了减少延迟的方法。摘要：In a speech-to-speech translation (S2ST) pipeline, the text-to-speech (TTS) module is an important component for delivering the translated speech to users. To enable incremental S2ST, the TTS module must be capable of synthesizing and playing utterances while its input text is still streaming in. In this work, we focus on improving the incremental synthesis performance of TTS models. With a simple data augmentation strategy based on prefixes, we are able to improve the incremental TTS quality to approach offline performance. Furthermore, we bring our incremental TTS system to the practical scenario in combination with an upstream simultaneous speech translation system, and show the gains also carry over to this use-case. In addition, we propose latency metrics tailored to S2ST applications, and investigate methods for latency reduction in this context.

【5】 Towards Identity Preserving Normal to Dysarthric Voice Conversion 标题：论保持身份的正常到异步性语音转换链接：https://arxiv.org/abs/2110.08213

作者：Wen-Chin Huang,Bence Mark Halpern,Lester Phillip Violeta,Odette Scharenborg,Tomoki Toda 机构：Nagoya University, Japan, Multimedia Computing Group, Delft University of Technology, Delft, The Netherlands, University of Amsterdam, Amsterdam, The Netherlands, Netherlands Cancer Institute, Amsterdam, The Netherlands 备注：Submitted to ICASSP 2022 摘要：我们提出了一个语音转换框架，将正常语音转换为构音障碍语音，同时保留说话人身份。这样一个框架对于（1）临床决策过程和减轻患者压力，（2）构音障碍语音识别的数据增强至关重要。这是一项特别具有挑战性的任务，因为转换后的样本应该能够捕捉到构音障碍语音的严重程度，同时具有高度的自然性和正常说话人的说话人身份。为此，我们采用了一个两阶段的框架，该框架由一个序列到序列模型和一个非平行的框架模型组成。对UASpeech数据集进行了客观和主观评估，结果表明，该方法能够产生合理的自然度并捕获病理语音的严重程度。另一方面，与正常源说话人声音的相似性有限，需要进一步改进。摘要：We present a voice conversion framework that converts normal speech into dysarthric speech while preserving the speaker identity. Such a framework is essential for (1) clinical decision making processes and alleviation of patient stress, (2) data augmentation for dysarthric speech recognition. This is an especially challenging task since the converted samples should capture the severity of dysarthric speech while being highly natural and possessing the speaker identity of the normal speaker. To this end, we adopted a two-stage framework, which consists of a sequence-to-sequence model and a nonparallel frame-wise model. Objective and subjective evaluations were conducted on the UASpeech dataset, and results showed that the method was able to yield reasonable naturalness and capture severity aspects of the pathological speech. On the other hand, the similarity to the normal source speaker's voice was limited and requires further improvements.

【6】 Scribosermo: Fast Speech-to-Text models for German and other Languages 标题：Scribosermo：德语和其他语言的快速语音到文本模型链接：https://arxiv.org/abs/2110.07982

作者：Daniel Bermuth,Alexander Poeppel,Wolfgang Reif 机构：University of Augsburg, Institute for Software & Systems Engineering 摘要：最近的语音到文本模型通常需要大量的硬件资源，并且大多是用英语训练的。本文介绍了德语、西班牙语和法语的语音到文本模型，具有以下特点：（a）它们体积小，在微控制器（如树莓）上实时运行。（b）使用预先训练的英语模型，他们可以在相对较小的数据集的消费级硬件上接受训练。（c）该模型与其他解决方案相比具有竞争力，在德国的表现优于其他解决方案。在这方面，这些模型结合了其他方法的优点，这些方法只包括所呈现特征的子集。此外，本文还提供了一个用于处理数据集的新库，该库侧重于使用附加数据集进行简单扩展，并展示了一种优化方法，用于使用预训练模型从具有类似字母表的另一种语言迁移学习新语言。摘要：Recent Speech-to-Text models often require a large amount of hardware resources and are mostly trained in English. This paper presents Speech-to-Text models for German, as well as for Spanish and French with special features: (a) They are small and run in real-time on microcontrollers like a RaspberryPi. (b) Using a pretrained English model, they can be trained on consumer-grade hardware with a relatively small dataset. (c) The models are competitive with other solutions and outperform them in German. In this respect, the models combine advantages of other approaches, which only include a subset of the presented features. Furthermore, the paper provides a new library for handling datasets, which is focused on easy extension with additional datasets and shows an optimized way for transfer-learning new languages using a pretrained model from another language with a similar alphabet.

【7】 Multilingual Speech Recognition using Knowledge Transfer across Learning Processes 标题：使用跨学习过程的知识转移的多语言语音识别链接：https://arxiv.org/abs/2110.07909

作者：Rimita Lahiri,Kenichi Kumatani,Eric Sun,Yao Qian 机构：Signal Analysis and Interpretation Laboratory, University of Southern California, USA, Microsoft Corp., USA 备注：5 pages 摘要：多语言端到端（E2E）模型在扩展自动语音识别（ASR）领域的语言覆盖方面显示出巨大的潜力。在本文中，我们旨在通过两种方式提高多语言ASR的性能，1）研究输入识别语言的一个热向量的影响，2）结合自我监督学习（SSL）制定具有元学习目标的任务。我们将每种语言与不同的任务流形相关联，并试图通过在学习过程中传递知识来提高性能，而不是通过最终的模型参数来传递。我们通过最小化与预期梯度路径长度相关的目标，在由6种语言组成的数据集上为域内ASR任务使用此策略。实验结果表明，最佳的预训练策略可使总体WER相对降低3.55%。当使用语言ID时，LEAP和SSL的组合可使总体WER相对减少3.51%。摘要：Multilingual end-to-end(E2E) models have shown a great potential in the expansion of the language coverage in the realm of automatic speech recognition(ASR). In this paper, we aim to enhance the multilingual ASR performance in two ways, 1)studying the impact of feeding a one-hot vector identifying the language, 2)formulating the task with a meta-learning objective combined with self-supervised learning (SSL). We associate every language with a distinct task manifold and attempt to improve the performance by transferring knowledge across learning processes itself as compared to transferring through final model parameters. We employ this strategy on a dataset comprising of 6 languages for an in-domain ASR task, by minimizing an objective related to expected gradient path length. Experimental results reveal the best pre-training strategy resulting in 3.55% relative reduction in overall WER. A combination of LEAP and SSL yields 3.51% relative reduction in overall WER when using language ID.

【8】 ESPnet2-TTS: Extending the Edge of TTS Research 标题：ESPnet2-TTS：拓展TTS研究的前沿链接：https://arxiv.org/abs/2110.07840

作者：Tomoki Hayashi,Ryuichi Yamamoto,Takenori Yoshimura,Peter Wu,Jiatong Shi,Takaaki Saeki,Yooncheol Ju,Yusuke Yasuda,Shinnosuke Takamichi,Shinji Watanabe 机构：Human Dataware Lab. Co., Ltd.,Nagoya University,LINE Corp.,Nagoya Institute of Technology, Carnegie Mellon University,The University of Tokyo,AIRS Company, Hyundai Motor Group 备注：Submitted to ICASSP2022. Demo HP: this https URL 摘要：本文介绍了端到端文本到语音（E2E-TTS）工具包ESPnet2 TTS。ESPnet2 TTS通过添加许多新功能扩展了我们的早期版本ESPnet TTS，包括：动态灵活的预处理、与神经声码器的联合训练，以及最先进的TTS模型和扩展，如全频带E2E文本到波形建模，这简化了训练管道并进一步增强了TTS性能。我们配方的统一设计使用户能够快速重现最先进的E2E-TTS结果。我们还在统一的Python接口中提供了许多经过预训练的模型，用于推理，为用户生成基线样本和构建演示提供了一种快速方法。使用英语和日语语料库进行的实验评估表明，我们提供的模型合成了与基本事实相当的话语，实现了最先进的TTS性能。该工具包可在线访问https://github.com/espnet/espnet. 摘要：This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art E2E-TTS results. We also provide many pre-trained models in a unified Python interface for inference, offering a quick means for users to generate baseline samples and build demos. Experimental evaluations with English and Japanese corpora demonstrate that our provided models synthesize utterances comparable to ground-truth ones, achieving state-of-the-art TTS performance. The toolkit is available online at https://github.com/espnet/espnet.

【9】 Attention-Free Keyword Spotting 标题：无需注意的关键词识别链接：https://arxiv.org/abs/2110.07749

作者：Mashrur M. Morshed,Ahmad Omar Ahsan 机构：Islamic University of Technology (IUT), Department of Computer Science & Engineering (CSE) 备注：Submitted to ICASSP-2022 (5 pages) 摘要：到目前为止，基于注意的模型已经在关键词识别领域取得了巨大的成功。然而，鉴于深度学习的最新进展，出现了一个问题：自我关注是否真的不可替代地识别语音关键词。因此，我们探讨了在关键字识别任务中使用门控MLP（以前在视觉任务中被证明是转换器的替代品）的问题。我们在Google Speech Commands V2-35数据集上验证了我们的方法，并表明在不使用任何明显的自我注意的情况下，可以获得与最新技术相当的性能。摘要：Till now, attention-based models have been used with great success in the keyword spotting problem domain. However, in light of recent advances in deep learning, the question arises whether self-attention is truly irreplaceable for recognizing speech keywords. We thus explore the usage of gated MLPs -- previously shown to be alternatives to transformers in vision tasks -- for the keyword spotting task. We verify our approach on the Google Speech Commands V2-35 dataset and show that it is possible to obtain performance comparable to the state of the art without any apparent usage of self-attention.

【10】 HumBugDB: A Large-scale Acoustic Mosquito Dataset 标题：HumBugDB：一个大规模声学蚊子数据集链接：https://arxiv.org/abs/2110.07607

作者：Ivan Kiskin,Marianne Sinka,Adam D. Cobb,Waqas Rafique,Lawrence Wang,Davide Zilli,Benjamin Gutteridge,Rinita Dam,Theodoros Marinos,Yunpeng Li,Dickson Msaky,Emmanuel Kaindoa,Gerard Killeen,Eva Herreros-Moya,Kathy J. Willis,Stephen J. Roberts 机构：University of Oxford, SRI International, Mind Foundry Ltd, University of Surrey, IHI Tanzania, UCC, BEES, Kathy Willis† 备注：Accepted at the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks. 10 pages main, 39 pages including appendix. This paper accompanies the dataset found at this https URL with corresponding code at this https URL 摘要：本文介绍了第一个在自由飞行中连续追踪蚊子的大规模多物种声学记录数据集。我们提供20个小时的录音，我们已经熟练地及时标记和标记。值得注意的是，18小时的记录包含了36个不同物种的注释。众所周知，蚊子是疟疾、登革热和黄热病等疾病的携带者。收集该数据集的动机是需要协助利用蚊子声学进行调查的应用程序，以帮助预测疫情并告知干预政策。从蚊子的翅膀拍打声中检测蚊子的任务是一项具有挑战性的任务，因为很难从现实场景中收集记录。为了解决这个问题，作为骗子项目的一部分，我们进行了全球实验，记录从在训练笼中繁殖到野外捕获的蚊子。因此，音频记录的信噪比各不相同，并且包含来自坦桑尼亚、泰国、肯尼亚、美国和英国的各种室内和室外背景环境。在本文中，我们详细描述了我们如何收集、标记和管理数据。数据来自PostgreSQL数据库，其中包含重要的元数据，如捕获方法、年龄、喂养状态和蚊子性别。此外，我们还提供了用于提取特征和训练贝叶斯卷积神经网络的代码，用于两项关键任务：从相应的背景环境中识别蚊子，以及将检测到的蚊子分类为物种。我们庞大的数据集不仅对专注于声学识别的机器学习研究人员具有挑战性，而且对昆虫学家、地理空间建模者和其他领域专家了解蚊子行为、模拟其分布并管理其对人类构成的威胁也至关重要。摘要：This paper presents the first large-scale multi-species dataset of acoustic recordings of mosquitoes tracked continuously in free flight. We present 20 hours of audio recordings that we have expertly labelled and tagged precisely in time. Significantly, 18 hours of recordings contain annotations from 36 different species. Mosquitoes are well-known carriers of diseases such as malaria, dengue and yellow fever. Collecting this dataset is motivated by the need to assist applications which utilise mosquito acoustics to conduct surveys to help predict outbreaks and inform intervention policy. The task of detecting mosquitoes from the sound of their wingbeats is challenging due to the difficulty in collecting recordings from realistic scenarios. To address this, as part of the HumBug project, we conducted global experiments to record mosquitoes ranging from those bred in culture cages to mosquitoes captured in the wild. Consequently, the audio recordings vary in signal-to-noise ratio and contain a broad range of indoor and outdoor background environments from Tanzania, Thailand, Kenya, the USA and the UK. In this paper we describe in detail how we collected, labelled and curated the data. The data is provided from a PostgreSQL database, which contains important metadata such as the capture method, age, feeding status and gender of the mosquitoes. Additionally, we provide code to extract features and train Bayesian convolutional neural networks for two key tasks: the identification of mosquitoes from their corresponding background environments, and the classification of detected mosquitoes into species. Our extensive dataset is both challenging to machine learning researchers focusing on acoustic identification, and critical to entomologists, geo-spatial modellers and other domain experts to understand mosquito behaviour, model their distribution, and manage the threat they pose to humans.

机器翻译，仅供参考

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2021-10-18，如有侵权请联系 cloudcommunity@tencent.com 删除

语音识别