金融/语音/音频处理学术速递[7.22]

公众号-arXiv每日学术速递

发布于 2021-07-27 11:10:05

4580

发布于 2021-07-27 11:10:05

文章被收录于专栏：arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

q-fin金融，共计3篇

cs.SD语音，共计9篇

eess.AS音频处理，共计8篇

1.q-fin金融:

【1】 Default Distances Based on the KMV-CEV Model 标题：基于KMV-CEV模型的默认距离

作者：Wen Su 链接：https://arxiv.org/abs/2107.10226 摘要：本文以CEV模型为例，将非常数波动率应用于KMV模型，提出了一种新的违约风险评估方法。我们发现经典KMV模型不能区分中国股市中的ST公司。为了提高KMV模型的准确性，我们假设企业的资产价值动态是由CEV过程$\frac{dvu A}{V\u A}=\mu\u A dt+\delta V\u A^{\beta-1}dB$给出的，并使用固定效应模型和等效波动率方法估计参数。估计结果表明，非ST公司的$\beta>1$，而ST公司的$\beta<1$和等效波动率法估计参数更为精确。与经典KMV模型相比，CEV-KMV模型在预测违约概率方面更符合市场。我们还提供了一个洞察，其他波动率模型也可以应用。摘要：This paper presents a new method to assess default risk based on applying non constant volatility to the KMV model, taking the CEV model as an instance. We find the evidence that the classical KMV model could not distinguish ST companies in China stock market. Aiming at improve the accuracy of the KMV model, we assume the firm's asset value dynamics are given by the CEV process $\frac{dV_A}{V_A} = \mu_A dt + \delta V_A^{\beta-1}dB$ and use fixed effects model and equivalent volatility method to estimate parameters. The estimation results show the $\beta>1$ for non ST companies while $\beta<1$ for ST companies and the equivalent volatility method estimate the parameters much more precisely. Compared with the classical KMV model, our CEV-KMV model fits the market better in forecasting the default probability. We also provide an insight that other volatility model can be applied, too.

【2】 Pricing Exchange Option Based on Copulas by MCMC Algorithm 标题：MCMC算法在基于Copulas的交换期权定价中的应用

作者：Wen Su 链接：https://arxiv.org/abs/2107.10225 摘要：本文采用MCMC算法对基于copulas的交换期权进行定价。本文首先介绍了风险网络定价方法、copulas方法和MCMC算法。在基本知识的基础上，对不同模型给出的期权价格进行了比较，结果表明，除了Gumbel copula模型外，其他模型给出了相似的估计。摘要：This paper focus on pricing exchange option based on copulas by MCMC algorithm. Initially, we introduce the methodologies concerned about risk-netural pricing, copulas and MCMC algorithm. After the basic knowledge, we compare the option prices given by different models, the results show except Gumbel copula, the other model provide similar estimation.

【3】 Forecasting performance of workforce reskilling programmes 标题：劳动力再培训计划的绩效预测

作者：Evan Hurwitz,George Cevora 链接：https://arxiv.org/abs/2107.10001 摘要：估计旨在使失业者重新融入劳动力队伍的方案的成功率对于良好管理公共财政至关重要。目前，用于这项任务的方法是基于可比方案的历史表现。鉴于英国脱欧和Covid-19同时对英国劳动力市场造成冲击，我们开发了一种基于所涉及的基本因素（劳动力需求和供给）的估算方法，与迅速变得不相关的历史值相对应。重新整合成功率的平均误差为3.9%，我们的模型比我们已知的最佳基准高出53% 摘要：Estimating success rates for programmes aiming to reintegrate theunemployed into the workforce is essential for good stewardship of publicfinances. At the current moment, the methods used for this task arebased on the historical performance of comparable programmes. In lightof Brexit and Covid-19 simultaneously causing a shock to the labourmarket in the UK we developed an estimation method that is basedon fundamental factors involved - workforce demand and supply - asopposed to the historical values which are quickly becoming irrelevant.With an average error of 3.9% of the re-integration success rate, ourmodel outperforms the best benchmark known to us by 53%

2.cs.SD语音:

【1】 A Tandem Framework Balancing Privacy and Security for Voice User Interfaces 标题：一种平衡语音用户界面私密性和安全性的串级框架

作者：Ranya Aloufi,Hamed Haddadi,David Boyle 机构：Imperial College London 备注：14 pages, 6 figures 链接：https://arxiv.org/abs/2107.10045 摘要：语音合成、语音克隆和语音转换技术给语音用户界面（VUIs）的用户带来了严重的隐私和安全威胁。这些技术转换语音信号的一个或多个元素，例如身份和情感，同时保留语言信息。对手可以使用高级转换工具触发欺骗攻击，利用合法说话人的欺诈生物特征。相反地，这些技术已经被用于通过抑制语音信号中的个人可识别属性来生成经过隐私转换的语音，从而实现匿名化。以往的工作对安全和隐私向量进行了并行的研究，从而提出了一个警告：如果一个良性用户可以通过一个转换来获得隐私，那么恶意用户也可以通过绕过反欺骗机制来破坏安全。在本文中，我们朝着平衡两个看似冲突的需求迈出了一步：安全性和隐私性。目前尚不清楚一个域中的漏洞对另一个域意味着什么，以及它们之间存在什么动态交互。更好地理解这些方面对于评估和缓解VUIs固有的脆弱性以及构建有效防御至关重要。在本文中，（i）我们通过部署一个结合了反欺骗和认证模型的串联框架来研究当前语音匿名化方法的适用性，并评估这些方法的性能(ii）通过分析和经验证据，我们揭示了这两种机制之间的二元性，因为它们提供了实现相同目标的不同方法，并且我们表明，利用一个向量显著放大了另一个向量的有效性(iii）我们证明，为了有效防御针对VUIs的潜在攻击，有必要从多个互补的角度（安全和隐私）调查攻击。摘要：Speech synthesis, voice cloning, and voice conversion techniques present severe privacy and security threats to users of voice user interfaces (VUIs). These techniques transform one or more elements of a speech signal, e.g., identity and emotion, while preserving linguistic information. Adversaries may use advanced transformation tools to trigger a spoofing attack using fraudulent biometrics for a legitimate speaker. Conversely, such techniques have been used to generate privacy-transformed speech by suppressing personally identifiable attributes in the voice signals, achieving anonymization. Prior works have studied the security and privacy vectors in parallel, and thus it raises alarm that if a benign user can achieve privacy by a transformation, it also means that a malicious user can break security by bypassing the anti-spoofing mechanism. In this paper, we take a step towards balancing two seemingly conflicting requirements: security and privacy. It remains unclear what the vulnerabilities in one domain imply for the other, and what dynamic interactions exist between them. A better understanding of these aspects is crucial for assessing and mitigating vulnerabilities inherent with VUIs and building effective defenses. In this paper,(i) we investigate the applicability of the current voice anonymization methods by deploying a tandem framework that jointly combines anti-spoofing and authentication models, and evaluate the performance of these methods;(ii) examining analytical and empirical evidence, we reveal a duality between the two mechanisms as they offer different ways to achieve the same objective, and we show that leveraging one vector significantly amplifies the effectiveness of the other;(iii) we demonstrate that to effectively defend from potential attacks against VUIs, it is necessary to investigate the attacks from multiple complementary perspectives(security and privacy).

【2】 Music Plagiarism Detection via Bipartite Graph Matching 标题：基于二部图匹配的音乐抄袭检测

作者：Tianyao He,Wenxuan Liu,Chen Gong,Junchi Yan,Ning Zhang 机构：Shanghai Jiaotong University, Minhang District, Shanghai, China, Minhang Qu, Shanghai Shi, China 链接：https://arxiv.org/abs/2107.09889 摘要：如今，随着社交媒体和音乐创作工具的普及，音乐作品的传播速度越来越快，音乐创作也越来越容易。音乐作品数量的不断增加，使得音乐剽窃问题日益突出。目前迫切需要一种能够自动检测音乐剽窃的工具。研究人员提出了各种方法来提取音乐的低层次和高层次特征，并计算它们的相似性。然而，倒谱系数等低层次特征与音乐作品的版权保护关系不大。现有的算法在考虑高层次特征的情况下，无法检测出两个音乐片段总体上不太相似，但有一些高度相似的区域的情况。提出了一种新的音乐剽窃检测方法MESMF，该方法将音乐剽窃检测问题转化为二部图匹配问题。通过最大权匹配和编辑距离模型求解。根据音乐理论，设计了几种旋律表示和相似性计算方法。该方法可以处理音乐剽窃中的移位、交换、换位和节奏变异问题。它还可以有效地从全局相似性相对较低的两首乐曲中提取出局部相似区域。我们收集了一个新的音乐剽窃数据集从真实的法律判断的音乐剽窃案件，并进行详细的研究。实验结果证明了该算法的优良性能。源代码和我们的数据集可以在https://anonymous.4open.science/r/a41b8fb4-64cf-4190-a1e1-09b7499a15f5/ 摘要：Nowadays, with the prevalence of social media and music creation tools, musical pieces are spreading much quickly, and music creation is getting much easier. The increasing number of musical pieces have made the problem of music plagiarism prominent. There is an urgent need for a tool that can detect music plagiarism automatically. Researchers have proposed various methods to extract low-level and high-level features of music and compute their similarities. However, low-level features such as cepstrum coefficients have weak relation with the copyright protection of musical pieces. Existing algorithms considering high-level features fail to detect the case in which two musical pieces are not quite similar overall, but have some highly similar regions. This paper proposes a new method named MESMF, which innovatively converts the music plagiarism detection problem into the bipartite graph matching task. It can be solved via the maximum weight matching and edit distances model. We design several kinds of melody representations and the similarity computation methods according to the music theory. The proposed method can deal with the shift, swapping, transposition, and tempo variance problems in music plagiarism. It can also effectively pick out the local similar regions from two musical pieces with relatively low global similarity. We collect a new music plagiarism dataset from real legally-judged music plagiarism cases and conduct detailed ablation studies. Experimental results prove the excellent performance of the proposed algorithm. The source code and our dataset are available at https://anonymous.4open.science/r/a41b8fb4-64cf-4190-a1e1-09b7499a15f5/

【3】 Melody Structure Transfer Network: Generating Music with Separable Self-Attention 标题：旋律结构转移网络：用可分离的自我注意生成音乐

作者：Ning Zhang,Junchi Yan 机构：Department of Computer Science and Engineering, Shanghai Jiao Tong University 链接：https://arxiv.org/abs/2107.09877 摘要：符号音乐的产生越来越受到人们的关注，而大多数的方法都集中在产生短曲（大多少于8小节，最多32小节）。产生长音乐需要有效地表达连贯的音乐结构。尽管在长序列上取得了成功，但自我关注架构在处理长期音乐时仍然面临挑战，因为它需要额外关注微妙的音乐结构。本文提出将训练样本的结构转换为新一代音乐，并提出了一种新的基于可分离自我注意的模型，实现了嵌入结构的学习和转换。我们证明了我们的传输模型可以生成具有可解释结构的音乐序列（最多100小节），其结构和作曲技术与训练集中的模板音乐相似。大量实验表明，该算法具有目标结构清晰、多样性好的音乐生成能力。生成的3000套音乐作为补充材料上传。摘要：Symbolic music generation has attracted increasing attention, while most methods focus on generating short piece (mostly less than 8 bars, and up to 32 bars). Generating long music calls for effective expression of the coherent music structure. Despite their success on long sequences, self-attention architectures still have challenge in dealing with long-term music as it requires additional care on the subtle music structure. In this paper, we propose to transfer the structure of training samples for new music generation, and develop a novel separable self-attention based model which enable the learning and transferring of the structure embedding. We show that our transfer model can generate music sequences (up to 100 bars) with interpretable structures, which bears similar structures and composition techniques with the template music from training set. Extensive experiments show its ability of generating music with target structure and well diversity. The generated 3,000 sets of music is uploaded as supplemental material.

【4】 Human Perception of Audio Deepfakes 标题：人类对音频Deepfake的感知

作者：Nicolas M. Müller,Karla Markert,Konstantin Böttinger 机构：Fraunhofer AISEC, Germany, ∗authors contributed equally 链接：https://arxiv.org/abs/2107.09667 摘要：最近出现的深赝品，计算机逼真的多媒体赝品，使检测操纵和生成的内容的前沿。虽然已经提出了许多机器学习模型来检测假货，但是人类的检测能力仍然没有得到充分的研究。这是特别重要的，因为人类的感知不同于机器的感知，深度伪造通常是为了欺骗人类。到目前为止，这个问题只在图像和视频领域得到解决。为了比较人类和机器在检测音频伪造方面的能力，我们进行了一项在线游戏化实验，让用户从用各种算法生成的伪造音频中辨别出真实的音频样本。200名用户参加了8976场比赛，比赛中使用了人工智能（AI）算法进行音频检测。根据收集到的数据，我们发现机器在检测音频假货方面通常优于人类，但是相反，对于某种攻击类型，人类仍然更准确。此外，我们发现年轻的参与者比年长的参与者平均更善于检测音频假货，而IT专业人士比外行没有优势。因此，将人与机器知识相结合对提高音频检测的准确性具有重要意义。摘要：The recent emergence of deepfakes, computerized realistic multimedia fakes, brought the detection of manipulated and generated content to the forefront. While many machine learning models for deepfakes detection have been proposed, the human detection capabilities have remained far less explored. This is of special importance as human perception differs from machine perception and deepfakes are generally designed to fool the human. So far, this issue has only been addressed in the area of images and video. To compare the ability of humans and machines in detecting audio deepfakes, we conducted an online gamified experiment in which we asked users to discern bonda-fide audio samples from spoofed audio, generated with a variety of algorithms. 200 users competed for 8976 game rounds with an artificial intelligence (AI) algorithm trained for audio deepfake detection. With the collected data we found that the machine generally outperforms the humans in detecting audio deepfakes, but that the converse holds for a certain attack type, for which humans are still more accurate. Furthermore, we found that younger participants are on average better at detecting audio deepfakes than older participants, while IT-professionals hold no advantage over laymen. We conclude that it is important to combine human and machine knowledge in order to improve audio deepfake detection.

【5】 Controlling the Remixing of Separated Dialogue with a Non-Intrusive Quality Estimate 标题：用非侵入式质量估计控制分离对话的混合

作者：Matteo Torcoli,Jouni Paulus,Thorsten Kastner,Christian Uhle 机构： Fraunhofer Institute for Integrated Circuits IIS, Erlangen, Germany, International Audio Laboratories Erlangen, Erlangen, Germany∗ 备注：Manuscript accepted for the 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 链接：https://arxiv.org/abs/2107.10151 摘要：重新混合分离的音频源可以抵消干扰衰减和声音衰减的影响。本文提出了一种非侵入性的音频质量估计方法，用于以信号自适应的方式控制这种折衷。最近提出的2f模型被采用作为基本的质量度量，因为它与源分离中的基本音频质量有很强的相关性。提出了一种新的测量方法，在考虑目标源长非活动期材料时更为合适。2f模型需要参考目标源作为输入，但这在许多应用程序中不可用。深度神经网络（DNNs）被训练成使用参考目标（iDNN2f）侵入式估计2f模型，使用输入混合作为参考（nDNN2f）非侵入式估计2f模型，并且仅使用分离的输出信号（rDNN2f）无参考地估计2f模型。结果表明，在测试数据上，iDNN2f与原始测量值有很强的相关性（Pearson r=0.99），而nDNN2f（r>=0.91）和rDNN2f（r>=0.82）的性能下降。非侵入性估计nDNN2f被映射到选择项相关的重混频增益，目的是在重混频输出的最小质量（例如，可听见但不讨厌的劣化）的约束下最大化干扰衰减。听力测试表明，这是成功地实现，即使有非常不同的选择增益（高达23分贝的差异）。摘要：Remixing separated audio sources trades off interferer attenuation against the amount of audible deteriorations. This paper proposes a non-intrusive audio quality estimation method for controlling this trade-off in a signal-adaptive manner. The recently proposed 2f-model is adopted as the underlying quality measure, since it has been shown to correlate strongly with basic audio quality in source separation. An alternative operation mode of the measure is proposed, more appropriate when considering material with long inactive periods of the target source. The 2f-model requires the reference target source as an input, but this is not available in many applications. Deep neural networks (DNNs) are trained to estimate the 2f-model intrusively using the reference target (iDNN2f), non-intrusively using the input mix as reference (nDNN2f), and reference-free using only the separated output signal (rDNN2f). It is shown that iDNN2f achieves very strong correlation with the original measure on the test data (Pearson r=0.99), while performance decreases for nDNN2f (r>=0.91) and rDNN2f (r>=0.82). The non-intrusive estimate nDNN2f is mapped to select item-dependent remixing gains with the aim of maximizing the interferer attenuation under a constraint on the minimum quality of the remixed output (e.g., audible but not annoying deteriorations). A listening test shows that this is successfully achieved even with very different selected gains (up to 23 dB difference).

【6】 Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning 标题：基于神经离散时频表示学习的条件声音生成

作者：Xubo Liu,Turab Iqbal,Jinzheng Zhao,Qiushi Huang,Mark D. Plumbley,Wenwu Wang 机构：Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK 备注：Submitted to MLSP 2021 链接：https://arxiv.org/abs/2107.09998 摘要：深度生成模型最近在语音合成和音乐生成方面取得了令人瞩目的成绩。然而，与这些特定领域声音的产生相比，一般声音（如汽车喇叭、狗吠声和枪声）的产生虽然有着广泛的潜在应用，但受到的关注较少。在我们以前的工作中，声音是在时域中使用SampleRNN生成的。然而，使用这种方法很难捕获录音中的长程依赖关系。在这项工作中，我们提出通过神经离散时频表示学习来产生以声音类别为条件的声音。这在建模远程依赖关系和在声音片段中保留局部细粒度结构方面提供了优势。我们在UrbanSound8K数据集上评估了我们提出的方法，并将其与SampleRNN基线进行了比较，其性能指标衡量了生成的声音样本的质量和多样性。实验结果表明，与基线方法相比，该方法在分集性能和质量上都有显著提高。摘要：Deep generative models have recently achieved impressive performance in speech synthesis and music generation. However, compared to the generation of those domain-specific sounds, the generation of general sounds (such as car horn, dog barking, and gun shot) has received less attention, despite their wide potential applications. In our previous work, sounds are generated in the time domain using SampleRNN. However, it is difficult to capture long-range dependencies within sound recordings using this method. In this work, we propose to generate sounds conditioned on sound classes via neural discrete time-frequency representation learning. This offers an advantage in modelling long-range dependencies and retaining local fine-grained structure within a sound clip. We evaluate our proposed approach on the UrbanSound8K dataset, as compared to a SampleRNN baseline, with the performance metrics measuring the quality and diversity of the generated sound samples. Experimental results show that our proposed method offers significantly better performance in diversity and comparable performance in quality, as compared to the baseline method.

【7】 CL4AC: A Contrastive Loss for Audio Captioning 标题：CL4AC：音频字幕的对比损失

作者：Xubo Liu,Qiushi Huang,Xinhao Mei,Tom Ko,H Lilian Tang,Mark D. Plumbley,Wenwu Wang 机构： Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK 备注：The first two authors contributed equally, 5 pages, 3 figures, submitted to DCASE2021 Workshop 链接：https://arxiv.org/abs/2107.09990 摘要：自动音频字幕（AAC）是一种跨模态的翻译任务，旨在使用自然语言来描述音频片段的内容。如DCASE 2021挑战任务6提交的资料所示，社区对这个问题越来越感兴趣。现有的AAC系统通常基于编解码器结构，将音频信号编码成潜在的表示，并与相应的文本描述对齐，然后使用解码器生成字幕。然而，AAC系统的训练经常遇到数据匮乏的问题，这可能导致不准确的表示和音频文本对齐。为了解决这个问题，我们提出了一种新的编码-解码器框架，称为对比损失音频字幕（CL4AC）。在CL4AC中，利用原始音频-文本配对数据产生的自监督信号，通过对比样本来挖掘音频与文本之间的对应关系，在有限数据的训练下，提高了潜在表征的质量和音频与文本的对齐度。在Clotho数据集上进行了实验，验证了该方法的有效性。摘要：Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.

【8】 Audio Captioning Transformer 标题：音频字幕转换器

作者：Xinhao Mei,Xubo Liu,Qiushi Huang,Mark D. Plumbley,Wenwu Wang 机构： Centre for Vision, Speech and Signal Processing (CVSSP), Department of Computer Science, University of Surrey, UK 备注：5 pages, 1 figure 链接：https://arxiv.org/abs/2107.09817 摘要：音频字幕旨在自动生成音频片段的自然语言描述。大多数字幕模型遵循编码器-解码器体系结构，解码器根据编码器提取的音频特征预测单词。卷积神经网络（CNNs）和递归神经网络（RNNs）常用作音频编码器。然而，cnn在建模音频信号中的时间帧之间的时间关系方面可能受到限制，而rnn在建模时间帧之间的长程依赖方面可能受到限制。本文提出了一种完全无卷积的音频字幕转换器（ACT），它是一种基于编解码器结构的全Transformer网络。该方法能够更好地对音频信号中的全局信息进行建模，并捕捉音频事件之间的时间关系。我们在AudioCaps上评估了我们的模型，AudioCaps是公开的最大的音频字幕数据集。与其他最先进的方法相比，我们的模型显示出了具有竞争力的性能。摘要：Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We evaluate our model on AudioCaps, which is the largest audio captioning dataset publicly available. Our model shows competitive performance compared to other state-of-the-art approaches.

【9】 VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion 标题：VQMIVC：矢量量化和基于互信息的无监督语音表示解纠缠的一次语音转换

作者：Disong Wang,Liqun Deng,Yu Ting Yeung,Xiao Chen,Xunying Liu,Helen Meng 机构：The Chinese University of Hong Kong, Hong Kong SAR, China, Huawei Noah’s Ark Lab 备注：Accepted to Interspeech 2021. Code, pre-trained models and demo are available at this https URL 链接：https://arxiv.org/abs/2106.10132 摘要：一次语音转换（VC）是通过语音表示解纠缠来实现的一种有效的转换方法，它只需要一个目标说话人的话语作为参考就可以在任意说话人之间进行转换。现有的研究普遍忽略了训练过程中不同语音表征之间的相关性，导致内容信息泄漏到说话人表征中，从而降低了VC的性能。为了缓解这一问题，我们采用矢量量化（VQ）进行内容编码，并在训练过程中引入互信息（MI）作为相关度量，通过无监督的方式减少内容、说话人和基音表示的相互依赖性，实现内容、说话人和基音表示的适当分离。实验结果表明，该方法在保持源语言内容和语调变化，同时捕捉目标说话人特征的前提下，能有效地学习分离语音表征。与现有的单次VC系统相比，该方法具有更高的语音自然度和说话人相似度。我们的代码、预先训练的模型和演示可在https://github.com/Wendison/VQMIVC. 摘要：One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and thus degrades VC performance. To alleviate this issue, we employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations for retaining source linguistic content and intonation variations, while capturing target speaker characteristics. In doing so, the proposed approach achieves higher speech naturalness and speaker similarity than current state-of-the-art one-shot VC systems. Our code, pre-trained models and demo are available at https://github.com/Wendison/VQMIVC.

3.eess.AS音频处理:

【1】 Controlling the Remixing of Separated Dialogue with a Non-Intrusive Quality Estimate 标题：用非侵入式质量估计控制分离对话的混合

作者：Matteo Torcoli,Jouni Paulus,Thorsten Kastner,Christian Uhle 机构： Fraunhofer Institute for Integrated Circuits IIS, Erlangen, Germany, International Audio Laboratories Erlangen, Erlangen, Germany∗ 备注：Manuscript accepted for the 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 链接：https://arxiv.org/abs/2107.10151 摘要：重新混合分离的音频源可以抵消干扰衰减和声音衰减的影响。本文提出了一种非侵入性的音频质量估计方法，用于以信号自适应的方式控制这种折衷。最近提出的2f模型被采用作为基本的质量度量，因为它与源分离中的基本音频质量有很强的相关性。提出了一种新的测量方法，在考虑目标源长非活动期材料时更为合适。2f模型需要参考目标源作为输入，但这在许多应用程序中不可用。深度神经网络（DNNs）被训练成使用参考目标（iDNN2f）侵入式估计2f模型，使用输入混合作为参考（nDNN2f）非侵入式估计2f模型，并且仅使用分离的输出信号（rDNN2f）无参考地估计2f模型。结果表明，在测试数据上，iDNN2f与原始测量值有很强的相关性（Pearson r=0.99），而nDNN2f（r>=0.91）和rDNN2f（r>=0.82）的性能下降。非侵入性估计nDNN2f被映射到选择项相关的重混频增益，目的是在重混频输出的最小质量（例如，可听见但不讨厌的劣化）的约束下最大化干扰衰减。听力测试表明，这是成功地实现，即使有非常不同的选择增益（高达23分贝的差异）。摘要：Remixing separated audio sources trades off interferer attenuation against the amount of audible deteriorations. This paper proposes a non-intrusive audio quality estimation method for controlling this trade-off in a signal-adaptive manner. The recently proposed 2f-model is adopted as the underlying quality measure, since it has been shown to correlate strongly with basic audio quality in source separation. An alternative operation mode of the measure is proposed, more appropriate when considering material with long inactive periods of the target source. The 2f-model requires the reference target source as an input, but this is not available in many applications. Deep neural networks (DNNs) are trained to estimate the 2f-model intrusively using the reference target (iDNN2f), non-intrusively using the input mix as reference (nDNN2f), and reference-free using only the separated output signal (rDNN2f). It is shown that iDNN2f achieves very strong correlation with the original measure on the test data (Pearson r=0.99), while performance decreases for nDNN2f (r>=0.91) and rDNN2f (r>=0.82). The non-intrusive estimate nDNN2f is mapped to select item-dependent remixing gains with the aim of maximizing the interferer attenuation under a constraint on the minimum quality of the remixed output (e.g., audible but not annoying deteriorations). A listening test shows that this is successfully achieved even with very different selected gains (up to 23 dB difference).

【2】 Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning 标题：基于神经离散时频表示学习的条件声音生成

作者：Xubo Liu,Turab Iqbal,Jinzheng Zhao,Qiushi Huang,Mark D. Plumbley,Wenwu Wang 机构：Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK 备注：Submitted to MLSP 2021 链接：https://arxiv.org/abs/2107.09998 摘要：深度生成模型最近在语音合成和音乐生成方面取得了令人瞩目的成绩。然而，与这些特定领域声音的产生相比，一般声音（如汽车喇叭、狗吠声和枪声）的产生虽然有着广泛的潜在应用，但受到的关注较少。在我们以前的工作中，声音是在时域中使用SampleRNN生成的。然而，使用这种方法很难捕获录音中的长程依赖关系。在这项工作中，我们提出通过神经离散时频表示学习来产生以声音类别为条件的声音。这在建模远程依赖关系和在声音片段中保留局部细粒度结构方面提供了优势。我们在UrbanSound8K数据集上评估了我们提出的方法，并将其与SampleRNN基线进行了比较，其性能指标衡量了生成的声音样本的质量和多样性。实验结果表明，与基线方法相比，该方法在分集性能和质量上都有显著提高。摘要：Deep generative models have recently achieved impressive performance in speech synthesis and music generation. However, compared to the generation of those domain-specific sounds, the generation of general sounds (such as car horn, dog barking, and gun shot) has received less attention, despite their wide potential applications. In our previous work, sounds are generated in the time domain using SampleRNN. However, it is difficult to capture long-range dependencies within sound recordings using this method. In this work, we propose to generate sounds conditioned on sound classes via neural discrete time-frequency representation learning. This offers an advantage in modelling long-range dependencies and retaining local fine-grained structure within a sound clip. We evaluate our proposed approach on the UrbanSound8K dataset, as compared to a SampleRNN baseline, with the performance metrics measuring the quality and diversity of the generated sound samples. Experimental results show that our proposed method offers significantly better performance in diversity and comparable performance in quality, as compared to the baseline method.

【3】 CL4AC: A Contrastive Loss for Audio Captioning 标题：CL4AC：音频字幕的对比损失

作者：Xubo Liu,Qiushi Huang,Xinhao Mei,Tom Ko,H Lilian Tang,Mark D. Plumbley,Wenwu Wang 机构： Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK 备注：The first two authors contributed equally, 5 pages, 3 figures, submitted to DCASE2021 Workshop 链接：https://arxiv.org/abs/2107.09990 摘要：自动音频字幕（AAC）是一种跨模态的翻译任务，旨在使用自然语言来描述音频片段的内容。如DCASE 2021挑战任务6提交的资料所示，社区对这个问题越来越感兴趣。现有的AAC系统通常基于编解码器结构，将音频信号编码成潜在的表示，并与相应的文本描述对齐，然后使用解码器生成字幕。然而，AAC系统的训练经常遇到数据匮乏的问题，这可能导致不准确的表示和音频文本对齐。为了解决这个问题，我们提出了一种新的编码-解码器框架，称为对比损失音频字幕（CL4AC）。在CL4AC中，利用原始音频-文本配对数据产生的自监督信号，通过对比样本来挖掘音频与文本之间的对应关系，在有限数据的训练下，提高了潜在表征的质量和音频与文本的对齐度。在Clotho数据集上进行了实验，验证了该方法的有效性。摘要：Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.

【4】 Audio Captioning Transformer 标题：音频字幕转换器

作者：Xinhao Mei,Xubo Liu,Qiushi Huang,Mark D. Plumbley,Wenwu Wang 机构： Centre for Vision, Speech and Signal Processing (CVSSP), Department of Computer Science, University of Surrey, UK 备注：5 pages, 1 figure 链接：https://arxiv.org/abs/2107.09817 摘要：音频字幕旨在自动生成音频片段的自然语言描述。大多数字幕模型遵循编码器-解码器体系结构，解码器根据编码器提取的音频特征预测单词。卷积神经网络（CNNs）和递归神经网络（RNNs）常用作音频编码器。然而，cnn在建模音频信号中的时间帧之间的时间关系方面可能受到限制，而rnn在建模时间帧之间的长程依赖方面可能受到限制。本文提出了一种完全无卷积的音频字幕转换器（ACT），它是一种基于编解码器结构的全Transformer网络。该方法能够更好地对音频信号中的全局信息进行建模，并捕捉音频事件之间的时间关系。我们在AudioCaps上评估了我们的模型，AudioCaps是公开的最大的音频字幕数据集。与其他最先进的方法相比，我们的模型显示出了具有竞争力的性能。摘要：Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We evaluate our model on AudioCaps, which is the largest audio captioning dataset publicly available. Our model shows competitive performance compared to other state-of-the-art approaches.

【5】 A Tandem Framework Balancing Privacy and Security for Voice User Interfaces 标题：一种平衡语音用户界面私密性和安全性的串级框架

作者：Ranya Aloufi,Hamed Haddadi,David Boyle 机构：Imperial College London 备注：14 pages, 6 figures 链接：https://arxiv.org/abs/2107.10045 摘要：语音合成、语音克隆和语音转换技术给语音用户界面（VUIs）的用户带来了严重的隐私和安全威胁。这些技术转换语音信号的一个或多个元素，例如身份和情感，同时保留语言信息。对手可以使用高级转换工具触发欺骗攻击，利用合法说话人的欺诈生物特征。相反地，这些技术已经被用于通过抑制语音信号中的个人可识别属性来生成经过隐私转换的语音，从而实现匿名化。以往的工作对安全和隐私向量进行了并行的研究，从而提出了一个警告：如果一个良性用户可以通过一个转换来获得隐私，那么恶意用户也可以通过绕过反欺骗机制来破坏安全。在本文中，我们朝着平衡两个看似冲突的需求迈出了一步：安全性和隐私性。目前尚不清楚一个域中的漏洞对另一个域意味着什么，以及它们之间存在什么动态交互。更好地理解这些方面对于评估和缓解VUIs固有的脆弱性以及构建有效防御至关重要。在本文中，（i）我们通过部署一个结合了反欺骗和认证模型的串联框架来研究当前语音匿名化方法的适用性，并评估这些方法的性能(ii）通过分析和经验证据，我们揭示了这两种机制之间的二元性，因为它们提供了实现相同目标的不同方法，并且我们表明，利用一个向量显著放大了另一个向量的有效性(iii）我们证明，为了有效防御针对VUIs的潜在攻击，有必要从多个互补的角度（安全和隐私）调查攻击。摘要：Speech synthesis, voice cloning, and voice conversion techniques present severe privacy and security threats to users of voice user interfaces (VUIs). These techniques transform one or more elements of a speech signal, e.g., identity and emotion, while preserving linguistic information. Adversaries may use advanced transformation tools to trigger a spoofing attack using fraudulent biometrics for a legitimate speaker. Conversely, such techniques have been used to generate privacy-transformed speech by suppressing personally identifiable attributes in the voice signals, achieving anonymization. Prior works have studied the security and privacy vectors in parallel, and thus it raises alarm that if a benign user can achieve privacy by a transformation, it also means that a malicious user can break security by bypassing the anti-spoofing mechanism. In this paper, we take a step towards balancing two seemingly conflicting requirements: security and privacy. It remains unclear what the vulnerabilities in one domain imply for the other, and what dynamic interactions exist between them. A better understanding of these aspects is crucial for assessing and mitigating vulnerabilities inherent with VUIs and building effective defenses. In this paper,(i) we investigate the applicability of the current voice anonymization methods by deploying a tandem framework that jointly combines anti-spoofing and authentication models, and evaluate the performance of these methods;(ii) examining analytical and empirical evidence, we reveal a duality between the two mechanisms as they offer different ways to achieve the same objective, and we show that leveraging one vector significantly amplifies the effectiveness of the other;(iii) we demonstrate that to effectively defend from potential attacks against VUIs, it is necessary to investigate the attacks from multiple complementary perspectives(security and privacy).

【6】 Music Plagiarism Detection via Bipartite Graph Matching 标题：基于二部图匹配的音乐抄袭检测

作者：Tianyao He,Wenxuan Liu,Chen Gong,Junchi Yan,Ning Zhang 机构：Shanghai Jiaotong University, Minhang District, Shanghai, China, Minhang Qu, Shanghai Shi, China 链接：https://arxiv.org/abs/2107.09889 摘要：如今，随着社交媒体和音乐创作工具的普及，音乐作品的传播速度越来越快，音乐创作也越来越容易。音乐作品数量的不断增加，使得音乐剽窃问题日益突出。目前迫切需要一种能够自动检测音乐剽窃的工具。研究人员提出了各种方法来提取音乐的低层次和高层次特征，并计算它们的相似性。然而，倒谱系数等低层次特征与音乐作品的版权保护关系不大。现有的算法在考虑高层次特征的情况下，无法检测出两个音乐片段总体上不太相似，但有一些高度相似的区域的情况。提出了一种新的音乐剽窃检测方法MESMF，该方法将音乐剽窃检测问题转化为二部图匹配问题。通过最大权匹配和编辑距离模型求解。根据音乐理论，设计了几种旋律表示和相似性计算方法。该方法可以处理音乐剽窃中的移位、交换、换位和节奏变异问题。它还可以有效地从全局相似性相对较低的两首乐曲中提取出局部相似区域。我们收集了一个新的音乐剽窃数据集从真实的法律判断的音乐剽窃案件，并进行详细的研究。实验结果证明了该算法的优良性能。源代码和我们的数据集可以在https://anonymous.4open.science/r/a41b8fb4-64cf-4190-a1e1-09b7499a15f5/ 摘要：Nowadays, with the prevalence of social media and music creation tools, musical pieces are spreading much quickly, and music creation is getting much easier. The increasing number of musical pieces have made the problem of music plagiarism prominent. There is an urgent need for a tool that can detect music plagiarism automatically. Researchers have proposed various methods to extract low-level and high-level features of music and compute their similarities. However, low-level features such as cepstrum coefficients have weak relation with the copyright protection of musical pieces. Existing algorithms considering high-level features fail to detect the case in which two musical pieces are not quite similar overall, but have some highly similar regions. This paper proposes a new method named MESMF, which innovatively converts the music plagiarism detection problem into the bipartite graph matching task. It can be solved via the maximum weight matching and edit distances model. We design several kinds of melody representations and the similarity computation methods according to the music theory. The proposed method can deal with the shift, swapping, transposition, and tempo variance problems in music plagiarism. It can also effectively pick out the local similar regions from two musical pieces with relatively low global similarity. We collect a new music plagiarism dataset from real legally-judged music plagiarism cases and conduct detailed ablation studies. Experimental results prove the excellent performance of the proposed algorithm. The source code and our dataset are available at https://anonymous.4open.science/r/a41b8fb4-64cf-4190-a1e1-09b7499a15f5/

【7】 Melody Structure Transfer Network: Generating Music with Separable Self-Attention 标题：旋律结构转移网络：用可分离的自我注意生成音乐

作者：Ning Zhang,Junchi Yan 机构：Department of Computer Science and Engineering, Shanghai Jiao Tong University 链接：https://arxiv.org/abs/2107.09877 摘要：符号音乐的产生越来越受到人们的关注，而大多数的方法都集中在产生短曲（大多少于8小节，最多32小节）。产生长音乐需要有效地表达连贯的音乐结构。尽管在长序列上取得了成功，但自我关注架构在处理长期音乐时仍然面临挑战，因为它需要额外关注微妙的音乐结构。本文提出将训练样本的结构转换为新一代音乐，并提出了一种新的基于可分离自我注意的模型，实现了嵌入结构的学习和转换。我们证明了我们的传输模型可以生成具有可解释结构的音乐序列（最多100小节），其结构和作曲技术与训练集中的模板音乐相似。大量实验表明，该算法具有目标结构清晰、多样性好的音乐生成能力。生成的3000套音乐作为补充材料上传。摘要：Symbolic music generation has attracted increasing attention, while most methods focus on generating short piece (mostly less than 8 bars, and up to 32 bars). Generating long music calls for effective expression of the coherent music structure. Despite their success on long sequences, self-attention architectures still have challenge in dealing with long-term music as it requires additional care on the subtle music structure. In this paper, we propose to transfer the structure of training samples for new music generation, and develop a novel separable self-attention based model which enable the learning and transferring of the structure embedding. We show that our transfer model can generate music sequences (up to 100 bars) with interpretable structures, which bears similar structures and composition techniques with the template music from training set. Extensive experiments show its ability of generating music with target structure and well diversity. The generated 3,000 sets of music is uploaded as supplemental material.

【8】 Human Perception of Audio Deepfakes 标题：人类对音频Deepfake的感知

作者：Nicolas M. Müller,Karla Markert,Konstantin Böttinger 机构：Fraunhofer AISEC, Germany, ∗authors contributed equally 链接：https://arxiv.org/abs/2107.09667 摘要：最近出现的深赝品，计算机逼真的多媒体赝品，使检测操纵和生成的内容的前沿。虽然已经提出了许多机器学习模型来检测假货，但是人类的检测能力仍然没有得到充分的研究。这是特别重要的，因为人类的感知不同于机器的感知，深度伪造通常是为了欺骗人类。到目前为止，这个问题只在图像和视频领域得到解决。为了比较人类和机器在检测音频伪造方面的能力，我们进行了一项在线游戏化实验，让用户从用各种算法生成的伪造音频中辨别出真实的音频样本。200名用户参加了8976场比赛，比赛中使用了人工智能（AI）算法进行音频检测。根据收集到的数据，我们发现机器在检测音频假货方面通常优于人类，但是相反，对于某种攻击类型，人类仍然更准确。此外，我们发现年轻的参与者比年长的参与者平均更善于检测音频假货，而IT专业人士比外行没有优势。因此，将人与机器知识相结合对提高音频检测的准确性具有重要意义。摘要：The recent emergence of deepfakes, computerized realistic multimedia fakes, brought the detection of manipulated and generated content to the forefront. While many machine learning models for deepfakes detection have been proposed, the human detection capabilities have remained far less explored. This is of special importance as human perception differs from machine perception and deepfakes are generally designed to fool the human. So far, this issue has only been addressed in the area of images and video. To compare the ability of humans and machines in detecting audio deepfakes, we conducted an online gamified experiment in which we asked users to discern bonda-fide audio samples from spoofed audio, generated with a variety of algorithms. 200 users competed for 8976 game rounds with an artificial intelligence (AI) algorithm trained for audio deepfake detection. With the collected data we found that the machine generally outperforms the humans in detecting audio deepfakes, but that the converse holds for a certain attack type, for which humans are still more accurate. Furthermore, we found that younger participants are on average better at detecting audio deepfakes than older participants, while IT-professionals hold no advantage over laymen. We conclude that it is important to combine human and machine knowledge in order to improve audio deepfake detection.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-22，如有侵权请联系 cloudcommunity@tencent.com 删除

linux