金融/语音/音频处理学术速递[12.17]

公众号-arXiv每日学术速递

发布于 2021-12-17 17:40:50

4430

发布于 2021-12-17 17:40:50

文章被收录于专栏：arXiv每日学术速递

q-fin金融，共计3篇

cs.SD语音，共计7篇

eess.AS音频处理，共计10篇

1.q-fin金融:

【1】 Multivariate Realized Volatility Forecasting with Graph Neural Network 标题：基于图神经网络的多变量已实现波动率预测链接：https://arxiv.org/abs/2112.09015

作者：Qinkai Chen,Christian-Yann Robert 备注：13 pages, 6 tables, 4 figures 摘要：现有出版物表明，限价指令簿数据有助于预测股票市场的短期波动。由于股票不是独立的，一只股票的变化也会影响其他相关股票。在本文中，我们感兴趣的是基于限价指令簿数据和关系数据的多变量方法预测短期已实现波动率。为了实现这一目标，我们引入了用于波动率预测的图变换网络。该模型允许组合限制订单簿功能和来自不同来源的无限数量的时间和横截面关系。通过对标准普尔500指数中约500只股票的实验，我们发现我们的模型的性能优于其他基准。摘要：The existing publications demonstrate that the limit order book data is useful in predicting short-term volatility in stock markets. Since stocks are not independent, changes on one stock can also impact other related stocks. In this paper, we are interested in forecasting short-term realized volatility in a multivariate approach based on limit order book data and relational data. To achieve this goal, we introduce Graph Transformer Network for Volatility Forecasting. The model allows to combine limit order book features and an unlimited number of temporal and cross-sectional relations from different sources. Through experiments based on about 500 stocks from S&P 500 index, we find a better performance for our model than for other benchmarks.

【2】 Macroscopic properties of buyer-seller networks in online marketplaces 标题：在线市场中买卖双方网络的宏观特性链接：https://arxiv.org/abs/2112.09065

作者：Alberto Bracci,Jörn Boehnke,Abeer ElBahrawy,Nicola Perra,Alexander Teytelboym,Andrea Baronchelli 摘要：在线市场是合法和非法电子商务的主要引擎，但人们对其背后的买卖双方网络的总体属性知之甚少。我们分析了两个数据集，包含245M交易（16B美元）发生在网上市场在2010和2021之间。数据涵盖28个黑暗网络市场，即以比特币为主要货币的不受监管市场，以及一个受监管电子商务平台的144个产品市场。我们展示了在线市场中的交易如何表现出惊人相似的聚合行为模式，尽管在语言、可用产品的生命周期、监管、监督和技术方面存在显著差异。我们发现，（i）交易金额，（ii）交易数量，（iii）事件间时间，（iv）第一次和最后一次交易之间的时间分布具有显著的规律性。然后，我们展示了买家行为是如何受到过去互动记忆的影响的，并利用这些观察结果提出了一个网络形成模型，该模型能够再现数据的主要风格化事实。我们的研究结果对于理解在线市场的市场力量以及市场间的竞争具有重要意义。摘要：Online marketplaces are the main engines of legal and illegal e-commerce, yet the aggregate properties of buyer-seller networks behind them are poorly understood. We analyze two datasets containing 245M transactions (16B USD) that took place on online marketplaces between 2010 and 2021. The data cover 28 dark web marketplaces, i.e., unregulated markets whose main currency is Bitcoin, and 144 product markets of one regulated e-commerce platform. We show how transactions in online marketplaces exhibit strikingly similar patterns of aggregate behavior despite significant differences in language, lifetimes available products, regulation, oversight, and technology. We find remarkable regularities in the distributions of (i) transaction amounts, (ii) number of transactions, (iii) inter-event times, (iv) time between first and last transactions. We then show how buyer behavior is affected by the memory of past interactions, and draw on these observations to propose a model of network formation able to reproduce the main stylized facts of the data. Our findings have implications for understanding market power on online marketplaces as well as inter-marketplace competition.

【3】 Trading with the Momentum Transformer: An Intelligent and Interpretable Architecture 标题：与动量转换器交易：一种智能且可解释的架构链接：https://arxiv.org/abs/2112.08534

作者：Kieran Wood,Sven Giegerich,Stephen Roberts,Stefan Zohren 摘要：深度学习体系结构，特别是深度动量网络（DMN）[1904.04912]，已被发现是动量和均值回归交易的有效方法。然而，近年来的一些关键挑战涉及学习长期依赖性、考虑交易成本后的回报时的绩效下降以及适应新的市场制度，特别是在SARS-CoV-2危机期间。注意力机制或基于转换器的架构是解决此类挑战的一种方法，因为它们允许网络关注过去和长期模式中的重要时间步骤。我们引入动量转换器，这是一种基于注意力的架构，其性能优于基准，并且具有内在的可解释性，为我们深入了解我们的深度学习交易策略提供了更深入的见解。我们的模型是基于LSTM的DMN的扩展，该DMN通过根据风险调整的绩效指标（如夏普比率）优化网络，直接输出头寸大小。我们发现注意LSTM混合解码器-纯时间融合转换器（TFT）风格的架构是性能最好的模型。在可解释性方面，我们观察到注意模式的显著结构，在动量转折点具有显著的重要性峰值。因此，时间序列被划分为不同的区域，模型倾向于关注类似区域中的先前时间步骤。我们发现改变点检测（CPD）[2105.13727]，另一种应对体制变化的技术，可以补充多头注意，特别是当我们在多个时间尺度上运行CPD时。通过添加可解释变量选择网络，我们观察了CPD如何帮助我们的模型摆脱以日收益数据为主的交易。我们注意到，该模型可以智能地在经典策略之间切换和混合——基于数据中的模式进行决策。摘要：Deep learning architectures, specifically Deep Momentum Networks (DMNs) [1904.04912], have been found to be an effective approach to momentum and mean-reversion trading. However, some of the key challenges in recent years involve learning long-term dependencies, degradation of performance when considering returns net of transaction costs and adapting to new market regimes, notably during the SARS-CoV-2 crisis. Attention mechanisms, or Transformer-based architectures, are a solution to such challenges because they allow the network to focus on significant time steps in the past and longer-term patterns. We introduce the Momentum Transformer, an attention-based architecture which outperforms the benchmarks, and is inherently interpretable, providing us with greater insights into our deep learning trading strategy. Our model is an extension to the LSTM-based DMN, which directly outputs position sizing by optimising the network on a risk-adjusted performance metric, such as Sharpe ratio. We find an attention-LSTM hybrid Decoder-Only Temporal Fusion Transformer (TFT) style architecture is the best performing model. In terms of interpretability, we observe remarkable structure in the attention patterns, with significant peaks of importance at momentum turning points. The time series is thus segmented into regimes and the model tends to focus on previous time-steps in alike regimes. We find changepoint detection (CPD) [2105.13727], another technique for responding to regime change, can complement multi-headed attention, especially when we run CPD at multiple timescales. Through the addition of an interpretable variable selection network, we observe how CPD helps our model to move away from trading predominantly on daily returns data. We note that the model can intelligently switch between, and blend, classical strategies - basing its decision on patterns in the data.

2.cs.SD语音:

【1】 Towards Robust Real-time Audio-Visual Speech Enhancement 标题：面向鲁棒实时视听语音增强的研究链接：https://arxiv.org/abs/2112.09060

作者：Mandar Gogate,Kia Dashtipour,Amir Hussain 摘要：人类大脑在上下文中利用异质的感觉信息来有效地执行包括视觉和听觉在内的认知任务。例如，在鸡尾酒会的情况下，人类的听觉皮层上下文整合视听（AV）线索，以便更好地感知语音。最近的研究表明，与纯音频语音增强（SE）模型相比，AV语音增强（SE）模型可以显著提高极低信噪比（SNR）环境下的语音质量和可懂度。然而，尽管在AV SE领域进行了大量研究，但开发低延迟的实时处理模型仍然是一项艰巨的技术挑战。在本文中，我们提出了一种新的低延迟非特定人AVSE框架，该框架可以推广到一系列视觉和声学噪声。特别地，提出了一种生成性对抗网络（GAN）来解决AV-SE中视觉缺陷的实际问题。此外，我们提出了一种基于深度神经网络的实时AV SE模型，该模型考虑了来自GAN的干净视觉语音输出，以提供更鲁棒的SE。使用客观的语音质量和可懂度指标以及主观列表测试，在合成和真实的有噪声AV语料库上对所提出的框架进行了评估。对比仿真结果表明，我们的实时AV SE框架优于最先进的SE方法，包括最新的基于DNN的SE模型。摘要：The human brain contextually exploits heterogeneous sensory information to efficiently perform cognitive tasks including vision and hearing. For example, during the cocktail party situation, the human auditory cortex contextually integrates audio-visual (AV) cues in order to better perceive speech. Recent studies have shown that AV speech enhancement (SE) models can significantly improve speech quality and intelligibility in very low signal to noise ratio (SNR) environments as compared to audio-only SE models. However, despite significant research in the area of AV SE, development of real-time processing models with low latency remains a formidable technical challenge. In this paper, we present a novel framework for low latency speaker-independent AV SE that can generalise on a range of visual and acoustic noises. In particular, a generative adversarial networks (GAN) is proposed to address the practical issue of visual imperfections in AV SE. In addition, we propose a deep neural network based real-time AV SE model that takes into account the cleaned visual speech output from GAN to deliver more robust SE. The proposed framework is evaluated on synthetic and real noisy AV corpora using objective speech quality and intelligibility metrics and subjective listing tests. Comparative simulation results show that our real time AV SE framework outperforms state-of-the-art SE approaches, including recent DNN based SE models.

【2】 Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer 标题：通过视觉知识传授实现无平行数据的音文连点链接：https://arxiv.org/abs/2112.08995

作者：Yanpeng Zhao,Jack Hessel,Youngjae Yu,Ximing Lu,Rowan Zellers,Yejin Choi 备注：Our code is available at this https URL 摘要：能够表示和描述环境声景的机器具有实用潜力，例如，用于音频标签和字幕系统。流行的学习模式一直依赖于平行的音频文本数据，然而，这在网络上几乎不可用。我们提出了VIP-ANT，它在不使用任何并行音频文本数据的情况下诱导\textbf{A}udio-\textbf{T}ext对齐。我们的核心思想是在双模图像文本表示和双模图像音频表示之间共享图像模态；图像模态作为轴心，在三模态嵌入空间中隐式连接音频和文本。在没有成对音频文本数据的困难Zero-Shot设置中，我们的模型在ESC50和US8K音频分类任务上展示了最先进的Zero-Shot性能，甚至超过了Cloto字幕检索（带音频查询）的监督最先进水平2.2\%R@1.我们进一步调查了最小音频文本监督的案例，发现，例如，在US8K上，只有几百对受监督的音频文本对将Zero-Shot音频分类精度提高了8%。然而，为了在一些Zero-Shot任务上匹配人类平价，我们的经验缩放实验表明，我们需要大约$2^{21}\大约200万美元的监督音频字幕对。我们的工作为学习几乎没有平行音频文本数据的音频文本连接开辟了新的途径。摘要：Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning systems. Prevailing learning paradigms have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces \textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a tri-modal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2\% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8\% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about $2^{21} \approx 2M$ supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.

【3】 Knowledge Distillation Leveraging Alternative Soft Targets from Non-Parallel Qualified Speech Data 标题：利用非并行合格语音数据中的备选软目标进行知识提取链接：https://arxiv.org/abs/2112.08878

作者：Tohru Nagano,Takashi Fukuda,Gakuto Kurata 摘要：本文描述了一种新的知识提取框架，该框架利用现有训练数据池中包含的声学合格语音数据作为特权信息。在我们提出的框架中，一个学生网络被训练为每个话语都有多个软目标，这些软目标包括来自原说话人话语的主要软目标和来自其他说话人在更好的声学条件下所说话语的替代目标，作为第二视图。这些来自其他说话人的合格话语，用于生成更好的软目标，通过使用严格的单词/电话/状态持续时间限制，从合格的数据池中收集。我们提出的方法是一种目标端数据扩充形式，它创建数据的多个副本，并从合格的数据池中获得相应的更好的软目标。我们在声学模型自适应设置下的实验表明，与仅使用来自原始说话人的软目标的传统方法相比，该方法利用来自不同说话人的更好的软目标，可以进一步提高识别精度。摘要：This paper describes a novel knowledge distillation framework that leverages acoustically qualified speech data included in an existing training data pool as privileged information. In our proposed framework, a student network is trained with multiple soft targets for each utterance that consist of main soft targets from original speakers' utterance and alternative targets from other speakers' utterances spoken under better acoustic conditions as a secondary view. These qualified utterances from other speakers, used to generate better soft targets, are collected from a qualified data pool by using strict constraints in terms of word/phone/state durations. Our proposed method is a form of target-side data augmentation that creates multiple copies of data with corresponding better soft targets obtained from a qualified data pool. We show in our experiments under acoustic model adaptation settings that the proposed method, exploiting better soft targets obtained from various speakers, can further improve recognition accuracy compared with conventional methods using only soft targets from original speakers.

【4】 EmotionBox: a music-element-driven emotional music generation system using Recurrent Neural Network 标题：EmotionBox：基于递归神经网络的音乐元素驱动的情感音乐生成系统链接：https://arxiv.org/abs/2112.08561

作者：Kaitong Zheng,Ruijie Meng,Chengshi Zheng,Xiaodong Li,Jinqiu Sang,Juanjuan Cai,Jie Wang 摘要：随着深度神经网络技术的发展，音乐自动作曲取得了很大的进步。虽然情感音乐能唤起听众的不同情感，对艺术表现也很重要，但很少有研究关注情感音乐的产生。本文介绍了EmotionBox——一种音乐元素驱动的情感音乐生成器，它能够在给定特定情感的情况下创作音乐，而该模型不需要标记有情感的音乐数据集。取而代之的是，音高直方图和音符密度被提取为分别代表模式和节奏的特征，以控制音乐情感。主观听力测试表明，与基于情绪标签的方法相比，情绪盒在激发特定情绪方面更具竞争力和平衡性。摘要：With the development of deep neural networks, automatic music composition has made great progress. Although emotional music can evoke listeners' different emotions and it is important for artistic expression, only few researches have focused on generating emotional music. This paper presents EmotionBox -an music-element-driven emotional music generator that is capable of composing music given a specific emotion, where this model does not require a music dataset labeled with emotions. Instead, pitch histogram and note density are extracted as features that represent mode and tempo respectively to control music emotions. The subjective listening tests show that the Emotionbox has a more competitive and balanced performance in arousing a specified emotion than the emotion-label-based method.

【5】 Expert and Crowd-Guided Affect Annotation and Prediction 标题：专家和人群引导的情感标注和预测链接：https://arxiv.org/abs/2112.08432

作者：Ramanathan Subramanian,Yan Yan,Nicu Sebe 备注：Manuscript submitted for review to IEEE Transactions on Affective Computing 摘要：我们利用众包技术为电影片段获取时间连续的情感注释，并在多任务学习（MTL）框架内结合专家信息对从这些群体注释中训练的噪声模型进行优化。我们提出了一种新颖的\textbf{e}xpert\textbf{g}导向的MTL（EG-MTL）算法，该算法最小化了群组和专家标签的损失，以学习与获取群组注释的每个电影剪辑对应的一组权重。我们使用EG-MTL来解决两个问题，即，textbf{\texttt{P1}}：其中，使用从专家和众工处获取的\textbf{Validation}集的动态注释来训练以视听片段描述符为特征的回归模型，并预测从剪辑中提取的5-15秒片段的动态唤醒和效价水平；和\textbf{\texttt{P2}：在\textbf{Validation}集合上使用动态群组和专家注释（作为特征）和静态情感剪辑标签训练的分类模型用于\textbf{Evaluation}集合上的二元情感识别，只有动态群组注释可用。观察到的实验结果证实了EG-MTL算法的有效性，这通过改进的\textbf{\texttt{P1}唤醒和价态估计以及\textbf{\texttt{P2}更高的识别精度反映出来。摘要：We employ crowdsourcing to acquire time-continuous affective annotations for movie clips, and refine noisy models trained from these crowd annotations incorporating expert information within a Multi-task Learning (MTL) framework. We propose a novel \textbf{e}xpert \textbf{g}uided MTL (EG-MTL) algorithm, which minimizes the loss with respect to both crowd and expert labels to learn a set of weights corresponding to each movie clip for which crowd annotations are acquired. We employ EG-MTL to solve two problems, namely, \textbf{\texttt{P1}}: where dynamic annotations acquired from both experts and crowdworkers for the \textbf{Validation} set are used to train a regression model with audio-visual clip descriptors as features, and predict dynamic arousal and valence levels on 5--15 second snippets derived from the clips; and \textbf{\texttt{P2}}: where a classification model trained on the \textbf{Validation} set using dynamic crowd and expert annotations (as features) and static affective clip labels is used for binary emotion recognition on the \textbf{Evaluation} set for which only dynamic crowd annotations are available. Observed experimental results confirm the effectiveness of the EG-MTL algorithm, which is reflected via improved arousal and valence estimation for \textbf{\texttt{P1}}, and higher recognition accuracy for \textbf{\texttt{P2}}.

【6】 Object-based synthesis of scraping and rolling sounds based on non-linear physical constraints 标题：基于非线性物理约束的刮擦声和滚动声的对象合成链接：https://arxiv.org/abs/2112.08984

作者：Vinayak Agarwal,Maddie Cusimano,James Traer,Josh McDermott 备注：None 摘要：持续的接触互动，如刮擦和滚动，会产生各种各样的声音。以前的研究探索了有效和直观地合成这些声音的方法，但不能完全模仿这些声音的真实实例的丰富结构。我们提出了一种新的源滤波器模型，用于真实合成刮擦声和滚动声，其物理和感知相关可控参数受力学原理约束。我们模型的主要特征包括约束接触力的非线性、不同运动的自然法向力变化，以及在材料内变形脉冲响应以实现位置依赖性的方法。感知实验表明，该模型能够合成真实的刮擦声和滚动声，同时传递与记录声相似的物理信息。摘要：Sustained contact interactions like scraping and rolling produce a wide variety of sounds. Previous studies have explored ways to synthesize these sounds efficiently and intuitively but could not fully mimic the rich structure of real instances of these sounds. We present a novel source-filter model for realistic synthesis of scraping and rolling sounds with physically and perceptually relevant controllable parameters constrained by principles of mechanics. Key features of our model include non-linearities to constrain the contact force, naturalistic normal force variation for different motions, and a method for morphing impulse responses within a material to achieve location-dependence. Perceptual experiments show that the presented model is able to synthesize realistic scraping and rolling sounds while conveying physical information similar to that in recorded sounds.

【7】 Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-supervised Speaker Verification 标题：用于自监督说话人确认的Bootstrap均衡和概率说话人表征学习链接：https://arxiv.org/abs/2112.08929

作者：Sung Hwan Mun,Min Hyun Han,Dongjune Lee,Jihwan Kim,Nam Soo Kim 备注：Accepted by IEEE Access 摘要：在本文中，我们提出了自监督说话人表示学习策略，包括前端的bootstrap均衡说话人表示学习和后端的不确定性感知概率说话人嵌入训练。在前端阶段，我们通过带一致性正则化项的自举训练方案学习说话人表示。在后端阶段，通过最大化属于同一说话人的语音样本之间的相互似然得分来估计概率说话人嵌入，这不仅提供说话人表示，而且提供数据不确定性。实验结果表明，所提出的bootstrap均衡训练策略能够有效地帮助学习说话人表征，并优于传统的基于对比学习的方法。此外，我们还证明了集成的两阶段框架进一步提高了VoxCeleb1测试集在EER和MinDCF方面的说话人验证性能。摘要：In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertainty-aware probabilistic speaker embedding training in the back-end. In the front-end stage, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In the back-end stage, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score between the speech samples belonging to the same speaker, which provide not only speaker representations but also data uncertainty. Experimental results show that the proposed bootstrap equilibrium training strategy can effectively help learn the speaker representations and outperforms the conventional methods based on contrastive learning. Also, we demonstrate that the integrated two-stage framework further improves the speaker verification performance on the VoxCeleb1 test set in terms of EER and MinDCF.

3.eess.AS音频处理:

【1】 Low Resource Species Agnostic Bird Activity Detection 标题：低资源种不可知性鸟类活动检测链接：https://arxiv.org/abs/2112.09042

作者：Mark Anderson,John Kennedy,Naomi Harte 备注：None 摘要：本文探讨了用于检测鸟类活动的低资源分类器和功能，适用于嵌入式自动记录装置，这些装置通常用于长期远程监测鸟类种群。特征包括低阶谱参数、基音样本上的统计矩以及由振幅调制得到的特征。使用NIPS4Bplus数据集在几个轻量级分类器上评估性能。我们的实验表明，随机森林分类器在这项任务中表现最好，达到了0.721的准确率和0.604的F1分数。我们将系统的结果与基于卷积神经网络的检测器和标准MFCC特征进行比较。我们的实验表明，我们可以使用计算量较小且适合边缘部署的特性和模型在大多数度量中实现相同或更好的性能。摘要：This paper explores low resource classifiers and features for the detection of bird activity, suitable for embedded Automatic Recording Units which are typically deployed for long term remote monitoring of bird populations. Features include low-level spectral parameters, statistical moments on pitch samples, and features derived from amplitude modulation. Performance is evaluated on several lightweight classifiers using the NIPS4Bplus dataset. Our experiments show that random forest classifiers perform best on this task, achieving an accuracy of 0.721 and an F1-Score of 0.604. We compare the results of our system against both a Convolutional Neural Network based detector, and standard MFCC features. Our experiments show that we can achieve equal or better performance in most metrics using features and models with a smaller computational cost and which are suitable for edge deployment.

【2】 Bioacoustic Event Detection with prototypical networks and data augmentation 标题：基于原型网络和数据增强的生物声事件检测链接：https://arxiv.org/abs/2112.09006

作者：Mark Anderson,Naomi Harte 备注：5 pages, 2 Figures, 3 Tables, Technical Report for DCASE2021 Challenge Task 5, June 2021 摘要：本报告介绍了用于DCASE2021挑战的少量生物声学事件检测的系统所使用的深度学习和数据增强技术。其任务是为动物（哺乳动物和鸟类）的发声开发一些镜头学习系统。参与者的任务是开发一种方法，该方法可以从哺乳动物或鸟类的五种典型发音或快照中提取信息，并在现场录音中检测和分类声音。在本报告所述的系统中，原型网络用于学习度量空间，通过计算查询点到类原型的距离来执行分类，并基于最短距离进行分类。我们描述了该网络的体系结构、特征提取方法以及在给定数据集上执行的数据增强，并将我们的工作与challenge的基线网络进行了比较。摘要：This report presents deep learning and data augmentation techniques used by a system entered into the Few-Shot Bioacoustic Event Detection for the DCASE2021 Challenge. The remit was to develop a few-shot learning system for animal (mammal and bird) vocalisations. Participants were tasked with developing a method that can extract information from five exemplar vocalisations, or shots, of mammals or birds and detect and classify sounds in field recordings. In the system described in this report, prototypical networks are used to learn a metric space, from which classification is performed by computing the distance of a query point to class prototypes, classifying based on shortest distance. We describe the architecture of this network, feature extraction methods, and data augmentation performed on the given dataset and compare our work to the challenge's baseline networks.

【3】 Object-based synthesis of scraping and rolling sounds based on non-linear physical constraints 标题：基于非线性物理约束的刮擦声和滚动声的对象合成链接：https://arxiv.org/abs/2112.08984

作者：Vinayak Agarwal,Maddie Cusimano,James Traer,Josh McDermott 备注：None 摘要：持续的接触互动，如刮擦和滚动，会产生各种各样的声音。以前的研究探索了有效和直观地合成这些声音的方法，但不能完全模仿这些声音的真实实例的丰富结构。我们提出了一种新的源滤波器模型，用于真实合成刮擦声和滚动声，其物理和感知相关可控参数受力学原理约束。我们模型的主要特征包括约束接触力的非线性、不同运动的自然法向力变化，以及在材料内变形脉冲响应以实现位置依赖性的方法。感知实验表明，该模型能够合成真实的刮擦声和滚动声，同时传递与记录声相似的物理信息。摘要：Sustained contact interactions like scraping and rolling produce a wide variety of sounds. Previous studies have explored ways to synthesize these sounds efficiently and intuitively but could not fully mimic the rich structure of real instances of these sounds. We present a novel source-filter model for realistic synthesis of scraping and rolling sounds with physically and perceptually relevant controllable parameters constrained by principles of mechanics. Key features of our model include non-linearities to constrain the contact force, naturalistic normal force variation for different motions, and a method for morphing impulse responses within a material to achieve location-dependence. Perceptual experiments show that the presented model is able to synthesize realistic scraping and rolling sounds while conveying physical information similar to that in recorded sounds.

【4】 Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-supervised Speaker Verification 标题：用于自监督说话人确认的Bootstrap均衡和概率说话人表征学习链接：https://arxiv.org/abs/2112.08929

作者：Sung Hwan Mun,Min Hyun Han,Dongjune Lee,Jihwan Kim,Nam Soo Kim 备注：Accepted by IEEE Access 摘要：在本文中，我们提出了自监督说话人表示学习策略，包括前端的bootstrap均衡说话人表示学习和后端的不确定性感知概率说话人嵌入训练。在前端阶段，我们通过带一致性正则化项的自举训练方案学习说话人表示。在后端阶段，通过最大化属于同一说话人的语音样本之间的相互似然得分来估计概率说话人嵌入，这不仅提供说话人表示，而且提供数据不确定性。实验结果表明，所提出的bootstrap均衡训练策略能够有效地帮助学习说话人表征，并优于传统的基于对比学习的方法。此外，我们还证明了集成的两阶段框架进一步提高了VoxCeleb1测试集在EER和MinDCF方面的说话人验证性能。摘要：In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertainty-aware probabilistic speaker embedding training in the back-end. In the front-end stage, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In the back-end stage, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score between the speech samples belonging to the same speaker, which provide not only speaker representations but also data uncertainty. Experimental results show that the proposed bootstrap equilibrium training strategy can effectively help learn the speaker representations and outperforms the conventional methods based on contrastive learning. Also, we demonstrate that the integrated two-stage framework further improves the speaker verification performance on the VoxCeleb1 test set in terms of EER and MinDCF.

【5】 Self-Supervised Learning for speech recognition with Intermediate layer supervision 标题：带中间层监督的语音识别自监督学习链接：https://arxiv.org/abs/2112.08778

作者：Chengyi Wang,Yu Wu,Sanyuan Chen,Shujie Liu,Jinyu Li,Yao Qian,Zhenglu Yang 备注：Submitted to ICASSP 2022 摘要：最近，先锋研究发现，语音预训练模型可以解决全堆栈语音处理任务，因为该模型利用底层学习说话人相关信息，顶层编码内容相关信息。由于网络容量有限，我们相信如果该模型专门用于音频内容信息学习，语音识别性能可以进一步提高。为此，我们提出了用于自监督学习的中间层监督（ILS-SSL），通过在中间层上添加额外的SSL损失，迫使模型尽可能集中于内容信息。在LibriSpeech-test-other集上的实验表明，我们的方法明显优于HuBERT，在基本/大型模型的w/o语言模型设置下，相对字错误率降低了23.5%/11.6%。详细的分析表明，我们模型的底层与语音单位有更好的相关性，这与我们的直觉一致，并解释了我们的ASR方法的成功。摘要：Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information. Since the network capacity is limited, we believe the speech recognition performance could be further improved if the model is dedicated to audio content information learning. To this end, we propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL), which forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly, which achieves a 23.5%/11.6% relative word error rate reduction in the w/o language model setting for base/large models. Detailed analysis shows the bottom layers of our model have a better correlation with phonetic units, which is consistent with our intuition and explains the success of our method for ASR.

【6】 Towards Robust Real-time Audio-Visual Speech Enhancement 标题：面向鲁棒实时视听语音增强的研究链接：https://arxiv.org/abs/2112.09060

作者：Mandar Gogate,Kia Dashtipour,Amir Hussain 摘要：人类大脑在上下文中利用异质的感觉信息来有效地执行包括视觉和听觉在内的认知任务。例如，在鸡尾酒会的情况下，人类的听觉皮层上下文整合视听（AV）线索，以便更好地感知语音。最近的研究表明，与纯音频语音增强（SE）模型相比，AV语音增强（SE）模型可以显著提高极低信噪比（SNR）环境下的语音质量和可懂度。然而，尽管在AV SE领域进行了大量研究，但开发低延迟的实时处理模型仍然是一项艰巨的技术挑战。在本文中，我们提出了一种新的低延迟非特定人AVSE框架，该框架可以推广到一系列视觉和声学噪声。特别地，提出了一种生成性对抗网络（GAN）来解决AV-SE中视觉缺陷的实际问题。此外，我们提出了一种基于深度神经网络的实时AV SE模型，该模型考虑了来自GAN的干净视觉语音输出，以提供更鲁棒的SE。使用客观的语音质量和可懂度指标以及主观列表测试，在合成和真实的有噪声AV语料库上对所提出的框架进行了评估。对比仿真结果表明，我们的实时AV SE框架优于最先进的SE方法，包括最新的基于DNN的SE模型。摘要：The human brain contextually exploits heterogeneous sensory information to efficiently perform cognitive tasks including vision and hearing. For example, during the cocktail party situation, the human auditory cortex contextually integrates audio-visual (AV) cues in order to better perceive speech. Recent studies have shown that AV speech enhancement (SE) models can significantly improve speech quality and intelligibility in very low signal to noise ratio (SNR) environments as compared to audio-only SE models. However, despite significant research in the area of AV SE, development of real-time processing models with low latency remains a formidable technical challenge. In this paper, we present a novel framework for low latency speaker-independent AV SE that can generalise on a range of visual and acoustic noises. In particular, a generative adversarial networks (GAN) is proposed to address the practical issue of visual imperfections in AV SE. In addition, we propose a deep neural network based real-time AV SE model that takes into account the cleaned visual speech output from GAN to deliver more robust SE. The proposed framework is evaluated on synthetic and real noisy AV corpora using objective speech quality and intelligibility metrics and subjective listing tests. Comparative simulation results show that our real time AV SE framework outperforms state-of-the-art SE approaches, including recent DNN based SE models.

【7】 Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer 标题：通过视觉知识传授实现无平行数据的音文连点链接：https://arxiv.org/abs/2112.08995

作者：Yanpeng Zhao,Jack Hessel,Youngjae Yu,Ximing Lu,Rowan Zellers,Yejin Choi 备注：Our code is available at this https URL 摘要：能够表示和描述环境声景的机器具有实用潜力，例如，用于音频标签和字幕系统。流行的学习模式一直依赖于平行的音频文本数据，然而，这在网络上几乎不可用。我们提出了VIP-ANT，它在不使用任何并行音频文本数据的情况下诱导\textbf{A}udio-\textbf{T}ext对齐。我们的核心思想是在双模图像文本表示和双模图像音频表示之间共享图像模态；图像模态作为轴心，在三模态嵌入空间中隐式连接音频和文本。在没有成对音频文本数据的困难Zero-Shot设置中，我们的模型在ESC50和US8K音频分类任务上展示了最先进的Zero-Shot性能，甚至超过了Cloto字幕检索（带音频查询）的监督最先进水平2.2\%R@1.我们进一步调查了最小音频文本监督的案例，发现，例如，在US8K上，只有几百对受监督的音频文本对将Zero-Shot音频分类精度提高了8%。然而，为了在一些Zero-Shot任务上匹配人类平价，我们的经验缩放实验表明，我们需要大约$2^{21}\大约200万美元的监督音频字幕对。我们的工作为学习几乎没有平行音频文本数据的音频文本连接开辟了新的途径。摘要：Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning systems. Prevailing learning paradigms have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces \textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a tri-modal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2\% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8\% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about $2^{21} \approx 2M$ supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.

【8】 Knowledge Distillation Leveraging Alternative Soft Targets from Non-Parallel Qualified Speech Data 标题：利用非并行合格语音数据中的备选软目标进行知识提取链接：https://arxiv.org/abs/2112.08878

作者：Tohru Nagano,Takashi Fukuda,Gakuto Kurata 摘要：本文描述了一种新的知识提取框架，该框架利用现有训练数据池中包含的声学合格语音数据作为特权信息。在我们提出的框架中，一个学生网络被训练为每个话语都有多个软目标，这些软目标包括来自原说话人话语的主要软目标和来自其他说话人在更好的声学条件下所说话语的替代目标，作为第二视图。这些来自其他说话人的合格话语，用于生成更好的软目标，通过使用严格的单词/电话/状态持续时间限制，从合格的数据池中收集。我们提出的方法是一种目标端数据扩充形式，它创建数据的多个副本，并从合格的数据池中获得相应的更好的软目标。我们在声学模型自适应设置下的实验表明，与仅使用来自原始说话人的软目标的传统方法相比，该方法利用来自不同说话人的更好的软目标，可以进一步提高识别精度。摘要：This paper describes a novel knowledge distillation framework that leverages acoustically qualified speech data included in an existing training data pool as privileged information. In our proposed framework, a student network is trained with multiple soft targets for each utterance that consist of main soft targets from original speakers' utterance and alternative targets from other speakers' utterances spoken under better acoustic conditions as a secondary view. These qualified utterances from other speakers, used to generate better soft targets, are collected from a qualified data pool by using strict constraints in terms of word/phone/state durations. Our proposed method is a form of target-side data augmentation that creates multiple copies of data with corresponding better soft targets obtained from a qualified data pool. We show in our experiments under acoustic model adaptation settings that the proposed method, exploiting better soft targets obtained from various speakers, can further improve recognition accuracy compared with conventional methods using only soft targets from original speakers.

【9】 EmotionBox: a music-element-driven emotional music generation system using Recurrent Neural Network 标题：EmotionBox：基于递归神经网络的音乐元素驱动的情感音乐生成系统链接：https://arxiv.org/abs/2112.08561

作者：Kaitong Zheng,Ruijie Meng,Chengshi Zheng,Xiaodong Li,Jinqiu Sang,Juanjuan Cai,Jie Wang 摘要：随着深度神经网络技术的发展，音乐自动作曲取得了很大的进步。虽然情感音乐能唤起听众的不同情感，对艺术表现也很重要，但很少有研究关注情感音乐的产生。本文介绍了EmotionBox——一种音乐元素驱动的情感音乐生成器，它能够在给定特定情感的情况下创作音乐，而该模型不需要标记有情感的音乐数据集。取而代之的是，音高直方图和音符密度被提取为分别代表模式和节奏的特征，以控制音乐情感。主观听力测试表明，与基于情绪标签的方法相比，情绪盒在激发特定情绪方面更具竞争力和平衡性。摘要：With the development of deep neural networks, automatic music composition has made great progress. Although emotional music can evoke listeners' different emotions and it is important for artistic expression, only few researches have focused on generating emotional music. This paper presents EmotionBox -an music-element-driven emotional music generator that is capable of composing music given a specific emotion, where this model does not require a music dataset labeled with emotions. Instead, pitch histogram and note density are extracted as features that represent mode and tempo respectively to control music emotions. The subjective listening tests show that the Emotionbox has a more competitive and balanced performance in arousing a specified emotion than the emotion-label-based method.

【10】 Expert and Crowd-Guided Affect Annotation and Prediction 标题：专家和人群引导的情感标注和预测链接：https://arxiv.org/abs/2112.08432

作者：Ramanathan Subramanian,Yan Yan,Nicu Sebe 备注：Manuscript submitted for review to IEEE Transactions on Affective Computing 摘要：我们利用众包技术为电影片段获取时间连续的情感注释，并在多任务学习（MTL）框架内结合专家信息对从这些群体注释中训练的噪声模型进行优化。我们提出了一种新颖的\textbf{e}xpert\textbf{g}导向的MTL（EG-MTL）算法，该算法最小化了群组和专家标签的损失，以学习与获取群组注释的每个电影剪辑对应的一组权重。我们使用EG-MTL来解决两个问题，即，textbf{\texttt{P1}}：其中，使用从专家和众工处获取的\textbf{Validation}集的动态注释来训练以视听片段描述符为特征的回归模型，并预测从剪辑中提取的5-15秒片段的动态唤醒和效价水平；和\textbf{\texttt{P2}：在\textbf{Validation}集合上使用动态群组和专家注释（作为特征）和静态情感剪辑标签训练的分类模型用于\textbf{Evaluation}集合上的二元情感识别，只有动态群组注释可用。观察到的实验结果证实了EG-MTL算法的有效性，这通过改进的\textbf{\texttt{P1}唤醒和价态估计以及\textbf{\texttt{P2}更高的识别精度反映出来。摘要：We employ crowdsourcing to acquire time-continuous affective annotations for movie clips, and refine noisy models trained from these crowd annotations incorporating expert information within a Multi-task Learning (MTL) framework. We propose a novel \textbf{e}xpert \textbf{g}uided MTL (EG-MTL) algorithm, which minimizes the loss with respect to both crowd and expert labels to learn a set of weights corresponding to each movie clip for which crowd annotations are acquired. We employ EG-MTL to solve two problems, namely, \textbf{\texttt{P1}}: where dynamic annotations acquired from both experts and crowdworkers for the \textbf{Validation} set are used to train a regression model with audio-visual clip descriptors as features, and predict dynamic arousal and valence levels on 5--15 second snippets derived from the clips; and \textbf{\texttt{P2}}: where a classification model trained on the \textbf{Validation} set using dynamic crowd and expert annotations (as features) and static affective clip labels is used for binary emotion recognition on the \textbf{Evaluation} set for which only dynamic crowd annotations are available. Observed experimental results confirm the effectiveness of the EG-MTL algorithm, which is reflected via improved arousal and valence estimation for \textbf{\texttt{P1}}, and higher recognition accuracy for \textbf{\texttt{P2}}.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-12-17，如有侵权请联系 cloudcommunity@tencent.com 删除

linux