金融/语音/音频处理学术速递[12.24]

公众号-arXiv每日学术速递

发布于 2021-12-27 17:03:54

2390

发布于 2021-12-27 17:03:54

文章被收录于专栏：arXiv每日学术速递

q-fin金融，共计5篇

cs.SD语音，共计7篇

eess.AS音频处理，共计7篇

1.q-fin金融:

【1】 Intra-Household Management of Joint Resources: Evidence from Malawi 标题：联合资源的家庭内部管理：来自马拉维的证据链接：https://arxiv.org/abs/2112.12766

作者：Anna Josephson 机构：Department of Agricultural and Resource Economics, University of Arizona 摘要：在家庭内部资源分配模型中，两个或两个以上家庭成员共同工作的收入经常被忽略。考虑到多个家庭成员共同赚取的收入，以及男性和女性各自赚取的收入，我检验了家庭内资源完全集中的假设。通过明确包括家庭内部合作，我发现了家庭内部部分收入共享和部分保险的证据。汇集是一种有意的行为，家庭将收入汇集起来用于基本和经常性支出。这表明家庭的运作方式不同于以往的实证研究结果，合作性更强。摘要：In models of intra-household resource allocation, the earnings from joint work between two or more household members are often omitted. Accounting for income earned jointly by multiple household members, in addition to income earned individually by men and women, I test the assumption of complete pooling of resources within a household. By explicitly including intra-household collaboration, I find evidence of partial income pooling and partial insurance within the household. Pooling is an intentional behavior, in which households pool income for essential and regular expenditures. This indicates that households operate differently and more collaboratively than previous empirical findings suggest.

【2】 Should transparency be (in-)transparent? On monitoring aversion and cooperation in teams 标题：透明度应该是(In-)透明的吗？论团队中厌恶与合作的监控链接：https://arxiv.org/abs/2112.12621

作者：Michalis Drouvelis,Johannes Jarke-Neuert,Johannes Lohse 机构： University of Hamburg 备注：13 pages excluding appendix, 22 pages including appendix, 3 figures 摘要：许多现代组织采用包括监控员工行为的方法，以鼓励工作场所的团队合作。虽然监测促进了透明的工作环境，但使监测本身透明的效果可能模棱两可，在文献中很少受到关注。通过一项新的实验室实验，我们创造了一个工作环境，在这个环境中，第一个搬运工可以（或不能）在一轮比赛结束时观察第二个搬运工的监控。我们的框架由一个标准的重复顺序囚徒困境组成，在这个框架中，第二个行动者可以观察第一个行动者做出的选择，无论是外在的还是内在的。我们表明，当监测变得透明时，相互合作发生的频率显著提高。此外，我们的研究结果强调了有条件的合作者（更有可能监督）在促进团队合作方面的关键作用。总的来说，观察到的合作促进效应是由于监测行动，这些行动携带有关先行者的信息，这些先行者使用这些信息来更好地筛选其合作伙伴的类型，从而降低被利用的风险。摘要：Many modern organisations employ methods which involve monitoring of employees' actions in order to encourage teamwork in the workplace. While monitoring promotes a transparent working environment, the effects of making monitoring itself transparent may be ambiguous and have received surprisingly little attention in the literature. Using a novel laboratory experiment, we create a working environment in which first movers can (or cannot) observe second mover's monitoring at the end of a round. Our framework consists of a standard repeated sequential Prisoner's Dilemma, where the second mover can observe the choices made by first movers either exogenously or endogenously. We show that mutual cooperation occurs significantly more frequently when monitoring is made transparent. Additionally, our results highlight the key role of conditional cooperators (who are more likely to monitor) in promoting teamwork. Overall, the observed cooperation enhancing effects are due to monitoring actions that carry information about first movers who use it to better screen the type of their co-player and thereby reduce the risk of being exploited.

【3】 A meta-analysis of residential PV adoption: the important role of perceived benefits, intentions and antecedents in solar energy acceptance 标题：住宅光伏采用率的荟萃分析：感知利益、意图和前因在太阳能采用率中的重要作用链接：https://arxiv.org/abs/2112.12464

作者：Emily Schulte,Fabian Scheller,Daniel Sloot,Thomas Bruckner 机构：Chair of Energy Management and Sustainability, Institute for Infrastructure and Resources Management (IIRM), Leipzig, Energy Economics and System Analysis, Division of Sustainability, Department of Technology, Management and Economics 备注：None 摘要：住宅光伏系统（PV）的采用被视为可持续能源转型的重要组成部分。为了促进这一进程，确定太阳能应用的决定因素至关重要。本文采用元分析结构方程建模方法，对住宅光伏采用意向研究进行元分析，并基于计划行为理论评估四种行为模型，以推进理论发展。在最初确定的653项研究中，有110项仍需进行全文筛查。只有八项研究具有足够的同质性，提供了双变量相关性，因此可以整合到荟萃分析中。主要研究的汇总相关性显示，环境关注、新奇感、感知利益、主观规范和采用住宅光伏系统的意愿之间存在中到大的相关性，而社会人口变量与意愿不相关。元分析结构方程模型揭示了一个模型（N=1714），在该模型中，采用意愿通过利益和感知的行为控制来预测，而利益反过来又可以通过环境关注、新颖性寻求和主观规范来解释。我们的结果表明，措施应主要侧重于增强对利益的认识。基于我们在分析中遇到的障碍，我们提出了促进未来科学证据汇总的指导方针，如系统纳入关键变量和报告双变量相关性。摘要：The adoption of residential photovoltaic systems (PV) is seen as an important part of the sustainable energy transition. To facilitate this process, it is crucial to identify the determinants of solar adoption. This paper follows a meta-analytical structural equation modeling approach, presenting a meta-analysis of studies on residential PV adoption intention, and assessing four behavioral models based on the theory of planned behavior to advance theory development. Of 653 initially identified studies, 110 remained for full-text screening. Only eight studies were sufficiently homogeneous, provided bivariate correlations, and could thus be integrated into the meta-analysis.The pooled correlations across primary studies revealed medium to large correlations between environmental concern, novelty seeking, perceived benefits, subjective norm and intention to adopt a residential PV system, whereas socio-demographic variables were uncorrelated with intention. Meta-analytical structural equation modeling revealed a model (N = 1,714) in which adoption intention was predicted by benefits and perceived behavioral control, and benefits in turn could be explained by environmental concern, novelty seeking, and subjective norm. Our results imply that measures should primarily focus on enhancing the perception of benefits. Based on obstacles we encountered within the analysis, we suggest guidelines to facilitate the future aggregation of scientific evidence, such as the systematic inclusion of key variables and reporting of bivariate correlations.

【4】 Stationarity analysis of the stock market data and its application to mechanical trading 标题：股市数据的平稳性分析及其在机械交易中的应用链接：https://arxiv.org/abs/2112.12459

作者：Kazuki Kanehira,Norikazu Todoroki 机构：Department of Computer Science, Chiba Institute of Technology ,-,-, Tsudanuma, Narashino,-, Chiba, Japan, Physics Division, Chiba Institute of Technology ,-,-, Shibazono, Narashino,- 摘要：本研究提出了一个基于KM$_2$O-Langevin理论的股票价格波动平稳性分析方案。利用该方案，我们将股票价格波动的时间序列数据分为三个阶段：平稳、非平稳和中间。然后，我们提出了一个低风险股票交易策略的例子，通过使用实际的股票指数数据来证明我们的方案的有效性。我们的策略使用一个基于趋势的指标，移动平均，用于平稳期，一个基于振荡器的指标，心理线，用于非平稳期，以做出交易决策。最后，我们通过对日经股票平均价格指数的回测，确认我们的策略是一种安全的交易策略，最大跌幅较小。我们的研究首次将KM$2$O-Langevin理论的平稳性分析应用于实际的机械交易，为股票价格预测开辟了新的途径。摘要：This study proposes a scheme for stationarity analysis of stock price fluctuations based on KM$_2$O-Langevin theory. Using this scheme, we classify the time-series data of stock price fluctuations into three periods: stationary, non-stationary, and intermediate. We then suggest an example of a low-risk stock trading strategy to demonstrate the usefulness of our scheme by using actual stock index data. Our strategy uses a trend-based indicator, moving averages, for stationary periods and an oscillator-based indicator, psychological lines, for non-stationary periods to make trading decisions. Finally, we confirm that our strategy is a safe trading strategy with small maximum drawdown by back testing on the Nikkei Stock Average. Our study, the first to apply the stationarity analysis of KM$_2$O-Langevin theory to actual mechanical trading, opens up new avenues for stock price prediction.

【5】 Henderson--Chu model extended to two heterogeneous groups 标题：Henderson-Chu模型推广到两个异质群链接：https://arxiv.org/abs/2112.12179

作者：Oliver Chiriac,Jonathan Hall 摘要：本文的目的是通过将通勤者分为两个基于收入的群体：“富人”和“穷人”，来修正亨德森·朱的方法。富人显然在到达时间上更加灵活，他们宁愿有更多的时间延迟成本，而不是旅行延迟。穷人非常缺乏灵活性，因为他们必须在规定的时间到达工作地点——旅行延误成本优先于计划延误成本。我们将多个峰值负荷瓶颈拥堵模型与收费和不收费相结合，以产生雷克萨斯车道的帕累托改进。摘要：The goal of this paper is to revise the Henderson-Chu approach by dividing the commuters into two income-based groups: the `rich' and the `poor'. The rich are clearly more flexible in their arrival times and would rather have more schedule delay cost instead of travel delay. The poor are quite inflexible as they have to arrive at work at a specified time -- travel delay cost is preferred over schedule delay cost. We combined multiple models of peak-load bottleneck congestion with and without toll-pricing to generate a Pareto improvement in Lexus lanes.

2.cs.SD语音:

【1】 Data Augmentation based Consistency Contrastive Pre-training for Automatic Speech Recognition 标题：基于数据增强的自动语音识别一致性对比预训练链接：https://arxiv.org/abs/2112.12522

作者：Changfeng Gao,Gaofeng Cheng,Yifan Guo,Qingwei Zhao,Pengyuan Zhang 机构： the pre-trained model can learn the contextualizedThe authors are with the Key Laboratory of Speech Acoustics and ContentUnderstanding, Institute of Acoustics, University of Chinese Academy of Sciences(e-mail 备注：5 pages, 2 figures 摘要：自监督声学预训练在自动语音识别（ASR）任务中取得了惊人的效果。大多数成功的声学预训练方法都采用对比学习的方法，通过区分不同时间步长的声学表征来学习声学表征，忽略了说话人和环境的鲁棒性。因此，在微调过程中遇到域外数据时，预先训练的模型可能会表现出较差的性能。在这封信中，我们设计了一种新的一致性对比学习（CCL）方法，利用数据增强进行声学预训练。在原始音频上应用不同类型的增强，然后将增强后的音频馈入编码器。编码器不仅应对比一个音频中的表示，还应最大限度地测量不同增强音频中的表示。通过这种方式，预先训练好的模型可以学习一种与文本相关的表示方法，这种方法随着说话人或环境的变化而变得更加鲁棒。实验表明，通过将CCL方法应用于Wav2Vec2上，该方法具有良好的性能。0时，可以在域内数据和域外数据上实现更好的结果。特别是对于含噪的域外数据，相对改善率可达15%以上。摘要：Self-supervised acoustic pre-training has achieved amazing results on the automatic speech recognition (ASR) task. Most of the successful acoustic pre-training methods use contrastive learning to learn the acoustic representations by distinguish the representations from different time steps, ignoring the speaker and environment robustness. As a result, the pre-trained model could show poor performance when meeting out-of-domain data during fine-tuning. In this letter, we design a novel consistency contrastive learning (CCL) method by utilizing data augmentation for acoustic pre-training. Different kinds of augmentation are applied on the original audios and then the augmented audios are fed into an encoder. The encoder should not only contrast the representations within one audio but also maximize the measurement of the representations across different augmented audios. By this way, the pre-trained model can learn a text-related representation method which is more robust with the change of the speaker or the environment.Experiments show that by applying the CCL method on the Wav2Vec2.0, better results can be realized both on the in-domain data and the out-of-domain data. Especially for noisy out-of-domain data, more than 15% relative improvement can be obtained.

【2】 S+PAGE: A Speaker and Position-Aware Graph Neural Network Model for Emotion Recognition in Conversation 标题：S+PAGE：一种用于会话情感识别的说话人和位置感知图神经网络模型链接：https://arxiv.org/abs/2112.12389

作者：Chen Liang,Chong Yang,Jing Xu,Juyang Huang,Yongliang Wang,Yang Dong 摘要：会话中的情感识别（ERC）由于其广泛应用的必要性，近年来受到了广泛关注。现有的ERC方法大多是分别对自我和说话人之间的语境进行建模，这是因为两者之间缺乏足够的互动。在本文中，我们提出了一种新的用于ERC（S+PAGE）的说话人和位置感知图神经网络模型，该模型包含三个阶段，以结合变换器和关系图卷积网络（R-GCN）的优点，实现更好的上下文建模。首先，提出了一种双流会话变换器，用于提取每个话语的粗略的自我和说话人之间的上下文特征。然后，我们构造了一个说话人和位置感知的对话图，并提出了一个增强的R-GCN模型PAG来细化由相对位置编码引导的粗糙特征。最后，前两个阶段的两个特征都被输入到条件随机场层来模拟情绪传递。摘要：Emotion recognition in conversation (ERC) has attracted much attention in recent years for its necessity in widespread applications. Existing ERC methods mostly model the self and inter-speaker context separately, posing a major issue for lacking enough interaction between them. In this paper, we propose a novel Speaker and Position-Aware Graph neural network model for ERC (S+PAGE), which contains three stages to combine the benefits of both Transformer and relational graph convolution network (R-GCN) for better contextual modeling. Firstly, a two-stream conversational Transformer is presented to extract the coarse self and inter-speaker contextual features for each utterance. Then, a speaker and position-aware conversation graph is constructed, and we propose an enhanced R-GCN model, called PAG, to refine the coarse features guided by a relative positional encoding. Finally, both of the features from the former two stages are input into a conditional random field layer to model the emotion transfer.

【3】 Graph attentive feature aggregation for text-independent speaker verification 标题：用于文本无关说话人确认的图注意特征聚合链接：https://arxiv.org/abs/2112.12343

作者：Hye-jin Shim,Jungwoo Heo,Jae-han Park,Ga-hui Lee,Ha-Jin Yu 机构：School of Computer Science, University of Seoul,KT Corporation 备注：5 pages, 1 figure, 6 tables, submitted to ICASSP 2022 摘要：本文的目的是将多个帧级特征结合到一个考虑成对关系的单一话语级表达中。为此，我们提出了一种新的关注图的特征聚合模块，将每个帧级特征解释为图的一个节点。所有可能的特征对之间的相互关系（通常是间接利用的）可以使用图形直接建模。该模块包括图形注意层和图形池层，随后是读出操作。图注意层首先对不同节点之间的非欧几里德数据流形进行建模。然后，考虑到节点的重要性，图池层丢弃信息较少的节点。最后，读出操作将剩余节点组合成单个表示。我们使用了两个最新的系统，SE ResNet和RawNet2，它们具有不同的输入功能和体系结构，并证明了与基线相比，所提出的功能聚合模块始终显示出超过10%的相对改进。摘要：The objective of this paper is to combine multiple frame-level features into a single utterance-level representation considering pairwise relationship. For this purpose, we propose a novel graph attentive feature aggregation module by interpreting each frame-level feature as a node of a graph. The inter-relationship between all possible pairs of features, typically exploited indirectly, can be directly modeled using a graph. The module comprises a graph attention layer and a graph pooling layer followed by a readout operation. The graph attention layer first models the non-Euclidean data manifold between different nodes. Then, the graph pooling layer discards less informative nodes considering the significance of the nodes. Finally, the readout operation combines the remaining nodes into a single representation. We employ two recent systems, SE-ResNet and RawNet2, with different input features and architectures and demonstrate that the proposed feature aggregation module consistently shows a relative improvement over 10%, compared to the baseline.

【4】 Perceptual Evaluation of 360 Audiovisual Quality and Machine Learning Predictions 标题：360视听质量的知觉评价与机器学习预测链接：https://arxiv.org/abs/2112.12273

作者：Randy Frans Fela,Nick Zacharov,Søren Forchhammer 机构：SenseLab, FORCE Technology, Hørsholm, Denmark, Dept. Photonics Engineering, Technical University of Denmark, Kgs. Lyngby, Denmark 摘要：在之前的一项研究中，我们收集了360个视听内容的音频、视频和视听质量的感知评估。本文研究了基于360个视频和空间音频内容的客观质量度量和主观分数的感知视听质量预测。针对每个编码参数的五个刺激，评估了十三个客观视频质量指标和三个客观音频质量指标。训练并测试了四种基于回归的机器学习模型，即多元线性回归、决策树、随机森林和支持向量机。每个模型都是使用音频和视频质量指标的组合构建的，并且研究了两种交叉验证方法（k-Fold和Leave-One-Out），并生成了312个预测模型。结果表明，基于VMAF和AMBIQUAL评估的模型优于其他音视频质量度量组合。在本研究中，支持向量机使用k-Fold（PCC=0.909，SROCC=0.914，RMSE=0.416）提供了更高的性能。这些结果可以为多媒体质量指标的设计和视听全方位媒体预测模型的开发提供见解。摘要：In an earlier study, we gathered perceptual evaluations of the audio, video, and audiovisual quality for 360 audiovisual content. This paper investigates perceived audiovisual quality prediction based on objective quality metrics and subjective scores of 360 video and spatial audio content. Thirteen objective video quality metrics and three objective audio quality metrics were evaluated for five stimuli for each coding parameter. Four regression-based machine learning models were trained and tested here, i.e., multiple linear regression, decision tree, random forest, and support vector machine. Each model was constructed using a combination of audio and video quality metrics and two cross-validation methods (k-Fold and Leave-One-Out) were investigated and produced 312 predictive models. The results indicate that the model based on the evaluation of VMAF and AMBIQUAL is better than other combinations of audio-video quality metric. In this study, support vector machine provides higher performance using k-Fold (PCC = 0.909, SROCC = 0.914, and RMSE = 0.416). These results can provide insights for the design of multimedia quality metrics and the development of predictive models for audiovisual omnidirectional media.

【5】 Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios 标题：基于单说话人单一风格训练数据场景的多说话人多风格文语合成链接：https://arxiv.org/abs/2112.12743

作者：Qicong Xie,Tao Li,Xinsheng Wang,Zhichao Wang,Lei Xie,Guoqiao Yu,Guanglu Wan 机构：Northwestern Polytechnical University, Xi’an, China, School of Software Engineering, Xi’an Jiaotong University, Xi’an, China, Meituan-Dianping Group, Beijing, China 备注：submitted to icassp2022 摘要：在现有的跨说话人风格转换任务中，需要源说话人录制多个风格的录音来为目标说话人提供风格。然而，一个演讲者很难表达所有预期的风格。在本文中，提出了一个更一般的任务，即通过组合来自多个说话人语料库的任何风格和音色来产生表达性语音，其中每个说话人都有一个独特的风格。为了实现这一任务，提出了一种新的方法。该方法是一个基于Tacotron2的框架，但具有细粒度的基于文本的韵律预测模块和说话人身份控制器。实验表明，该方法能够绕过对单个说话人多语体语料库的依赖，成功地用另一个说话人的木材表达一个说话人的语体。此外，韵律预测模块中使用的显式韵律特征可以通过调整韵律特征值来增加合成语音的多样性。摘要：In the existing cross-speaker style transfer task, a source speaker with multi-style recordings is necessary to provide the style for a target speaker. However, it is hard for one speaker to express all expected styles. In this paper, a more general task, which is to produce expressive speech by combining any styles and timbres from a multi-speaker corpus in which each speaker has a unique style, is proposed. To realize this task, a novel method is proposed. This method is a Tacotron2-based framework but with a fine-grained text-based prosody predicting module and a speaker identity controller. Experiments demonstrate that the proposed method can successfully express a style of one speaker with the timber of another speaker bypassing the dependency on a single speaker's multi-style corpus. Moreover, the explicit prosody features used in the prosody predicting module can increase the diversity of synthetic speech by adjusting the value of prosody features.

【6】 Are E2E ASR models ready for an industrial usage? 标题：E2E ASR型号是否已准备好投入工业使用？链接：https://arxiv.org/abs/2112.12572

作者：Valentin Vielzeuf,Grigory Antipov 机构：Orange, rue du Clos Courtel, Cesson-S´evign´e, France 摘要：随着全神经（端到端，E2E）方法的兴起，自动语音识别（ASR）社区经历了一个重大转折点。同时，传统的混合模型仍然是ASR实际应用的标准选择。根据以前的研究，E2E ASR在实际应用中的采用受到两个主要限制的阻碍：它们在看不见的领域上的推广能力和高昂的运营成本。在本文中，我们通过对几个当代E2E模型和一个混合基线进行全面的多领域基准测试来研究上述两个缺点。我们的实验表明，E2E模型是混合方法的可行替代方案，甚至在准确性和操作效率方面都优于基线。因此，我们的研究表明，泛化和复杂性问题不再是工业集成的主要障碍，并提请社区注意E2E方法在某些特定用例中的其他潜在限制。摘要：The Automated Speech Recognition (ASR) community experiences a major turning point with the rise of the fully-neural (End-to-End, E2E) approaches. At the same time, the conventional hybrid model remains the standard choice for the practical usage of ASR. According to previous studies, the adoption of E2E ASR in real-world applications was hindered by two main limitations: their ability to generalize on unseen domains and their high operational cost. In this paper, we investigate both above-mentioned drawbacks by performing a comprehensive multi-domain benchmark of several contemporary E2E models and a hybrid baseline. Our experiments demonstrate that E2E models are viable alternatives for the hybrid approach, and even outperform the baseline both in accuracy and in operational efficiency. As a result, our study shows that the generalization and complexity issues are no longer the major obstacle for industrial integration, and draws the community's attention to other potential limitations of the E2E approaches in some specific use-cases.

【7】 Nonnegative OPLS for Supervised Design of Filter Banks: Application to Image and Audio Feature Extraction 标题：过滤银行监督设计的非负最小二乘支持向量机在图像和音频特征提取中的应用链接：https://arxiv.org/abs/2112.12280

作者：Sergio Muñoz-Romero,Jerónimo Arenas García,Vanessa Gómez-Verdejo 机构： Universidad Polit´ecnica de Madrid 备注：None 摘要：音频或视频数据分析任务通常必须处理高维和非负信号。然而，大多数数据分析方法都会遇到过拟合和数值问题，当数据有多个维度需要进行降维预处理时。此外，关于滤波器如何以及为什么用于音频或视频应用的可解释性是一个需要的特性，特别是当涉及能量或光谱信号时。在这些情况下，由于这些信号的性质，滤波器权重的非负性是更好地理解其工作的一个期望属性。由于这两个必要性，我们提出了不同的方法来降低数据的维数，同时保证了解的非负性和可解释性。特别是，我们提出了一种广义方法，用于以有监督的方式设计用于处理非负数据的应用程序的滤波器组，并探讨了解决由正交归一化偏最小二乘法的非负版本组成的拟议目标函数的不同方法。我们分析了两种不同且被广泛研究的应用：纹理和音乐类型分类，使用所提出的方法获得的特征的辨别能力。此外，我们将我们的方法实现的滤波器组与其他专门为特征提取设计的最新方法进行了比较。摘要：Audio or visual data analysis tasks usually have to deal with high-dimensional and nonnegative signals. However, most data analysis methods suffer from overfitting and numerical problems when data have more than a few dimensions needing a dimensionality reduction preprocessing. Moreover, interpretability about how and why filters work for audio or visual applications is a desired property, especially when energy or spectral signals are involved. In these cases, due to the nature of these signals, the nonnegativity of the filter weights is a desired property to better understand its working. Because of these two necessities, we propose different methods to reduce the dimensionality of data while the nonnegativity and interpretability of the solution are assured. In particular, we propose a generalized methodology to design filter banks in a supervised way for applications dealing with nonnegative data, and we explore different ways of solving the proposed objective function consisting of a nonnegative version of the orthonormalized partial least-squares method. We analyze the discriminative power of the features obtained with the proposed methods for two different and widely studied applications: texture and music genre classification. Furthermore, we compare the filter banks achieved by our methods with other state-of-the-art methods specifically designed for feature extraction.

3.eess.AS音频处理:

【1】 Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios 标题：基于单说话人单一风格训练数据场景的多说话人多风格文语合成链接：https://arxiv.org/abs/2112.12743

作者：Qicong Xie,Tao Li,Xinsheng Wang,Zhichao Wang,Lei Xie,Guoqiao Yu,Guanglu Wan 机构：Northwestern Polytechnical University, Xi’an, China, School of Software Engineering, Xi’an Jiaotong University, Xi’an, China, Meituan-Dianping Group, Beijing, China 备注：submitted to icassp2022 摘要：在现有的跨说话人风格转换任务中，需要源说话人录制多个风格的录音来为目标说话人提供风格。然而，一个演讲者很难表达所有预期的风格。在本文中，提出了一个更一般的任务，即通过组合来自多个说话人语料库的任何风格和音色来产生表达性语音，其中每个说话人都有一个独特的风格。为了实现这一任务，提出了一种新的方法。该方法是一个基于Tacotron2的框架，但具有细粒度的基于文本的韵律预测模块和说话人身份控制器。实验表明，该方法能够绕过对单个说话人多语体语料库的依赖，成功地用另一个说话人的木材表达一个说话人的语体。此外，韵律预测模块中使用的显式韵律特征可以通过调整韵律特征值来增加合成语音的多样性。摘要：In the existing cross-speaker style transfer task, a source speaker with multi-style recordings is necessary to provide the style for a target speaker. However, it is hard for one speaker to express all expected styles. In this paper, a more general task, which is to produce expressive speech by combining any styles and timbres from a multi-speaker corpus in which each speaker has a unique style, is proposed. To realize this task, a novel method is proposed. This method is a Tacotron2-based framework but with a fine-grained text-based prosody predicting module and a speaker identity controller. Experiments demonstrate that the proposed method can successfully express a style of one speaker with the timber of another speaker bypassing the dependency on a single speaker's multi-style corpus. Moreover, the explicit prosody features used in the prosody predicting module can increase the diversity of synthetic speech by adjusting the value of prosody features.

【2】 Are E2E ASR models ready for an industrial usage? 标题：E2E ASR型号是否已准备好投入工业使用？链接：https://arxiv.org/abs/2112.12572

作者：Valentin Vielzeuf,Grigory Antipov 机构：Orange, rue du Clos Courtel, Cesson-S´evign´e, France 摘要：随着全神经（端到端，E2E）方法的兴起，自动语音识别（ASR）社区经历了一个重大转折点。同时，传统的混合模型仍然是ASR实际应用的标准选择。根据以前的研究，E2E ASR在实际应用中的采用受到两个主要限制的阻碍：它们在看不见的领域上的推广能力和高昂的运营成本。在本文中，我们通过对几个当代E2E模型和一个混合基线进行全面的多领域基准测试来研究上述两个缺点。我们的实验表明，E2E模型是混合方法的可行替代方案，甚至在准确性和操作效率方面都优于基线。因此，我们的研究表明，泛化和复杂性问题不再是工业集成的主要障碍，并提请社区注意E2E方法在某些特定用例中的其他潜在限制。摘要：The Automated Speech Recognition (ASR) community experiences a major turning point with the rise of the fully-neural (End-to-End, E2E) approaches. At the same time, the conventional hybrid model remains the standard choice for the practical usage of ASR. According to previous studies, the adoption of E2E ASR in real-world applications was hindered by two main limitations: their ability to generalize on unseen domains and their high operational cost. In this paper, we investigate both above-mentioned drawbacks by performing a comprehensive multi-domain benchmark of several contemporary E2E models and a hybrid baseline. Our experiments demonstrate that E2E models are viable alternatives for the hybrid approach, and even outperform the baseline both in accuracy and in operational efficiency. As a result, our study shows that the generalization and complexity issues are no longer the major obstacle for industrial integration, and draws the community's attention to other potential limitations of the E2E approaches in some specific use-cases.

【3】 Nonnegative OPLS for Supervised Design of Filter Banks: Application to Image and Audio Feature Extraction 标题：过滤银行监督设计的非负最小二乘支持向量机在图像和音频特征提取中的应用链接：https://arxiv.org/abs/2112.12280

作者：Sergio Muñoz-Romero,Jerónimo Arenas García,Vanessa Gómez-Verdejo 机构： Universidad Polit´ecnica de Madrid 备注：None 摘要：音频或视频数据分析任务通常必须处理高维和非负信号。然而，大多数数据分析方法都会遇到过拟合和数值问题，当数据有多个维度需要进行降维预处理时。此外，关于滤波器如何以及为什么用于音频或视频应用的可解释性是一个需要的特性，特别是当涉及能量或光谱信号时。在这些情况下，由于这些信号的性质，滤波器权重的非负性是更好地理解其工作的一个期望属性。由于这两个必要性，我们提出了不同的方法来降低数据的维数，同时保证了解的非负性和可解释性。特别是，我们提出了一种广义方法，用于以有监督的方式设计用于处理非负数据的应用程序的滤波器组，并探讨了解决由正交归一化偏最小二乘法的非负版本组成的拟议目标函数的不同方法。我们分析了两种不同且被广泛研究的应用：纹理和音乐类型分类，使用所提出的方法获得的特征的辨别能力。此外，我们将我们的方法实现的滤波器组与其他专门为特征提取设计的最新方法进行了比较。摘要：Audio or visual data analysis tasks usually have to deal with high-dimensional and nonnegative signals. However, most data analysis methods suffer from overfitting and numerical problems when data have more than a few dimensions needing a dimensionality reduction preprocessing. Moreover, interpretability about how and why filters work for audio or visual applications is a desired property, especially when energy or spectral signals are involved. In these cases, due to the nature of these signals, the nonnegativity of the filter weights is a desired property to better understand its working. Because of these two necessities, we propose different methods to reduce the dimensionality of data while the nonnegativity and interpretability of the solution are assured. In particular, we propose a generalized methodology to design filter banks in a supervised way for applications dealing with nonnegative data, and we explore different ways of solving the proposed objective function consisting of a nonnegative version of the orthonormalized partial least-squares method. We analyze the discriminative power of the features obtained with the proposed methods for two different and widely studied applications: texture and music genre classification. Furthermore, we compare the filter banks achieved by our methods with other state-of-the-art methods specifically designed for feature extraction.

【4】 Data Augmentation based Consistency Contrastive Pre-training for Automatic Speech Recognition 标题：基于数据增强的自动语音识别一致性对比预训练链接：https://arxiv.org/abs/2112.12522

作者：Changfeng Gao,Gaofeng Cheng,Yifan Guo,Qingwei Zhao,Pengyuan Zhang 机构： the pre-trained model can learn the contextualizedThe authors are with the Key Laboratory of Speech Acoustics and ContentUnderstanding, Institute of Acoustics, University of Chinese Academy of Sciences(e-mail 备注：5 pages, 2 figures 摘要：自监督声学预训练在自动语音识别（ASR）任务中取得了惊人的效果。大多数成功的声学预训练方法都采用对比学习的方法，通过区分不同时间步长的声学表征来学习声学表征，忽略了说话人和环境的鲁棒性。因此，在微调过程中遇到域外数据时，预先训练的模型可能会表现出较差的性能。在这封信中，我们设计了一种新的一致性对比学习（CCL）方法，利用数据增强进行声学预训练。在原始音频上应用不同类型的增强，然后将增强后的音频馈入编码器。编码器不仅应对比一个音频中的表示，还应最大限度地测量不同增强音频中的表示。通过这种方式，预先训练好的模型可以学习一种与文本相关的表示方法，这种方法随着说话人或环境的变化而变得更加鲁棒。实验表明，通过将CCL方法应用于Wav2Vec2上，该方法具有良好的性能。0时，可以在域内数据和域外数据上实现更好的结果。特别是对于含噪的域外数据，相对改善率可达15%以上。摘要：Self-supervised acoustic pre-training has achieved amazing results on the automatic speech recognition (ASR) task. Most of the successful acoustic pre-training methods use contrastive learning to learn the acoustic representations by distinguish the representations from different time steps, ignoring the speaker and environment robustness. As a result, the pre-trained model could show poor performance when meeting out-of-domain data during fine-tuning. In this letter, we design a novel consistency contrastive learning (CCL) method by utilizing data augmentation for acoustic pre-training. Different kinds of augmentation are applied on the original audios and then the augmented audios are fed into an encoder. The encoder should not only contrast the representations within one audio but also maximize the measurement of the representations across different augmented audios. By this way, the pre-trained model can learn a text-related representation method which is more robust with the change of the speaker or the environment.Experiments show that by applying the CCL method on the Wav2Vec2.0, better results can be realized both on the in-domain data and the out-of-domain data. Especially for noisy out-of-domain data, more than 15% relative improvement can be obtained.

【5】 S+PAGE: A Speaker and Position-Aware Graph Neural Network Model for Emotion Recognition in Conversation 标题：S+PAGE：一种用于会话情感识别的说话人和位置感知图神经网络模型链接：https://arxiv.org/abs/2112.12389

作者：Chen Liang,Chong Yang,Jing Xu,Juyang Huang,Yongliang Wang,Yang Dong 摘要：会话中的情感识别（ERC）由于其广泛应用的必要性，近年来受到了广泛关注。现有的ERC方法大多是分别对自我和说话人之间的语境进行建模，这是因为两者之间缺乏足够的互动。在本文中，我们提出了一种新的用于ERC（S+PAGE）的说话人和位置感知图神经网络模型，该模型包含三个阶段，以结合变换器和关系图卷积网络（R-GCN）的优点，实现更好的上下文建模。首先，提出了一种双流会话变换器，用于提取每个话语的粗略的自我和说话人之间的上下文特征。然后，我们构造了一个说话人和位置感知的对话图，并提出了一个增强的R-GCN模型PAG来细化由相对位置编码引导的粗糙特征。最后，前两个阶段的两个特征都被输入到条件随机场层来模拟情绪传递。摘要：Emotion recognition in conversation (ERC) has attracted much attention in recent years for its necessity in widespread applications. Existing ERC methods mostly model the self and inter-speaker context separately, posing a major issue for lacking enough interaction between them. In this paper, we propose a novel Speaker and Position-Aware Graph neural network model for ERC (S+PAGE), which contains three stages to combine the benefits of both Transformer and relational graph convolution network (R-GCN) for better contextual modeling. Firstly, a two-stream conversational Transformer is presented to extract the coarse self and inter-speaker contextual features for each utterance. Then, a speaker and position-aware conversation graph is constructed, and we propose an enhanced R-GCN model, called PAG, to refine the coarse features guided by a relative positional encoding. Finally, both of the features from the former two stages are input into a conditional random field layer to model the emotion transfer.

【6】 Graph attentive feature aggregation for text-independent speaker verification 标题：用于文本无关说话人确认的图注意特征聚合链接：https://arxiv.org/abs/2112.12343

作者：Hye-jin Shim,Jungwoo Heo,Jae-han Park,Ga-hui Lee,Ha-Jin Yu 机构：School of Computer Science, University of Seoul,KT Corporation 备注：5 pages, 1 figure, 6 tables, submitted to ICASSP 2022 摘要：本文的目的是将多个帧级特征结合到一个考虑成对关系的单一话语级表达中。为此，我们提出了一种新的关注图的特征聚合模块，将每个帧级特征解释为图的一个节点。所有可能的特征对之间的相互关系（通常是间接利用的）可以使用图形直接建模。该模块包括图形注意层和图形池层，随后是读出操作。图注意层首先对不同节点之间的非欧几里德数据流形进行建模。然后，考虑到节点的重要性，图池层丢弃信息较少的节点。最后，读出操作将剩余节点组合成单个表示。我们使用了两个最新的系统，SE ResNet和RawNet2，它们具有不同的输入功能和体系结构，并证明了与基线相比，所提出的功能聚合模块始终显示出超过10%的相对改进。摘要：The objective of this paper is to combine multiple frame-level features into a single utterance-level representation considering pairwise relationship. For this purpose, we propose a novel graph attentive feature aggregation module by interpreting each frame-level feature as a node of a graph. The inter-relationship between all possible pairs of features, typically exploited indirectly, can be directly modeled using a graph. The module comprises a graph attention layer and a graph pooling layer followed by a readout operation. The graph attention layer first models the non-Euclidean data manifold between different nodes. Then, the graph pooling layer discards less informative nodes considering the significance of the nodes. Finally, the readout operation combines the remaining nodes into a single representation. We employ two recent systems, SE-ResNet and RawNet2, with different input features and architectures and demonstrate that the proposed feature aggregation module consistently shows a relative improvement over 10%, compared to the baseline.

【7】 Perceptual Evaluation of 360 Audiovisual Quality and Machine Learning Predictions 标题：360视听质量的知觉评价与机器学习预测链接：https://arxiv.org/abs/2112.12273

作者：Randy Frans Fela,Nick Zacharov,Søren Forchhammer 机构：SenseLab, FORCE Technology, Hørsholm, Denmark, Dept. Photonics Engineering, Technical University of Denmark, Kgs. Lyngby, Denmark 摘要：在之前的一项研究中，我们收集了360个视听内容的音频、视频和视听质量的感知评估。本文研究了基于360个视频和空间音频内容的客观质量度量和主观分数的感知视听质量预测。针对每个编码参数的五个刺激，评估了十三个客观视频质量指标和三个客观音频质量指标。训练并测试了四种基于回归的机器学习模型，即多元线性回归、决策树、随机森林和支持向量机。每个模型都是使用音频和视频质量指标的组合构建的，并且研究了两种交叉验证方法（k-Fold和Leave-One-Out），并生成了312个预测模型。结果表明，基于VMAF和AMBIQUAL评估的模型优于其他音视频质量度量组合。在本研究中，支持向量机使用k-Fold（PCC=0.909，SROCC=0.914，RMSE=0.416）提供了更高的性能。这些结果可以为多媒体质量指标的设计和视听全方位媒体预测模型的开发提供见解。摘要：In an earlier study, we gathered perceptual evaluations of the audio, video, and audiovisual quality for 360 audiovisual content. This paper investigates perceived audiovisual quality prediction based on objective quality metrics and subjective scores of 360 video and spatial audio content. Thirteen objective video quality metrics and three objective audio quality metrics were evaluated for five stimuli for each coding parameter. Four regression-based machine learning models were trained and tested here, i.e., multiple linear regression, decision tree, random forest, and support vector machine. Each model was constructed using a combination of audio and video quality metrics and two cross-validation methods (k-Fold and Leave-One-Out) were investigated and produced 312 predictive models. The results indicate that the model based on the evaluation of VMAF and AMBIQUAL is better than other combinations of audio-video quality metric. In this study, support vector machine provides higher performance using k-Fold (PCC = 0.909, SROCC = 0.914, and RMSE = 0.416). These results can provide insights for the design of multimedia quality metrics and the development of predictive models for audiovisual omnidirectional media.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-12-24，如有侵权请联系 cloudcommunity@tencent.com 删除

linux