前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >金融/语音/音频处理学术速递[6.23]

金融/语音/音频处理学术速递[6.23]

作者头像
公众号-arXiv每日学术速递
发布2021-07-02 18:28:14
5810
发布2021-07-02 18:28:14
举报

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

q-fin金融,共计8篇

cs.SD语音,共计10篇

eess.AS音频处理,共计10篇

1.q-fin金融:

【1】 A systems framework for remedying dysfunction in U.S. democracy 标题:一个弥补美国民主功能障碍的系统框架

作者:Samuel S. -H. Wang,Jonathan Cervas,Bernard Grofman,Keena Lipsitz 机构: 0000-000 2-0 490-9786; bInstitute for Politics and Strategy, Carnegie Mellon University, 0000-000 1-9686-6 308; cDepartment of Political Science, University of California 链接:https://arxiv.org/abs/2106.11901 摘要:民主经常不能实现其理想,选举机构可能使这些失败更加严重。不希望出现的结果包括两极分化的机构、反应迟钝的代表,以及一派选民以牺牲多数人的利益获取权力的能力。为解决这些问题,人们提出了各种改革方案,但在相互作用复杂的背景下,其效果很难预测。在这里,我们概述了系统级建模的路径,以帮助理解和优化对美国民主的修复。遵循工程学和生物学的传统,系统模型包括具有动态特性的机制,包括非线性和放大(投票规则)、正反馈机制(单方控制、无期徒刑)、负反馈(制衡)、随时间的积分(终身监禁),低维(极化)。为了说明一个系统级的方法,我们分析了三个新出现的现象:低维度,精英两极分化,和反多数主义的立法机构。在每一种情况下,由于政治环境的变化,长期存在的规则现在都会导致不良结果。在一般层面上的理论理解也将有助于评估拟议改革的好处是否会实现和持久,特别是在条件再次改变时。这样,严格的模型不仅可以形成新的研究路线,而且有助于设计有效和持久的改革。 摘要:Democracy often fails to meet its ideals, and these failures may be made worse by electoral institutions. Unwanted outcomes include polarized institutions, unresponsive representatives, and the ability of a faction of voters to gain power at the expense of the majority. Various reforms have been proposed to address these problems, but their effectiveness is difficult to predict against a backdrop of complex interactions. Here we outline a path for systems-level modeling to help understand and optimize repairs to U.S. democracy. Following the tradition of engineering and biology, models of systems include mechanisms with dynamical properties that include nonlinearities and amplification (voting rules), positive feedback mechanisms (single-party control, gerrymandering), negative feedback (checks and balances), integration over time (lifetime judicial appointments), and low dimensionality (polarization). To illustrate a systems-level approach we analyze three emergent phenomena: low dimensionality, elite polarization, and anti-majoritarianism in legislatures. In each case, long-standing rules now contribute to undesirable outcomes as a consequence of changes in the political environment. Theoretical understanding at a general level will also help evaluate whether a proposed reform's benefits will materialize and be lasting, especially as conditions change again. In this way, rigorous modeling may not only shape new lines of research, but aid in the design of effective and lasting reform.

【2】 Quantifying the Impact of Human Capital, Job History, and Language Factors on Job Seniority with a Large-scale Analysis of Resumes 标题:通过大规模简历分析量化人力资本、工作历史和语言因素对工作年限的影响

作者:Austin P Wright,Caleb Ziems,Haekyu Park,Jon Saad-Falcon,Duen Horng Chau,Diyi Yang,Maria Tomprou 机构:Georgia Institute of Technology, Atlanta, Georgia, USA, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA 备注:9 Pages, 5 Figures, 8 Tables 链接:https://arxiv.org/abs/2106.11846 摘要:随着世界范围内的就业市场竞争日益激烈,求职者的选择标准变得更加不透明,希望在职业生涯中取得进步的求职者可以获得不同(有时甚至相互矛盾)的信息和建议,要确定简历中哪些因素最有效地帮助职业发展,这是前所未有的困难。在这项工作中,我们提出了一个新的,大规模的超过50万人的数据集,并进行了初步分析,以开始从经验上回答哪些因素有助于或伤害那些希望在职业发展过程中过渡到更高级职位的人。我们发现先前的经验形成了最重要的因素,超过了人力资本的其他方面,并且发现了英语中哪些语言因素具有显著的影响。这为将来利用大规模数据分析和自然语言处理技术对职业轨迹进行研究奠定了基础。 摘要:As job markets worldwide have become more competitive and applicant selection criteria have become more opaque, and different (and sometimes contradictory) information and advice is available for job seekers wishing to progress in their careers, it has never been more difficult to determine which factors in a r\'esum\'e most effectively help career progression. In this work we present a novel, large scale dataset of over half a million r\'esum\'es with preliminary analysis to begin to answer empirically which factors help or hurt people wishing to transition to more senior roles as they progress in their career. We find that previous experience forms the most important factor, outweighing other aspects of human capital, and find which language factors in a r\'esum\'e have significant effects. This lays the groundwork for future inquiry in career trajectories using large scale data analysis and natural language processing techniques.

【3】 Two Price Regimes in Limit Order Books: Liquidity Cushion and Fragmented Distant Field 标题:限购书中的两种价格机制:流动性缓冲和碎片化的远场

作者:Sebastian M. Krause,Edgar Jungblut,Thomas Guhr 链接:https://arxiv.org/abs/2106.11691 摘要:限价指令簿内流动性的分布对于市场指令对股价的影响和价格冲击的出现至关重要。因此,提高对极限订货簿的时变动力学的理解是非常有意义的。在我们的分析中,我们发现了一个广泛的分布极限秩序寿命。在引号周围,我们发现了一个密集填充的制度,其中大部分是短期限制指令,远离引号,我们发现了一个稀疏填充,其中大部分是长期限制指令。我们确定了这两种制度的特点,并指出了主要区别。基于我们的研究,我们提出了一个模型来模拟报价制度。 摘要:The distribution of liquidity within the limit order book is essential for the impact of market orders on the stock price and the emergence of price shocks. Hence it is of great interest to improve the understanding of the time-dependent dynamics of the limit order book. In our analysis we find a broad distribution of limit order lifetimes. Around the quotes we find a densely filled regime with mostly short living limit orders, far away from the quotes we find a sparse filling with mostly long living limit orders. We determine the characteristics of those two regimes and point out the main differences. Based on our research we propose a model for simulating the regime around the quotes.

【4】 Information of income position and its impact on perceived tax burden and preference for redistribution: An Internet Survey Experiment 标题:收入状况信息及其对感知税负和再分配偏好的影响:一项互联网调查实验

作者:Eiji Yamamura 机构:Seinan Gakuin University, Japan 链接:https://arxiv.org/abs/2106.11537 摘要:在日本进行了一项定制的互联网调查实验,以考察个人的相对收入状况如何影响收入再分配偏好和个人对所得税负担的看法。我首先询问了受访者,他们对自己国家的收入状况、对再分配的偏好以及对税负的看法。在对治疗组的后续调查中,我提供了他们真实收入状况的信息,并提出了与第一次调查相同的问题。对于对照组,我没有提供他们真实的收入状况,问了同样的问题。我收集了一个大样本,包括治疗组(4682)和对照组(2268)的观察结果。主要研究结果表明,在得知个人的真实收入状况后,(1)认为自己的收入状况高于真实收入状况的个人认为自己的税负更大,(2)个人对再分配的偏好几乎没有改变,不公平的个人认为他们的税负更大,更倾向于再分配。然而,不对等的比例很小。这导致日本成为一个非福利国家。 摘要:A customized internet survey experiment is conducted in Japan to examine how individuals' relative income position influences preferences for income redistribution and individual perceptions regarding income tax burden. I first asked respondents about their perceived income position in their country and their preferences for redistribution and perceived tax burden. In the follow-up survey for the treatment group, I provided information on their true income position and asked the same questions as in the first survey. For the control group, I did not provide their true income position and asked the same questions. I gathered a large sample that comprised observations of the treatment group (4,682) and the control group (2,268). The key findings suggest that after being informed of individuals' real income position, (1) individuals who thought their income position was higher than the true one perceived their tax burden to be larger, (2) individuals' preference for redistribution hardly changes, and (3) irreciprocal individuals perceive their tax burden to be larger and are more likely to prefer redistribution. However, the share of irreciprocal ones is small. This leads Japan to be a non-welfare state.

【5】 Sub- and Super-solution Approach to Accuracy Analysis of Portfolio Optimization Asymptotics in Multiscale Stochastic Factor Market 标题:多尺度随机因素市场投资组合优化渐近精度分析的子解和上解方法

作者:Jean-Pierre Fouque,Ruimeng Hu,Ronnie Sircar 链接:https://arxiv.org/abs/2106.11510 摘要:随机因素驱动收益率和波动率时的投资组合优化问题在前人的工作中已有研究。特别是,他们提出了渐近逼近的价值函数和最优策略的制度,这些因素都运行在缓慢和快速的时间尺度。然而,这些近似值的精确性的严格证明仅限于电力公司和单一因素。本文通过构造完全非线性问题的子解和超解,对具有一般效用函数和两个时间尺度因子的情形进行精度分析,使它们的差分达到所需的精度水平。这种方法在各种相关的随机控制问题中有一定的应用价值。 摘要:The problem of portfolio optimization when stochastic factors drive returns and volatilities has been studied in previous works by the authors. In particular, they proposed asymptotic approximations for value functions and optimal strategies in the regime where these factors are running on both slow and fast timescales. However, the rigorous justification of the accuracy of these approximations has been limited to power utilities and a single factor. In this paper, we provide an accuracy analysis for cases with general utility functions and two timescale factors by constructing sub- and super-solutions to the fully nonlinear problem such that their difference is at the desired level of accuracy. This approach will be valuable in various related stochastic control problems.

【6】 Sectoral portfolio optimization by judicious selection of financial ratios via PCA 标题:通过主成分分析合理选择财务比率的部门投资组合优化

作者:Vrinda Dhingra,Amita Sharma,Shiv K. Gupta 备注:26 pages, 12 tables 链接:https://arxiv.org/abs/2106.11484 摘要:在投资组合优化模型中嵌入价值投资一直是一个挑战。在本文中,我们尝试将其合并,首先采用主成分分析(PCA)按部门筛选出每个部门的主导财务比率,然后使用组合优化模型结合二阶随机优势(SSD)标准得出最终的最优投资。我们考虑了与每个部门相对应的11个众所周知的财务比率,分别代表四类比率,即流动性、偿付能力、盈利能力和估值。然后,在2004年4月至2014年3月的10年时间内,按部门采用主成分分析法,以两种方式从每个部门中提取主导比率,一种是从组成部分解决方案中提取,另一种是从每个类别的公共性基础上提取。将SSD准则与约束条件相结合的两步部门投资组合优化(SPO)模型用于构建最优投资组合。使用前一个提取比率形成的策略称为PCA-SPO(A),后一个称为PCA-SPO(B)。将所得结果与SPO模型和两个名义SSD模型进行了比较,并对有无财务比率进行了计算研究。在2014年4月至2020年3月的6年时间内,采用滚动窗口计划,对标准普尔BSE 500指数市场的3、6、9、12和24个月的样本外时间线进行了评估,以评估拟议战略的实证绩效。我们观察到,PCA-SPO(B)策略在几乎所有样本期外的下行偏差、CVaR、VaR、Sortino比率、Rachev比率和STARR比率方面都优于所有其他模型。这突出了价值投资的重要性,在投资组合选择过程中,比率是经过仔细选择和定量嵌入的。 摘要:Embedding value investment in portfolio optimization models has always been a challenge. In this paper, we attempt to incorporate it by first employing principal component analysis (PCA) sector wise to filter out dominant financial ratios from each sector and thereafter, use the portfolio optimization model incorporating second order stochastic dominance (SSD) criteria to derive the final optimal investment. We consider a total of 11 well known financial ratios corresponding to each sector representing four categories of ratios, namely liquidity, solvency, profitability, and valuation. PCA is then applied sector wise over a period of 10 years from April 2004 to March 2014 to extract dominant ratios from each sector in two ways, one from the component solution and other from each category on the basis of their communalities. The two step Sectoral Portfolio Optimization (SPO) model integrating the SSD criteria in constraints is then utilized to build an optimal portfolio. The strategy formed using the former extracted ratios is termed as PCA-SPO(A) and the latter one as PCA-SPO(B). The results obtained from the proposed strategies are compared with the SPO model and two nominal SSD models, with and without financial ratios for computational study. Empirical performance of proposed strategies is assessed over the period of 6 years from April 2014 to March 2020 using a rolling window scheme with varying out-of-sample time line of 3, 6, 9, 12 and 24 months for S&P BSE 500 market. We observe that the proposed strategy PCA-SPO(B) outperforms all other models in terms of downside deviation, CVaR, VaR, Sortino ratio, Rachev ratio, and STARR ratios over almost all out-of-sample periods. This highlights the importance of value investment where ratios are carefully selected and embedded quantitatively in portfolio selection process.

【7】 Bitcoin's Crypto Flow Newtork 标题:比特币的加密流Newtork

作者:Yoshi Fujiwara,Rubaiyat Islam 机构:Graduate School of Information Science, University of Hyogo, Kobe ,-, Japan, ) 备注:39 pages, 18 Figures; "Blockchain in Kyoto 2021" conference; forthcoming in JPS Conference Proceedings 链接:https://arxiv.org/abs/2106.11446 摘要:如何在比特币用户之间进行加密是理解全球范围内加密资产结构和动态的一个重要问题。我们收集了比特币从诞生到2020年的所有区块链数据,从钱包的匿名地址识别用户,并通过将普通用户作为大玩家来构建每月的网络快照。我们采用蝴蝶结结构和霍奇分解的方法来定位整个加密流的上游、下游和核心的用户。此外,我们使用非负矩阵分解揭示隐藏在流中的主成分,我们将其解释为一个概率模型。我们证明了该模型等价于自然语言处理中的概率潜在语义分析,使我们能够估计出这些隐藏组件的数量。此外,我们发现,在这些大公司中,领结结构和主成分是相当稳定的。本文的研究为进一步研究加密流的时间变化、大玩家的进入和退出等问题奠定了坚实的基础。 摘要:How crypto flows among Bitcoin users is an important question for understanding the structure and dynamics of the cryptoasset at a global scale. We compiled all the blockchain data of Bitcoin from its genesis to the year 2020, identified users from anonymous addresses of wallets, and constructed monthly snapshots of networks by focusing on regular users as big players. We apply the methods of bow-tie structure and Hodge decomposition in order to locate the users in the upstream, downstream, and core of the entire crypto flow. Additionally, we reveal principal components hidden in the flow by using non-negative matrix factorization, which we interpret as a probabilistic model. We show that the model is equivalent to a probabilistic latent semantic analysis in natural language processing, enabling us to estimate the number of such hidden components. Moreover, we find that the bow-tie structure and the principal components are quite stable among those big players. This study can be a solid basis on which one can further investigate the temporal change of crypto flow, entry and exit of big players, and so forth.

【8】 Learning versus Unlearning: An Experiment on Retractions 标题:学习与遗忘:一项关于回溯的实验

作者:Duarte Gonçalves,Jonathan Libgober,Jack Willis 机构:University College London, University of Southern California, Columbia University 链接:https://arxiv.org/abs/2106.11433 摘要:然而,广受质疑的观点依然存在。为什么人们不能“忘却”?我们研究了一种解释:信念是抵制收回(撤销先前的信息)。我们的实验设计确定了忘却——即,从收回中更新——并将其与从同等的新信息中学习进行比较。在不同类型的收回中——例如,那些与先前一致或矛盾的收回,或者那些发生在先前信念是极端或中等的收回——受试者不会完全从收回中解脱出来,从收回中得到的更新比从同等的新信息中少。这一现象并不是由大多数被充分研究的违反贝叶斯更新的行为所解释的,这些行为在我们的设计中产生了不同的预测。然而,这与条件推理中的困难是一致的,这在其他领域和环境中都有记载。 摘要:Widely discredited ideas nevertheless persist. Why do people fail to ``unlearn''? We study one explanation: beliefs are resistant to retractions (the revoking of earlier information). Our experimental design identifies unlearning -- i.e., updating from retractions -- and enables its comparison with learning from equivalent new information. Across different kinds of retractions -- for instance, those consistent or contradictory with the prior, or those occurring when prior beliefs are either extreme or moderate -- subjects do not fully unlearn from retractions and update less from them than from equivalent new information. This phenomenon is not explained by most of the well-studied violations of Bayesian updating, which yield differing predictions in our design. However, it is consistent with difficulties in conditional reasoning, which have been documented in other domains and circumstances.

2.cs.SD语音:

【1】 Glance and Gaze: A Collaborative Learning Framework for Single-channel Speech Enhancement 标题:扫视与凝视:一种单通道语音增强的协作学习框架

作者:Andong Li,Chengshi Zheng,Lu Zhang,Xiaodong Li 链接:https://arxiv.org/abs/2106.11789 摘要:人类同时关注粗粒度和细粒度区域的能力已经被应用到计算机视觉任务中。基于此,我们提出了一个复杂领域的单耳噪声抑制协作学习框架。该系统由两个主要模块组成,即光谱特征提取模块(FEM)和叠加扫视模块(GGMs)。在有限元法中,在每个卷积层后引入UNet块,实现了多尺度的特征重校准。在每个GGM中,我们将复杂频谱中的多目标优化问题分解为两个子任务。具体来说,扫视路径的目的是抑制幅度域中的噪声以获得粗略估计,同时,扫视路径尝试补偿复域中丢失的光谱细节。这两条路径协同工作,有助于从互补的角度进行谱估计。此外,通过反复展开GGM,中间结果可以跨阶段迭代细化,最终得到频谱估计。实验在WSJ0-SI84、DNS挑战数据集和Voicebank+Demand数据集上进行。实验结果表明,该方法在WSJ0-SI84和DNS挑战数据集上取得了优于以往先进系统的性能,同时在Voicebank+Demand语料库上取得了有竞争力的性能。 摘要:The capability of the human to pay attention to both coarse and fine-grained regions has been applied to computer vision tasks. Motivated by that, we propose a collaborative learning framework in the complex domain for monaural noise suppression. The proposed system consists of two principal modules, namely spectral feature extraction module (FEM) and stacked glance-gaze modules (GGMs). In FEM, the UNet-block is introduced after each convolution layer, enabling the feature recalibration from multiple scales. In each GGM, we decompose the multi-target optimization in the complex spectrum into two sub-tasks. Specifically, the glance path aims to suppress the noise in the magnitude domain to obtain a coarse estimation, and meanwhile, the gaze path attempts to compensate for the lost spectral detail in the complex domain. The two paths work collaboratively and facilitate spectral estimation from complementary perspectives. Besides, by repeatedly unfolding the GGMs, the intermediate result can be iteratively refined across stages and lead to the ultimate estimation of the spectrum. The experiments are conducted on the WSJ0-SI84, DNS-Challenge dataset, and Voicebank+Demand dataset. Results show that the proposed approach achieves state-of-the-art performance over previous advanced systems on the WSJ0-SI84 and DNS-Challenge dataset, and meanwhile, competitive performance is achieved on the Voicebank+Demand corpus.

【2】 Learning to Inference with Early Exit in the Progressive Speech Enhancement 标题:渐进式语音增强中的提前退出学习推理

作者:Andong Li,Chengshi Zheng,Lu Zhang,Xiaodong Li 机构:⋆ Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy, † University of Chinese Academy of Sciences, Beijing, China, †† Department of Electronics and Information Engineering, Harbin Institute of Technology, Shenzhen, China 备注:Accepted by EUSIPCO2021 链接:https://arxiv.org/abs/2106.11730 摘要:在实际场景中,控制语音增强系统在不同条件下的推理速度是非常必要和重要的。为此,我们提出了一种具有早期退出机制的阶段性自适应推理方法,用于渐进式语音增强。具体地说,在每个阶段中,当相邻阶段之间的谱距离降低经验预设的阈值时,推理将终止并输出估计,这可以有效地加快推理速度。为了进一步提高现有语音增强系统的性能,提出了PL-CRN+,它是我们前期工作PL-CRN的一个改进版本,结合了阶段循环机制和复杂谱映射。在TIMIT语料库上进行了大量的实验,结果表明我们的系统在PESQ、ESTOI和DNSMOS方面优于现有的基线。此外,通过调整阈值,可以在保证系统性能的前提下,方便地控制推理效率。 摘要:In real scenarios, it is often necessary and significant to control the inference speed of speech enhancement systems under different conditions. To this end, we propose a stage-wise adaptive inference approach with early exit mechanism for progressive speech enhancement. Specifically, in each stage, once the spectral distance between adjacent stages lowers the empirically preset threshold, the inference will terminate and output the estimation, which can effectively accelerate the inference speed. To further improve the performance of existing speech enhancement systems, PL-CRN++ is proposed, which is an improved version over our preliminary work PL-CRN and combines stage recurrent mechanism and complex spectral mapping. Extensive experiments are conducted on the TIMIT corpus, the results demonstrate the superiority of our system over state-of-the-art baselines in terms of PESQ, ESTOI and DNSMOS. Moreover, by adjusting the threshold, we can easily control the inference efficiency while sustaining the system performance.

【3】 Multi-accent Speech Separation with One Shot Learning 标题:基于一次学习的多重音语音分离

作者:Kuan-Po Huang,Yuan-Kuei Wu,Hung-yi Lee 机构:National Taiwan University, Graduate Institute of Computer Science and Information Engineering, Graduate Institute of Communication Engineering 备注:Accepted at ACL 2021 Meta Learning for NLP 链接:https://arxiv.org/abs/2106.11713 摘要:语音分离是语音处理领域的一个研究热点。然而,目前对多重音语音分离方案的研究还不多。有新口音和噪声的隐形说话人引起了传统联合训练方法难以解决的领域失配问题。因此,我们应用MAML和FOMAML来解决这个问题,并且在几乎所有看不见的口音上获得了比联合训练更高的平均Si-SNRi值。这证明了这两种方法确实能够生成经过良好训练的参数,以适应新的说话人和口音的语音混合。此外,我们还发现FOMAML在节省大量时间的同时,获得了与MAML相似的性能。 摘要:Speech separation is a problem in the field of speech processing that has been studied in full swing recently. However, there has not been much work studying a multi-accent speech separation scenario. Unseen speakers with new accents and noise aroused the domain mismatch problem which cannot be easily solved by conventional joint training methods. Thus, we applied MAML and FOMAML to tackle this problem and obtained higher average Si-SNRi values than joint training on almost all the unseen accents. This proved that these two methods do have the ability to generate well-trained parameters for adapting to speech mixtures of new speakers and accents. Furthermore, we found out that FOMAML obtains similar performance compared to MAML while saving a lot of time.

【4】 Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw 标题:弗罗茨瓦夫大学为ZeroSpeech 2021年提供的信息检索:“深渊翻滚”

作者:Jan Chorowski,Grzegorz Ciesielski,Jarosław Dzikowski,Adrian Łańcucki,Ricard Marxer,Mateusz Opala,Piotr Pusz,Paweł Rychlikowski,Michał Stypułkowski 备注:Published in Interspeech 2021 链接:https://arxiv.org/abs/2106.11603 摘要:我们提出了一些低资源的方法来完成2021年零资源语音挑战赛的任务。我们以组织者提出的语音的无监督表示为基础,从CPC中导出,并用k-means算法进行聚类。我们证明了改进这些表示的简单方法可以缩小差距,甚至改进使用高计算预算的解。结果表明,CPC导出的表示对于训练语言模型来说仍然太过嘈杂,但是对于简单的模式匹配和检索来说足够稳定。 摘要:We present a number of low-resource approaches to the tasks of the Zero Resource Speech Challenge 2021. We build on the unsupervised representations of speech proposed by the organizers as a baseline, derived from CPC and clustered with the k-means algorithm. We demonstrate that simple methods of refining those representations can narrow the gap, or even improve upon the solutions which use a high computational budget. The results lead to the conclusion that the CPC-derived representations are still too noisy for training language models, but stable enough for simpler forms of pattern matching and retrieval.

【5】 Key-Sparse Transformer with Cascaded Cross-Attention Block for Multimodal Speech Emotion Recognition 标题:级联交叉注意挡路的键稀疏变换器用于多模态语音情感识别

作者:Weidong Chen,Xiaofeng Xing,Xiangmin Xu,Jichen Yang,Jianxin Pang 机构:School of Electronic and Information Engineering, South China University of Technology, China, UBTECH Robotics Corp, China 链接:https://arxiv.org/abs/2106.11532 摘要:语音情感识别是一个具有挑战性的重要研究课题,在人机交互中起着至关重要的作用。多模态输入可以提高性能,因为更多的情感信息用于识别。然而,现有的研究了解了样本中的所有信息,而只有一小部分是关于情绪的。此外,在多模态框架下,不同模态之间的相互作用是肤浅和不足的。本文提出了一种只关注情绪相关信息的密钥解析变换器来实现高效的SER。此外,还引入了一种针对多模态框架设计的级联交叉注意块,实现了不同模态之间的深度交互。通过IEMOCAP语料库对该方法进行了评价,实验结果表明,该方法比现有方法具有更好的性能。 摘要:Speech emotion recognition is a challenging and important research topic that plays a critical role in human-computer interaction. Multimodal inputs can improve the performance as more emotional information is used for recognition. However, existing studies learnt all the information in the sample while only a small portion of it is about emotion. Moreover, under the multimodal framework, the interaction between different modalities is shallow and insufficient. In this paper, a keysparse Transformer is proposed for efficient SER by only focusing on emotion related information. Furthermore, a cascaded cross-attention block, which is specially designed for multimodal framework, is introduced to achieve deep interaction between different modalities. The proposed method is evaluated by IEMOCAP corpus and the experimental results show that the proposed method gives better performance than the state-of-theart approaches.

【6】 Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams 标题:基于注意力的跨模式融合用于音乐视频流中的视听语音活动检测

作者:Yuanbo Hou,Zhesong Yu,Xia Liang,Xingjian Du,Bilei Zhu,Zejun Ma,Dick Botteldooren 机构:Ghent University, Belgium, ByteDance AI Lab, China 备注:Accepted by INTERSPEECH 2021 链接:https://arxiv.org/abs/2106.11411 摘要:以往许多与语音相关的视听作品都是以语音为主,而忽略了网络上越来越多的音乐视频流中的歌唱语音。语音活动检测是处理各种音乐视频数据的必要步骤。本文试图利用视听信息来检测音乐视频流中目标表演者的语音和歌声。为了整合视听形态信息,提出了一种多分支网络来学习声音和图像的表征,并基于语义相似度的注意融合表征,通过锚定发声的概率来塑造声音表征。实验结果表明,在复杂的声学环境中,所提出的视听多分支网络的性能远远优于纯音频模型,表明基于语义相关的跨模态信息融合是合理和成功的。 摘要:Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate information of audio and visual modalities, a multi-branch network is proposed to learn audio and image representations, and the representations are fused by attention based on semantic similarity to shape the acoustic representations through the probability of anchor vocalization. Experiments show the proposed audio-visual multi-branch network far outperforms the audio-only model in challenging acoustic environments, indicating the cross-modal information fusion based on semantic correlation is sensible and successful.

【7】 Do sound event representations generalize to other audio tasks? A case study in audio transfer learning 标题:声音事件表示是否适用于其他音频任务?关于音频迁移学习的个案研究

作者:Anurag Kumar,Yun Wang,Vamsi Krishna Ithapu,Christian Fuegen 机构:†Facebook Reality Labs Research, ‡ Facebook Applied AI Research 备注:Accepted Interspeech 2021 链接:https://arxiv.org/abs/2106.11335 摘要:迁移学习是跨多个相关学习问题进行有效信息传递的关键。一种简单而有效的转移学习方法利用在大规模任务中训练的深层神经网络进行特征提取。然后用这种表示来学习相关的下游任务。在这篇论文中,我们研究了在大规模声音事件检测数据集上训练的神经网络所获得的音频表示的转移学习能力。我们通过一个简单的线性分类器传输机制,在广泛的其他音频任务中构建和评估这些表示。我们证明了这种简单的线性传输已经足够强大,可以在下游任务上实现高性能。我们还提供了对声音事件表示的属性的见解,这些属性可以实现这种有效的信息传输。 摘要:Transfer learning is critical for efficient information transfer across multiple related learning problems. A simple, yet effective transfer learning approach utilizes deep neural networks trained on a large-scale task for feature extraction. Such representations are then used to learn related downstream tasks. In this paper, we investigate transfer learning capacity of audio representations obtained from neural networks trained on a large-scale sound event detection dataset. We build and evaluate these representations across a wide range of other audio tasks, via a simple linear classifier transfer mechanism. We show that such simple linear transfer is already powerful enough to achieve high performance on the downstream tasks. We also provide insights into the attributes of sound event representations that enable such efficient information transfer.

【8】 Deep neural network Based Low-latency Speech Separation with Asymmetric analysis-Synthesis Window Pair 标题:基于深度神经网络的非对称分析-合成窗口对低延迟语音分离

作者:Shanshan Wang,Gaurav Naithani,Archontis Politis,Tuomas Virtanen 机构:Audio Research Group, Tampere University, Tampere, Finland 备注:Accepted to EUSIPCO-2021 链接:https://arxiv.org/abs/2106.11794 摘要:在基于低延迟深度神经网络(DNN)的信源分离中,通常使用时频掩蔽或通过短对称窗口计算的频谱预测。在本文中,我们提出了一种非对称分析-合成窗口对的使用方法,该窗口对允许使用具有更好频率分辨率的目标进行训练,同时保留了适合于实时语音增强或辅助听力应用的推理过程中的低延迟。为了在不同的模型类型和数据集上评估我们的方法,我们使用了说话人无关的深度聚类(DC)模型和说话人相关的掩码推理(MI)模型。我们报告在保持8毫秒的算法延迟的同时,源失真比(SDR)的分离性能提高了1.5分贝。 摘要:Time-frequency masking or spectrum prediction computed via short symmetric windows are commonly used in low-latency deep neural network (DNN) based source separation. In this paper, we propose the usage of an asymmetric analysis-synthesis window pair which allows for training with targets with better frequency resolution, while retaining the low-latency during inference suitable for real-time speech enhancement or assisted hearing applications. In order to assess our approach across various model types and datasets, we evaluate it with both speaker-independent deep clustering (DC) model and a speaker-dependent mask inference (MI) model. We report an improvement in separation performance of up to 1.5 dB in terms of source-to-distortion ratio (SDR) while maintaining an algorithmic latency of 8 ms.

【9】 Improving Ultrasound Tongue Image Reconstruction from Lip Images Using Self-supervised Learning and Attention Mechanism 标题:利用自监督学习和注意机制改进唇部图像的超声舌象重建

作者:Haiyang Liu,Jihan Zhang 机构:Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan, School of Mechanical Engineering, Southeast University, Nanjing, China 备注:Accepted in KDD Workshop (BIOKDD 2021) 链接:https://arxiv.org/abs/2106.11769 摘要:言语产生是一个动态的过程,涉及舌、颌、唇等多个人体器官。建立声道变形的动力学模型是理解语音的一个基本问题,而语音是人类日常交际中最常见的方式。研究人员使用几个感官流来同时描述这个过程,这些感官流在统计学上无疑与其他感官流相关。在本文中,我们解决了以下问题:给定一个可观察到的嘴唇图像序列,我们可以描绘出相应的舌头运动。将该问题描述为自监督学习问题,采用双流卷积网络和长短记忆网络作为学习任务,并引入注意机制。通过利用未标记的唇部视频预测即将到来的超声舌像序列,我们评估了该方法的性能。结果表明,该模型能够生成与真实舌象接近的图像,实现了两种成像模式的匹配。 摘要:Speech production is a dynamic procedure, which involved multi human organs including the tongue, jaw and lips. Modeling the dynamics of the vocal tract deformation is a fundamental problem to understand the speech, which is the most common way for human daily communication. Researchers employ several sensory streams to describe the process simultaneously, which are incontrovertibly statistically related to other streams. In this paper, we address the following question: given an observable image sequences of lips, can we picture the corresponding tongue motion. We formulated this problem as the self-supervised learning problem, and employ the two-stream convolutional network and long-short memory network for the learning task, with the attention mechanism. We evaluate the performance of the proposed method by leveraging the unlabeled lip videos to predict an upcoming ultrasound tongue image sequence. The results show that our model is able to generate images that close to the real ultrasound tongue images, and results in the matching between two imaging modalities.

【10】 Analysis and Tuning of a Voice Assistant System for Dysfluent Speech 标题:一种不流利语音辅助系统的分析与调谐

作者:Vikramjit Mitra,Zifang Huang,Colin Lea,Lauren Tooley,Sarah Wu,Darren Botten,Ashwini Palekar,Shrinath Thelapurath,Panayiotis Georgiou,Sachin Kajarekar,Jefferey Bigham 机构:Apple, Cupertino, CA, USA 备注:5 pages, 1 page reference, 2 figures 链接:https://arxiv.org/abs/2106.11759 摘要:语音发音的不流畅和变异会严重降低语音识别性能,对于许多中重度语音障碍患者来说,语音操作系统不起作用。当前的语音识别系统主要是用流利的说话者的数据来训练的,因此不能很好地推广到语音不流畅的情况,如声音或单词重复、声音延长或听觉障碍。这项工作的重点是定量分析消费者的语音识别系统对个人谁口吃和生产为导向的方法,以提高性能的共同语音助理任务(即,“天气如何?”)。在基线检查时,该系统引入了大量的插入和替换错误,导致流利性障碍患者的预期语音词错误率(isWER)下降13.64%(绝对值)。我们表明,在现有的混合语音识别系统中,只要简单地调整解码参数,就可以将流利性障碍个体的isWER提高24%(相对)。与所有口吃严重程度的18名受试者的默认设置相比,调整这些参数可以提高3.6%的领域识别率和1.7%的意图识别率。 摘要:Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work. Current speech recognition systems are trained primarily with data from fluent speakers and as a consequence do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks. The focus of this work is on quantitative analysis of a consumer speech recognition system on individuals who stutter and production-oriented approaches for improving performance for common voice assistant tasks (i.e., "what is the weather?"). At baseline, this system introduces a significant number of insertion and substitution errors resulting in intended speech Word Error Rates (isWER) that are 13.64\% worse (absolute) for individuals with fluency disorders. We show that by simply tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24\% (relative) for individuals with fluency disorders. Tuning these parameters translates to 3.6\% better domain recognition and 1.7\% better intent recognition relative to the default setup for the 18 study participants across all stuttering severities.

3.eess.AS音频处理:

【1】 Deep neural network Based Low-latency Speech Separation with Asymmetric analysis-Synthesis Window Pair 标题:基于深度神经网络的非对称分析-合成窗口对低延迟语音分离

作者:Shanshan Wang,Gaurav Naithani,Archontis Politis,Tuomas Virtanen 机构:Audio Research Group, Tampere University, Tampere, Finland 备注:Accepted to EUSIPCO-2021 链接:https://arxiv.org/abs/2106.11794 摘要:在基于低延迟深度神经网络(DNN)的信源分离中,通常使用时频掩蔽或通过短对称窗口计算的频谱预测。在本文中,我们提出了一种非对称分析-合成窗口对的使用方法,该窗口对允许使用具有更好频率分辨率的目标进行训练,同时保留了适合于实时语音增强或辅助听力应用的推理过程中的低延迟。为了在不同的模型类型和数据集上评估我们的方法,我们使用了说话人无关的深度聚类(DC)模型和说话人相关的掩码推理(MI)模型。我们报告在保持8毫秒的算法延迟的同时,源失真比(SDR)的分离性能提高了1.5分贝。 摘要:Time-frequency masking or spectrum prediction computed via short symmetric windows are commonly used in low-latency deep neural network (DNN) based source separation. In this paper, we propose the usage of an asymmetric analysis-synthesis window pair which allows for training with targets with better frequency resolution, while retaining the low-latency during inference suitable for real-time speech enhancement or assisted hearing applications. In order to assess our approach across various model types and datasets, we evaluate it with both speaker-independent deep clustering (DC) model and a speaker-dependent mask inference (MI) model. We report an improvement in separation performance of up to 1.5 dB in terms of source-to-distortion ratio (SDR) while maintaining an algorithmic latency of 8 ms.

【2】 Improving Ultrasound Tongue Image Reconstruction from Lip Images Using Self-supervised Learning and Attention Mechanism 标题:利用自监督学习和注意机制改进唇部图像的超声舌象重建

作者:Haiyang Liu,Jihan Zhang 机构:Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan, School of Mechanical Engineering, Southeast University, Nanjing, China 备注:Accepted in KDD Workshop (BIOKDD 2021) 链接:https://arxiv.org/abs/2106.11769 摘要:言语产生是一个动态的过程,涉及舌、颌、唇等多个人体器官。建立声道变形的动力学模型是理解语音的一个基本问题,而语音是人类日常交际中最常见的方式。研究人员使用几个感官流来同时描述这个过程,这些感官流在统计学上无疑与其他感官流相关。在本文中,我们解决了以下问题:给定一个可观察到的嘴唇图像序列,我们可以描绘出相应的舌头运动。将该问题描述为自监督学习问题,采用双流卷积网络和长短记忆网络作为学习任务,并引入注意机制。通过利用未标记的唇部视频预测即将到来的超声舌像序列,我们评估了该方法的性能。结果表明,该模型能够生成与真实舌象接近的图像,实现了两种成像模式的匹配。 摘要:Speech production is a dynamic procedure, which involved multi human organs including the tongue, jaw and lips. Modeling the dynamics of the vocal tract deformation is a fundamental problem to understand the speech, which is the most common way for human daily communication. Researchers employ several sensory streams to describe the process simultaneously, which are incontrovertibly statistically related to other streams. In this paper, we address the following question: given an observable image sequences of lips, can we picture the corresponding tongue motion. We formulated this problem as the self-supervised learning problem, and employ the two-stream convolutional network and long-short memory network for the learning task, with the attention mechanism. We evaluate the performance of the proposed method by leveraging the unlabeled lip videos to predict an upcoming ultrasound tongue image sequence. The results show that our model is able to generate images that close to the real ultrasound tongue images, and results in the matching between two imaging modalities.

【3】 Analysis and Tuning of a Voice Assistant System for Dysfluent Speech 标题:一种不流利语音辅助系统的分析与调谐

作者:Vikramjit Mitra,Zifang Huang,Colin Lea,Lauren Tooley,Sarah Wu,Darren Botten,Ashwini Palekar,Shrinath Thelapurath,Panayiotis Georgiou,Sachin Kajarekar,Jefferey Bigham 机构:Apple, Cupertino, CA, USA 备注:5 pages, 1 page reference, 2 figures 链接:https://arxiv.org/abs/2106.11759 摘要:语音发音的不流畅和变异会严重降低语音识别性能,对于许多中重度语音障碍患者来说,语音操作系统不起作用。当前的语音识别系统主要是用流利的说话者的数据来训练的,因此不能很好地推广到语音不流畅的情况,如声音或单词重复、声音延长或听觉障碍。这项工作的重点是定量分析消费者的语音识别系统对个人谁口吃和生产为导向的方法,以提高性能的共同语音助理任务(即,“天气如何?”)。在基线检查时,该系统引入了大量的插入和替换错误,导致流利性障碍患者的预期语音词错误率(isWER)下降13.64%(绝对值)。我们表明,在现有的混合语音识别系统中,只要简单地调整解码参数,就可以将流利性障碍个体的isWER提高24%(相对)。与所有口吃严重程度的18名受试者的默认设置相比,调整这些参数可以提高3.6%的领域识别率和1.7%的意图识别率。 摘要:Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work. Current speech recognition systems are trained primarily with data from fluent speakers and as a consequence do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks. The focus of this work is on quantitative analysis of a consumer speech recognition system on individuals who stutter and production-oriented approaches for improving performance for common voice assistant tasks (i.e., "what is the weather?"). At baseline, this system introduces a significant number of insertion and substitution errors resulting in intended speech Word Error Rates (isWER) that are 13.64\% worse (absolute) for individuals with fluency disorders. We show that by simply tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24\% (relative) for individuals with fluency disorders. Tuning these parameters translates to 3.6\% better domain recognition and 1.7\% better intent recognition relative to the default setup for the 18 study participants across all stuttering severities.

【4】 Glance and Gaze: A Collaborative Learning Framework for Single-channel Speech Enhancement 标题:扫视与凝视:一种单通道语音增强的协作学习框架

作者:Andong Li,Chengshi Zheng,Lu Zhang,Xiaodong Li 链接:https://arxiv.org/abs/2106.11789 摘要:人类同时关注粗粒度和细粒度区域的能力已经被应用到计算机视觉任务中。基于此,我们提出了一个复杂领域的单耳噪声抑制协作学习框架。该系统由两个主要模块组成,即光谱特征提取模块(FEM)和叠加扫视模块(GGMs)。在有限元法中,在每个卷积层后引入UNet块,实现了多尺度的特征重校准。在每个GGM中,我们将复杂频谱中的多目标优化问题分解为两个子任务。具体来说,扫视路径的目的是抑制幅度域中的噪声以获得粗略估计,同时,扫视路径尝试补偿复域中丢失的光谱细节。这两条路径协同工作,有助于从互补的角度进行谱估计。此外,通过反复展开GGM,中间结果可以跨阶段迭代细化,最终得到频谱估计。实验在WSJ0-SI84、DNS挑战数据集和Voicebank+Demand数据集上进行。实验结果表明,该方法在WSJ0-SI84和DNS挑战数据集上取得了优于以往先进系统的性能,同时在Voicebank+Demand语料库上取得了有竞争力的性能。 摘要:The capability of the human to pay attention to both coarse and fine-grained regions has been applied to computer vision tasks. Motivated by that, we propose a collaborative learning framework in the complex domain for monaural noise suppression. The proposed system consists of two principal modules, namely spectral feature extraction module (FEM) and stacked glance-gaze modules (GGMs). In FEM, the UNet-block is introduced after each convolution layer, enabling the feature recalibration from multiple scales. In each GGM, we decompose the multi-target optimization in the complex spectrum into two sub-tasks. Specifically, the glance path aims to suppress the noise in the magnitude domain to obtain a coarse estimation, and meanwhile, the gaze path attempts to compensate for the lost spectral detail in the complex domain. The two paths work collaboratively and facilitate spectral estimation from complementary perspectives. Besides, by repeatedly unfolding the GGMs, the intermediate result can be iteratively refined across stages and lead to the ultimate estimation of the spectrum. The experiments are conducted on the WSJ0-SI84, DNS-Challenge dataset, and Voicebank+Demand dataset. Results show that the proposed approach achieves state-of-the-art performance over previous advanced systems on the WSJ0-SI84 and DNS-Challenge dataset, and meanwhile, competitive performance is achieved on the Voicebank+Demand corpus.

【5】 Learning to Inference with Early Exit in the Progressive Speech Enhancement 标题:渐进式语音增强中的提前退出学习推理

作者:Andong Li,Chengshi Zheng,Lu Zhang,Xiaodong Li 机构:⋆ Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy, † University of Chinese Academy of Sciences, Beijing, China, †† Department of Electronics and Information Engineering, Harbin Institute of Technology, Shenzhen, China 备注:Accepted by EUSIPCO2021 链接:https://arxiv.org/abs/2106.11730 摘要:在实际场景中,控制语音增强系统在不同条件下的推理速度是非常必要和重要的。为此,我们提出了一种具有早期退出机制的阶段性自适应推理方法,用于渐进式语音增强。具体地说,在每个阶段中,当相邻阶段之间的谱距离降低经验预设的阈值时,推理将终止并输出估计,这可以有效地加快推理速度。为了进一步提高现有语音增强系统的性能,提出了PL-CRN+,它是我们前期工作PL-CRN的一个改进版本,结合了阶段循环机制和复杂谱映射。在TIMIT语料库上进行了大量的实验,结果表明我们的系统在PESQ、ESTOI和DNSMOS方面优于现有的基线。此外,通过调整阈值,可以在保证系统性能的前提下,方便地控制推理效率。 摘要:In real scenarios, it is often necessary and significant to control the inference speed of speech enhancement systems under different conditions. To this end, we propose a stage-wise adaptive inference approach with early exit mechanism for progressive speech enhancement. Specifically, in each stage, once the spectral distance between adjacent stages lowers the empirically preset threshold, the inference will terminate and output the estimation, which can effectively accelerate the inference speed. To further improve the performance of existing speech enhancement systems, PL-CRN++ is proposed, which is an improved version over our preliminary work PL-CRN and combines stage recurrent mechanism and complex spectral mapping. Extensive experiments are conducted on the TIMIT corpus, the results demonstrate the superiority of our system over state-of-the-art baselines in terms of PESQ, ESTOI and DNSMOS. Moreover, by adjusting the threshold, we can easily control the inference efficiency while sustaining the system performance.

【6】 Multi-accent Speech Separation with One Shot Learning 标题:基于一次学习的多重音语音分离

作者:Kuan-Po Huang,Yuan-Kuei Wu,Hung-yi Lee 机构:National Taiwan University, Graduate Institute of Computer Science and Information Engineering, Graduate Institute of Communication Engineering 备注:Accepted at ACL 2021 Meta Learning for NLP 链接:https://arxiv.org/abs/2106.11713 摘要:语音分离是语音处理领域的一个研究热点。然而,目前对多重音语音分离方案的研究还不多。有新口音和噪声的隐形说话人引起了传统联合训练方法难以解决的领域失配问题。因此,我们应用MAML和FOMAML来解决这个问题,并且在几乎所有看不见的口音上获得了比联合训练更高的平均Si-SNRi值。这证明了这两种方法确实能够生成经过良好训练的参数,以适应新的说话人和口音的语音混合。此外,我们还发现FOMAML在节省大量时间的同时,获得了与MAML相似的性能。 摘要:Speech separation is a problem in the field of speech processing that has been studied in full swing recently. However, there has not been much work studying a multi-accent speech separation scenario. Unseen speakers with new accents and noise aroused the domain mismatch problem which cannot be easily solved by conventional joint training methods. Thus, we applied MAML and FOMAML to tackle this problem and obtained higher average Si-SNRi values than joint training on almost all the unseen accents. This proved that these two methods do have the ability to generate well-trained parameters for adapting to speech mixtures of new speakers and accents. Furthermore, we found out that FOMAML obtains similar performance compared to MAML while saving a lot of time.

【7】 Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw 标题:弗罗茨瓦夫大学为ZeroSpeech 2021年提供的信息检索:“深渊翻滚”

作者:Jan Chorowski,Grzegorz Ciesielski,Jarosław Dzikowski,Adrian Łańcucki,Ricard Marxer,Mateusz Opala,Piotr Pusz,Paweł Rychlikowski,Michał Stypułkowski 备注:Published in Interspeech 2021 链接:https://arxiv.org/abs/2106.11603 摘要:我们提出了一些低资源的方法来完成2021年零资源语音挑战赛的任务。我们以组织者提出的语音的无监督表示为基础,从CPC中导出,并用k-means算法进行聚类。我们证明了改进这些表示的简单方法可以缩小差距,甚至改进使用高计算预算的解。结果表明,CPC导出的表示对于训练语言模型来说仍然太过嘈杂,但是对于简单的模式匹配和检索来说足够稳定。 摘要:We present a number of low-resource approaches to the tasks of the Zero Resource Speech Challenge 2021. We build on the unsupervised representations of speech proposed by the organizers as a baseline, derived from CPC and clustered with the k-means algorithm. We demonstrate that simple methods of refining those representations can narrow the gap, or even improve upon the solutions which use a high computational budget. The results lead to the conclusion that the CPC-derived representations are still too noisy for training language models, but stable enough for simpler forms of pattern matching and retrieval.

【8】 Key-Sparse Transformer with Cascaded Cross-Attention Block for Multimodal Speech Emotion Recognition 标题:级联交叉注意挡路的键稀疏变换器用于多模态语音情感识别

作者:Weidong Chen,Xiaofeng Xing,Xiangmin Xu,Jichen Yang,Jianxin Pang 机构:School of Electronic and Information Engineering, South China University of Technology, China, UBTECH Robotics Corp, China 链接:https://arxiv.org/abs/2106.11532 摘要:语音情感识别是一个具有挑战性的重要研究课题,在人机交互中起着至关重要的作用。多模态输入可以提高性能,因为更多的情感信息用于识别。然而,现有的研究了解了样本中的所有信息,而只有一小部分是关于情绪的。此外,在多模态框架下,不同模态之间的相互作用是肤浅和不足的。本文提出了一种只关注情绪相关信息的密钥解析变换器来实现高效的SER。此外,还引入了一种针对多模态框架设计的级联交叉注意块,实现了不同模态之间的深度交互。通过IEMOCAP语料库对该方法进行了评价,实验结果表明,该方法比现有方法具有更好的性能。 摘要:Speech emotion recognition is a challenging and important research topic that plays a critical role in human-computer interaction. Multimodal inputs can improve the performance as more emotional information is used for recognition. However, existing studies learnt all the information in the sample while only a small portion of it is about emotion. Moreover, under the multimodal framework, the interaction between different modalities is shallow and insufficient. In this paper, a keysparse Transformer is proposed for efficient SER by only focusing on emotion related information. Furthermore, a cascaded cross-attention block, which is specially designed for multimodal framework, is introduced to achieve deep interaction between different modalities. The proposed method is evaluated by IEMOCAP corpus and the experimental results show that the proposed method gives better performance than the state-of-theart approaches.

【9】 Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams 标题:基于注意力的跨模式融合用于音乐视频流中的视听语音活动检测

作者:Yuanbo Hou,Zhesong Yu,Xia Liang,Xingjian Du,Bilei Zhu,Zejun Ma,Dick Botteldooren 机构:Ghent University, Belgium, ByteDance AI Lab, China 备注:Accepted by INTERSPEECH 2021 链接:https://arxiv.org/abs/2106.11411 摘要:以往许多与语音相关的视听作品都是以语音为主,而忽略了网络上越来越多的音乐视频流中的歌唱语音。语音活动检测是处理各种音乐视频数据的必要步骤。本文试图利用视听信息来检测音乐视频流中目标表演者的语音和歌声。为了整合视听形态信息,提出了一种多分支网络来学习声音和图像的表征,并基于语义相似度的注意融合表征,通过锚定发声的概率来塑造声音表征。实验结果表明,在复杂的声学环境中,所提出的视听多分支网络的性能远远优于纯音频模型,表明基于语义相关的跨模态信息融合是合理和成功的。 摘要:Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate information of audio and visual modalities, a multi-branch network is proposed to learn audio and image representations, and the representations are fused by attention based on semantic similarity to shape the acoustic representations through the probability of anchor vocalization. Experiments show the proposed audio-visual multi-branch network far outperforms the audio-only model in challenging acoustic environments, indicating the cross-modal information fusion based on semantic correlation is sensible and successful.

【10】 Do sound event representations generalize to other audio tasks? A case study in audio transfer learning 标题:声音事件表示是否适用于其他音频任务?关于音频迁移学习的个案研究

作者:Anurag Kumar,Yun Wang,Vamsi Krishna Ithapu,Christian Fuegen 机构:†Facebook Reality Labs Research, ‡ Facebook Applied AI Research 备注:Accepted Interspeech 2021 链接:https://arxiv.org/abs/2106.11335 摘要:迁移学习是跨多个相关学习问题进行有效信息传递的关键。一种简单而有效的转移学习方法利用在大规模任务中训练的深层神经网络进行特征提取。然后用这种表示来学习相关的下游任务。在这篇论文中,我们研究了在大规模声音事件检测数据集上训练的神经网络所获得的音频表示的转移学习能力。我们通过一个简单的线性分类器传输机制,在广泛的其他音频任务中构建和评估这些表示。我们证明了这种简单的线性传输已经足够强大,可以在下游任务上实现高性能。我们还提供了对声音事件表示的属性的见解,这些属性可以实现这种有效的信息传输。 摘要:Transfer learning is critical for efficient information transfer across multiple related learning problems. A simple, yet effective transfer learning approach utilizes deep neural networks trained on a large-scale task for feature extraction. Such representations are then used to learn related downstream tasks. In this paper, we investigate transfer learning capacity of audio representations obtained from neural networks trained on a large-scale sound event detection dataset. We build and evaluate these representations across a wide range of other audio tasks, via a simple linear classifier transfer mechanism. We show that such simple linear transfer is already powerful enough to achieve high performance on the downstream tasks. We also provide insights into the attributes of sound event representations that enable such efficient information transfer.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-06-23,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
语音识别
腾讯云语音识别(Automatic Speech Recognition,ASR)是将语音转化成文字的PaaS产品,为企业提供精准而极具性价比的识别服务。被微信、王者荣耀、腾讯视频等大量业务使用,适用于录音质检、会议实时转写、语音输入法等多个场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档