前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >金融/语音/音频处理学术速递[12.21]

金融/语音/音频处理学术速递[12.21]

作者头像
公众号-arXiv每日学术速递
发布2021-12-24 08:53:02
3330
发布2021-12-24 08:53:02
举报
文章被收录于专栏:arXiv每日学术速递

q-fin金融,共计11篇

cs.SD语音,共计6篇

eess.AS音频处理,共计7篇

1.q-fin金融:

【1】 Rainbow Options under Bayesian MS-VAR Process 标题:贝叶斯MS-VAR过程下的彩虹期权 链接:https://arxiv.org/abs/2112.10447

作者:Battulga Gankhuu 机构:Department of Applied Mathematics, National University of Mongolia 备注:14 pages. arXiv admin note: substantial text overlap with arXiv:2111.04038, arXiv:2109.05998 摘要:研究了贝叶斯马尔可夫转换向量自回归(MS--VAR)过程下彩虹期权和回望期权的定价和套期保值方法。这里我们假设一个制度转换过程是由齐次马尔可夫过程产生的。我们的模型的一个优点是它依赖于经济变量,并且与已有的文献相比简单。 摘要:This paper presents pricing and hedging methods for rainbow options and lookback options under Bayesian Markov-Switching Vector Autoregressive (MS--VAR) process. Here we assumed that a regime-switching process is generated by a homogeneous Markov process. An advantage of our model is it depends on economic variables and simple as compared with previous existing papers.

【2】 Option Pricing Model with Transaction Costs 标题:有交易费的期权定价模型 链接:https://arxiv.org/abs/2112.10209

作者:F. G. Bellora,G. Mazzei,M. Maurette 机构: Maurette 3 1Carnegie Mellon University, edu 2Universit´e Paris-Saclay, universite-paris-saclay 备注:None 摘要:作者通过在复制策略中加入不同的交易成本结构,提出了Black-Scholes欧式看涨期权定价模型的替代方案。特别地,提出并发展了一种指数递减结构。 摘要:The author presents alternatives to the Black-Scholes european call option pricing model by incorporating different transaction cost structures in the replicating strategy. In particular, an exponentially decreasing structure is proposed and developed.

【3】 Neural Networks for Delta Hedging 标题:神经网络在Delta套期保值中的应用 链接:https://arxiv.org/abs/2112.10084

作者:Guijin Son,Joocheol Kim 机构:Yonsei University, Seoul, South Korea 摘要:布莱克-斯科尔斯模型(Black-Scholes model)是在完美金融市场的假设下定义的,理论上创造了一种完美的对冲策略,允许交易者在期权组合中规避风险。然而,要求零交易和连续交易的“完美金融市场”的概念在现实世界中很难实现。尽管存在这些广为人知的局限性,但学术界未能开发出足够成功的替代模型,以供长期使用。在本文中,我们通过测试以下神经结构的套期保值能力来探索基于深度神经网络(DNN)的套期保值系统的前景:递归神经网络、时间卷积网络、注意网络和跨多层感知器网络。此外,我们试图通过将传统的衍生工具套期保值模型与基于DNN的方法相结合,获得更具前景的结果。最后,我们构建了\textbf{NNHedge},这是一个深度学习框架,为模型开发和实验评估提供无缝管道。 摘要:The Black-Scholes model, defined under the assumption of a perfect financial market, theoretically creates a flawless hedging strategy allowing the trader to evade risks in a portfolio of options. However, the concept of a "perfect financial market," which requires zero transaction and continuous trading, is challenging to meet in the real world. Despite such widely known limitations, academics have failed to develop alternative models successful enough to be long-established. In this paper, we explore the landscape of Deep Neural Networks(DNN) based hedging systems by testing the hedging capacity of the following neural architectures: Recurrent Neural Networks, Temporal Convolutional Networks, Attention Networks, and Span Multi-Layer Perceptron Networks. In addition, we attempt to achieve even more promising results by combining traditional derivative hedging models with DNN based approaches. Lastly, we construct \textbf{NNHedge}, a deep learning framework that provides seamless pipelines for model development and assessment for the experiments.

【4】 Mean-Covariance Robust Risk Measurement 标题:均值-协方差稳健风险度量 链接:https://arxiv.org/abs/2112.09959

作者:Viet Anh Nguyen,Soroosh Shafieezadeh Abadeh,Damir Filipović,Daniel Kuhn 机构:Stanford University and VinAI Research, SOROOSH SHAFIEEZADEH-ABADEH, Tepper School of Business, CMU, DAMIR FILIPOVI ´C, EPFL and Swiss Finance Institute, Risk Analytics and Optimization Chair, EPFL 摘要:我们介绍了均值-协方差稳健风险度量和投资组合优化的通用框架。我们根据平均协方差空间上的Gelbrich距离以及人口分布的先验结构信息对不确定性进行建模。我们的方法与最优输运理论相关,与现有模型相比,具有更好的统计和计算性能。我们发现,对于一大类风险度量,均值-协方差稳健投资组合优化可归结为Markowitz模型,服从以闭合形式给出的正则化项。这包括财务标准、风险价值和条件风险价值,可以高效解决。 摘要:We introduce a universal framework for mean-covariance robust risk measurement and portfolio optimization. We model uncertainty in terms of the Gelbrich distance on the mean-covariance space, along with prior structural information about the population distribution. Our approach is related to the theory of optimal transport and exhibits superior statistical and computational properties than existing models. We find that, for a large class of risk measures, mean-covariance robust portfolio optimization boils down to the Markowitz model, subject to a regularization term given in closed form. This includes the finance standards, value-at-risk and conditional value-at-risk, and can be solved highly efficiently.

【5】 Paternalism, Autonomy, or Both? Experimental Evidence from Energy Saving Programs 标题:家长式作风、自治,还是两者兼而有之?来自节能计划的实验证据 链接:https://arxiv.org/abs/2112.09850

作者:Takanori Ida,Takunori Ishihara,Koichiro Ito,Daido Kido,Toru Kitagawa,Shosei Sakaguchi,Shusaku Sasaki 机构:Kyoto University of Advanced Science, University of Chicago and NBER, Brown University, University College London, Tohoku Gakuin University, . 备注:46 pages, 8 figures 摘要:确定谁应该得到治疗是经济学的核心问题。有两种相互竞争的目标定位方法——家长式和自主式。在家长式方法中,政策制定者根据可观察到的个人特征,以政策为最佳目标。相反,自主方法承认个人可能拥有关于异质政策影响的关键不可观察信息,并允许他们自行选择治疗。在本文中,我们提出了一种新的方法,混合家长式分配和自主选择。我们的方法使用个人特征和经验福利最大化来确定谁应该接受治疗,谁不应该接受治疗,并决定是否自己接受治疗。我们利用随机现场实验中收集的数据,应用此方法设计节能项目的目标策略。我们发现,家长式分配和自主选择的最佳混合显著提高了政策的社会福利收益。利用现场实验产生的随机变异,我们开发了一种方法,用于估计每个个体亚组的平均治疗效果,这些个体将做出相同的自主治疗选择。我们的估计证实,估计分配政策根据家长式分配和个人类型的自主选择的相对优点,最优地分配个人接受治疗、不接受治疗或选择自己。 摘要:Identifying who should be treated is a central question in economics. There are two competing approaches to targeting - paternalistic and autonomous. In the paternalistic approach, policymakers optimally target the policy given observable individual characteristics. In contrast, the autonomous approach acknowledges that individuals may possess key unobservable information on heterogeneous policy impacts, and allows them to self-select into treatment. In this paper, we propose a new approach that mixes paternalistic assignment and autonomous choice. Our approach uses individual characteristics and empirical welfare maximization to identify who should be treated, untreated, and decide whether to be treated themselves. We apply this method to design a targeting policy for an energy saving programs using data collected in a randomized field experiment. We show that optimally mixing paternalistic assignments and autonomous choice significantly improves the social welfare gain of the policy. Exploiting random variation generated by the field experiment, we develop a method to estimate average treatment effects for each subgroup of individuals who would make the same autonomous treatment choice. Our estimates confirm that the estimated assignment policy optimally allocates individuals to be treated, untreated, or choose themselves based on the relative merits of paternalistic assignments and autonomous choice for individuals types.

【6】 Potential utilization of Battery Energy Storage Systems (BESS) in the major European electricity markets 标题:蓄电池储能系统(BESS)在欧洲主要电力市场的潜力利用 链接:https://arxiv.org/abs/2112.09816

作者:Yu Hu,Miguel Armada,Maria Jesus Sanchez 机构: Simulyde S.L., Madrid, Spain., Escuela Técnica Superior Ingenieros Industriales, Universidad Politecnica de Madrid, Madrid, Spain. 摘要:鉴于过去十年电池技术的成本不断下降,如今BESS已成为电力系统中更具吸引力的解决方案。这项工作的目的是分析欧洲主要电力市场中BES的潜在利用率。提出了BESS运行的一般收益模型,以正确处理电池系统的运行灵活性。利用实际市场信息计算常见应用(包括能源套利和频率支持服务)的利用率,如潜在盈利利用时间和利用率。结果表明,在目前对电池成本和寿命的经验估计下,BESS在大多数欧洲电力市场进行能源套利是不可行的。然而,BESS在提供频率支持服务方面显示出明显的更高潜力。结果表明,当频率控制储备是有报酬的时,BES的潜在盈利利用在大多数欧洲国家已经变得越来越多。例如,从一月到2021年9月,潜在的有利可图的利用率已经达到了100%的FCR-N服务在丹麦市场。通过比较欧洲的区域电力市场,BESS已显示出通过提供频率调节服务成为中西欧和北欧部分地区可行解决方案的巨大潜力。与此同时,在不列颠群岛和其他一些岛国的当地市场,调查显示出灵活性的显著不足,BES的潜力也将相当令人鼓舞。 摘要:Given the declining cost of battery technology in the last decade, nowadays BESS becomes a more attractive solution in electrical power systems. The objective of this work is to analyze the potential utilization of BESS in the major European electricity markets. A general payoff model for BESS operation is proposed to correctly address the operational flexibility of battery systems. Utilization factors such as potentially profitable utilization time and rate are calculated for common applications including energy arbitrage and frequency support services using real market information. The result shows that under the current empirical estimation of the battery cost and lifetime, BESS is not feasible for energy arbitrage in most of the European electricity markets. However, BESS shows clearly and significantly higher potential in providing frequency support services. The result suggests that, when the frequency containment reserve is remunerable, the potentially profitable utilization of BESS has become already accretive in most of the European countries. For example from January to September 2021, the potentially profitable utilization rate has reached almost 100% for the FCR-N service in the Danish market. Comparing the regional electricity markets in Europe, BESS has shown significant potential in becoming a feasible solution in Central Western Europe and parts of Northern Europe by providing frequency regulation services. Meanwhile, in the British Isles and some other islanded local markets, a remarkable level of scarcity of flexibility has been revealed by the investigation, and the potential of BESS would also be considerably encouraging.

【7】 Dollar Cost Averaging Returns Estimation 标题:美元成本平均收益估计 链接:https://arxiv.org/abs/2112.09807

作者:Hayden Brown 机构:Department of Mathematics and Statistics, University of Nevada 备注:23 pages, 12 figures 摘要:给定一个几何布朗运动财富过程,构造了一个正则投资计划收益的对数正态下界。该界的分布参数是递归计算的。对于美元成本平均(相等时间间隔内的相等金额),参数以封闭形式计算。描述了一次总付(时间0时的单一金额)投资计划,该计划实现了与下限所示财富分布相匹配的最终财富分布。这些结果适用于标准普尔综合指数过去150年的年回报率。在数据分析结果中,当年平均美元成本持续超过40年时,负回报的概率小于2.5%。 摘要:Given a geometric Brownian motion wealth process, a log-Normal lower bound is constructed for the returns of a regular investing schedule. The distribution parameters of this bound are computed recursively. For dollar cost averaging (equal amounts in equal time intervals), parameters are computed in closed form. A lump sum (single amount at time 0) investing schedule is described which achieves a terminal wealth distribution that matches the wealth distribution indicated by the lower bound. Results are applied to annual returns of the S&P Composite Index from the last 150 years. Among data analysis results, the probability of negative returns is less than 2.5% when annual dollar cost averaging lasts over 40 years.

【8】 More Reviews May Not Help: Evidence from Incentivized First Reviews on Airbnb 标题:更多的评论可能无济于事:来自Airbnb激励性首次评论的证据 链接:https://arxiv.org/abs/2112.09783

作者:Andrey Fradkin,David Holtz 摘要:在线评论通常由志愿者撰写,因此,在数字市场中,关于卖家质量的信息可能会提供不足。我们在Airbnb进行的大规模随机实验中研究了这一不足的程度。在这个实验中,买家可以获得一张优惠券,用来查看之前没有评论的物品。治疗引起了额外的评价,这些评价往往比对照组的评价更为负面,这与评价中的选择偏差一致。治疗引起的复查导致交易量暂时增加,但平均而言,这些交易的夜数较少。对每笔交易和每笔交易的夜数的影响相互抵消,因此对售出的总夜数和收入没有可检测的影响。治疗组的交易质量指标下降,表明激励性审查不能改善匹配。我们展示了市场条件和声誉系统的设计如何解释我们的发现。 摘要:Online reviews are typically written by volunteers and, as a consequence, information about seller quality may be under-provided in digital marketplaces. We study the extent of this under-provision in a large-scale randomized experiment conducted by Airbnb. In this experiment, buyers are offered a coupon to review listings that have no prior reviews. The treatment induces additional reviews and these reviews tend to be more negative than reviews in the control group, consistent with selection bias in reviewing. Reviews induced by the treatment result in a temporary increase in transactions but these transactions are for fewer nights, on average. The effects on transactions and nights per transaction cancel out so that there is no detectable effect on total nights sold and revenue. Measures of transaction quality in the treatment group fall, suggesting that incentivized reviews do not improve matching. We show how market conditions and the design of the reputation system can explain our findings.

【9】 Rational expectations as a tool for predicting failure of weighted k-out-of-n reliability systems 标题:有理期望作为预测n中取k加权可靠性系统失效的工具 链接:https://arxiv.org/abs/2112.10672

作者:Jorgen Vitting Andersen,Roy Cerqueti,Jessica Riccioni 机构:Sapienza University of Rome – Department of Social and Economic Sciences, Piazzale Aldo Moro, – , Rome, Italy, Tel.: +, London South Bank University – School of Business, London SE,AA, UK 备注:28 pages, 7 figures 摘要:在这里,我们介绍了使用理性预期这一经济学和金融学的核心概念,作为预测一大类加权n取k可靠性系统最佳失效时间的工具。我们通过将其应用于具有异构故障时间的组件的系统来说明这一概念。根据部件故障的不均匀分布,我们发现不同的措施对于预测整个系统的故障时间是最优的。我们举例说明,当一个给定的系统随着时间的推移而恶化时,如何通过在一组时间相关的度量中进行选择来发布不同的系统故障最优预测。 摘要:Here we introduce the idea of using rational expectations, a core concept in economics and finance, as a tool to predict the optimal failure time for a wide class of weighted k-out-of-n reliability systems. We illustrate the concept by applying it to systems which have components with heterogeneous failure times. Depending on the heterogeneous distributions of component failure, we find different measures to be optimal for predicting the failure time of the total system. We give examples of how, as a given system deteriorates over time, one can issue different optimal predictions of system failure by choosing among a set of time-dependent measures.

【10】 Nonzero-sum stochastic impulse games with an application in competitive retail energy markets 标题:非零和随机脉冲博弈及其在好胜能源零售市场中的应用 链接:https://arxiv.org/abs/2112.10213

作者:René Aïd,Lamia Ben Ajmia,M'hamed Gaïgi,Mohamed Mnif 机构:M’hamed Gaïgi‡ 摘要:在有限时间范围内,研究了一类两方均采用脉冲控制的非零和随机微分对策。每个玩家的目标是最大化其预期总折扣利润。解决方法依赖于纳什均衡和相应的拟变分不等式系统(简称QVIs)之间的联系。利用随机微分对策的弱动态规划原理,我们证明了每个参和者的值函数是线性增长函数类中相关QVIs系统的约束粘性解。我们还引入了一系列值函数,这些值函数收敛于我们每个参与者的值函数,其特征是QVIs系统近似的唯一约束粘性解。这个收敛结果对于数值计算是有用的。我们将一个概率数值格式应用于两个电力零售商之间竞争的情况,该格式近似于QVIs系统的解。我们展示了我们的模型如何再现电力零售竞争的定性行为。 摘要:We study a nonzero-sum stochastic differential game with both players adopting impulse controls, on a finite time horizon. The objective of each player is to maximize her total expected discounted profits. The resolution methodology relies on the connection between Nash equilibrium and the corresponding system of quasi-variational inequalities (QVIs in short). We prove, by means of the weak dynamic programming principle for the stochastic differential game, that the value function of each player is a constrained viscosity solution to the associated QVIs system in the class of linear growth functions. We also introduce a family of value functions converging to our value function of each player, and which is characterized as the unique constrained viscosity solutions of an approximation of our QVIs system. This convergence result is useful for numerical purpose. We apply a probabilistic numerical scheme which approximates the solution of the QVIs system to the case of the competition between two electricity retailers. We show how our model reproduces the qualitative behaviour of electricity retail competition.

【11】 Denoised Labels for Financial Time-Series Data via Self-Supervised Learning 标题:基于自监督学习的金融时间序列数据去噪标签 链接:https://arxiv.org/abs/2112.10139

作者:Yanqing Ma,Carmine Ventre,Maria Polukarov 机构:King’s College London, London, United Kingdom 摘要:电子交易平台的引入有效地改变了传统系统性交易的组织结构,从报价驱动的市场转变为订单驱动的市场。其便利性导致金融数据量呈指数增长,但由于低信噪比和金融时间序列的非平稳性,很难将其用于预测未来价格。更简单的分类任务——目标是通过监督学习算法预测未来价格运动的方向——需要足够可靠的标签才能很好地概括。然而,与其他领域相比,标记金融数据的定义不那么明确:价格上涨是因为噪音还是因为信号?现有的标签方法在抗噪声和改进学习算法方面的效果有限。这项工作的灵感来自于交易中的图像分类和自监督学习的成功。我们研究将计算机视觉技术应用于金融时间序列的想法,以减少噪声暴露,从而生成正确的标签。我们将标签生成视为自我监督学习方法的借口任务,并将文献中常用的朴素(和噪声)标签与相同下游分类任务的去噪自动编码器生成的标签进行比较。我们的结果表明,我们的去噪标签提高了下游学习算法的性能,无论是对于小型还是大型数据集。我们进一步表明,我们获得的信号可以用于有效地与二进制策略进行交易。我们认为,利用所提出的技术,自我监督学习构成了一个强大的框架,可以生成“更好”的金融标签,有助于研究市场的基本模式。 摘要:The introduction of electronic trading platforms effectively changed the organisation of traditional systemic trading from quote-driven markets into order-driven markets. Its convenience led to an exponentially increasing amount of financial data, which is however hard to use for the prediction of future prices, due to the low signal-to-noise ratio and the non-stationarity of financial time series. Simpler classification tasks -- where the goal is to predict the directions of future price movement -- via supervised learning algorithms, need sufficiently reliable labels to generalise well. Labelling financial data is however less well defined than other domains: did the price go up because of noise or because of signal? The existing labelling methods have limited countermeasures against noise and limited effects in improving learning algorithms. This work takes inspiration from image classification in trading and success in self-supervised learning. We investigate the idea of applying computer vision techniques to financial time-series to reduce the noise exposure and hence generate correct labels. We look at the label generation as the pretext task of a self-supervised learning approach and compare the naive (and noisy) labels, commonly used in the literature, with the labels generated by a denoising autoencoder for the same downstream classification task. Our results show that our denoised labels improve the performances of the downstream learning algorithm, for both small and large datasets. We further show that the signals we obtain can be used to effectively trade with binary strategies. We suggest that with proposed techniques, self-supervised learning constitutes a powerful framework for generating "better" financial labels that are useful for studying the underlying patterns of the market.

2.cs.SD语音:

【1】 Context-Based Music Recommendation Algorithm Evaluation 标题:基于上下文的音乐推荐算法评测 链接:https://arxiv.org/abs/2112.10612

作者:Marissa Baxter,Lisa Ha,Kirill Perfiliev,Natalie Sayre 机构: Professor at the University of Technology in Australia 摘要:人工智能(AI)已经非常成功地根据在线用户的数据创建和预测音乐播放列表;从用户那里接收到的数据使用该应用程序进行体验,例如搜索他们喜欢的歌曲。由于Spotify、Pandora等音乐平台所有者之间的竞争,AI目前有很多技术进步。在这篇论文中,6种机器学习算法及其预测用户是否喜欢歌曲的个体准确性在3个不同的平台上进行了探索,包括Weka、SKLearn和Orange。探索的算法包括逻辑回归、朴素贝叶斯、序列最小优化(SMO)、多层感知器(神经网络)、最近邻和随机森林。通过分析Spotify API提供的每首歌曲的具体特征[1],Random Forest是预测用户是否喜欢歌曲的最成功的算法,准确率为84%。这比Mungekar使用随机森林技术发现的82.72%的准确率要高,并且歌曲的特征略有不同[2]。Mungekars Random Forest算法中的特征更多地关注艺术家和流行度,而不是歌曲的声音特征。去除流行性因素,将注意力完全放在音质上,可以提高推荐的准确性。最后,本文展示了歌曲预测是如何在没有任何货币投资的情况下完成的,并由此启发了一个想法,即通过全面的金融研究可以实现多么惊人的结果。 摘要:Artificial Intelligence (AI ) has been very successful in creating and predicting music playlists for online users based on their data; data received from users experience using the app such as searching the songs they like. There are lots of current technological advancements in AI due to the competition between music platform owners such as Spotify, Pandora, and more. In this paper, 6 machine learning algorithms and their individual accuracy for predicting whether a user will like a song are explored across 3 different platforms including Weka, SKLearn, and Orange. The algorithms explored include Logistic Regression, Naive Bayes, Sequential Minimal Optimization (SMO), Multilayer Perceptron (Neural Network), Nearest Neighbor, and Random Forest. With the analysis of the specific characteristics of each song provided by the Spotify API [1], Random Forest is the most successful algorithm for predicting whether a user will like a song with an accuracy of 84%. This is higher than the accuracy of 82.72% found by Mungekar using the Random Forest technique and slightly different characteristics of a song [2]. The characteristics in Mungekars Random Forest algorithm focus more on the artist and popularity rather than the sonic features of the songs. Removing the popularity aspect and focusing purely on the sonic qualities improve the accuracy of recommendations. Finally, this paper shows how song prediction can be accomplished without any monetary investments, and thus, inspires an idea of what amazing results can be accomplished with full financial research.

【2】 Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching 标题:集成知识的端到端自动语音识别中的汉英代码转换 链接:https://arxiv.org/abs/2112.10202

作者:Chia-Yu Li,Ngoc Thang Vu 机构:Institute for Natural Language Processing (IMS), University of Stuttgart, Stuttgart, Germany 备注:The 2019 International Conference on Asian Language Processing (IALP) 摘要:语码转换(CS)是多语言社区中常见的一种语言现象,包括在说话时在语言之间的转换。本文介绍了我们对汉英CS语音的端到端语音识别的研究。我们分析了不同的CS特定问题,如CS语言对中语言之间的属性不匹配、切换点的不可预测性以及数据稀缺问题。我们通过合并非语言符号,通过使用分层softmax集成语言识别,通过对子单词单元建模,通过人为降低语速,开发和改进最先进的端到端系统,并通过使用速度扰动技术和多个单语数据集来增加数据,从而不仅提高CS语音的最终性能,而且提高单语基准测试的最终性能,从而使系统更适用于实际环境。最后,我们探讨了不同语言模型集成方法对所提出模型性能的影响。我们的实验结果表明,所有提出的技术都提高了识别性能。最佳组合系统在混合错误率方面将基线系统提高了35%,并在单语基准上提供了可接受的性能。 摘要:Code-Switching (CS) is a common linguistic phenomenon in multilingual communities that consists of switching between languages while speaking. This paper presents our investigations on end-to-end speech recognition for Mandarin-English CS speech. We analyse different CS specific issues such as the properties mismatches between languages in a CS language pair, the unpredictable nature of switching points, and the data scarcity problem. We exploit and improve the state-of-the-art end-to-end system by merging nonlinguistic symbols, by integrating language identification using hierarchical softmax, by modeling sub-word units, by artificially lowering the speaking rate, and by augmenting data using speed perturbed technique and several monolingual datasets to improve the final performance not only on CS speech but also on monolingual benchmarks in order to make the system more applicable on real life settings. Finally, we explore the effect of different language model integration methods on the performance of the proposed model. Our experimental results reveal that all the proposed techniques improve the recognition performance. The best combined system improves the baseline system by up to 35% relatively in terms of mixed error rate and delivers acceptable performance on monolingual benchmarks.

【3】 Detect what you want: Target Sound Detection 标题:检测您想要的:目标声音检测 链接:https://arxiv.org/abs/2112.10153

作者:Dongchao Yang,Helin Wang,Yuexian Zou,Chao Weng 机构:ADSPLAB, School of ECE, Peking University, Shenzhen, China, Tencent AI Lab, Shenzhen, China 备注:Submitted to ICASSP2022 摘要:人类可以通过选择性听觉注意从多源环境中感知我们感兴趣的目标声音,然而,这种功能在机器听觉中几乎从未被探索过。本文研究了目标声音检测(TSD),其目的是在给定目标声音的参考音频时,从混合音频中检测目标声音信号。我们提出了一种新型的目标声音检测网络(TSDNet),它由两个主要部分组成:条件和检测网络。前者的目的是生成一个表示目标声音全局信息的声音判别条件嵌入向量。后者以混合音频和条件嵌入向量作为输入,产生检测结果。这两个网络可以通过多任务学习方法进行联合优化,以进一步提高性能。此外,我们还研究了训练TSDNet的监督和弱监督策略。为了评估我们的方法,我们构建了一个基于URBAN-SED和URBAN-SOUND8K数据集的目标声音检测数据集(TSD数据集)。实验结果表明,该系统比通用声音事件检测系统具有更好的性能。 摘要:Human beings can perceive a target sound that we are interested in from a multi-source environment by the selective auditory attention, however, such functionality was hardly ever explored in machine hearing.This paper address the target sound detection (TSD), which aims to detect the target sound signal from a mixture audio when a target sound's reference audio is given.We present a novel target sound detection network (TSDNet) which consists of two main parts: A conditional and a detection network. The former aims at generating a sound-discriminative conditional embedding vector representing the global information of the target sound. The latter takes both the mixture audio and the conditional embedding vector as inputs, and produces the detection result. These two networks can be jointly optimized with a multi-task learning approach to further improve the performance. In addition, we study both supervised and weakly supervised strategies to train TSDNet.To evaluate our methods, we build a target sound detection dataset (TSD Dataset) based on URBAN-SED and URBAN-SOUND8K datasets. Experimental results indicate our system can get better performance than universal sound event detection.

【4】 Soundify: Matching Sound Effects to Video 标题:Soundify:将音效与视频进行匹配 链接:https://arxiv.org/abs/2112.09726

作者:David Chuan-En Lin,Anastasis Germanidis,Cristóbal Valenzuela,Yining Shi,Nikolas Martelaro 机构:Carnegie Mellon University,Runway 备注:NeurIPS 2021 Workshop on Machine Learning for Creativity and Design 摘要:在视频编辑艺术中,声音实际上是故事的一半。熟练的视频编辑器将声音(如效果和环境)覆盖在画面上,为对象添加角色或将观众沉浸在空间中。然而,通过与专业视频编辑的形成性访谈,我们发现这个过程可能非常乏味和耗时。我们介绍Soundify,一个将声音效果与视频匹配的系统。通过利用带标签的录音棚音效库和将CLIP(一种具有令人印象深刻的Zero-Shot图像分类功能的神经网络)扩展到“Zero-Shot检测器”中,我们能够在无需资源密集型通信学习或音频生成的情况下产生高质量的结果。我们鼓励您查看,或者更好的是,在https://chuanenlin.com/soundify. 摘要:In the art of video editing, sound is really half the story. A skilled video editor overlays sounds, such as effects and ambients, over footage to add character to an object or immerse the viewer within a space. However, through formative interviews with professional video editors, we found that this process can be extremely tedious and time-consuming. We introduce Soundify, a system that matches sound effects to video. By leveraging labeled, studio-quality sound effects libraries and extending CLIP, a neural network with impressive zero-shot image classification capabilities, into a "zero-shot detector", we are able to produce high-quality results without resource-intensive correspondence learning or audio generation. We encourage you to have a look at, or better yet, have a listen to the results at https://chuanenlin.com/soundify.

【5】 Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus 标题:多歌手:具有大规模语料库的快速多歌手演唱声码器 链接:https://arxiv.org/abs/2112.10358

作者:Rongjie Huang,Feiyang Chen,Yi Ren,Jinglin Liu,Chenye Cui,Zhou Zhao 机构:Zhejiang University 备注:Accepted by ACM Multimedia 2021 摘要:高保真多歌手歌唱语音合成是神经声码器面临的挑战,因为歌唱语音数据不足,歌手泛化能力有限,计算量大。现有的开放语料库由于规模和质量上的缺陷,不能满足高保真歌唱语音合成的要求。以前的声码器在多歌手建模方面存在困难,并且在进行看不见的歌手歌声生成时会出现明显的退化。为了加速社区的歌唱声音研究,我们发布了一个大规模的、多歌手的中国歌唱声音数据集OpenSinger。为了解决隐形歌手建模的困难,我们提出了一种具有生成对抗网络的快速多歌手声码器Multi-singer。具体地说,1)多歌手使用多波段生成器来加速训练和推理过程。2) 为了从声学特征(即mel谱图)中捕获和重建歌手身份,多歌手采用歌手条件鉴别器和条件对抗训练目标。3) 为了监督频域频谱包络中歌手身份的重建,我们提出了一种辅助歌手感知损失。联合训练方法在GANs中有效地用于多歌手声音建模。实验结果验证了OpenSinger的有效性,并表明与以前的方法相比,Multi-Singer在速度和质量上都提高了看不见歌手的人声建模。进一步的实验证明,结合FastSpeech 2作为声学模型,多歌手在多歌手歌唱语音合成管道中实现了很强的鲁棒性。样本可于https://Multi-Singer.github.io/ 摘要:High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost. Existing open corpora could not meet requirements for high-fidelity singing voice synthesis because of the scale and quality weaknesses. Previous vocoders have difficulty in multi-singer modeling, and a distinct degradation emerges when conducting unseen singer singing voice generation. To accelerate singing voice researches in the community, we release a large-scale, multi-singer Chinese singing voice dataset OpenSinger. To tackle the difficulty in unseen singer modeling, we propose Multi-Singer, a fast multi-singer vocoder with generative adversarial networks. Specifically, 1) Multi-Singer uses a multi-band generator to speed up both training and inference procedure. 2) to capture and rebuild singer identity from the acoustic feature (i.e., mel-spectrogram), Multi-Singer adopts a singer conditional discriminator and conditional adversarial training objective. 3) to supervise the reconstruction of singer identity in the spectrum envelopes in frequency domain, we propose an auxiliary singer perceptual loss. The joint training approach effectively works in GANs for multi-singer voices modeling. Experimental results verify the effectiveness of OpenSinger and show that Multi-Singer improves unseen singer singing voices modeling in both speed and quality over previous methods. The further experiment proves that combined with FastSpeech 2 as the acoustic model, Multi-Singer achieves strong robustness in the multi-singer singing voice synthesis pipeline. Samples are available at https://Multi-Singer.github.io/

【6】 Multi-turn RNN-T for streaming recognition of multi-party speech 标题:用于多方语音流识别的多轮RNN-T 链接:https://arxiv.org/abs/2112.10200

作者:Ilya Sklyar,Anna Piunova,Xianrui Zheng,Yulan Liu 机构:Amazon Alexa,University of Cambridge 备注:Submitted to ICASSP2022 摘要:传统上,扬声器数量未知的单通道远场记录的自动语音识别(ASR)是通过级联模块来解决的。最近的研究表明,与模块化系统相比,端到端(E2E)多说话人ASR模型可以实现更高的识别精度。但是,由于这些模型依赖于完整的音频上下文,因此无法确保实时适用性。这项工作将实时适用性作为模型设计的首要任务,并解决了以前多说话人递归神经网络传感器(MS-RNN-T)工作中的一些挑战。首先,我们在训练过程中引入动态重叠语音模拟,在LibriSpeechMix测试集上产生14%的相对字错误率(WER)。其次,我们提出了一种新的多回合RNN-T(MT-RNN-T)模型,该模型采用了基于重叠的目标排列策略,可以在不改变模型结构的情况下推广到任意数量的说话人。我们在LibriCSS测试集上调查了训练期间看到的最大说话人数量对MT-RNN-T性能的影响,并报告了与两个说话人MS-RNN-T相比28%的相对功率改善。第三,我们试验了一种丰富的转录策略,用于多方语音的联合识别和分割。通过深入分析,我们讨论了所提出系统的潜在缺陷以及未来的研究方向。 摘要:Automatic speech recognition (ASR) of single channel far-field recordings with an unknown number of speakers is traditionally tackled by cascaded modules. Recent research shows that end-to-end (E2E) multi-speaker ASR models can achieve superior recognition accuracy compared to modular systems. However, these models do not ensure real-time applicability due to their dependency on full audio context. This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T). First, we introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set. Second, we propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture. We investigate the impact of the maximum number of speakers seen during training on MT-RNN-T performance on LibriCSS test set, and report 28% relative WER improvement over the two-speaker MS-RNN-T. Third, we experiment with a rich transcription strategy for joint recognition and segmentation of multi-party speech. Through an in-depth analysis, we discuss potential pitfalls of the proposed system as well as promising future research directions.

3.eess.AS音频处理:

【1】 Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus 标题:多歌手:具有大规模语料库的快速多歌手演唱声码器 链接:https://arxiv.org/abs/2112.10358

作者:Rongjie Huang,Feiyang Chen,Yi Ren,Jinglin Liu,Chenye Cui,Zhou Zhao 机构:Zhejiang University 备注:Accepted by ACM Multimedia 2021 摘要:高保真多歌手歌唱语音合成是神经声码器面临的挑战,因为歌唱语音数据不足,歌手泛化能力有限,计算量大。现有的开放语料库由于规模和质量上的缺陷,不能满足高保真歌唱语音合成的要求。以前的声码器在多歌手建模方面存在困难,并且在进行看不见的歌手歌声生成时会出现明显的退化。为了加速社区的歌唱声音研究,我们发布了一个大规模的、多歌手的中国歌唱声音数据集OpenSinger。为了解决隐形歌手建模的困难,我们提出了一种具有生成对抗网络的快速多歌手声码器Multi-singer。具体地说,1)多歌手使用多波段生成器来加速训练和推理过程。2) 为了从声学特征(即mel谱图)中捕获和重建歌手身份,多歌手采用歌手条件鉴别器和条件对抗训练目标。3) 为了监督频域频谱包络中歌手身份的重建,我们提出了一种辅助歌手感知损失。联合训练方法在GANs中有效地用于多歌手声音建模。实验结果验证了OpenSinger的有效性,并表明与以前的方法相比,Multi-Singer在速度和质量上都提高了看不见歌手的人声建模。进一步的实验证明,结合FastSpeech 2作为声学模型,多歌手在多歌手歌唱语音合成管道中实现了很强的鲁棒性。样本可于https://Multi-Singer.github.io/ 摘要:High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost. Existing open corpora could not meet requirements for high-fidelity singing voice synthesis because of the scale and quality weaknesses. Previous vocoders have difficulty in multi-singer modeling, and a distinct degradation emerges when conducting unseen singer singing voice generation. To accelerate singing voice researches in the community, we release a large-scale, multi-singer Chinese singing voice dataset OpenSinger. To tackle the difficulty in unseen singer modeling, we propose Multi-Singer, a fast multi-singer vocoder with generative adversarial networks. Specifically, 1) Multi-Singer uses a multi-band generator to speed up both training and inference procedure. 2) to capture and rebuild singer identity from the acoustic feature (i.e., mel-spectrogram), Multi-Singer adopts a singer conditional discriminator and conditional adversarial training objective. 3) to supervise the reconstruction of singer identity in the spectrum envelopes in frequency domain, we propose an auxiliary singer perceptual loss. The joint training approach effectively works in GANs for multi-singer voices modeling. Experimental results verify the effectiveness of OpenSinger and show that Multi-Singer improves unseen singer singing voices modeling in both speed and quality over previous methods. The further experiment proves that combined with FastSpeech 2 as the acoustic model, Multi-Singer achieves strong robustness in the multi-singer singing voice synthesis pipeline. Samples are available at https://Multi-Singer.github.io/

【2】 Multi-turn RNN-T for streaming recognition of multi-party speech 标题:用于多方语音流识别的多轮RNN-T 链接:https://arxiv.org/abs/2112.10200

作者:Ilya Sklyar,Anna Piunova,Xianrui Zheng,Yulan Liu 机构:Amazon Alexa,University of Cambridge 备注:Submitted to ICASSP2022 摘要:传统上,扬声器数量未知的单通道远场记录的自动语音识别(ASR)是通过级联模块来解决的。最近的研究表明,与模块化系统相比,端到端(E2E)多说话人ASR模型可以实现更高的识别精度。但是,由于这些模型依赖于完整的音频上下文,因此无法确保实时适用性。这项工作将实时适用性作为模型设计的首要任务,并解决了以前多说话人递归神经网络传感器(MS-RNN-T)工作中的一些挑战。首先,我们在训练过程中引入动态重叠语音模拟,在LibriSpeechMix测试集上产生14%的相对字错误率(WER)。其次,我们提出了一种新的多回合RNN-T(MT-RNN-T)模型,该模型采用了基于重叠的目标排列策略,可以在不改变模型结构的情况下推广到任意数量的说话人。我们在LibriCSS测试集上调查了训练期间看到的最大说话人数量对MT-RNN-T性能的影响,并报告了与两个说话人MS-RNN-T相比28%的相对功率改善。第三,我们试验了一种丰富的转录策略,用于多方语音的联合识别和分割。通过深入分析,我们讨论了所提出系统的潜在缺陷以及未来的研究方向。 摘要:Automatic speech recognition (ASR) of single channel far-field recordings with an unknown number of speakers is traditionally tackled by cascaded modules. Recent research shows that end-to-end (E2E) multi-speaker ASR models can achieve superior recognition accuracy compared to modular systems. However, these models do not ensure real-time applicability due to their dependency on full audio context. This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T). First, we introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set. Second, we propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture. We investigate the impact of the maximum number of speakers seen during training on MT-RNN-T performance on LibriCSS test set, and report 28% relative WER improvement over the two-speaker MS-RNN-T. Third, we experiment with a rich transcription strategy for joint recognition and segmentation of multi-party speech. Through an in-depth analysis, we discuss potential pitfalls of the proposed system as well as promising future research directions.

【3】 Noisy Speech Based Temporal Decomposition to Improve Fundamental Frequency Estimation 标题:基于带噪语音的时间分解改进基频估计 链接:https://arxiv.org/abs/2112.09896

作者:A. Queiroz,R. Coelho 机构:The authors are with the Laboratory of Acoustic Signal Processing, MilitaryInstitute of Engineering (IME) 备注:9 pages 摘要:为了提高基频(F0)估计精度,提出了一种将含噪语音分离为低频或高频帧的新方法。在该方案中,目标信号通过系综经验模式分解进行分析。接下来,从第一分解模式中提取基音信息。此功能指示语音F0应位于的频率区域,从而将帧分为低频(LF)或高频(HF)。将该分离应用于从传统基频检测方法中提取的候选信号的校正,从而提高了F0估计的精度。在CSTR和TIMIT数据库的实验中,考虑了不同信噪比下的六种噪声,对所提出的方法进行了评估。在评估分析中,考虑到三种常规估计器,采用了基音增强算法作为基线。结果表明,该方法在低频/高频分离精度方面优于竞争策略。此外,F0估计技术的性能指标表明,在不同噪声条件下,与竞争方法相比,新的解决方案能够更好地提高F0检测精度。 摘要:This paper introduces a novel method to separate noisy speech into low or high frequency frames, in order to improve fundamental frequency (F0) estimation accuracy. In this proposal, the target signal is analyzed by means of the ensemble empirical mode decomposition. Next, the pitch information is extracted from the first decomposition modes. This feature indicates the frequency region where the F0 of speech should be located, thus separating the frames into low-frequency (LF) or high-frequency (HF). The separation is applied to correct candidates extracted from a conventional fundamental frequency detection method, and hence improving the accuracy of F0 estimate. The proposed method is evaluated in experiments with CSTR and TIMIT databases, considering six acoustic noises under various signal-to-noise ratios. A pitch enhancement algorithm is adopted as baseline in the evaluation analysis considering three conventional estimators. Results show that the proposed method outperforms the competing strategies, in terms of low/high frequency separation accuracy. Moreover, the performance metrics of the F0 estimation techniques show that the novel solution is able to better improve F0 detection accuracy when compared to competitive approaches under different noisy conditions.

【4】 Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching 标题:集成知识的端到端自动语音识别中的汉英代码转换 链接:https://arxiv.org/abs/2112.10202

作者:Chia-Yu Li,Ngoc Thang Vu 机构:Institute for Natural Language Processing (IMS), University of Stuttgart, Stuttgart, Germany 备注:The 2019 International Conference on Asian Language Processing (IALP) 摘要:语码转换(CS)是多语言社区中常见的一种语言现象,包括在说话时在语言之间的转换。本文介绍了我们对汉英CS语音的端到端语音识别的研究。我们分析了不同的CS特定问题,如CS语言对中语言之间的属性不匹配、切换点的不可预测性以及数据稀缺问题。我们通过合并非语言符号,通过使用分层softmax集成语言识别,通过对子单词单元建模,通过人为降低语速,开发和改进最先进的端到端系统,并通过使用速度扰动技术和多个单语数据集来增加数据,从而不仅提高CS语音的最终性能,而且提高单语基准测试的最终性能,从而使系统更适用于实际环境。最后,我们探讨了不同语言模型集成方法对所提出模型性能的影响。我们的实验结果表明,所有提出的技术都提高了识别性能。最佳组合系统在混合错误率方面将基线系统提高了35%,并在单语基准上提供了可接受的性能。 摘要:Code-Switching (CS) is a common linguistic phenomenon in multilingual communities that consists of switching between languages while speaking. This paper presents our investigations on end-to-end speech recognition for Mandarin-English CS speech. We analyse different CS specific issues such as the properties mismatches between languages in a CS language pair, the unpredictable nature of switching points, and the data scarcity problem. We exploit and improve the state-of-the-art end-to-end system by merging nonlinguistic symbols, by integrating language identification using hierarchical softmax, by modeling sub-word units, by artificially lowering the speaking rate, and by augmenting data using speed perturbed technique and several monolingual datasets to improve the final performance not only on CS speech but also on monolingual benchmarks in order to make the system more applicable on real life settings. Finally, we explore the effect of different language model integration methods on the performance of the proposed model. Our experimental results reveal that all the proposed techniques improve the recognition performance. The best combined system improves the baseline system by up to 35% relatively in terms of mixed error rate and delivers acceptable performance on monolingual benchmarks.

【5】 Detect what you want: Target Sound Detection 标题:检测您想要的:目标声音检测 链接:https://arxiv.org/abs/2112.10153

作者:Dongchao Yang,Helin Wang,Yuexian Zou,Chao Weng 机构:ADSPLAB, School of ECE, Peking University, Shenzhen, China, Tencent AI Lab, Shenzhen, China 备注:Submitted to ICASSP2022 摘要:人类可以通过选择性听觉注意从多源环境中感知我们感兴趣的目标声音,然而,这种功能在机器听觉中几乎从未被探索过。本文研究了目标声音检测(TSD),其目的是在给定目标声音的参考音频时,从混合音频中检测目标声音信号。我们提出了一种新型的目标声音检测网络(TSDNet),它由两个主要部分组成:条件和检测网络。前者的目的是生成一个表示目标声音全局信息的声音判别条件嵌入向量。后者以混合音频和条件嵌入向量作为输入,产生检测结果。这两个网络可以通过多任务学习方法进行联合优化,以进一步提高性能。此外,我们还研究了训练TSDNet的监督和弱监督策略。为了评估我们的方法,我们构建了一个基于URBAN-SED和URBAN-SOUND8K数据集的目标声音检测数据集(TSD数据集)。实验结果表明,该系统比通用声音事件检测系统具有更好的性能。 摘要:Human beings can perceive a target sound that we are interested in from a multi-source environment by the selective auditory attention, however, such functionality was hardly ever explored in machine hearing.This paper address the target sound detection (TSD), which aims to detect the target sound signal from a mixture audio when a target sound's reference audio is given.We present a novel target sound detection network (TSDNet) which consists of two main parts: A conditional and a detection network. The former aims at generating a sound-discriminative conditional embedding vector representing the global information of the target sound. The latter takes both the mixture audio and the conditional embedding vector as inputs, and produces the detection result. These two networks can be jointly optimized with a multi-task learning approach to further improve the performance. In addition, we study both supervised and weakly supervised strategies to train TSDNet.To evaluate our methods, we build a target sound detection dataset (TSD Dataset) based on URBAN-SED and URBAN-SOUND8K datasets. Experimental results indicate our system can get better performance than universal sound event detection.

【6】 Investigation of Densely Connected Convolutional Networks with Domain Adversarial Learning for Noise Robust Speech Recognition 标题:基于域对抗学习的密集连接卷积网络抗噪语音识别研究 链接:https://arxiv.org/abs/2112.10108

作者:Chia Yu Li,Ngoc Thang Vu 机构:Institute for Natural Language Processing, University of Stuttgart, Germany 备注:7 pages, 5 figures, The 30th Conference on Electronic Speech Signal Processing (ESSV2019) 摘要:我们研究了稠密连接卷积网络(DenseNets)及其在域对抗训练中的扩展,用于噪声鲁棒语音识别。DenseNets是一种非常深入、紧凑的卷积神经网络,与计算机视觉领域的最新成果相比有了惊人的改进。我们的实验结果表明,DenseNets比其他基于神经网络的模型(如深前馈神经网络和卷积神经网络)对噪声具有更强的鲁棒性。此外,领域对抗性学习可以进一步提高Densenet对已知和未知噪声条件的鲁棒性。 摘要:We investigate densely connected convolutional networks (DenseNets) and their extension with domain adversarial training for noise robust speech recognition. DenseNets are very deep, compact convolutional neural networks which have demonstrated incredible improvements over the state-of-the-art results in computer vision. Our experimental results reveal that DenseNets are more robust against noise than other neural network based models such as deep feed forward neural networks and convolutional neural networks. Moreover, domain adversarial learning can further improve the robustness of DenseNets against both, known and unknown noise conditions.

【7】 Soundify: Matching Sound Effects to Video 标题:Soundify:将音效与视频进行匹配 链接:https://arxiv.org/abs/2112.09726

作者:David Chuan-En Lin,Anastasis Germanidis,Cristóbal Valenzuela,Yining Shi,Nikolas Martelaro 机构:Carnegie Mellon University,Runway 备注:NeurIPS 2021 Workshop on Machine Learning for Creativity and Design 摘要:在视频编辑艺术中,声音实际上是故事的一半。熟练的视频编辑器将声音(如效果和环境)覆盖在画面上,为对象添加角色或将观众沉浸在空间中。然而,通过与专业视频编辑的形成性访谈,我们发现这个过程可能非常乏味和耗时。我们介绍Soundify,一个将声音效果与视频匹配的系统。通过利用带标签的录音棚音效库和将CLIP(一种具有令人印象深刻的Zero-Shot图像分类功能的神经网络)扩展到“Zero-Shot检测器”中,我们能够在无需资源密集型通信学习或音频生成的情况下产生高质量的结果。我们鼓励您查看,或者更好的是,在https://chuanenlin.com/soundify. 摘要:In the art of video editing, sound is really half the story. A skilled video editor overlays sounds, such as effects and ambients, over footage to add character to an object or immerse the viewer within a space. However, through formative interviews with professional video editors, we found that this process can be extremely tedious and time-consuming. We introduce Soundify, a system that matches sound effects to video. By leveraging labeled, studio-quality sound effects libraries and extending CLIP, a neural network with impressive zero-shot image classification capabilities, into a "zero-shot detector", we are able to produce high-quality results without resource-intensive correspondence learning or audio generation. We encourage you to have a look at, or better yet, have a listen to the results at https://chuanenlin.com/soundify.

机器翻译,仅供参考

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2021-12-21,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
语音识别
腾讯云语音识别(Automatic Speech Recognition,ASR)是将语音转化成文字的PaaS产品,为企业提供精准而极具性价比的识别服务。被微信、王者荣耀、腾讯视频等大量业务使用,适用于录音质检、会议实时转写、语音输入法等多个场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档