前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >金融/语音/音频处理学术速递[9.8]

金融/语音/音频处理学术速递[9.8]

作者头像
公众号-arXiv每日学术速递
发布2021-09-16 16:37:40
3130
发布2021-09-16 16:37:40
举报

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

q-fin金融,共计8篇

cs.SD语音,共计7篇

eess.AS音频处理,共计7篇

1.q-fin金融:

【1】 Characterization of flexible electricity in power and energy markets 标题:电力和能源市场中灵活电力的特性 链接:https://arxiv.org/abs/2109.03000

作者:Güray Kara,Asgeir Tomasgard,Hossein Farahmand 备注:This paper was sent to the Renewable & Sustainable Energy Reviews at August 2021. It is currently under review 摘要:作者从时间、空间、资源和风险等维度对电力系统的灵活性特征进行了全面概述。这些维度与灵活性资产、产品和服务以及新的和现有的灵活性市场设计有关。作者认为,应根据讨论的维度评估灵活性。如果在时间、地理、技术和风险方面考虑并使用灵活性资产和相关服务,那么灵活性产品和服务可以提高电力系统和市场的效率。尽管可以评估现有市场设计中的灵活性,但可能需要一个本地灵活性市场来开发灵活性的价值,这取决于灵活性产品和服务的规模。为了定位电网中的灵活性并防止不正确的评估,作者还讨论了TSO-DSO在四个维度上的协调,并展示了灵活性维度、产品、服务和生产性使用灵活电力的相关市场设计之间的相互关系。 摘要:The authors provide a comprehensive overview of flexibility characterization along the dimensions of time, spatiality, resource, and risk in power systems. These dimensions are discussed in relation to flexibility assets, products, and services, as well as new and existing flexibility market designs. The authors argue that flexibility should be evaluated based on the dimensions under discussion. Flexibility products and services can increase the efficiency of power systems and markets if flexibility assets and related services are taken into consideration and used along the time, geography, technology, and risk dimensions. Although it is possible to evaluate flexibility in existing market designs, a local flexibility market may be needed to exploit the value of the flexibility, depending on the dimensions of the flexibility products and services. To locate flexibility in power grids and prevent incorrect valuations, the authors also discuss TSO-DSO coordination along the four dimensions, and they present interrelations between flexibility dimensions, products, services, and related market designs for productive usage of flexible electricity.

【2】 Examining the Dynamic Asset Market Linkages under the COVID-19 Global Pandemic 标题:冠状病毒全球大流行下的动态资产市场联系研究 链接:https://arxiv.org/abs/2109.02933

作者:Akihiko Noda 机构:a School of Commerce, Meiji University,-, Kanda-Surugadai, Chiyoda-ku, Tokyo ,-, Japan, b Keio Economic Observatory, Keio University,-,-, Mita, Minato-ku, Tokyo ,-, Japan 摘要:本研究基于Fama(1970)意义上的市场效率,考察了新冠病毒-19全球流行下的动态资产-市场联系。特别是,我们通过应用Ito等人(2014;2017)基于广义最小二乘法的时变向量自回归模型。实证结果表明:(1)市场效率的联合程度随着时间的推移而变化很大,如Lo(2004)的适应性市场假说所示,(2)新冠病毒-19大流行可能通过加强资产市场之间的联系消除套利并提高市场效率;(3)由于2020年底出现的比特币泡沫,市场效率持续下降。 摘要:This study examines the dynamic asset market linkages under the COVID-19 global pandemic based on market efficiency, in the sense of Fama (1970). Particularly, we estimate the joint degree of market efficiency by applying Ito et al.'s (2014; 2017) Generalized Least Squares-based time-varying vector autoregression model. The empirical results show that (1) the joint degree of market efficiency changes widely over time, as shown in Lo's (2004) adaptive market hypothesis, (2) the COVID-19 pandemic may eliminate arbitrage and improve market efficiency through enhanced linkages between the asset markets; and (3) the market efficiency has continued to decline due to the Bitcoin bubble that emerged at the end of 2020.

【3】 Using Satellite Imagery and Machine Learning to Estimate the Livelihood Impact of Electricity Access 标题:利用卫星图像和机器学习评估供电对民生的影响 链接:https://arxiv.org/abs/2109.02890

作者:Nathan Ratledge,Gabe Cadamuro,Brandon de la Cuesta,Matthieu Stigler,Marshall Burke 机构:Emmett Interdisciplinary Program in Environment and Resources, Stanford University, Palo Alto, CA , USA, Atlas AI, Palo Alto, CA , USA, King Center for International Development, Stanford University, Palo Alto, CA , USA 摘要:在世界许多地区,关键经济成果的稀疏数据阻碍了公共政策的制定、目标确定和评估。我们展示了卫星图像和机器学习的进步如何帮助改善这些数据和推理挑战。在乌干达电网扩张的背景下,我们展示了如何将卫星图像和计算机视觉结合起来,开发适合于推断电力接入对生计的因果影响的地方级生计测量。然后,我们展示了基于ML的推理技术在应用于这些数据时,如何比传统的替代方案更可靠地估计电气化的因果影响。我们估计,电网接入将乌干达农村地区的村级资产财富提高了0.17个标准差,比我们研究期间未经处理地区的增长率翻了一番多。我们的研究结果为关键基础设施投资的影响提供了国家级证据,并为数据稀疏环境下的未来政策评估提供了一种低成本、可推广的方法。 摘要:In many regions of the world, sparse data on key economic outcomes inhibits the development, targeting, and evaluation of public policy. We demonstrate how advancements in satellite imagery and machine learning can help ameliorate these data and inference challenges. In the context of an expansion of the electrical grid across Uganda, we show how a combination of satellite imagery and computer vision can be used to develop local-level livelihood measurements appropriate for inferring the causal impact of electricity access on livelihoods. We then show how ML-based inference techniques deliver more reliable estimates of the causal impact of electrification than traditional alternatives when applied to these data. We estimate that grid access improves village-level asset wealth in rural Uganda by 0.17 standard deviations, more than doubling the growth rate over our study period relative to untreated areas. Our results provide country-scale evidence on the impact of a key infrastructure investment, and provide a low-cost, generalizable approach to future policy evaluation in data sparse environments.

【4】 Moment Matching Method for Pricing Spread Options with Mean-Variance Mixture Lévy Motions 标题:均值-方差混合Lévy运动价差期权定价的矩匹配方法 链接:https://arxiv.org/abs/2109.02872

作者:Dongdong Hu,Hasanjan Sayit,Svetlozar T. Rachev 机构:Department of Financial Mathematics, Xi’an Jiaotong Liverpool University, Suzhou, China, Department of Mathematics and Statistics, Texas Tech University, Lubbock, TX, USA 备注:20 pages 摘要:Borovkova等人[4]利用矩匹配方法获得对数正态模型下的价差和篮子看涨期权价格的封闭式公式。在本文中,我们还使用矩匹配方法,得到了指数L掼evy模型下均值-方差混合的价差期权价格的半闭式公式。与Caldana和Fusai[5]中的半封闭式公式不同,在Caldana和Fusai[5]中,差价是通过使用一般价格动态的傅里叶反演公式来表示的,我们的公式根据混合分布来表示差价。数值试验表明,我们的公式也给出了准确的价差价格 摘要:The paper Borovkova et al. [4] uses moment matching method to obtain closed form formulas for spread and basket call option prices under log normal models. In this note, we also use moment matching method to obtain semi-closed form formulas for the price of spread options under exponential L\'evy models with mean-variance mixture. Unlike the semi-closed form formulas in Caldana and Fusai [5], where spread prices were expressed by using Fourier inversion formula for general price dynamics, our formula expresses spread prices in terms of the mixing distribution. Numerical tests show that our formulas give accurate spread prices also

【5】 Identifying the Main Factors of Iran's Economic Growth using Growth Accounting Framework 标题:运用增长核算框架识别伊朗经济增长的主要因素 链接:https://arxiv.org/abs/2109.02787

作者:Mohammadreza Mahmoudi 机构:Department of Economics, Northern Illinois University, Dekalb, USA., Correspondence 备注:22 pages, 4 Tables, 7 figures 摘要:本文旨在利用世界银行、麦迪逊数据库、伊朗统计中心和伊朗中央银行的数据,对1950年至2018年伊朗经济增长进行实证分析。结果表明,在此期间,人均国内生产总值(GDP)以每年2%的速度增长,但这一指标随着时间的推移有着巨大的波动。此外,伊朗的经济增长与石油收入有着密切的关系。事实上,每当石油危机发生时,增长率和其他指标都会随之发生巨大波动。尽管工业和服务业等其他部门在GDP中的份额随着时间的推移而增加,但石油部门仍然在伊朗的经济增长中发挥着关键作用。此外,增长会计分析表明,资本的贡献对伊朗的经济增长起着重要作用。此外,根据增长核算框架,伊朗经济的有效资本稳定状态为4.27。 摘要:This paper aims to present empirical analysis of Iranian economic growth from 1950 to 2018 using data from the World Bank, Madison Data Bank, Statistical Center of Iran, and Central Bank of Iran. The results show that Gross Domestic Product (GDP) per capital increased by 2 percent annually during this time, however this indicator has had a huge fluctuation over time. In addition, the economic growth of Iran and oil revenue have close relationship with each other. In fact, whenever oil crises happen, great fluctuation in growth rate and other indicators happened subsequently. Even though the shares of other sectors like industry and services in GDP have increased over time, the oil sector still plays a key role in the economic growth of Iran. Moreover, growth accounting analysis shows contribution of capital plays a significant role in economic growth of Iran. Furthermore, based on growth accounting framework the steady state of effective capital is 4.27 for Iran's economy.

【6】 Net Buying Pressure and the Information in Bitcoin Option Trades 标题:净买入压力与比特币期权交易中的信息 链接:https://arxiv.org/abs/2109.02776

作者:Carol Alexander,Jun Deng,Jianfen Feng,Huning Wan 备注:27 pages 摘要:知情交易者的供求如何推动比特币期权的市场价格?Deribit期权指数级数据支持关于做市商供应的套利限制假设。主要的需求副作用是,在货币期权价格主要由波动性交易者驱动,而非货币期权同时由波动性交易者和那些拥有未来比特币价格走势专有信息的交易者驱动。需求侧交易结果与之前对美国和亚洲已建立期权市场的研究形成对比,但我们也表明,Deribit正在迅速演变为一个更有效的渠道,用于收集来自知情交易者的信息。 摘要:How do supply and demand from informed traders drive market prices of bitcoin options? Deribit options tick-level data supports the limits-to-arbitrage hypothesis about market maker's supply. The main demand-side effects are that at-the-money option prices are largely driven by volatility traders and out-of-the-money options are simultaneously driven by volatility traders and those with proprietary information about the direction of future bitcoin price movements. The demand-side trading results contrast with prior studies on established options markets in the US and Asia, but we also show that Deribit is rapidly evolving into a more efficient channel for aggregating information from informed traders.

【7】 Sorting with Team Formation 标题:按团队编队进行排序 链接:https://arxiv.org/abs/2109.02730

作者:Job Boerma,Aleh Tsyvinski,Alexander P. Zimin 机构:University of Wisconsin-Madison, Yale University and NES, MIT and HSE 摘要:我们完全解决了一个具有异质企业和多个异质工人的分配问题,这些工人的技能是不完全替代品,也就是说,当生产是子模块化的时候。我们证明了排序既不是正的也不是负的,并且具有两个区域的充分特征。在第一个区域,平庸的公司与平庸的工人和同事进行排序,使得所有这些配对的产出损失相等(完全混合)。在第二个区域,高技能工人与低技能同事和高生产率公司进行排序,而高生产率公司则雇佣低技能工人和高技能同事(成对反单调性)。均衡分配也必然具有乘积反单调性的特征,这意味着排序对于异质性的每个维度与其他维度异质性的乘积都是负的。均衡分配以及工资和企业价值是完全封闭的。我们用一个应用程序来说明我们的理论,以表明我们的模型与观察到的美国公司内部和公司之间的收益分散一致。我们的反事实分析表明,1981年至2013年间,公司项目分布的变化对收入分散的影响大于工人技能分布的变化。 摘要:We fully solve an assignment problem with heterogeneous firms and multiple heterogeneous workers whose skills are imperfect substitutes, that is, when production is submodular. We show that sorting is neither positive nor negative and is characterized sufficiently by two regions. In the first region, mediocre firms sort with mediocre workers and coworkers such that output losses are equal across all these pairings (complete mixing). In the second region, high skill workers sort with a low skill coworker and a high productivity firm, while high productivity firms employ a low skill worker and a high skill coworker (pairwise countermonotonicity). The equilibrium assignment is also necessarily characterized by product countermonotonicity, meaning that sorting is negative for each dimension of heterogeneity with the product of heterogeneity in the other dimensions. The equilibrium assignment as well as wages and firm values are completely characterized in closed form. We illustrate our theory with an application to show that our model is consistent with the observed dispersion of earnings within and across U.S. firms. Our counterfactual analysis gives evidence that the change in the firm project distribution between 1981 and 2013 has a larger effect on the observed change in earnings dispersion than the change in the worker skill distribution.

【8】 Information Payoffs: An Interim Perspective 标题:信息收益:一个过渡视角 链接:https://arxiv.org/abs/2109.03061

作者:Laura Doval,Alex Smolin 摘要:我们从中期的角度研究在某些信息结构下可能产生的收益。存在一组根据某种先验分布分布的类型和一个支付函数,该函数为每对类型分配一个值和对类型的信任。任何信息结构都会产生一个临时支付模式,该模式描述了每种类型在以该类型为条件的信息结构下的预期支付。我们描述了与某些信息结构相一致的所有中期收益概况的集合。我们通过应用来说明我们的结果。 摘要:We study the payoffs that can arise under some information structure from an interim perspective. There is a set of types distributed according to some prior distribution and a payoff function that assigns a value to each pair of a type and a belief over the types. Any information structure induces an interim payoff profile which describes, for each type, the expected payoff under the information structure conditional on the type. We characterize the set of all interim payoff profiles consistent with some information structure. We illustrate our results through applications.

2.cs.SD语音:

【1】 Fruit-CoV: An Efficient Vision-based Framework for Speedy Detection and Diagnosis of SARS-CoV-2 Infections Through Recorded Cough Sounds 标题:水果冠状病毒:一种高效的基于视觉的咳嗽音快速检测和诊断SARS-CoV-2感染的框架 链接:https://arxiv.org/abs/2109.03219

作者:Long H. Nguyen,Nhat Truong Pham,Van Huong Do,Liu Tai Nguyen,Thanh Tin Nguyen,Van Dung Do,Hai Nguyen,Ngoc Duy Nguyen 机构:Division of Computational Mechatronics, Institute for Computational Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam, ASICLAND, Suwon, South Korea, Human Computer Interaction Lab, Sejong University, Seoul, South Korea 备注:4 pages 摘要:SARS-CoV-2俗称COVID-19,于2019年12月首次爆发。这种致命病毒已在全球传播,自2020年3月以来参与了全球大流行疾病。此外,最近一种名为Delta的SARS-CoV-2变种具有难以控制的传染性,导致全球400多万人死亡。因此,在国内拥有SARS-CoV-2的自我检测服务至关重要。在这项研究中,我们介绍了水果冠状病毒,一个两阶段的视觉框架,它能够通过记录的咳嗽声检测SARS-CoV-2感染。具体来说,我们将声音转换为对数Mel谱图,并在第一阶段使用EfficientNet-V2网络提取其视觉特征。在第二阶段,我们使用从大规模预训练音频神经网络中提取的14个卷积层进行音频模式识别(PANNs)和波形图Log-Mel-CNN来聚合Log-Mel谱图的特征表示。最后,利用组合特征训练二值分类器。在这项研究中,我们使用了爱科维恩115M挑战赛提供的数据集,其中包括在越南、印度和瑞士收集的总共7371个记录的咳嗽声。实验结果表明,我们提出的模型达到了92.8%的AUC分数,并在AICovidVN挑战排行榜上排名第一。更重要的是,我们提出的框架可以集成到呼叫中心或VoIP系统中,以通过在线/记录的咳嗽声加速检测SARS-CoV-2感染。 摘要:SARS-CoV-2 is colloquially known as COVID-19 that had an initial outbreak in December 2019. The deadly virus has spread across the world, taking part in the global pandemic disease since March 2020. In addition, a recent variant of SARS-CoV-2 named Delta is intractably contagious and responsible for more than four million deaths over the world. Therefore, it is vital to possess a self-testing service of SARS-CoV-2 at home. In this study, we introduce Fruit-CoV, a two-stage vision framework, which is capable of detecting SARS-CoV-2 infections through recorded cough sounds. Specifically, we convert sounds into Log-Mel Spectrograms and use the EfficientNet-V2 network to extract its visual features in the first stage. In the second stage, we use 14 convolutional layers extracted from the large-scale Pretrained Audio Neural Networks for audio pattern recognition (PANNs) and the Wavegram-Log-Mel-CNN to aggregate feature representations of the Log-Mel Spectrograms. Finally, we use the combined features to train a binary classifier. In this study, we use a dataset provided by the AICovidVN 115M Challenge, which includes a total of 7371 recorded cough sounds collected throughout Vietnam, India, and Switzerland. Experimental results show that our proposed model achieves an AUC score of 92.8% and ranks the 1st place on the leaderboard of the AICovidVN Challenge. More importantly, our proposed framework can be integrated into a call center or a VoIP system to speed up detecting SARS-CoV-2 infections through online/recorded cough sounds.

【2】 Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors 标题:基于单模和多模检测器的音视频多模态深伪数据集评估 链接:https://arxiv.org/abs/2109.02993

作者:Hasam Khalid,Minha Kim,Shahroz Tariq,Simon S. Woo 机构:College of Computing and Informatics, Sungkyunkwan University, South Korea, Department of Applied Data Science 备注:2 Figures, 2 Tables, Accepted for publication at the 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation and Detection (ADGD '21) at ACM MM 2021 摘要:deepfakes产品的开发取得了重大进展,导致了安全和隐私问题。攻击者可以用目标人的脸替换他的脸,从而很容易地在图像中模拟一个人的身份。此外,利用深度学习技术克隆人声的新领域也正在出现。现在,攻击者只需几秒钟的目标人音频就可以生成逼真的克隆人声。随着deepfake可能造成的潜在危害的不断出现,研究人员提出了deepfake检测方法。然而,他们只专注于检测单一模态,即视频或音频。另一方面,为了开发一个好的deepfake检测器,以应对deepfake一代的最新进展,我们需要一个能够检测多种模式的deepfake的检测器,即视频和音频。为了构建这样一个检测器,我们需要一个包含视频和音频的数据集。我们能够找到最新的deepfake数据集,音频-视频多模式deepfake检测数据集(FakeAVCeleb),它不仅包含deepfake视频,还包含合成的假音频。我们使用这个多模态数据集,并使用最先进的单峰、基于集合和多模态检测方法进行详细的基线实验来评估它。我们通过详细的实验得出结论,与基于集成的方法相比,仅处理单一模态(视频或音频)的单模方法性能不佳。而纯粹基于多模式的基线提供了最差的性能。 摘要:Significant advancements made in the generation of deepfakes have caused security and privacy issues. Attackers can easily impersonate a person's identity in an image by replacing his face with the target person's face. Moreover, a new domain of cloning human voices using deep-learning technologies is also emerging. Now, an attacker can generate realistic cloned voices of humans using only a few seconds of audio of the target person. With the emerging threat of potential harm deepfakes can cause, researchers have proposed deepfake detection methods. However, they only focus on detecting a single modality, i.e., either video or audio. On the other hand, to develop a good deepfake detector that can cope with the recent advancements in deepfake generation, we need to have a detector that can detect deepfakes of multiple modalities, i.e., videos and audios. To build such a detector, we need a dataset that contains video and respective audio deepfakes. We were able to find a most recent deepfake dataset, Audio-Video Multimodal Deepfake Detection Dataset (FakeAVCeleb), that contains not only deepfake videos but synthesized fake audios as well. We used this multimodal deepfake dataset and performed detailed baseline experiments using state-of-the-art unimodal, ensemble-based, and multimodal detection methods to evaluate it. We conclude through detailed experimentation that unimodals, addressing only a single modality, video or audio, do not perform well compared to ensemble-based methods. Whereas purely multimodal-based baselines provide the worst performance.

【3】 FastAudio: A Learnable Audio Front-End for Spoof Speech Detection 标题:FastAudio:一种用于欺骗语音检测的可学习音频前端 链接:https://arxiv.org/abs/2109.02774

作者:Quchen Fu,Zhongwei Teng,Jules White,Maria Powell,Douglas C. Schmidt 摘要:语音助手,如智能音箱,已经在流行中爆炸。据目前估计,美国成年人中智能扬声器的采用率已超过35%。制造商集成了说话人识别技术,试图确定说话人的身份,为同一家庭的不同成员提供个性化服务。说话人识别也可以在控制智能扬声器的使用方式方面发挥重要作用。例如,在播放音乐时正确识别用户并不重要。但是,当大声阅读用户的电子邮件时,关键是要正确验证发出请求的说话人是授权用户。因此,说话人验证系统需要验证说话人身份,作为网关守卫,以防止各种旨在模拟注册用户的欺骗攻击。本文将流行的可学习前端与下游任务(端到端)进行比较,前者通过联合训练学习音频的表示。我们通过定义两种通用架构对前端进行分类,然后根据学习约束分析这两种类型的过滤阶段。我们建议用一个可以更好地适应反欺骗任务的可学习层来替换固定过滤器组。然后,使用两个流行的后端测试拟议的FastAudio前端,以测量ASVspoof 2019数据集的LA轨道上的性能。与固定前端相比,FastAudio前端实现了27%的相对改进,在这项任务上优于所有其他可学习的前端。 摘要:Voice assistants, such as smart speakers, have exploded in popularity. It is currently estimated that the smart speaker adoption rate has exceeded 35% in the US adult population. Manufacturers have integrated speaker identification technology, which attempts to determine the identity of the person speaking, to provide personalized services to different members of the same family. Speaker identification can also play an important role in controlling how the smart speaker is used. For example, it is not critical to correctly identify the user when playing music. However, when reading the user's email out loud, it is critical to correctly verify the speaker that making the request is the authorized user. Speaker verification systems, which authenticate the speaker identity, are therefore needed as a gatekeeper to protect against various spoofing attacks that aim to impersonate the enrolled user. This paper compares popular learnable front-ends which learn the representations of audio by joint training with downstream tasks (End-to-End). We categorize the front-ends by defining two generic architectures and then analyze the filtering stages of both types in terms of learning constraints. We propose replacing fixed filterbanks with a learnable layer that can better adapt to anti-spoofing tasks. The proposed FastAudio front-end is then tested with two popular back-ends to measure the performance on the LA track of the ASVspoof 2019 dataset. The FastAudio front-end achieves a relative improvement of 27% when compared with fixed front-ends, outperforming all other learnable front-ends on this task.

【4】 Complementing Handcrafted Features with Raw Waveform Using a Light-weight Auxiliary Model 标题:使用轻量级辅助模型用原始波形补充手工制作的特征 链接:https://arxiv.org/abs/2109.02773

作者:Zhongwei Teng,Quchen Fu,Jules White,Maria Powell,Douglas C. Schmidt 摘要:音频处理的一个新兴趋势是从原始波形中捕获低级语音表示。这些表示方法在语音识别和语音分离等多种任务中显示了良好的效果。与手工制作的功能相比,通过反向传播学习语音功能在理论上为模型表示不同任务的数据提供了更大的灵活性。然而,实证研究的结果表明,在一些任务中,例如语音欺骗检测,手工制作的特征比学习的特征更具竞争力。本文提出了一种辅助Rawnet模型,用从原始波形中学习的特征来补充手工特征,而不是单独评估手工特征和原始波形。该方法的一个主要优点是,它可以在相对较低的计算成本下提高精度。使用ASVspoof 2019数据集对提议的辅助Rawnet模型进行了测试,该数据集的结果表明,重量轻的波形编码器可以潜在地提高基于手工特征的编码器的性能,以换取少量的额外计算工作。 摘要:An emerging trend in audio processing is capturing low-level speech representations from raw waveforms. These representations have shown promising results on a variety of tasks, such as speech recognition and speech separation. Compared to handcrafted features, learning speech features via backpropagation provides the model greater flexibility in how it represents data for different tasks theoretically. However, results from empirical study shows that, in some tasks, such as voice spoof detection, handcrafted features are more competitive than learned features. Instead of evaluating handcrafted features and raw waveforms independently, this paper proposes an Auxiliary Rawnet model to complement handcrafted features with features learned from raw waveforms. A key benefit of the approach is that it can improve accuracy at a relatively low computational cost. The proposed Auxiliary Rawnet model is tested using the ASVspoof 2019 dataset and the results from this dataset indicate that a light-weight waveform encoder can potentially boost the performance of handcrafted-features-based encoders in exchange for a small amount of additional computational work.

【5】 Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds 标题:双耳声网:利用双耳声音预测语义、深度和运动 链接:https://arxiv.org/abs/2109.02763

作者:Dengxin Dai,Arun Balajee Vasudevan,Jiri Matas,Luc Van Gool 机构:! 备注:Journal extension of our ECCV'20 Paper -- 15 pages. arXiv admin note: substantial text overlap with arXiv:2003.04210 摘要:人类可以通过视觉和/或听觉线索来稳健地识别和定位物体。虽然机器已经能够对视觉数据进行同样的处理,但对声音的处理却很少。这项工作开发了一种完全基于双耳声音的场景理解方法。所考虑的任务包括预测发声对象的语义掩码、发声对象的运动以及场景的深度图。为此,我们提出了一种新的传感器设置,并使用八个专业的双耳麦克风和一个360度摄像机记录了一个新的街景视听数据集。视觉和听觉线索的共存被用于监督转移。特别是,我们采用了一个跨模式蒸馏框架,该框架由多个视觉教师方法和一个良好的学生方法组成——学生方法经过训练,可以产生与教师方法相同的结果。这样,听觉系统就可以在不使用人类注释的情况下进行训练。为了进一步提高性能,我们提出了另一个新的辅助任务,即创造空间声音超分辨率,以提高声音的方向分辨率。然后,我们将这四个任务组成一个端到端可训练的多任务网络,以提高整体性能。实验结果表明:1)我们的方法在所有四项任务中都取得了良好的效果;2)这四项任务是互利的——共同训练它们可以获得最佳性能;3)麦克风的数量和方向都很重要,4)从标准频谱图学习的特征和通过经典信号处理管道获得的特征对于听觉感知任务是互补的。数据和代码已发布。 摘要:Humans can robustly recognize and localize objects by using visual and/or auditory cues. While machines are able to do the same with visual data already, less work has been done with sounds. This work develops an approach for scene understanding purely based on binaural sounds. The considered tasks include predicting the semantic masks of sound-making objects, the motion of sound-making objects, and the depth map of the scene. To this aim, we propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360-degree camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of multiple vision teacher methods and a sound student method -- the student method is trained to generate the same results as the teacher methods do. This way, the auditory system can be trained without using human annotations. To further boost the performance, we propose another novel auxiliary task, coined Spatial Sound Super-Resolution, to increase the directional resolution of sounds. We then formulate the four tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results show that 1) our method achieves good results for all four tasks, 2) the four tasks are mutually beneficial -- training them together achieves the best performance, 3) the number and orientation of microphones are both important, and 4) features learned from the standard spectrogram and features obtained by the classic signal processing pipeline are complementary for auditory perception tasks. The data and code are released.

【6】 Machine Learning: Challenges, Limitations, and Compatibility for Audio Restoration Processes 标题:机器学习:音频恢复过程的挑战、限制和兼容性 链接:https://arxiv.org/abs/2109.02692

作者:Owen Casey,Rushit Dave,Naeem Seliya,Evelyn R Sowells Boone 机构:#Department of Computer Science, University of Wisconsin-Eau Claire, Garfield Ave, Eau Claire, WI , United States, Department of Computer Systems Technology, North Carolina A&T State University, Greensboro, NC , United States 备注:6 pages, 2 figures 摘要:本文探讨了机器学习网络在恢复降级和压缩语音音频方面的应用。该项目的目的是从语音数据中构建一个新的训练模型,以学习由有损压缩和分辨率损失导致的数据丢失引起的压缩伪影失真的特征,并使用SEGAN:语音增强生成对抗网络中提出的现有算法。然后,从该模型生成的生成器将用于恢复降级的语音音频。本文详细分析了使用不推荐使用的代码所带来的后续兼容性和操作问题,这些问题阻碍了经过训练的模型的成功开发。本文进一步探讨了当前机器学习的挑战、局限性和兼容性。 摘要:In this paper machine learning networks are explored for their use in restoring degraded and compressed speech audio. The project intent is to build a new trained model from voice data to learn features of compression artifacting distortion introduced by data loss from lossy compression and resolution loss with an existing algorithm presented in SEGAN: Speech Enhancement Generative Adversarial Network. The resulting generator from the model was then to be used to restore degraded speech audio. This paper details an examination of the subsequent compatibility and operational issues presented by working with deprecated code, which obstructed the trained model from successfully being developed. This paper further serves as an examination of the challenges, limitations, and compatibility in the current state of machine learning.

【7】 The DKU-DukeECE System for the Self-Supervision Speaker Verification Task of the 2021 VoxCeleb Speaker Recognition Challenge 标题:用于2021年VoxCeleb说话人识别挑战赛自监控说话人确认任务的DKU-DUKECES系统 链接:https://arxiv.org/abs/2109.02853

作者:Danwei Cai,Ming Li 机构:Data Science Research Center, Duke Kunshan University, Kunshan, China, Department of Electrical and Computer Engineering, Duke University, Durham, USA 备注:arXiv admin note: text overlap with arXiv:2010.14751 摘要:本报告描述了DKU DukeECE团队向2021年VoxCeleb说话人识别挑战赛(VoxSRC)的自我监督说话人验证任务提交的情况。我们的方法采用一个迭代标记框架来学习基于深度神经网络(DNN)的自监督说话人表示。该框架从训练一个自我监督的说话人嵌入网络开始,通过对比损失最大化话语中不同片段之间的一致性。利用DNN从带有标签噪声的数据中学习的能力,我们建议对先前说话人网络中获得的说话人嵌入进行聚类,并使用后续的类分配作为伪标签来训练新的DNN。此外,我们使用前一步生成的伪标签迭代训练说话人网络,以提升DNN的辨别能力。此外,视觉模态数据也包含在这个自标记框架中。视觉伪标签和音频伪标签通过聚类集成算法进行融合,生成鲁棒的表示学习监控信号。我们的提交在挑战开发和测试集上分别实现了5.58%和5.59%的相等错误率(EER)。 摘要:This report describes the submission of the DKU-DukeECE team to the self-supervision speaker verification task of the 2021 VoxCeleb Speaker Recognition Challenge (VoxSRC). Our method employs an iterative labeling framework to learn self-supervised speaker representation based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between different segments within an utterance via a contrastive loss. Taking advantage of DNN's ability to learn from data with label noise, we propose to cluster the speaker embedding obtained from the previous speaker network and use the subsequent class assignments as pseudo labels to train a new DNN. Moreover, we iteratively train the speaker network with pseudo labels generated from the previous step to bootstrap the discriminative power of a DNN. Also, visual modal data is incorporated in this self-labeling framework. The visual pseudo label and the audio pseudo label are fused with a cluster ensemble algorithm to generate a robust supervisory signal for representation learning. Our submission achieves an equal error rate (EER) of 5.58% and 5.59% on the challenge development and test set, respectively.

3.eess.AS音频处理:

【1】 The DKU-DukeECE System for the Self-Supervision Speaker Verification Task of the 2021 VoxCeleb Speaker Recognition Challenge 标题:用于2021年VoxCeleb说话人识别挑战赛自监控说话人确认任务的DKU-DUKECES系统 链接:https://arxiv.org/abs/2109.02853

作者:Danwei Cai,Ming Li 机构:Data Science Research Center, Duke Kunshan University, Kunshan, China, Department of Electrical and Computer Engineering, Duke University, Durham, USA 备注:arXiv admin note: text overlap with arXiv:2010.14751 摘要:本报告描述了DKU DukeECE团队向2021年VoxCeleb说话人识别挑战赛(VoxSRC)的自我监督说话人验证任务提交的情况。我们的方法采用一个迭代标记框架来学习基于深度神经网络(DNN)的自监督说话人表示。该框架从训练一个自我监督的说话人嵌入网络开始,通过对比损失最大化话语中不同片段之间的一致性。利用DNN从带有标签噪声的数据中学习的能力,我们建议对先前说话人网络中获得的说话人嵌入进行聚类,并使用后续的类分配作为伪标签来训练新的DNN。此外,我们使用前一步生成的伪标签迭代训练说话人网络,以提升DNN的辨别能力。此外,视觉模态数据也包含在这个自标记框架中。视觉伪标签和音频伪标签通过聚类集成算法进行融合,生成鲁棒的表示学习监控信号。我们的提交在挑战开发和测试集上分别实现了5.58%和5.59%的相等错误率(EER)。 摘要:This report describes the submission of the DKU-DukeECE team to the self-supervision speaker verification task of the 2021 VoxCeleb Speaker Recognition Challenge (VoxSRC). Our method employs an iterative labeling framework to learn self-supervised speaker representation based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between different segments within an utterance via a contrastive loss. Taking advantage of DNN's ability to learn from data with label noise, we propose to cluster the speaker embedding obtained from the previous speaker network and use the subsequent class assignments as pseudo labels to train a new DNN. Moreover, we iteratively train the speaker network with pseudo labels generated from the previous step to bootstrap the discriminative power of a DNN. Also, visual modal data is incorporated in this self-labeling framework. The visual pseudo label and the audio pseudo label are fused with a cluster ensemble algorithm to generate a robust supervisory signal for representation learning. Our submission achieves an equal error rate (EER) of 5.58% and 5.59% on the challenge development and test set, respectively.

【2】 Fruit-CoV: An Efficient Vision-based Framework for Speedy Detection and Diagnosis of SARS-CoV-2 Infections Through Recorded Cough Sounds 标题:水果冠状病毒:一种高效的基于视觉的咳嗽音快速检测和诊断SARS-CoV-2感染的框架 链接:https://arxiv.org/abs/2109.03219

作者:Long H. Nguyen,Nhat Truong Pham,Van Huong Do,Liu Tai Nguyen,Thanh Tin Nguyen,Van Dung Do,Hai Nguyen,Ngoc Duy Nguyen 机构:Division of Computational Mechatronics, Institute for Computational Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam, ASICLAND, Suwon, South Korea, Human Computer Interaction Lab, Sejong University, Seoul, South Korea 备注:4 pages 摘要:SARS-CoV-2俗称COVID-19,于2019年12月首次爆发。这种致命病毒已在全球传播,自2020年3月以来参与了全球大流行疾病。此外,最近一种名为Delta的SARS-CoV-2变种具有难以控制的传染性,导致全球400多万人死亡。因此,在国内拥有SARS-CoV-2的自我检测服务至关重要。在这项研究中,我们介绍了水果冠状病毒,一个两阶段的视觉框架,它能够通过记录的咳嗽声检测SARS-CoV-2感染。具体来说,我们将声音转换为对数Mel谱图,并在第一阶段使用EfficientNet-V2网络提取其视觉特征。在第二阶段,我们使用从大规模预训练音频神经网络中提取的14个卷积层进行音频模式识别(PANNs)和波形图Log-Mel-CNN来聚合Log-Mel谱图的特征表示。最后,利用组合特征训练二值分类器。在这项研究中,我们使用了爱科维恩115M挑战赛提供的数据集,其中包括在越南、印度和瑞士收集的总共7371个记录的咳嗽声。实验结果表明,我们提出的模型达到了92.8%的AUC分数,并在AICovidVN挑战排行榜上排名第一。更重要的是,我们提出的框架可以集成到呼叫中心或VoIP系统中,以通过在线/记录的咳嗽声加速检测SARS-CoV-2感染。 摘要:SARS-CoV-2 is colloquially known as COVID-19 that had an initial outbreak in December 2019. The deadly virus has spread across the world, taking part in the global pandemic disease since March 2020. In addition, a recent variant of SARS-CoV-2 named Delta is intractably contagious and responsible for more than four million deaths over the world. Therefore, it is vital to possess a self-testing service of SARS-CoV-2 at home. In this study, we introduce Fruit-CoV, a two-stage vision framework, which is capable of detecting SARS-CoV-2 infections through recorded cough sounds. Specifically, we convert sounds into Log-Mel Spectrograms and use the EfficientNet-V2 network to extract its visual features in the first stage. In the second stage, we use 14 convolutional layers extracted from the large-scale Pretrained Audio Neural Networks for audio pattern recognition (PANNs) and the Wavegram-Log-Mel-CNN to aggregate feature representations of the Log-Mel Spectrograms. Finally, we use the combined features to train a binary classifier. In this study, we use a dataset provided by the AICovidVN 115M Challenge, which includes a total of 7371 recorded cough sounds collected throughout Vietnam, India, and Switzerland. Experimental results show that our proposed model achieves an AUC score of 92.8% and ranks the 1st place on the leaderboard of the AICovidVN Challenge. More importantly, our proposed framework can be integrated into a call center or a VoIP system to speed up detecting SARS-CoV-2 infections through online/recorded cough sounds.

【3】 Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors 标题:基于单模和多模检测器的音视频多模态深伪数据集评估 链接:https://arxiv.org/abs/2109.02993

作者:Hasam Khalid,Minha Kim,Shahroz Tariq,Simon S. Woo 机构:College of Computing and Informatics, Sungkyunkwan University, South Korea, Department of Applied Data Science 备注:2 Figures, 2 Tables, Accepted for publication at the 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation and Detection (ADGD '21) at ACM MM 2021 摘要:deepfakes产品的开发取得了重大进展,导致了安全和隐私问题。攻击者可以用目标人的脸替换他的脸,从而很容易地在图像中模拟一个人的身份。此外,利用深度学习技术克隆人声的新领域也正在出现。现在,攻击者只需几秒钟的目标人音频就可以生成逼真的克隆人声。随着deepfake可能造成的潜在危害的不断出现,研究人员提出了deepfake检测方法。然而,他们只专注于检测单一模态,即视频或音频。另一方面,为了开发一个好的deepfake检测器,以应对deepfake一代的最新进展,我们需要一个能够检测多种模式的deepfake的检测器,即视频和音频。为了构建这样一个检测器,我们需要一个包含视频和音频的数据集。我们能够找到最新的deepfake数据集,音频-视频多模式deepfake检测数据集(FakeAVCeleb),它不仅包含deepfake视频,还包含合成的假音频。我们使用这个多模态数据集,并使用最先进的单峰、基于集合和多模态检测方法进行详细的基线实验来评估它。我们通过详细的实验得出结论,与基于集成的方法相比,仅处理单一模态(视频或音频)的单模方法性能不佳。而纯粹基于多模式的基线提供了最差的性能。 摘要:Significant advancements made in the generation of deepfakes have caused security and privacy issues. Attackers can easily impersonate a person's identity in an image by replacing his face with the target person's face. Moreover, a new domain of cloning human voices using deep-learning technologies is also emerging. Now, an attacker can generate realistic cloned voices of humans using only a few seconds of audio of the target person. With the emerging threat of potential harm deepfakes can cause, researchers have proposed deepfake detection methods. However, they only focus on detecting a single modality, i.e., either video or audio. On the other hand, to develop a good deepfake detector that can cope with the recent advancements in deepfake generation, we need to have a detector that can detect deepfakes of multiple modalities, i.e., videos and audios. To build such a detector, we need a dataset that contains video and respective audio deepfakes. We were able to find a most recent deepfake dataset, Audio-Video Multimodal Deepfake Detection Dataset (FakeAVCeleb), that contains not only deepfake videos but synthesized fake audios as well. We used this multimodal deepfake dataset and performed detailed baseline experiments using state-of-the-art unimodal, ensemble-based, and multimodal detection methods to evaluate it. We conclude through detailed experimentation that unimodals, addressing only a single modality, video or audio, do not perform well compared to ensemble-based methods. Whereas purely multimodal-based baselines provide the worst performance.

【4】 FastAudio: A Learnable Audio Front-End for Spoof Speech Detection 标题:FastAudio:一种用于欺骗语音检测的可学习音频前端 链接:https://arxiv.org/abs/2109.02774

作者:Quchen Fu,Zhongwei Teng,Jules White,Maria Powell,Douglas C. Schmidt 摘要:语音助手,如智能音箱,已经在流行中爆炸。据目前估计,美国成年人中智能扬声器的采用率已超过35%。制造商集成了说话人识别技术,试图确定说话人的身份,为同一家庭的不同成员提供个性化服务。说话人识别也可以在控制智能扬声器的使用方式方面发挥重要作用。例如,在播放音乐时正确识别用户并不重要。但是,当大声阅读用户的电子邮件时,关键是要正确验证发出请求的说话人是授权用户。因此,说话人验证系统需要验证说话人身份,作为网关守卫,以防止各种旨在模拟注册用户的欺骗攻击。本文将流行的可学习前端与下游任务(端到端)进行比较,前者通过联合训练学习音频的表示。我们通过定义两种通用架构对前端进行分类,然后根据学习约束分析这两种类型的过滤阶段。我们建议用一个可以更好地适应反欺骗任务的可学习层来替换固定过滤器组。然后,使用两个流行的后端测试拟议的FastAudio前端,以测量ASVspoof 2019数据集的LA轨道上的性能。与固定前端相比,FastAudio前端实现了27%的相对改进,在这项任务上优于所有其他可学习的前端。 摘要:Voice assistants, such as smart speakers, have exploded in popularity. It is currently estimated that the smart speaker adoption rate has exceeded 35% in the US adult population. Manufacturers have integrated speaker identification technology, which attempts to determine the identity of the person speaking, to provide personalized services to different members of the same family. Speaker identification can also play an important role in controlling how the smart speaker is used. For example, it is not critical to correctly identify the user when playing music. However, when reading the user's email out loud, it is critical to correctly verify the speaker that making the request is the authorized user. Speaker verification systems, which authenticate the speaker identity, are therefore needed as a gatekeeper to protect against various spoofing attacks that aim to impersonate the enrolled user. This paper compares popular learnable front-ends which learn the representations of audio by joint training with downstream tasks (End-to-End). We categorize the front-ends by defining two generic architectures and then analyze the filtering stages of both types in terms of learning constraints. We propose replacing fixed filterbanks with a learnable layer that can better adapt to anti-spoofing tasks. The proposed FastAudio front-end is then tested with two popular back-ends to measure the performance on the LA track of the ASVspoof 2019 dataset. The FastAudio front-end achieves a relative improvement of 27% when compared with fixed front-ends, outperforming all other learnable front-ends on this task.

【5】 Complementing Handcrafted Features with Raw Waveform Using a Light-weight Auxiliary Model 标题:使用轻量级辅助模型用原始波形补充手工制作的特征 链接:https://arxiv.org/abs/2109.02773

作者:Zhongwei Teng,Quchen Fu,Jules White,Maria Powell,Douglas C. Schmidt 摘要:音频处理的一个新兴趋势是从原始波形中捕获低级语音表示。这些表示方法在语音识别和语音分离等多种任务中显示了良好的效果。与手工制作的功能相比,通过反向传播学习语音功能在理论上为模型表示不同任务的数据提供了更大的灵活性。然而,实证研究的结果表明,在一些任务中,例如语音欺骗检测,手工制作的特征比学习的特征更具竞争力。本文提出了一种辅助Rawnet模型,用从原始波形中学习的特征来补充手工特征,而不是单独评估手工特征和原始波形。该方法的一个主要优点是,它可以在相对较低的计算成本下提高精度。使用ASVspoof 2019数据集对提议的辅助Rawnet模型进行了测试,该数据集的结果表明,重量轻的波形编码器可以潜在地提高基于手工特征的编码器的性能,以换取少量的额外计算工作。 摘要:An emerging trend in audio processing is capturing low-level speech representations from raw waveforms. These representations have shown promising results on a variety of tasks, such as speech recognition and speech separation. Compared to handcrafted features, learning speech features via backpropagation provides the model greater flexibility in how it represents data for different tasks theoretically. However, results from empirical study shows that, in some tasks, such as voice spoof detection, handcrafted features are more competitive than learned features. Instead of evaluating handcrafted features and raw waveforms independently, this paper proposes an Auxiliary Rawnet model to complement handcrafted features with features learned from raw waveforms. A key benefit of the approach is that it can improve accuracy at a relatively low computational cost. The proposed Auxiliary Rawnet model is tested using the ASVspoof 2019 dataset and the results from this dataset indicate that a light-weight waveform encoder can potentially boost the performance of handcrafted-features-based encoders in exchange for a small amount of additional computational work.

【6】 Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds 标题:双耳声网:利用双耳声音预测语义、深度和运动 链接:https://arxiv.org/abs/2109.02763

作者:Dengxin Dai,Arun Balajee Vasudevan,Jiri Matas,Luc Van Gool 机构:! 备注:Journal extension of our ECCV'20 Paper -- 15 pages. arXiv admin note: substantial text overlap with arXiv:2003.04210 摘要:人类可以通过视觉和/或听觉线索来稳健地识别和定位物体。虽然机器已经能够对视觉数据进行同样的处理,但对声音的处理却很少。这项工作开发了一种完全基于双耳声音的场景理解方法。所考虑的任务包括预测发声对象的语义掩码、发声对象的运动以及场景的深度图。为此,我们提出了一种新的传感器设置,并使用八个专业的双耳麦克风和一个360度摄像机记录了一个新的街景视听数据集。视觉和听觉线索的共存被用于监督转移。特别是,我们采用了一个跨模式蒸馏框架,该框架由多个视觉教师方法和一个良好的学生方法组成——学生方法经过训练,可以产生与教师方法相同的结果。这样,听觉系统就可以在不使用人类注释的情况下进行训练。为了进一步提高性能,我们提出了另一个新的辅助任务,即创造空间声音超分辨率,以提高声音的方向分辨率。然后,我们将这四个任务组成一个端到端可训练的多任务网络,以提高整体性能。实验结果表明:1)我们的方法在所有四项任务中都取得了良好的效果;2)这四项任务是互利的——共同训练它们可以获得最佳性能;3)麦克风的数量和方向都很重要,4)从标准频谱图学习的特征和通过经典信号处理管道获得的特征对于听觉感知任务是互补的。数据和代码已发布。 摘要:Humans can robustly recognize and localize objects by using visual and/or auditory cues. While machines are able to do the same with visual data already, less work has been done with sounds. This work develops an approach for scene understanding purely based on binaural sounds. The considered tasks include predicting the semantic masks of sound-making objects, the motion of sound-making objects, and the depth map of the scene. To this aim, we propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360-degree camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of multiple vision teacher methods and a sound student method -- the student method is trained to generate the same results as the teacher methods do. This way, the auditory system can be trained without using human annotations. To further boost the performance, we propose another novel auxiliary task, coined Spatial Sound Super-Resolution, to increase the directional resolution of sounds. We then formulate the four tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results show that 1) our method achieves good results for all four tasks, 2) the four tasks are mutually beneficial -- training them together achieves the best performance, 3) the number and orientation of microphones are both important, and 4) features learned from the standard spectrogram and features obtained by the classic signal processing pipeline are complementary for auditory perception tasks. The data and code are released.

【7】 Machine Learning: Challenges, Limitations, and Compatibility for Audio Restoration Processes 标题:机器学习:音频恢复过程的挑战、限制和兼容性 链接:https://arxiv.org/abs/2109.02692

作者:Owen Casey,Rushit Dave,Naeem Seliya,Evelyn R Sowells Boone 机构:#Department of Computer Science, University of Wisconsin-Eau Claire, Garfield Ave, Eau Claire, WI , United States, Department of Computer Systems Technology, North Carolina A&T State University, Greensboro, NC , United States 备注:6 pages, 2 figures 摘要:本文探讨了机器学习网络在恢复降级和压缩语音音频方面的应用。该项目的目的是从语音数据中构建一个新的训练模型,以学习由有损压缩和分辨率损失导致的数据丢失引起的压缩伪影失真的特征,并使用SEGAN:语音增强生成对抗网络中提出的现有算法。然后,从该模型生成的生成器将用于恢复降级的语音音频。本文详细分析了使用不推荐使用的代码所带来的后续兼容性和操作问题,这些问题阻碍了经过训练的模型的成功开发。本文进一步探讨了当前机器学习的挑战、局限性和兼容性。 摘要:In this paper machine learning networks are explored for their use in restoring degraded and compressed speech audio. The project intent is to build a new trained model from voice data to learn features of compression artifacting distortion introduced by data loss from lossy compression and resolution loss with an existing algorithm presented in SEGAN: Speech Enhancement Generative Adversarial Network. The resulting generator from the model was then to be used to restore degraded speech audio. This paper details an examination of the subsequent compatibility and operational issues presented by working with deprecated code, which obstructed the trained model from successfully being developed. This paper further serves as an examination of the challenges, limitations, and compatibility in the current state of machine learning.

机器翻译,仅供参考

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-09-08,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
语音识别
腾讯云语音识别(Automatic Speech Recognition,ASR)是将语音转化成文字的PaaS产品,为企业提供精准而极具性价比的识别服务。被微信、王者荣耀、腾讯视频等大量业务使用,适用于录音质检、会议实时转写、语音输入法等多个场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档