专栏首页arxiv.org翻译专栏语音表征的视觉引导自监督学习(Multimedia)
原创

语音表征的视觉引导自监督学习(Multimedia)

自监督表示学习近年来引起了视、听两方面的广泛研究兴趣。然而,大多数研究工作通常只关注一种特定的模态或特征,而研究学习自我监督表征的两种模态之间的相互作用的工作非常有限。我们提出了一个框架,学习音频表征的指导下,视觉模态的背景下的视听语言。我们采用了一种生成式的音频-视频训练方案,在此方案中,我们对给定音频剪辑对应的静态图像进行动画处理,并对生成的视频进行优化,使其尽可能接近语音段的真实视频。通过这个过程,音频编码器网络学习有用的语音表示,并对其进行情感识别和语音识别评估。我们在情绪识别和语音识别方面取得了最先进的成果。这证明了视觉监督作为一种全新的自我监督学习方式的潜力,这在过去并没有被探索过。提出的无监督音频功能可以利用几乎无限数量的未标记的视听语音训练数据,并具有大量潜在的应用前景。

原文标题:Multimedia:Visually Guided Self Supervised Learning of Speech Representations

原文:Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very limited work that studies the interaction between the two modalities for learning self supervised representations. We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. Through this process, the audio encoder network learns useful speech representations that we evaluate on emotion recognition and speech recognition. We achieve state of the art results for emotion recognition and competitive results for speech recognition. This demonstrates the potential of visual supervision for learning audio representations as a novel way for self-supervised learning which has not been explored in the past. The proposed unsupervised audio features can leverage a virtually unlimited amount of training data of unlabelled audiovisual speech and have a large number of potentially promising applications.

原文作者:Abhinav Shukla,Konstantinos Vougioukas,Pingchuan Ma,Stavros Petridis,Maja Pantic 原文链接:https://arxiv.org/abs/2001.04316

原创声明,本文系作者授权云+社区发表,未经许可,不得转载。

如有侵权,请联系 yunjia_community@tencent.com 删除。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 自动咳嗽检测的音频特征评估(Sound)

    本文讨论了仅使用录音来检测咳嗽的问题,最终目的是对患有呼吸系统疾病,特别是粘稠物阻塞症的患者的病理程度进行量化和鉴定。本文提出了一种描述音频信号各个方面的大型音...

    李欣颖6837176
  • 拯救人脸识别:人脸识别审计的伦理问题研究(Computers and Society)

    尽管披露有偏见的绩效是必要的,但出于好意的算法审计尝试,可能会对这些措施旨在保护的人群造成伤害。在审核面部识别等生物识别系统时,这种担忧甚至更为突出。在这些系统...

    李欣颖6837176
  • 人工智能治理应该集中化吗?历史的设计课堂(Computers and Society)

    有效的国际人工智能治理还会是碎片化的吗?还是需要一个集中化的国际人工智能组织?我们借鉴了其他国际制度的历史,以确定在集中人工智能治理方面的优势和劣势。还有一些考...

    李欣颖6837176
  • 语音表征的视觉引导自监督学习(Multimedia)

    自监督表示学习是近年来音频和视频模态的研究热点。然而,大多数研究工作通常只关注一种特定的模态或特征,而研究学习自我监督表征的两种模态之间的相互作用的工作非常有限...

    用户6869393
  • 使用maxima解决初等数论中的问题

    You might remember that for any integer n greater than 1,  n is a prime number ...

    Enjoy233
  • POJ 2370 Democracy in danger(简单贪心)

    Democracy in danger Time Limit: 1000MS Memory Limit: 65536K Total Submis...

    Angel_Kitty
  • 可压缩的Euler和Navier-Stokes方程的全离散显式局部熵稳定格式(CS NA)

    近年来,为了保证常微分方程解的一个全局泛函的保存,人们发展了各种逐次近似法。我们推广了这种方法来保证有限多凸函数(熵)的局部熵不等式,并将其应用于可压缩Eule...

    非过度曝光
  • 3D Models and Matching

    Many different representations have been used to model 3D objects.

    点云PCL博主
  • Episodic memory 认知笔记

    https://en.wikipedia.org/wiki/Episodic_memory

    用户1908973
  • How to Accelerate Your Python Deep Learning with Cloud GPU?

    This afternoon, I trained a 3-layers neural network as a regression model to pre...

    用户2930930

扫码关注云+社区

领取腾讯云代金券