周末轻松一刻，欣赏完全由程序自己回忆的视频片段

CreateAMind

发布于 2018-07-20 16:07:00

3910

发布于 2018-07-20 16:07:00

文章被收录于专栏：CreateAMind

Generating Videos with Scene Dynamics

Carl Vondrick Hamed Pirsiavash Antonio Torralba

NIPS 2016

Download Paper

Abstract

使用大量无标签视频，以学习为视频识别任务场景动态（例如动作分类）和视频生成任务（例如未来的预测）的模型。我们提出了一个视频生成对抗性的时空卷积架构网络，将场景的前景和背景解构区分出来。实验表明这种模式可以产生微小的视频达全帧速率比简单的基线好，我们显示出其在预测的静态图像的作用。此外，实验和可视化显示模型内部学习到了有用的特征、认识行动以最少的监督学习，场景视频是表示学习一种很有前途的信号。我们相信，生成的视频模式可能会影响很多应用，包括视频的理解和模拟。

Video Generations

下面就是生成的视频的展示，这些视频是不是真实的; 虽然它们不是照片般逼真，但是运动基本是合理的。

Beach

Golf

Train Station

Conditional Video Generations

我们也可以使用该模型被训练它产生将来新增动画静止图像。当然，未来是不确定的，因此模型很少产生“正确”的未来，但我们认为，该预测具有一定的合理性。

Input	Output

Input	Output

Input	Output

Video Representation Learning

Learning models that generate videos may also be a promising way to learn representations. For example, we can train generators on a large repository of unlabeled videos, then fine-tune the discriminator on a small labeled dataset in order to recognize some actions with minimal supervision. We report accuracy on UCF101 and compare to other unsupervised learning methods for videos:

我们可视化预测下一步的特征。虽然不是所有的单位都语义中，我们发现有一些神经元是激活的当运动的输入时候，如人或火车轨道的物体。由于预测未来需要理解移动物体，该网络可能已经学习到了如何识别这些物体，即使它不是监督学习。

Hidden unit for "people"

Hidden unit for "train tracks"

The images above highlight regions where a particular convolutional hidden unit fires.

Brief Technical Overview

Our approach builds on generative image models that leverage adversarial learning, which we apply to video. The basic idea behind the approach is to compete two deep networks against each other. One network ("the generator") tries to generate a synthetic video, and another network ("the discriminator") tries to discriminate synthetic versus real videos. The generator is trained to fool the discriminator.

For the generator, we use a deep convolutional network that inputs low-dimensional random noise and outputs a video. To model video, we use spatiotemporal up-convolutions (2D for space, 1D for time). The generator also models the background separately from the foreground. The network produces a static background (which is replicated over time) and a moving foreground that is combined using a mask. We illustrate this below:

We simultaneously train a discriminator network to distinguish real videos from fake videos. We use a deep spatiotemporal convolutional network for the discriminator.

We downloaded two years of videos from Flickr, which we stabilize, automatically filter by scene category, and use for training.

To predict the future, one can attach an encoder to the generator

Notes on Failures & Limitations

Adversarial learning is tricky to get right, and our model has several limitations we wish to point out.

The generations are usually distinguishable from real videos. They are also fairly low resolution: 64x64 for 32 frames.
Evaluation of generative models is hard. We used a psychophysical 2AFC test on Mechanical Turk asking workers "Which video is more realistic?" We think this evaluation is okay, but it is important for the community to settle on robust automatic evaluation metrics.

For better generations, we automatically filtered videos by scene category and trained a separate model per category. We used the PlacesCNN on the first few frames to get scene categories.

The future extrapolations do not always match the first frame very well, which may happen because the bottleneck is too strong.

Related Works

This project builds upon many great ideas, which we encourage you to check out. They have very cool results!

Adversarial Learning

Generative Adversarial Networks

by Goodfellow et al. 2014.

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

by Denton et al. 2015.

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

by Radford et al. 2015.

Context Encoders: Feature Learning by Inpainting

by Pathak et al. 2016.

Generative Image Modeling using Style and Structure Adversarial Networks

by Wang and Gupta. 2016.

Generative Adversarial Text to Image Synthesis

by Reed et al. 2016.

Generative Models of Video

Recursive estimation of generative models of video

by Petrovic et al. 2006.

Video (language) modeling: a baseline for generative models of natural videos

by Ranzato et al. 2014.

Patch to the Future: Unsupervised Visual Prediction

by Walker et al. 2014.

Unsupervised Learning of Video Representations using LSTMs

by Srivastava et al. 2015.

Deep multi-scale video prediction beyond mean square error

by Mathieu et al. 2015.

An Uncertain Future: Forecasting from Static Images using Variational Autoencoders

by Walker et al. 2016.

Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks

by Xue et al. 2016.

Please send us any references we may have accidentally missed.

Data & Code

The code is available on Github implemented in Torch7. You can also download pretrained models(1 GB zip file).

We are working to release the raw data. The stabilized videos are around 7 TB, making it challenging to distribute. Here is a small sample (3 GB). Please contact Carl if you would like all.

Acknowledgements

We thank Yusuf Aytar for dataset discussions. We thank MIT TIG, especially Garrett Wollman, for troubleshooting issues on storing the 26 terabytes of video. We are grateful for the Torch7 community for answering many questions. NVidia donated GPUs used for this research. This work was partially supported by NSF grant #1524817 to AT, START program at UMBC to HP, and the Google PhD fellowship to CV.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2016-09-10，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自 CreateAMind 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度