unsupervised segmentation video motion
Shu Kong, Charless Fowlkes
Last update: March 24, 2019.
We introduce multigrid Predictive Filter Flow (mgPFF), a framework for unsupervised learning on videos. The mgPFF takes as input a pair of frames and outputs per-pixel filters to warp one frame to the other. Compared to optical flow used for warping frames, mgPFF is more powerful in modeling sub-pixel movement and dealing with corruption (e.g., motion blur). We develop a multigrid coarse-to-fine modeling strategy that avoids the requirement of learning large filters to capture large displacement. This allows us to train an extremely compact model (4.6MB) which operates in a progressive way over multiple resolutions with shared weights. We train mgPFF on unsupervised, free-form videos and show that mgPFF is able to not only estimate long-range flow for frame reconstruction and detect video shot transitions, but also readily amendable for video object segmentation and pose tracking, where it substantially outperforms the published state-of-the-art without bells and whistles. Moreover, owing to mgPFF's nature of per-pixel filter prediction, we have the unique opportunity to visualize how each pixel is evolving during solving these tasks, thus gaining better interpretability.
keywords: Unsupervised Learning, Multigrid Computing, Long-Range Flow, Video Segmentation, Instance Tracking, Pose Tracking, Video Shot/Transition Detection, Optical Flow, Filter Flow, Low-level Vision,
S. Kong, C. Fowlkes, "Multigrid Predictive Filter Flow for Unsupervised Learning on Videos", arXiv 1904.01693, 2019. [project page] [paper] [demo] [github] [slides] [poster]
Acknowledgements: This project is supported by NSF grants IIS-1813785, IIS-1618806, IIS-1253538 and a hardware donation from NVIDIA. Shu Kong personally thanks Teng Liu and Etthew Kong who initiated this research, and the academic uncle Alexei A. Efros for the encouragement and discussion.
mgPFF captures well the long-range flow, even though we did not train with large frame intervals. This is owing to the excellent reconstruction power by multigrid computing and filter flow model (powerful in capturing subpixel movement and dealing with corruption, e.g., motion blur). This is reminiscent of a variety of flow-based applications, e.g., video compression, unsupervised optical flow learning, frame interpolation, etc.
Shu Kong, Charless Fowlkes
Last update: Nov. 28, 2018.
We propose a simple, interpretable framework for solving a wide range of image reconstruction problems such as denoising and deconvolution. Given a corrupted input image, the model synthesizes a spatially varying linear filter which, when applied to the input image, reconstructs the desired output. The model parameters are learned using supervised or self-supervised training. We test this model on three tasks: non-uniform motion blur removal, lossy-compression artifact reduction and single image super resolution. We demonstrate that our model substantially outperforms state-of-the-art methods on all these tasks and is significantly faster than optimization-based approaches to deconvolution. Unlike models that directly predict output pixel values, the predicted filter flow is controllable and interpretable, which we demonstrate by visualizing the space of predicted filters for different tasks.
keywords: inverse problem, spatially-variant blind deconvolution, low-level vision, non-uniform motion blur removal, compression artifact removal, single image super-resolution, filter flow, interpretable model, per-pixel twist, self-supervised learning, image distribution learning.
Using a CNN to directly predict a well regularized solution is orders of magnitude faster than expensive iterative optimization.
原文发布于微信公众号 - CreateAMind（createamind）