Scaling Autoregressive Video Models

Open AccessProceedings Article

Scaling Autoregressive Video Models

TLDR

It is shown that conceptually simple autoregressive video generation models based on a three-dimensional self-attention mechanism achieve competitive results across multiple metrics on popular benchmark datasets, for which they produce continuations of high fidelity and realism.

Abstract:

Due to the statistical complexity of video, the high degree of inherent stochasticity, and the sheer amount of data, generating natural video remains a challenging task. State-of-the-art video generation models attempt to address these issues by combining sometimes complex, often video-specific neural network architectures, latent variable models, adversarial training and a range of other methods. Despite their often high complexity, these approaches still fall short of generating high quality video continuations outside of narrow domains and often struggle with fidelity. In contrast, we show that conceptually simple, autoregressive video generation models based on a three-dimensional self-attention mechanism achieve highly competitive results across multiple metrics on popular benchmark datasets for which they produce continuations of high fidelity and realism. Furthermore, we find that our models are capable of producing diverse and surprisingly realistic continuations on a subset of videos from Kinetics, a large scale action recognition dataset comprised of YouTube videos exhibiting phenomena such as camera movement, complex object interactions and diverse human movement. To our knowledge, this is the first promising application of video-generation models to videos of this complexity.

Citations

PDF

Open Access

More filters

Posted Content

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, +11 more

- 22 Oct 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

Posted Content

Taming Transformers for High-Resolution Image Synthesis

Patrick Esser, +2 more

- 17 Dec 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: It is demonstrated how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.

...read moreread less

Posted Content

Efficient Transformers: A Survey

Yi Tay, +3 more

- 14 Sep 2020 -

arXiv: Learning

TL;DR: This paper characterizes a large and thoughtful selection of recent efficiency-flavored “X-former” models, providing an organized and comprehensive overview of existing work and models across multiple domains.

...read moreread less

Proceedings Article

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, +11 more

TL;DR: The Vision Transformer (ViT) as discussed by the authors uses a pure transformer applied directly to sequences of image patches to perform very well on image classification tasks, achieving state-of-the-art results on ImageNet, CIFAR-100, VTAB, etc.

...read moreread less

Posted Content

Axial Attention in Multidimensional Transformers

Jonathan Ho, +3 more

- 25 Sep 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Axial Transformers is proposed, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors that maintains both full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Image quality assessment: from error visibility to structural similarity

Zhou Wang, +3 more

- 01 Apr 2004 -

IEEE Transactions on Image Processing

TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.

...read moreread less

Posted Content

Attention Is All You Need

Ashish Vaswani, +7 more

- 12 Jun 2017 -

arXiv: Computation and Language

TL;DR: A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

Posted Content

GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium

Martin Heusel, +5 more

- 26 Jun 2017 -

arXiv: Learning

TL;DR: In this article, a two time-scale update rule (TTUR) was proposed for training GANs with stochastic gradient descent on arbitrary GAN loss functions, which has an individual learning rate for both the discriminator and the generator.

...read moreread less

Proceedings ArticleDOI

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Joao Carreira, +1 more

TL;DR: In this article, a Two-Stream Inflated 3D ConvNet (I3D) is proposed to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and their parameters.

...read moreread less

Posted Content

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting

Xingjian Shi, +5 more

- 13 Jun 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper proposes the convolutional LSTM (ConvLSTM) and uses it to build an end-to-end trainable model for the precipitation nowcasting problem and shows that it captures spatiotemporal correlations better and consistently outperforms FC-L STM and the state-of-the-art operational ROVER algorithm.

...read moreread less

Collapse

Scaling Autoregressive Video Models

Citations

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Taming Transformers for High-Resolution Image Synthesis

Efficient Transformers: A Survey

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Axial Attention in Multidimensional Transformers

References

Image quality assessment: from error visibility to structural similarity

Attention Is All You Need

GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting

Related Papers (5)

Deep multi-scale video prediction beyond mean square error

Deep Residual Learning for Image Recognition

Auto-Encoding Variational Bayes

Adam: A Method for Stochastic Optimization

Unsupervised Learning of Video Representations using LSTMs