Multiview Transformers for Video Recognition

doi:10.1109/CVPR52688.2022.00333

Proceedings ArticleDOI

Multiview Transformers for Video Recognition

Shen Yan, +6 more

- pp 3323-3333

Chats0

TLDR

This work presents Multiview Transformers for Video Recognition (MTV), a model that consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views and achieves state-of-the-art results on six standard datasets.

Abstract:

Video understanding requires reasoning at multiple spatiotemporal resolutions – from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes. Furthermore, we achieve state-of-the-art results on six standard datasets, and improve even further with large-scale pretraining. Code and checkpoints are available at: https://github.com/google-research/scenic.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Masked Autoencoders As Spatiotemporal Learners

Christoph Feichtenhofer, +3 more

TL;DR: It is shown that the MAE method can learn strong representations with almost no inductive bias on spacetime, and spacetime- agnostic random masking performs the best, and the general framework of masked autoencoding can be a uniﬁed methodology for representation learning with minimal domain knowledge.

...read moreread less

Posted ContentDOI

Human Action Recognition from Various Data Modalities: A Review

Zehua Sun, +5 more

- 22 Dec 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper reviews both the hand-crafted feature-based and deep learning-based methods for single data modalities and also the methods based on multiple modalities, including the fusion-based frameworks and the co-learning-based approaches for HAR.

...read moreread less

Proceedings ArticleDOI

MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound

Rowan Zellers, +9 more

TL;DR: This work introduces @MERLOT RESERVE, a model that represents videos jointly over time - through a new training objective that learns from audio, subtitles, and video frames, and obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark.

...read moreread less

Proceedings ArticleDOI

Expanding Language-Image Pretrained Models for General Video Recognition

Bolin Ni, +7 more

TL;DR: To capture the long-range dependencies of frames along the temporal dimension, a cross-frame attention mechanism that explicitly exchanges information across frames is proposed that is lightweight and can be plugged into pretrained language-image models seamlessly.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

Proceedings ArticleDOI

Going deeper with convolutions

Christian Szegedy, +8 more

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Collapse

Multiview Transformers for Video Recognition

Citations

Flamingo: a Visual Language Model for Few-Shot Learning

Masked Autoencoders As Spatiotemporal Learners

Human Action Recognition from Various Data Modalities: A Review

MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound

Expanding Language-Image Pretrained Models for General Video Recognition

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

Attention is All you Need

ImageNet: A large-scale hierarchical image database

Deep Residual Learning for Image Recognition

Going deeper with convolutions