Multi-modal Transformer for Video Retrieval

doi:10.1007/978-3-030-58548-8_13

Open AccessBook ChapterDOI

Multi-modal Transformer for Video Retrieval

- Vol. 12349, pp 214-229

TLDR

A multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others, and a novel framework to establish state-of-the-art results for video retrieval on three datasets.

Citations

PDF

Open Access

More filters

Posted Content

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

Jie Lei, +6 more

- 11 Feb 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that CLIPBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end- to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full- length videos, proving the proverbial less-is-more principle.

...read moreread less

Journal ArticleDOI

A Survey on Vision Transformer

- 01 Jan 2023 -

IEEE Transactions on Pattern Analysis an...

TL;DR: Transformer as discussed by the authors is a type of deep neural network mainly based on the self-attention mechanism, which has been applied to the field of natural language processing, and has been shown to perform similar to or better than other types of networks such as convolutional and recurrent neural networks.

...read moreread less

Posted Content

Support-set bottlenecks for video-text representation learning

Mandela Patrick, +6 more

- 06 Oct 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper proposes a novel method that leverages a generative model to naturally push related samples together, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning.

...read moreread less

Proceedings ArticleDOI

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

Jie Lei, +6 more

TL;DR: ClipBERT as mentioned in this paper employs sparse sampling, where only a single or a few sparsely sampled short clips from a video are used at each training step to enable affordable end-to-end learning for video and language tasks.

...read moreread less

Posted Content

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Antoine Yang, +4 more

- 01 Dec 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work proposes to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision, and introduces iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018 -

arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Proceedings ArticleDOI

Densely Connected Convolutional Networks

Gao Huang, +3 more

TL;DR: DenseNet as mentioned in this paper proposes to connect each layer to every other layer in a feed-forward fashion, which can alleviate the vanishing gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

...read moreread less

Posted Content

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, +3 more

- 16 Jan 2013 -

arXiv: Computation and Language

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.

...read moreread less

Journal ArticleDOI

Squeeze-and-Excitation Networks

Jie Hu, +4 more

TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.

...read moreread less

Collapse

Multi-modal Transformer for Video Retrieval

Citations

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

A Survey on Vision Transformer

Support-set bottlenecks for video-text representation learning

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

References

Long short-term memory

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Densely Connected Convolutional Networks

Efficient Estimation of Word Representations in Vector Space

Squeeze-and-Excitation Networks

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Attention is All you Need

VideoBERT: A Joint Model for Video and Language Representation Learning