On Pursuit of Designing Multi-modal Transformer for Video Grounding

Open AccessPosted Content

On Pursuit of Designing Multi-modal Transformer for Video Grounding

Meng Cao, +4 more

- 13 Sep 2021 -

arXiv: Computer Vision and Pattern Recog...

Chats0

TLDR

Wang et al. as mentioned in this paper reformulate video grounding as a set prediction task and propose a novel end-to-end multi-modal Transformer model, dubbed as ''textbf{GTR''.

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018 -

arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Proceedings ArticleDOI

Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran, +5 more

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.

...read moreread less

Posted Content

Attention Is All You Need

Ashish Vaswani, +7 more

- 12 Jun 2017 -

arXiv: Computation and Language

TL;DR: A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

Posted Content

Decoupled Weight Decay Regularization

Ilya Loshchilov, +1 more

- 14 Nov 2017 -

arXiv: Learning

TL;DR: This work proposes a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function, and provides empirical evidence that this modification substantially improves Adam's generalization performance.

...read moreread less

Collapse

On Pursuit of Designing Multi-modal Transformer for Video Grounding

References

Glove: Global Vectors for Word Representation

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Learning Spatiotemporal Features with 3D Convolutional Networks

Attention Is All You Need

Decoupled Weight Decay Regularization

Related Papers (5)

On Pursuit of Designing Multi-modal Transformer for Video Grounding

Open domain video natural language description generation method based on multi-modal feature fusion

Video Multitask Transformer Network

Video semantic analysis method

Spatiotemporal Transformer for Video-based Person Re-identification