CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

doi:10.1016/j.neucom.2022.07.028

Journal ArticleDOI

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

Huaishao Luo, +1 more

- 01 Oct 2022 -

Neurocomputing

- Vol. 508, pp 293-304

Chats0

TLDR

In this article , a CLIP4Clip model is proposed to transfer the knowledge of the image-text pretrained CLIP model to video-text tasks in an end-to-end manner.

About:

This article is published in Neurocomputing.The article was published on 2022-10-01. It has received 41 citations till now. The article focuses on the topics: Closed captioning & Computer science.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

VLP: A Survey on Vision-language Pre-training

Fei Fei Chen, +5 more

- 18 Feb 2022 -

Machine Intelligence Research

TL;DR: Recently, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era as discussed by the authors .

...read moreread less

Book ChapterDOI

Multi-query Video Retrieval

Zeyu Wang, +3 more

TL;DR: In this paper , the authors focus on the less-studied setting of multi-query video retrieval, where multiple descriptions are provided to the model for searching over the video archive.

...read moreread less

Journal ArticleDOI

Prompting Visual-Language Models for Efficient Video Understanding

- 01 Jan 2022 -

Lecture Notes in Computer Science

TL;DR: In this article , a few random vectors, termed as "continuous prompt vectors", are used to convert video-related tasks into the same format as the pre-training objectives, and temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.

...read moreread less

Journal ArticleDOI

MotionCLIP: Exposing Human Motion Generation to CLIP Space

Hiroki Hoshina

- 01 Jan 2022 -

Lecture Notes in Computer Science

TL;DR: MotionCLIP as discussed by the authors is a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions.

...read moreread less

Book ChapterDOI

gScoreCAM: What Objects Is CLIP Looking At?

Peijie Chen, +4 more

- 01 Jan 2023 -

Lecture Notes in Computer Science

TL;DR: al. as discussed by the authors proposed gScoreCAM, which is a state-of-the-art method for visualizing the main objects that CLIP is looking at in an image.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

Proceedings ArticleDOI

Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran, +5 more

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.

...read moreread less

Journal ArticleDOI

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, +11 more

- 01 May 2017 -

International Journal of Computer Vision

TL;DR: The Visual Genome dataset as mentioned in this paper contains over 108k images where each image has an average of $35$35 objects, $26$26 attributes, and $21$21 pairwise relationships between objects.

...read moreread less

Proceedings ArticleDOI

CIDEr: Consensus-based image description evaluation

Ramakrishna Vedantam, +2 more

TL;DR: A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.

...read moreread less

Collapse

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

Citations

VLP: A Survey on Vision-language Pre-training

Multi-query Video Retrieval

Prompting Visual-Language Models for Efficient Video Understanding

MotionCLIP: Exposing Human Motion Generation to CLIP Space

gScoreCAM: What Objects Is CLIP Looking At?

References

Long short-term memory

Bleu: a Method for Automatic Evaluation of Machine Translation

Learning Spatiotemporal Features with 3D Convolutional Networks

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CIDEr: Consensus-based image description evaluation

Related Papers (5)

Region Driven Remote Sensing Image Captioning

Practical NLP-Based Text Indexing

An automatic approach to treebank error detection using a dependency parser

Use of Dependency Microcontexts in Information Retrieval

Cross-Language Dependency Parsing Using Part-of-Speech Patterns