scispace - formally typeset
Journal ArticleDOI

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

Reads0
Chats0
TLDR
In this article , a CLIP4Clip model is proposed to transfer the knowledge of the image-text pretrained CLIP model to video-text tasks in an end-to-end manner.
About
This article is published in Neurocomputing.The article was published on 2022-10-01. It has received 41 citations till now. The article focuses on the topics: Closed captioning & Computer science.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

VLP: A Survey on Vision-language Pre-training

TL;DR: Recently, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era as discussed by the authors .
Book ChapterDOI

Multi-query Video Retrieval

TL;DR: In this paper , the authors focus on the less-studied setting of multi-query video retrieval, where multiple descriptions are provided to the model for searching over the video archive.
Journal ArticleDOI

Prompting Visual-Language Models for Efficient Video Understanding

TL;DR: In this article , a few random vectors, termed as "continuous prompt vectors", are used to convert video-related tasks into the same format as the pre-training objectives, and temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
Journal ArticleDOI

MotionCLIP: Exposing Human Motion Generation to CLIP Space

TL;DR: MotionCLIP as discussed by the authors is a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions.
Book ChapterDOI

gScoreCAM: What Objects Is CLIP Looking At?

TL;DR: al. as discussed by the authors proposed gScoreCAM, which is a state-of-the-art method for visualizing the main objects that CLIP is looking at in an image.
References
More filters
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Proceedings ArticleDOI

Learning Spatiotemporal Features with 3D Convolutional Networks

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
Journal ArticleDOI

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

TL;DR: The Visual Genome dataset as mentioned in this paper contains over 108k images where each image has an average of $35$35 objects, $26$26 attributes, and $21$21 pairwise relationships between objects.
Proceedings ArticleDOI

CIDEr: Consensus-based image description evaluation

TL;DR: A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.
Related Papers (5)