When CLIP meets cross-modal hashing retrieval: A new strong baseline?

Best insight from top research papers

Pre-trained vision-language models have become the foundation for various downstream tasks, but their application in scene text recognition (STR) has been limited. However, CLIP, a vision-language model, has the potential to be a powerful scene text reader. CLIP4STR is a new STR method that utilizes the image and text encoders of CLIP. It consists of two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on visual features, while the cross-modal branch refines this prediction by addressing the discrepancy between visual features and text semantics. CLIP4STR achieves state-of-the-art performance on 11 STR benchmarks and establishes a strong baseline for future STR research with vision-language models . Additionally, CLIP-Hash is a lightweight hashing retrieval network that utilizes the pre-trained CLIP model to obtain better hash features. It outperforms other hashing methods and requires only a few training samples . The Self Attentive CLIP Hashing (SACH) model focuses on unsupervised cross-modal hashing tasks and utilizes the pre-trained CLIP model to construct a feature extraction network. It achieves superior performance compared to other unsupervised hashing methods .

Answers from top 5 papers

PDF

Open Access

More filters

Papers (5)	Insight
Proceedings Article•DOI Self-Attentive CLIP Hashing for Unsupervised Cross-Modal Retrieval Heng Yu, Shuyan Ding, Lunbo Li, Jiexin Wu - Show less +3 more 13 Dec 2022	The provided paper is about a Self Attentive CLIP Hashing (SACH) model for unsupervised cross-modal retrieval. It does not mention anything about CLIP meeting cross-modal hashing retrieval as a new strong baseline.
Open access•Posted Content•DOI CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model 23 May 2023	The provided paper is about CLIP4STR, a simple baseline for scene text recognition with pre-trained vision-language models. It does not mention anything about CLIP meeting cross-modal hashing retrieval or a new strong baseline.
Proceedings Article•DOI CLIP-HASH: A Lightweight Hashing Network for Cross-Modal Retrieval Mengying Xu, Hanjiang Lai, Jian Yin - Show less +2 more 24 May 2023	The paper proposes a lightweight hashing retrieval network called CLIP-Hash, which utilizes the pre-trained CLIP model to obtain better hash features and outperforms state-of-the-art hashing methods.
Journal Article•DOI CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model Shuai Zhao, Xiaohan Wang, Lin Zhu, Yi Yang - Show less +3 more 23 May 2023-arXiv.org 1 Citations	The provided paper is not about "CLIP meets cross-modal hashing retrieval".
Proceedings Article•DOI CLIP-HASH: A Lightweight Hashing Network for Cross-Modal Retrieval 24 May 2023	The paper does not mention the specific phrase "When CLIP meets cross-modal hashing retrieval: A new strong baseline."