Image Retrieval Using Textual Cues
Anand Mishra,Karteek Alahari,C. V. Jawahar +2 more
- pp 3040-3047
Reads0
Chats0
TLDR
An approach for the text-to-image retrieval problem based on textual content present in images, where the retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M.Abstract:
We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-art methods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.read more
Citations
More filters
Journal ArticleDOI
Reading Text in the Wild with Convolutional Neural Networks
TL;DR: An end-to-end system for text spotting—localising and recognising text in natural scene images—and text based image retrieval and a real-world application to allow thousands of hours of news footage to be instantly searchable via a text query is demonstrated.
Journal ArticleDOI
Word Spotting and Recognition with Embedded Attributes
TL;DR: An approach in which both word images and text strings are embedded in a common vectorial subspace, allowing one to cast recognition and retrieval tasks as a nearest neighbor problem and is very fast to compute and, especially, to compare.
Proceedings ArticleDOI
Scene Text Visual Question Answering
Ali Furkan Biten,Rubèn Tito,Andres Mafla,Lluis Gomez,Marçal Rusiñol,C. V. Jawahar,Ernest Valveny,Dimosthenis Karatzas +7 more
TL;DR: The ST-VQA dataset as discussed by the authors proposes a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer.
Posted Content
Scene Text Visual Question Answering
Ali Furkan Biten,Rubèn Tito,Andres Mafla,Lluis Gomez,Marçal Rusiñol,Ernest Valveny,C. V. Jawahar,Dimosthenis Karatzas +7 more
TL;DR: A new dataset, ST-VQA, is presented that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process and proposes a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module.
Proceedings ArticleDOI
Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA
TL;DR: Li et al. as mentioned in this paper propose a multimodal transformer architecture accompanied by a rich representation for text in images, which naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter-and intra-modality context.
References
More filters
Proceedings ArticleDOI
Scene Text Recognition Using Part-Based Tree-Structured Character Detection
TL;DR: A novel scene text recognition method using part-based tree-structured character detection that outperforms state-of-the-art methods significantly both for character detection and word recognition.