Image Retrieval Using Textual Cues

doi:10.1109/ICCV.2013.378

Home
/
Papers
/
Image Retrieval Using Textual Cues

Proceedings Article•DOI•

Image Retrieval Using Textual Cues

Anand Mishra¹, Karteek Alahari², C. V. Jawahar¹•Institutions (2)

International Institute of Information Technology, Hyderabad¹, French Institute for Research in Computer Science and Automation²

01 Dec 2013-pp 3040-3047

TL;DR: An approach for the text-to-image retrieval problem based on textual content present in images, where the retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M.

read less

Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-art methods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Reading Text in the Wild with Convolutional Neural Networks

[...]

Max Jaderberg¹, Karen Simonyan¹, Andrea Vedaldi¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2016-International Journal of Computer Vision

TL;DR: An end-to-end system for text spotting—localising and recognising text in natural scene images—and text based image retrieval and a real-world application to allow thousands of hours of news footage to be instantly searchable via a text query is demonstrated.

...read moreread less

Abstract: In this work we present an end-to-end system for text spotting--localising and recognising text in natural scene images--and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision. For the recognition and ranking of proposals, we train very large convolutional neural networks to perform word recognition on the whole proposal region at the same time, departing from the character classifier based systems of the past. These networks are trained solely on data produced by a synthetic text generation engine, requiring no human labelled data. Analysing the stages of our pipeline, we show state-of-the-art performance throughout. We perform rigorous experiments across a number of standard end-to-end text spotting benchmarks and text-based image retrieval datasets, showing a large improvement over all previous methods. Finally, we demonstrate a real-world application of our text spotting system to allow thousands of hours of news footage to be instantly searchable via a text query.

...read moreread less

1,054 citations

Cites background from "Image Retrieval Using Textual Cues"

...STR IIIT STR (Mishra et al. 2013) text retrieval dataset 50 10k...
[...]
...IIIT Scene Text Retrieval (STR) [40] is a text based image retrieval dataset also collected with Google image search....
[...]
...The IIIT Sports-10k dataset [40] is another text based image retrieval dataset constructed from frames of sports video....
[...]
...2014), or on combining both in end-to-end systems (Alsharif and Pineau 2014; Gordo 2014; Jaderberg et al. 2014; Mishra et al. 2013; Neumann and Matas 2010, 2011, 2012, 2013; Posner et al. 2010; Quack 2009; Wang et al. 2011, 2012; Weinman et al. 2014)....
[...]
...Sports IIIT Sports-10k (Mishra et al. 2013) text retrieval dataset 10 10k...
[...]

Journal Article•DOI•

Word Spotting and Recognition with Embedded Attributes

[...]

Jon Almazan¹, Albert Gordo², Alicia Fornés¹, Ernest Valveny¹•Institutions (2)

Autonomous University of Barcelona¹, Xerox²

17 Jul 2014-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: An approach in which both word images and text strings are embedded in a common vectorial subspace, allowing one to cast recognition and retrieval tasks as a nearest neighbor problem and is very fast to compute and, especially, to compare.

...read moreread less

Abstract: This paper addresses the problems of word spotting and word recognition on images. In word spotting, the goal is to find all instances of a query word in a dataset of images. In recognition, the goal is to recognize the content of the word image, usually aided by a dictionary or lexicon. We describe an approach in which both word images and text strings are embedded in a common vectorial subspace. This is achieved by a combination of label embedding and attributes learning, and a common subspace regression. In this subspace, images and strings that represent the same word are close together, allowing one to cast recognition and retrieval tasks as a nearest neighbor problem. Contrary to most other existing methods, our representation has a fixed length, is low dimensional, and is very fast to compute and, especially, to compare. We test our approach on four public datasets of both handwritten documents and natural images showing results comparable or better than the state-of-the-art on spotting and recognition tasks.

...read moreread less

522 citations

Additional excerpts

...The final PHOC histogram is the concatenation of these partial histograms....
[...]

Proceedings Article•DOI•

Scene Text Visual Question Answering

[...]

Ali Furkan Biten, Rubèn Tito, Andres Mafla, Lluis Gomez¹, Marçal Rusiñol, C. V. Jawahar², Ernest Valveny¹, Dimosthenis Karatzas - Show less +4 more•Institutions (2)

Autonomous University of Barcelona¹, International Institute of Information Technology, Hyderabad²

01 Oct 2019

TL;DR: The ST-VQA dataset as discussed by the authors proposes a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer.

...read moreread less

Abstract: Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

...read moreread less

119 citations

Posted Content•

Scene Text Visual Question Answering

[...]

Ali Furkan Biten, Rubèn Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Ernest Valveny, C. V. Jawahar, Dimosthenis Karatzas¹ - Show less +4 more•Institutions (1)

Autonomous University of Barcelona¹

31 May 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: A new dataset, ST-VQA, is presented that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process and proposes a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module.

...read moreread less

Abstract: Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

...read moreread less

110 citations

Cites methods from "Image Retrieval Using Textual Cues"

...In total, we used six different datasets, namely: ICDAR 2013[27] and ICDAR2015[26], ImageNet [10], VizWiz[18], IIIT Scene Text Retrieval[42], Visual Genome [33] and COCO-Text [53]....
[...]

Proceedings Article•DOI•

Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

[...]

Ronghang Hu¹, Amanpreet Singh², Trevor Darrell¹, Marcus Rohrbach²•Institutions (2)

University of California, Berkeley¹, Facebook²

14 Jun 2020

TL;DR: Li et al. as mentioned in this paper propose a multimodal transformer architecture accompanied by a rich representation for text in images, which naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter-and intra-modality context.

...read moreread less

Abstract: Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the scene. Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question. However, existing approaches for TextVQA are mostly based on custom pairwise fusion mechanisms between a pair of two modalities and are restricted to a single prediction step by casting TextVQA as a classification task. In this work, we propose a novel model for the TextVQA task based on a multimodal transformer architecture accompanied by a rich representation for text in images. Our model naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter- and intra- modality context. Furthermore, it enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification. Our model outperforms existing approaches on three benchmark datasets for the TextVQA task by a large margin.

...read moreread less

104 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Histograms of oriented gradients for human detection

[...]

Navneet Dalal¹, Bill Triggs¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

20 Jun 2005

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.

...read moreread less

Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

...read moreread less

31,952 citations

Proceedings Article•DOI•

Rapid object detection using a boosted cascade of simple features

[...]

Paul A. Viola¹, Michael Jones•Institutions (1)

Mitsubishi¹

01 Dec 2001

TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.

...read moreread less

Abstract: This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.

...read moreread less

18,620 citations

Book•

Introduction to Information Retrieval

[...]

Christopher D. Manning¹, Prabhakar Raghavan², Hinrich Schütze³•Institutions (3)

Stanford University¹, Google², University of Stuttgart³

01 Jan 2008

TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.

...read moreread less

Abstract: Class-tested and coherent, this groundbreaking new textbook teaches web-era information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. Written from a computer science perspective by three leading experts in the field, it gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Although originally designed as the primary text for a graduate or advanced undergraduate course in information retrieval, the book will also create a buzz for researchers and professionals alike.

...read moreread less

11,804 citations

Journal Article•DOI•

Object Detection with Discriminatively Trained Part-Based Models

[...]

Pedro F. Felzenszwalb¹, Ross Girshick¹, David McAllester², Deva Ramanan³•Institutions (3)

University of Chicago¹, Toyota², University of California, Irvine³

01 Sep 2010-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.

...read moreread less

Abstract: We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.

...read moreread less

10,501 citations

"Image Retrieval Using Textual Cues" refers methods in this paper

...We analyzed three different variations [10] (13, 31 and 36dimensional) of HOG....
[...]

Proceedings Article•DOI•

Video Google: a text retrieval approach to object matching in videos

[...]

Sivic¹, Zisserman¹•Institutions (1)

University of Oxford¹

13 Oct 2003

TL;DR: An approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video, represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.

...read moreread less

Abstract: We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.

...read moreread less

6,938 citations

"Image Retrieval Using Textual Cues" refers methods in this paper

...We take this problem one step further and ask the question:Can we search for query text in a large collection of images and videos, and retrieve all occurrences of the query text? Note that, unlike approaches such as Video Google [21], which retrieve only similar instances of the queried content, our goal is to retrieve instances (text appearing in different places or view points), as well as categories (text in different font styles)....
[...]
...bag of words based image retrieval [21]....
[...]
...Note that, unlike approaches such as Video Google [21], which retrieve only similar instances of the queried content, our goal is to retrieve instances (text appearing in different places or view points), as well as categories (text in different font styles)....
[...]
...An alternate approach like Video Google [21] enables image search using image as a query, by finding visually similar regions in the database....
[...]