scispace - formally typeset
Open AccessJournal ArticleDOI

Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification

TLDR
In this article, the authors combine word representations and deep visual features in a globally trainable deep convolutional neural network for fine-grained image classification, where the attention mechanism is adopted to compute the relevance between each recognized word and the given image.
Abstract
Text in natural images contains rich semantics that is often highly relevant to objects or scene. In this paper, we focus on the problem of fully exploiting scene text for visual understanding. The main idea is combining word representations and deep visual features in a globally trainable deep convolutional neural network. First, the recognized words are obtained by a scene text reading system. Next, we combine the word embedding of the recognized words and the deep visual features into a single representation that is optimized by a convolutional neural network for fine-grained image classification. In our framework, the attention mechanism is adopted to compute the relevance between each recognized word and the given image, which further enhances the recognition performance. We have performed experiments on two datasets: con-text dataset and drink bottle dataset, which are proposed for fine-grained classification of business places and drink bottles, respectively. The experimental results consistently demonstrate that the proposed method of combining textual and visual cues significantly outperforms classification with only visual representation. Moreover, we have shown that the learned representation improves the retrieval performance on the drink bottle images by a large margin, making it potentially powerful in product search.

read more

Citations
More filters
Proceedings ArticleDOI

Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation

TL;DR: This paper proposes to detect scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions and achieves better or comparable results in both accuracy and efficiency.
Posted Content

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

TL;DR: This paper investigates the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images, and proposes an end-to-end trainable neural network model, named as Mask TextSpotter, which is inspired by the newly published work Mask R-CNN.
Journal ArticleDOI

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

TL;DR: The recognition module of the Mask TextSpotter method is investigated separately, which significantly outperforms state-of-the-art methods on both regular and irregular text datasets for scene text recognition.
Posted Content

Text Recognition in the Wild: A Survey

TL;DR: This literature review attempts to present the entire picture of the field of scene text recognition, which provides a comprehensive reference for people entering this field, and could be helpful to inspire future research.
Journal ArticleDOI

Text Recognition in the Wild: A Survey

TL;DR: A recent literature review as discussed by the authors summarizes the fundamental problems and the state-of-the-art associated with scene text recognition, introduces new insights and ideas, provides a comprehensive review of publicly available resources, and points out directions for future work.
References
More filters
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Proceedings ArticleDOI

You Only Look Once: Unified, Real-Time Object Detection

TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Related Papers (5)