Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification

doi:10.1109/ACCESS.2018.2878899

Open AccessJournal ArticleDOI

Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification

Xiang Bai, +4 more

- 31 Oct 2018 -

IEEE Access

- Vol. 6, pp 66322-66335

TLDR

In this article, the authors combine word representations and deep visual features in a globally trainable deep convolutional neural network for fine-grained image classification, where the attention mechanism is adopted to compute the relevance between each recognized word and the given image.

Abstract:

Text in natural images contains rich semantics that is often highly relevant to objects or scene. In this paper, we focus on the problem of fully exploiting scene text for visual understanding. The main idea is combining word representations and deep visual features in a globally trainable deep convolutional neural network. First, the recognized words are obtained by a scene text reading system. Next, we combine the word embedding of the recognized words and the deep visual features into a single representation that is optimized by a convolutional neural network for fine-grained image classification. In our framework, the attention mechanism is adopted to compute the relevance between each recognized word and the given image, which further enhances the recognition performance. We have performed experiments on two datasets: con-text dataset and drink bottle dataset, which are proposed for fine-grained classification of business places and drink bottles, respectively. The experimental results consistently demonstrate that the proposed method of combining textual and visual cues significantly outperforms classification with only visual representation. Moreover, we have shown that the learned representation improves the retrieval performance on the drink bottle images by a large margin, making it potentially powerful in product search.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation

Pengyuan Lyu, +4 more

TL;DR: This paper proposes to detect scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions and achieves better or comparable results in both accuracy and efficiency.

...read moreread less

Posted Content

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Pengyuan Lyu, +4 more

- 06 Jul 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper investigates the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images, and proposes an end-to-end trainable neural network model, named as Mask TextSpotter, which is inspired by the newly published work Mask R-CNN.

...read moreread less

Journal ArticleDOI

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Minghui Liao, +5 more

- 01 Feb 2021 -

IEEE Transactions on Pattern Analysis an...

TL;DR: The recognition module of the Mask TextSpotter method is investigated separately, which significantly outperforms state-of-the-art methods on both regular and irregular text datasets for scene text recognition.

...read moreread less

Posted Content

Text Recognition in the Wild: A Survey

Xiaoxue Chen, +4 more

- 07 May 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This literature review attempts to present the entire picture of the field of scene text recognition, which provides a comprehensive reference for people entering this field, and could be helpful to inspire future research.

...read moreread less

Journal ArticleDOI

Text Recognition in the Wild: A Survey

Xiaoxue Chen, +4 more

- 05 Mar 2021 -

ACM Computing Surveys

TL;DR: A recent literature review as discussed by the authors summarizes the fundamental problems and the state-of-the-art associated with scene text recognition, introduces new insights and ideas, provides a comprehensive review of publicly available resources, and points out directions for future work.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Going deeper with convolutions

Christian Szegedy, +8 more

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Proceedings ArticleDOI

You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon, +3 more

TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

...read moreread less

Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Ross Girshick, +3 more

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less