scispace - formally typeset
Proceedings ArticleDOI

An MRF Model for Binarization of Natural Scene Text

18 Sep 2011-pp 11-16
TL;DR: This work represents the pixels in a document image as random variables in an MRF, and introduces a new energy function on these variables to find the optimal binarization, using an iterative graph cut scheme.
Abstract: Inspired by the success of MRF models for solving object segmentation problems, we formulate the binarization problem in this framework. We represent the pixels in a document image as random variables in an MRF, and introduce a new energy (or cost) function on these variables. Each variable takes a foreground or background label, and the quality of the binarization (or labelling) is determined by the value of the energy function. We minimize the energy function, i.e. find the optimal binarization, using an iterative graph cut scheme. Our model is robust to variations in foreground and background colours as we use a Gaussian Mixture Model in the energy function. In addition, our algorithm is efficient to compute, and adapts to a variety of document images. We show results on word images from the challenging ICDAR 2003 dataset, and compare our performance with previously reported methods. Our approach shows significant improvement in pixel level accuracy as well as OCR accuracy.
Topics: Image segmentation (55%), Cut (53%), Mixture model (52%), Gaussian process (51%)
Figures (8)
  • Table I OCR ACCURACY(IN %)
    Table I OCR ACCURACY(IN %)
  • Table II BINARIZATION RESULTS WITH RESPECT TO WELL-KNOWN EVALUATION MEASURES (AVERAGE)
    Table II BINARIZATION RESULTS WITH RESPECT TO WELL-KNOWN EVALUATION MEASURES (AVERAGE)
  • Figure 4. Comparison of thresholding based algorithms and the proposed method (From left to right Original, Otsu, Sauvola, Niblack, Kittler, Proposed (with edginess difference)).
    Figure 4. Comparison of thresholding based algorithms and the proposed method (From left to right Original, Otsu, Sauvola, Niblack, Kittler, Proposed (with edginess difference)).
  • Table III EFFECT OF NUMBER OF GMM COMPONENTS
    Table III EFFECT OF NUMBER OF GMM COMPONENTS
  • Figure 5. Effect of edginess difference term: (a) Original image (b) Without edginess difference (c) With edginess difference.
    Figure 5. Effect of edginess difference term: (a) Original image (b) Without edginess difference (c) With edginess difference.
  • Figure 3. Images where auto-seeding fails
    Figure 3. Images where auto-seeding fails
  • Figure 2. (a) Input Image (b) Its foreground-background seeds, Red and blue colour shows foreground and background seeds respectively (Best viewed in colour).
    Figure 2. (a) Input Image (b) Its foreground-background seeds, Red and blue colour shows foreground and background seeds respectively (Best viewed in colour).
  • Figure 1. Some samples images we considered in this work
    Figure 1. Some samples images we considered in this work
Citations
More filters

01 Jan 2011-
TL;DR: A new benchmark dataset for research use is introduced containing over 600,000 labeled digits cropped from Street View images, and variants of two recently proposed unsupervised feature learning methods are employed, finding that they are convincingly superior on benchmarks.
Abstract: Detecting and reading text from natural images is a hard computer vision task that is central to a variety of emerging applications. Related problems like document character recognition have been widely studied by computer vision and machine learning researchers and are virtually solved for practical applications like reading handwritten digits. Reliably recognizing characters in more complex scenes like photographs, however, is far more difficult: the best existing methods lag well behind human performance on the same tasks. In this paper we attack the problem of recognizing digits in a real application using unsupervised feature learning methods: reading house numbers from street level photos. To this end, we introduce a new benchmark dataset for research use containing over 600,000 labeled digits cropped from Street View images. We then demonstrate the difficulty of recognizing these digits when the problem is approached with hand-designed features. Finally, we employ variants of two recently proposed unsupervised feature learning methods and find that they are convincingly superior on our benchmarks.

3,956 citations


Cites methods from "An MRF Model for Binarization of Na..."

  • ...Finally, we note that in prior work binarization has been an important component in scene text applications, driven partly by efforts to re-use existing OCR machinery in new domains [24, 25]....

    [...]


Proceedings ArticleDOI
07 Sep 2009-
TL;DR: A framework is presented that uses a higher order prior computed from an English dictionary to recognize a word, which may or may not be a part of the dictionary, and achieves significant improvement in word recognition accuracies without using a restricted word list.
Abstract: The problem of recognizing text in images taken in the wild has gained significant attention from the computer vision community in recent years. Contrary to recognition of printed documents, recognizing scene text is a challenging problem. We focus on the problem of recognizing text extracted from natural scene images and the web. Significant attempts have been made to address this problem in the recent past. However, many of these works benefit from the availability of strong context, which naturally limits their applicability. In this work we present a framework that uses a higher order prior computed from an English dictionary to recognize a word, which may or may not be a part of the dictionary. We show experimental results on publicly available datasets. Furthermore, we introduce a large challenging word dataset with five thousand words to evaluate various steps of our method exhaustively. The main contributions of this work are: (1) We present a framework, which incorporates higher order statistical language models to recognize words in an unconstrained manner (i.e. we overcome the need for restricted word lists, and instead use an English dictionary to compute the priors). (2) We achieve significant improvement (more than 20%) in word recognition accuracies without using a restricted word list. (3) We introduce a large word recognition dataset (atleast 5 times larger than other public datasets) with character level annotation and benchmark it.

651 citations


Journal ArticleDOI
TL;DR: This review provides a fundamental comparison and analysis of the remaining problems in the field and summarizes the fundamental problems and enumerates factors that should be considered when addressing these problems.
Abstract: This paper analyzes, compares, and contrasts technical challenges, methods, and the performance of text detection and recognition research in color imagery It summarizes the fundamental problems and enumerates factors that should be considered when addressing these problems Existing techniques are categorized as either stepwise or integrated and sub-problems are highlighted including text localization, verification, segmentation and recognition Special issues associated with the enhancement of degraded text and the processing of video text, multi-oriented, perspectively distorted and multilingual text are also addressed The categories and sub-categories of text are illustrated, benchmark datasets are enumerated, and the performance of the most representative approaches is compared This review provides a fundamental comparison and analysis of the remaining problems in the field

615 citations


Cites methods from "An MRF Model for Binarization of Na..."

  • ...segmentation problems, Mishra [135] and Kim and Lee [185] formulated the text binarization problem in optimal frameworks and used an energy minimization to label text pixels....

    [...]

  • ...Mishra et al. [161] presented a framework that utilizes both bottom-up (character) and top-down (language) cues for text recognition....

    [...]

  • ...Inspired by the success of CRF models for solving image segmentation problems, Mishra [135] and Kim and Lee [185] formulated the text binarization problem in optimal frameworks and used an energy minimization to label text pixels....

    [...]


Proceedings ArticleDOI
23 Jun 2014-
TL;DR: This paper proposes a novel multi-scale representation for scene text recognition that consists of a set of detectable primitives, termed as strokelets, which capture the essential substructures of characters at different granularities.
Abstract: Driven by the wide range of applications, scene text detection and recognition have become active research topics in computer vision. Though extensively studied, localizing and reading text in uncontrolled environments remain extremely challenging, due to various interference factors. In this paper, we propose a novel multi-scale representation for scene text recognition. This representation consists of a set of detectable primitives, termed as strokelets, which capture the essential substructures of characters at different granularities. Strokelets possess four distinctive advantages: (1) Usability: automatically learned from bounding box labels, (2) Robustness: insensitive to interference factors, (3) Generality: applicable to variant languages, and (4) Expressivity: effective at describing characters. Extensive experiments on standard benchmarks verify the advantages of strokelets and demonstrate the effectiveness of the proposed algorithm for text recognition.

283 citations


Cites methods from "An MRF Model for Binarization of Na..."

  • ...We intentionally avoid the term “character detection” as certain algorithms (such as [17, 29]) utilize binarization to seek character candidates....

    [...]

  • ...However, binarization based methods [17, 29] are sensitive to noise, blur and nonuniform illumination; connected component based methods [21, 23] are unable to handle connected characters and...

    [...]

  • ...To tackle these issues, several approaches were proposed, which employed adaptive binarization [17, 29], connected component extraction [21, 23] or direct character detection [27, 18, 25]....

    [...]


Book ChapterDOI
01 Nov 2014-
TL;DR: This paper presents a novel approach to recognize text in scene images that outperforms the state-of-the-art techniques significantly and is able to recognize the whole word images without character-level segmentation and recognition.
Abstract: Scene text recognition is a useful but very challenging task due to uncontrolled condition of text in natural scenes. This paper presents a novel approach to recognize text in scene images. In the proposed technique, a word image is first converted into a sequential column vectors based on Histogram of Oriented Gradient (HOG). The Recurrent Neural Network (RNN) is then adapted to classify the sequential feature vectors into the corresponding word. Compared with most of the existing methods that follow a bottom-up approach to form words by grouping the recognized characters, our proposed method is able to recognize the whole word images without character-level segmentation and recognition. Experiments on a number of publicly available datasets show that the proposed method outperforms the state-of-the-art techniques significantly. In addition, the recognition results on publicly available datasets provide a good benchmark for the future research in this area.

205 citations


Cites methods from "An MRF Model for Binarization of Na..."

  • ...Datsets ICDAR03 (Full) ICDAR03 (50) ICDAR11 (Full) ICDAR11 (50) SVT MRF [5] 0.67 0.69 - - - IR [7] 0.75 0.77 - - - NESP [6] 0.66 - 0.73 - PLEX [16] 0.62 0.76 - - 0.57 HOG + CRF [10] - 0.82 - - 0.73 PBS [9] 0.79 0.87 0.83 0.87 0.74 WFST [11] 0.83 - 0.56 - 0.73 CNN [14] 0.84 0.90 - - 0.70 Proposed 0.82 0.92 0.83 0.91 0.83 ICDAR03(FULL) and ICDAR11(FULL) in Table 1), as well as with lexicon consisting of 50 random words from the test set (as denoted by ICDAR03(50) and ICDAR11(50) in Table 1)....

    [...]

  • ...Several systems have been reported that exploit Markov Random Field [5], Nonlinear color enhancement [6] and Inverse Rendering [7] to extract the character regions....

    [...]

  • ...The text segmentation methods (MRF, IR, and NESP) produce lower recognition accuracy than other methods because robust and accurate scene text segmentation by itself is an very challenging task....

    [...]

  • ...We compare our proposed method with eight state-of-the-art techniques, including markov random field method (MRF) [5], inverse rendering method (IR) [7], nonlinear color enhancement method (NESP) [6], pictorial structure method (PLEX) [16], HOG based conditional random field method (HOG+CRF) [10], weighted finite-state transducers method (WFST) [11], part based tree structure method (PBS) [9] and convolutional neural network method (CNN) [14]....

    [...]


References
More filters

Journal ArticleDOI

31,977 citations


"An MRF Model for Binarization of Na..." refers methods in this paper

  • ...We also compare our method with Otsu followed by colour thresholding (CT) [14]....

    [...]

  • ...Otsu followed by colour thresholding binarization proposed in [14] improves the word recognition accuracy but not significantly....

    [...]

  • ...Traditional thresholding based binarization can be categorized into two categories: the one which uses global threshold for the given document (like Otsu [5], Kittler et al. [6]) and the one with local thresholds (like Sauvola [7], Niblack [8])....

    [...]

  • ...To evaluate the performance of proposed binarization algorithm, we compare it with the well-known thresholding based binarization techniques like Otsu [5], Sauvola [7], Niblack [8], Kittler et al. [6]....

    [...]

  • ...Since this dataset consists of images of tight word boundaries, global methods (like [5], [6]) performs better than popular local methods....

    [...]


Journal ArticleDOI
01 Aug 2004-
TL;DR: A more powerful, iterative version of the optimisation of the graph-cut approach is developed and the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result.
Abstract: The problem of efficient, interactive foreground/background segmentation in still images is of great practical importance in image editing. Classical image segmentation tools use either texture (colour) information, e.g. Magic Wand, or edge (contrast) information, e.g. Intelligent Scissors. Recently, an approach based on optimization by graph-cut has been developed which successfully combines both types of information. In this paper we extend the graph-cut approach in three respects. First, we have developed a more powerful, iterative version of the optimisation. Secondly, the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result. Thirdly, a robust algorithm for "border matting" has been developed to estimate simultaneously the alpha-matte around an object boundary and the colours of foreground pixels. We show that for moderately difficult examples the proposed method outperforms competitive tools.

5,088 citations


Journal ArticleDOI
TL;DR: This paper compares the running times of several standard algorithms, as well as a new algorithm that is recently developed that works several times faster than any of the other methods, making near real-time performance possible.
Abstract: Minimum cut/maximum flow algorithms on graphs have emerged as an increasingly useful tool for exactor approximate energy minimization in low-level vision. The combinatorial optimization literature provides many min-cut/max-flow algorithms with different polynomial time complexity. Their practical efficiency, however, has to date been studied mainly outside the scope of computer vision. The goal of this paper is to provide an experimental comparison of the efficiency of min-cut/max flow algorithms for applications in vision. We compare the running times of several standard algorithms, as well as a new algorithm that we have recently developed. The algorithms we study include both Goldberg-Tarjan style "push -relabel" methods and algorithms based on Ford-Fulkerson style "augmenting paths." We benchmark these algorithms on a number of typical graphs in the contexts of image restoration, stereo, and segmentation. In many cases, our new algorithm works several times faster than any of the other methods, making near real-time performance possible. An implementation of our max-flow/min-cut algorithm is available upon request for research purposes.

4,298 citations


12


Proceedings ArticleDOI
07 Jul 2001-
Abstract: In this paper we describe a new technique for general purpose interactive segmentation of N-dimensional images. The user marks certain pixels as "object" or "background" to provide hard constraints for segmentation. Additional soft constraints incorporate both boundary and region information. Graph cuts are used to find the globally optimal segmentation of the N-dimensional image. The obtained solution gives the best balance of boundary and region properties among all segmentations satisfying the constraints. The topology of our segmentation is unrestricted and both "object" and "background" segments may consist of several isolated parts. Some experimental results are presented in the context of photo/video editing and medical image segmentation. We also demonstrate an interesting Gestalt example. A fast implementation of our segmentation method is possible via a new max-flow algorithm.

3,504 citations


Journal ArticleDOI
01 Jan 2004-
TL;DR: This work gives a precise characterization of what energy functions can be minimized using graph cuts, among the energy functions that can be written as a sum of terms containing three or fewer binary variables.
Abstract: In the last few years, several new algorithms based on graph cuts have been developed to solve energy minimization problems in computer vision. Each of these techniques constructs a graph such that the minimum cut on the graph also minimizes the energy. Yet, because these graph constructions are complex and highly specific to a particular energy function, graph cuts have seen limited application to date. In this paper, we give a characterization of the energy functions that can be minimized by graph cuts. Our results are restricted to functions of binary variables. However, our work generalizes many previous constructions and is easily applicable to vision problems that involve large numbers of labels, such as stereo, motion, image restoration, and scene reconstruction. We give a precise characterization of what energy functions can be minimized using graph cuts, among the energy functions that can be written as a sum of terms containing three or fewer binary variables. We also provide a general-purpose construction to minimize such an energy function. Finally, we give a necessary condition for any energy function of binary variables to be minimized by graph cuts. Researchers who are considering the use of graph cuts to optimize a particular energy function can use our results to determine if this is possible and then follow our construction to create the appropriate graph. A software implementation is freely available.

2,984 citations


Network Information
Related Papers (5)
13 Jun 2010

Boris Epshtein, Eyal Ofek +1 more

01 Jan 1986

Wayne Niblack

06 Nov 2011

Kai Wang, Boris Babenko +1 more

03 Aug 2003

Simon M. Lucas, A. Panaretos +4 more

Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20213
20203
20194
20185
201710
201612