TL;DR: A novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts of Indian newspaper is proposed.
Abstract: Digitization of newspaper article is important for registering historical events. Layout analysis of Indian newspaper is a challenging task due to the presence of different font size, font styles and random placement of text and non-text regions. In this paper we propose a novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts. The learning problem has been formulated as an optimization problem using EM algorithm to learn optimal parameters depending on the nature of the document content.
The process of converting physical document page into digital format is important for its preservation and archival.
Text graphic separation has been the most challenging problem in document image analysis.
While scripts like Telugu, Tamil, Malayalam etc. lack such standards as the components overlap with the boundaries of the neighbouring lines due to which the block/line level segmentation becomes complex and inefficient.
Hence, estimating optimal values instead will help in generalizing the algorithm as well as provide accurate results.
Section 3 discusses their approach in details.
2. RELATED WORK
Over the years several document layout analysis algorithms have been proposed in [4, 19, 21, 26].
Many methods have been proposed to address the problem of text-graphics separation in document images [3, 23, 27, 25, 22].
Gatos et al [8] proposed a two stage technique for layout analysis of newspaper page.
The blocks are then classified based on statistical textural features and feature space decision techniques.
Due to the presence of shiro-rekha the horizontal profile possesses regularity in frequency, orientation and spatial cohesion for text blocks.
3. TEXT GRAPHIC SEPARATION IN NEWSPAPER ARTICLES
The authors propose an approach that adaptively learns parameters based on the content of the document for text graphic separation in newspaper articles.
The approach utilizes the spatial texture properties over a local neighbourhood, the dimensions of which are also one of the parameters that are learned during optimization.
Each script is represented as a distribution over edge direction histogram (EDH) features [24].
These distributions are used to model the script characteristics that are learned from the training data using Expectation Maximization (EM) algorithm.
The authors now describe the parameter optimization and text graphic separation methodologies in detail in the following sections.
3.1 Parameter Optimization
The authors EM based parameter estimation framework relies on the nature of the document image content.
The documents in the training set are ground-truth images in which the regions of the text are marked.
Ei is the global EDH of the i th ground-truth image responsible for modelling the document content.
The edge direction features are computed by convolving the document image with Sobel mask in horizontal and vertical directions.
Ri and Wi(m,n)) are the parameters to be optimized, where ri is the P/N ratio computed over a local neighbourhood Wi(m,n)) respectively.
3.2 Adaptive Segmentation
For a given gray-scale document image, it is binarized using the method described in [1].
The pseudo-periodic pattern is the autocorrelation of the horizontal projection profile over a local neighbourhood.
It is worth noting that the complex layout of newspaper image is segmented properly.
The method used for evaluating the performance of their algorithm is based on counting the number of matches between the pixels segmented by the algorithm and the pixels in the ground truth [11].
5. CONCLUSION
This paper contributes a unique technique for segmenting text graphics in Indian newspapers.
The proposed technique can be easily adapted for a variety of complex layouts and scripts.
The technique has been tested on a variety of document images from different newspapers and books with different page layouts.
TL;DR: A solution to segment headline from the body text and body text into columns is proposed and experiments are carried out on Gurumukhi script newspaper article images.
Abstract: Newspapers are vital source of information, and it is very much necessary to store newspapers in digital form. To search information from digital newspapers, text should be in computer processable form. To convert any newspaper into computer processable form, first step is to detect headline and segment the headline from body text. Next step is to segment columns if multiple columns are present in any article. In this paper, we have proposed a solution to segment headline from the body text and body text into columns. Experiments are carried out on Gurumukhi script newspaper article images.
TL;DR: This work has chosen several features to distinguish graphics from text as well as tried to reduce the noise in this work, and applied the techniques on Indian newspapers written in roman script and got satisfactory results.
Abstract: Identification of graphics from newspaper pages and then their separation from text is a challenging task. Very few works have been reported in this field. In general, newspapers are printed in low quality papers which have a tendency to change color with time. This color change generates noise that adds with time to the document. In this work we have chosen several features to distinguish graphics from text as well as tried to reduce the noise. At first minimum bounding box around each object has been identified by connected component analysis of binary image. Each object was cropped thereafter and passed through geometric feature extraction system. Then we have done two different frequency analysis of each object. Thus we have collected both spatial and frequency domain features from objects which are used for training and testing purpose using different classifiers. We have applied the techniques on Indian newspapers written in roman script and got satisfactory results over that.
TL;DR: In this paper, the authors propose a text recovery technique to recover some text that is accidentally removed during the first stage of the graphic removal process, which also temporarily leads to the removal of some text; the lost text will then be recovered using the text recovery algorithm.
Abstract: A graphic removal process for document images involves two stages: First, removal of graphics in the document image based on heuristic text analyses; and second, text recovery to recover some text that is accidentally removed during the first stage. The first stage uses a relatively aggressive strategy to ensure that all graphics components are removed, which also temporarily leads to the removal of some text; the lost text will then be recovered using the text recovery technique. The heuristic text analyses utilize the geometric properties of text characters and consider the properties of text characters in relation to their neighbors. The text recovery technique starts from the text that remain after the first stage, and recovers any connected component that is at least partially located within a pre-defined neighboring area around any of the text components in the intermediate document image.
TL;DR: The development and implementation of an algorithm for automated text string separation that is relatively independent of changes in text font style and size and of string orientation are described and showed superior performance compared to other techniques.
Abstract: The development and implementation of an algorithm for automated text string separation that is relatively independent of changes in text font style and size and of string orientation are described. It is intended for use in an automated system for document analysis. The principal parts of the algorithm are the generation of connected components and the application of the Hough transform in order to group components into logical character strings that can then be separated from the graphics. The algorithm outputs two images, one containing text strings and the other graphics. These images can then be processed by suitable character recognition and graphics recognition systems. The performance of the algorithm, both in terms of its effectiveness and computational efficiency, was evaluated using several test images and showed superior performance compared to other techniques. >
658 citations
"Text graphic separation in Indian n..." refers methods in this paper
...ument images [5, 6] is based on connected component analysis....
TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.
Abstract: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.
526 citations
"Text graphic separation in Indian n..." refers background in this paper
...Over the years several document layout analysis algorithms have been proposed in [4, 19, 21, 26]....
TL;DR: It is shown that a constrained run length algorithm is well suited to partition most documents into areas of text lines, solid black lines, and rectangular ☐es enclosing graphics and halftone images.
Abstract: The segmentation and classification of digitized printed documents into regions of text and images is a necessary first processing step in document analysis systems. It is shown that a constrained run length algorithm is well suited to partition most documents into areas of text lines, solid black lines, and rectangular ☐es enclosing graphics and halftone images. During the processing these areas are labeled and meaningful features are calculated. By making use of the regular appearance of text lines as textured stripes, a linear adaptive classification scheme is constructed to discriminate text regions from others.
TL;DR: In this paper, two-dimensional Gabor filters are used to extract texture features for each text region in a given document image, and the text in the document is considered as a textured region.
Abstract: There is a considerable interest in designing automatic systems that will scan a given paper document and store it on electronic media for easier storage, manipulation, and access. Most documents contain graphics and images in addition to text. Thus, the document image has to be segmented to identify the text regions, so that OCR techniques may be applied only to those regions. In this paper, we present a simple method for document image segmentation in which text regions in a given document image are automatically identified. The proposed segmentation method for document images is based on a multichannel filtering approach to texture segmentation. The text in the document is considered as a textured region. Nontext contents in the document, such as blank spaces, graphics, and pictures, are considered as regions with different textures. Thus, the problem of segmenting document images into text and nontext regions can be posed as a texture segmentation problem. Two-dimensional Gabor filters are used to extract texture features for each of these regions. These filters have been extensively used earlier for a variety of texture segmentation tasks. Here we apply the same filters to the document image segmentation problem. Our segmentation method does not assume any a priori knowledge about the content or font styles of the document, and is shown to work even for skewed images and handwritten text. Results of the proposed segmentation method are presented for several test images which demonstrate the robustness of this technique.
TL;DR: This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches.
Abstract: Document structure analysis can be regarded as a syntactic analysis problem. The order and containment relations among the physical or logical components of a document page can be described by an ordered tree structure and can be modeled by a tree grammar which describes the page at the component level in terms of regions or blocks. This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches. In particular, we survey past work on document physical layout representations and algorithms, document logical structure representations and algorithms, and performance evaluation of document structure analysis algorithms. In the last section, we summarize this work and point out its limitations.
266 citations
"Text graphic separation in Indian n..." refers background in this paper
...Over the years several document layout analysis algorithms have been proposed in [4, 19, 21, 26]....