scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2013"


Journal ArticleDOI
TL;DR: This paper proposes a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents.
Abstract: Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. Existing methods for document clustering have traditionally assumed that a document is an indivisible unit for text representation and similarity computation, which may not be appropriate to handle documents with multiple topics. In this paper, we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments that are coherent with respect to the underlying subtopics. We propose a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our segment-based approach on large collections of multi-topic documents, and we compare it to conventional methods for document clustering.

62 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: An objective comparative evaluation of layout analysis methods for scanned historical newspapers indicates that there is a convergence to a certain methodology with some variations in the approach, but there is still a considerable need to develop robust methods that deal with the idiosyncrasies of historical newspapers.
Abstract: This paper presents an objective comparative evaluation of layout analysis methods for scanned historical newspapers. It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICDAR2013 and the 2nd International Workshop on Historical Document Imaging and Processing (HIP2013), presenting the results of the evaluation of five submitted methods. Two state-of-the-art systems, one commercial and one open-source, are also evaluated for comparison. Two scenarios are reported in this paper, one evaluating the ability of methods to accurately segment regions and the other evaluating the whole pipeline of segmentation and region classification (with a text extraction goal). The results indicate that there is a convergence to a certain methodology with some variations in the approach. However, there is still a considerable need to develop robust methods that deal with the idiosyncrasies of historical newspapers.

47 citations


01 Jan 2013
TL;DR: Noise in scanned document images is reviewed, which reduces the accuracy of subsequent tasks of OCR (Optical character Recognition) systems and some noise removal methods are discussed.
Abstract:  Abstract- document images may be contaminated with noise during transmission, scanning or conversion to digital form. We can categorize noises by identifying their features and can search for similar patterns in a document image to choose appropriate methods for their removal. After a brief introduction, this paper reviews noises that might appear in scanned document images and discusses some noise removal methods. owadays, with the increase in computer use in everybody's lives, the ability for people to convert documents to digital and readable formats has become a necessity. Scanning documents is a way of changing printed documents into digital format. A common problem encountered when scanning documents is 'noise' which can occur in an image because of paper quality, the typing machine used, or it can be created by scanners during the scanning process. Noise removal is one of the steps in pre- processing. Among other things, noise reduces the accuracy of subsequent tasks of OCR (Optical character Recognition) systems. It can appear in the foreground or background of an image and can be generated before or after scanning. Examples of noise in scanned document images are as follows. The page rule line is a source of noise which interferes with text objects. The marginal noise usually appears in a large dark region around the document image and can be textual or non-textual. Some forms of clutter noise appear in an image because of document skew while scanning or are from holes punched in the document, or background noise, such as uneven contrast, show through effects, interfering strokes, and background spots, etc. Next, we will discuss each type in detail.

39 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: The proposed method is a bottom up approach that fuses words, to globally minimize their fusing distance and in order to improve processing time and further layout analysis, text lines are represented by oriented rectangles.
Abstract: Text line detection is a pre-processing step for automated document analysis such as word spotting or OCR. It is additionally used for document structure analysis or layout analysis. Considering mixed layouts, degraded documents and handwritten documents, text line detection is still challenging. We present a novel approach that targets torn documents having varying layouts and writing. The proposed method is a bottom up approach that fuses words, to globally minimize their fusing distance. In order to improve processing time and further layout analysis, text lines are represented by oriented rectangles. Even though, the method was designed for modern handwritten and printed documents, tests on medieval manuscripts give promising results. Additionally, the text line detection was evaluated on the ICDAR 2009 and ICFHR 2010 Handwriting Segmentation Contest datasets.

27 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper proposes a novel text line extraction method for historical documents that takes the layout recognition results as an input, extracts the text lines, and groups them into blocks using the connected components approach.
Abstract: This paper proposes a novel text line extraction method for historical documents. The method works in two steps. In the first step, layout analysis is performed to recognize the physical structure of a given document using a classification technique, more precisely the pixels of a coloured document image are classified into five classes: text-block, core-text-line, decoration, background, and periphery. This layout recognition is achieved by a cascade of two Dynamic Multilayer Perceptron (DMLP) classifiers and works without binarisation. In the second step, an algorithm takes the layout recognition results as an input, extracts the text lines, and groups them into blocks using the connected components approach. Finally, the algorithm refines the boundaries of the text lines using the binary image and the layout recognition results. Our system is evaluated on three historical manuscripts with a test set of 49 pages. The best obtained hit rate for text lines is 96.3%.

26 citations


01 Jan 2013
TL;DR: A framework for classify document image retrieval approaches is proposed, and then these approaches are evaluated based on important measures.
Abstract: During the last decades, Due to the advances in Information technology and communication and increase in volume of printed documents in many applications, document image databases have become increasingly important. Document Images are documents that normally begin on paper and are then via electronics scanned that move towards a paperless office and stored documents as images. Document Image retrieval is one of an important research area in the field of document image databases. Many approaches come in for indexing and retrieval document images. Traditionally, Optical character recognition (OCR) has been used for completely convert the manuscript to an electronic version which can be indexed automatically. Then, Keyword spotting has been proposed for indexing document image retrieval. Keyword spotting method has lower cost than OCR. But there are some problems in both of methods for indexing document images with non-text components. Three approaches have been presented to solve this problem, Signature based approach, layout structural and logo based approach. In this paper we proposed a framework for classify document image retrieval approaches, and then we evaluated these approaches based on important measures.

19 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: The details of the ICDAR2013 Document Image Skew Estimation Contest are described including the evaluation measures used as well as the performance of the twelve methods submitted by ten different groups along with a short description of each method.
Abstract: The detection and correction of document skew is one of the most important document image analysis steps. The ICDAR2013 Document Image Skew Estimation Contest (DISEC'13) is the first contest which is dedicated to record recent advances in the field of skew estimation using well established evaluation performance measures on a variety of printed document images. The benchmarking dataset that is used contains 1550 images that were obtained from various sources such as newspapers, scientific books and dictionaries. The document images contain figures, tables, diagrams, architectural plans, electrical circuits and they are written in various languages such as English, Chinese and Greek. This paper describes the details of the contest including the evaluation measures used as well as the performance of the twelve methods submitted by ten different groups along with a short description of each method.

19 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper presents a method to separate the textual and non textual components in document images using a graph-based modeling and structural analysis, which is a fast and efficient way to separate adequately the graphical and the textual parts of a document.
Abstract: Page segmentation into text and non-text elements is an essential preprocessing step before optical character recognition (OCR) operation. In case of poor segmentation, an OCR classification engine produces garbage characters due to the presence of non-text elements. This paper presents a method to separate the textual and non textual components in document images using a graph-based modeling and structural analysis. This is a fast and efficient method to separate adequately the graphical and the textual parts of a document. We have evaluated our method on two well-known subsets: the UW-III dataset and the ICDAR 2009 page segmentation competition dataset. Comparisons are led with two methods of state-of-the-art, these results showing that our method proved better performances in this task.

19 citations


Proceedings ArticleDOI
06 Dec 2013
TL;DR: Novel features based on centroid fluctuation information of non-homogeneous regions are proposed to more appropriately characterize both displayed formulas and embedded formulas in heterogeneous document images that may contain figures, tables, text, and math formulas.
Abstract: This paper presents mathematical formula detection in heterogeneous document images that may contain figures, tables, text, and math formulas. We adopt the method originally proposed for sign detection in natural images to detect non-homogeneous regions and accordingly achieve text line detection and segmentation. Novel features based on centroid fluctuation information of non-homogeneous regions are proposed to more appropriately characterize both displayed formulas and embedded formulas. By comparing the proposed method with previous works, we demonstrate the effectiveness of the proposed features.

18 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: A novel evaluation approach that responds to the evaluation of reading order results generated by layout analysis methods by incorporating region correspondence analysis is proposed and a sophisticated reading order representation scheme is presented and used by the system.
Abstract: Reading order detection and representation is an important task in many digitisation scenarios involving the preservation of the logical structure of a document. The corresponding need for the evaluation of reading order results generated by layout analysis methods poses a particular challenge due to potential deviations between ground truth and actually detected segmentation of the page. To this end a novel evaluation approach that responds to this problem by incorporating region correspondence analysis is proposed. Furthermore, a sophisticated reading order representation scheme is presented and used by the system allowing the grouping of objects with ordered and/or unordered relations. This is a typical requirement for documents with complex layouts such as magazines and newspapers. The evaluation method has been validated using the results of two state-of-the-art OCR / layout analysis systems and a basic top-to-bottom reading order detection algorithm applied on representative samples from the PRImA contemporary and the IMPACT historical document datasets.

16 citations


Proceedings ArticleDOI
04 Feb 2013
TL;DR: A fast automatic layout segmentation of old document images based on five descriptors, which is parameter-free since it automatically adapts to the image content, and defines a new evaluation metric, the homogeneity measure, which aims at evaluating the segmentation and characterization accuracy of the method.
Abstract: Recent progress in the digitization of heterogeneous collections of ancient documents has rekindled new challenges in information retrieval in digital libraries and document layout analysis. Therefore, in order to control the quality of historical document image digitization and to meet the need of a characterization of their content using intermediate level metadata (between image and document structure), we propose a fast automatic layout segmentation of old document images based on five descriptors. Those descriptors, based on the autocorrelation function, are obtained by multiresolution analysis and used afterwards in a specific clustering method. The method proposed in this article has the advantage that it is performed without any hypothesis on the document structure, either about the document model (physical structure), or the typographical parameters (logical structure). It is also parameter-free since it automatically adapts to the image content. In this paper, firstly, we detail our proposal to characterize the content of old documents by extracting the autocorrelation features in the different areas of a page and at several resolutions. Then, we show that is possible to automatically find the homogeneous regions defined by similar indices of autocorrelation without knowledge about the number of clusters using adapted hierarchical ascendant classification and consensus clustering approaches. To assess our method, we apply our algorithm on 316 old document images, which encompass six centuries (1200-1900) of French history, in order to demonstrate the performance of our proposal in terms of segmentation and characterization of heterogeneous corpus content. Moreover, we define a new evaluation metric, the homogeneity measure, which aims at evaluating the segmentation and characterization accuracy of our methodology. We find a 85% of mean homogeneity accuracy. Those results help to represent a document by a hierarchy of layout structure and content, and to define one or more signatures for each page, on the basis of a hierarchical representation of homogeneous blocks and their topology.

Patent
19 Mar 2013
TL;DR: In this paper, a PDF document recognition method is proposed, which comprises the steps as follows: S1: analyzing path objects in a PDF text document, and recognizing forms in the PDF document; S2: analyzing text objects outside table areas in PDF documents, and S3: writing recognition results into a temporary file, or writing the recognition result into a PDF file in the form of an attachment.
Abstract: The invention discloses a PDF document recognition method The method comprises the steps as follows: S1: analyzing path objects in a PDF document, and recognizing forms in the PDF document; S2: analyzing text objects outside table areas in the PDF document, and recognizing text content in the PDF document; S3: writing recognition results into a temporary file, or writing the recognition results into a PDF file in the form of an attachment By the aid of the PDF document recognition method, objects such as the forms, paragraphs, titles, lists and the like in the PDF document can be recognized, so that the PDF document can be edited with one paragraph as the unit, labels can be added to the PDF document conveniently, the reading order can be determined, and persons with dysopia can read conveniently; meanwhile, documents in other formats can be exported according to the recognition results, so that users can read and edit the PDF document conveniently

Proceedings ArticleDOI
25 Aug 2013
TL;DR: It is argued that a key-region detector designed to take into account the special characteristics of document images can result in the detection of less and more meaningful key-regions.
Abstract: In this paper we argue that a key-region detector designed to take into account the special characteristics of document images can result in the detection of less and more meaningful key-regions. We propose a fast key-region detector able to capture aspects of the structural information of the document, and demonstrate its efficiency by comparing against standard detectors in an administrative document retrieval scenario. We show that using the proposed detector results to a smaller number of detected key-regions and higher performance without any drop in speed compared to standard state of the art detectors.

Proceedings ArticleDOI
04 Feb 2013
TL;DR: This paper proposes a machine learning based super resolution framework for low resolution document image OCR, using a document page segmentation algorithm and a modified K-means clustering algorithm to reconstruct from a lowresolution document image a better resolution image and improve OCR results.
Abstract: Optical character recognition is widely used for converting document images into digital media. Existing OCR algorithms and tools produce good results from high resolution, good quality, document images. In this paper, we propose a machine learning based super resolution framework for low resolution document image OCR. Two main techniques are used in our proposed approach: a document page segmentation algorithm and a modified K-means clustering algorithm. Using this approach, by exploiting coherence in the document, we reconstruct from a low resolution document image a better resolution image and improve OCR results. Experimental results show substantial gain in low resolution documents such as the ones captured from video.

Proceedings ArticleDOI
Canhui Xu1, Zhi Tang1, Xin Tao1, Yun Li1, Cao Shi1 
TL;DR: To increase the flexibility and enrich the reading experience of e-book on small portable screens, a graph based method is proposed to perform layout analysis on Portable Document Format (PDF) documents.
Abstract: To increase the flexibility and enrich the reading experience of e-book on small portable screens, a graph based method is proposed to perform layout analysis on Portable Document Format (PDF) documents. Digital born document has its inherent advantages like representing texts and fractional images in explicit form, which can be straightforwardly exploited. To integrate traditional image-based document analysis and the inherent meta-data provided by PDF parser, the page primitives including text, image and path elements are processed to produce text and non text layer for respective analysis. Graph-based method is developed in superpixel representation level, and page text elements corresponding to vertices are used to construct an undirected graph. Euclidean distance between adjacent vertices is applied in a top-down manner to cut the graph tree formed by Kruskal’s algorithm. And edge orientation is then used in a bottom-up manner to extract text lines from each sub tree. On the other hand, non-textual objects are segmented by connected component analysis. For each segmented text and non-text composite, a 13-dimensional feature vector is extracted for labelling purpose. The experimental results on selected pages from PDF books are presented.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: A new method of ground-truth estimation using multispectral (MS) imaging representation space for the sake of document image binarization and based on the cooperation of multiple classifiers under some constraints is proposed.
Abstract: Human ground-truthing is the manual labelling of samples (pixels for example) to generate reference data without any automatic algorithm help. Although a manual ground-truth is more accurate than a machine ground-truth, it still suffers from mislabeling and/or judgement errors. In this paper we propose a new method of ground-truth estimation using multispectral (MS) imaging representation space for the sake of document image binarization. Starting from the initial manual ground-truth, the proposed classification method aims to select automatically some samples with correct labels (well-labeled pixels) from each class for the training phase, then reassign new labels to the document image pixels. The classification scheme is based on the cooperation of multiple classifiers under some constraints. A real data set of MS historical document images and their ground-truth is created to demonstrate the effectiveness of the proposed method of ground-truth estimation.

Proceedings ArticleDOI
27 Dec 2013
TL;DR: The methods presented allow for the analysis of heterogeneous documents that contain printed and handwritten text and allow for a hierarchically clustering with different feature subsets in different layers.
Abstract: In this paper a semi-automated document image clustering and retrieval is presented to create links between different documents based on their content. Ideally the initial bundling of shuffled document images can be reproduced to explore large document databases. Structural and textural features, which describe the visual similarity, are extracted and used by experts (e.g. registrars) to interactively cluster the documents with a manually defined feature subset (e.g. checked paper, handwritten). The methods presented allow for the analysis of heterogeneous documents that contain printed and handwritten text and allow for a hierarchically clustering with different feature subsets in different layers.

Proceedings ArticleDOI
01 Sep 2013
TL;DR: An incremental learning method for document image and zone classification based on a reject utility in order to reject ambiguous zones or documents in an industrial context where the system faces a large variability of digitized administrative documents.
Abstract: We present an incremental learning method for document image and zone classification We consider an industrial context where the system faces a large variability of digitized administrative documents that become available progressively over time Each new incoming document is segmented into physical regions (zones) which are classified according to a zonemodel We represent the document by means of its classified zones and we classify the document according to a document-model The classification relies on a reject utility in order to reject ambiguous zones or documents Models are updated by incrementally learning each new document and its extracted zones We validate the method on real administrative document images and we achieve a recognition rate of more than 92%

Proceedings ArticleDOI
25 Aug 2013
TL;DR: It is shown that the different algorithms yield very different results depending on the type of documents and that two of them are constantly better than the others, and the Zone Map metric provides greater detail on the error types.
Abstract: Even if numerous text line detection algorithms have been proposed, the algorithms are usually compared on a single database and according to a single metric. In this paper, we study the performance of four different text line detection algorithms, on four databases containing very different documents, and according to three metrics (Zone Map, ICDAR and recognition error rate). Our goal is to provide a more comprehensive empirical evaluation of handwritten text line detection methods and to identify what are the key points in the evaluation. We show that the different algorithms yield very different results depending on the type of documents and that two of them are constantly better than the others. We also show that the Zone Map and the ICDAR metric are strongly correlated, but the Zone Map metric provides greater detail on the error types. Finally we show that the geometric metrics are correlated to the recognition error rate on easy to segment databases, but this has to be confirmed on difficult documents.

Proceedings ArticleDOI
10 Sep 2013
TL;DR: This paper presents an improved approach for automatically laying out content onto a document page, where the number and size of the items are unknown in advance and an analytical approximation for text placement is presented, refined by using curve fitting over TeX-generated data.
Abstract: This paper presents an improved approach for automatically laying out content onto a document page, where the number and size of the items are unknown in advance. Our solution leverages earlier results from Oliveira (2008) wherein layouts are modeled by a guillotine partitioning of the page. The benefit of such method is its efficiency and ability to place as many items on a page as desired. In our model, items have flexible representations and texts may freely change their font sizes to fit a particular area of the page. As a consequence, the optimization goal is to find a layout that produces the least noticeable difference between font sizes, in order to obtain the most aesthetically pleasing layout. Finding the best areas for text requires knowledge of how typesetting engines actually render text for a particular setting. As such, we also model the behavior of the TeX typesetting engine when computing the height to be occupied by a text block as a function of the font size, text length and line width. An analytical approximation for text placement is then presented, refined by using curve fitting over TeX-generated data. As a practical result, the resulting layouts for a newspaper generation application are also presented. Finally, we discuss these results and directions for further research.

Journal ArticleDOI
TL;DR: The warping text and the figures in documents have been restored by the proposed method successfully, which is efficiency and fast for implementing on the module of digital photocopier.
Abstract: The warp problems usually make the documents being hardly recognized. Specifically, when we copy a page of a thick book or bound document by digital photocopier, the resulted image is usually warped because of the thickness of the document. We focus on this problem and propose a fast method to restore the warped document image in this paper. The text rectangle area of the document is one of the features of a document. The morphological operation is utilized for text rectangle area segmentation. The DLT method is used to compute the mapping relations between the warped document and the non-warped document. In experimental results, the proposed method works on high resolution image very quickly. The warping text and the figures in documents have been restored by the proposed method successfully. This method is efficiency and fast for implementing on the module of digital photocopier.

Proceedings ArticleDOI
10 Sep 2013
TL;DR: The Conditional Random Field (CRF) model is used to perform OFR and it is shown that the effectiveness of this approach on a set of 616 fonts is demonstrated.
Abstract: Automated publishing systems require large databases containing document page layout templates. Most of these layout templates are created manually. A lower cost alternative is to extract document page layouts from existing documents. In order to extract the layout from a scanned document image, it is necessary to perform Optical Font Recognition (OFR) since the font is an important element in layout design. In this paper, we use the Conditional Random Field (CRF) model to perform OFR. First, we extract typographical features of the text. Then, we train the probabilistic model using a log-linear parameterization of CRF. The advantage of using CRF is that it does not assume that the typographical features are independent of each other. We demonstrate the effectiveness of this approach on a set of 616 fonts.

Book ChapterDOI
01 Jan 2013
TL;DR: A novel approach to segment text and non text components in Malayalam handwritten document image using Simplified Fuzzy ARTMAP (SFAM) classifier is proposed and results are promising and it can be extended to other scripts also.
Abstract: Segmentation of document image into text and non-text regions is an essential process in document layout analysis which is one of the preprocessing steps in optical character recognition Usually handwritten documents has no specific layout It may contain non text regions such as diagrams, graphics, tables etc In this work we propose a novel approach to segment text and non text components in Malayalam handwritten document image using Simplified Fuzzy ARTMAP (SFAM) classifier Binarized document image is dilated horizontally and vertically and merged together Perform connected component labelling on the smeared image A set of geometrical and statistical features are extracted from each component and given to SFAM for classifying it into text and non text components Experimental results are promising and it can be extended to other scripts also

Patent
28 Mar 2013
TL;DR: In this paper, a system for recording a document with a camera-based mobile radio device and for converting textual information in the document into a format for suitable presentation on the mobile device is described.
Abstract: Systems may be provided for recording a document with a camera-based mobile radio device and for converting textual information in the document into a format for suitable presentation on the mobile device. A document may be recorded by the mobile device in an image. A layout structure may be recognized with a text block in the image. Character text in the text block may be recognized by OCR. An order of the text blocks may be determined by taking into account the layout structure. A suitable format for presenting the character texts on the mobile device's display may be selected. The format may be adapted to a width of the display so that during reading of the character texts on the display, substantially only vertical scrolling is necessary. A file may be generated and displayed in the format with the character texts in the determined order of the text blocks.

Proceedings ArticleDOI
21 Mar 2013
TL;DR: E-VSM: Enhanced-Vector Space Model is proposed to overcome limitations of original Vector Space Model and new `Density-based Clustering' approach to calculate context-based closeness between two text documents which outperforms state of art in terms of accuracy.
Abstract: In many applications of Information Retrieval and Text Mining, there is need for an intelligent system to calculate the closeness between two text documents. In this, representation of text document in terms of mathematical object plays vital role. Vector Space Model is most popular method to represent text document in mathematical form but it is lossy, loses ordering of terms in text document in turn the context of it. Existing measures of closeness between two text documents are Cosine Similarity, Euclidean Distance etc. which are efficient but lacks in consideration of context of document. Through this paper we propose E-VSM: Enhanced-Vector Space Model to overcome limitations of original Vector Space Model and new `Density-based Clustering' approach to calculate context-based closeness between two text documents which outperforms state of art in terms of accuracy. Experiments show good results specially when text document to be compared is very much close to a particular region of target text document.

Book ChapterDOI
Dan Bloomberg1, Luc Vincent1
13 Feb 2013
TL;DR: This chapter is concerned with robust and efficient methods for extracting useful information from document images using maximum a posteriori (MAP) inference, which depend on the accuracy of the statistical models representing the collection of images.
Abstract: The analysis of document images is a difficult and ill-defined task. Unlike the graphics operation of rendering a document into a pixmap, using a structured page-level description such as pdf, the analysis problem starts with the pixmap and attempts to generate a structured description. This description is hierarchical, and typically consists of two interleaved trees, one giving the physical layout of the elements and the other affixing semantic tags. Tag assignment is ambiguous unless the rules determining structure and rendering are tightly constrained and known in advance. Although the graphical rendering process invariably loses structural information, much useful information can be extracted from the pixmaps. Some of that information, such as skew, warp and text orientation detection, is related to the digitization process and is useful for improving the rendering on a screen or paper. The layout hierarchy can be used to reflow the text for small displays or magnified printing. Other information is useful for organizing the information in an index, or for compressing the image data. This chapter is concerned with robust and efficient methods for extracting such useful data. What representation(s) should be used for image analysis? Empirically, a very large set of document image analysis (DIA) problems can be accurately and efficiently addressed with image morphology and related image processing methods. When the image is used as the fundamental representation, and analysis (decisions) are based on nonlinear image operations, many benefits accrue: (1) analysis is very fast, especially if carried out at relevant image scales; (2) analysis retains the image geometry, so that processing errors are obvious, the accuracy of results is visually evident, and the operations are easily improved; (3) alignment between different renderings and resolutions is maintained; (4) pixel labelling is made in parallel by neighbors; (5) sequential (e.g., filling) operations are used where pixels can have arbitrarily long-range effects; (6) pixel groupings are easily determined; (7) segmentation output is naturally represented using masks; (8) implementation is simplified because only a relatively small number of imaging operations must be implemented efficiently; (9) applications can use both shape and texture, at multiple resolutions, to label pixels; and (10) the statistical properties of pixels and sets of pixels can be used to make robust estimation. Table 1 depicts document image analysis (DIA) as occupying a high to intermediate position in terms of constraints, which depend on the accuracy of the statistical models representing the collection of images. Bayesian statistical models are the most constrained. Analysis is performed by generation from the models, using maximum a posteriori (MAP) inference. These techniques have been used for OCR [7] and for locating textlines [6], and can be implemented efficiently using heuristics despite the fact that they require matching all templates at all possible locations [9].

Book ChapterDOI
TL;DR: This chapter describes several approaches that have been proposed to use learning algorithm to analyze the layout of digitized documents by using supervised classifiers to label the objects in the document image according to physical or logical categories.
Abstract: In this chapter we describe several approaches that have been proposed to use learning algorithm to analyze the layout of digitized documents Layout analysis encompasses all the techniques that are used to infer the organization of the page layout of document images From a physical point of view the layout can be described as composed by blocks, in most cases rectangular, that are arranged in the page and contain homogeneous content, such as text, vectorial graphics, or illustrations From a logical point of view text blocks can have a different meaning on the basis of their content and their position in the page For instance, in the case of technical papers blocks can correspond to the title, author, or abstract of the paper The learning algorithms adopted in this domain are often related to supervised classifiers that are used at various processing levels to label the objects in the document image according to physical or logical categories The classification can be performed for individual pixels, for regions, or even for whole pages The different approaches adopted for using supervised classifiers in layout analysis are analyzed in this chapter

Patent
06 Dec 2013
TL;DR: In this article, a document processing device (1) includes a character information extracting unit (13) that extracts character information from document image data; a feature character string extracting unit that extracts, as document name candidate character strings, a given number of character strings indicative of features of the image data from the character information extracted by the character Information Extracting Unit (13); and an output condition acquiring unit (15) that, when the document data is processed by one of multiple processing methods involving an output of a document name of the document image of the data, acquires an output
Abstract: A document processing device (1) includes: a character information extracting unit (13) that extracts character information from document image data; a feature character string extracting unit (14) that extracts, as document name candidate character strings, a given number of character strings indicative of features of the document image data from the character information extracted by the character information extracting unit (13); an output condition acquiring unit (15) that, when the document image data is processed by one of multiple processing methods involving an output of a document name of the document image data, acquires an output condition required for the output of the document name of the document image data; and a document name generating unit (15) that generates the document name complying with a character condition corresponding to the output condition from the document name candidate character strings.

Proceedings ArticleDOI
01 Dec 2013
TL;DR: This paper proposes a generalized scheme for detection and removal of hand-drawn annotation lines in various forms, such as underlines, circular lines, and other text-surrounding curves from a scanned document page.
Abstract: Performance of an OCR system is badly affected due to presence of hand-drawn annotation lines in various forms, such as underlines, circular lines, and other text-surrounding curves. Such annotation lines are drawn by a reader usually in free hand in order to summarize some text or to mark the keywords within a document page. In this paper, we propose a generalized scheme for detection and removal of these hand-drawn annotations from a scanned document page. An underline drawn by hand is roughly horizontal or has a tolerable undulation, whereas for a hand-drawn curved line, the slope usually changes at a gradual pace. Based on this observation, we detect the cover of an annotation object-be it straight or curved-as a sequence of straight edge segments. The novelty of the proposed method lies in its ability to compute the exact cover of the annotation object, even when it touches or passes through any text character. After getting the annotation cover, an effective method of inpainting is used to quantify the regions where text reconstruction is needed. We have done our experimentation with various documents written in English, and some results are presented here to show the efficiency and robustness of the proposed method.

Proceedings ArticleDOI
Zongyi Liu1, Ray Smith1
25 Aug 2013
TL;DR: This paper presents an equation detector built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, and it has been built into the open source Tesseract that can be accessed and used by the OCR community.
Abstract: Detecting equation regions from scanned books has received attention in the document image research community in the past few years. Compared with regular text blocks, equation regions have more complicated layouts so we can not simply use text lines to model them. On the other hand, these regions consist of text symbols that can be reflowed, so that the OCR engines should parse them instead of rasterizing them like image regions. In this paper, we present an equation detector with two major contributions: (i) it is built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, (ii) it has been built into the open source Tesseract that can be accessed and used by the OCR community. The algorithm is tested on the Google Books database with 1534 entries sampled from books/magazines/newspapers of over thirty languages. And we show that Tesseract performance is improved after enabling the detector.