Showing papers on "Document layout analysis published in 2013"

PDF

Open Access

Journal Article•DOI•

A segment-based approach to clustering multi-topic documents

[...]

Andrea Tagarelli¹, George Karypis²•Institutions (2)

University of Calabria¹, University of Minnesota²

01 Mar 2013-Knowledge and Information Systems

TL;DR: This paper proposes a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents.

...read moreread less

Abstract: Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. Existing methods for document clustering have traditionally assumed that a document is an indivisible unit for text representation and similarity computation, which may not be appropriate to handle documents with multiple topics. In this paper, we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments that are coherent with respect to the underlying subtopics. We propose a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our segment-based approach on large collections of multi-topic documents, and we compare it to conventional methods for document clustering.

...read moreread less

62 citations

Proceedings Article•DOI•

ICDAR 2013 Competition on Historical Newspaper Layout Analysis (HNLA 2013)

[...]

Apostolos Antonacopoulos¹, Christian Clausner¹, Christos Papadopoulos¹, Stefan Pletschacher¹•Institutions (1)

University of Salford¹

25 Aug 2013

TL;DR: An objective comparative evaluation of layout analysis methods for scanned historical newspapers indicates that there is a convergence to a certain methodology with some variations in the approach, but there is still a considerable need to develop robust methods that deal with the idiosyncrasies of historical newspapers.

...read moreread less

Abstract: This paper presents an objective comparative evaluation of layout analysis methods for scanned historical newspapers. It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICDAR2013 and the 2nd International Workshop on Historical Document Imaging and Processing (HIP2013), presenting the results of the evaluation of five submitted methods. Two state-of-the-art systems, one commercial and one open-source, are also evaluated for comparison. Two scenarios are reported in this paper, one evaluating the ability of methods to accurately segment regions and the other evaluating the whole pipeline of segmentation and region classification (with a text extraction goal). The results indicate that there is a convergence to a certain methodology with some variations in the approach. However, there is still a considerable need to develop robust methods that deal with the idiosyncrasies of historical newspapers.

...read moreread less

47 citations

Document Image Noises and Removal Methods

[...]

Atena Farahmand, Hossein Sarrafzadeh, Jamshid Shanbehzadeh

01 Jan 2013

TL;DR: Noise in scanned document images is reviewed, which reduces the accuracy of subsequent tasks of OCR (Optical character Recognition) systems and some noise removal methods are discussed.

...read moreread less

Abstract:  Abstract- document images may be contaminated with noise during transmission, scanning or conversion to digital form. We can categorize noises by identifying their features and can search for similar patterns in a document image to choose appropriate methods for their removal. After a brief introduction, this paper reviews noises that might appear in scanned document images and discusses some noise removal methods. owadays, with the increase in computer use in everybody's lives, the ability for people to convert documents to digital and readable formats has become a necessity. Scanning documents is a way of changing printed documents into digital format. A common problem encountered when scanning documents is 'noise' which can occur in an image because of paper quality, the typing machine used, or it can be created by scanners during the scanning process. Noise removal is one of the steps in pre- processing. Among other things, noise reduces the accuracy of subsequent tasks of OCR (Optical character Recognition) systems. It can appear in the foreground or background of an image and can be generated before or after scanning. Examples of noise in scanned document images are as follows. The page rule line is a source of noise which interferes with text objects. The marginal noise usually appears in a large dark region around the document image and can be textual or non-textual. Some forms of clutter noise appear in an image because of document skew while scanning or are from holes punched in the document, or background noise, such as uneven contrast, show through effects, interfering strokes, and background spots, etc. Next, we will discuss each type in detail.

...read moreread less

39 citations

Proceedings Article•DOI•

Text Line Detection for Heterogeneous Documents

[...]

Markus Diem¹, Florian Kleber¹, Robert Sablatnig¹•Institutions (1)

Vienna University of Technology¹

25 Aug 2013

TL;DR: The proposed method is a bottom up approach that fuses words, to globally minimize their fusing distance and in order to improve processing time and further layout analysis, text lines are represented by oriented rectangles.

...read moreread less

Abstract: Text line detection is a pre-processing step for automated document analysis such as word spotting or OCR. It is additionally used for document structure analysis or layout analysis. Considering mixed layouts, degraded documents and handwritten documents, text line detection is still challenging. We present a novel approach that targets torn documents having varying layouts and writing. The proposed method is a bottom up approach that fuses words, to globally minimize their fusing distance. In order to improve processing time and further layout analysis, text lines are represented by oriented rectangles. Even though, the method was designed for modern handwritten and printed documents, tests on medieval manuscripts give promising results. Additionally, the text line detection was evaluated on the ICDAR 2009 and ICFHR 2010 Handwriting Segmentation Contest datasets.

...read moreread less

27 citations

Proceedings Article•DOI•

Text Line Extraction Using DMLP Classifiers for Historical Manuscripts

[...]

Micheal Baechler¹, Marcus Liwicki¹, Rolf Ingold¹•Institutions (1)

University of Fribourg¹

25 Aug 2013

TL;DR: This paper proposes a novel text line extraction method for historical documents that takes the layout recognition results as an input, extracts the text lines, and groups them into blocks using the connected components approach.

...read moreread less

Abstract: This paper proposes a novel text line extraction method for historical documents. The method works in two steps. In the first step, layout analysis is performed to recognize the physical structure of a given document using a classification technique, more precisely the pixels of a coloured document image are classified into five classes: text-block, core-text-line, decoration, background, and periphery. This layout recognition is achieved by a cascade of two Dynamic Multilayer Perceptron (DMLP) classifiers and works without binarisation. In the second step, an algorithm takes the layout recognition results as an input, extracts the text lines, and groups them into blocks using the connected components approach. Finally, the algorithm refines the boundaries of the text lines using the binary image and the layout recognition results. Our system is evaluated on three historical manuscripts with a test set of 49 pages. The best obtained hit rate for text lines is 96.3%.

...read moreread less

26 citations

Document Image Retrieval: Algorithms, Analysis and Promising Directions

[...]

Mohammad Reza Keyvanpour¹, Reza Tavoli•Institutions (1)

Alzahra University¹

01 Jan 2013

TL;DR: A framework for classify document image retrieval approaches is proposed, and then these approaches are evaluated based on important measures.

...read moreread less

Abstract: During the last decades, Due to the advances in Information technology and communication and increase in volume of printed documents in many applications, document image databases have become increasingly important. Document Images are documents that normally begin on paper and are then via electronics scanned that move towards a paperless office and stored documents as images. Document Image retrieval is one of an important research area in the field of document image databases. Many approaches come in for indexing and retrieval document images. Traditionally, Optical character recognition (OCR) has been used for completely convert the manuscript to an electronic version which can be indexed automatically. Then, Keyword spotting has been proposed for indexing document image retrieval. Keyword spotting method has lower cost than OCR. But there are some problems in both of methods for indexing document images with non-text components. Three approaches have been presented to solve this problem, Signature based approach, layout structural and logo based approach. In this paper we proposed a framework for classify document image retrieval approaches, and then we evaluated these approaches based on important measures.

...read moreread less

19 citations

Proceedings Article•DOI•

ICDAR 2013 Document Image Skew Estimation Contest (DISEC 2013)

[...]

A. Papandreou, B. Gatos, Georgios Louloudis, Nikolaos Stamatopoulos

25 Aug 2013

TL;DR: The details of the ICDAR2013 Document Image Skew Estimation Contest are described including the evaluation measures used as well as the performance of the twelve methods submitted by ten different groups along with a short description of each method.

...read moreread less

Abstract: The detection and correction of document skew is one of the most important document image analysis steps. The ICDAR2013 Document Image Skew Estimation Contest (DISEC'13) is the first contest which is dedicated to record recent advances in the field of skew estimation using well established evaluation performance measures on a variety of printed document images. The benchmarking dataset that is used contains 1550 images that were obtained from various sources such as newspapers, scientific books and dictionaries. The document images contain figures, tables, diagrams, architectural plans, electrical circuits and they are written in various languages such as English, Chinese and Greek. This paper describes the details of the contest including the evaluation measures used as well as the performance of the twelve methods submitted by ten different groups along with a short description of each method.

...read moreread less

19 citations

Proceedings Article•DOI•

A Document Image Segmentation System Using Analysis of Connected Components

[...]

Fattah Zirari¹, Abdellatif Ennaji¹, Stéphane Nicolas¹, Driss Mammass•Institutions (1)

University of Rouen¹

25 Aug 2013

TL;DR: This paper presents a method to separate the textual and non textual components in document images using a graph-based modeling and structural analysis, which is a fast and efficient way to separate adequately the graphical and the textual parts of a document.

...read moreread less

Abstract: Page segmentation into text and non-text elements is an essential preprocessing step before optical character recognition (OCR) operation. In case of poor segmentation, an OCR classification engine produces garbage characters due to the presence of non-text elements. This paper presents a method to separate the textual and non textual components in document images using a graph-based modeling and structural analysis. This is a fast and efficient method to separate adequately the graphical and the textual parts of a document. We have evaluated our method on two well-known subsets: the UW-III dataset and the ICDAR 2009 page segmentation competition dataset. Comparisons are led with two methods of state-of-the-art, these results showing that our method proved better performances in this task.

...read moreread less

19 citations

Proceedings Article•DOI•

Mathematical Formula Detection in Heterogeneous Document Images

[...]

Wei-Ta Chu¹, Fan Liu¹•Institutions (1)

National Chung Cheng University¹

06 Dec 2013

TL;DR: Novel features based on centroid fluctuation information of non-homogeneous regions are proposed to more appropriately characterize both displayed formulas and embedded formulas in heterogeneous document images that may contain figures, tables, text, and math formulas.

...read moreread less

Abstract: This paper presents mathematical formula detection in heterogeneous document images that may contain figures, tables, text, and math formulas. We adopt the method originally proposed for sign detection in natural images to detect non-homogeneous regions and accordingly achieve text line detection and segmentation. Novel features based on centroid fluctuation information of non-homogeneous regions are proposed to more appropriately characterize both displayed formulas and embedded formulas. By comparing the proposed method with previous works, we demonstrate the effectiveness of the proposed features.

...read moreread less

18 citations

Proceedings Article•DOI•

The Significance of Reading Order in Document Recognition and Its Evaluation

[...]

Christian Clausner¹, Stefan Pletschacher¹, Apostolos Antonacopoulos¹•Institutions (1)

University of Salford¹

25 Aug 2013

TL;DR: A novel evaluation approach that responds to the evaluation of reading order results generated by layout analysis methods by incorporating region correspondence analysis is proposed and a sophisticated reading order representation scheme is presented and used by the system.

...read moreread less

Abstract: Reading order detection and representation is an important task in many digitisation scenarios involving the preservation of the logical structure of a document. The corresponding need for the evaluation of reading order results generated by layout analysis methods poses a particular challenge due to potential deviations between ground truth and actually detected segmentation of the page. To this end a novel evaluation approach that responds to this problem by incorporating region correspondence analysis is proposed. Furthermore, a sophisticated reading order representation scheme is presented and used by the system allowing the grouping of objects with ordered and/or unordered relations. This is a typical requirement for documents with complex layouts such as magazines and newspapers. The evaluation method has been validated using the results of two state-of-the-art OCR / layout analysis systems and a basic top-to-bottom reading order detection algorithm applied on representative samples from the PRImA contemporary and the IMPACT historical document datasets.

...read moreread less

16 citations

Proceedings Article•DOI•

Old document image segmentation using the autocorrelation function and multiresolution analysis

[...]

Maroua Mehri¹, Maroua Mehri², Petra Gomez-Krämer², Pierre Héroux¹, Rémy Mullot² - Show less +1 more•Institutions (2)

University of Rouen¹, University of La Rochelle²

04 Feb 2013

TL;DR: A fast automatic layout segmentation of old document images based on five descriptors, which is parameter-free since it automatically adapts to the image content, and defines a new evaluation metric, the homogeneity measure, which aims at evaluating the segmentation and characterization accuracy of the method.

...read moreread less

Abstract: Recent progress in the digitization of heterogeneous collections of ancient documents has rekindled new challenges in information retrieval in digital libraries and document layout analysis. Therefore, in order to control the quality of historical document image digitization and to meet the need of a characterization of their content using intermediate level metadata (between image and document structure), we propose a fast automatic layout segmentation of old document images based on five descriptors. Those descriptors, based on the autocorrelation function, are obtained by multiresolution analysis and used afterwards in a specific clustering method. The method proposed in this article has the advantage that it is performed without any hypothesis on the document structure, either about the document model (physical structure), or the typographical parameters (logical structure). It is also parameter-free since it automatically adapts to the image content. In this paper, firstly, we detail our proposal to characterize the content of old documents by extracting the autocorrelation features in the different areas of a page and at several resolutions. Then, we show that is possible to automatically find the homogeneous regions defined by similar indices of autocorrelation without knowledge about the number of clusters using adapted hierarchical ascendant classification and consensus clustering approaches. To assess our method, we apply our algorithm on 316 old document images, which encompass six centuries (1200-1900) of French history, in order to demonstrate the performance of our proposal in terms of segmentation and characterization of heterogeneous corpus content. Moreover, we define a new evaluation metric, the homogeneity measure, which aims at evaluating the segmentation and characterization accuracy of our methodology. We find a 85% of mean homogeneity accuracy. Those results help to represent a document by a hierarchy of layout structure and content, and to define one or more signatures for each page, on the basis of a hierarchical representation of homogeneous blocks and their topology.

...read moreread less

Patent•

PDF document recognition method

[...]

樊孝龙

19 Mar 2013

TL;DR: In this paper, a PDF document recognition method is proposed, which comprises the steps as follows: S1: analyzing path objects in a PDF text document, and recognizing forms in the PDF document; S2: analyzing text objects outside table areas in PDF documents, and S3: writing recognition results into a temporary file, or writing the recognition result into a PDF file in the form of an attachment.

...read moreread less

Abstract: The invention discloses a PDF document recognition method The method comprises the steps as follows: S1: analyzing path objects in a PDF document, and recognizing forms in the PDF document; S2: analyzing text objects outside table areas in the PDF document, and recognizing text content in the PDF document; S3: writing recognition results into a temporary file, or writing the recognition results into a PDF file in the form of an attachment By the aid of the PDF document recognition method, objects such as the forms, paragraphs, titles, lists and the like in the PDF document can be recognized, so that the PDF document can be edited with one paragraph as the unit, labels can be added to the PDF document conveniently, the reading order can be determined, and persons with dysopia can read conveniently; meanwhile, documents in other formats can be exported according to the recognition results, so that users can read and edit the PDF document conveniently

...read moreread less

Proceedings Article•DOI•

Key-Region Detection for Document Images -- Application to Administrative Document Retrieval

[...]

Hongxing Gao, Marçal Rusiñol, Dimosthenis Karatzas, Josep Lladós, Tomokazu Sato¹, Masakazu Iwamura¹, Koichi Kise¹ - Show less +3 more•Institutions (1)

Osaka Prefecture University¹

25 Aug 2013

TL;DR: It is argued that a key-region detector designed to take into account the special characteristics of document images can result in the detection of less and more meaningful key-regions.

...read moreread less

Abstract: In this paper we argue that a key-region detector designed to take into account the special characteristics of document images can result in the detection of less and more meaningful key-regions. We propose a fast key-region detector able to capture aspects of the structural information of the document, and demonstrate its efficiency by comparing against standard detectors in an administrative document retrieval scenario. We show that using the proposed detector results to a smaller number of detected key-regions and higher performance without any drop in speed compared to standard state of the art detectors.

...read moreread less

Proceedings Article•DOI•

A super resolution framework for low resolution document image OCR

[...]

Di Ma¹, Gady Agam¹•Institutions (1)

Illinois Institute of Technology¹

04 Feb 2013

TL;DR: This paper proposes a machine learning based super resolution framework for low resolution document image OCR, using a document page segmentation algorithm and a modified K-means clustering algorithm to reconstruct from a lowresolution document image a better resolution image and improve OCR results.

...read moreread less

Abstract: Optical character recognition is widely used for converting document images into digital media. Existing OCR algorithms and tools produce good results from high resolution, good quality, document images. In this paper, we propose a machine learning based super resolution framework for low resolution document image OCR. Two main techniques are used in our proposed approach: a document page segmentation algorithm and a modified K-means clustering algorithm. Using this approach, by exploiting coherence in the document, we reconstruct from a low resolution document image a better resolution image and improve OCR results. Experimental results show substantial gain in low resolution documents such as the ones captured from video.

...read moreread less

Proceedings Article•DOI•

Graph-based layout analysis for PDF documents

[...]

Canhui Xu¹, Zhi Tang¹, Xin Tao¹, Yun Li¹, Cao Shi¹ - Show less +1 more•Institutions (1)

Peking University¹

21 Mar 2013-Proceedings of SPIE

TL;DR: To increase the flexibility and enrich the reading experience of e-book on small portable screens, a graph based method is proposed to perform layout analysis on Portable Document Format (PDF) documents.

...read moreread less

Abstract: To increase the flexibility and enrich the reading experience of e-book on small portable screens, a graph based method is proposed to perform layout analysis on Portable Document Format (PDF) documents. Digital born document has its inherent advantages like representing texts and fractional images in explicit form, which can be straightforwardly exploited. To integrate traditional image-based document analysis and the inherent meta-data provided by PDF parser, the page primitives including text, image and path elements are processed to produce text and non text layer for respective analysis. Graph-based method is developed in superpixel representation level, and page text elements corresponding to vertices are used to construct an undirected graph. Euclidean distance between adjacent vertices is applied in a top-down manner to cut the graph tree formed by Kruskal’s algorithm. And edge orientation is then used in a bottom-up manner to extract text lines from each sub tree. On the other hand, non-textual objects are segmented by connected component analysis. For each segmented text and non-text composite, a 13-dimensional feature vector is extracted for labelling purpose. The experimental results on selected pages from PDF books are presented.

...read moreread less

Proceedings Article•DOI•

Ground-Truth Estimation in Multispectral Representation Space: Application to Degraded Document Image Binarization

[...]

Rachid Hedjam, Mohamed Cheriet

25 Aug 2013

TL;DR: A new method of ground-truth estimation using multispectral (MS) imaging representation space for the sake of document image binarization and based on the cooperation of multiple classifiers under some constraints is proposed.

...read moreread less

Abstract: Human ground-truthing is the manual labelling of samples (pixels for example) to generate reference data without any automatic algorithm help. Although a manual ground-truth is more accurate than a machine ground-truth, it still suffers from mislabeling and/or judgement errors. In this paper we propose a new method of ground-truth estimation using multispectral (MS) imaging representation space for the sake of document image binarization. Starting from the initial manual ground-truth, the proposed classification method aims to select automatically some samples with correct labels (well-labeled pixels) from each class for the training phase, then reassign new labels to the document image pixels. The classification scheme is based on the cooperation of multiple classifiers under some constraints. A real data set of MS historical document images and their ground-truth is created to demonstrate the effectiveness of the proposed method of ground-truth estimation.

...read moreread less

Proceedings Article•DOI•

Semi-automated document image clustering and retrieval

[...]

Markus Diem, Florian Kleber, Stefan Fiel, Robert Sablatnig

27 Dec 2013

TL;DR: The methods presented allow for the analysis of heterogeneous documents that contain printed and handwritten text and allow for a hierarchically clustering with different feature subsets in different layers.

...read moreread less

Abstract: In this paper a semi-automated document image clustering and retrieval is presented to create links between different documents based on their content. Ideally the initial bundling of shuffled document images can be reproduced to explore large document databases. Structural and textural features, which describe the visual similarity, are extracted and used by experts (e.g. registrars) to interactively cluster the documents with a manually defined feature subset (e.g. checked paper, handwritten). The methods presented allow for the analysis of heterogeneous documents that contain printed and handwritten text and allow for a hierarchically clustering with different feature subsets in different layers.

...read moreread less

Proceedings Article•DOI•

Document image and zone classification through incremental learning

[...]

Mohamed-Rafik Bouguelia, Yolande Belaïd, Abdel Belaïd

01 Sep 2013

TL;DR: An incremental learning method for document image and zone classification based on a reject utility in order to reject ambiguous zones or documents in an industrial context where the system faces a large variability of digitized administrative documents.

...read moreread less

Abstract: We present an incremental learning method for document image and zone classification We consider an industrial context where the system faces a large variability of digitized administrative documents that become available progressively over time Each new incoming document is segmented into physical regions (zones) which are classified according to a zonemodel We represent the document by means of its classified zones and we classify the document according to a document-model The classification relies on a reject utility in order to reject ambiguous zones or documents Models are updated by incrementally learning each new document and its extracted zones We validate the method on real administrative document images and we achieve a recognition rate of more than 92%

...read moreread less

Proceedings Article•DOI•

On the Evaluation of Handwritten Text Line Detection Algorithms

[...]

Bastien Moysset, Christopher Kermorvant

25 Aug 2013

TL;DR: It is shown that the different algorithms yield very different results depending on the type of documents and that two of them are constantly better than the others, and the Zone Map metric provides greater detail on the error types.

...read moreread less

Abstract: Even if numerous text line detection algorithms have been proposed, the algorithms are usually compared on a single database and according to a single metric. In this paper, we study the performance of four different text line detection algorithms, on four databases containing very different documents, and according to three metrics (Zone Map, ICDAR and recognition error rate). Our goal is to provide a more comprehensive empirical evaluation of handwritten text line detection methods and to identify what are the key points in the evaluation. We show that the different algorithms yield very different results depending on the type of documents and that two of them are constantly better than the others. We also show that the Zone Map and the ICDAR metric are strongly correlated, but the Zone Map metric provides greater detail on the error types. Finally we show that the geometric metrics are correlated to the recognition error rate on easy to segment databases, but this has to be confirmed on difficult documents.

...read moreread less

Proceedings Article•DOI•

Balancing font sizes for flexibility in automated document layout

[...]

Ricardo Piccoli¹, João Batista S. de Oliveira¹•Institutions (1)

Pontifícia Universidade Católica do Rio Grande do Sul¹

10 Sep 2013

TL;DR: This paper presents an improved approach for automatically laying out content onto a document page, where the number and size of the items are unknown in advance and an analytical approximation for text placement is presented, refined by using curve fitting over TeX-generated data.

...read moreread less

Abstract: This paper presents an improved approach for automatically laying out content onto a document page, where the number and size of the items are unknown in advance. Our solution leverages earlier results from Oliveira (2008) wherein layouts are modeled by a guillotine partitioning of the page. The benefit of such method is its efficiency and ability to place as many items on a page as desired. In our model, items have flexible representations and texts may freely change their font sizes to fit a particular area of the page. As a consequence, the optimization goal is to find a layout that produces the least noticeable difference between font sizes, in order to obtain the most aesthetically pleasing layout. Finding the best areas for text requires knowledge of how typesetting engines actually render text for a particular setting. As such, we also model the behavior of the TeX typesetting engine when computing the height to be occupied by a text block as a function of the font size, text length and line width. An analytical approximation for text placement is then presented, refined by using curve fitting over TeX-generated data. As a practical result, the resulting layouts for a newspaper generation application are also presented. Finally, we discuss these results and directions for further research.

...read moreread less

Journal Article•DOI•

Fast Restoration of Warped Document Image based on Text Rectangle Area Segmentation

[...]

Kuo-Hsien Hsia¹, Shao-Fan Lien¹, Juhng-Perng Su•Institutions (1)

Communist University of the Toilers of the East¹

05 Jan 2013-Journal of Software

TL;DR: The warping text and the figures in documents have been restored by the proposed method successfully, which is efficiency and fast for implementing on the module of digital photocopier.

...read moreread less

Abstract: The warp problems usually make the documents being hardly recognized. Specifically, when we copy a page of a thick book or bound document by digital photocopier, the resulted image is usually warped because of the thickness of the document. We focus on this problem and propose a fast method to restore the warped document image in this paper. The text rectangle area of the document is one of the features of a document. The morphological operation is utilized for text rectangle area segmentation. The DLT method is used to compute the mapping relations between the warped document and the non-warped document. In experimental results, the proposed method works on high resolution image very quickly. The warping text and the figures in documents have been restored by the proposed method successfully. This method is efficiency and fast for implementing on the module of digital photocopier.

...read moreread less

Proceedings Article•DOI•

Optical font recognition using conditional random field

[...]

Aziza Satkhozhina¹, Ildus Ahmadullin², Jan P. Allebach¹•Institutions (2)

Purdue University¹, Hewlett-Packard²

10 Sep 2013

TL;DR: The Conditional Random Field (CRF) model is used to perform OFR and it is shown that the effectiveness of this approach on a set of 616 fonts is demonstrated.

...read moreread less

Abstract: Automated publishing systems require large databases containing document page layout templates. Most of these layout templates are created manually. A lower cost alternative is to extract document page layouts from existing documents. In order to extract the layout from a scanned document image, it is necessary to perform Optical Font Recognition (OFR) since the font is an important element in layout design. In this paper, we use the Conditional Random Field (CRF) model to perform OFR. First, we extract typographical features of the text. Then, we train the probabilistic model using a log-linear parameterization of CRF. The advantage of using CRF is that it does not assume that the typographical features are independent of each other. We demonstrate the effectiveness of this approach on a set of 616 fonts.

...read moreread less

Book Chapter•DOI•

Classification of Handwritten Document Image into Text and Non-Text Regions

[...]

V. Vidya¹, T. R. Indhu¹, V K Bhadran¹•Institutions (1)

Centre for Development of Advanced Computing¹

01 Jan 2013

TL;DR: A novel approach to segment text and non text components in Malayalam handwritten document image using Simplified Fuzzy ARTMAP (SFAM) classifier is proposed and results are promising and it can be extended to other scripts also.

...read moreread less

Abstract: Segmentation of document image into text and non-text regions is an essential process in document layout analysis which is one of the preprocessing steps in optical character recognition Usually handwritten documents has no specific layout It may contain non text regions such as diagrams, graphics, tables etc In this work we propose a novel approach to segment text and non text components in Malayalam handwritten document image using Simplified Fuzzy ARTMAP (SFAM) classifier Binarized document image is dilated horizontally and vertically and merged together Perform connected component labelling on the smeared image A set of geometrical and statistical features are extracted from each component and given to SFAM for classifying it into text and non text components Experimental results are promising and it can be extended to other scripts also

...read moreread less

Patent•

Conversion of a document of captured images into a format for optimized display on a mobile device

[...]

Herr Cüneyt Göktekin¹•Institutions (1)

Nuance Communications¹

28 Mar 2013

TL;DR: In this paper, a system for recording a document with a camera-based mobile radio device and for converting textual information in the document into a format for suitable presentation on the mobile device is described.

...read moreread less

Abstract: Systems may be provided for recording a document with a camera-based mobile radio device and for converting textual information in the document into a format for suitable presentation on the mobile device. A document may be recorded by the mobile device in an image. A layout structure may be recognized with a text block in the image. Character text in the text block may be recognized by OCR. An order of the text blocks may be determined by taking into account the layout structure. A suitable format for presenting the character texts on the mobile device's display may be selected. The format may be adapted to a width of the display so that during reading of the character texts on the display, substantially only vertical scrolling is necessary. A file may be generated and displayed in the format with the character texts in the determined order of the text blocks.

...read moreread less

Proceedings Article•DOI•

E-VSM: Novel text representation model to capture contex-based closeness between two text documents

[...]

A. Bhakkad¹, Shweta C. Dharmadhikari¹, M. Emmanuel¹, P. Kulkarni•Institutions (1)

Pune Institute of Computer Technology¹

21 Mar 2013

TL;DR: E-VSM: Enhanced-Vector Space Model is proposed to overcome limitations of original Vector Space Model and new `Density-based Clustering' approach to calculate context-based closeness between two text documents which outperforms state of art in terms of accuracy.

...read moreread less

Abstract: In many applications of Information Retrieval and Text Mining, there is need for an intelligent system to calculate the closeness between two text documents. In this, representation of text document in terms of mathematical object plays vital role. Vector Space Model is most popular method to represent text document in mathematical form but it is lossy, loses ordering of terms in text document in turn the context of it. Existing measures of closeness between two text documents are Cosine Similarity, Euclidean Distance etc. which are efficient but lacks in consideration of context of document. Through this paper we propose E-VSM: Enhanced-Vector Space Model to overcome limitations of original Vector Space Model and new `Density-based Clustering' approach to calculate context-based closeness between two text documents which outperforms state of art in terms of accuracy. Experiments show good results specially when text document to be compared is very much close to a particular region of target text document.

...read moreread less

Book Chapter•DOI•

Document Image Applications

[...]

Dan Bloomberg¹, Luc Vincent¹•Institutions (1)

Google¹

13 Feb 2013

TL;DR: This chapter is concerned with robust and efficient methods for extracting useful information from document images using maximum a posteriori (MAP) inference, which depend on the accuracy of the statistical models representing the collection of images.

...read moreread less

Abstract: The analysis of document images is a difficult and ill-defined task. Unlike the graphics operation of rendering a document into a pixmap, using a structured page-level description such as pdf, the analysis problem starts with the pixmap and attempts to generate a structured description. This description is hierarchical, and typically consists of two interleaved trees, one giving the physical layout of the elements and the other affixing semantic tags. Tag assignment is ambiguous unless the rules determining structure and rendering are tightly constrained and known in advance. Although the graphical rendering process invariably loses structural information, much useful information can be extracted from the pixmaps. Some of that information, such as skew, warp and text orientation detection, is related to the digitization process and is useful for improving the rendering on a screen or paper. The layout hierarchy can be used to reflow the text for small displays or magnified printing. Other information is useful for organizing the information in an index, or for compressing the image data. This chapter is concerned with robust and efficient methods for extracting such useful data. What representation(s) should be used for image analysis? Empirically, a very large set of document image analysis (DIA) problems can be accurately and efficiently addressed with image morphology and related image processing methods. When the image is used as the fundamental representation, and analysis (decisions) are based on nonlinear image operations, many benefits accrue: (1) analysis is very fast, especially if carried out at relevant image scales; (2) analysis retains the image geometry, so that processing errors are obvious, the accuracy of results is visually evident, and the operations are easily improved; (3) alignment between different renderings and resolutions is maintained; (4) pixel labelling is made in parallel by neighbors; (5) sequential (e.g., filling) operations are used where pixels can have arbitrarily long-range effects; (6) pixel groupings are easily determined; (7) segmentation output is naturally represented using masks; (8) implementation is simplified because only a relatively small number of imaging operations must be implemented efficiently; (9) applications can use both shape and texture, at multiple resolutions, to label pixels; and (10) the statistical properties of pixels and sets of pixels can be used to make robust estimation. Table 1 depicts document image analysis (DIA) as occupying a high to intermediate position in terms of constraints, which depend on the accuracy of the statistical models representing the collection of images. Bayesian statistical models are the most constrained. Analysis is performed by generation from the models, using maximum a posteriori (MAP) inference. These techniques have been used for OCR [7] and for locating textlines [6], and can be implemented efficiently using heuristics despite the fact that they require matching all templates at all possible locations [9].

...read moreread less

Book Chapter•DOI•

Learning Algorithms for Document Layout Analysis

[...]

Simone Marinai¹•Institutions (1)

University of Florence¹

01 Jan 2013-Handbook of Statistics

TL;DR: This chapter describes several approaches that have been proposed to use learning algorithm to analyze the layout of digitized documents by using supervised classifiers to label the objects in the document image according to physical or logical categories.

...read moreread less

Abstract: In this chapter we describe several approaches that have been proposed to use learning algorithm to analyze the layout of digitized documents Layout analysis encompasses all the techniques that are used to infer the organization of the page layout of document images From a physical point of view the layout can be described as composed by blocks, in most cases rectangular, that are arranged in the page and contain homogeneous content, such as text, vectorial graphics, or illustrations From a logical point of view text blocks can have a different meaning on the basis of their content and their position in the page For instance, in the case of technical papers blocks can correspond to the title, author, or abstract of the paper The learning algorithms adopted in this domain are often related to supervised classifiers that are used at various processing levels to label the objects in the document image according to physical or logical categories The classification can be performed for individual pixels, for regions, or even for whole pages The different approaches adopted for using supervised classifiers in layout analysis are analyzed in this chapter

...read moreread less

Patent•

Document processing device, image processing apparatus, document processing method and computer program product

[...]

Yoshihisa Ohguro

06 Dec 2013

TL;DR: In this article, a document processing device (1) includes a character information extracting unit (13) that extracts character information from document image data; a feature character string extracting unit that extracts, as document name candidate character strings, a given number of character strings indicative of features of the image data from the character information extracted by the character Information Extracting Unit (13); and an output condition acquiring unit (15) that, when the document data is processed by one of multiple processing methods involving an output of a document name of the document image of the data, acquires an output

...read moreread less

Abstract: A document processing device (1) includes: a character information extracting unit (13) that extracts character information from document image data; a feature character string extracting unit (14) that extracts, as document name candidate character strings, a given number of character strings indicative of features of the document image data from the character information extracted by the character information extracting unit (13); an output condition acquiring unit (15) that, when the document image data is processed by one of multiple processing methods involving an output of a document name of the document image data, acquires an output condition required for the output of the document name of the document image data; and a document name generating unit (15) that generates the document name complying with a character condition corresponding to the output condition from the document name candidate character strings.

...read moreread less

Proceedings Article•DOI•

Removal of hand-drawn annotation lines from document images by digital-geometric analysis and inpainting

[...]

Sanjoy Pratihar¹, Partha Bhowmick¹, Shamik Sural¹, Jayanta Mukhopadhyay¹•Institutions (1)

Indian Institute of Technology Kharagpur¹

01 Dec 2013

TL;DR: This paper proposes a generalized scheme for detection and removal of hand-drawn annotation lines in various forms, such as underlines, circular lines, and other text-surrounding curves from a scanned document page.

...read moreread less

Abstract: Performance of an OCR system is badly affected due to presence of hand-drawn annotation lines in various forms, such as underlines, circular lines, and other text-surrounding curves. Such annotation lines are drawn by a reader usually in free hand in order to summarize some text or to mark the keywords within a document page. In this paper, we propose a generalized scheme for detection and removal of these hand-drawn annotations from a scanned document page. An underline drawn by hand is roughly horizontal or has a tolerable undulation, whereas for a hand-drawn curved line, the slope usually changes at a gradual pace. Based on this observation, we detect the cover of an annotation object-be it straight or curved-as a sequence of straight edge segments. The novelty of the proposed method lies in its ability to compute the exact cover of the annotation object, even when it touches or passes through any text character. After getting the annotation cover, an effective method of inpainting is used to quantify the regions where text reconstruction is needed. We have done our experimentation with various documents written in English, and some results are presented here to show the efficiency and robustness of the proposed method.

...read moreread less

Proceedings Article•DOI•

A Simple Equation Region Detector for Printed Document Images in Tesseract

[...]

Zongyi Liu¹, Ray Smith¹•Institutions (1)

Google¹

25 Aug 2013

TL;DR: This paper presents an equation detector built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, and it has been built into the open source Tesseract that can be accessed and used by the OCR community.

...read moreread less

Abstract: Detecting equation regions from scanned books has received attention in the document image research community in the past few years. Compared with regular text blocks, equation regions have more complicated layouts so we can not simply use text lines to model them. On the other hand, these regions consist of text symbols that can be reflowed, so that the OCR engines should parse them instead of rasterizing them like image regions. In this paper, we present an equation detector with two major contributions: (i) it is built on a simple algorithm that uses the density of special symbols, such that no additional classifier is required, (ii) it has been built into the open source Tesseract that can be accessed and used by the OCR community. The algorithm is tested on the Google Books database with 1534 entries sampled from books/magazines/newspapers of over thirty languages. And we show that Tesseract performance is improved after enabling the detector.

...read moreread less