Showing papers on "Document layout analysis published in 2014"

PDF

Open Access

Proceedings Article•DOI•

Convolutional Neural Networks for Document Image Classification

[...]

Le Kang¹, Jayant Kumar¹, Peng Ye¹, Yi Li², David Doermann¹ - Show less +1 more•Institutions (2)

University of Maryland, College Park¹, NICTA²

24 Aug 2014

TL;DR: Equipped with rectified linear units and trained with dropout, this CNN performs well even when document layouts present large inner-class variations, and experiments on public challenging datasets demonstrate the effectiveness of the proposed approach.

...read moreread less

Abstract: This paper presents a Convolutional Neural Network (CNN) for document image classification. In particular, document image classes are defined by the structural similarity. Previous approaches rely on hand-crafted features for capturing structural information. In contrast, we propose to learn features from raw image pixels using CNN. The use of CNN is motivated by the the hierarchical nature of document layout. Equipped with rectified linear units and trained with dropout, our CNN performs well even when document layouts present large inner-class variations. Experiments on public challenging datasets demonstrate the effectiveness of the proposed approach.

...read moreread less

137 citations

Proceedings Article•DOI•

Page Segmentation for Historical Handwritten Document Images Using Color and Texture Features

[...]

Kai Chen¹, Hao Wei¹, Jean Hennebert¹, Rolf Ingold¹, Marcus Liwicki¹ - Show less +1 more•Institutions (1)

University of Fribourg¹

01 Sep 2014

TL;DR: A physical structure detection method for historical handwritten document images by classifying each pixel as either periphery, background, text block, or decoration, which achieves high quality segmentation without any assumption of specific topologies and shapes.

...read moreread less

Abstract: In this paper we present a physical structure detection method for historical handwritten document images. We considered layout analysis as a pixel labeling problem. By classifying each pixel as either periphery, background, text block ,o rdecoration, we achieve high quality segmentation without any assumption of specific topologies and shapes. Various color and texture features such as color variance, smoothness, Laplacian, Local Binary Patterns, and Gabor Dominant Orientation Histogram are used for classification. Some of these features have so far not got many attentions for document image layout analysis. By applying an Improved Fast Correlation-Based Filter feature selection algorithm, the redundant and irrelevant features are removed. Finally, the segmentation results are refined by a smoothing post-processing procedure. The proposed method is demonstrated by exper- iments conducted on three different historical handwritten document image datasets. Experiments show the benefit of combining various color and texture features for classification. The results also show the advantage of using a feature selection method to choose optimal feature subset. By applying the proposed method we achieve superior accuracy compared with earlier work on several datasets, e.g., we achieved 93% accuracy compared with 91% of the previous method on the Parzival dataset which contains about 100 million pixels. Keywords-page segmentation; historical document; layout analysis; feature selection;

...read moreread less

44 citations

Journal Article•DOI•

Unsupervised document structure analysis of digital scientific articles

[...]

Stefan Klampfl, Michael Granitzer¹, Kris Jack, Roman Kern²•Institutions (2)

University of Passau¹, Graz University of Technology²

08 Jun 2014-International Journal on Digital Libraries

TL;DR: This work has developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics and shows that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents.

...read moreread less

Abstract: Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.

...read moreread less

36 citations

Proceedings Article•DOI•

A Coarse-to-Fine Approach for Layout Analysis of Ancient Manuscripts

[...]

Abedelkadir Asi¹, Rafi Cohen¹, Klara Kedem¹, Jihad El-Sana¹, Its'hak Dinstein¹ - Show less +1 more•Institutions (1)

Ben-Gurion University of the Negev¹

15 Dec 2014

TL;DR: A learning-free approach to detect the main text area in ancient manuscripts that outperforms another state-of-the-art page segmentation method in terms of segmentation quality and time performance.

...read moreread less

Abstract: Many applications along the manuscript analysis pipeline rely on the accuracy of pre-processing steps. Perfectly detecting the main text area in ancient historical documents is of great importance for these applications. We propose a learning-free approach to detect the main text area in ancient manuscripts. First, we coarsely segment the main text area by using a texture-based filter. Then, we refine the segmentation by formulating the problem as an energy minimization task and achieving the minimum using graph cuts. The energy function is derived from properties of the text components. Spatial coherence of the segmented text regions is explicitly encouraged by the energy function. We evaluate the suggested method on a publicly available dataset of 38 historical document images. Experiments show that the suggested approach outperforms another state-of-the-art page segmentation method in terms of segmentation quality and time performance.

...read moreread less

35 citations

Patent•

Automated document recognition, identification, and data extraction

[...]

Jorgen van Deventer, Michael Hagen, Istvan Mandak

25 Aug 2014

TL;DR: A method for automated document recognition, identification, and data extraction is described in this paper, where an image of a document associated with a user is analyzed using optical character recognition to obtain image data, wherein the image data includes text zones.

...read moreread less

Abstract: A method for automated document recognition, identification, and data extraction is described herein The method comprises receiving, by the processor, an image of a document associated with a user The image is analyzed using optical character recognition to obtain image data, wherein the image data includes text zones Based on the image data, the image is compared to one or more document templates Based on the comparison, a document template having the highest degree of coincidence with the image is determined The text zones of the image are associated with text zones of the document template to determine a type of data in each text zone The data is structured into a standard format to obtain structured data

...read moreread less

33 citations

Journal Article•DOI•

A new thresholding algorithm for document images based on the perception of objects by distance

[...]

Rafael G. Mesquita¹, Carlos A. B. Mello¹, L. H. E. V. Almeida¹•Institutions (1)

Federal University of Pernambuco¹

01 Apr 2014-Computer-Aided Engineering

TL;DR: A new method to enhance and binarize document images with several kind of degradation is proposed, based on the idea that by the absolute difference between a document image and its background it is possible to effectively emphasize the text and attenuate degraded regions.

...read moreread less

Abstract: In this work a new method to enhance and binarize document images with several kind of degradation is proposed. The method is based on the idea that by the absolute difference between a document image and its background it is possible to effectively emphasize the text and attenuate degraded regions. To generate the background of a document our work was inspired on the human visual system and on the perception of objects by distance. Snellen's visual acuity notation was used to define how far an image must be from an observer so that the details of the characters are not perceived anymore, remaining just the background. A scheme that combines k-means clustering algorithm and Otsu's thresholding method is also used to perform binarization. The proposed method has been tested on two different datasets of document images DIBCO 2011 and a real historical document image dataset with very satisfactory results.

...read moreread less

29 citations

Proceedings Article•DOI•

A Typed and Handwritten Text Block Segmentation System for Heterogeneous and Complex Documents

[...]

Panos Barlas, Sébastien Adam, Clément Chatelain, Thierry Paquet

07 Apr 2014

TL;DR: A Document Image Analysis system able to extract homogeneous typed and handwritten text regions from complex layout documents of various types based on two connected component classification stages that successively discriminate text/non text and typed/handwritten shapes.

...read moreread less

Abstract: This paper presents a Document Image Analysis (DIA) system able to extract homogeneous typed and handwritten text regions from complex layout documents of various types. The method is based on two connected component classification stages that successively discriminate text/non text and typed/handwritten shapes, followed by an original block segmentation method based on white rectangles detection. We present the results obtained by the system during the first competition round of the MAURDOR campaign.

...read moreread less

29 citations

Proceedings Article•DOI•

Hybrid Feature Selection for Historical Document Layout Analysis

[...]

Hao Wei¹, Kai Chen¹, Rolf Ingold¹, Marcus Liwicki¹•Institutions (1)

University of Fribourg¹

15 Dec 2014

TL;DR: A novel hybrid feature selection method for historical Document Image Analysis (DIA) using Adapted greedy forward selection and genetic selection are used in a cascading way and some features, e.g., Gradient, Laplacian, and local binary patterns (LBP), are selected by most of the feature selection methods.

...read moreread less

Abstract: In this paper we propose a novel hybrid feature selection method for historical Document Image Analysis (DIA). Adapted greedy forward selection and genetic selection are used in a cascading way. We apply the proposed method to the task of historical document layout analysis on three handwritten datasets of diverse nature. The documents contain complex layouts, different handwriting styles, and several results of decay. The task is to segment each page into four areas: periphery, background, text block, and decoration. The proposed method selected significantly less features and resulted in significantly lower error rates than using all features. Compared to several conventional feature selection methods, the proposed method is competitive with respect to the number of selected features and the resultant error rates. In addition, we found that some features, e.g., Gradient, Laplacian, and local binary patterns (LBP), are selected by most of the feature selection methods and we give some explanations. This finding suggests a clue for the layout analysis on handwritten documents in general.

...read moreread less

25 citations

Proceedings Article•DOI•

Automatic article extraction in old newspapers digitized collections

[...]

David Hebert¹, Thomas Palfray¹, Stéphane Nicolas¹, Pierrick Tranouez¹, Thierry Paquet¹ - Show less +1 more•Institutions (1)

University of Rouen¹

19 May 2014

TL;DR: This top-level structural analysis relies on the generation of an article separation grid applied recursively on the document image, allowing analyzing any type of Manhattan page layout, even for complex structures with multiple columns and overlapping entities.

...read moreread less

Abstract: We present a complete method for article segmentation in old newspapers, which deals with complex layouts analysis of degraded documents. The designed workflow can process large amounts of documents and generates digital objects in METS/ALTO format in order to facilitate the indexing and the browsing of information in digital libraries. The analysis of the document image is performed by a two stages scheme. Pixels are labeled in a first stage with a Conditional Random Field model in order to intent to label the areas of interest with a low logical level. Then this first logical representation of the document content is analyzed in a second stage to get a higher logical representation including article segmentation and reading order. This top-level structural analysis relies on the generation of an article separation grid applied recursively on the document image, allowing analyzing any type of Manhattan page layout, even for complex structures with multiple columns and overlapping entities. This method which benefits from both a local analysis using a probabilistic model trained using machine learning procedures, and a more global structural analysis using recursive rules, is evaluated on a dataset of daily local press document images covering several time periods and different page layouts, to prove its effectiveness.

...read moreread less

24 citations

Proceedings Article•DOI•

Text Line Segmentation for Handwritten Documents Using Constrained Seam Carving

[...]

Xi Zhang¹, Chew Lim Tan¹•Institutions (1)

National University of Singapore¹

15 Dec 2014

TL;DR: A constrained seam carving method is proposed, which can constrain energy to be passed along the connected components in the same text line as much as possible, and tries to extract all the text lines by computing the energy map only once.

...read moreread less

Abstract: This paper proposes a language-independent method for segmenting text lines from handwritten document images. Our method is based on the seam carving, which has been already used for text line segmentation, but in order to tolerate multi-skewed text lines even in the same document image, we propose a constrained seam carving method, which can constrain energy to be passed along the connected components in the same text line as much as possible. Moreover, our proposed method tries to extract all the text lines by computing the energy map only once. In the experiments, our method is tested on the Greek, English and Indian document images, and get 98.41% FM score.

...read moreread less

24 citations

Proceedings Article•DOI•

Table Extraction from Document Images using Fixed Point Model

[...]

Anukriti Bansal¹, Gaurav Harit², Sumantra Dutta Roy¹•Institutions (2)

Indian Institute of Technology Delhi¹, Indian Institute of Technology, Jodhpur²

14 Dec 2014

TL;DR: A novel learning-based framework to identify tables from scanned document images as a structured labeling problem, which learns the layout of the document and labels its various entities as table header, table trailer, table cell and non-table region is presented.

...read moreread less

Abstract: The paper presents a novel learning-based framework to identify tables from scanned document images. The approach is designed as a structured labeling problem, which learns the layout of the document and labels its various entities as table header, table trailer, table cell and non-table region. We develop features which encode the foreground block characteristics and the contextual information. These features are provided to a fixed point model which learns the inter-relationship between the blocks. The fixed point model attains a contraction mapping and provides a unique label to each block. We compare the results with Condition Random Fields(CRFs). Unlike CRFs, the fixed point model captures the context information in terms of the neighbourhood layout more efficiently. Experiments on the images picked from UW-III (University of Washington) dataset, UNLV dataset and our dataset consisting of document images with multicolumn page layout, show the applicability of our algorithm in layout analysis and table detection.

...read moreread less

Journal Article•DOI•

Texture sparseness for pixel classification of business document images

[...]

Melissa Cote¹, Alexandra Branzan Albu¹•Institutions (1)

University of Victoria¹

01 Sep 2014-International Journal on Document Analysis and Recognition

TL;DR: A novel document image classification approach that distributes individual pixels into four fundamental classes (text, image, graphics, and background) through support vector machines using a novel low-dimensional feature descriptor based on textural properties.

...read moreread less

Abstract: Contemporary business documents contain diverse, multi-layered mixtures of textual, graphical, and pictorial elements. Existing methods for document segmentation and classification do not handle well the complexity and variety of contents, geometric layout, and elemental shapes. This paper proposes a novel document image classification approach that distributes individual pixels into four fundamental classes (text, image, graphics, and background) through support vector machines. This approach uses a novel low-dimensional feature descriptor based on textural properties. The proposed feature vector is constructed by considering the sparseness of the document image responses to a filter bank on a multi-resolution and contextual basis. Qualitative and quantitative evaluations on business document images show the benefits of adopting a contextual and multi-resolution approach. The proposed approach achieves excellent results; it is able to handle varied contents and complex document layouts, without imposing any constraint or making assumptions about the shape and spatial arrangement of document elements.

...read moreread less

Analysis of the Logical Layout of Documents.

[...]

Andreas Dengel, Faisal Shafait

01 Jan 2014

TL;DR: This chapter provides a comprehensive review of the state of the art in the field of automated document understanding, highlights key methods developed for different target applications, and provides practical recommendations for designing a document understanding system for the problem at hand.

...read moreread less

Abstract: Automatic document understanding is one of the most important tasks when dealing with printed documents since all post-ordered systems require the captured but process-relevant data. Analysis of the logical layout of documents not only enables an automatic conversion into a semantically marked-up electronic representation but also reveals options for developing higher-level functionality like advanced search (e.g., limiting search to titles only), automatic routing of business letters, automatic processing of invoices, and developing link structures to facilitate navigation through books. Over the last three decades, a number of techniques have been proposed to address the challenges arising in logical layout analysis of documents originating from many different domains. This chapter provides a comprehensive review of the state of the art in the field of automated document understanding, highlights key methods developed for different target applications, and provides practical recommendations for designing a document understanding system for the problem at hand.

...read moreread less

Patent•

Method and system of pre-analysis and automated classification of documents

[...]

Irina Filimonova, Sergey Zlobin, Andrey Myakutin

25 Jun 2014

TL;DR: In this article, an image of a form or document is captured and an automatic classifier determines possible features and calculates a range of feature values and possible other feature parameters for each type or class of document.

...read moreread less

Abstract: Automatic classification of different types of documents is disclosed. An image of a form or document is captured. The document is assigned to one or more type definitions by identifying one or more objects within the image of the document. A matching model is selected via identification of the document image. In the case of multiple identifications, a profound analysis of the document type is performed—either automatically or manually. An automatic classifier may be trained with document samples of each of a plurality of document classes or document types where the types are known in advance or a system of classes may be formed automatically without a priori information about types of samples. An automatic classifier determines possible features and calculates a range of feature values and possible other feature parameters for each type or class of document. A decision tree, based on rules specified by a user, may be used for classifying documents. Processing, such as optical character recognition (OCR), may be used in the classification process.

...read moreread less

Patent•

Detection and Reconstruction of East Asian Layout Features in a Fixed Format Document

[...]

Drazen Zaric¹, Milan Sesum¹, Milos Lazarevic¹, Milos Raskovic¹•Institutions (1)

Microsoft¹

28 Feb 2014

TL;DR: In this article, a fixed format document is detected and rotated for layout analysis, and the rotated text is rotated back and restructured in a flow format document, which is used to detect East Asian layout features.

...read moreread less

Abstract: Detection of East Asian layout features and reconstruction of East Asian layout features is provided. Vertically written text in the fixed format document is detected and rotated for layout analysis. After layout analysis, the rotated text is rotated back and restructured in a flow format document. When a plurality of characters is written horizontally in a vertical line of text, vertically overlapping text runs are detected, designated as horizontal-in-vertical text, and are restructured as horizontal-in-vertical text in a flow format document. Lines of text are analyzed for attributes of a ruby line and are designated as ruby text, associated with corresponding text in a ruby base line, and restructured as ruby text in a flow format document. Text in a fixed format document is analyzed for detection of a particular East Asian language so that a font for the language is designated in a flow format document.

...read moreread less

Proceedings Article•DOI•

Document image skew detection and correction method based on extreme points

[...]

Marian Wagdy, Ibrahima Faye, Dayang Rohaya

03 Jun 2014

TL;DR: The main idea of this method is based on the concept that any document image has objects with rectangular shape such as paragraphs, text lines, tables and figures that can be bounded by rectangles, which represents the angle of document skew.

...read moreread less

Abstract: In this paper we present a method for estimating the document image skew angle. The main idea of this method is based on the concept that any document image has objects with rectangular shape such as paragraphs, text lines, tables and figures. These objects can be bounded by rectangles. We use the extreme point's properties to obtain the corners of the rectangle which fits the largest connected component of the document image. The angle of this rectangle represents the angle of document skew. The experimental results show the high performance of the algorithm in detecting the angle of skew for a variety of documents with different levels of complexity.

...read moreread less

Patent•

Methods, systems, and media for guiding user reading on a screen

[...]

Mark Sybren Tigchelaar¹•Institutions (1)

Chamber of commerce¹

06 May 2014

TL;DR: In this article, a method for guiding user reading on a screen is provided, the method comprising: determining a reading speed of a user, receiving a selection of a document having an original layout to be read, setting a reading time for the document, formatting the selected document for presentation to the user on the screen, presenting the formatted document using the original document layout, dividing text in the formatted text into blocks that include a plurality of words, reformatting the blocks based on the layout of the formatted documents and punctuation of the document to include at least one word, such

...read moreread less

Abstract: In some embodiments, a method for guiding user reading on a screen is provided, the method comprising: determining a reading speed of a user; receiving a selection of a document having an original layout to be read; setting a reading speed for the document; formatting the selected document for presentation to the user on a screen; presenting the formatted document using the original document layout; dividing text in the formatted document into blocks that include a plurality of words; reformatting the blocks based on the layout of the formatted document and punctuation of the document to include at least one word, such that each reformatted block includes less than a predetermined number of characters and the at least one word of the reformatted block is on a single line in the text of the document; and presenting guidance to the user within the formatted document at the set reading speed.

...read moreread less

Proceedings Article•DOI•

First maurdor 2013 evaluation campaign in scanned document image processing

[...]

Ilya Oparin, Juliette Kahn, Olivier Galibert

04 May 2014

TL;DR: This campaign aims at evaluating the complete chain of scanned document image processing, which has a modular structure that includes page segmentation and zone classification, identification of writing type and language, optical character recognition and revealing logical structure of a document.

...read moreread less

Abstract: This paper presents the results of the first Maurdor evaluation campaign. This campaign aims at evaluating the complete chain of scanned document image processing. It has a modular structure that includes page segmentation and zone classification, identification of writing type and language, optical character recognition and revealing logical structure of a document. This campaign is based on a unique corpus of 8,000 images of scanned documents annotated at different levels. Presentation of the results of the first campaign is important to assess the state-of-the-art and create common references both for participants in future campaigns and, as the scoring tools are publicly available, for independent tests.

...read moreread less

Patent•

Content-based document image classification

[...]

Anatoly Smirnov, Vasily Vladimirovich Panferov, Andrey Anatolievich Isaev

16 Dec 2014

TL;DR: In this article, a method of classifying one or more document images based on content thereof using a device with a processor is presented, which includes accessing a set of features stored in memory and analysing the document image to determine the arrangement of blocks.

...read moreread less

Abstract: FIELD: physics.SUBSTANCE: disclosed is a method of classifying one or more document images based on content thereof using a device with a processor. The method includes a step of obtaining a document image. Further, the method includes accessing a set of features stored in memory and analysing the document image to determine the arrangement of blocks. The method also includes recognising a document image using an optical symbol recognition technique to obtain digital content data representing textual content or potential graphical content.EFFECT: high efficiency of classifying documents based on predetermined features.27 cl, 3 dwg

...read moreread less

Proceedings Article•DOI•

Investigation of feature selection for historical document layout analysis

[...]

Hao Wei¹, Kai Chen¹, Anguelos Nicolaou¹, Marcus Liwicki¹, Rolf Ingold¹ - Show less +1 more•Institutions (1)

University of Fribourg¹

01 Oct 2014

TL;DR: It is found that L BP features are consistently selected by all feature selection methods on all three datasets, which indicates that LBP correlate highly with the pixel classes much more than any other type of features does.

...read moreread less

Abstract: In this paper we investigate the importance of individual features for the task of document layout analysis, in particular for the classification of the document pixels. The feature set consists of numerous state-of-the-art features, including color, gradient, and local binary patterns (LBP). To deal with the high dimensionality of the feature set, we propose a cascade of an adapted forward selection and a genetic selection. We have evaluated our feature selection method on three historical document datasets. For the classification we used machine learning methods which classify each pixel into either periphery, background, text block, or decoration. The proposed cascading feature selection method reduced the number of features significantly while preserving the cross-validation performance. Furthermore, it selected less features with comparable performance, compared with the conventional feature selection methods. In our analysis we found that LBP features are consistently selected by all feature selection methods on all three datasets. This indicates that LBP correlate highly with the pixel classes much more than any other type of features does. These findings suggest a clue in paradigm for document layout analysis in general.

...read moreread less

Patent•

System and method for document classification based on semantic analysis of the document

[...]

Venkat Srinivasan

24 Dec 2014

TL;DR: In this paper, a computer-based method and system for classifying a document into one or more categories is presented. But the method is not suitable for the classification of large documents.

...read moreread less

Abstract: A computer based method and system for classifying a document into one or more categories. The method and system can be configured to identify one or more cluster of clauses or sentences from a plurality of semantically similar clauses of the document and determine one or more representative concepts for each cluster of the document. Accordingly, one or more categories for the document are determined from the one or more representative concepts and the document is classified into the one or more categories.

...read moreread less

Proceedings Article•DOI•

Logical Labeling of Fixed Layout PDF Documents Using Multiple Contexts

[...]

Xin Tao¹, Zhi Tang¹, Canhui Xu¹, Yongtao Wang¹•Institutions (1)

Peking University¹

07 Apr 2014

TL;DR: The modeling of contextual information based on 2D Conditional Random Fields is proposed to learn page structure for born-digital fixed-layout documents by integrating local and contextual observations obtained from PDF attributes, and the ambiguities of semantic labels are better resolved.

...read moreread less

Abstract: The task of logical structure recovery is known to be of crucial importance, yet remains unsolved not only for image based document but also for born-digital document system. In this work, the modeling of contextual information based on 2D Conditional Random Fields is proposed to learn page structure for born-digital fixed-layout documents. Heuristic prior knowledge of Portable Document Format (PDF) content and layout are interpreted to construct neighborhood graphs and various pair wise clique templates for the modeling of multiple contexts. By integrating local and contextual observations obtained from PDF attributes, the ambiguities of semantic labels are better resolved. Experimental comparisons for six types of clique templates has demonstrated the benefits of contextual information in logical labeling of 16 finely defined categories.

...read moreread less

Patent•

Systems and Methods for Determining Text Complexity

[...]

Michael Flor¹, Beata Beigman Klebanov¹•Institutions (1)

Princeton University¹

14 Feb 2014

TL;DR: In this article, computer-implemented systems and methods are provided for determining a document's complexity, and an association profile can be created for the document using the association measures and the computer can use the association profile to determine the complexity of the document.

...read moreread less

Abstract: Computer-implemented systems and methods are provided for determining a document's complexity. For example, a computer performing the complexity analysis can receive a document. The computer can determine the content words within the document and determine an association measure for each group of content words. An association profile can be created for the document using the association measures. The computer can use the association profile to determine the complexity of the document. The complexity of the document may correspond to the document's suitable reading level or, if the document is an essay, an essay score.

...read moreread less

Text extraction from images: a survey

[...]

Kumary R Soumya, Athulya Chacko

01 Jan 2014

TL;DR: This paper presents a comparative study and performance evaluation of various text extraction techniques and concludes that text extraction without characters recognition capabilities is to extract regions just contains text.

...read moreread less

Abstract: Text extraction is one of the key tasks in document image analysis. Automatic text extraction without characters recognition capabilities is to extract regions just contains text. The text extraction process includes detection, localization, segmentation and enhancement of the text from the given input image. In this paper we present a comparative study and performance evaluation of various text extraction techniques.

...read moreread less

Proceedings Article•DOI•

Document layout analysis for Indian newspapers using contour based symbiotic approach

[...]

Vijay Singh¹, Bhupendra Kumar¹•Institutions (1)

Centre for Development of Advanced Computing¹

16 Oct 2014

TL;DR: In this methodology various image morphological operations, contour analysis, connected component analysis, projection analysis are employed for the realization and the results have been evaluated by number of blocks detected and taking their correct ordering information into account.

...read moreread less

Abstract: Document layout analysis is necessary process for automated document recognition systems. Document layout analysis identifies, categorizes and labels the semantics of text blocks for meaningful information retrieval from document images. Our primary target document includes various newspaper and magazine pages which are having complex layout without following any static rules. We propose an effective approach for document layout analysis where power of bottom up approach and top-down approach i.e. region growing and segmentation respectively, have been utilized simultaneously. In this methodology various image morphological operations, contour analysis, connected component analysis, projection analysis are employed for the realization. The proposed algorithm has been successfully implemented and applied over a large number of Indian script newspaper and magazine pages. The results have been evaluated by number of blocks detected and taking their correct ordering information into account.

...read moreread less

Patent•

Cloud service for hospital form auto filling system

[...]

Kaoru Watanabe¹•Institutions (1)

Ricoh¹

28 Feb 2014

TL;DR: In this paper, the authors present a technique for acquiring form document data representing a form document, parsing the form documents to extract one or more data information components, identifying a database that stores text strings in association with data text labels, and generating filled form documents data of a filled form document.

...read moreread less

Abstract: Techniques are provided for acquiring form document data representing a form document; parsing the form document data of the form document to extract one or more data information components; identifying a database that stores one or more text strings in association with one or more data text labels; and generating filled form document data of a filled form document. The generating is performed by: for each data information component, of the one or more data information components of the form document, having an associated data text label: based on, at least in part, the associated data text label, retrieving, from the database, a text string that is associated with the data text label; and inserting the text string into a data text field of the data information component at the data text field location.

...read moreread less

Proceedings Article•DOI•

Computer-assisted transcription of a historical botanical specimen book: organization and process overview

[...]

Vicent Bosch¹, Isabel Bordes-Cabrera, Paloma Cuenca Muñoz², Celio Hernández-Tornero¹, Luis A. Leiva¹, Moisés Pastor¹, Verónica Romero¹, Alejandro Héctor Toselli¹, Enrique Vidal¹ - Show less +5 more•Institutions (2)

Polytechnic University of Valencia¹, Complutense University of Madrid²

19 May 2014

TL;DR: A protocol designed for computer-assisted transcribing a XVII century botanical specimen book, based on Handwritten Text Recognition (HTR) technology, is described and all the data produced will be made publicly available for research and development.

...read moreread less

Abstract: We describe a protocol designed for computer-assisted transcribing a XVII century botanical specimen book, based on Handwritten Text Recognition (HTR) technology. Here we focus on the organization and coordination aspects of this protocol and outline related technical issues. Using the proposed protocol, full ground truth data has been produced for the first book chapter and high-quality transcripts are being cost-effectively obtained for the rest of the approximately 1000 pages of the book. The process encompasses two main, computer-assisted steps; namely, image layout analysis and transcription. Layout analysis is based on a semi-supervised incremental approach and transcription makes use of an interactive-predictive HTR prototype known as CATTI. Currently, the first step of this procedure has been completed for the full book and the second step is close to be finished. Ultimately, all the data produced will be made publicly available for research and development.

...read moreread less

Proceedings Article•DOI•

EM-Based Layout Analysis Method for Structured Documents

[...]

Francisco Cruz¹, Oriol Ramos Terrades¹•Institutions (1)

Autonomous University of Barcelona¹

24 Aug 2014

TL;DR: This paper proposed an EM-based algorithm to fit a set of Gaussian mixtures to the different regions according to the logical distribution along the page and evaluated its method in the task of record detection in a collection of historical structured documents.

...read moreread less

Abstract: In this paper we present a method to perform layout analysis in structured documents. We proposed an EM-based algorithm to fit a set of Gaussian mixtures to the different regions according to the logical distribution along the page. After the convergence, we estimate the final shape of the regions according to the parameters computed for each component of the mixture. We evaluated our method in the task of record detection in a collection of historical structured documents and performed a comparison with other previous works in this task.

...read moreread less

Journal Article•DOI•

Illustrations Segmentation in Digitized Documents Using Local Correlation Features

[...]

Dalia Coppi¹, Costantino Grana¹, Rita Cucchiara¹•Institutions (1)

University of Modena and Reggio Emilia¹

01 Jan 2014

TL;DR: An approach for Document Layout Analysis based on local correlation features that identifies and extracts illustrations in digitized documents by learning the discriminative patterns of textual and pictorial regions.

...read moreread less

Abstract: In this paper we propose an approach for Document Layout Analysis based on local correlation features. We identify and extract illustrations in digitized documents by learning the discriminative patterns of textual and pictorial regions. The proposal has been demonstrated to be effective on historical datasets and to outperform the state-of-the-art in presence of challenging documents with a large variety of pictorial elements.

...read moreread less

Proceedings Article•DOI•

A TaLISMAN: automatic text and line segmentation of historical manuscripts

[...]

Ruggero Pintus, Ying Yang¹, Enrico Gobbetti, Holly Rushmeier¹•Institutions (1)

Yale University¹

06 Oct 2014

TL;DR: A completely automatic algorithm is presented to perform a robust text segmentation of old handwritten manuscripts on a per-book basis, and it is shown how to exploit this outcome to find two layout elements, i.e., text blocks and text lines.

...read moreread less

Abstract: Historical and artistic handwritten books are valuable cultural heritage (CH) items, as they provide information about tangible and intangible cultural aspects from the past. Massive digitization projects have made these kind of data available to a world-wide population, and pose real challenges for automatic processing. In this scenario, document layout analysis plays a significant role, being a fundamental step of any document image understanding system. In this paper, we present a completely automatic algorithm to perform a robust text segmentation of old handwritten manuscripts on a per-book basis, and we show how to exploit this outcome to find two layout elements, i.e., text blocks and text lines. Our proposed technique have been evaluated on a large and heterogeneous corpus content, and our experimental results demonstrate that this approach is efficient and reliable, even when applied to very noisy and damaged books.

...read moreread less