Showing papers on "Document layout analysis published in 2009"

PDF

Open Access

Journal Article•DOI•

Docuburst: visualizing document content using language structure

[...]

Christopher Collins¹, Sheelagh Carpendale², Gerald Penn¹•Institutions (2)

University of Toronto¹, University of Calgary²

10 Jun 2009

TL;DR: DocuBurst is a radial, space‐filling layout of hyponymy (the IS‐A relation), overlaid with occurrence counts of words in a document of interest to provide visual summaries at varying levels of granularity.

...read moreread less

Abstract: Textual data is at the forefront of information management problems today. One response has been the development of visualizations of text data. These visualizations, commonly based on simple attributes such as relative word frequency, have become increasingly popular tools. We extend this direction, presenting the first visualization of document content which combines word frequency with the human-created structure in lexical databases to create a visualization that also reflects semantic content. DocuBurst is a radial, space-filling layout of hyponymy (the IS-A relation), overlaid with occurrence counts of words in a document of interest to provide visual summaries at varying levels of granularity. Interactive document analysis is supported with geometric and semantic zoom, selectable focus on individual words, and linked access to source text.

...read moreread less

171 citations

Proceedings Article•DOI•

A Realistic Dataset for Performance Evaluation of Document Layout Analysis

[...]

Apostolos Antonacopoulos¹, D. Bridson¹, Christos Papadopoulos¹, Stefan Pletschacher¹•Institutions (1)

University of Salford¹

26 Jul 2009

TL;DR: This paper presents a new dataset (and the methodology used to create it) based on a wide range of contemporary documents, with strong emphasis on comprehensive and detailed representation of both complex and simple layouts, and on colour originals.

...read moreread less

Abstract: There is a significant need for a realistic dataset on which to evaluate layout analysis methods and examine their performance in detail. This paper presents a new dataset (and the methodology used to create it) based on a wide range of contemporary documents. Strong emphasis is placed on comprehensive and detailed representation of both complex and simple layouts, and on colour originals. In-depth information is recorded both at the page and region level. Ground truth is efficiently created using a new semi-automated tool and stored in a new comprehensive XML representation, the PAGE format. The dataset can be browsed and searched via a web-based front end to the underlying database and suitable subsets (relevant to specific evaluation goals) can be selected and downloaded.

...read moreread less

117 citations

Patent•

Combined image and text document

[...]

Oliver H. Foehr¹, Alan John Michaelis¹•Institutions (1)

Microsoft¹

08 Jan 2009

TL;DR: In this article, a combined image and text document is described, where a scanned image of a document can be generated utilizing a scanning application, and text representations of text that is included in the document can also be generated using a character recognition application.

...read moreread less

Abstract: A combined image and text document is described. In embodiment(s), a scanned image of a document can be generated utilizing a scanning application, and text representations of text that is included in the document can be generated utilizing a character recognition application. Position data of the text representations can be correlated with locations of corresponding text in the scanned image of the document. The scanned image can then be rendered for display overlaid with the text representations as a transparent overlay, where the scanned image and the text representations are independently user-selectable for display. A user-selectable input can be received to display the text representations without the scanned image, the scanned image without the text representations, or to display the text representations adjacent the scanned image.

...read moreread less

30 citations

Proceedings Article•DOI•

Text Line Identification from a Multilingual Document

[...]

P. A. Vijaya¹, M.C. Padma²•Institutions (2)

Malnad College of Engineering¹, P.E.S. College of Engineering²

07 Mar 2009

TL;DR: A simple but efficient technique of language identification for Kannada, Hindi and English text lines from a printed document is presented, based on the characteristic features of top-profile and bottom-profile of individual text lines of the input document image.

...read moreread less

Abstract: In India, a document may contain text lines in more than one language forms. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify different language forms of the input document, before feeding the documents to the OCRs of individual language. In this paper, a simple but efficient technique of language identification for Kannada, Hindi and English text lines from a printed document is presented. The proposed system is based on the characteristic features of top-profile and bottom-profile of individual text lines of the input document image. The feature extraction is achieved by finding the behavior of the characteristics of the top and bottom profiles of individual text lines. The system is trained to learn the behavior of the top and bottom profiles with a training data set of 800 text lines. Range of feature values of top and bottom profiles for all the three languages are obtained and stored in knowledge base for later use during decision-making. For a new text line, necessary features are extracted from the top and bottom profiles and the feature values obtained are compared with the stored knowledge base. A new text line is classified to the type of the language that falls within that range. The proposed system is tested on 600 text lines and an overall classification accuracy of96.6% is achieved.

...read moreread less

27 citations

Patent•

Methods, systems and devices for transcoding and displaying electronic documents

[...]

Per Hedbor, Johan Schön

19 Jun 2009

TL;DR: In this paper, a received markup language document including a structured list of elements is transcoded by a method which includes analyzing the structure of the document, generating a virtual rendering of a layout, and identifying one or more rectangles each containing at least one element from the virtual rendering.

...read moreread less

Abstract: A received markup language document including a structured list of elements is transcoded by a method which includes analyzing the structure of the document, generating a virtual rendering of a layout of the document, and identifying one or more rectangles each containing at least one element from the virtual rendering. Data representative of the markup language document is generated, including a list of rectangles and their positions in the layout. The thus transcoded document can be displayed on a device which receives the generated data. When a position or a direction within the document is selected, such device may analyze the layout of the document to select at least one of the rectangles based on the position or direction. The device may then display at least a portion of the document selected such that the identified rectangle is given a predefined position on the display.

...read moreread less

23 citations

Proceedings Article•DOI•

Automated Ground Truth Data Generation for Newspaper Document Images

[...]

Thomas Strecker¹, Joost van Beusekom², Sahin Albayrak¹, Thomas M. Breuel²•Institutions (2)

Technical University of Berlin¹, Kaiserslautern University of Technology²

26 Jul 2009

TL;DR: A way to semi-automatically generate ground-truthed datasets for newspapers and provide a comprehensive dataset for layout analysis ground truth is proposed.

...read moreread less

Abstract: In document image understanding, public datasets with ground-truth are an important part of scientific work. They are not only helpful for developing new methods, but also provide a way of comparing performance. Generating these datasets, however, is time consuming and cost-intensive work, requiring a lot of manual effort. In this paper we both propose a way to semi-automatically generate ground-truthed datasets for newspapers and provide a comprehensive dataset. The focus of this paper is layout analysis ground truth. The proposed two step approach consists of a module which automatically creates layouts and an image matching module which allows to map the ground truth information from the synthetic layout to the scanned version. In the first step, layouts are generated automatically from a news corpus. The output consists of a digital newspaper (PDF file) and an XML file containing geometric and logical layout information. In the second step, the PDF files are printed, scanned and aligned with the synthetic image obtained by rendering the PDF. Finally, the geometric and logical layout ground truth is mapped onto the scanned image.

...read moreread less

19 citations

Proceedings Article•DOI•

A Distance-Based Technique for Non-Manhattan Layout Analysis

[...]

Stefano Ferilli, Marenglen Biba, Floriana Esposito, Teresa Maria Altomare Basile

26 Jul 2009

TL;DR: A general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources are proposed.

...read moreread less

Abstract: Layout analysis is a fundamental step in automatic document processing. Many different techniques have been proposed to perform this task. Some follow a top-down approach: they start by identifying the high level components of the page structure and then recursively split them until basic blocks are found. On the other hand, bottom-up approaches start with the smallest elements (e.g., the pixels in case of digitized document) and then recursively merge them into higher level components. A first limitation of such methods is that most of them are designed to deal only with digitized documents and hence are not applicable to native digital documents which are nowadays pervasive. Furthermore, top-down and most of bottom-up methods are able to process Manhattan layout documents only. In this work, we propose a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources. It was successfully embedded and tested in the DOMINUS document management system.

...read moreread less

18 citations

Proceedings Article•DOI•

Language Identification from an Indian Multilingual Document Using Profile Features

[...]

M.C. Padma¹, P. A. Vijaya², P. Nagabhushan³•Institutions (3)

P.E.S. College of Engineering¹, Malnad College of Engineering², University of Mysore³

08 Mar 2009

TL;DR: The problem of recognizing the language of the text content is addressed, however it is perhaps impossible to design a single recognizer which can identify a large number of scripts/languages.

...read moreread less

Abstract: In order to reach a larger cross section of people, it is necessary that a document should be composed of text contents in different languages. But on the other hand, this causes practical difficulty in OCRing such a document, because the language type of the text should be pre-determined, before employing a particular OCR. In this research work, this problem of recognizing the language of the text content is addressed, however it is perhaps impossible to design a single recognizer which can identify a large number of scripts/languages. As a via media, in this research we have proposed to work on the prioritized requirements of a particular region, for instance in Karnataka state in India,generally any document including official ones, would contain the text in three languages-English-the language of general importance, Hindi-the language of National importance and Kannada –the language of State/Regional importance. We have proposed to learn identifying the language of the text by thoroughly understanding the nature of top and bottom profiles of the printed text lines in these three languages.Experimentation conducted involved 800 text lines for learning and 600 text lines for testing. The performance has turned out to be 95.4%.

...read moreread less

18 citations

Book Chapter•DOI•

Model-Guided Segmentation and Layout Labelling of Document Images Using a Hierarchical Conditional Random Field

[...]

Santanu Chaudhury¹, Megha Jindal¹, Sumantra Dutta Roy¹•Institutions (1)

Indian Institute of Technology Delhi¹

15 Dec 2009

TL;DR: A model-guided segmentation and document layout extraction scheme based on hierarchical Conditional Random Fields, motivated for an automated layout analyser and machine translator for technical papers, and can also be used for other applications such as search, indexing and information retrieval.

...read moreread less

Abstract: We present a model-guided segmentation and document layout extraction scheme based on hierarchical Conditional Random Fields (CRFs, hereafter). Common methods to classify a pixel of a document image into classes - text, background and image - are often noisy, and error-prone, often requiring post-processing through heuristic methods. The input to the system is a pixel-wise classification based on the output of a Fisher classifier based on the output of a set of Globally Matched Wavelet (GMW) Filters. The system extracts features which encode contextual information and spatial configurations of a given document image, and learns relations between these layout entities using hierarchical CRFs. The hierarchical CRF enables learning at various levels - 1. local features for text, background and image areas; 2. contextual features for further classifying region blocks - title, author block, heading, paragraph, etc.; and 3. probabilistic layout model for encoding global relations between the above blocks for a particular class of documents. Although the work has been motivated for an automated layout analyser and machine translator for technical papers, it can also be used for other applications such as search, indexing and information retrieval.

...read moreread less

17 citations

Patent•

Method for matching images, image matching device, image data output apparatus, and recording medium

[...]

Hitoshi Hirohata, Masakazu Ohira

29 Apr 2009

TL;DR: In this paper, an image matching device includes a section calculating feature points on an input document image, a section that calculates features of the image in accordance with a relative position between the calculated feature points, and sections for comparing the calculated features with features of a reference document image to determine whether the image is similar to the reference image.

...read moreread less

Abstract: An image matching device includes a section calculates feature points on an input document image, a section that calculates features of the input document image in accordance with a relative position between the calculated feature points, and sections for comparing the calculated features of the input document image with features of a reference document image to determine whether the input document image is similar to the reference document image. When it is determined that the input and reference documents are similar, a document discrimination section determines a position of an image on the input document and similar to the reference document image, in accordance with the positions of the coordinates of the feature points on the input document and on the reference document.

...read moreread less

16 citations

Proceedings Article•DOI•

Separation of Foreground Text from Complex Background in Color Document Images

[...]

Nirmala Shivananda¹, P. Nagabhushan¹•Institutions (1)

University of Mysore¹

04 Feb 2009

TL;DR: A hybrid approach which combines connected component analysis and an unsupervised thresholding for separation of text from the complex background is proposed.

...read moreread less

Abstract: Reading of the foreground text is difficult in documents having multi colored complex background. Automatic foreground text separation in such document images is very much essential for smooth reading of the document contents. In this paper we propose a hybrid approach which combines connected component analysis and an unsupervised thresholding for separation of text from the complex background. The proposed approach identifies the candidate text regions based on edge detection followed by a connected component analysis. Because of background complexity it is also possible that a non text region may be identified as a text region. To overcome this problem we extract texture features of connected components and analyze the feature values. Finally the threshold value for each detected text region is derived automatically from the data of corresponding image region to perform foreground separation. The proposed approach can handle document images with varying background of multiple colors. Also it can handle foreground text of any color, font and size. Experimental results show that the proposed algorithm detects on an average 97.8% of text regions in the source document. Readability of the extracted foreground text is illustrated through OCRing.

...read moreread less

Proceedings Article•DOI•

Robust unsupervised segmentation of degraded document images with topic models

[...]

Timothy J Burns¹, Jason J. Corso¹•Institutions (1)

University at Buffalo¹

20 Jun 2009

TL;DR: This paper proposes a Bayesian generative model for document images which automatically discovers different regions present in a document image in a completely unsupervised fashion and illustrates its robustness by providing results on a highly degraded version of the test set.

...read moreread less

Abstract: Segmentation of document images remains a challenging vision problem. Although document images have a structured layout, capturing enough of it for segmentation can be difficult. Most current methods combine text extraction and heuristics for segmentation, but text extraction is prone to failure and measuring accuracy remains a difficult challenge. Furthermore, when presented with significant degradation many common heuristic methods fall apart. In this paper, we propose a Bayesian generative model for document images which seeks to overcome some of these drawbacks. Our model automatically discovers different regions present in a document image in a completely unsupervised fashion. We attempt no text extraction, but rather use discrete patch-based codebook learning to make our probabilistic representation feasible. Each latent region topic is a distribution over these patch indices. We capture rough document layout with an MRF Potts model. We take an analysis by synthesis approach to examine the model, and provide quantitative segmentation results on a manually labeled document image data set. We illustrate our model's robustness by providing results on a highly degraded version of our test set.

...read moreread less

Patent•

Method and apparatus for generating layout-preserved text

[...]

Eunyee Koh¹, Walter Chang¹•Institutions (1)

Adobe Systems¹

26 May 2009

TL;DR: In this paper, a layout-preserved text generation method was proposed to transform the PDF (X, Y) document space into a text file grid space while preserving a similar global view of the text and layout of the original PDF input document.

...read moreread less

Abstract: Methods and apparatus for generating layout-preserved text output from portable document format (PDF) input are described. A layout-preserved text generation method may generate layout-preserved text output from PDF input that includes the text along with indentations, spaces, newlines, and paging and that thus preserves the global document layout view of the original PDF input document. The layout-preserved text generation method may transform the PDF (X, Y) document space into a text file grid space while preserving a similar global view of the text and layout from the PDF (X, Y) document space. This transformation may include determining a base size per grid that may produce accurate layout in the text output from the PDF input.

...read moreread less

Patent•

Physical page layout analysis via tab-stop detection for optical character recognition

[...]

Ray Smith¹•Institutions (1)

Google¹

21 Jan 2009

TL;DR: In this article, a physical page layout analysis for optical character recognition is performed, where tab-stops are detected from groups of edge-aligned connected components, which are used to deduce the column layout of the page by finding column partitions.

...read moreread less

Abstract: Physical page layout analysis for optical character recognition is performed. A physical page layout analysis method finds constituent parts of an image and gives an initial data-type label, such as text or non-text. Within the text data, connected components are identified and analyzed. Tab-stops are detected from groups of edge-aligned connected components. The detected tab-stops are used to deduce the column layout of the page by finding column partitions. The column layout is then applied to find the polygonal boundaries of and a reading order of regions containing flowing text, headings, and pull-outs.

...read moreread less

Proceedings Article•DOI•

Document image analysis with OCRopus

[...]

Faisal Shafait¹•Institutions (1)

German Research Centre for Artificial Intelligence¹

01 Dec 2009

TL;DR: An overview of different steps involved in a document image analysis system is presented and illustrated with examples from OCRopus, one of the leading open source document analysis system with a modular and pluggable architecture.

...read moreread less

Abstract: Document image analysis is the field of converting paper documents into an editable electronic representation by performing optical character recognition (OCR) In recent years, there has been a tremendous amount of progress in the development of open source OCR systems OCRopus is one of the leading open source document analysis system with a modular and pluggable architecture This paper presents an overview of different steps involved in a document image analysis system and illustrates them with examples from OCRopus

...read moreread less

Patent•

Document storage system

[...]

Anjaneyulu Seetha Rama Kuchibhotla, Guruprasad Chintakunta, Sitaram Ramachandrula, Sriganesh Madhvanath, Deivanayagam Ramakrishnan - Show less +1 more

21 Jul 2009

TL;DR: In this paper, a method of storing a document and one or more related images of alterations made to the document, comprising capturing an image of the document and storing the image in memory, is described.

...read moreread less

Abstract: Methods for storing and managing hard copy documents and their modified versions are disclosed. Specifically, a method of storing a document and one or more related images of alterations made to the document, comprising capturing an image of the document; storing the image of the document in memory; capturing an image of an altered version of the document; comparing the image of the document to the image of the altered version of the document; extracting the differences between the image of the document and the image of the altered version of the document; creating an image of the extracted differences between the image of the document and the image of the altered version of the document; and storing the image of the extracted differences in memory.

...read moreread less

Patent•

Document reading device and image forming apparatus

[...]

Tsuyoshi Sasano¹, Atsuya Baba¹•Institutions (1)

Fuji Xerox¹

17 Aug 2009

TL;DR: In this paper, a document reading device includes a document image acquisition unit, a memory that holds the document image obtained by the document imaging unit, and a determination unit that determines whether or not a size or orientation of the document matches a set parameter.

...read moreread less

Abstract: A document reading device includes: a document image acquisition unit that reads a document and obtains a document image; a memory that holds the document image obtained by the document image acquisition unit; a determination unit that determines whether or not a size or orientation of the document, the image of which is obtained by the document image acquisition unit, matches a set parameter; and a setting change unit that, when it is determined that the size or orientation of the document does not match the set parameter, changes a setting of the parameter so as to match the size or orientation of the document.

...read moreread less

Dissertation•

Contributions au tri automatique de documents et de courrier d'entreprises

[...]

Djamel Gaceb

01 Jan 2009

TL;DR: In this paper, a hierarchical graph coloring-based model for business documents and mail sorting has been proposed, which is composed of three main sections: low-level segmentation (binarisation and connected component labeling), the physical layout extraction by hierarchical graph colouring and the address block location and document sorting.

...read moreread less

Abstract: This thesis deals with the development of industrial vision systems for automatic business documents and mail sorting. These systems need very high processing time, accuracy and precision of results. The current systems are most of time made of sequential modules needing fast and efficient algorithms throughout the processing line: from low to high level stages of analysis and content recognition. The existing architectures that we have described in the three first chapters of the thesis have shown their weaknesses that are expressed by reading errors and OCR rejections. The modules that are responsible of these rejections and reading errors are mostly the first to occur in the processes of image segmentation and interest regions location. Indeed, theses two processes, involving each other, are fundamental for the system performances and the efficiency of the automatic sorting lines. In this thesis, we have chosen to focus on different sides of mail images segmentation and of relevant zones (as address block) location. We have chosen to develop a model based on a new pyramidal approach using a hierarchical graph coloring. As for now, graph coloring has never been exploited in such context. It has been introduced in our contribution at every stage of document layout analysis for the recognition and decision tasks (kind of document or address block recognition). The recognition stage is made about a training process with a unique model of graph b-coloring. Our architecture is basically designed to guarantee a good cooperation bewtween the different modules of decision and analysis for the layout analysis and the recognition stages. It is composed of three main sections: the low-level segmentation (binarisation and connected component labeling), the physical layout extraction by hierarchical graph coloring and the address block location and document sorting. The algorithms involved in the system have been designed for their execution speed (matching with real time constraints), their robustness, and their compatibility. The experimentations made in this context are very encouraging and lead to investigate a wider diversity of document images.

...read moreread less

Patent•

System and method for automatic translation of documents scanned by multifunctional printer machines

[...]

Rajinderjeet Singh Minhas¹•Institutions (1)

Xerox¹

23 Nov 2009

TL;DR: In this paper, a method and system for translating documents with the use of a multifunctional printer machine is presented. But the system is not suitable for the translation of large documents.

...read moreread less

Abstract: A method and system for translating documents with the use of a multifunctional printer machine, including capturing an image of a document; determining regions of the document captured that include original text; performing optical character recognition of the regions of the document captured that include the original text; specifying a source language corresponding to the original text; specifying one or more target languages corresponding to translated text; performing language translation of the original text into translated text; selecting one or more page layout templates having multiple pre-designated areas for receiving the original text and the translated text; and outputting one or more printouts in accordance with the one or more page layout templates selected, including at least an area designating (i) the original text in the source language and (ii) the translated text in the one or more target languages.

...read moreread less

Patent•

Document verification apparatus, document verification method, and computer product

[...]

Tetsuya Izu¹, Masahiko Takenaka¹•Institutions (1)

Fujitsu¹

29 Jan 2009

TL;DR: In this paper, a normal random number or a pseudo random number is assigned to each of the constituent parts according to the order in which the constituent part appears in the digital document, and verification of the authenticity of a digital document is enabled even when an alteration, such as a change of the order of the partial documents or a copy thereof, has been made to the digital documents.

...read moreread less

Abstract: In verifying a digital document, an input of a digital document is received and the digital document is divided into arbitrary constituent parts. A normal random number or a pseudo random number is assigned to each of the constituent parts according to the order in which the constituent parts appear in the digital document. Thus, verification of the authenticity of a digital document is enabled even when an alteration, such as a change of the order of the partial documents or a copy thereof, has been made to the digital document.

...read moreread less

Identifying Layout Classes for Mathematical Symbols Using Layout Context

[...]

Ling Ouyang, Richard Zanibbi¹•Institutions (1)

Rochester Institute of Technology¹

01 Jan 2009

TL;DR: In this paper, a symbol classification technique for identifying the expected locations of neighboring symbols in mathematical expressions is described and it is found that the size of the symbol neighborhood, and number of key points representing a symbol affect performance significantly.

...read moreread less

Abstract: We describe a symbol classification technique for identifying the expected locations of neighboring symbols in mathematical expressions. We use the seven symbol layout classes of the DRACULAEmath notation parser (Zanibbi, Blostein, and Cordy, 2002) to represent expected locations for neighboring symbols: Ascender, Descender, Centered, Open Bracket, Non-Script, Variable Range (e.g. integrals) and Square Root. A new feature based on shape contexts (Belongie et al., 2002) named layout context is used to describe the arrangement of neighboring symbol bounding boxes relative to a reference symbol, and the nearest neighbor rule is used for classification. 1917 mathematical symbols from the University of Washington III document database are used in our experiments. Using a leave-one-out estimate, our best classification rate reaches nearly 80%. In our experiments, we find that the size of the symbol neighborhood, and number and arrangement of key points representing a symbol affect performance significantly.

...read moreread less

Proceedings Article•

Combining Qualitative and Quantitative Keyword Extraction Methods with Document Layout Analysis.

[...]

Stefano Ferilli¹, Marenglen Biba¹, Teresa Maria Altomare Basile¹, Floriana Esposito¹•Institutions (1)

University of Bari¹

01 Jan 2009

TL;DR: This work aims at introducing in the document processing framework of DOMINUS qualitative techniques based on the lexical taxonomy WordNet and its extension WordNet Domains for text categorization and keyword extraction, that can support the currently embedded techniquesbased on quantitative approaches.

...read moreread less

Abstract: The large availability of documents in digital format posed the problem of efficient and effective retrieval mechanisms. This involves the ability to process natural language, which is a significantly complex task. Traditional algorithms based on term matching between the document and the query, although efficient, are not able to catch the intended meaning of both, and hence cannot ensure effectiveness. To step on toward semantics, problems such as polysemy and synonimy must be tackled automatically by text processing systems. This work aims at introducing in the document processing framework of DOMINUS qualitative techniques based on the lexical taxonomy WordNet and its extension WordNet Domains for text categorization and keyword extraction, that can support the currently embedded techniques based on quantitative approaches. In particular, a density function is exploited to assign the proper importance to the involved concepts and domains. Preliminary results on texts of different subjects confirm its effectiveness.

...read moreread less

Patent•

Apparatus, method and computer program product for processing documents

[...]

Kosei Fume¹•Institutions (1)

Toshiba¹

22 Jan 2009

TL;DR: A document processing apparatus includes an extracting unit that extracts text document information from a document data; an analyzing unit that analyzes a modification relation of a character string included in the text documents; and an attribute unit that assigns an attribute indicating details of the modification relation to the character string, and embeds the attribute in text documents as discussed by the authors.

...read moreread less

Abstract: A document processing apparatus includes an extracting unit that extracts text document information from a document data; an analyzing unit that analyzes a modification relation of a character string included in the text document information; an attribute unit that assigns an attribute indicating details of the modification relation to the character string, and embeds the attribute in the text document information; a document specifying unit that specifies a document-specifying character string that specifies other text document information, using the text document information in which the attribute is embedded by the attribute unit; and a document-identification unit that assigns document identification information to the document-specifying character string, and embeds the document identification information in the text document information.

...read moreread less

Proceedings Article•DOI•

A comprehensive evaluation methodology for noisy historical document recognition techniques

[...]

Nikolaos Stamatopoulos, Georgios Louloudis¹, Basilis Gatos•Institutions (1)

National and Kapodistrian University of Athens¹

23 Jul 2009

TL;DR: Experimental results prove that using the proposed technique, the percentage of time saved for the text line, word and character segmentation ground truth creation is more than 90%.

...read moreread less

Abstract: In this paper, we propose a new comprehensive methodology in order to evaluate the performance of noisy historical document recognition techniques. We aim to evaluate not only the final noisy recognition result but also the main intermediate stages of text line, word and character segmentation. For this purpose, we efficiently create the text line, word and character segmentation ground truth guided by the transcription of the historical documents. The proposed methodology consists of (i) a semiautomatic procedure in order to detect the text line, word and character segmentation ground truth regions making use of the correct document transcription, (ii) calculation of proper evaluation metrics in order to measure the performance of the final OCR result as well as of the intermediate segmentation stages. The semi-automatic procedure for detecting the ground truth regions has been evaluated and proved efficient and time saving. Experimental results prove that using the proposed technique, the percentage of time saved for the text line, word and character segmentation ground truth creation is more than 90%. An analytic experiment using a commercial OCR engine applied to a historical book is also presented.

...read moreread less

Book Chapter•DOI•

Simultaneous Document Margin Removal and Skew Correction Based on Corner Detection in Projection Profiles

[...]

M. Mehdi Haji¹, Tien D. Bui¹, Ching Y. Suen¹•Institutions (1)

Concordia University¹

29 Aug 2009

TL;DR: An algorithm which is developed for document margin removal based upon the detection of document corners from projection profiles which was successfully applied to all document images in the authors' databases of French and Arabic document images.

...read moreread less

Abstract: Document images obtained from scanners or photocopiers usually have a black margin which interferes with subsequent stages of page segmentation algorithms. Thus, the margins must be removed at the initial stage of a document processing application. This paper presents an algorithm which we have developed for document margin removal based upon the detection of document corners from projection profiles. The algorithm does not make any restrictive assumptions regarding the input document image to be processed. It neither needs all four margins to be present nor needs the corners to be right angles. In the case of the tilted documents, it is able to detect and correct the skew. In our experiments, the algorithm was successfully applied to all document images in our databases of French and Arabic document images which contain more than two hundred images with different types of layouts, noise, and intensity levels.

...read moreread less

Patent•

Apparatus, method and computer program for creating document image

[...]

Kutsumi Takeshi, Ichiko Sada, 毅九津見, いち子佐田

06 Nov 2009

TL;DR: In this article, the translation words are disposed in positions corresponding to positions between lines close to the words or collocations to create a document image with the supplementary explanations, while maintaining the layout of the document.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To provide an apparatus, a method and a computer program for creating document images, making it easier to understand a document by adding additional explanations such as translation words to words and collocations in the document while maintaining the layout of the document. SOLUTION: Characters are recognized from an original document image obtained using an image reader and the like, and natural language processing of the document composed of the recognized characters is carried out to determine translation words (supplementary explanations) for words or collocations included in the sentence. A supplementary explanation text layer where the translation words are disposed in positions corresponding to positions between lines close to the words or collocations is laid on top of an original document image layer made of the original document image, to create a document image with the supplementary explanations. For discontinuous collocations, underlines are disposed in addition to the translation words. The layout of the document is maintained to facilitate comparison between the original document and the document having the translation words. COPYRIGHT: (C)2011,JPO&INPIT

...read moreread less

Patent•

Document translation apparatus and method

[...]

Roh Yoon Hyung¹, Choi Sung Kwon¹, Lee Ki Young¹, Kwon Oh Woog¹, Kim Young Kil¹, Kim Chang Hyun¹, Seo Young Ae¹, Yang Seong Il¹, Jin Yun¹, Park Eun Jin¹, Wu Ying Shun¹, Changhao Yin¹, Sang Kyu Park¹ - Show less +9 more•Institutions (1)

Electronics and Telecommunications Research Institute¹

15 Jun 2009

TL;DR: A document translation apparatus includes a document processing module for analyzing associative relations between nouns or noun phrases within an input document to be translated to generate analysis information on texts; and a document translation module for selecting target words for the respective texts in reference to the text analysis information to generate morphemes corresponding to the target words, thereby producing a translated document corresponding to input document.

...read moreread less

Abstract: A document translation apparatus includes a document processing module for analyzing associative relations between nouns or noun phrases within an input document to be translated to generate analysis information on texts; and a document translation module for selecting target words for the respective texts in reference to the text analysis information to generate morphemes corresponding to the target words, thereby producing a translated document corresponding to the input document.

...read moreread less

Patent•

Electronic document processing apparatus and electronic document processing method

[...]

Tsuyoshi Itami¹•Institutions (1)

Canon Inc.¹

11 Aug 2009

TL;DR: In this paper, text strings are extracted from an electronic document containing layout information and a baseline of the extracted text string is detected, and a line segment A extending forward from the baseline and another line segment B of a different type from the line segments A and B extending backward from a baseline are provided.

...read moreread less

Abstract: An electronic document processing apparatus and an electronic document processing method are provided that can perform a sophisticated search such as concept search with a high precision even on images including decorated text strings. Text strings are extracted from an electronic document containing layout information, and a baseline of the extracted text string is detected. Subsequently, a line segment A extending forward from the baseline and a line segment B of a different type from the line segment A extending backward from the baseline are provided. It is determined, for different text strings, that the different text strings are concatenated if the line segments A and B, which are provided for the different text string, overlap with each other.

...read moreread less

Patent•

Document processing apparatus and method for processing document using the same

[...]

Hyung-soo Ohk¹•Institutions (1)

Samsung¹

25 Sep 2009

TL;DR: In this article, a document processing apparatus, including a symbol-related information acquirement unit which identifies a text area in a scanned document, extracts symbols from the identified text area, and acquires symbolrelated information regarding each extracted symbol, a symbol division unit which divides the extracted symbols into several groups based on a preset reference value regarding the symbol related information, and a key index generation unit which generates a k-index by arranging one group of symbols from among the divided groups.

...read moreread less

Abstract: A document processing apparatus, including a symbol-related information acquirement unit which identifies a text area in a scanned document, extracts symbols from the identified text area, and acquires symbol-related information regarding each extracted symbol, a symbol division unit which divides the extracted symbols into several groups based on a preset reference value regarding the symbol-related information, and a key index generation unit which generates a key index by arranging one group of symbols from among the divided groups. Accordingly, a user can look for a desired document more easily and conveniently.

...read moreread less

Patent•

Method and device for processing the structure of a layout file

[...]

Ruiheng Qiu¹, Yi Wang¹, Zhi Tang¹•Institutions (1)

Peking University¹

06 Jun 2009

TL;DR: In this article, a method and a device for processing the structure of a layout file, comprising of obtaining document content structure information and/or document layout exhibition information of the layout file was presented.

...read moreread less

Abstract: Disclosed are a method and a device for processing the structure of a layout file, comprising: obtaining document content structure information and/or document layout exhibition information of the layout file; dividing document contents of the layout file into content blocks according to the document content structure information and/or the document layout exhibition information; and creating document flow information of the layout file according to the divided content blocks.

...read moreread less