scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2005"


Patent
10 Jun 2005
TL;DR: An intelligent document recognition-based document management system as discussed by the authors includes modules for image capture, image enhancement, image identification, optical character recognition (OCR), data extraction, and quality assurance.
Abstract: An intelligent document recognition-based document management system (Fig. 2) includes modules for image capture (32), image enhancement (32), image identification (34), optical character recognition (36), data extraction (37) and quality assurance (42). The system captures data from electronic documents as diverse as facsimile images, scanned images and images from document management systems. It processes these images and presents the data in, for example, a standard XML format. The document management system processes both structured document images (40) (ones which have a standard format) and unstructured document images (38) (ones which do not have a standard format). The system can extract images directly from a facsimile machine, a scanner or a document management system for processing.

233 citations


Patent
Kathrin Berkner1
30 Jun 2005
TL;DR: In this article, a method, article of manufacture, and apparatus for content-adaptive scaling of document images is described, which comprises identifying spatial relationships between document objects of a document image, determining space separating pairs of neighboring document objects, and determining at least one scaling factor based on the space separating the document objects in the document image and based on display device characteristics.
Abstract: A method, article of manufacture, and apparatus for content-adaptive scaling of document images is described. In one embodiment, the method comprises identifying spatial relationships between document objects of a document image, determining space separating pairs of neighboring document objects, and determining at least one scaling factor based on the space separating the document objects in the document image and based on display device characteristics.

61 citations


Patent
24 Jun 2005
TL;DR: In this paper, a text layout boundary is generated from extended and unextended edges of the bounding rectangles of a text block, which is stored in a machine-readable medium.
Abstract: Methods, systems and machine-readable instructions for processing an electronic document are described. In one aspect, logical blocks that were extracted from the electronic document, including a text block comprising text lines each encompassed by a respective bounding rectangle, are received. Edges of ones of the bounding rectangles are extended to at least one boundary without changing layout relationships among the logical blocks in the electronic document. A text layout boundary is generated from extended and unextended edges of the bounding rectangles. A description of the text layout boundary is stored in a machine-readable medium.

56 citations


Proceedings ArticleDOI
17 Oct 2005
TL;DR: This approach models document layout as a grammar and performs a global search for the optimal parse based on a grammatical cost function and applies this technique to two document image analysis tasks: page layout structure extraction and mathematical expression interpretation.
Abstract: We present a general approach for the hierarchical segmentation and labeling of document layout structures. This approach models document layout as a grammar and performs a global search for the optimal parse based on a grammatical cost function. Our contribution is to utilize machine learning to discriminatively select features and set all parameters in the parsing process. Therefore, and unlike many other approaches for layout analysis, ours can easily adapt itself to a variety of document analysis problems. One need only specify the page grammar and provide a set of correctly labeled pages. We apply this technique to two document image analysis tasks: page layout structure extraction and mathematical expression interpretation. Experiments demonstrate that the learned grammars can be used to extract the document structure in 57 files from the UWIII document image database. We also show that the same framework can be used to automatically interpret printed mathematical expressions so as to recreate the original LaTeX

50 citations


Proceedings ArticleDOI
Jean-Luc Meunier1
31 Aug 2005
TL;DR: A fast method for determining the human reading order of the layout elements of a document page and includes a computationally tractable optimization approach to the problem.
Abstract: In this paper, we propose a fast method for determining the human reading order of the layout elements of a document page. The proposal includes a computationally tractable optimization approach to the problem. We also report on the performance of the method and discuss it in light of related work.

49 citations


Patent
25 Feb 2005
TL;DR: In this paper, a method and apparatus for utilizing a document object model to manage content regions for use in an electronic document is provided, where a content region is a predefined area which may be inserted in a document and which serves as a placeholder for receiving and displaying specific types of content such as text, graphics data, calendar data, or tabular data.
Abstract: A method and apparatus are provided for utilizing a document object model to manage content regions for use in an electronic document. A content region is a predefined area which may be inserted in an electronic document and which serves as a placeholder for receiving and displaying specific types of content such as text, graphics data, calendar data, or tabular data. The document object model enables a user to create, modify, and delete content regions from an electronic document using an application programming interface from within a computer application program.

47 citations


Journal ArticleDOI
TL;DR: This paper presents a method for determining the up/down orientation of text in a scanned document of unknown orientation, so that it can be appropriately rotated and processed by an optical character recognition (OCR) engine.

46 citations


Patent
John C. Handley1
31 Mar 2005
TL;DR: In this article, a system for classifying a genre of an electronic document may include a network processor configured to parse the RTF document into lines of text ordered from top to bottom and left to right and assign tokens to each line of text based on content of the line and to line separators based on space between blocks of lines.
Abstract: A system for classifying a genre of an electronic document may include a network processor configured to receive an electronic document and convert the electronic document to rich text format (RTF). The processor may be configured to parse the RTF document into lines of text ordered from top to bottom and left to right and assign tokens to each line of text based on content of the line and to line separators based on space between blocks of lines. The network processor may be configured to sequence the tokens, parse the tokenized document with a number of pre-defined document grammars, determine a probability for each genre corresponding to the electronic document, and classify the electronic document as the genre with the highest probability.

42 citations


Proceedings ArticleDOI
31 Aug 2005
TL;DR: Xed mixes electronic extraction methods with state-of-the-art document analysis techniques and outputs the layout structure in a hierarchical canonical form, i.e. which is universal and independent of the document type.
Abstract: This article presents Xed, a reverse engineering tool for PDF documents, which extracts the original document layout structure. Xed mixes electronic extraction methods with state-of-the-art document analysis techniques and outputs the layout structure in a hierarchical canonical form, i.e. which is universal and independent of the document type. This article first reviews the major traps and tricks of the PDF format. It then introduces the architecture of Xed along with its main modules, and, in particular, the document physical structure extraction algorithm. Later on, a canonical format is proposed and discussed with an example. Finally the results of a practical evaluation are presented, followed by an outline of future works on the logical structure extraction.

36 citations


Patent
01 Jun 2005
TL;DR: A table version of a document is generated by computing a table layout of the document and generating the table version based on the table layout as mentioned in this paper, which can be used to export the document to one or more computers in a distributed network while maintaining the visual fidelity and text content of the original document.
Abstract: A table version of a document is generated by computing a table layout of the document and generating the table version based on the table layout. Computing the table layout can include recording the positions of each object in the document while recording the position of text by recording the position of each line, dividing the document into sections, and grouping the sections based on their object content while compensating for overlapping objects. Generating the table version of the document from the table layout can include creating table code that represents information in the table layout of the document. The table version of the document can be used to export the document to one or more computers in a distributed network while maintaining the visual fidelity and text content of the original document.

36 citations


Proceedings ArticleDOI
31 Aug 2005
TL;DR: The proposed skew detection algorithm has no restriction on detectable angle range and does not rely on large blocks of text, and works well on textual document images, graphical images and mixed text and graphic images.
Abstract: Document image processing has become an increasingly important technology in the automation of office documentation tasks. Automatic document scanners such as text readers and OCR (optical character recognition) systems are an essential component of systems capable of those tasks. One of the problems in this field is that the document to be read is not always placed correctly on a flat-bed scanner. This means that the document may be skewed on the scanner bed, resulting in a skewed image. This skew has a detrimental effect on document analysis, document understanding, and character segmentation and recognition. Consequently, detecting the skew of a document image and correcting it are important issues in realizing a practical document reader. The proposed skew detection algorithm has no restriction on detectable angle range and does not rely on large blocks of text. It works well on textual document images, graphical images and mixed text and graphic images. The performance of the systems was evaluated using over 60 images that consist of real life documents like envelopes and artificial mixed text/graphic icons. The skew detection algorithm is robust when compared with other methods when very few text lines are present in the document image.

Patent
Hiroki Kanno1
02 Aug 2005
TL;DR: In this article, a system and a method for providing characteristic data associated with a scanned document is provided. The method includes analyzing a bitmapped image file of a document, determining at least one characteristic data of the document, and linking the characteristic data to the image file, wherein the document is useable by a document management system to identify the document in a search.
Abstract: A system and a method for providing characteristic data associated with a scanned document is provided. The characteristic data of the document may include a title, a creation date, a scan date, an author, a subject matter, a total page count, a starting page number, an ending page number, a color type, a document type, a language, and/or a document direction. The method includes analyzing a bitmapped image file of a document, determining at least one characteristic data of the document based on the analysis of the bitmapped image file, and linking the characteristic data to the bitmapped image file, wherein the characteristic data is useable by a document management system to identify the document in a search. Analyzing the bitmapped image of the document may include a natural language analysis technique, an optical character recognition analysis technique, an image layout analysis technique, and/or a color analysis technique.

Proceedings ArticleDOI
31 Aug 2005
TL;DR: Results show that such a conversion strategy aided by (expert) user-specified semantic information and which enables the processing of individual parts of the document in a specialised way, produces superior results than document analysis and understanding techniques devised for contemporary documents.
Abstract: This paper presents a flexible approach to extracting content from scanned historical documents using semantic information. The final electronic document is the result of a "digital historical document lifecycle" process, where the expert knowledge of the historian/archivist user is incorporated at different stages. Results show that such a conversion strategy aided by (expert) user-specified semantic information and which enables the processing of individual parts of the document in a specialised way, produces superior (in a variety of significant ways) results than document analysis and understanding techniques devised for contemporary documents.

Patent
Xiaofan Lin1
30 Mar 2005
TL;DR: In this paper, the static layout description assigns to each logical block at least one associated static layout attribute, and each of the static layouts attributes is assigned a fixed value, while the adaptive layout template includes adaptive layout attributes corresponding to static layout attributes and assigns to them a respective symbolic expression representing a variable value.
Abstract: Methods, systems and machine-readable instructions for processing electronic documents are described. In one aspect, a description of a static layout of logical blocks of the electronic document is received. The static layout description assigns to each of the logical blocks at least one associated static layout attribute. Each of the static layout attributes is assigned a fixed value. An adaptive layout template is generated from the static layout description. The adaptive layout template includes adaptive layout attributes corresponding to the static layout attributes and assigns to each of the adaptive layout attributes a respective symbolic expression representing a variable value. The adaptive layout template is stored in a memory.

Patent
14 Nov 2005
TL;DR: In this paper, a relationship representation is generated for the components of the document and a user can navigate the relationships between the components to quickly understand the nature of a document and its components and to locate particular portions of the documents that are important to the user.
Abstract: Methods and systems provide for breaking a computer-generated document into a number of components where the components have explicit relationships with each other. A relationship representation is generated for the components of the document. A user may then navigate the relationships between the components to quickly understand the nature of the document and its components and to locate particular portions of the document that are important to the user. In addition, the user may open, edit and reuse particular components of the document apart from the rest of the document and without having to open or edit the document.

Proceedings ArticleDOI
31 Aug 2005
TL;DR: The PerfectDoc tool is presented; a ground truthing and document correction tool that provides post processing correction capabilities that are required after complex document analysis and understanding tasks.
Abstract: In this paper, we present PerfectDoc; a ground truthing and document correction tool. The tool provides post processing correction capabilities that are required after complex document analysis and understanding tasks. The tool has the advantage of being comprehensive (integration of most common correction tasks), easy to use (minimal clicks for corrections), configurable (can be used for different types of documents), and provides separate correction views. We used the tool to correct the output from a document understanding system used to extract articles from 80-years archive of Time weekly magazine.

Patent
04 Aug 2005
TL;DR: In this paper, a translation device consisting of a character recognition unit that recognizes text data in a text region of an input image, a translator that translates the text data from the text region to the image region, and a layout configuration processor that generates data containing the translated text data and graphics in the input image is described.
Abstract: A translation device comprises a character recognition unit that recognizes text data in a text region of an input image; a translator that translates the text data in the text region; and a layout configuration processor that generates data containing the translated text data in the text region and graphics in the input image, wherein a layout of the input image is maintained in a layout of the image of the data generated by the layout configuration processor.

Proceedings ArticleDOI
Hung-Ming Sun1
31 Aug 2005
TL;DR: The proposed method, named selective CRLA, has been successfully applied to extraction of text from commercial magazine pages with complicated layouts and is capable of processing documents with both Manhattan and non-Manhattan layouts.
Abstract: The constrained run-length algorithm (CRLA) is a well-known technique for page segmentation. The algorithm is fast and can be used to partition documents with Manhattan layouts. It is not, however, suited to deal with pages with layouts beyond the Manhattan format, e.g. irregular halftone images embedded in text paragraphs. A modified version of the CRLA, named selective CRLA, is presented in this paper. The selective CRLA is capable of processing documents with both Manhattan and non-Manhattan layouts. The selective CRLA is performed twice with different sets of parameters on a label image derived from the input document image. After both of its executions, the yielded text regions are extracted. The proposed method has been successfully applied to extraction of text from commercial magazine pages with complicated layouts.

Proceedings ArticleDOI
31 Aug 2005
TL;DR: The association of text and graphics allows us to capture the semantic meaning carried by scientific chart images in a more complete way.
Abstract: This paper presents our recent work that aims at associating the recognition results of textual and graphical information contained in the scientific chart images. Text components are first located in the input image and then recognized using OCR. On the other hand, the graphical objects are segmented and form high level symbols. Both logical and semantic correspondence between text and graphical symbols are identified. The association of text and graphics allows us to capture the semantic meaning carried by scientific chart images in a more complete way. The result of scientific chart image understanding is presented using XML documents.

Patent
07 Mar 2005
TL;DR: In this article, an apparatus and method for easily generating document data (tag file) in a form that makes it possible to perform various processes upon the document data is disclosed for easily retrieving document data.
Abstract: An apparatus and method are disclosed for easily generating document data (tag file) in a form that makes it possible to perform various processes upon the document data. An original document (plain text) is divided into morphological elements, and morphological information is added thereto. Information representing the hierarchical document structures is also added. Furthermore information indicating referential relations between portions in the original document is also added.

Patent
20 May 2005
TL;DR: In this article, a mixed text and image layout algorithm capable of supporting Unicode text and arbitrary content definitions for geometric layout with worst case two-pass layout placement procedure is presented, where the layout is best-case achieved in a single layout pass and worst-case in two passes.
Abstract: A mixed text and image layout algorithm capable of supporting Unicode text and arbitrary content definitions for geometric layout with worst case two-pass layout placement procedure. Layout of Unicode text requires a number of distinct processing steps commencing with classification of input characters into contiguous groups of identical directionality, writing system and possibly script (and language) followed by mapping of character groups to glyphs for display purposes followed by a layout taking into account font display characteristics, embedded directionality level and shape of container for layout contents. Layout is best-case achieved in a single layout pass and worst-case in two passes. During layout information is cached to facilitate incremental changes to an existing layout in order to minimize refresh operations for editing display purposes. An optional two-pass operation on the layout result may be used to generate ordered rendering operation to support so-called Z-index display. An optimized Unicode character classification method utilizing reduced memory is also disclosed. Additionally a method to selectively display caret location for mixed font and/or directional text is disclosed.

Patent
Xing Xie1, Gengxin Miao1, Guomao Xin1, Ruihua Song1, Ji-Rong Wen1, Wei-Ying Ma1 
26 Sep 2005
TL;DR: In this article, the authors classified page block functionality to improve document layout for browsing and used these block function assignments to generate one or more customized document layouts for browsing by a user, which can be used to improve the user experience.
Abstract: Categorizing page block functionality to improve document layout for browsing is described. In one aspect, document content is analyzed with respect to multiple block function criteria. Results of this analysis are used to assign a respective block function to blocks of the document content. These block function assignments are used to generate one or more customized document layouts for browsing by a user.

Patent
Yuushin Tatsumi1
25 Oct 2005
TL;DR: In this paper, a document analysis system which can execute a layout analysis intended by a document provider and an exhaustive title analysis and output the analysis result which can be used by a third person is provided.
Abstract: A document analysis system which can execute a layout analysis intended by a document provider and an exhaustive title analysis and output the analysis result which can be used by a third person is provided by the present invention. The input unit ( 11 ) obtains a structured or semi-structured document and renders it. The basic layout analysis unit ( 14 ) obtains the rendering result and analyzes the layout by grouping document description elements juxtaposed in a determined direction by referencing an arrangement of the document description elements. The title analysis unit ( 15 ) obtains the rendering result and a title analysis rule from the title analysis rule storing unit ( 23 ) and analyzes the title by comparing the name, attribute, style or the content of the document analysis elements with the title analysis rule. The layout analysis unit ( 16 ) obtains the layout components and the hierarchical relationship thereof and the titles for generating a new layout by grouping the layout components. The output unit ( 13 ) obtains the layout components and the hierarchical relationship thereof, the relationship between the components and the titles, shapes them into a format having an expression which uses the reference to the document description elements and output them.

Patent
01 Apr 2005
TL;DR: In this paper, a reading machine applies text-to-speech to a text file that corresponds to the selected section of the document, to read the selected sections of a document aloud to the user.
Abstract: Controlling a reading machine while reading a document to a user by receiving an image of a document, accessing a knowledge base that provides data that identifies sections in the document and processing user commands to select a section of the document. The reading machine applies text-to-speech to a text file that corresponds to the selected section of the document, to read the selected section of the document aloud to the user.

Proceedings ArticleDOI
31 Aug 2005
TL;DR: This paper presents a novel approach for document page segmentation using a transform based multi-scale method and shows that the algorithm is robust for variations of document parameters.
Abstract: Page segmentation algorithms found in published literatures often rely on some predetermined parameters such as general font sizes, distances between text lines and document scan resolutions. Variations of these parameters in real document images greatly affect the performance of the algorithms. In this paper, we present a novel approach for document page segmentation using a multi-scale technique. An efficient implementation of a local connectivity algorithm transforms a document image into a parameter domain in which a parameter value at a pixel location represents a connectivity property for its neighboring foreground pixels in the original document image. Then a top-down approach with a linear search reveals the document regions at each scale levels as text block, text lines and graphics. We consider our algorithm a transform based multi-scale method. Our ongoing research shows that the algorithm is robust for variations of document parameters.

Proceedings ArticleDOI
31 Aug 2005
TL;DR: A document understanding system in which the arrangement of lines of text and block separators within a document are modeled by stochastic context free grammars, which may be adapted to a new genre simply by replacing the input grammar.
Abstract: We present a document understanding system in which the arrangement of lines of text and block separators within a document are modeled by stochastic context free grammars. A grammar corresponds to a document genre; our system may be adapted to a new genre simply by replacing the input grammar. The system incorporates an optical character recognition system that outputs characters, their positions and font sizes. These features are combined to form a document representation of lines of text and separators. Lines of text are labeled as tokens using regular expression matching. The maximum likelihood parse of this stream of tokens and separators yields a functional labeling of the document lines. We describe business card and business letter applications.

Proceedings ArticleDOI
31 Aug 2005
TL;DR: This paper describes document images based on frequent occurring symbols, a document description created in an unsupervised manner and can be related to the domain knowledge.
Abstract: Document image classification is an important step in document image analysis. Based on classification results we can tackle other tasks such as indexation, understanding or navigation in document collections. Using a document representation and an unsupervised classification method, we may group documents that from the user point of view constitute valid clusters. The semantic gap between a domain independent document representation and the user implicit representation can lead to unsatisfactory results. In this paper, we describe document images based on frequent occurring symbols. This document description is created in an unsupervised manner and can be related to the domain knowledge. Using data mining techniques applied to a graph based document representation we find frequent and maximal subgraphs. For each document image, we construct a bag containing the frequent subgraphs found in it. This bag of "symbols" represents the description of a document. We present results obtained on a corpus of 60 graphical document images.

Proceedings ArticleDOI
31 Aug 2005
TL;DR: Experimental results on a large scale document image database, which contains 10385 document images, show that the proposed method is efficient and robust to retrieve different kinds of document images in real time.
Abstract: Document image retrieval is an important part of many document image processing systems such as paperless office systems, digital libraries and so on. Its task is to help users find out the most similar document images from a document image database. For developing a system of document image retrieval among different resolutions, different formats document images with hybrid characters of multiple languages, a new retrieval method based on document image density distribution features and key block features is proposed in this paper. Firstly, the density distribution and key block features of a document image are defined and extracted based on documents' print-core. Secondly, the candidate document images are attained based on the density distribution features. Thirdly, to improve reliability of the retrieval results, a confirmation procedure using key block features is applied to those candidates. Experimental results on a large scale document image database, which contains 10385 document images, show that the proposed method is efficient and robust to retrieve different kinds of document images in real time.

Patent
28 Apr 2005
TL;DR: In this article, the content of a document is segmented into one or more original document structures, and then the original document structure is localized with new content to generate a more aesthetically pleasing document.
Abstract: A method which includes segmenting the content of a document into one or more original document structures, determining which of the one or more original document structures are to be localized, replacing the original document structures to be localized with new content, and automatically adjusting the layout of the document with new content to generate a more aesthetically pleasing document.

Proceedings ArticleDOI
31 Aug 2005
TL;DR: This paper develops a new software environment for manual page image segmentation and labeling, and uses it to create a dataset containing 932 page images from academic journals, and develops a physical layout analysis algorithm based on a logistic regression classifier, which is found to outperform existing algorithms of comparable complexity.
Abstract: In the field of computer analysis of document images, the problems of physical and logical layout analysis have been approached through a variety of heuristic, rule-based, and grammar-based techniques. In this paper we investigate the effectiveness of statistical pattern recognition algorithms for solving these two problems, and report results suggesting that these more complex and powerful techniques are worth pursuing. First, we developed a new software environment for manual page image segmentation and labeling, and used it to create a dataset containing 932 page images from academic journals. Next, a physical layout analysis algorithm based on a logistic regression classifier was developed, and found to outperform existing algorithms of comparable complexity. Finally, three statistical classifiers were applied to the logical layout analysis problem, also with encouraging results.