Showing papers on "Document layout analysis published in 2005"

PDF

Open Access

Patent•

Document management system with enhanced intelligent document recognition capabilities

[...]

Suresh S. Pandian, Thyagarajan Swaminathan, Subramaniyan Neelagandan, Krishna K. Srinivasan, Randal J. Martin - Show less +1 more

10 Jun 2005

TL;DR: An intelligent document recognition-based document management system as discussed by the authors includes modules for image capture, image enhancement, image identification, optical character recognition (OCR), data extraction, and quality assurance.

...read moreread less

Abstract: An intelligent document recognition-based document management system (Fig. 2) includes modules for image capture (32), image enhancement (32), image identification (34), optical character recognition (36), data extraction (37) and quality assurance (42). The system captures data from electronic documents as diverse as facsimile images, scanned images and images from document management systems. It processes these images and presents the data in, for example, a standard XML format. The document management system processes both structured document images (40) (ones which have a standard format) and unstructured document images (38) (ones which do not have a standard format). The system can extract images directly from a facsimile machine, a scanner or a document management system for processing.

...read moreread less

233 citations

Patent•

White space graphs and trees for content-adaptive scaling of document images

[...]

Kathrin Berkner¹•Institutions (1)

Ricoh¹

30 Jun 2005

TL;DR: In this article, a method, article of manufacture, and apparatus for content-adaptive scaling of document images is described, which comprises identifying spatial relationships between document objects of a document image, determining space separating pairs of neighboring document objects, and determining at least one scaling factor based on the space separating the document objects in the document image and based on display device characteristics.

...read moreread less

Abstract: A method, article of manufacture, and apparatus for content-adaptive scaling of document images is described. In one embodiment, the method comprises identifying spatial relationships between document objects of a document image, determining space separating pairs of neighboring document objects, and determining at least one scaling factor based on the space separating the document objects in the document image and based on display device characteristics.

...read moreread less

61 citations

Patent•

Generating a text layout boundary from a text block in an electronic document

[...]

Hui Chao¹, Xiaofan Lin¹, Charles Nelson¹•Institutions (1)

Hewlett-Packard¹

24 Jun 2005

TL;DR: In this paper, a text layout boundary is generated from extended and unextended edges of the bounding rectangles of a text block, which is stored in a machine-readable medium.

...read moreread less

Abstract: Methods, systems and machine-readable instructions for processing an electronic document are described. In one aspect, logical blocks that were extracted from the electronic document, including a text block comprising text lines each encompassed by a respective bounding rectangle, are received. Edges of ones of the bounding rectangles are extended to at least one boundary without changing layout relationships among the logical blocks in the electronic document. A text layout boundary is generated from extended and unextended edges of the bounding rectangles. A description of the text layout boundary is stored in a machine-readable medium.

...read moreread less

56 citations

Proceedings Article•DOI•

Learning nongenerative grammatical models for document analysis

[...]

M. Shilman¹, Percy Liang¹, Paul A. Viola¹•Institutions (1)

Microsoft¹

17 Oct 2005

TL;DR: This approach models document layout as a grammar and performs a global search for the optimal parse based on a grammatical cost function and applies this technique to two document image analysis tasks: page layout structure extraction and mathematical expression interpretation.

...read moreread less

Abstract: We present a general approach for the hierarchical segmentation and labeling of document layout structures. This approach models document layout as a grammar and performs a global search for the optimal parse based on a grammatical cost function. Our contribution is to utilize machine learning to discriminatively select features and set all parameters in the parsing process. Therefore, and unlike many other approaches for layout analysis, ours can easily adapt itself to a variety of document analysis problems. One need only specify the page grammar and provide a set of correctly labeled pages. We apply this technique to two document image analysis tasks: page layout structure extraction and mathematical expression interpretation. Experiments demonstrate that the learned grammars can be used to extract the document structure in 57 files from the UWIII document image database. We also show that the same framework can be used to automatically interpret printed mathematical expressions so as to recreate the original LaTeX

...read moreread less

50 citations

Proceedings Article•DOI•

Optimized XY-cut for determining a page reading order

[...]

Jean-Luc Meunier¹•Institutions (1)

Xerox¹

31 Aug 2005

TL;DR: A fast method for determining the human reading order of the layout elements of a document page and includes a computationally tractable optimization approach to the problem.

...read moreread less

Abstract: In this paper, we propose a fast method for determining the human reading order of the layout elements of a document page. The proposal includes a computationally tractable optimization approach to the problem. We also report on the performance of the method and discuss it in light of related work.

...read moreread less

49 citations

Patent•

Method and apparatus for utilizing an object model for managing content regions in an electronic document

[...]

Brian M. Jones¹, E. Mark Sunderland¹, Marcin Sawicki¹, Robert A. Little¹, Tristan A. Davis¹ - Show less +1 more•Institutions (1)

Microsoft¹

25 Feb 2005

TL;DR: In this paper, a method and apparatus for utilizing a document object model to manage content regions for use in an electronic document is provided, where a content region is a predefined area which may be inserted in a document and which serves as a placeholder for receiving and displaying specific types of content such as text, graphics data, calendar data, or tabular data.

...read moreread less

Abstract: A method and apparatus are provided for utilizing a document object model to manage content regions for use in an electronic document. A content region is a predefined area which may be inserted in an electronic document and which serves as a placeholder for receiving and displaying specific types of content such as text, graphics data, calendar data, or tabular data. The document object model enables a user to create, modify, and delete content regions from an electronic document using an application programming interface from within a computer application program.

...read moreread less

47 citations

Journal Article•DOI•

A generic method for determining up/down orientation of text in roman and non-roman scripts

[...]

Hrishikesh Aradhye¹•Institutions (1)

SRI International¹

01 Nov 2005-Pattern Recognition

TL;DR: This paper presents a method for determining the up/down orientation of text in a scanned document of unknown orientation, so that it can be appropriately rotated and processed by an optical character recognition (OCR) engine.

...read moreread less

46 citations

Patent•

Systems and methods for electronic document genre classification using document grammars

[...]

John C. Handley¹•Institutions (1)

Xerox¹

31 Mar 2005

TL;DR: In this article, a system for classifying a genre of an electronic document may include a network processor configured to parse the RTF document into lines of text ordered from top to bottom and left to right and assign tokens to each line of text based on content of the line and to line separators based on space between blocks of lines.

...read moreread less

Abstract: A system for classifying a genre of an electronic document may include a network processor configured to receive an electronic document and convert the electronic document to rich text format (RTF). The processor may be configured to parse the RTF document into lines of text ordered from top to bottom and left to right and assign tokens to each line of text based on content of the line and to line separators based on space between blocks of lines. The network processor may be configured to sequence the tokens, parse the tokenized document with a number of pre-defined document grammars, determine a probability for each genre corresponding to the electronic document, and classify the electronic document as the genre with the highest probability.

...read moreread less

42 citations

Proceedings Article•DOI•

Towards a canonical and structured representation of PDF documents through reverse engineering

[...]

Maurizio Rigamonti, Jean-Luc Bloechle, Karim Hadjar, Denis Lalanne, Rolf Ingold - Show less +1 more

31 Aug 2005

TL;DR: Xed mixes electronic extraction methods with state-of-the-art document analysis techniques and outputs the layout structure in a hierarchical canonical form, i.e. which is universal and independent of the document type.

...read moreread less

Abstract: This article presents Xed, a reverse engineering tool for PDF documents, which extracts the original document layout structure. Xed mixes electronic extraction methods with state-of-the-art document analysis techniques and outputs the layout structure in a hierarchical canonical form, i.e. which is universal and independent of the document type. This article first reviews the major traps and tricks of the PDF format. It then introduces the architecture of Xed along with its main modules, and, in particular, the document physical structure extraction algorithm. Later on, a canonical format is proposed and discussed with an example. Finally the results of a practical evaluation are presented, followed by an outline of future works on the logical structure extraction.

...read moreread less

36 citations

Patent•

Method and system for creating a table version of a document

[...]

Tara M. Kraft¹, Uladzislau Sudzilouski¹, Jacqui J. Salerno¹•Institutions (1)

Microsoft¹

01 Jun 2005

TL;DR: A table version of a document is generated by computing a table layout of the document and generating the table version based on the table layout as mentioned in this paper, which can be used to export the document to one or more computers in a distributed network while maintaining the visual fidelity and text content of the original document.

...read moreread less

Abstract: A table version of a document is generated by computing a table layout of the document and generating the table version based on the table layout. Computing the table layout can include recording the positions of each object in the document while recording the position of text by recording the position of each line, dividing the document into sections, and grouping the sections based on their object content while compensating for overlapping objects. Generating the table version of the document from the table layout can include creating table code that represents information in the table layout of the document. The table version of the document can be used to export the document to one or more computers in a distributed network while maintaining the visual fidelity and text content of the original document.

...read moreread less

36 citations

Proceedings Article•DOI•

Robust skew detection in mixed text/graphics documents

[...]

Adnan Amin¹, Sue Wu¹•Institutions (1)

University of New South Wales¹

31 Aug 2005

TL;DR: The proposed skew detection algorithm has no restriction on detectable angle range and does not rely on large blocks of text, and works well on textual document images, graphical images and mixed text and graphic images.

...read moreread less

Abstract: Document image processing has become an increasingly important technology in the automation of office documentation tasks. Automatic document scanners such as text readers and OCR (optical character recognition) systems are an essential component of systems capable of those tasks. One of the problems in this field is that the document to be read is not always placed correctly on a flat-bed scanner. This means that the document may be skewed on the scanner bed, resulting in a skewed image. This skew has a detrimental effect on document analysis, document understanding, and character segmentation and recognition. Consequently, detecting the skew of a document image and correcting it are important issues in realizing a practical document reader. The proposed skew detection algorithm has no restriction on detectable angle range and does not rely on large blocks of text. It works well on textual document images, graphical images and mixed text and graphic images. The performance of the systems was evaluated using over 60 images that consist of real life documents like envelopes and artificial mixed text/graphic icons. The skew detection algorithm is robust when compared with other methods when very few text lines are present in the document image.

...read moreread less

Patent•

System and method for defining characteristic data of a scanned document

[...]

Hiroki Kanno¹•Institutions (1)

Toshiba¹

02 Aug 2005

TL;DR: In this article, a system and a method for providing characteristic data associated with a scanned document is provided. The method includes analyzing a bitmapped image file of a document, determining at least one characteristic data of the document, and linking the characteristic data to the image file, wherein the document is useable by a document management system to identify the document in a search.

...read moreread less

Abstract: A system and a method for providing characteristic data associated with a scanned document is provided. The characteristic data of the document may include a title, a creation date, a scan date, an author, a subject matter, a total page count, a starting page number, an ending page number, a color type, a document type, a language, and/or a document direction. The method includes analyzing a bitmapped image file of a document, determining at least one characteristic data of the document based on the analysis of the bitmapped image file, and linking the characteristic data to the bitmapped image file, wherein the characteristic data is useable by a document management system to identify the document in a search. Analyzing the bitmapped image of the document may include a natural language analysis technique, an optical character recognition analysis technique, an image layout analysis technique, and/or a color analysis technique.

...read moreread less

Proceedings Article•DOI•

Semantics-based content extraction in typewritten historical documents

[...]

Apostolos Antonacopoulos¹, Dimosthenis Karatzas¹•Institutions (1)

University of Salford¹

31 Aug 2005

TL;DR: Results show that such a conversion strategy aided by (expert) user-specified semantic information and which enables the processing of individual parts of the document in a specialised way, produces superior results than document analysis and understanding techniques devised for contemporary documents.

...read moreread less

Abstract: This paper presents a flexible approach to extracting content from scanned historical documents using semantic information. The final electronic document is the result of a "digital historical document lifecycle" process, where the expert knowledge of the historian/archivist user is incorporated at different stages. Results show that such a conversion strategy aided by (expert) user-specified semantic information and which enables the processing of individual parts of the document in a specialised way, produces superior (in a variety of significant ways) results than document analysis and understanding techniques devised for contemporary documents.

...read moreread less

Patent•

Adaptive layout templates for generating electronic documents with variable content

[...]

Xiaofan Lin¹•Institutions (1)

Hewlett-Packard¹

30 Mar 2005

TL;DR: In this paper, the static layout description assigns to each logical block at least one associated static layout attribute, and each of the static layouts attributes is assigned a fixed value, while the adaptive layout template includes adaptive layout attributes corresponding to static layout attributes and assigns to them a respective symbolic expression representing a variable value.

...read moreread less

Abstract: Methods, systems and machine-readable instructions for processing electronic documents are described. In one aspect, a description of a static layout of logical blocks of the electronic document is received. The static layout description assigns to each of the logical blocks at least one associated static layout attribute. Each of the static layout attributes is assigned a fixed value. An adaptive layout template is generated from the static layout description. The adaptive layout template includes adaptive layout attributes corresponding to the static layout attributes and assigns to each of the adaptive layout attributes a respective symbolic expression representing a variable value. The adaptive layout template is stored in a memory.

...read moreread less

Patent•

Management and use of data in a computer-generated document

[...]

Bishop Andrew K¹, Ashley Morgan¹, Brian M. Jones¹, Chad Rothschiller¹, Charles S. Walker¹, Eoin Burke¹, Josh Pollock¹, Robert A. Little¹, Garg Sharad K¹, Shawn A. Villaron¹, Su Piao Bill Wu¹ - Show less +7 more•Institutions (1)

Microsoft¹

14 Nov 2005

TL;DR: In this paper, a relationship representation is generated for the components of the document and a user can navigate the relationships between the components to quickly understand the nature of a document and its components and to locate particular portions of the documents that are important to the user.

...read moreread less

Abstract: Methods and systems provide for breaking a computer-generated document into a number of components where the components have explicit relationships with each other. A relationship representation is generated for the components of the document. A user may then navigate the relationships between the components to quickly understand the nature of the document and its components and to locate particular portions of the document that are important to the user. In addition, the user may open, edit and reuse particular components of the document apart from the rest of the document and without having to open or edit the document.

...read moreread less

Proceedings Article•DOI•

PerfectDoc: a ground truthing environment for complex documents

[...]

Sherif Yacoub¹, Vinay Saxena¹, S.N. Sami¹•Institutions (1)

Hewlett-Packard¹

31 Aug 2005

TL;DR: The PerfectDoc tool is presented; a ground truthing and document correction tool that provides post processing correction capabilities that are required after complex document analysis and understanding tasks.

...read moreread less

Abstract: In this paper, we present PerfectDoc; a ground truthing and document correction tool. The tool provides post processing correction capabilities that are required after complex document analysis and understanding tasks. The tool has the advantage of being comprehensive (integration of most common correction tasks), easy to use (minimal clicks for corrections), configurable (can be used for different types of documents), and provides separate correction views. We used the tool to correct the output from a document understanding system used to extract articles from 80-years archive of Time weekly magazine.

...read moreread less

Patent•

Translation device, image processing device, translation method, and recording medium

[...]

Toshiya Koyama¹, Teruka Saito¹, Masakazu Tateno¹, Kei Tanaka¹, Takashi Nagao¹, Masayoshi Sakakibara¹, Xinyu Peng¹, Kotaro Nakamura¹, Atsushi Itoh¹, Masatoshi Tagawa¹, Michihiro Tamune¹, Hiroshi Masuichi¹, Sato Naoko¹, Kiyoshi Tashiro¹ - Show less +10 more•Institutions (1)

Fuji Xerox¹

04 Aug 2005

TL;DR: In this paper, a translation device consisting of a character recognition unit that recognizes text data in a text region of an input image, a translator that translates the text data from the text region to the image region, and a layout configuration processor that generates data containing the translated text data and graphics in the input image is described.

...read moreread less

Abstract: A translation device comprises a character recognition unit that recognizes text data in a text region of an input image; a translator that translates the text data in the text region; and a layout configuration processor that generates data containing the translated text data in the text region and graphics in the input image, wherein a layout of the input image is maintained in a layout of the image of the data generated by the layout configuration processor.

...read moreread less

Proceedings Article•DOI•

Page segmentation for Manhattan and non-Manhattan layout documents via selective CRLA

[...]

Hung-Ming Sun¹•Institutions (1)

Kainan University¹

31 Aug 2005

TL;DR: The proposed method, named selective CRLA, has been successfully applied to extraction of text from commercial magazine pages with complicated layouts and is capable of processing documents with both Manhattan and non-Manhattan layouts.

...read moreread less

Abstract: The constrained run-length algorithm (CRLA) is a well-known technique for page segmentation. The algorithm is fast and can be used to partition documents with Manhattan layouts. It is not, however, suited to deal with pages with layouts beyond the Manhattan format, e.g. irregular halftone images embedded in text paragraphs. A modified version of the CRLA, named selective CRLA, is presented in this paper. The selective CRLA is capable of processing documents with both Manhattan and non-Manhattan layouts. The selective CRLA is performed twice with different sets of parameters on a label image derived from the input document image. After both of its executions, the yielded text regions are extracted. The proposed method has been successfully applied to extraction of text from commercial magazine pages with complicated layouts.

...read moreread less

Proceedings Article•DOI•

Associating text and graphics for scientific chart understanding

[...]

Weihua Huang¹, Chew Lim Tan¹, Wee Kheng Leow¹•Institutions (1)

National University of Singapore¹

31 Aug 2005

TL;DR: The association of text and graphics allows us to capture the semantic meaning carried by scientific chart images in a more complete way.

...read moreread less

Abstract: This paper presents our recent work that aims at associating the recognition results of textual and graphical information contained in the scientific chart images. Text components are first located in the input image and then recognized using OCR. On the other hand, the graphical objects are segmented and form high level symbols. Both logical and semantic correspondence between text and graphical symbols are identified. The association of text and graphics allows us to capture the semantic meaning carried by scientific chart images in a more complete way. The result of scientific chart image understanding is presented using XML documents.

...read moreread less

Patent•

Document processing apparatus having an authoring capability for describing a document structure

[...]

Katashi Nagao¹•Institutions (1)

Sony Broadcast & Professional Research Laboratories¹

07 Mar 2005

TL;DR: In this article, an apparatus and method for easily generating document data (tag file) in a form that makes it possible to perform various processes upon the document data is disclosed for easily retrieving document data.

...read moreread less

Abstract: An apparatus and method are disclosed for easily generating document data (tag file) in a form that makes it possible to perform various processes upon the document data. An original document (plain text) is divided into morphological elements, and morphological information is added thereto. Information representing the hierarchical document structures is also added. Furthermore information indicating referential relations between portions in the original document is also added.

...read moreread less

Patent•

Method and apparatus for layout of text and image documents

[...]

Alexander Vincent Danilo

20 May 2005

TL;DR: In this article, a mixed text and image layout algorithm capable of supporting Unicode text and arbitrary content definitions for geometric layout with worst case two-pass layout placement procedure is presented, where the layout is best-case achieved in a single layout pass and worst-case in two passes.

...read moreread less

Abstract: A mixed text and image layout algorithm capable of supporting Unicode text and arbitrary content definitions for geometric layout with worst case two-pass layout placement procedure. Layout of Unicode text requires a number of distinct processing steps commencing with classification of input characters into contiguous groups of identical directionality, writing system and possibly script (and language) followed by mapping of character groups to glyphs for display purposes followed by a layout taking into account font display characteristics, embedded directionality level and shape of container for layout contents. Layout is best-case achieved in a single layout pass and worst-case in two passes. During layout information is cached to facilitate incremental changes to an existing layout in order to minimize refresh operations for editing display purposes. An optional two-pass operation on the layout result may be used to generate ordered rendering operation to support so-called Z-index display. An optimized Unicode character classification method utilizing reduced memory is also disclosed. Additionally a method to selectively display caret location for mixed font and/or directional text is disclosed.

...read moreread less

Patent•

Categorizing page block functionality to improve document layout for browsing

[...]

Xing Xie¹, Gengxin Miao¹, Guomao Xin¹, Ruihua Song¹, Ji-Rong Wen¹, Wei-Ying Ma¹ - Show less +2 more•Institutions (1)

Microsoft¹

26 Sep 2005

TL;DR: In this article, the authors classified page block functionality to improve document layout for browsing and used these block function assignments to generate one or more customized document layouts for browsing by a user, which can be used to improve the user experience.

...read moreread less

Abstract: Categorizing page block functionality to improve document layout for browsing is described. In one aspect, document content is analyzed with respect to multiple block function criteria. Results of this analysis are used to assign a respective block function to blocks of the document content. These block function assignments are used to generate one or more customized document layouts for browsing by a user.

...read moreread less

Patent•

Document Analysis System and Document Adaptation System

[...]

Yuushin Tatsumi¹•Institutions (1)

NEC¹

25 Oct 2005

TL;DR: In this paper, a document analysis system which can execute a layout analysis intended by a document provider and an exhaustive title analysis and output the analysis result which can be used by a third person is provided.

...read moreread less

Abstract: A document analysis system which can execute a layout analysis intended by a document provider and an exhaustive title analysis and output the analysis result which can be used by a third person is provided by the present invention. The input unit ( 11 ) obtains a structured or semi-structured document and renders it. The basic layout analysis unit ( 14 ) obtains the rendering result and analyzes the layout by grouping document description elements juxtaposed in a determined direction by referencing an arrangement of the document description elements. The title analysis unit ( 15 ) obtains the rendering result and a title analysis rule from the title analysis rule storing unit ( 23 ) and analyzes the title by comparing the name, attribute, style or the content of the document analysis elements with the title analysis rule. The layout analysis unit ( 16 ) obtains the layout components and the hierarchical relationship thereof and the titles for generating a new layout by grouping the layout components. The output unit ( 13 ) obtains the layout components and the hierarchical relationship thereof, the relationship between the components and the titles, shapes them into a format having an expression which uses the reference to the document description elements and output them.

...read moreread less

Patent•

Document mode processing for portable reading machine enabling document navigation

[...]

Raymond C. Kurzweil, Paul Albrecht, Lucy Gibson

01 Apr 2005

TL;DR: In this paper, a reading machine applies text-to-speech to a text file that corresponds to the selected section of the document, to read the selected sections of a document aloud to the user.

...read moreread less

Abstract: Controlling a reading machine while reading a document to a user by receiving an image of a document, accessing a knowledge base that provides data that identifies sections in the document and processing user commands to select a section of the document. The reading machine applies text-to-speech to a text file that corresponds to the selected section of the document, to read the selected section of the document aloud to the user.

...read moreread less

Proceedings Article•DOI•

Multi-scale techniques for document page segmentation

[...]

Zhixin Shi¹, Venu Govindaraju¹•Institutions (1)

University at Buffalo¹

31 Aug 2005

TL;DR: This paper presents a novel approach for document page segmentation using a transform based multi-scale method and shows that the algorithm is robust for variations of document parameters.

...read moreread less

Abstract: Page segmentation algorithms found in published literatures often rely on some predetermined parameters such as general font sizes, distances between text lines and document scan resolutions. Variations of these parameters in real document images greatly affect the performance of the algorithms. In this paper, we present a novel approach for document page segmentation using a multi-scale technique. An efficient implementation of a local connectivity algorithm transforms a document image into a parameter domain in which a parameter value at a pixel location represents a connectivity property for its neighboring foreground pixels in the original document image. Then a top-down approach with a linear search reveals the document regions at each scale levels as text block, text lines and graphics. We consider our algorithm a transform based multi-scale method. Our ongoing research shows that the algorithm is robust for variations of document parameters.

...read moreread less

Proceedings Article•DOI•

Document understanding system using stochastic context-free grammars

[...]

John C. Handley¹, Anoop M. Namboodiri, Richard Zanibbi²•Institutions (2)

Xerox¹, Concordia University²

31 Aug 2005

TL;DR: A document understanding system in which the arrangement of lines of text and block separators within a document are modeled by stochastic context free grammars, which may be adapted to a new genre simply by replacing the input grammar.

...read moreread less

Abstract: We present a document understanding system in which the arrangement of lines of text and block separators within a document are modeled by stochastic context free grammars. A grammar corresponds to a document genre; our system may be adapted to a new genre simply by replacing the input grammar. The system incorporates an optical character recognition system that outputs characters, their positions and font sizes. These features are combined to form a document representation of lines of text and separators. Lines of text are labeled as tokens using regular expression matching. The maximum likelihood parse of this stream of tokens and separators yields a functional labeling of the document lines. We describe business card and business letter applications.

...read moreread less

Proceedings Article•DOI•

Clustering document images using a bag of symbols representation

[...]

Eugen Barbu, Pierre Héroux, Sébastien Adam, Eric Trupin

31 Aug 2005

TL;DR: This paper describes document images based on frequent occurring symbols, a document description created in an unsupervised manner and can be related to the domain knowledge.

...read moreread less

Abstract: Document image classification is an important step in document image analysis. Based on classification results we can tackle other tasks such as indexation, understanding or navigation in document collections. Using a document representation and an unsupervised classification method, we may group documents that from the user point of view constitute valid clusters. The semantic gap between a domain independent document representation and the user implicit representation can lead to unsatisfactory results. In this paper, we describe document images based on frequent occurring symbols. This document description is created in an unsupervised manner and can be related to the domain knowledge. Using data mining techniques applied to a graph based document representation we find frequent and maximal subgraphs. For each document image, we construct a bag containing the frequent subgraphs found in it. This bag of "symbols" represents the description of a document. We present results obtained on a corpus of 60 graphical document images.

...read moreread less

Proceedings Article•DOI•

Document image retrieval based on density distribution feature and key block feature

[...]

Hong Liu¹, Suoqian Feng¹, Hongbin Zha¹, Xueping Liu²•Institutions (2)

Peking University¹, Ricoh²

31 Aug 2005

TL;DR: Experimental results on a large scale document image database, which contains 10385 document images, show that the proposed method is efficient and robust to retrieve different kinds of document images in real time.

...read moreread less

Abstract: Document image retrieval is an important part of many document image processing systems such as paperless office systems, digital libraries and so on. Its task is to help users find out the most similar document images from a document image database. For developing a system of document image retrieval among different resolutions, different formats document images with hybrid characters of multiple languages, a new retrieval method based on document image density distribution features and key block features is proposed in this paper. Firstly, the density distribution and key block features of a document image are defined and extracted based on documents' print-core. Secondly, the candidate document images are attained based on the density distribution features. Thirdly, to improve reliability of the retrieval results, a confirmation procedure using key block features is applied to those candidates. Experimental results on a large scale document image database, which contains 10385 document images, show that the proposed method is efficient and robust to retrieve different kinds of document images in real time.

...read moreread less

Patent•

Automated document localization and layout method

[...]

Robert G. Campbell¹, Lisa S. Purvis¹, Steven J. Harrington¹, Jonas Karlsson¹, Christopher J. Regruit¹ - Show less +1 more•Institutions (1)

Xerox¹

28 Apr 2005

TL;DR: In this article, the content of a document is segmented into one or more original document structures, and then the original document structure is localized with new content to generate a more aesthetically pleasing document.

...read moreread less

Abstract: A method which includes segmenting the content of a document into one or more original document structures, determining which of the one or more original document structures are to be localized, replacing the original document structures to be localized with new content, and automatically adjusting the layout of the document with new content to generate a more aesthetically pleasing document.

...read moreread less

Proceedings Article•DOI•

A statistical learning approach to document image analysis

[...]

Kevin Laven¹, S. Leishman¹, Sam T. Roweis¹•Institutions (1)

University of Toronto¹

31 Aug 2005

TL;DR: This paper develops a new software environment for manual page image segmentation and labeling, and uses it to create a dataset containing 932 page images from academic journals, and develops a physical layout analysis algorithm based on a logistic regression classifier, which is found to outperform existing algorithms of comparable complexity.

...read moreread less

Abstract: In the field of computer analysis of document images, the problems of physical and logical layout analysis have been approached through a variety of heuristic, rule-based, and grammar-based techniques. In this paper we investigate the effectiveness of statistical pattern recognition algorithms for solving these two problems, and report results suggesting that these more complex and powerful techniques are worth pursuing. First, we developed a new software environment for manual page image segmentation and labeling, and used it to create a dataset containing 932 page images from academic journals. Next, a physical layout analysis algorithm based on a logistic regression classifier was developed, and found to outperform existing algorithms of comparable complexity. Finally, three statistical classifiers were applied to the logical layout analysis problem, also with encouraging results.

...read moreread less