scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2003"


Proceedings ArticleDOI
13 Jan 2003
TL;DR: This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches.
Abstract: Document structure analysis can be regarded as a syntactic analysis problem. The order and containment relations among the physical or logical components of a document page can be described by an ordered tree structure and can be modeled by a tree grammar which describes the page at the component level in terms of regions or blocks. This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches. In particular, we survey past work on document physical layout representations and algorithms, document logical structure representations and algorithms, and performance evaluation of document structure analysis algorithms. In the last section, we summarize this work and point out its limitations.

278 citations


01 Jan 2003
TL;DR: This paper summarize research in document layout analysis carried out over the last few years in the laboratory, which has developed a number of novel geometric algorithms and statistical methods that are applicable to a wide variety of languages and layouts.
Abstract: In this paper, I summarize research in document layout analysis carried out over the last few years in our laboratory. Correct document layout analysis is a key step in document capture conversions into electronic formats, optical character recognition (OCR), information retrieval from scanned documents, appearance-based document retrieval, and reformatting of documents for on-screen display. We have developed a number of novel geometric algorithms and statistical methods. Layout analysis systems built from these algorithms are applicable to a wide variety of languages and layouts, and have proven to be robust to the presence of noise and spurious features in a page image. The system itself consists of reusable and independent software modules that can be reconfigured to be adapted to different languages and applications. Currently, we are using them for electronic book and document capture applications. If there is commercial or government demand, we are interested in adapting these tools to information retrieval and intelligence applications.

114 citations


Patent
30 May 2003
TL;DR: In this article, a number of possible document layouts and possible document designs are stored and a document is based on the combination of stored layouts and one of the stored designs, while the user is viewing the electronic document, controls are available to the user allowing the user to view and select among alternate layouts and designs for that document.
Abstract: Electronic document design methods and computer programs allowing a user to separately control and modify layout and the design components of an electronic document. A number of possible document layouts and possible document designs are stored. A document is based on the combination of one of the stored layouts and one of the stored designs. While the user is viewing the electronic document, controls are available to the user allowing the user to view and select among alternate layouts and designs for that document. Color schemes and font schemes for the document may also be separated controlled.

96 citations


Patent
20 May 2003
TL;DR: In this article, a method of automated document structure identification based on visual cues is proposed, which can be applied in the generation of extensible mark-up language files, natural language parsing and search engine ranking mechanisms.
Abstract: A method of automated document structure identification based on visual cues is disclosed herein The two dimensional layout of the document is analyzed to discern visual cues related to the structure of the document, and the text of the document is tokenized so that similarly structured elements are treated similarly The method can be applied in the generation of extensible mark-up language files, natural language parsing and search engine ranking mechanisms

91 citations


Proceedings ArticleDOI
Michael Shilman1, Zile Wei1, Sashi Raghupathy1, Patrice Y. Simard1, D. Jones1 
03 Aug 2003
TL;DR: This paper presents an integrated approach to parsing textual structure in freeform handwritten notes that solves the layout analysis and classification problems simultaneously: the problems are so tightly coupled that it is not possible to solve one without the other for real user notes.
Abstract: This paper presents an integrated approach to parsing textual structure in freeform handwritten notes. Text-graphics classification and text layout analysis are classical problems in printed document analysis, but the irregularity in handwriting and content in freeform notes reveals limitations in existing approaches. We advocate an integrated technique that solves the layout analysis and classification problems simultaneously: the problems are so tightly coupled that it is not possible to solve one without the other for real user notes. We tune and evaluate our approach on a large corpus of unscripted user files and reflect on the difficult recognition scenarios that we have encountered in practice.

87 citations


Journal ArticleDOI
TL;DR: Preliminary experimental results with the document images captured from students’ theses show that the proposed approach to retrieve the documents from CCITT Group 4 compressed document images has achieved a promising performance.

85 citations


Patent
13 Mar 2003
TL;DR: In this article, a method for document manipulation which embeds additional information in document data in which layout and position of a element have been defined comprises a process of generating rendered image data by rendering a region where additional information is to be embedded in the document, embedding additional information within a part of the rendered image, and merging a images of the part in which the additional information embedded in a rendered image with a predetermined region in a original document data.
Abstract: A method for document manipulation which embeds additional information in document data in which layout and position of a element have been defined comprises a process of generating rendered image data by rendering a region where additional information is to be embedded in the document, a process of embedding additional information in a part of the rendered image data, and a process of merging a images of the part in which the additional information embedded in the rendered image data with a predetermined region in a original document data.

59 citations


Patent
20 Aug 2003
TL;DR: In this article, a method, system, apparatus, and graphical user interface (GUI) for generating structured document files from a document image is disclosed, by segmenting the document image into one or more zones containing respective text images.
Abstract: A method, system, apparatus, and graphical user interface (GUI) for generating structured document files from a document image is disclosed. Structured document files are generated by segmenting the document image into one or more zones containing respective text images, converting the respective text images to digital text, automatically identifying layout information for each of the one or more zones, labeling each of the one or more zones in accordance with a schema, and automatically associating mark-up language tags with the labeled zones to generate the structured document files responsive to the identified layout information and a model file.

55 citations


Proceedings ArticleDOI
10 Mar 2003
TL;DR: This paper proposes to preprocess the input document images so as to compensate for the variations due to writing style and thereby making them suitable for analysis on the basis of their visual appearances, and applies denoising, thinning, pruning, m-connectivity and text size normalization in sequence.
Abstract: Script-based text document classification is an important field of research in the context of multilingual textual document processing But, all script identification techniques available in the literature so far do not consider handwritten documents Variations in the writing style, character size, inter-line and inter-word spacings, etc make the recognition process difficult and unreliable when these script identification algorithms, more specifically visual appearance based approaches, are applied directly on hand-written documents Therefore, in this paper, we propose to preprocess the input document images so as to compensate for the variations due to writing style and thereby making them suitable for analysis on the basis of their visual appearances Accordingly, we apply denoising, thinning, pruning, m-connectivity and text size normalization in sequence Multi-channel Gabor filtering is used to extract texture features that characterize the visual appearances of the document images Experimental result proves the potentiality of our proposed method of script identification for hand-written text document classification

51 citations


Proceedings ArticleDOI
03 Aug 2003
TL;DR: The performance of segmentation algorithms and their adaptation in order to treat complex structured Arabic documents such as newspapers, as well as promising experimental results are described.
Abstract: The aim of layout analysis is to extract the geometricstructure from a document image. It consists of labelinghomogenous regions of a document image. This paperdescribes the performance of segmentation algorithmsand their adaptation in order to treat complex structuredArabic documents such as newspapers. Experimentaltests have been carried out on four different phases ofnewspaper image analysis: thread recognition, framerecognition, image text separation, text line recognition,and line merging into blocks. Some promisingexperimental results are reported.

47 citations


Patent
05 Jun 2003
TL;DR: In this article, a distributed data processing system, including providing document structure templates comprising model document structures and semantics for the model document structure, identifying the structure of a document, selecting a document structure template in dependence upon the document and the model documents in the template, and storing search keywords from the document in records in a semantics-based search index according to the semantics from the selected template.
Abstract: Indexing information in a distributed data processing system, including providing document structure templates comprising model document structures and semantics for the model document structures; identifying the structure of a document; selecting a document structure template in dependence upon the structure of the document and the model document structures in the document structure templates; and storing search keywords from the document in records in a semantics-based search index according to the semantics from the selected document structure template. Selecting a document structure template in dependence upon the structure of the document and the model document structures in the document structure templates typically further comprises comparing the structure of the document and the model document structures in the templates; and selecting a template whose model document structure matches the structure of the document.

Proceedings ArticleDOI
03 Aug 2003
TL;DR: This work presents a fully automatic identification and segmentation of a table of contents (TOC) page from a scanned document to help develop a digital document library.
Abstract: With an aim to extract the structural information from the table of contents (TOC) to help develop a digital document library, the requirement of identifying/segmenting the TOC page is obvious. The objective to create a digital document library is to provide a non-labour intensive, cheap and flexible way of storing, representing and managing the paper document in electronic form to facilitate indexing, viewing, printing and extracting the intended portions. Information from the TOC pages is to be extracted for use in a document database for effective retrieval of the required pages. We present a fully automatic identification and segmentation of a table of contents (TOC) page from a scanned document.

Proceedings ArticleDOI
03 Aug 2003
TL;DR: This paper presents fully auotmatic segmentation of displayed-math zones from the document image, using only the spatial layout information of math-formulas and equations, so as to help commercial OCR systems which cannot discern math-zones and also for the identification and arrangement of math symbols by others.
Abstract: With an aim to high-level understanding of the mathematicalcontents in a document image the requirement ofmath-zone extraction and recognition technique is obviousIn this paper we present fully auotmatic segmentation ofdisplayed-math zones from the document image, using onlythe spatial layout information of math-formulas and equations,so as to help commercial OCR systems which cannotdiscern math-zones and also for the identification and arrangementof math symbols by others

Proceedings ArticleDOI
03 Aug 2003
TL;DR: An algorithm is presented that finds globally maximal whitespace rectangles on page images at arbitrary orientations and eliminates the need for page rotation correction prior to background analysis and can be applied to considerably more complex page layouts than previously possible.
Abstract: The analysis of the background structure (whitespace) of page images has become an important technique for physical document layout analysis. Globally maximal whites-pace rectangles have been previously demonstrated to constitute a concise representation of the major layout features of documents. However, previous methods for computing maximal whitespace rectangles were limited to axis-aligned rectangles. This paper presents an algorithm that finds globally maximal whitespace rectangles on page images at arbitrary orientations. The new algorithm eliminates the need for page rotation correction prior to background analysis and can be applied to considerably more complex page layouts than previously possible. The algorithm is resolution independent and takes as input a list of foreground shapes (e.g., character or word bounding boxes or polygons) and a set of parameter ranges; it outputs the N largest non-overlapping maximal whitespace rectangles whose parameters (location, width, height, orientation) fall within the required parameter ranges. Examples of applications of the method to severely skewed documents, as well as the UW3 database, are presented.

Proceedings ArticleDOI
09 Nov 2003
TL;DR: This paper discusses how image-processing techniques can be used to perform document layout analysis of complex multiple-column PDF documents, which allows the conversion of these documents into the HTML format keeping the logical and physical layout intact.
Abstract: Portable document format (PDF) has become the de facto standard in many fields because of its independence of local formatting restrictions and its accurate reproducibility. On the other hand, HTML documents are becoming an integral form of our lives by being the dominant form for information exchange within the World Wide Web environment. This paper discusses how image-processing techniques can be used to perform document layout analysis of complex multiple-column PDF documents. This analysis allows the conversion of these documents into the HTML format keeping the logical and physical layout intact.

Patent
Osamu Imaichi1, Tetsuo Nishikawa1, Toru Hisamitsu1, Makoto Iwayama1, Masakazu Fujio1 
27 Feb 2003
TL;DR: In this paper, two document units were extracted from a document database and relevance degrees between individual elements of a group of the document units are calculated, and the results are displayed on a two-dimensional coordinate plane depending on the relevance degree.
Abstract: The present invention visualizes the contents of a plurality of documents without a lack of the listing property. Two document units are extracted from a document database and relevance degrees between individual elements of a group of the document units are calculated. The results are displayed on a two-dimensional coordinate plane depending on the relevance degree.

Proceedings ArticleDOI
03 Aug 2003
TL;DR: A machine learning approach to support the user during the correction of the layout analysis by allowing the user to correct the results of the global analysis and then by learning rules for layout correction from the sequence of user actions.
Abstract: In this paper, a machine learning approach to support the user during the correction of the layout analysis is proposed. Layout analysis is the process of extracting a hierarchical structure describing the layout of a page. In our approach, the layout analysis is performed in two steps: firstly, the global analysis determines possible areas containing paragraphs, sections, columns, figures and tables, and secondly, the local analysis groups together blocks that possibly fall within the same area. The result of the local analysis process strongly depends on the quality of the results of the first step. We investigate the possibility of supporting the user during the correction of the results of the global analysis. This is done by allowing the user to correct the results of the global analysis and then by learning rules for layout correction from the sequence of user actions. Experimental results on a set of multi-page documents are reported and commented.

Patent
Atsuko Yagi1
25 Mar 2003
TL;DR: In this paper, a presentation data-generating device includes a first part for acquiring document data sets, each containing contents to be presented, and a second part for transforming the document data set based on the layout data and the transformation data sets to generate a unified presentation data set.
Abstract: A presentation data-generating device includes a first part for acquiring document data sets, each containing contents to be presented, a second part for acquiring transformation data sets, each defining a transformation rule between the document data sets and a presentation data set, and, a third part for acquiring layout data containing layout information that defines a layout of the document data sets. The device further includes a fourth part for transforming the document data sets based on the layout data and the transformation data sets to generate a unified presentation data set that presents contents of the document data sets based on the layout defined in the layout data.

Proceedings ArticleDOI
19 Dec 2003
TL;DR: In this article, a novel image representation of compound document images, called SmartNails, is presented to overcome poor readability of text and recognizability of image features in low resolution thumbnails.
Abstract: In order to overcome poor readability of text and recognizability of image features in low resolution thumbnails, a novel image representation of compound document images - a SmartNail representation - is presented. SmartNails are replacements or supplements to traditional thumbnails for compound documents and contain cropped and scaled image and text segments. Image- and text-based analysis are merged to generate a layout for a particular display size with selected readable text and recognizable image regions. The analysis is efficiently performed by using information from document layout analysis and JPEG 2000 compressed file headers.

Journal ArticleDOI
TL;DR: This paper proposes a method of text retrieval from document images using a similarity measure based on word shape analysis that directly extract image features instead of using optical character recognition.
Abstract: In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.

Patent
21 Jul 2003
TL;DR: In this paper, an electronic equipment capable of displaying input data from an external interface in a display region defined by a layout script text and of enabling a user to easily select a layout which is desired to be adopted is provided.
Abstract: An electronic equipment capable of displaying input data from an external interface in a display region defined by a layout script text and of enabling a user to easily select a layout which is desired to be adopted is provided. The electronic equipment, such as a television set, takes in a layout script text from outside through, for example, a network to store it in the electronic equipment, which layout script text defines at least media information and a display region of the media information, which is input through various interfaces such as a video input terminal, broadcast reception, a reader of a detachably mountable storage medium. The user selects an arbitrary layout script text among the stored layout script texts. Then, the electronic equipment reproduces media information in accordance with the selected layout script text to display the media information.

01 Jan 2003
TL;DR: This paper presents a computationally efficient procedure for skew detection in digitized text documents which is based on Linear Regression Analysis and provides good and accurate results.
Abstract: This paper presents a computationally efficient procedure for skew detection in digitized text documents which is based on Linear Regression Analysis. The determination of the skew angle in text documents is essential in Optical Character Recognition (OCR) systems and Document Image Mosaicing (DIM). We use the Linear Regression formula to estimate a skew angle for each text line segment of the given skewed text document. The part of the text line is extracted using static and dynamic thresholds of projection profile based method. The proposed method is tested on variety of text documents and it provides good and accurate results

Patent
18 Nov 2003
TL;DR: In this paper, the authors propose a system for processing a structured document that can reduce labor for programming, in a software program for handling structured documents, by eliminating data conversion processing for processing of document contents and data extraction processing for selection of necessary data.
Abstract: PROBLEM TO BE SOLVED: To provide a system for processing a structured document that can reduce labor for programming, in a software program for handling a structured document, by eliminating data conversion processing for processing of document contents and data extraction processing for selection of necessary data. SOLUTION: A structured document holds a document structure definition represented by declarations of document elements forming document contents and by a set of semantic relations defined between the document elements, and a set of instances of the respective document elements matching the document structure definition. The system for processing a structured document, which comprises data reading means for reading the instances of document elements from the structured document and editing them into data processible by a software program to provide them, comprises basic data structure selecting means for selecting and specifying a data structure of the data provided by the data reading means, from known basic data structures or object structures, such as array, set, list, tree, graph and table structures. COPYRIGHT: (C)2005,JPO&NCIPI

Journal ArticleDOI
TL;DR: A novel automatic text reading system is introduced using an active camera focused on text regions already located in the scene (using this recent work), and a number of images are captured over the text region to construct a high-resolution mosaic composite of the whole region.
Abstract: Document recognition is a lively research area with much effort concentrated on optical character recognition. Less attention is paid to locating and extracting text from the general (non-desktop, non-scanner) environment. Such contact-free extraction of text from a general scene has applications in the context of wearable computing, robotic vision, point and click document capture, or as an aid for visually handicapped people. Here, a novel automatic text reading system is introduced using an active camera focused on text regions already located in the scene (using our recent work). Initially, a located region of text is analysed to determine the optimal zoom that would foveate onto it. Then a number of images are captured over the text region to construct a high-resolution mosaic composite of the whole region. This magnified image of the text is suitable for reading by humans or for recognition by OCR, or even for text-to speech synthesis. Although we employed a low resolution camera, we still obtained very good results.

Patent
29 Oct 2003
TL;DR: In this article, the authors present a method that enables a search and browse of document image groups through the application of a document structure analysis technique and a character recognition technique as searching/browsing means for paper documents and document images.
Abstract: PROBLEM TO BE SOLVED: To provide a method that enables a search and browse of a document image group through the application of a document structure analysis technique and a character recognition technique as searching/browsing means for paper documents and document images. SOLUTION: A highly functional document image search/browse system separates an OCR and a document processing apparatus, adopts as OCR output formats data (reading hypothesis data) holding multiple hypotheses of character line extraction, character segmentation and character recognition, and document structure data having ruled line information, frame information, character line information, browse attribute information and the like about a document image, and provides a function of important keyword extraction and document search from typed and handwritten character strings using OCR-added data, and of document display intended by a browser using the document structure data. COPYRIGHT: (C)2005,JPO&NCIPI

Patent
27 Mar 2003
TL;DR: In this article, the layout design of the layout information is performed so as to be treatable independently of the writing data obtained by converting the information into electronic positional information, which is used for printing ruled lines or symbols on the forms.
Abstract: PROBLEM TO BE SOLVED: To provide a system capable of performing original layout designing of forms and printing thereof, which are used, when making electron data of contents input by handwriting by a small lot. SOLUTION: The layout design of the layout information is performed so as to be treatable independently of the writing data obtained by converting the information, in which the layout information for printing ruled lines or symbols on the forms is filled in onto the surface of the form by handwriting, into electronic positional information. The system is provided with a layout information receiving means for receiving the input of the layout information created to be treatable, independently of the writing data and to be set on the form, and an identifier giving means for giving an identifier by each layout information received by the layout information receiving means. COPYRIGHT: (C)2004,JPO

Proceedings ArticleDOI
17 Sep 2003
TL;DR: A new system for document recognition is presented that follows the open source methodologies, XML description for document segmentation and classification, which turns out to be beneficial in terms of classification precision, and general-purpose availability.
Abstract: Paper document recognition is fundamental for office automation becoming every day a more powerful tool in those fields where information is still on paper. Document recognition follows from data acquisition, from both journals and entire books, in order to transform them into digital objects. We present a new system for document recognition that follows the open source methodologies, XML description for document segmentation and classification, which turns out to be beneficial in terms of classification precision, and general-purpose availability.

Journal ArticleDOI
TL;DR: A semantic document management system XBASE is designed and implemented based on the semantics and the functions of three main modules, X-Loader,X-Explorer and X-Query.
Abstract: In this paper, a system is presented where documents are no longer identified by their file names. Instead, a document is represented by its semantics in terms of descriptor and content vector. The descriptor of a document consists of a set of attributes, such as date of creation, its type, its size, annotations, etc. The content vector of a document consists of a set of terms extracted from the document. In this paper, a semantic document management system XBASE is designed and implemented based on the semantics and the functions of three main modules, X-Loader, X-Explorer and X-Query.

Proceedings ArticleDOI
Hui Chao1
20 Jan 2003
TL;DR: This paper presents a bottom up approach to recognize graphic illustration in PDF document and shows how this technique can be used in automatic figure extraction, document re-flow and document transformation.
Abstract: PDF is a document format for final presentation. It preserves the original document layout but often not the document logical structure. Graphic illustrations such as figures and tables in PDF often consist of ungrouped graphic primitives such as lines, curves and small text elements. In this paper, we present a bottom up approach to recognize graphic illustration in PDF document. Vicinities of page elements in both 2D space and indexes in layer are used to understand the logical connection between elements. Graphics recognition and elements grouping for illustration is an important part in understanding the document logical structure. This technique can be used in automatic figure extraction, document re-flow and document transformation.

Patent
30 Apr 2003
TL;DR: In this paper, text feature information based on texts included in the documents and image feature based on document images are held in a plurality of the respective documents to retrieve an original of an electronic document with practical response performance without constraining a user.
Abstract: PROBLEM TO BE SOLVED: To retrieve an original of an electronic document with practical response performance without constraining a special burden on a user. SOLUTION: Text feature information based on texts included in the documents and image feature information based on document images are held in a plurality of the respective documents. A character recognition processing is executed for an image data of the retrieved document to acquire the text feature information based on the obtained text, and to acquire the image feature information (lay-out information) based on an image data of the retrieved document (S92-S94). A memory is retrieved using the text feature information and the image feature information acquired as to the retrieved document to retrieve the document corresponding to the retrieved document out of the plurality of documents (S98-S101). COPYRIGHT: (C)2005,JPO&NCIPI