scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 1998"


Journal ArticleDOI
TL;DR: A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer, and logical analysis.
Abstract: Transforming a paper document to its electronic version in a form suitable for efficient storage, retrieval, and interpretation continues to be a challenging problem. An efficient representation scheme for document images is necessary to solve this problem. Document representation involves techniques of thresholding, skew detection, geometric layout analysis, and logical layout analysis. The derived representation can then be used in document storage and retrieval. Page segmentation is an important stage in representing document images obtained by scanning journal pages. The performance of a document understanding system greatly depends on the correctness of page segmentation and labeling of different regions such as text, tables, images, drawings, and rulers. We use the traditional bottom-up approach based on the connected component extraction to efficiently implement page segmentation and region identification. A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer, and logical analysis. Our algorithm has a high accuracy and takes approximately 1.4 seconds on a SGI Indy workstation for model creation, including orientation estimation, segmentation, and labeling (text, table, image, drawing, and ruler) for a 2550/spl times/3300 image of a typical journal page scanned at 300 dpi. This method is applicable to documents from various technical journals and can accommodate moderate amounts of skew and noise.

239 citations


Patent
30 Sep 1998
TL;DR: In this article, a new document is automatically stored in one or more mirror directories in which the new document would most likely be stored by the user of the device if the new documents were placed manually.
Abstract: A method and apparatus for automatic document classification using text and images. The present invention provides a method and apparatus for automatic document classification based on text and image. A new document is analyzed based on textual content as well as visual appearance. The new document is automatically stored in one or more mirror directories in which the new document would most likely be stored by the user of the device if the new document were placed manually. Determination of the most likely directories is based on an analysis of multiple documents stored by the user in various directories. The mirror directories are components of a mirror directory structure, which is a copy of a pre-existing directory structure, such as the user's hard drive. By storing the new document automatically, the user is relieved of the duty of manually selecting a directory for the new document.

148 citations


Journal ArticleDOI
Zhaoyang Lu1
TL;DR: An algorithm for text/graphics separation is presented that can be used to extract both Chinese and Western characters, dimensions, and symbols and has few limitations on the kind of engineering drawings and noise level.
Abstract: An algorithm for text/graphics separation is presented in this paper. The basic principle of the algorithm is to erase nontext regions from mixed text and graphics engineering drawings, rather than extract text regions directly. This algorithm can be used to extract both Chinese and Western characters, dimensions, and symbols and has few limitations on the kind of engineering drawings and noise level. It is robust to text-graphics touching, text fonts, and written orientations.

74 citations


Patent
Robert Cooperman1
30 Apr 1998
TL;DR: In this paper, a method for detecting insets in the structure of a document page so as to further complement the document layout and textual information provided in an optical character recognition system is presented.
Abstract: The present invention is a method for detecting insets in the structure of a document page so as to further complement the document layout and textual information provided in an optical character recognition system. A system employing the present method preferably includes a document layout analysis system wherein the inset detection methodology is used to extend the capability of an associated character recognition package to more accurately recreate the document being processed.

73 citations


Book
01 Jan 1998
TL;DR: The key idea is that analyses of the text contours at appropriate levels of granularity offer a rich source of information about document structure, which can provide the basis for flexible document manipulation tools in heterogeneous collections.
Abstract: The availability of large, heterogeneous repositories of electronic documents is increasing rapidly, and the need for flexible, sophisticated document manipulation tools is growing correspondingly. These tools can benefit greatly by exploiting logical structure, a hierarchy of visually observable organizational components of a document, such as paragraphs, lists, sections, etc. Knowledge of this structure can enable a multiplicity of applications, including hierarchical browsing, structural hyperlinking, logical component-based retrieval, and style translation. Most work on the problem of deriving logical structure from document layout either relies on knowledge of the particular document style or finds a single flat set of text blocks. This thesis describes an implemented approach to discovering a full logical hierarchy in generic text documents, based primarily on layout information. Since the styles of the documents are not known a priori, the precise layout effects of the logical structure are unknown. Nonetheless, typographical capabilities and conventions provide cues that can be used to deduce a logical structure for a generic document. In particular, the key idea is that analyses of the text contours at appropriate levels of granularity offer a rich source of information about document structure. The problem of logical structure discovery is divided into problems of segmentation, which separates the text into logical pieces, and classification, which labels the pieces with structure types. The segmentation algorithm relies entirely on layout-based cues, and the classification algorithm uses word-based information only when this is demonstrably unavoidable. Thus, this approach is particularly appropriate for scanned-in documents, since it is more robust with respect to OCR errors than a content-oriented approach would be. It is applicable, however, to the problem of analyzing any electronic document whose original formatting style rules remain unknown; thus, it can provide the basis for flexible document manipulation tools in heterogeneous collections.

49 citations


Patent
31 Dec 1998
TL;DR: In this article, a document assembly system assembles and prints one or more documents in response to input data describing the nature and circumstances of a transaction to be documented and describing the parties to the transaction.
Abstract: A document assembly system assembles and prints one or more documents in response to input data describing the nature and circumstances of a transaction to be documented and describing the parties to the transaction. The document assembly system initially produces a separate document definition object for each document to be produced and a separate party definition object for each party to the transaction. The document definition object includes procedures for generating “document-related” text that a document may use when referring to itself. The party definition object includes procedures for generating party-related text that the document is to use when referring to a party. The nature of the text each document definition or party definition object procedure produces depends on the nature of the document or the party as indicated by the input data. The system also includes a set of “text generators”, blocks of source code which when compiled and executed, generate the text that may be included in a document. When the nature of a word or phrase to be included in a document depends on the nature of the document or on the nature of a party, the text generator refers to the word or phrase by referring to a procedure of the document or party definition object which generates the word or phrase.

37 citations


Patent
20 Aug 1998
TL;DR: In this paper, a document identification process is used to automatically identify the kind of a document image inputted at random, and then reading or editing the identified result. But the process is performed on an interface screen, and the input document icon is dropped to the logical model folder of a certain category so that the input documents can be identified by using only the model.
Abstract: PROBLEM TO BE SOLVED: To provide an interface which is convenient to a user, and to easily construct document folder constitution in a structure which is the most convenient to the user while operating not only simple classification but also an identifying work at the time of automatically identifying the kind of a document image inputted at random, and then reading or editing the identified result. SOLUTION: A document identifying means 104 identifies the kind of an input document image 101 by using plural logical models 106, and outputs the identified result to a document data base 107. At the time of reading and editing the identified result, this processing is operated on an interface screen 108. An input document icon is dropped to the logical model folder of a certain category so that the input document can be identified by using only the model.

24 citations


17 Dec 1998
TL;DR: This dissertation shows how to improve the online manipulation capabilities of potentially all formats, media types, and genres of existing and future digital documents with the Multivalent Document Model, which extensively opens to enhancement all aspects of a digital document system.
Abstract: Digital documents are important. Whatever else computer workers do, they expend a considerable time working with digital documents, whether as e-mail, word processing files, presentation slides, web pages, discussion groups, or help systems, among many other ways. This dissertation shows how to improve the online manipulation capabilities of potentially all formats, media types, and genres of existing and future digital documents. The Multivalent Document Model extensively opens to enhancement all aspects of a digital document system. Document content is constructed from layers of often heterogeneous type, each with specialized purpose, all semantically aligned. All user-visible document functionality is constructed from stylized program components called behaviors. Document system operations, such as drawing a representation of the document on the screen and saving an edited version, derive from the fundamental operation found to some degree in every digital document system, newly codified as extensible programmatic protocols. This diverse open content, open functionality, and open operation are woven together by numerous mechanisms to produce a final composition that appears built from the ground up as a unified whole. A prototype of the Model, called the Multivalent Document System, has been realized in Java, deployed, and built upon by several third party developers. The System has been tested by and has significantly contributed to the development of three sample applications. The first application allows paper scanned into a computer as images to be manipulated as a live, semantic object, with text copy and paste, text search, and a "lens" operation that displays the corresponding ASCII translation of the region. In the second application, HTML, the lingua franca of the web and a very different document type than scanned page images, has been extended with new functionality including outline displays, a speed reading window, and tables sorted on demand. As a third application, both of the above document types can be annotated in situ with hyperlinks, highlights, floating note windows, a new display mode called Notemarks, and executable copy editor markup. Annotations and behaviors in general can be distributed across the network, augment documents on read-only media, and operate on potentially any document format with a single, format-neutral implementation. This dissertation describes the design of the Multivalent Document Model, its implementation as the Multivalent Document System, and the specialization of the Model in each of the three example applications.

21 citations


Patent
04 Dec 1998
TL;DR: In this article, a feature word extracting part 33 extracts words from document data designated with a document identification ID of dated document data and accumulates the number of words of each word in every area and period and obtains an appearance frequency.
Abstract: PROBLEM TO BE SOLVED: To provide a user with document data having common feature among many document data. SOLUTION: A feature word extracting part 33 extracts words from document data designated with a document identification ID of dated document data 22, accumulates the number of words of each word in every area and period and obtains an appearance frequency. It further extracts the fixed number of words having large appearance frequencies of every area and period as feature words. When a user designates an area and a period, feature words of document data in the period are displayed, and when a specified feature word is selected, the document headline, etc., of the document data including the feature word are displayed. COPYRIGHT: (C)2000,JPO

19 citations


Proceedings ArticleDOI
05 Apr 1998
TL;DR: The method presented in this paper, despite its simplicity, has shown to be effective and robust and requires less computation than the segmentation methods using texture described in other papers.
Abstract: This paper describes a document segmentation method based on segmentation by texture using low resolution gray level images. The method is derived from the human vision perception theory. The concepts used from this theory are, global to local processing and low resolution information. If a document is viewed at a certain distance far from a person, the person sees a blurred image of the document, but is still able to detect the different blocks of the document. Detection is possible since each block has a specific texture pattern. These patterns correspond to regions of text, regions of graphics and regions of pictures. Thus the theory to prove is that a document image can be segmented into regions of text, and regions of graphics and/or pictures using the texture of low resolution images. The method presented in this paper, despite its simplicity, has shown to be effective and robust. It was designed to work with free format documents, text in background other than white, skew greater than 10 degrees. It requires less computation than the segmentation methods using texture described in other papers.

18 citations


Patent
17 Sep 1998
TL;DR: In this article, an interactive document image data base programming/retrieval system with which a document image can be retrieved through an intuitive method easy to comprehend is presented. But the system is limited to document images.
Abstract: PROBLEM TO BE SOLVED: To provide an interactive document image data base programming/ retrieving system with which a document image can be retrieved through an intuitive method easy to comprehend. SOLUTION: A user inputs a keyword (102), and a system retrieves a document image containing that keyword (104). The system classifies these document images into a plurality of clusters based on visual features (108) and automatically selects the representative document image of each cluster (110), and that image is displayed for the user. When further continuing retrieval, the user selects one of these representative document images (114). The system classifies the document images in the cluster corresponding to the selected representative document image again (108), selects the representative document image of each cluster and displays it for the user (110 and 112). COPYRIGHT: (C)1999,JPO

Patent
15 Jan 1998
TL;DR: An apparatus for autodiscrimination of multi-level signal representative of a document having text and image content is described in this article, which includes a digital signal processor and random access memory.
Abstract: An apparatus for autodiscrimination of multi-level signal representative of a document having text and image content is described. A content module processes the multi-level signal serially. Portions of the signal corresponding to text and fine line features are processed to enhance contrast. Portions of the signal corresponding to images are processed to enhance image quality. In one embodiment, the content processor includes a digital signal processor and random access memory. The digital signal processor can be programmable, allowing the methods used to process the document data to be adapted to different types of documents. Text orientation in the document is not critical due to the random accessibility of the memory. The invention also relates to a method of processing a document image.

Book ChapterDOI
30 Mar 1998
TL;DR: A novel formalism called generalized n-gram is presented, which is shown to be accurate for the recognition task and well adapted to automatic learning by examples and the thorny problem of integrating models for document analysis with existing standards used for document manipulation and production is addressed.
Abstract: This paper deals with the representation of document models used in the field of document recognition. A novel formalism called generalized n-gram is presented, which is shown to be accurate for the recognition task and well adapted to automatic learning by examples. The paper addresses also the thorny problem of integrating models for document analysis with existing standards used for document manipulation and production.

Patent
22 Sep 1998
TL;DR: In this paper, the problem of extracting a magazine item by using only layout information, and automatically identifying a document type at the time of recognizing a logical structure was solved by dividing an input document picture into elements such as character areas, and detecting the layout feature of a document.
Abstract: PROBLEM TO BE SOLVED: To extract a magazine item by using only layout information, and to automatically identify a document type at the time of recognizing a logical structure. SOLUTION: A picture dividing means 104 divides an input document picture into elements such as character areas, and detects the layout feature of a document. A logical structure model preparing means 107 prepares a logical structure model for each of plural mode documents 102. An element extraction processing means 105 extracts the logical element from the document picture by using one model in the logical structure model, and calculates the similarity of the layout feature of the model to the layout feature of the document corresponding to the extracted logical element, and when a value obtained by multiplying the similarity by a prescribed value is equal to a threshold value or more, an outputting means 106 outputs the extracted logical element to a document data base 108. COPYRIGHT: (C)1999,JPO

Book ChapterDOI
04 Nov 1998
TL;DR: The system consists of three main components namely detection of mathematical expressions in a document, recognition of the symbols present in the expression and meaningful arrangement of the recognized symbols.
Abstract: In this paper, we propose an approach for understanding mathematical expressions in printed document. The system consists of three main components namely (i) detection of mathematical expressions in a document, (ii) recognition of the symbols present in the expression and (iii) meaningful arrangement of the recognized symbols. However, detection of mathematical expressions is done through recognition of symbols. Moreover, some structural features of the expressions are also used for this purpose. For recognition of the symbols a hybrid of feature based and template based recognition techniques is used. The bounding-box coordinates and the size information of the symbols help to determine the spatial relationships among the symbols. A set of predefined grammar rules is used to form the meaningful symbol groups to properly arrange the symbols. Experiments conducted using these approaches on a large number of documents show high accuracy.

Patent
27 May 1998
TL;DR: In this paper, the authors proposed a logical model detecting approach to extract bibliographic items by using not only a character recognition result, but also its layout information at the time of recognizing logical structure.
Abstract: PROBLEM TO BE SOLVED: To accurately extract bibliographic items by using not a character recognition result, but only its layout information at the time of recognizing logical structure. SOLUTION: A layout feature extracting means 103 divides an input document image into elements such as areas and detects features regarding document layout structure. A logical model detecting means 104 detects a model matching the type of a document to be processed out of plural models and an element extracting process means 105 extracts bibliographic items from the document image by using the detected logical model. When it is decided (106) that the model need not be updated, the extracted bibliographic items are outputted (107) and when it is updated, the logical model is updated by using the decided document, a sample document, etc.

Patent
Yoshiharu Konishi1
17 Dec 1998
TL;DR: The character image layout method as mentioned in this paper includes a reference line-selecting process for selecting a desired one of n lines of character string images as a reference and an other line layout process for determining the range of the length of a character string image on the reference line as a layout range to lay out each character image on each of other lines, within the layout range.
Abstract: A character image layout method lays out n lines (2≦n≦m) out of m lines of character string images (m≧2) through any of uniform layout processing, left end alignment processing, right end alignment processing, center alignment processing, scale-up processing and scale-down processing. The character image layout method includes a reference line-selecting process for selecting a desired one of n lines of character string images as a reference line and an other line layout process for determining the range of the length of a character string image on the reference line as a layout range to lay out each character string image on each of other lines, within the layout range. In the other line layout process, when the length of a character string image on any of the n lines to be laid out is longer than that of the layout range, the any of the n lines to be laid out is reduced in size through the scale-down processing and laid out in the layout range.

Journal ArticleDOI
TL;DR: This paper incorporated the notions of document type hierarchy and folder organization into the multilevel architecture of document storage and proposes a knowledge-based query-preprocessing algorithm, which reduces the search space.
Abstract: This paper presents a knowledge-based approach to managing and retrieving personal documents. The dual document models consist of a document type hierarchy and a folder organization. The document type hierarchy is used to capture the layout, logical and conceptual structures of documents. The folder organization mimics the user's real-world document filing system for organizing and storing documents in an office environment. Predicate-based representation of documents is formalized for specifying knowledge about documents. Document filing and retrieval are predicate-driven. The filing criteria for the folders, which are specified in terms of predicates, govern the grouping of frame instances, regardless of their document types. We incorporated the notions of document type hierarchy and folder organization into the multilevel architecture of document storage. This architecture supports various text-based information retrieval techniques and content-based multimedia information retrieval techniques. The paper also proposes a knowledge-based query-preprocessing algorithm, which reduces the search space. For automating the document filing and retrieval, a predicate evaluation engine with a knowledge base is proposed. The learning agent is responsible for acquiring the knowledge needed by the evaluation engine.

Patent
23 Apr 1998
TL;DR: In this paper, a document processing apparatus is used to avoid inadvertent errors or omissions concerning temporal phrases, including dates, made when a new document is created using a pre-existing document.
Abstract: A document processing apparatus includes a text processor which compares temporal words and phrases of a first document which is used to generate a second, modified document, with temporal words and phrases in the second, modified document, and with temporal metrics. The result of the comparison operation is used to determine whether any temporal words in the modified document have not been updated or should be changed. The document processing apparatus helps to avoid inadvertent errors or omissions concerning temporal phrases, including dates, made when a new document is created using a pre-existing document.

Book ChapterDOI
08 Jan 1998
TL;DR: This paper proposes a categorization method on the basis of the classification and verification paradigm that divides various kinds of documents into appropriate document types stepwisely.
Abstract: In the knowledge-based document image understanding, it is important to distinguish the layout structures of individual documents exactly with a view to making use of adaptable document model. At least, the document models which are characterized heuristically by the application-specific layout structures are not always applicable to every document. In this paper, we propose a categorization method of various kinds of documents. Our categorization method on the basis of the classification and verification paradigm divides various kinds of documents into appropriate document types stepwisely. First, the classification procedure divides the given documents using rough features about documents, and then the verification procedure is applied to the globally categorized document sets, using the detail features.

Book ChapterDOI
04 Nov 1998
TL;DR: An Extended Split Detection Method that can hierarchically segment a machine-printed page image with a complex layout into smaller layout elements and represents an analyzed layout of a hierarchical structure in a tree data structure.
Abstract: This paper describes an Extended Split Detection Method that can hierarchically segment a machine-printed page image with a complex layout into smaller layout elements. The method performs piecewise-linear segmentation using many kinds of separator elements such as field separators, lines, edges of figures, and edges of white background areas. Furthermore, this method represents an analyzed layout of a hierarchical structure in a tree data structure, in which all nodes are traversed according to the simple rules for generating the reading sequence. We demonstrated that the new method increases the correct character line segmentation rate by 15.5%, to 95.5%, and we achieved a correct reading sequence generation of 88.1%.

Journal ArticleDOI
01 Feb 1998
TL;DR: A novel computational paradigm for extracting text/graphic blocks from Chinese document images, which is based on a notion of distributed autonomous agents that are adaptive to the locality of given images and hence efficient in locating the homogeneous image blocks.
Abstract: In Chinese document image processing, text and/or graphical block detection serves as an essential step in document layout analysis that in turn permits the effective reasoning about the logical relationships among various text paragraphs and graphical entities for the purpose of document understanding. This paper presents a novel computational paradigm for extracting text/graphic blocks from Chinese document images, which is based on a notion of distributed autonomous agents. The primary features of the agents lie in that they are (1) adaptive to the locality of given images and hence efficient in locating the homogeneous image blocks, (2) reliable in performing image processing as the computation proceeds simultaneously from different image locations, (3) less sensitive to the noise in the given images as the computation disperses gracefully when it is moving away from the homogeneous blocks, and (4) easy to represent in their behaviors and evolvable in their performance. The paper, first of all, describes the formalisms as well as the behavioral characteristics of the agents, which is followed by a demonstration of the agents in detecting document blocks from some real-life images.

Proceedings ArticleDOI
TL;DR: An integrated IDU system that processes documents all the way from recognition to full utilization using standard generalized markup language (SGML), the means for filling the gap between document recognition and seamless document reuse.
Abstract: Intelligent document understanding (IDU) systems convert scanned document pages into an electronic format which preserves layout and logical document structure in addition to document content. MOst of the IDU experimental systems, however, lack the capability of full exploitation of recognition results. In this paper we present an integrated IDU system that processes documents all the way from recognition to full utilization using standard generalized markup language (SGML). The standardization and widespread use of SGML-based tools provides the means for filling the gap between document recognition and seamless document reuse. The conversion process involves OCR of a multipage document, document structure analysis, processing of tabular data and mathematical expressions, and generation of the final SGML description. Document structure analysis is reduce here to parsing OCR results and recreating document structure by performing fuzzy searches for standard phrases and format analysis. Tabular data processing utilizes OCR results with positional data, horizontal lines and heuristic rules to determine cell boundaries and contents. Recognition of mathematical expressions involves OCR on an extended symbol set, and equation structure recognition via transformations on a tree representation. The transformations are ordered and involve connecting of separated symbols, context-sensitive OCR correction, extraction of horizontally aligned subexpressions, subscript and superscript processing, and a general processing of symbols detected above or below the target symbol.

Proceedings ArticleDOI
18 May 1998
TL;DR: This paper addresses the step of contextual postprocessing using the extended information received from the pattern classifiers as well as the information about the pattern preclassification according to its shape.
Abstract: The process of recognizing characters from a scanned image can be divided into three main operational steps: document layout analysis, character recognition, and contextual postprocessing. This paper addresses the step of contextual postprocessing using the extended information received from the pattern classifiers as well as the information about the pattern preclassification according to its shape. Every pattern is examined by several classifiers, and their decisions are combined in a list of character candidates along with their levels of confidence. Combining the character candidates on different letter positions generates a list of word candidates and the lexicon is being checked for their existence. Word candidates are generated one at a time, in sequence of descending word confidence, and the first candidate found in the lexicon is accepted.

Patent
02 Jun 1998
TL;DR: In this article, the authors propose a method and device for extracting a document by which even new information can be extracted without preliminary preparing a pattern, heuristics, etc., and to provide a storage medium in which a document information extracting program is stored.
Abstract: PROBLEM TO BE SOLVED: To provide the method and device for extracting a document by which even new information can be extracted without preliminary preparing a pattern, heuristics, etc., and to provide a storage medium in which a document information extracting program is stored. SOLUTION: The 1st characteristic information which characterizes an object document set with respect to a standard document set is calculated from the object document set (S1), the 2nd characteristic information which characterizes each separate document in the object document set with respect to other separate documents is calculated from each separate document in the object document set (S2), the information which characterizes the object document set more and also characterizes each separate document with respect to the other separate documents is extracted from the each separate document (S3) based on the 1st and 2nd characteristic information, and the extracted information is outputted as the information that characterizes each separate document. COPYRIGHT: (C)1999,JPO

Patent
20 Mar 1998
TL;DR: In this paper, the layout analysis of a document is performed by analyzing the layout structure of the document according to the input image data, and the layout information storage part stores layout information showing the relation between the analyzed layout structure and information regarding the input input data.
Abstract: PROBLEM TO BE SOLVED: To easily perform more kinds of process for image data of a document. SOLUTION: The processor is equipped with an image input part 10 which receives document in the form of image data, a layout analysis part 14 which analyzes the layout structure of the document according to the inputted image data, a layout information storage part 16 which stores layout information showing the relation between the analyzed layout structure of the document and information regarding the inputted image data, an image processing part 20 which displays an image of the document based upon the inputted image data and the area showing the layout structure according to the layout information, and a correction part 24 which receives an indication for correction of the area displayed by the image processing part 20 and corrects the layout information according to the indication. COPYRIGHT: (C)1999,JPO

Dissertation
01 Jan 1998
TL;DR: A system that exploits on one hand the information about content of the document such as its physical and logical structures, and on the other hand on two level grammars to convert some paper documents into HTML documents because these documents are more used on the Internet.
Abstract: This work is part of the thematic "Document Analysis" in the Laboratory Reconnaissance de Forme et Vision(RFV). To achieve an analysis system ables to, interpret documents and to restore its structure, the Methodologies we have chosen lean on several approaches and particularly on the syntactic and structural approach of the Pattern Recognition. The aim in this work is to convert some paper documents into HTML documents because these documents are more used on the Internet. The application domain of such systems could be general; however, we concentrate us on a particular type of documents with a rich typography: the summaries. In this context, we have realized a system that exploits on one hand the information about content of the document such as its physical and logical structures, and on the other hand on two level grammars. It is composed with two grammars: a meta-grammar and a hyper-grammar. In our system, the role of the meta-grammar is to describe the physical and logical structures of the document. The hyper-grammar is constituted with a set of calculus rules and describes the treatments to do in order to convert the document in HTML. The summary analysis is done in two steps: analysis and identification of the document, and then translation into HTML. During of the first step, the system constructs a learning base by using the grammatical inference. This base contains several patterns of synopses to identify. An unknown document, submitted to the system is identified by matching with the patterns of the base by using all the attributes obtained in the analysis step. The layout of HTML document construction is based on the grammatical analysis of the hyper-grammar. The last is obtained by translation of the logical labels and some typographic parameters into HTML commands. The result of the grammatical analysis of the hyper-grammar produces the structured HTML document corresponding to the studied document. This last will be visualized by software of navigation.

Patent
26 Mar 1998
TL;DR: In this paper, a text document inputted through an inputting part 1 is shown on a screen of a displaying part 2 and it is transferred to a text analyzing part 3 by performing a prescribed instruction operation on the screen.
Abstract: PROBLEM TO BE SOLVED: To prepare a text document that is suitable to mechanical processing more, by detecting a language error of the text document in the process of corresponding character strings which are extracted from the text document and are grammatically related. SOLUTION: A text document inputted through an inputting part 1 is shown on a screen of a displaying part 2. And it is transferred to a text analyzing part 3 by performing a prescribed instruction operation on the screen. The part 3 performs morphological analysis, syntax analysis, etc., of the text document while referring to a text analytical dictionary 7 stored in a prescribed memory and the analytical result is transferred to a correspondence relation extracting part 4. The part 4 extracts character strings that are related to each other based on the analytical result and produces additional information that associates a character string of a reference destination with a character string of a reference source based on a correspondence relation selected by a user. An editing part 5 reedits the text document based on the produced additional information and stores it in a storing part 6. COPYRIGHT: (C)1999,JPO

Journal ArticleDOI
Hirobumi Nishida1
TL;DR: This work addresses the following problems: (1) efficient multi-stage systems for character recognition and (2) feature extraction from gray-scale document images of low resolution, including illustrative examples for the two problems.

Patent
19 Mar 1998
TL;DR: In this article, a scanner part 110 reads the whole image of the document video and stores the read document video in a 1st memory part 120, a text area and a non-text area are analyzed from the read image, and a document angle recognition part 130 detects a certain text part in the text area or character area.
Abstract: PROBLEM TO BE SOLVED: To automatically correct a video document in a wrong direction by detecting a certain part of a character area in document video, determining the inclination of the document according to the recognition reliability of the certain part, and rotating the document by the determined inclination and performing character recognition. SOLUTION: A scanner part 110 reads the whole image of the document video and stores the read document video in a 1st memory part 120. The whole image of the document video is read out of the 1st memory part 120 by a document structure analysis part 124, a text area and a non-text area are analyzed from the read image, and a document angle recognition part 130 detects a certain text part in the text area or character area. According to the recognition reliability of the certain part, the inclination of the document is determined, the document is rotated by the determined inclination, and characters are recognized and stored in a 2nd memory part 140. Consequently, even a visually handicapped person who is unable to recognize the document video can correctly recognize the characters.