scispace - formally typeset
Search or ask a question

Showing papers on "Document processing published in 1998"


Journal ArticleDOI
TL;DR: A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented and extension of the work to Devnagari, the third most popular Script in the World, is discussed.

381 citations


Journal ArticleDOI
TL;DR: A survey of methods developed by researchers to access and manipulate document images without the need for complete and accurate conversion is provided.

319 citations


Journal ArticleDOI
01 Aug 1998
TL;DR: An automatic system that starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents as the corpus grows, assigning topics to these new documents with high speed and accuracy is described.
Abstract: We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called taxonomies. Similar to indices for relational data, taxonomies make search and access more efficient. However, the exponential growth in the volume of on-line textual information makes it nearly impossible to maintain such taxonomic organization for large, fast-changing corpora by hand. We describe an automatic system that starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents as the corpus grows, assigning topics to these new documents with high speed and accuracy. To do this, we use techniques from statistical pattern recognition to efficiently separate the feature words, or discriminants, from thenoise words at each node of the taxonomy. Using these, we build a multilevel classifier. At each node, this classifier can ignore the large number of “noise” words in a document. Thus, the classifier has a small model size and is very fast. Owing to the use of context-sensitive features, the classifier is very accurate. As a by-product, we can compute for each document a set of terms that occur significantly more often in it than in the classes to which it belongs. We describe the design and implementation of our system, stressing how to exploit standard, efficient relational operations like sorts and joins. We report on experiences with the Reuters newswire benchmark, the US patent database, and web document samples from Yahoo!. We discuss applications where our system can improve searching and filtering capabilities.

292 citations


Patent
13 Aug 1998
TL;DR: In this paper, a document processing system includes a database generation unit for generating predetermined bases from documents and a user-defined search template generator for generating a user defined search template which is used to extract a predetermined set of information from substantially similar documents.
Abstract: The document processing system includes a database generation unit for generating predetermined bases from documents and a user-defined search template generation unit for generating a user-defined search template which is used to extract a predetermined set of information from substantially similar documents. The user-defined search templates are efficiently generated without any intervention from a technical support personnel.

101 citations


Patent
Hideo Honma1
08 Dec 1998
TL;DR: In this article, a document is divided into blocks and each block is sensed by a CCD, and perspective correction is performed on the image data of each of a plurality of images obtained by divisionally sensing the document.
Abstract: A digital camera which performs accurate document reading, and is used in a document processing system. In the document processing system, a document is divided into blocks and each block is sensed by a CCD, and perspective correction is performed on the image data of each of a plurality of images obtained by divisionally sensing the document. An OCR process is performed on the corrected image data to convert the corrected image data to text data. The converted text data, corresponding to the image data of each of the plurality of images, is combined into one text data, and the combined text data is output for printing.

85 citations


Journal ArticleDOI
TL;DR: A machine-printed and handwritten text classification method to automatically identify the identity of texts segmented from a document image to facilitate later optical character recognition task.

78 citations


Patent
27 Feb 1998
TL;DR: In this article, the problem of generating a structured document by setting in appropriate places document logic elements other than sentence such as graphs, and tables, contained in a printing document consisting of a plurality of pages is addressed.
Abstract: PROBLEM TO BE SOLVED: To generate a structured document such as an XML (extensible markup language) document and an HTML (hypertext markup language) document, by setting in appropriate places document logic elements other than sentence such as graphs, and tables, contained in a printing document consisting of a plurality of pages. SOLUTION: The device extracts a paragraph area and a graph area by analyzing document graphs in layout corresponding to a printing document with a layout analyzing part 11 while segmenting characters in the paragraph area to recognize and process with a character recognizing part 12. It extracts a document logic element area from the paragraph area by providing a character recognizing result and a layout analyzing result to a document logic element extracting part 13, and carries out order setting respectively to a document logic element area and a graph/table area with a reading order setting part 14. Then, it extracts a document structure by grouping respectively the document logic element area and the graph/table area with a document structure analyzing part 16, and generates the structure document by changing the appearance position of an area corresponding to the document logic elements other than sentence in the document structure and providing to a document output part 17. COPYRIGHT: (C)2004,JPO

66 citations


01 Jan 1998
TL;DR: The SummarizerTool is described, a Java-implemented prototype, and its applications in various document processing tasks, including news reports, government documents, and even court records.
Abstract: We present an automated method of generating human-readable summaries from text documents such as news, technical reports, government documents, and even court records. Our approach exploits an empirical observation that much of the written text display certain regularities of organization and style, which we call the Discourse Macro Structure (DMS). A summary is therefore created to reflect the con.ponents of a given DMS. In order to produce ~ roherent and readable summary we select continuoa~, well-formed passages from the source document and assemble them into a mini-document within a DMS template. In this paper we describe the SummarizerTool, a Java-implemented prototype, and its applications in various document processing tasks.

64 citations


Journal ArticleDOI
TL;DR: Two techniques for speeding up character recognition are presented, including the candidate-cluster selection and modified branch-and-bound detail-matching modules, which are integrated in the Windows-based document reading system, which provides a user-friendly environment.

48 citations


Patent
27 Feb 1998
TL;DR: A system and method for selectable encryption of documents employing a document generator and a user-controlled document encryptor operative to encrypt user-selector portions of a document generated on document generator is described in this article.
Abstract: A system and method for selectable encryption of documents employing a document generator and a user-controlled document encryptor operative to encrypt user-selector portions of a document generated on document generator.

45 citations


Patent
01 Jul 1998
TL;DR: In this article, a method of processing documents in an image-based document processing system was proposed to associate recognition results from a primary source results list with corresponding results from the secondary source results lists.
Abstract: A method of processing documents in an image-based document processing system to associate recognition results from a primary source results list with corresponding recognition results from a secondary source results list to improve assistance to an operator of the image-based document processing system during, operation of the image-based document processing system comprises the steps of (a) scanning a first type of document to obtain scanned data representative thereof, (b) scanning a second type of document to obtain scanned data representative thereof, (c) processing scanned data representative of the first type of document to provide recognition results associated with the first type of document, (d) processing scanned data representative of the second type of document to provide recognition results associated with the second document, (e) storing recognition results associated with the first type of document in a primary list, (f) storing recognition results associated with the second type of document in a secondary list, (g) comparing recognition results from the primary list with recognition results from the secondary list to determine if an exact match occurs and thereby to associate a first set of recognition results from the primary list and a first set of recognition results from the secondary list, and (h) comparing recognition results from the primary list with recognition results from the secondary list to determine if an approximate match occurs when an exact match fails to occur in step (g) and thereby to associate a second set of recognition results from the primary list and a second set of recognition results from the secondary list.

Patent
05 Oct 1998
TL;DR: A transmission document edition device edits a transmission document to be transmitted to a variety of mobile communication terminals from a document described in a markup language as mentioned in this paper, and a document content storage unit stores a document, including a plurality of document elements.
Abstract: A transmission document edition device edits a transmission document to be transmitted to a variety of mobile communication terminals from a document described in a markup language. A document content storage unit stores a document including a plurality of document elements to be transmitted. A device input/output information storage unit stores a plurality of pieces of device input/output information that indicate the document elements to be transmitted for a plurality of types of mobile communication terminal. A transmission document creation unit creates a transmission document including the document and the plurality of pieces of device input/output information.

Proceedings ArticleDOI
16 Sep 1998
TL;DR: A system for reliably establishing correspondences between printed words and their electronic counterparts, without performing optical character recognition, which might have interesting applications in document database retrieval, since it allows an electronic document to be indexed by a printed version of itself.
Abstract: A common authoring technique involves making annotations on a printed draft and then typing the corrections into a computer at a later date. In this paper, we describe a system that goes some way towards automating this process. The author simply passes the annotated documents through a sheetfeed scanner and then brings up the electronic document in a text editor. The system then works out where the annotated words are and allows the author to skip from one annotation to the next at the touch of a key. At the heart of the system lies a procedure for reliably establishing correspondences between printed words and their electronic counterparts, without performing optical character recognition. This procedure might have interesting applications in document database retrieval, since it allows an electronic document to be indexed by a printed version of itself.

Patent
27 Mar 1998
TL;DR: In this article, a form document processing apparatus for entry of data described in form documents is described, which includes a main processing section for performing predetermined processing operations for entered form data, and an auxiliary processing section which, upon receiving a request from the main processing, performs, in an auxiliary manner, a specific processing operation determined by the contents of form data.
Abstract: There is disclosed a form document processing apparatus for entry of data described in form documents. The form document processing apparatus includes a main processing section for performing predetermined processing operations for entered form data, and an auxiliary processing section which, upon receipt of a request from the main processing section, performs, in an auxiliary manner, among the processing operations to be performed by the main processing section, a specific processing operation determined by the contents of form data. Therefore, specific processing operations determined by the contents of form data can be performed without a need to develop a program for each type of task.

Proceedings ArticleDOI
16 Aug 1998
TL;DR: A statistical study reveals that the detection of italic, bold and all-capital words may play a key role in automatic information retrieval from documents and can be used to improve the recognition accuracy of a text recognition system.
Abstract: We propose simple and fast algorithms for detection of italic, bold and all-capital words without doing actual character recognition. We present a statistical study which reveals that the detection of such words may play a key role in automatic information retrieval from documents. Moreover, detection of italic words can be used to improve the recognition accuracy of a text recognition system. Considerable number of document images have been tested and our algorithms give accurate results on all the tested images, and the algorithms are very easy to implement.

Patent
06 Jul 1998
TL;DR: In this article, a computer-implemented method and system for processing a document such as a structured document in which information such as term, name and belonging department is used as shared information and word consistency or modification can be automatically and easily reflected on all documents.
Abstract: A computer-implemented method and system for processing a document such as a structured document in which information such as a term, name and belonging department is used as shared information and word consistency or modification can be automatically and easily reflected on all documents. In the document processing method, a shared information editing program edits shared information shared information frequently described in a plurality of documents, a shared information storage program stores the edited shared information in a secondary memory, a shared information list-up program lists up the shared information for each information type, a structured document editing program edits a structured document to describe a link to the shared information selected from the edited shared information listed up, a structured document storage program stores the structured document in the secondary memory, and a structured document output program reads out the shared information and structured document from the secondary memory and embeds the contents of the shared information in the structured document for its display or printout.

Journal ArticleDOI
TL;DR: A novel encoding scheme is provided that facilitates scalable lossy compression and progressive transmission and supports document image analysis in the compressed domain and a class of document image understanding tasks that operate on the compressed representation.

Patent
Donald T. Tang1, Li Qin Shen1, Xiao Jin Zhu1
28 Aug 1998
TL;DR: In this article, a Chinese speech recognition method and system for single or un-correlated Chinese character(s) is presented. But the method uses various types of Character Description Language (CDL) to describe the single or non-corrrelated Chinese characters to be inputted, and the system uses CDL grammar directed speech recognizer to accept CDLs which are inputted by voice.
Abstract: A Chinese speech recognition (SR) method and system for single or un-correlated Chinese character(s). The method uses various types of Character Description Language (CDL) to describe the single or un-correlated Chinese character(s) to be inputted. The SR system uses CDL grammar directed speech recognizer to accept CDLs, which are inputted by voice. On the basis of analysis of CDL parser, the character generator gives a corresponding character. Therefore, recognition of single or un-correlated Chinese character(s) out of context can be made reliably.

Patent
30 Jan 1998
TL;DR: In this article, a computerized document processing apparatus for creating an abstract includes document storage for storing a computerised document, keyword storage and an abstract creation section for creating abstract by extracting at least a character string containing a keyword stored in the keyword storage section from the computerized documents stored in a document storage section.
Abstract: A computerized document processing apparatus for creating an abstract includes document storage for storing a computerized document, keyword storage for storing keywords, an abstract creation section for creating an abstract by extracting at least a character string containing a keyword stored in the keyword storage section from the computerized document stored in the document storage section, a document modification section for modifying the computerized document to link the keyword in the computerized document with the same keyword in the abstract, and a display section for displaying the abstract and the modified document that is linked with the abstract. The modified document is displayed when the linked keyword in the abstract is selected.

Proceedings ArticleDOI
11 Oct 1998
TL;DR: A recognition-based Arabic OCR system that consists of the image acquisition, preprocessing, segmentation, character fragmentation, combination of character fragments, feature extraction, and classification.
Abstract: Optical character recognition systems improve human-machine interaction and are widely used in many government and commercial departments. After forty years of intensive research, OCR systems for most scripts are well developed. However, not for Arabic script. Since Arabic is a popular script, Arabic OCR systems should have great commercial value. Thus a recognition-based Arabic OCR system is proposed in this paper. It consists of the image acquisition, preprocessing, segmentation, character fragmentation, combination of character fragments, feature extraction, and classification. A signal is fed back to improve and determine the segmentation/recognition result. The system has been implemented and it has 90% recognition accuracy with a 20 chars/sec recognition rate.

Patent
Keiichi Imamura1
27 Jan 1998
TL;DR: In this article, a preselected decoration is made on document data using a CPU that analyzes document structures of the overall document in unit of a document structural element, and extracts a predetermined structural element from these analyzed structural elements as a design element to be designed.
Abstract: In a document processing apparatus equipped with a computer program storage medium, a preselected decoration is made on document data. A CPU analyzes document structures of the overall document in unit of a document structural element, and extracts a predetermined structural element from these analyzed structural elements as a design element to be designed. Then, the CPU retrieves a table contained in a RAM based on an attribute of this design element. This table fixedly stores specific decoration information with respect to each of the attributes of the design elements. The CPU retrieves the decoration information corresponding to the attribute of the extracted design element, and then decorated the design element based on this decoration information. As a result, a predetermined decoration can be made on the document data.

13 Mar 1998
TL;DR: The multivalent document model enables one to better use digital documents for tasks in which paper documents are still otherwise superior to digital documents, such as annotating someone else''s document.
Abstract: "Multivalent documents" is a model of documents that addresses some of the shortcomings one currently encounters when manipulating documents in digital form. In the multivalent document model, a document is composed out of distributed data and program resources, called layers and behaviors, respectively. The model exposes virtually all aspects of document processing to behaviors, and provides the means to compose these components into a single coherent document. Behaviors allow the model to be highly extensible, including the capability to be extended to work with arbitrary document formats. We have implemented the model in Java, and developed behaviors that support multiple document types (scanned page images, HTML, and ASCII) and a number of different user-interface metaphors (e.g., "lenses" and "Notemarks"). The multivalent document model enables one to better use digital documents for tasks in which paper documents are still otherwise superior to digital documents, such as annotating someone else''s document. We have shown how the model is naturally conducive to realizing powerful forms of distributed, open annotation by implementing a variety of annotation types, some familiar and some novel.

Journal ArticleDOI
TL;DR: In this article, the authors introduce the concept of document functionality, which attempts to describe the roles of documents and their components in the process of transferring information, and demonstrate how functional descriptions can be used to reverse-engineer the intentions of the author, to navigate in document space, and to provide important contextual information to aid in interpretation.

Patent
Leigh L. Klotz1, Glen W. Petrie1, Robert S. Bauer1, Daniel Davies1, Julia A. Craig1 
13 Nov 1998
TL;DR: In this article, a tag-based user interface scheme for digitizing and processing hardcopy documents utilizes a sticker that includes a printed data code representative of a user identity code and a service code.
Abstract: A tag-based user interface scheme for digitizing and processing hardcopy documents utilizes a sticker that includes a printed data code representative of a user identity code and a service code. When the sticker is applied to a hardcopy document and scanned, the sticker is located, the data code is parsed, and a desired service is performed based upon the information stored in the data code.

Patent
Naoko Ito1
08 May 1998
TL;DR: In this paper, a document processing apparatus is implemented in a client/server system to add and modify a document conversion function without modification for either client or server, consisting of a client requesting acquisition or storage of a document, a server performing management such as transfer and storage of the document, network connecting the client and the server, and a proxy server existing on the network and relaying interaction between client and server.
Abstract: A document processing apparatus implemented in a client/server system to add and modify a document conversion function without modification for either client or server. The apparatus comprises a client requesting acquisition or storage of a document, a server performing management such as transfer and storage of the document, a network connecting the client and the server, and a proxy server existing on the network and relaying interaction between the client and the server. The proxy server has a document data conversion section for performing conversion of the document based on the structure of document.

Journal ArticleDOI
TL;DR: This paper demonstrates how recent progress in the area of multiple-expert classification can be exploited to provide new approaches to the processing of printed data.

Patent
27 Oct 1998
TL;DR: In this article, a secret area is designated in the document information via an area designation part 12, and a partial document information on the designated area is enciphered at an encipherment part 13 to obtain the information.
Abstract: PROBLEM TO BE SOLVED: To make quickly and properly processable secret document information, to make clear the secret part of the secret document information and to make properly manageable the secret of the secret document information by its providing side. SOLUTION: A secret area is designated in the document information via an area designation part 12, and a partial document information on the designated area is enciphered at an encipherment part 13 to obtain the enciphered information. At a management information generation part 14, the address information on the designated area and the key information on the encipherment are stored in a management table and the management information is generated. Then, the part 14 replaces an object area included in the text of the document information, i.e., the designated area with the enciphered information via the part 13 and transmits or stores the enciphered information together with the management information.

01 Jan 1998
TL;DR: The MANICURE system as discussed by the authors is a document processing system that provides integrated facilities for creating electronic forms of printed materials and their implementation is described in detail in the paper "Manicure: A Document Processing System for creating Electronic Form of Printed Materials".
Abstract: MANICURE is a document processing system that provides integrated facilities for creating electronic forms of printed materials. In this paper the functionalties supported by MANICURE and their implementations are described. In particular, we provide information on specific modules dealing with automatic detection and correction of OCR errors and automatic markup of logical components of the text. We further show that the various text formats produced by MANICURE can be used by web browsers and/or be manipulated by search routines to highlight the requested information on document images.

Patent
22 Sep 1998
TL;DR: In this article, a transmission document editing device edits a general document described in language with a mark into transmission document to be transmitted to various mobile communication terminals, and a simulation operation executing part 213 obtains the equipment input and output information of the designated terminal from the transmission document, selects the document elements suited to the selection condition, and generates display data.
Abstract: PROBLEM TO BE SOLVED: To synthetically edit transmission documents to be transmitted to each kind of mobile communication terminal. SOLUTION: A transmission document editing device edits a general document described in language with a mark into a transmission document to be transmitted to various mobile communication terminals. A document content temporary storing part 201 stores a document constituted of plural document elements to be transmitted. An equipment input and output information storing part 202 stores equipment input and output information including the selection condition of the document elements for each kind of each terminal. A transmission document generating part 208 generates the transmission document in which the plural equipment input and output information is added to the general document. A simulation operation executing part 213 obtains the equipment input and output information of the designated terminal from the transmission document, selects the document elements suited to the selection condition, and generates display data. COPYRIGHT: (C)1999,JPO

Book ChapterDOI
TL;DR: It is shown how the matrices produced by the SVD calculation can be interpreted, allowing us to spot patterns of characters that indicate particular topics in a corpus.
Abstract: The singular value decomposition, or SVD , has been studied in the past as a tool for detecting and understanding patterns in a collection of documents. We show how the matrices produced by the SVD calculation can be interpreted, allowing us to spot patterns of characters that indicate particular topics in a corpus. A test collection, consisting of two days of AP newswire traffic, is used as a running example.