scispace - formally typeset
Search or ask a question

Showing papers on "Document processing published in 1995"


Patent
Gregory J. Wolff1, David G. Stork1
01 Nov 1995
TL;DR: A pen-like instrument with a writing point for making written entries upon a physical document and sensing the three-dimensional forces exerted on the writing tip as well as the motion associated with the act of writing is described in this article.
Abstract: A manual entry interactive paper and electronic document handling and process system uses a pen-like instrument (PI) with a writing point for making written entries upon a physical document and sensing the three-dimensional forces exerted on the writing tip as well as the motion associated with the act of writing. The PI is also equipped with a CCD array for reading pre-printed bar codes used for identifying document pages and other application defined areas on the page, as well as for providing optical character recognition data. A communication link between the PI and an associated base unit transfers the transducer data from the PI. The base unit includes a programmable processor, a display, and a communication link receiver. The processor includes programs for written character and word recognition, memory for storage of an electronic version of the physical document and any hand-written additions to the document. The display unit displays the corresponding electronic version of the physical document on a CRT or LCD as a means of feedback to the user and for use by authorized electronic agents.

1,024 citations


Journal ArticleDOI
10 Feb 1995-Science
TL;DR: A language-independent means of gauging topical similarity in unrestricted text by combining information derived from n-grams with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents.
Abstract: A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure. When an existing document is used as an exemplar, the completeness and accuracy with which topically related documents are retrieved is comparable to that of the best existing systems. The results of a formal evaluation are discussed, and examples are given using documents in English and Japanese.

630 citations


Book
01 Jan 1995
TL;DR: The document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Abstract: Page layout analysis is a document processing technique used to determine the format of a page. This paper describes the document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components. The method yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks. It is advantageous over many other methods in three main ways: independence from skew angle, independence from different text spacings, and the ability to process local regions of different text orientations within the same image. Results of the method shown for several different page formats and for randomly oriented subpages on the same image illustrate the versatility of the method. We also discuss the differences, advantages, and disadvantages of the docstrum with respect to other lay-out methods. >

628 citations


Patent
23 Jan 1995
TL;DR: In this paper, the authors describe a document processing system in which a server subsystem stores information corresponding to a document containing human readable and machine readable information and a client subsystem receives the document and interprets the machine-readable information.
Abstract: A document processing system in which a server subsystem stores information corresponding to a document containing human readable and machine readable information and a client subsystem receives the document and interprets the machine readable information. The client subsystem contacts the server to verify validity of information in the document using a communications network that allows information to be exchanged between the server and the client.

285 citations


Journal ArticleDOI
Yi Lu1
TL;DR: An overview of the character segmentation techniques in machine-printed documents is presented, which will cover techniques for segmenting uniformed or proportional fonts, broken and touching characters; techniques based on text image features and techniquesbased on recognition results.

206 citations


Book
01 Jan 1995
TL;DR: This paper presents a simple method for document image segmentation in which text regions in a given document image are automatically identified and is shown to work even for skewed images and handwritten text.
Abstract: There is a considerable interest in designing automatic systems that will scan a given paper document and store it on electronic media for easier storage, manipulation, and access. Most documents contain graphics and images in addition to text. Thus, the document image has to be segmented to identify the text regions, so that OCR techniques may be applied only to those regions. In this paper, we present a simple method for document image segmentation in which text regions in a given document image are automatically identified. The proposed segmentation method for document images is based on a multichannel filtering approach to texture segmentation. The text in the document is considered as a textured region. Nontext contents in the document, such as blank spaces, graphics, and pictures, are considered as regions with different textures. Thus, the problem of segmenting document images into text and nontext regions can be posed as a texture segmentation problem. Two-dimensional Gabor filters are used to extract texture features for each of these regions. These filters have been extensively used earlier for a variety of texture segmentation tasks. Here we apply the same filters to the document image segmentation problem. Our segmentation method does not assume any a priori knowledge about the content or font styles of the document, and is shown to work even for skewed images and handwritten text. Results of the proposed segmentation method are presented for several test images which demonstrate the robustness of this technique.

197 citations


Patent
26 May 1995
TL;DR: In this article, a user extensible document processing system includes a document processing platform such as a digital copier, control forms for specifying requested services and instructions and one or more document service cards.
Abstract: A user extensible document processing system. The extensible document processing system includes a document processing platform such as a digital copier,control forms for specifying requested services and instructions and one or more document service cards. User provided document processing services are contained on document service cards. A set of basic document processing services are provided by the document processing platform. The document processing platform includes one or more ports for coupling to document service cards, a registration device for registering services into a service taxonomy, a deregistration device for deregistering services from the service taxonomy, a service dispatcher for identifying the service to process a control form using the service taxonomy, and a scanner for creating a digital representation of a paper based document. The document processing platform registers document services upon detecting the coupling of a document services card.

130 citations


Book
01 Jan 1995
TL;DR: In this article, the authors describe a text reading system consisting of three major components: document analysis, document understanding, and character segmentation/recognition, which is used for multicolumned and multi-article documents.
Abstract: The document image processes used in a recently developed text reading system are described. The system consists of three major components: document analysis, document understanding, and character segmentation/recognition. The document analysis component extracts lines of text from a page for recognition. The document understanding component extracts logical relationships between the document constituents. The character segmentation/recognition component extracts characters from a text line and recognizes them. Experiments on more than a hundred documents have proved that the proposed approaches to document analysis and document understanding are robust even for multicolumned and multiarticle documents containing graphics and photographs, and that the proposed character segmentation/recognition method is robust enough to cope with omnifont characters which frequently touch each other. >

128 citations


Patent
28 Nov 1995
TL;DR: In this article, a document server is provided for processing a distribution job in a document processing system, which includes a document manager, communicating with first and second virtual services, for coordinating the storing or processing of first or second copies of an image data set at the first or the second virtual service, respectively.
Abstract: A document server is provided for processing a distribution job in a document processing system. The document processing system includes a document manager, communicating with first and second virtual services, for coordinating the storing or processing of first and second copies of an image data set at the first and second virtual services, respectively. The document processing system further includes a distribution agent, communicating with the document manager, for receiving a first job ticket, including attributes for controlling the storing or processing of the first copy of the image data set at the first virtual service and a second job ticket for controlling the storing or processing of the second copy of the image data set at the second virtual service. The document manager and the distribution agent function cooperatively to provide the document server with the capability to halt the processing of the first and second copies of the image data set in order for altering attributes of the first and second job tickets, and to determine the status of the processing of the first and second copies of the image data set. Moreover, the distribution agent can provide values for attributes of the first and second job tickets when a user does not program those values in the first and second job tickets.

112 citations


Book
01 Jan 1995
TL;DR: In this article, a conceptual framework for solving the task of document analysis, which consists in the conversion of the document's pixel representation into an equivalent knowledge network representation holding the document content and layout, is presented.
Abstract: The authors present a conceptual framework for solving the task of document analysis, which, in essence, consists in the conversion of the document's pixel representation into an equivalent knowledge network representation holding the document's content and layout. Starting on the pixel level, the formation of elementary geometric objects on which layout analysis as well as the definition of character objects is based is described. Character recognition accomplishes the mapping from geometric object to character meaning in ASCII representation. On the next level of abstraction words are formed and verified by contextual processing. Modeled knowledge about complete documents and about how their constituents are related to the application forms the highest level of abstraction. The various problems arising at each stage are discussed. The dependencies between the different levels are exemplified and technical solutions put forward. >

89 citations


Patent
19 Jun 1995
TL;DR: In this article, an image-based dual path, document processing system including an imaging unit, a character recognition unit, dual path module, and an encoder is presented, where a document received by the system is sequentially processed through the imaging unit.
Abstract: An image based, dual path, document processing system including an imaging unit, a character recognition unit, a dual path module and an encoder. A document received by the system is sequentially processed through the imaging unit, the character recognition unit, the dual path module and the encoder. The imaging unit images the front face of the document and attempts to identify character data appearing on the face of said document, such as a hand-written courtesy amount appearing on a bank check, while the character recognition unit is utilized to reads machine-readable data, such as MICR data, printed on the face of the document. The dual path module includes a first document path for directly delivering the document through the dual path module to the encoder upon successful processing of the document by the imaging and character recognition units, and a second document path including an action window wherein an operator may perform corrective action on the document upon unsuccessful processing of the document by either the imaging unit or the character recognition unit.

Patent
Makoto Murata1
03 Apr 1995
TL;DR: In this article, a document structure layout unit is proposed to embed in the content portion the document structure laid out by the document layout layout unit, whereby layout processing of embedding a medium into another already embedded medium can be realized (e.g., a mathematical formula into a text in a graphic frame).
Abstract: A document processing system in which, when it is desired to lay out a document structure embedded in a content portion to be laid out, a content layout unit calls a document structure layout unit to embed in the content portion the document structure laid out by the called document structure layout unit, whereby layout processing of embedding a medium into another already embedded medium can be realized (for example, embedding a mathematical formula into a text in a graphic frame). Further, such data (layout of a boxed item) that is difficult to express in terms of content portion can be expressed in terms of document structures appearing in the middle of the processing.

Patent
30 Aug 1995
TL;DR: In this article, a document processing system stores document forms which can be chosen by the user based on the specification of document attributes such as the purpose and distributees of document or on the finished styles of document, and displays the selected document form.
Abstract: The document processing system stores document forms which can be chosen by the user based on the specification of document attributes such as the purpose and distributees of document or on the finished styles of document, and displays the selected document form. The system enables the user to determine a proper document form easily and swiftly without the need of instructing a detailed document design and write the intended document in the displayed form.

Patent
20 Jun 1995
TL;DR: In this article, a system consisting of a plurality of terminals connected through communication is sometimes used to acquire information input from the terminals to a host terminal and edit/arrange the information in the host terminal when information is exchanged by inputting handwritten characters from the users.
Abstract: A system consisting of a plurality of terminals connected through communication is sometimes used to acquire information input from the terminals to a host terminal and edit/arrange the information in the host terminal when information is exchanged by inputting handwritten characters from the terminals. In this case, the host terminal must recognize the handwritten characters and thereafter process the characters to arrange the handwritten character information. Each terminal can perform character recognition by a unique character recognizing method. The terminals also individually prepare dictionaries for recognition. The host terminal returns the acquired handwritten character information to the terminals and causes the terminals to perform character recognition. Alternatively, the host terminal concentratively manages the character recognizing methods and dictionaries of all the terminals and performs character recognition.


Journal ArticleDOI
01 May 1995
TL;DR: In this paper, a financial document recognition prototype system which can process bank cheques, payment slips and bills is presented and Numerous experimental results are presented and discussed.
Abstract: Millions of financial transactions take place every day. Associated with them are documents such as bank cheques, payment slips and bills which have to be processed. A great deal of time, effort and money will be saved if they can be entered into the computer and processed automatically. According to the specific characteristics of financial documents, it can be concluded that it is possible to build a system for recognizing specific types of financial documents, instead of a complex and general one aiming at different kinds of documents. In this paper, a financial document recognition prototype system which can process bank cheques, payment slips and bills, is presented. It consists of four major parts: (a) document image acquisition including scanning and binarization, (b) fixed document processing subsystem based on the detection of staff lines, (c) flexible document processing subsystem operating in a form description language (FDL), and (d) character recognition. Numerous experimental results are presented and discussed. >

Proceedings ArticleDOI
30 Mar 1995
TL;DR: This paper describes a system developed for the detection of isolated words, word portions, as well as multi-word phrases in images of documents and provides for automated training of desired keywords and creation of indexing filters to speed matching.
Abstract: With the advent of on-line access to very large collections of document images, electronic classification into areas of interest has become possible. A first approach to classification might be the use of OCR on each document followed by analysis of the resulting ASCII text. But if the quality of a document is poor, the format unconstrained, or time is critical, complete OCR of each image is not appropriate. An alternative approach is the use of word shape recognition (as opposed to individual character recognition) and the subsequent classification of documents by the presence or absence of selected keywords. Use of word shape recognition not only provides a more robust collection of features but also eliminates the need for character segmentation (a leading cause of error in OCR). In this paper we describe a system we have developed for the detection of isolated words, word portions, as well as multi-word phrases in images of documents. It is designed to be used with large, changeable, keyword sets and very large document sets. The system provides for automated training of desired keywords and creation of indexing filters to speed matching.© (1995) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Patent
09 Feb 1995
TL;DR: In this paper, an input handwritten character pattern is subjected to character recognition processing, and a recognition reliability of the character as a standard characteristic feature pattern is determined from the recognition result.
Abstract: In the present invention, an input handwritten character pattern is subjected to character recognition processing, and a recognition reliability of the character as a standard characteristic feature pattern is determined from the recognition result. If the recognition reliability is low, a warning is issued. In response to the warning, a user or operator can decide whether the character pattern should be registered in the user dictionary (106). If it is decided that the character pattern should be registered in the user dictionary, the character pattern is stored in the user dictionary with the information representing that the character pattern has low recognition reliability. When character patterns registered in the user dictionary are displayed on a screen, these characters are displayed in such a manner that it is possible to distinguish characters having low recognition reliability from characters having high recognition reliability. There is also provided a user name index file (5309) for storing information regarding characteristic features of a handwritten character pattern peculiar to a specific user. Furthermore, there is also provided a password input-and-decision part (5103) for making a decision of whether or not allow to access to the user dictionary based on the information of the handwritten character pattern input by a specific user.

Proceedings ArticleDOI
14 Aug 1995
TL;DR: A page segmentation method called block selection is presented which not only segments the page image into categorized blocks but also provides a novel tree structure to represent the page blocks for selection that can be efficiently used for further storage, retrieval or other manipulation purposes.
Abstract: This paper presents a page segmentation method called block selection which not only segments the page image into categorized blocks but also provides a novel tree structure to represent the page blocks for selection. Block selection, more than classifying the text and nontext areas only, can identify the major document elements, such as text, picture, table, frame and line. This ability fits block selection into a wider range of document processing applications. In order to make the usage of block selection more practical to various document styles, many restrictions set on the document by some existing technologies are freed. The language on the document could be English-like, Kanji-like or both. The direction of text could be horizontal, vertical, slanted, or mixed. The editing style of the document is unconstrained. No skew correction is involved regardless of the document style. The formed blocks are described by a hierarchical tree to reflect the page arrangement in the "object" sense. This structural result can be efficiently used for further storage, retrieval or other manipulation purposes. The possible applications using this proposed method are discussed.

Journal ArticleDOI
08 Jan 1995
TL;DR: An approach to handprinted word recognition is described, based on the use of generating multiple possible segmentations of a word image into characters and matching these segmentations to a lexicon of candidate strings.
Abstract: An approach to handprinted word recognition is described. The approach is based on the use of generating multiple possible segmentations of a word image into characters and matching these segmentations to a lexicon of candidate strings. The segmentation process uses a combination of connected component analysis and distance transform-based, connected character splitting. Neural networks are used to assign character confidence values to potential character within word images. Experimental results are provided for both character and word recognition modules on data extracted from the NIST handprinted character database.

Patent
Tutomu Watanabe1
19 Sep 1995
TL;DR: In this paper, a document having the same text part among documents stored in a document storage unit is read out from the document unit and a separation unit separates the comment parts from the text part.
Abstract: Copies of a document having the same text part among documents stored in a document storage unit are read out from the document storage unit and a separation unit separates the comment parts from the text part. Resultantly, one text part and a plurality of comment parts are provided. A layout unit again lays out the comment parts thus separated for the same text part.

Proceedings ArticleDOI
14 Aug 1995
TL;DR: This paper describes recent efforts to develop a document classification system that uses two steps: first, the document is sorted by the number of columns and second, functional landmarks are detected to determine the class.
Abstract: This paper describes recent efforts to develop a document classification system. Our classification approach uses two steps: first, the document is sorted by the number of columns and second, functional landmarks are detected to determine the class. Results for detecting and classifying business class documents are included.

Proceedings ArticleDOI
T.A. Bayer1, H. Walischewski1
14 Aug 1995
TL;DR: The document model is used to extract the sender, date, recipient, opening and closing formula from a business letter, and results show that the computational effort can be limited to O(n/sup 2/) given n primitive objects for matching.
Abstract: Extracting structural information from paper documents supports the daily document processing by, for example, automatically finding index terms, document topics, etc. Knowledge about such components are modeled in a semantic net, which describes geometric properties, spatial relationships, lexical entities as well as lexical relationships. The document model is used to extract the sender, date, recipient, opening and closing formula from a business letter. 181 business letters have been processed, divided into a training set of 20 and the remaining ones for testing. The error rates for the test set range from 0.022 to 0.049 by an average rejection rate of 0.4. Results show that the computational effort can be limited to O(n/sup 2/) given n primitive objects for matching.

Proceedings ArticleDOI
Yuan Yan Tang1, Hong Ma, Xiaogang Mao, Dan Liu, Ching Y. Suen 
14 Aug 1995
TL;DR: This paper presents a new approach to document analysis based on modified fractal signature that can divide a document into blocks in only one step and be used to process documents with high geometrical complexity.
Abstract: This paper presents a new approach to document analysis. The proposed approach is based on modified fractal signature. Instead of the time-consuming traditional approaches (top-down and bottom-up approaches) where iterative operations are necessary to break a document into blocks to extract its geometric (layout) structure, this new approach can divide a document into blocks in only one step. This approach can be used to process documents with high geometrical complexity. Experiments have been conducted to prove the proposed new approach for document processing.

Journal ArticleDOI
TL;DR: In this paper, a clustering-based technique for extracting characters from form documents is presented, which treats the character extraction process as a pattern clustering problem and reveals the feasibility of the novel method.

Book ChapterDOI
15 May 1995
TL;DR: This chapter illustrates some of the possible methods that cope with the uncertainty of the database entries and add fuzziness to precisely formulated queries in order to increase their recall.
Abstract: Though the quality of optical character recognition software is steadily improving, it is still far from being perfect. As a result, full-text databases that are lled by means of OCR software contain many errors. These errors have to be taken into consideration if such kind of databases are examined by means of full-text searches. In this chapter, we will illustrate some of the possible methods that { to a certain extent { cope with the uncertainty of the database entries. These methods add fuzziness to precisely formulated queries in order to increase their recall. In addition, the described methods are compared to the method of matching query terms exactly: the preliminary results of tests that show their eeects on recall and precision are given.

Patent
11 Dec 1995
TL;DR: In this article, a document image processor reads data on the image of an original document and automatically sets a format for the document in a document processor such as a word processor or personal computer.
Abstract: A document image processor reads data on the image of an original document and automatically sets a format for the document in a document processor such as a word processor or personal computer. First, an image scanner reads a document as image data. A character recognizer recognizes and encodes a character from the read image data, and a document processor stores the encoded document data into a document memory. When the character recognizer encodes the image data, it analyzes the structure of the document. The document processor produces format information on the basis of the result of the analysis, and sets the produced format information in the document format memory in correspondence to document data in the document memory. Thus, the document image processor reads a document image from the document, encodes the image, and sets a format for a document conforming exactly to the document image in correspondence to the encoded document data.

Proceedings ArticleDOI
22 Oct 1995
TL;DR: Devnagari document processing system discussed here makes use of various knowledge sources at all levels and various algorithms are used to further segment the character into its constituent symbols instead of treating the character as a unit.
Abstract: Devnagari document processing system discussed here makes use of various knowledge sources at all levels. Extraction of test zone from a document is a preprocessing stage which uses document layout knowledge represented syntactically. The test zone is then segmented into lines, lines into words and words into characters. Since Devnagari characters is a complex composition of symbols, various algorithms are used to further segment the character into its constituent symbols instead of treating the character as a unit. The symbols are then recognized using various features which are extracted and saved during training phase. The recognized symbols are composed back and sent for validation through a partitioned dictionary.

Patent
15 May 1995
TL;DR: In this paper, a document format setting operation which was cumbersome in prior art can be readily performed by selecting a desirable format sample from the format samples which have been categorized based upon the usage thereof, and also by adding an item capable of freely setting a format into the format sample.
Abstract: In a document processing apparatus wherein a format of a document or the like is set, a plurality of possible format patterns are simultaneously displayed, and a selection of a desirable format pattern among said format patterns is made so as to perform a format setting operation. Displays of the format patterns are subdivided into a plurality of screens based on a purpose or usage of a document to be formed. After selecting or designating the purpose or usage of the document, the above-described format pattern is displayed. The document format setting operation which was cumbersome in prior art can be readily performed by selecting a desirable format sample from the format samples which have been categorized based upon the usage thereof, and also by adding an item capable of freely setting a format into the format samples.

20 Nov 1995
TL;DR: This research develops a new, medium-independent model of presentation and shows that a specification-driven presentation system based on this model can form the basis of a software environment supporting multiple presentations and a variety of media.
Abstract: The many different documents produced by a large software project are typically created and maintained by a variety of incompatible software tools, such as programming environments, document processing systems, and specialized editors for non-textual media. The incompatibility of these tools hinders communication within the project by making it difficult to share the documents that record the project's plans, design history, implementations, and experiences. An important factor underlying this incompatibility is the diversity of presentation models that have been adopted. Each system's presentation model is well-suited to the document types and media it supports, but is difficult to adapt to other types and media. This dissertation describes a new model of presentation that is designed to be applied to a diverse array of documents drawn from many different media. The model is based on four simple services: attribute propagation, box layout, tree elaboration, and interface functions. Together, these services are powerful enough to support the creation of many novel and visually rich document presentations. Furthermore, because the model is based on a new understanding of the fundamental parameters defining media, the four services can be adapted for use with all media in common use. The utility of this presentation model has been explored through the design and implementation of Proteus, a system for handling presentation specifications that is part of Ensemble, an environment for developing both software and multimedia documents. Proteus interprets specifications that describe how the four presentation services should be applied to individual documents. Proteus has a medium-independent kernel that provides the specification interpreter and runtime support for the four presentation services. The kernel is adapted to different media via the addition of a shell specifying the medium's parameters. Proteus's adaptability significantly eases the task of extending Ensemble to support new media. Proteus is also an important part of Ensemble's support for multiple, synchronized presentations. In summary, this research develops a new, medium-independent model of presentation and shows that a specification-driven presentation system based on this model can form the basis of a software environment supporting multiple presentations and a variety of media.