scispace - formally typeset
Search or ask a question

Showing papers on "Document processing published in 2002"


Patent
James Cox1, Oliver M. Dain1
31 May 2002
TL;DR: A computer-implemented system and method for processing text-based documents is described in this article. But the system is not suitable for the analysis of text documents and it cannot handle large numbers of documents.
Abstract: A computer-implemented system and method for processing text-based documents. A frequency of terms data set is generated for the terms appearing in the documents. Singular value decomposition is performed upon the frequency of terms data set in order to form projections of the terms and documents into a reduced dimensional subspace. The projections are normalized, and the normalized projections are used to analyze the documents.

233 citations


Patent
23 Jul 2002
TL;DR: In this paper, a document processing device has an evaluation region disposed along a transport path between an input and output receptacle capable of processing both currency bills and barcoded media having at least two barcodes.
Abstract: A document processing device having an evaluation region disposed along a transport path between an input and output receptacle capable of processing both currency bills and barcoded media having at least two barcodes. One of the barcodes encodes a ticket number and another barcode encodes a payout amount associated with that ticket number. The evaluation region includes detectors for detecting predetermined characteristics of currency bills and a barcode reader for scanning the barcodes printed on the barcoded media. A controller coupled to the evaluation region controls the operation of the document processing device and receives input from and provides information to a user via a control unit In some embodiments, the document processing device may have any number of output receptacles, and the control unit allows the user to specify which output receptacle receives which type of document An optional coin sorter may be coupled to the document processing device to allow document and coin processing The document processing device may be coupled to a network to communicate information to devices linked to the network.

181 citations


Patent
Grace T. Brewington1
16 Dec 2002
TL;DR: A secure document processing system for receiving an original document and for printing a secure hardcopy version of the original document, wherein the secure hard copy version includes a machine-readable encoded image signature which represents an image segment of a document as discussed by the authors.
Abstract: A secure document processing system for receiving an original document and for printing a secure hardcopy version of the original document, wherein the secure hardcopy version includes a machine-readable encoded image signature which represents an image segment of the original document. Such hardcopy secure documents can be validated by inputting them to an secure document validation system operable to identify and process the machine readable encoded representation and in response to determine whether the recovered image signature indicates that the document is counterfeit or has been altered.

157 citations


Journal ArticleDOI
TL;DR: This paper briefly describes various components of a document analysis system and provides the background necessary to understand the detailed descriptions of specific techniques presented in other papers in this issue.
Abstract: Document image analysis refers to algorithms and techniques that are applied to images of documents to obtain a computer-readable description from pixel data. A well-known document image analysis product is the Optical Character Recognition (OCR) software that recognizes characters in a scanned document. OCR makes it possible for the user to edit or search the document’s contents. In this paper we briefly describe various components of a document analysis system. Many of these basic building blocks are found in most document analysis systems, irrespective of the particular domain or language to which they are applied. We hope that this paper will help the reader by providing the background necessary to understand the detailed descriptions of specific techniques presented in other papers in this issue.

143 citations


Patent
15 Nov 2002
TL;DR: In this paper, a target document (25 ) is generated by merging four source documents (2 - 5 ) by parsing a document (6, 8, 11, 20 ) into a hierarchical tree if it is not already in that form, and merging the trees.
Abstract: A target document ( 25 ) is generated by merging four source documents ( 2 - 5 ). There are three merge operations, an operation ( 10 ) for two source documents ( 2, 13 ), an operation ( 13 ) for an intermediate target document and a source document ( 4 ), and an operation ( 21 ) for a second intermediate target document and a final source document ( 5 ). In each merge operation one source document inherits from the other. An inheriting instruction is embedded within the inheriting document. Merging is performed by parsing a document ( 6, 8, 11, 20 ) into a hierarchical tree if it is not already in that form, and merging the trees. Matching nodes are identified and are combined or replaced according to a policy.

140 citations


Patent
18 Mar 2002
TL;DR: In this article, an encrypted symbol is imprinted on the document using ink that is not visible in invisible light, and the symbol includes information used to authenticate the document and to identify the bearer of the document.
Abstract: A bearer document processing system includes preparation, verification, redeeming and depositing of the document. An encrypted symbol is imprinted on the document using ink that is not visible in invisible light. The symbol includes information used to authenticate the document and to identify the bearer of the document. The document is scanned at a transaction point. The symbol can be decoded at transaction points or at a remote central processing station. Accounts involved in transactions are credited and debited using information contained in the encoded symbol and other information provided by the bearer and the acceptor of the document. Transactions are performed in essentially real-time, and the bearer is provided with evidence of a successful transaction. Although applicable to any type of bearer document such as stock certificates, money orders, the system is particularly applicable to processing bank checks in real-time and with the possibility of fraudulent transactions being minimized.

134 citations


Journal ArticleDOI
TL;DR: This analysis demonstrates how detection of various high-level features of the Bengali character set might help formulate successful multistage OCR design.

94 citations


Journal ArticleDOI
TL;DR: This work uses writer-independent writing style models (lexemes) to identify the styles present in a particular writer's training data and updates these models using the writer's data, demonstrating the feasibility of this approach on both isolated handwritten character recognition and unconstrained word recognition tasks.
Abstract: Writer-adaptation is the process of converting a writer-independent handwriting recognition system into a writer-dependent system. It can greatly increasing recognition accuracy, given adequate writer models. The limited amount of data a writer provides during training constrains the models' complexity. We show how appropriate use of writer-independent models is important for the adaptation. Our approach uses writer-independent writing style models (lexemes) to identify the styles present in a particular writer's training data. These models are then updated using the writer's data. Lexemes in the writer's data for which an inadequate number of training examples is available are replaced with the writer-independent models. We demonstrate the feasibility of this approach on both isolated handwritten character recognition and unconstrained word recognition tasks. Our results show an average reduction in error rate of 16.3 percent for lowercase characters as compared against representing each of the writer's character classes with a single model. In addition, an average error rate reduction of 9.2 percent is shown on handwritten words using only a small amount of data for adaptation.

94 citations


Patent
13 Aug 2002
TL;DR: In this paper, a document processing device (100) having an evaluation region (104) disposed along a transport path between an input (102) and output receptacle (108) capable of processing both currency bills and barcoded media.
Abstract: A document processing device (100) having an evaluation region (104) disposed along a transport path between an input (102) and output receptacle (108) capable of processing both currency bills and barcoded media. The evaluation region includes detectors (110) for detecting predetermined characteristics of currency bills and a barcode reader (112) for scanning the barcoded media. A controller (114) coupled to the evaluation region controls the operation of the document processing device and receives input from and provides information to a user via a control unit (216). In some embodiments, the document processing device may have any number of output receptacles (708a and 708b), and the control unit allows the user to specify which output receptacle receives which type of document. In some embodiments, an optional coin sorter (1048) may be coupled to the document processing device to allow document and coin processing. The document processing device may be coupled to a network (1192) to communicate information to devices (1100a-1100n) linked to the network.

94 citations


Journal ArticleDOI
TL;DR: An intuitive, easy-to-implement evaluation schemes for the related problems of table detection and table structure recognition are introduced and a new paradigm, “graph probing,” is described for comparing the results returned by the recognition system and the representation created during ground-truthing.
Abstract: While techniques for evaluating the performance of lower-level document analysis tasks such as optical character recognition have gained acceptance in the literature, attempts to formalize the problem for higher-level algorithms, while receiving a fair amount of attention in terms of theory, have generally been less successful in practice, perhaps owing to their complexity. In this paper, we introduce intuitive, easy-to-implement evaluation schemes for the related problems of table detection and table structure recognition. We also present the results of several small experiments, demonstrating how well the methodologies work and the useful sorts of feedback they provide. We first consider the table detection problem. Here algorithms can yield various classes of errors, including non-table regions improperly labeled as tables (insertion errors), tables missed completely (deletion errors), larger tables broken into a number of smaller ones (splitting errors), and groups of smaller tables combined to form larger ones (merging errors). This leads naturally to the use of an edit distance approach for assessing the results of table detection. Next we address the problem of evaluating table structure recognition. Our model is based on a directed acyclic attribute graph, or table DAG. We describe a new paradigm, “graph probing,” for comparing the results returned by the recognition system and the representation created during ground-truthing. Probing is in fact a general concept that could be applied to other document recognition tasks as well.

92 citations


Proceedings ArticleDOI
06 Aug 2002
TL;DR: A new character segmentation algorithm (ACSA) of Arabic scripts is presented, which yields on the segmentation of isolated handwritten words in perfectly separated characters based on morphological rules constructed at the feature extraction phase.
Abstract: Character segmentation is a necessary preprocessing step for character recognition in many OCR systems. It is an important step because incorrectly segmented characters are unlikely to be recognized correctly. The most difficult case in character segmentation is the cursive script. The scripted nature of Arabic written language poses some high challenges for automatic character segmentation and recognition. In this paper, a new character segmentation algorithm (ACSA) of Arabic scripts is presented. The developed segmentation algorithm yields on the segmentation of isolated handwritten words in perfectly separated characters. It is based on morphological rules, which are constructed at the feature extraction phase. Finally, ACSA is combined with an existing handwritten Arabic character recognition system (RECAM).

Journal ArticleDOI
TL;DR: A neural network-based script identification system which can be used in the machine reading of documents written in English, Hindi and Kannada language scripts and results are very encouraging and prove the effectiveness of the approach.
Abstract: The paper describes a neural network-based script identification system which can be used in the machine reading of documents written in English, Hindi and Kannada language scripts. Script identification is a basic requirement in automation of document processing, in multi-script, multi-lingual environments. The system developed includes a feature extractor and a modular neural network. The feature extractor consists of two stages. In the first stage the document image is dilated using 3 X 3 masks in horizontal, vertical, right diagonal, and left diagonal directions. In the next stage, average pixel distribution is found in these resulting images. The modular network is a combination of separately trained feedforward neural network classifiers for each script. The system recognizes 64 X 64 pixel document images. In the next level, the system is modified to perform on single word-document images in the same three scripts. Modified system includes a pre-processor, modified feature extractor and probabilistic neural network classifier. Pre-processor segments the multi-script multi-lingual document into individual words. The feature extractor receives these word-document images of variable size and still produces the discriminative features employed by the probabilistic neural classifier. Experiments are conducted on a manually developed database of document images of size 64 X 64 pixels and on a database of individual words in the three scripts. The results are very encouraging and prove the effectiveness of the approach.

Patent
10 May 2002
TL;DR: The text format of input data is checked and converted into a system-manipulated format using tags, heading information, and the like, and then the converted data is divided into blocks in a simple manner such that elements in the blocks can be checked based on repetition of predetermined character patterns as discussed by the authors.
Abstract: The text format of input data is checked, and is converted into a system-manipulated format It is further determined if the input data is in an HTML or e-mail format using tags, heading information, and the like The converted data is divided into blocks in a simple manner such that elements in the blocks can be checked based on repetition of predetermined character patterns Each block section is tagged with a tag indicating a block The data divided into blocks is parsed based on tags, character patterns, etc, and is structured A table in text is also parsed, and is segmented into cells Finally, tree-structured data having a hierarchical structure is generated based on the sentence-structured data A sentence-extraction template paired with the tree-structured data is used to extract sentences

Patent
Yasuo Mori1
11 Sep 2002
TL;DR: In this paper, a document processing method and system which implement display that improves efficiency and usability of edit operations when inserting, moving, or copying & pasting data, by taking full advantage of the feature of retaining data and set values hierarchically in the system.
Abstract: The present invention provides a document processing method and system which implement display that improves efficiency and usability of edit operations when inserting, moving, or copying & pasting data, by taking full advantage of the feature of retaining data and set values hierarchically in the system In document processing for editing a document consisting of multiple sets of original data, when a user moves a graphic object which represents a desired original by dragging it on the document in order to move or copy the desired original data to a certain position on the document, the present invention detects the boundary between originals in the document, nearest to the position of the cursor dragging the graphic object which represents the desired original, and displays an identifiable mark on the boundary between originals in the document

Patent
13 Nov 2002
TL;DR: In this paper, a document processing system for use in identifying a segmented document includes a data store of layout graph models that are classified and/or labeled, and a matching module makes a determination of a match between a layout graph sample for the segmented documents and a particular layout graph model.
Abstract: A document processing system for use in identifying a segmented document includes a data store of layout graph models that are classified and/or labeled A matching module makes a determination of a match between a layout graph sample for the segmented document and a particular layout graph model The matching module uses a correlator to generate an identified, segmented document that is classified and/or labeled based on the segmented document, the layout graph model, and the determination of a match

Patent
06 Sep 2002
TL;DR: In this article, a system and methods enable the gathering and transferring of usage information for an electronic document so that the document's usage history can be tracked, through the execution of a tracking module located within the electronic document.
Abstract: A system and methods enable the gathering and transferring of usage information for an electronic document so that the document's usage history can be tracked. A document history is recorded into an electronic document through the execution of a tracking module located within the electronic document. When the electronic document is accessed, the tracking module executes to record document history information into the electronic document. The disclosed system and methods provide a convenient way to track secured documents, maintain document databases, and offer feedback to authors on how documents are used so that document contents can be tailored to better suit the needs of an audience.

Journal ArticleDOI
Dar-Shyang Lee1
TL;DR: This work proposes a new solution to substitution deciphering based on hidden Markov models that is more accurate than relaxation and much more robust in the presence of noise, making it useful for applications in compressed document processing.
Abstract: It has been shown that simple substitution ciphers can be solved using statistical methods such as probabilistic relaxation. However, the utility of such solutions has been limited by their inability to cope with noise encountered in practical applications. We propose a new solution to substitution deciphering based on hidden Markov models. We show that our algorithm is more accurate than relaxation and much more robust in the presence of noise, making it useful for applications in compressed document processing. Recovering character interpretations from the sequence of cluster identifiers in a symbolically compressed document can be treated as a cipher problem. Although a significant amount of noise is present in the cluster sequence, enough information can be recovered with a robust deciphering algorithm to accomplish certain document analysis tasks. The feasibility of this approach is demonstrated in a multilingual document duplicate detection system.

Patent
03 Sep 2002
TL;DR: In this article, the electronic document information determined for a paper document may include information identifying an electronic document corresponding to the paper document and identifying a location where the electronic documents are stored or a pointer or reference to the e-document.
Abstract: Techniques for determining electronic document information for a paper document. The electronic document information determined for a paper document may include information identifying an electronic document corresponding to the paper document. The electronic document information may also include information identifying a location where the electronic document is stored or a pointer or reference to the electronic document. The electronic document information determined for a paper document may be stored along with identification code information read from an identification tag that is physically associated with the paper document. The electronic document information for a paper document may also be stored in an identification tag that is physically associated with the paper document or physically associated with another paper document generated based upon the paper document.

Book ChapterDOI
19 Aug 2002
TL;DR: The system smartFIX which is a document analysis and understanding system developed by the DFKI spin-off INSIDERS permits the processing of documents ranging from fixed format forms to unstructured letters of any format.
Abstract: Although the internet offers a wide-spread platform for information interchange, day-to-day work in large companies still means the processing of tens of thousands of printed documents every day. This paper presents the system smartFIX which is a document analysis and understanding system developed by the DFKI spin-off INSIDERS. It permits the processing of documents ranging from fixed format forms to unstructured letters of any format. Apart from the architecture, the main components and system characteristics, we also show some results when applying smartFIX to medical bills and prescriptions.

Patent
11 Sep 2002
TL;DR: In this article, a system and method for providing electronic document processing via a network such as the Internet is described, where electronic documents are generated, processed, and reviewed by different users fulfilling different roles within a loan documentation process.
Abstract: A system and method for providing electronic document processing via a network such as the Internet. A superuser defines access rules by which other users can access the system. Electronic documents are generated, processed, and reviewed by different users fulfilling different roles within a loan documentation process. An originator initiates electronic document processing by transmitting electronic documents to a document server. An electronic document processor evaluates the electronic documents and determines their applicability to defined documentation processes. An electronic document manager defines the documentation processes and balances processing activities for a plurality of electronic document processors. Ultimately, the electronic documents are made available to a plurality of electronic document recipients in specific formats specifiable by each of the electronic document recipients.

Patent
04 Jan 2002
TL;DR: In this paper, a system and method for processing communications between a sender computing device (202) and at least one recipient computing devices (204) is provided. But it does not specify recipient identity verification.
Abstract: A system and method for processing communications between a sender computing device (202) and at least one recipient computing device (204) are provided. A sender (202) establishes a secure communication with a document processing server (206) and requests the processing of an electronic document, which can include the appending of a digital signature. The document processing server (206) processes the electronic document and establishes secure communications with one or more designated recipients (204). The document processing server (206) can implement sender (202) specified recipient identity verification and provide further processing of the electronic document as designated by the recipients (204).

Patent
03 Sep 2002
TL;DR: In this paper, a surface suitable for placement of documents is configured for monitoring RFID tagged documents, such documents can be monitored in a document processing device to control access to the document processing functions.
Abstract: Document monitoring provides a measure of document security. Documents incorporating radio frequency identification (RFID) tags can be monitored by appropriate interrogation components for movement activity. A surface suitable for placement of documents is configured for monitoring RFID tagged documents. Such documents can be monitored in a document processing device to control access to the document processing functions.

Patent
27 Nov 2002
TL;DR: In this paper, a plurality of document definition information for identifying documents, and format control information for recognizing a character recorded on a document corresponding to each of the plurality of definition information are held beforehand.
Abstract: A plurality of document definition information for identifying documents, and format control information for recognizing a character recorded on a document corresponding to each of the plurality of document definition information are held beforehand, documents targeted for character recognition are identified as specific documents based on document images of the entered documents targeted for character recognition and the document definition information and, based on a result of the identification, character recognition is executed by using corresponding format control information. A document definition device adds a plane area of each of documents to be identified to the document definition. An OCR device checks the plane area on the document by using the document definition before check of a preprint accompanied by character recognition.

Book ChapterDOI
TL;DR: This paper presents a preliminary investigation of applying a homogeneous multi-agent clustering system based on the self-organization behavior of the ants to the high-dimensional problem of web document categorization.
Abstract: The self-organizing and autonomous behavior of social insects such as ants presents an interesting and powerful metaphor for applications in the retrieval and management of large and fast growing amount of online information. The explosive growth of web documents has increasingly made more difficult and costly the manual task of organizing the documents into meaningful categories by human experts. Hence, it is desirable that some degree of automation be incorporated into the classification process to enable better scalability and prevent human classifiers from being overwhelmed by the deluge of information. This paper presents a preliminary investigation of applying a homogeneous multi-agent clustering system based on the self-organization behavior of the ants to the high-dimensional problem of web document categorization. A description of the text processing needed to obtain significant document features is included. The system will be evaluated on multi-class online English documents obtained from a popularly used search engine.

Proceedings ArticleDOI
11 Aug 2002
TL;DR: This work combines information from a language model and character image pattern matching to iteratively reduce ambiguity in document images to at least partially resolves the character content without optical character recognition.
Abstract: We combine information from a language model and character image pattern matching to iteratively reduce ambiguity in document images. Combining word shape information and lists of similar bitmap patterns in a document at least partially resolves the character content without optical character recognition. We present the output in various ways. suitable for human readers or for differing downstream processes.

Proceedings ArticleDOI
11 Aug 2002
TL;DR: The primary concern of the approach is the modeling of human motor functionality while writing characters by looking at the whole pen trajectory where the time evaluation of the pen coordinates plays a crucial role.
Abstract: This paper presents the online handwriting recognition for Indian scripts. The primary concern of the approach is the modeling of human motor functionality while writing characters. This is achieved by looking at the whole pen trajectory where the time evaluation of the pen coordinates plays a crucial role. A low complexity classifier was designed and the proposed similarity measure appears to be quite robust against wide variations in writing styles. Initially, the approach was applied for online recognition of handwritten characters in Devnagari and Bangla, the two major Indian scripts. A test on a dataset of considerable size shows promising recognition rates: 97.29% for Devnagari and 96.34% for Bangla.

Patent
20 Nov 2002
TL;DR: In this article, the same contents (content data) of a document among versions are shared so as to reduce the storage area, instead of accumulating content data separately for each version, each version is related to the content data accumulated in the storage space shared among the versions.
Abstract: When there are the same contents (content data) of a document among versions, the contents of the document are shared so as to reduce the storage area. Accordingly, instead of accumulating content data separately for each version, each version is related to the content data accumulated in the storage area shared among the versions. When a document α has versions 1 through 3 and each of versions 1 through 3 has sections 1 and 2, three sections 1 of the versions 1 through 3 share content data 1 and the section 2 of each of the versions 1 through 3 has different content data 2, 3, or 4. The content data 1, 2, 3, or 4 indicated by version information are searched for from a content data DB, to be edited. Only when the content data 1, 2, 3, or 4 are changed, new content data are registered.

Patent
11 Jan 2002
TL;DR: In this paper, the authors proposed a method to automatically detect plural documents and to output images onto sheets of paper in different modes such as for alignment of inclination when the plural documents are placed on a document table.
Abstract: PROBLEM TO BE SOLVED: To automatically detect plural documents and to output images onto sheets of paper in different modes such as for alignment of inclination when the plural documents are placed on a document table. SOLUTION: When read image data D1 are data of plural documents OR1, OR2, etc., which are placed on a document table 9 and are read at the same time, the document edges E1, E2, etc., of document image data d1, d2, etc., in the read image data D1 are detected by a document edge detecting section 5 according to difference in density. The plural document image data d1, d2, etc., are cut down by a plural document detecting section 6 according to the detected document edges E1, E2, etc., and, after processing such as individual alignment of inclination by a plural document processing section 7, images are output onto sheets 10 of paper by an image output device 3.


Patent
12 Apr 2002
TL;DR: In this paper, digital watermarks are embedded in documents to create a communication channel between document handling devices such as copiers (24, 26), printers (64, 66), scanners (52, 54) and fax machines.
Abstract: Digital watermarks are embedded in documents (44, 60) to create a communication channel between document handling devices such as copiers (24, 26), printers (64, 66), scanners (52, 54) and fax machines (24, 26). The digital watermarks are used to control document reproduction and transmission operations. The digital watermarks are also used to embed transaction information in documents (44, 60), to link the document (44, 60) to an original, electronic version stored on a network (48, 50), or to trace the document handling history of a document (44, 60).