scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2008"


01 Jan 2008
TL;DR: The main purpose of the present report is to describe the current status of DIU with particular attention to two subprocesses: document skew angle estimation and page decomposition.
Abstract: Document Image Understanding (DIU) is an interesting research area with a large variety of challenging applications. Researchers have worked from decades on this topic, as witnessed by the scientific literature. The main purpose of the present report is to describe the current status of DIU with particular attention to two subprocesses: document skew angle estimation and page decomposition. Several algorithms proposed in the literature are synthetically described. They are included in a novel classification scheme. Some methods proposed for the evaluation of page decomposition algorithms are described. Critical discussions are reported about the current status of the field and about the open problems. Some considerations about the logical layout analysis are also reported.

128 citations


Journal ArticleDOI
TL;DR: The proposed technique retrieves document images by a new word shape coding scheme, which captures the document content through annotating each word image by a word shape code.
Abstract: This paper presents a document retrieval technique that is capable of searching document images without optical character recognition (OCR). The proposed technique retrieves document images by a new word shape coding scheme, which captures the document content through annotating each word image by a word shape code. In particular, we annotate word images by using a set of topological shape features including character ascenders/descenders, character holes, and character water reservoirs. With the annotated word shape codes, document images can be retrieved by either query keywords or a query document image. Experimental results show that the proposed document image retrieval technique is fast, efficient, and tolerant to various types of document degradation.

111 citations


Patent
Yiwu Lei, James E Maclean1
12 Nov 2008
TL;DR: In this article, a dynamic document identification framework is proposed for identifying and validating security documents according to a dynamic data structure, and a document processing engine traverses the data structure by selectively invoking one or more of the plurality of processes to identify the captured images as one of the document type objects.
Abstract: Techniques are described for identifying and validating security documents according to a dynamic document identification framework. For example, a security document authentication device includes an image capture interface that receives the captured images of a document and a memory that stores a plurality of document type objects within a data structure according to the dynamic document identification framework. The security document authentication device also includes a document processing engine that traverses the data structure by selectively invoking one or more of the plurality of processes to identify the captured images as one of the plurality of document type objects. Contrary to conventional identification techniques, this identification method performed by traversing the data structure stored according to the dynamic document identification framework may provide more accurate identification result in a more efficient manner, as only applicable processes may be applied to identify the captured images. Upon identifying the document type, a set of one or more validators are applied to further confirm its authenticity.

85 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed identification technique is accurate, easy for extension, and tolerant to noise and various types of document degradation.
Abstract: This paper reports an identification technique that detects scripts and languages of noisy and degraded document images. In the proposed technique, scripts and languages are identified through the document vectorization, which converts each document image into a document vector that characterizes the shape and frequency of the contained character or word images. Document images are vectorized by using vertical component cuts and character extremum points, which are both tolerant to the variation in text fonts and styles, noise, and various types of document degradation. For each script or language under study, a script or language template is first constructed through a training process. Scripts and languages of document images are then determined according to the distances between converted document vectors and the preconstructed script and language templates. Experimental results show that the proposed technique is accurate, easy for extension, and tolerant to noise and various types of document degradation.

72 citations


Journal ArticleDOI
TL;DR: A geometric matching algorithm is used to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property and shows that by removing characters outside the computed page frame, the OCR error rate is reduced.
Abstract: When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.

66 citations


Patent
23 Sep 2008
TL;DR: In this paper, the authors propose a system for determining a logical structure of a document, which stores a collection of models, each of which describes one or more possible logical structures.
Abstract: The invention relates to methods for determining a logical structure of a document. The system stores a collection of models, each of which describes one or more possible logical structures. At least one document hypothesis is generated for the whole document. For each document hypothesis, the system verifies the document hypothesis on each page, for example, by generating at least one block hypothesis for each block in the document based on the document hypothesis, selecting a best block hypothesis for each block, selecting the model that corresponds to a best document hypothesis the document hypothesis that has a best degree of correspondence with the selected best block hypotheses for the document, and forming a representation of the document based on the best document hypothesis described.

42 citations


Patent
05 Jun 2008
TL;DR: A document processing system and method for using image quality to sort documents is described in this paper. But it does not address the problem of image quality analysis in the sorting of documents.
Abstract: A document processing system and method for using image quality to sort documents. The document processing system comprises: a document sorting system that designates a destination pocket for each document based on data gathered from each document; a document imaging system that captures an image of each document; and an image quality analysis system that analyzes each image and causes any document having an unacceptable image to be redirected to an unacceptable destination pocket.

41 citations


Patent
03 Dec 2008
TL;DR: In this paper, a method of classifying segmented contents of a scanned image of a document is proposed, which comprises partitioning the scanned image into colour segmented tiles at pixel level.
Abstract: Disclosed is a method of classifying segmented contents of a scanned image of a document. The method comprise partitioning the scanned image into colour segmented tiles at pixel level. The method then generates superpositioned segmented contents, each segmented content representing related colour segments in at least one colour segmented tile. Statistics are then calculated for each segmented content using pixel level statistics from each of the tile colour segments included in segmented content, and then determines a classification for each segmented content based on the calculated statistics. The segmented content may be macroregions. The macroregions may form part of a multi-layered document representation of the document. Each of a plurality of tiles of predetermined size of the image are converted into a representation having a plurality of layers, the representation corresponding to at least one said tiles comprising multiple coloured layers, each tile comprising a superposition of the corresponding coloured layers. For each of the coloured layers, merging is performed with adjacent ones of the tiles, thereby generating a multi-layered document representation.

34 citations


Patent
Jasmine Novak1
26 Mar 2008
TL;DR: In this paper, a system and a method for detecting relationships described in unstructured text-based electronic documents are described, which incorporate the use of an input file that contains one or more text patterns that represent particular relationships, each include regular text expressions that describe the particular relationship and slots for the location of each entity in that relationship.
Abstract: Disclosed are embodiments of a system and a method for detecting relationships described in unstructured text-based electronic documents. The system and method incorporate the use of an input file that contains one or more text patterns that represent particular relationships. The text patterns each include regular text expressions that describe the particular relationship and slots for the location of each entity in that relationship. Document(s) are selected by a user and scanned by a proper noun tagger that identifies and tags every occurrence of proper names within the document(s). Then, a pattern matcher scans the document(s) to match text patterns. If a text pattern is matched within a document a relationship detector extracts all pairs of proper names found in the slots for each matched text pattern. The output from the relationship detector includes the names for each entity in the relationship, the type of relationship, and the identity of the document and the location of the sentence describing the relationship in the document.

25 citations


Book ChapterDOI
01 Jan 2008
TL;DR: The problem of detecting the reading order relationship between components of a logical structure is investigated, typically denoted as document layout analysis, which involves several steps including preprocessing, page decomposition, classification of segments according to content type and hierarchical organization on the basis of perceptual meaning.
Abstract: Summary. Document image understanding refers to logical and semantic analysis of document images in order to extract information understandable to humans and codify it into machine-readable form. Most of the studies on document image understanding have targeted the specific problem of associating layout components with logical labels, while less attention has been paid to the problem of extracting relationships between logical components, such as cross-references. In this chapter, we investigate the problem of detecting the reading order relationship between components of a logical structure. The domain specific knowledge required for this task is automatically acquired from a set of training examples by applying a machine learning method. The input of the learning method is the description of “chains” of layout components defined by the user. The output is a logical theory which defines two predicates, fi rst to read/ 1a ndsucc in reading/2, useful for consistently reconstructing all chains in the training set. Only spatial information on the page layout is exploited for both single and multiple chain reconstruction. The proposed approach has been evaluated on a set of document images processed by the system WISDOM++. Documents are characterized by two important structures: the layout structure and the logical structure. Both are the results of repeatedly dividing the content of a document into increasingly smaller parts, and are typically represented by means of a tree structure. The difference between them is the criteria adopted for structuring the document content: the layout structure is based on the presentation of the content, while the logical structure is based on the human-perceptible meaning of the content. The extraction of the layout structures from images of scanned paper documents is a complex process, typically denoted as document layout analysis, which involves several steps including preprocessing, page decomposition (or segmentation), classification of segments according to content type (e.g., text, graphics, pictures) and hierarchical organization on the basis of perceptual

25 citations



Patent
25 Apr 2008
TL;DR: In this paper, the present subject matter relates to controlling of mail processing equipment and allows for unique recognition of a printed document from all other similar documents, without the inclusion of additional purposeful identifying marks, data or barcodes.
Abstract: The present subject matter relates to controlling of mail processing equipment. More specifically, the present subject matter allows for unique recognition of a printed document from all other similar documents, without the inclusion of additional purposeful identifying marks, data or barcodes. A document processing system, such as an inserter, printer, postage meter, sorter or other document processing system is controlled based on document identification which does not depend on unique identifiers. Similarly if a document is identified with a unique identifying mark on the first page, the present subject matter allows for identification of each subsequent page in the document without requiring identifying marks on each page. The identification data is then used to control the processing of the printed document based upon the recognition and enables the performance of quality checks. Further, each subsequent page in the document, as part of a quality check, can be verified without requiring identifying marks on each page.

Patent
Zhigang Fan1, Ramesh Nagarajan1
05 Aug 2008
TL;DR: In this paper, a system and methods are described that facilitate determining an original document format for a scanned document by analyzing a bitmap thereof, and text objects are extracted from the document, binarized, and segmented to identify text.
Abstract: Systems and methods are described that facilitate determining an original document format for a scanned document by analyzing a bitmap thereof. Text objects are extracted from the document, binarized, and segmented to identify text. Page orientation and text size are used to distinguish between a slideshow-type document, and a word processing or spreadsheet-type document. To further distinguish between the word processing and spreadsheet types, text column structure and count is analyzed.

Patent
Hervé Déjean1, Jean-Luc Meunier1
28 Jan 2008
TL;DR: In this article, a method for operating a computing device to create a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document is provided.
Abstract: A method is provided for operating a computing device to create a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The method includes converting a computer parsable text document of any format to an alternative structured language format to form a converted document. The text of the converted document is fragmented into an ordered sequence of text fragments within a text format. The text fragments are enumerated to obtain a sequence of terms. At least one optimal sub-sequence of terms is identified from among the sequence of terms, with an optimal sub-sequence being one or more longest increasing sub-sequence(s). The computer parsable text document is annotated with tags, with the tags including information derived from identification of the optimal sub-sequence(s). The annotated document is displayed on the graphical user interface.

Patent
06 Nov 2008
TL;DR: A document analysis system that automatically classifies documents by recognizing in each document distinctive features comprises a document acquisition system, a document recognition training system, document classification system and a job organization system as mentioned in this paper.
Abstract: A document analysis system that automatically classifies documents by recognizing in each document distinctive features comprises a document acquisition system, a document recognition training system, a document classification system, a document recognition system, and a job organization system. The document acquisition system receives jobs wherein each job containing at least one electronic document. The document feature recognition system automatically extracts image and text features from each received document. The document classification system automatically classifies recognized electronic documents by finding the best match between the extracted features of each of the document and feature sets associated with each category of document. The document recognition training system automatically trains the feature set for each corresponding category of documents, wherein the training system using extracted features of unrecognized documents automatically modifies the feature set for a document category. The job organization system automatically organizes each job according to the document categories it contains.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: An algorithm for ruling estimation of Glagolitic texts based on text line extraction and is suitable for degraded manuscripts by extrapolating the baselines with the a priori knowledge of the ruling.
Abstract: In order to preserve our cultural heritage and for automated document processing libraries and national archives have started digitizing historical documents. In the case of degraded manuscripts (e.g. by mold, humidity, bad storage conditions) the text or parts of it can disappear. The remaining parts of the text can be segmented and the ruling can be extrapolated with the a priori knowledge. Since the ruling defines the position of the text within a page, it can be used for layout analysis and as a basis for the enhancement of the readability. Furthermore, information about the scribe (hand) of the manuscript, its spatiotemporal origin can be gained by analyzing the ruling. This paper presents an algorithm for ruling estimation of Glagolitic texts based on text line extraction and is suitable for degraded manuscripts by extrapolating the baselines with the a priori knowledge of the ruling. The algorithm was tested on 30 pages of the Missale Sinaiticum and the evaluation was based on visual criteria.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: This work proposes a statistically motivated model-based trainable layout analysis system that allows assumption-free adaptation to different layout types and produces likelihood estimates of the correctness of the computed page segmentation.
Abstract: Geometric layout analysis plays an important role in document image understanding. Many algorithms known in literature work well on standard document images, achieving high text line segmentation accuracy on the UW-III dataset. These algorithms rely on certain assumptions about document layouts, and fail when their underlying assumptions are not met. Also, they do not provide confidence scores for their output. These two problems limit the usefulness of general purpose layout analysis methods in large scale applications. In this contribution, we propose a statistically motivated model-based trainable layout analysis system that allows assumption-free adaptation to different layout types and produces likelihood estimates of the correctness of the computed page segmentation. The performance of our approach is tested on a subset of the Google 1000 books dataset where it achieved a text line segmentation accuracy of 98.4% on layouts where other general-purpose algorithms failed to do a correct segmentation.

Patent
06 Nov 2008
TL;DR: In this paper, a method in a document analysis system automatically extracts from each received electronic document image and text features, in which the image features are indicative of how the document is laid out or textually-organized and therefore indicative of a corresponding document category.
Abstract: A method in a document analysis system automatically extracts from each received electronic document image and text features, in which the image features are indicative of how the document is laid out or textually-organized and therefore indicative of a corresponding document category, next compares the extracted image and text features with feature sets associated with each document category, and then classifies each document to a document category, the feature set of which best matches the extracted features of the document.

Dissertation
01 Jan 2008
TL;DR: This thesis presents an efficient local adaptive thresholding algorithm that yields the same quality of binarization as that of state-of-the-art local binarized methods, but runs in time close to that of global thresholding methods, independent of the local window size.
Abstract: Layout analysis--the division of page images into text blocks, lines, and determination of their reading order--is a major performance limiting step in large scale document digitization projects. This thesis addresses this problem in several ways: it presents new performance measures to identify important classes of layout errors, evaluates the performance of state-of-the-art layout analysis algorithms, presents a number of methods to reduce the error rate and catastrophic failures occurring during layout analysis, and develops a statistically motivated, trainable layout analysis system that addresses the needs of large-scale document analysis applications. An overview of the key contributions of this thesis is as follows. First, this thesis presents an efficient local adaptive thresholding algorithm that yields the same quality of binarization as that of state-of-the-art local binarization methods, but runs in time close to that of global thresholding methods, independent of the local window size. Tests on the UW-1 dataset demonstrate a 20-fold speedup compared to traditional local thresholding techniques. Then, this thesis presents a new perspective for document image cleanup. Instead of trying to explicitly detect and remove marginal noise, the approach focuses on locating the page frame, i.e. the actual page contents area. A geometric matching algorithm is presented to extract the page frame of a structured document. It is demonstrated that incorporating page frame detection step into document processing chain results in a reduction in OCR error rates from 4.3% to 1.7% (n=4,831,618 characters) on the UW-III dataset and layout-based retrieval error rates from 7.5% to 5.3% (n=815 documents) on the MARG dataset. The performance of six widely used page segmentation algorithms (x-y cut, smearing, whitespace analysis, constrained text-line finding, docstrum, and Voronoi) on the UW-III database is evaluated in this work using a state-of-the-art evaluation methodology. It is shown that current evaluation scores are insufficient for diagnosing specific errors in page segmentation and fail to identify some classes of serious segmentation errors altogether. Thus, a vectorial score is introduced that is sensitive to, and identifies, the most important classes of segmentation errors (over-, under-, and mis-segmentation) and what page components (lines, blocks, etc.) are affected. Unlike previous schemes, this evaluation method has a canonical representation of ground truth data and guarantees pixel-accurate evaluation results for arbitrary region shapes. Based on a detailed analysis of the errors made by different page segmentation algorithms, this thesis presents a novel combination of the line-based approach by Breuel with the area-based approach of Baird which solves the over-segmentation problem in area-based approaches. This new approach achieves a mean text-line extraction error rate of 4.4% (n=878 documents) on the UW-III dataset, which is the lowest among the analyzed algorithms. This thesis also describes a simple, fast, and accurate system for document image zone classification that results from a detailed comparative analysis of performance of widely used features in document analysis and content-based image retrieval. Using a novel combination of known algorithms, an error rate of 1.46% (n=13,811 zones) is achieved on the UW-III dataset in comparison to a state-of-the-art system that reports an error rate of 1.55% (n=24,177 zones) using more complicated techniques. In addition to layout analysis of Roman script documents, this work also presents the first high-performance layout analysis method for Urdu script. For that purpose a geometric text-line model for Urdu script is presented. It is shown that the method can accurately extract Urdu text-lines from documents of different layouts like prose books, poetry books, magazines, and newspapers. Finally, this thesis presents a novel algorithm for probabilistic layout analysis that specifically addresses the needs of large-scale digitization projects. The presented approach models known page layouts as a structural mixture model. A probabilistic matching algorithm is presented that gives multiple interpretations of input layout with associated probabilities. An algorithm based on A* search is presented for finding the most likely layout of a page, given its structural layout model. For training layout models, an EM-like algorithm is presented that is capable of learning the geometric variability of layout structures from data, without the need for a page segmentation ground-truth. Evaluation of the algorithm on documents from the MARG dataset shows an accuracy of above 95% for geometric layout analysis.

Patent
13 Feb 2008
TL;DR: In this article, the authors compared a first document 10 and a second document 20 with an electronic image formed in other ways, and determined differences between the matched basic units 14, 24 are determined and a document 30 representing the differences is created 130 and output 132.
Abstract: The method compares a first document 10 and a second document 20. The documents may be scanned in 110,112 or an electronic image formed in other ways 114,116. Each electronic image is then segmented into basic units 14,24 such as words, lines or paragraphs. Differences between the matched basic units 14, 24 are determined and a document 30 representing the differences is created 130 and output 132.

Patent
Tapas Kanungo1, James J. Rhodes1
15 Dec 2008
TL;DR: In this article, a document or multiple documents is analyzed to identify entities of interest within that document, by constructing n-gram or bi-gram models that correspond to different kinds of text entities such as chemistry-related words and generic English words.
Abstract: A document (or multiple documents) is analyzed to identify entities of interest within that document. This is accomplished by constructing n-gram or bi-gram models that correspond to different kinds of text entities, such as chemistry-related words and generic English words. The models can be constructed from training text selected to reflect a particular kind of text entity. The document is tokenized, and the tokens are run against the models to determine, for each token, which kind of text entity is most likely to be associated with that token. The entities of interest in the document can then be annotated accordingly.

Patent
06 Nov 2008
TL;DR: In this paper, a method of enhancing electronic documents received from a plurality of users by a document analysis system for improving automatic recognition and classification of the received electronic documents, is provided.
Abstract: A method of enhancing electronic documents received from a plurality of users by a document analysis system for improving automatic recognition and classification of the received electronic documents, is provided. For each page of a received electronic document, the method filters the page to infer binarized-background artifacts resulting from the binarization of the original grayscale or color image source document and which reside in the vicinity of binarized text and binarized image features in the page, so that the binarized text and binarized images may be distinguished from the binarized-background artifacts and extracted from the document. The method then uses the extracted features from the filtered document to automatically recognized and classify a document into a document category.

Patent
11 Feb 2008
TL;DR: In this article, a system and method for producing semantically rich representations of texts which amplify and sharpen the interpretation of the texts is proposed, which relies on the fact that there is a substantial amount of semantic content associated with most text strings that is not explicit in the strings, or in the mere statistical co-occurrence of the stnngs with other strings, but which is nevertheless relevant to the text, and also augment the representation with content that, while not explicitly mentioned in the string, can be used to support the performance of text processing applications including document index
Abstract: A system and method for producing semantically-rich representations of texts which amplify and sharpen the interpretation of the texts The method relies on the fact that there is a substantial amount of semantic content associated with most text strings that is not explicit in the strings, or in the mere statistical co-occurrence of the stnngs with other strings, but which is nevertheless relevant to the text This additional information is used to sharpen the representations derived directly from the text string, and also augment the representation with content that, while not explicitly mentioned in the string, can be used to support the performance of text processing applications including document indexing and retrieval, document classification, document routing, document summarization, and document tagging.

Patent
17 Mar 2008
TL;DR: In this paper, a novel method is disclosed for embedding hidden information in a document comprising characters, including: determining hidden information to be embedded in each class of layout transformation, acquiring a code sequence for each class, and performing layout transformation on characters from the document according to the acquired code sequence.
Abstract: A novel method is disclosed for embedding hidden information in a document comprising characters, including: determining hidden information to be embedded in each class of layout transformation respectively; acquiring a code sequence for each class of layout transformation by coding the hidden information to be embedded in the class of layout transformation; performing layout transformation on characters from the document according to the acquired code sequence for each class of layout transformation respectively.

Patent
08 Sep 2008
TL;DR: In this paper, a document image is processed to identify at least one repetitive structure and then a document template is generated to populate each instance with corresponding data from the document image 100.2.
Abstract: In one embodiment, there is disclosed a method 300 of (and system for) capturing data from a document image 100 . The method 300 comprises processing the document image 100 to identify at least one repetitive structure 102, 104.2 and performing a capturing operation including creating a plurality of instances of the repetitive structure 102, 104.2 based on once-described structure properties (Table 1), 400 of the repetitive structure 102, 104.2 in a document template, and populating each instance with corresponding data from the document image 100. There is also disclosed a method 200 of (and system for) creating a document template for capturing data from a document image 100.

Journal ArticleDOI
TL;DR: This work uses support vector machines to learn whether or not to apply the previously mentioned operations from training documents in which all textlines and text regions have been located and their identifies labeled, and views document layout analysis as a matter of solving a series of binary decision problems.

Proceedings ArticleDOI
12 Dec 2008
TL;DR: This research focuses on the classification of non-text block in technical documents into table, graph, and figure and shows that support vector machine classifies better than back propagation neural network.
Abstract: Text and non-text segmentation and classification is very important in document layout analysis system before it is presented to an OCR system. Heuristic rules have been used in segmenting and classifying the text and non-text blocks. This research focuses on the classification of non-text block in technical documents into table, graph, and figure. A comparative study is conducted between backpropagation neural network and support vector machine and the result shows that support vector machine classifies better than back propagation neural network.

Patent
06 Nov 2008
TL;DR: A document analysis system that automatically classifies documents by recognizing in each document distinctive features comprises a document acquisition system, a document recognition training system, document classification system and a job organization system.
Abstract: A document analysis system that automatically classifies documents by recognizing in each document distinctive features comprises a document acquisition system, a document recognition training system, a document classification system, a document recognition system, and a job organization system. The document acquisition system receives jobs wherein each job containing at least one electronic document. The document feature recognition system automatically extracts image and text features from each received document. The document classification system automatically classifies recognized electronic documents by finding the best match between the extracted features of each of the document and feature sets associated with each category of document. The document recognition training system automatically trains the feature set for each corresponding category of documents, wherein the training system using extracted features of unrecognized documents automatically modifies the feature set for a document category. The job organization system automatically organizes each job according to the document categories it contains.

Patent
31 Jul 2008
TL;DR: In this paper, an e-document can be searched and found by capturing an image of the printed document, instead of typing in a file name or searching through multiple directories, the user simply takes a picture of the document with a camera and the system uses the document image to locate the edocument.
Abstract: In an embodiment of the invention, an electronic document (e-document) can be searched and found by capturing an image of the printed document. Instead of typing in a file name or searching through multiple directories, the user simply takes a picture of the document with a camera and the system uses the document image to locate the e-document. In an alternative embodiment of the invention, an image of a printed document can be useful for remote document sharing. In various embodiments of the invention, sharing an image of a printed document can be used to email a high quality paper document, send a high quality fax, or open a document to a page containing an annotation. Through co-design of the feature extraction and search algorithm in the system, the image feature detection robustness and search speed are improved at the same time.

Patent
06 Nov 2008
TL;DR: In this paper, a method in a document analysis system automatically extracts image and text features from each received electronic document and compares the extracted features with feature sets associated with each category of document to determine whether the document is recognizable as belonging to a document category.
Abstract: A method in a document analysis system automatically extracts image and text features from each received electronic document and compares the extracted features with feature sets associated with each category of document to determine whether the document is recognizable as belonging to a document category. If an electronic document is recognized as belonging to one of the document categories, the method classifies the electronic document as belonging to that document category. If, however, an electronic document is unrecognized, the method submits the unrecognized document to a learning phase, in which the unrecognized document is presented to a human trainer for manual classification of the unrecognized electronic document into a document category, and automatically modifies at least one of the features and the weights of the feature set of the document category corresponding to the manually-classified electronic document using the automatically extracted features of the manually-classified document.