scispace - formally typeset
Search or ask a question

Showing papers by "Paul A. Viola published in 2005"


Proceedings Article
05 Dec 2005
TL;DR: MILBoost adapts the feature selection criterion of MILBoost to optimize the performance of the Viola-Jones cascade to show the advantage of simultaneously learning the locations and scales of the objects in the training set along with the parameters of the classifier.
Abstract: A good image object detection algorithm is accurate, fast, and does not require exact locations of objects in a training set. We can create such an object detector by taking the architecture of the Viola-Jones detector cascade and training it with a new variant of boosting that we call MIL-Boost. MILBoost uses cost functions from the Multiple Instance Learning literature combined with the AnyBoost framework. We adapt the feature selection criterion of MILBoost to optimize the performance of the Viola-Jones cascade. Experiments show that the detection rate is up to 1.6 times better using MILBoost. This increased detection rate shows the advantage of simultaneously learning the locations and scales of the objects in the training set along with the parameters of the classifier.

808 citations


Journal ArticleDOI
TL;DR: A vision-based performance interface for controlling animated human characters that interactively combines information about the user's motion contained in silhouettes from three viewpoints with domain knowledge contained in a motion capture database to produce an animation of high quality.
Abstract: We present a vision-based performance interface for controlling animated human characters. The system interactively combines information about the user's motion contained in silhouettes from three viewpoints with domain knowledge contained in a motion capture database to produce an animation of high quality. Such an interactive system might be useful for authoring, for teleconferencing, or as a control interface for a character in a game. In our implementation, the user performs in front of three video cameras; the resulting silhouettes are used to estimate his orientation and body configuration based on a set of discriminative local features. Those features are selected by a machine-learning algorithm during a preprocessing step. Sequences of motions that approximate the user's actions are extracted from the motion database and scaled in time to match the speed of the user's motion. We use swing dancing, a complex human motion, to demonstrate the effectiveness of our approach. We compare our results to those obtained with a set of global features, Hu moments, and ground truth measurements from a motion capture system.

110 citations


Proceedings ArticleDOI
15 Aug 2005
TL;DR: It is shown that a statistical parsing approach results in a 50% reduction in error rate and this system also has the advantage of being interactive, similar to the system described in [9].
Abstract: In recent work, conditional Markov chain models (CMM) have been used to extract information from semi-structured text (one example is the Conditional Random Field [10]). Applications range from finding the author and title in research papers to finding the phone number and street address in a web page. The CMM framework combines a priori knowledge encoded as features with a set of labeled training data to learn an efficient extraction process. We will show that similar problems can be solved more effectively by learning a discriminative context free grammar from training data. The grammar has several distinct advantages: long range, even global, constraints can be used to disambiguate entity labels; training data is used more efficiently; and a set of new more powerful features can be introduced. The grammar based approach also results in semantic information (encoded in the form of a parse tree) which could be used for IR applications like question answering. The specific problem we consider is of extracting personal contact, or address, information from unstructured sources such as documents and emails. While linear-chain CMMs perform reasonably well on this task, we show that a statistical parsing approach results in a 50% reduction in error rate. This system also has the advantage of being interactive, similar to the system described in [9]. In cases where there are multiple errors, a single user correction can be propagated to correct multiple errors automatically. Using a discriminatively trained grammar, 93.71% of all tokens are labeled correctly (compared to 88.43% for a CMM) and 72.87% of records have all tokens labeled correctly (compared to 45.29% for the CMM).

98 citations


Patent
29 Apr 2005
TL;DR: In this paper, a discriminative grammar framework utilizing a machine learning algorithm is employed to facilitate in learning scoring functions for parsing of unstructured information, which is trained based on features of an example input.
Abstract: A discriminative grammar framework utilizing a machine learning algorithm is employed to facilitate in learning scoring functions for parsing of unstructured information. The framework includes a discriminative context free grammar that is trained based on features of an example input. The flexibility of the framework allows information features and/or features output by arbitrary processes to be utilized as the example input as well. Myopic inside scoring is circumvented in the parsing process because contextual information is utilized to facilitate scoring function training.

79 citations


Patent
31 Mar 2005
TL;DR: In this paper, a boosted classifier and a transductive classifier are employed to detect text in data under text detection, wherein unlabeled data is received, and connected components are extracted therefrom and utilized to generate corresponding feature vectors, which are then employed to classify the connected components using the initial boosted classifiers.
Abstract: The subject invention relates to facilitating text detection. The invention employs a boosted classifier and a transductive classifier to provide accurate and efficient text detection systems and/or methods. The boosted classifier is trained through features generated from a set of training connected components and labels. The boosted classifier utilizes the features to classify the training connected components, wherein inferred labels are conveyed to a transductive classifier, which generates additional properties. The initial set of features and the properties are utilized to train the transductive classifier. Upon training, the system and/or methods can be utilized to detect text in data under text detection, wherein unlabeled data is received, and connected components are extracted therefrom and utilized to generate corresponding feature vectors, which are employed to classify the connected components using the initial boosted classifier. Inferred labels are utilized to generate properties, which are utilized along with the initial feature vectors to classify each connected component using the transductive classifier.

74 citations


Proceedings ArticleDOI
17 Oct 2005
TL;DR: This approach models document layout as a grammar and performs a global search for the optimal parse based on a grammatical cost function and applies this technique to two document image analysis tasks: page layout structure extraction and mathematical expression interpretation.
Abstract: We present a general approach for the hierarchical segmentation and labeling of document layout structures. This approach models document layout as a grammar and performs a global search for the optimal parse based on a grammatical cost function. Our contribution is to utilize machine learning to discriminatively select features and set all parameters in the parsing process. Therefore, and unlike many other approaches for layout analysis, ours can easily adapt itself to a variety of document analysis problems. One need only specify the page grammar and provide a set of correctly labeled pages. We apply this technique to two document image analysis tasks: page layout structure extraction and mathematical expression interpretation. Experiments demonstrate that the learned grammars can be used to extract the document structure in 57 files from the UWIII document image database. We also show that the same framework can be used to automatically interpret printed mathematical expressions so as to recreate the original LaTeX

50 citations


Proceedings ArticleDOI
31 Aug 2005
TL;DR: This paper uses a machine learning approach based on a convolutional neural network to achieve maximum robustness in OCR, and when combined with a language model using dynamic programming, the overall performance is in the vicinity of 80-95% word accuracy on pages captured with a 1024/spl times/768 webcam and 10-point text.
Abstract: Cheap and versatile cameras make it possible to easily and quickly capture a wide variety of documents. However, low resolution cameras present a challenge to OCR because it is virtually impossible to do character segmentation independently from recognition. In this paper we solve these problems simultaneously by applying methods borrowed from cursive handwriting recognition. To achieve maximum robustness, we use a machine learning approach based on a convolutional neural network. When our system is combined with a language model using dynamic programming, the overall performance is in the vicinity of 80-95% word accuracy on pages captured with a 1024/spl times/768 webcam and 10-point text.

46 citations


Proceedings ArticleDOI
31 Aug 2005
TL;DR: This paper introduces (and unify) several types of geometrical data structures which can be used to significantly accelerate parsing time, and introduces a clean design for the parsing software, and test the same parsing framework with various geometric constraints to determine the most effective combination.
Abstract: Grammars are a powerful technique for modeling and extracting the structure of documents. One large challenge, however, is computational complexity. The computational cost of grammatical parsing is related to both the complexity of the input and the ambiguity of the grammar. For programming languages, where the terminals appear in a linear sequence and the grammar is unambiguous, parsing is O(N). For natural languages, which are linear yet have an ambiguous grammar, parsing is O(N/sup 3/). For documents, where the terminals are arranged in two dimensions and the grammar is ambiguous, parsing time can be exponential in the number of terminals. In this paper we introduce (and unify) several types of geometrical data structures which can be used to significantly accelerate parsing time. Each data structure embodies a different geometrical constraint on the set of possible valid parses. These data structures are very general, in that they can be used by any type of grammatical model, and a wide variety of document understanding tasks, to limit the set of hypotheses examined and tested. Assuming a clean design for the parsing software, the same parsing framework can be tested with various geometric constraints to determine the most effective combination.

26 citations


Patent
Ming Ye1, Paul A. Viola1
18 Oct 2005
TL;DR: This paper used the Collins model for parsing non-textual information into hierarchical content, and assigned labels to lines that indicate how the lines relate to one another in a hierarchical content representation.
Abstract: A system and method for determining hierarchical information is described. Aspects include using the Collins model for parsing non-textual information into hierarchical content. The system and process assign labels to lines that indicate how the lines relate to one another.

24 citations


Proceedings ArticleDOI
31 Aug 2005
TL;DR: A promising new framework for improving boosting performance with transductive inference when training an automatic text detector is presented, which is fast and efficient, and it exhibits high accuracy on a large test set.
Abstract: We present a promising new framework for improving boosting performance with transductive inference when training an automatic text detector. The resulting detector is fast and efficient, and it exhibits high accuracy on a large test set.

16 citations


Patent
13 Jun 2005
TL;DR: In this article, image recognition is utilized to facilitate in scoring parse trees for two-dimensional recognition tasks, where trees and subtrees are rendered as images and then utilized to determine parsing scores.
Abstract: Image recognition is utilized to facilitate in scoring parse trees for two-dimensional recognition tasks. Trees and subtrees are rendered as images and then utilized to determine parsing scores. Other instances of the subject invention can incorporate additional features such as stroke curvature and/or nearby white space as rendered images as well. Geometric constraints can also be employed to increase performance of a parsing process, substantially improving parsing speed, some even resolvable in polynomial time. Additional performance enhancements can be achieved in yet other instances of the subject invention by employing constellations of integral images and/or integral images of document features.

Patent
29 Apr 2005
TL;DR: Grammatical parsing is utilized to parse structured layouts that are modeled as grammars as mentioned in this paper, which provides an optimal parse tree for the structured layout based on a grammatical cost function associated with a global search Machine learning techniques facilitate in discriminatively selecting features and setting parameters in the grammatical parsing process.
Abstract: Grammatical parsing is utilized to parse structured layouts that are modeled as grammars This type of parsing provides an optimal parse tree for the structured layout based on a grammatical cost function associated with a global search Machine learning techniques facilitate in discriminatively selecting features and setting parameters in the grammatical parsing process In one instance, labeled examples are parsed and a chart is generated The chart is then converted into a subsequent set of labeled learning examples Classifiers are then trained utilizing conventional machine learning and the subsequent example set The classifiers are then employed to facilitate scoring of succedent sub-parses A global reference grammar can also be established to facilitate in completing varying tasks without requiring additional grammar learning, substantially increasing the efficiency of the structured layout analysis techniques

Patent
24 May 2005
TL;DR: In this article, a computer-implemented word processing system comprises an interface component that receives a features vector associated with an electronic document and an analysis component communicatively coupled to the interface component analyzes the features vector and determines a viewing mode in which to display the electronic document.
Abstract: A computer-implemented word processing system comprises an interface component that receives a features vector associated with an electronic document. An analysis component communicatively coupled to the interface component analyzes the features vector and determines a viewing mode in which to display the electronic document. In accordance with one aspect of the subject invention, the viewing mode can be one of a conventional viewing mode and a viewing mode associated with enhanced readability.

Patent
19 May 2005
TL;DR: This article proposed a global optimization framework for optical character recognition (OCR) of low-resolution photographed documents that combines a binarization-type process, segmentation, and recognition into a single process.
Abstract: A global optimization framework for optical character recognition (OCR) of low-resolution photographed documents that combines a binarization-type process, segmentation, and recognition into a single process. The framework includes a machine learning approach trained on a large amount of data. A convolutional neural network can be employed to compute a classification function at multiple positions and take grey-level input which eliminates binarization. The framework utilizes preprocessing, layout analysis, character recognition, and word recognition to output high recognition rates. The framework also employs dynamic programming and language models to arrive at the desired output.

Patent
20 May 2005
TL;DR: This paper proposed a global optimization framework for optical character recognition of low-resolution photographed documents that combines a binarization-type process, segmentation, and recognition into a single process.
Abstract: PROBLEM TO BE SOLVED: To provide a global optimization framework for optical character recognition (OCR) of low-resolution photographed documents that combines a binarization-type process, segmentation, and recognition into a single process. SOLUTION: The framework includes a machine learning approach trained on a large amount of data. A convolutional neural network can be employed, to compute a classification function at multiple positions and take gray-level input which eliminates binarization. The framework utilizes preprocessing, layout analysis, character recognition, and word recognition to output high recognition rates. The framework also employs dynamic programming and language models, to arrive at the desired output. COPYRIGHT: (C)2006,JPO&NCIPI