scispace - formally typeset
Search or ask a question
Author

Paul A. Viola

Bio: Paul A. Viola is an academic researcher from Microsoft. The author has contributed to research in topics: Parsing & Boosting (machine learning). The author has an hindex of 52, co-authored 115 publications receiving 59853 citations. Previous affiliations of Paul A. Viola include IBM & Wilmington University.


Papers
More filters
Patent
Cha Zhang1, Paul A. Viola1
13 Jul 2007
TL;DR: In this paper, a combination classifier and intermediate rejection threshold are learned using a pruning process, which ensures that objects detected by the original classifier are also detected by another classifier, thereby guaranteeing the same detection rate on the training set after pruning.
Abstract: A “Classifier Trainer” trains a combination classifier for detecting specific objects in signals (e.g., faces in images, words in speech, patterns in signals, etc.). In one embodiment “multiple instance pruning” (MIP) is introduced for training weak classifiers or “features” of the combination classifier. Specifically, a trained combination classifier and associated final threshold for setting false positive/negative operating points are combined with learned intermediate rejection thresholds to construct the combination classifier. Rejection thresholds are learned using a pruning process which ensures that objects detected by the original combination classifier are also detected by the combination classifier, thereby guaranteeing the same detection rate on the training set after pruning. The only parameter required throughout training is a target detection rate for the final cascade system. In additional embodiments, combination classifiers are trained using various combinations of weight trimming, bootstrapping, and a weak classifier termed a “fat stump” classifier.

13 citations

Patent
29 Apr 2005
TL;DR: Grammatical parsing is utilized to parse structured layouts that are modeled as grammars as mentioned in this paper, which provides an optimal parse tree for the structured layout based on a grammatical cost function associated with a global search Machine learning techniques facilitate in discriminatively selecting features and setting parameters in the grammatical parsing process.
Abstract: Grammatical parsing is utilized to parse structured layouts that are modeled as grammars This type of parsing provides an optimal parse tree for the structured layout based on a grammatical cost function associated with a global search Machine learning techniques facilitate in discriminatively selecting features and setting parameters in the grammatical parsing process In one instance, labeled examples are parsed and a chart is generated The chart is then converted into a subsequent set of labeled learning examples Classifiers are then trained utilizing conventional machine learning and the subsequent example set The classifiers are then employed to facilitate scoring of succedent sub-parses A global reference grammar can also be established to facilitate in completing varying tasks without requiring additional grammar learning, substantially increasing the efficiency of the structured layout analysis techniques

13 citations

Proceedings ArticleDOI
Ming Ye1, Paul A. Viola1
26 Oct 2004
TL;DR: A system is presented, which automatically recognizes lists and hierarchical outlines in handwritten notes, and then computes the correct structure, which provides the foundation for new user interfaces and facilitates the importation of handwritten notes into conventional editing tools.
Abstract: Handwritten notes are complex structures, which include blocks of text, drawings, and annotations. The main challenge for the newly emerging tablet computer is to provide high-level tools for editing and authoring handwritten documents using a natural interface. One frequent component of natural notes are lists and hierarchical outlines, which correspond directly to the bulleted lists and itemized structures in conventional text, editing tools. We present a system, which automatically recognizes lists and hierarchical outlines in handwritten notes, and then computes the correct structure. This inferred structure provides the foundation for new user interfaces and facilitates the importation of handwritten notes into conventional editing tools.

13 citations

Proceedings Article
01 Jan 1989
TL;DR: This Artificial-Eye (A-eye) combines the signals generated by two rate gyroscopes with motion information extracted from visual analysis to stabilize its camera and learns a system model that can be incrementally modified to adapt to changes in its structure, performance and environment.
Abstract: We have constructed a two axis camera positioning system which is roughly analogous to a single human eye. This Artificial-Eye (A-eye) combines the signals generated by two rate gyroscopes with motion information extracted from visual analysis to stabilize its camera. This stabilization process is similar to the vestibulo-ocular response (VOR); like the VOR, A-eye learns a system model that can be incrementally modified to adapt to changes in its structure, performance and environment. A-eye is an example of a robust sensory system that performs computations that can be of significant use to the designers of mobile robots.

12 citations

Book ChapterDOI
08 Sep 2004
TL;DR: A system for automatic FAX routing which processes incoming FAX images and forwards them to the correct email alias by combining the quality of the matches and the relevance of the words.
Abstract: We present a system for automatic FAX routing which processes incoming FAX images and forwards them to the correct email alias. The system first performs optical character recognition to find words and in some cases parts of words (we have observed error rates as high as 10 to 20 percent). For all these “noisy” words, a set of features is computed which include internal text features, location features, and relationship features. These features are combined to estimate the relevance of the word in the context of the page and the recipient database. The parameters of the word relevance function are learned from training data using the AdaBoost learning algorithm. Words are then compared to the database of recipients to find likely matches. The recipients are finally ranked by combining the quality of the matches and the relevance of the words. Experiments are presented which demonstrate the effectiveness of this system on a large set of real data.

11 citations


Cited by
More filters
Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations

Proceedings ArticleDOI
27 Jun 2016
TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
Abstract: We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

27,256 citations

Proceedings ArticleDOI
01 Dec 2001
TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Abstract: This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.

18,620 citations

Journal ArticleDOI
TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Abstract: The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

15,935 citations

Proceedings ArticleDOI
Ross Girshick1
07 Dec 2015
TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.
Abstract: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

14,824 citations