scispace - formally typeset
Search or ask a question
Author

Paul A. Viola

Bio: Paul A. Viola is an academic researcher from Microsoft. The author has contributed to research in topics: Parsing & Boosting (machine learning). The author has an hindex of 52, co-authored 115 publications receiving 59853 citations. Previous affiliations of Paul A. Viola include IBM & Wilmington University.


Papers
More filters
Patent
24 Mar 2004
TL;DR: In this paper, a computer-implemented method and apparatus are provided for populating an electronic form from an electronic image, identifying a size, orientation and position of an object within the electronic image and identifying information elements from pixels within the image that correspond to the object.
Abstract: A computer-implemented method and apparatus are provided for populating an electronic form from an electronic image. The method and apparatus identify a size, orientation and position of an object within the electronic image, and identify information elements from pixels within the image that correspond to the object. Fields of the electronic form are displayed to a user along with the identified information elements through a graphical user interface. The information elements are parsed into tagged groups of different information types. At least some of the fields of the electronic form are populated with the tagged groups to produce a populated form. The user is allowed to edit the populated fields through the graphical user interface.

204 citations

Proceedings Article
Cha Zhang1, Paul A. Viola1
03 Dec 2007
TL;DR: The multiple instance pruning (MIP) algorithm for soft cascades is proposed, which computes a set of thresholds which aggressively terminate computation with no reduction in detection rate or increase in false positive rate on the training dataset.
Abstract: Cascade detectors have been shown to operate extremely rapidly, with high accuracy, and have important applications such as face detection Driven by this success, cascade learning has been an area of active research in recent years Nevertheless, there are still challenging technical problems during the training process of cascade detectors In particular, determining the optimal target detection rate for each stage of the cascade remains an unsolved issue In this paper, we propose the multiple instance pruning (MIP) algorithm for soft cascades This algorithm computes a set of thresholds which aggressively terminate computation with no reduction in detection rate or increase in false positive rate on the training dataset The algorithm is based on two key insights: i) examples that are destined to be rejected by the complete classifier can be safely pruned early; ii) face detection is a multiple instance learning problem The MIP process is fully automatic and requires no assumptions of probability distributions, statistical independence, or ad hoc intermediate rejection targets Experimental results on the MIT+CMU dataset demonstrate significant performance advantages

183 citations

Proceedings Article
25 Jul 2004
TL;DR: This work applies a constrained Viterbi decoding which finds the optimal field assignments consistent with the fields explicitly specified or corrected by the user; and a mechanism for estimating the confidence of each extracted field, so that low-confidence extractions can be highlighted.
Abstract: Information Extraction methods can be used to automatically "fill-in" database forms from unstructured data such as Web documents or email. State-of-the-art methods have achieved low error rates but invariably make a number of errors. The goal of an interactive information extraction system is to assist the user in filling in database fields while giving the user confidence in the integrity of the data. The user is presented with an interactive interface that allows both the rapid verification of automatic field assignments and the correction of errors. In cases where there are multiple errors, our system takes into account user corrections, and immediately propagates these constraints such that other fields are often corrected automatically. Linear-chain conditional random fields (CRFs) have been shown to perform well for information extraction and other language modelling tasks due to their ability to capture arbitrary, overlapping features of the input in a Markov model. We apply this framework with two extensions: a constrained Viterbi decoding which finds the optimal field assignments consistent with the fields explicitly specified or corrected by the user; and a mechanism for estimating the confidence of each extracted field, so that low-confidence extractions can be highlighted. Both of these mechanisms are incorporated in a novel user interface for form filling that is intuitive and speeds the entry of data--providing a 23% reduction in error due to automated corrections.

176 citations

Proceedings ArticleDOI
23 Jun 2008
TL;DR: A novel method that simultaneously performs feature selection and feature extraction is proposed, which is computationally much more efficient than previous approaches, without sacrificing accuracy.
Abstract: In computer vision, the bag-of-visual words image representation has been shown to yield good results. Recent work has shown that modeling the spatial relationship between visual words further improves performance. Previous work extracts higher-order spatial features exhaustively. However, these spatial features are expensive to compute. We propose a novel method that simultaneously performs feature selection and feature extraction. Higher-order spatial features are progressively extracted based on selected lower order ones, thereby avoiding exhaustive computation. The method can be based on any additive feature selection algorithm such as boosting. Experimental results show that the method is computationally much more efficient than previous approaches, without sacrificing accuracy.

161 citations

01 Jan 2003
TL;DR: A new method for face recognition which learns a face similarity measure from example image pairs using a set of computationally efficient “rectangle” features which act on pairs of input images.
Abstract: This paper presents a new method for face recognition which learns a face similarity measure from example image pairs. A set of computationally efficient “rectangle” features are described which act on pairs of input images. The features compare regions within the input images at different locations, scales, and orientations. The AdaBoost algorithm is used to train the face similarity function by selecting features. Given a large face database, the set of face pairs is too large for effective training. We present a sampling procedure which selects a training subset based on the AdaBoost example weights. Finally, we show state of the art results on the FERET set of faces as well as a more challenging set of faces collected at our lab.

156 citations


Cited by
More filters
Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations

Proceedings ArticleDOI
27 Jun 2016
TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
Abstract: We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

27,256 citations

Proceedings ArticleDOI
01 Dec 2001
TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Abstract: This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.

18,620 citations

Journal ArticleDOI
TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Abstract: The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

15,935 citations

Proceedings ArticleDOI
Ross Girshick1
07 Dec 2015
TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.
Abstract: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

14,824 citations