scispace - formally typeset
Search or ask a question

Showing papers by "Andrew Y. Ng published in 2007"


Proceedings ArticleDOI
Rajat Raina1, Alexis Battle1, Honglak Lee1, Benjamin Packer1, Andrew Y. Ng1 
20 Jun 2007
TL;DR: An approach to self-taught learning that uses sparse coding to construct higher-level features using the unlabeled data to form a succinct input representation and significantly improve classification performance.
Abstract: We present a new machine learning framework called "self-taught learning" for using unlabeled data in supervised classification tasks. We do not assume that the unlabeled data follows the same class labels or generative distribution as the labeled data. Thus, we would like to use a large number of unlabeled images (or audio samples, or text documents) randomly downloaded from the Internet to improve performance on a given image (or audio, or text) classification task. Such unlabeled data is significantly easier to obtain than in typical semi-supervised or transfer learning settings, making self-taught learning widely applicable to many practical learning problems. We describe an approach to self-taught learning that uses sparse coding to construct higher-level features using the unlabeled data. These features form a succinct input representation and significantly improve classification performance. When using an SVM for classification, we further show how a Fisher kernel can be learned for this representation.

1,731 citations


Proceedings Article
03 Dec 2007
TL;DR: An unsupervised learning model is presented that faithfully mimics certain properties of visual area V2 and the encoding of these more complex "corner" features matches well with the results from the Ito & Komatsu's study of biological V2 responses, suggesting that this sparse variant of deep belief networks holds promise for modeling more higher-order features.
Abstract: Motivated in part by the hierarchical organization of the cortex, a number of algorithms have recently been proposed that try to learn hierarchical, or "deep," structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hierarchy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear ("contour") features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex "corner" features matches well with the results from the Ito & Komatsu's study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features.

1,048 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: This work considers the problem of estimating detailed 3D structure from a single still image of an unstructured environment and uses a Markov random field (MRF) to infer a set of "plane parameters" that capture both the 3D location and 3D orientation of the patch.
Abstract: We consider the problem of estimating detailed 3D structure from a single still image of an unstructured environment. Our goal is to create 3D models which are both quantitatively accurate as well as visually pleasing. For each small homogeneous patch in the image, we use a Markov random field (MRF) to infer a set of "plane parameters" that capture both the 3D location and 3D orientation of the patch. The MRF, trained via supervised learning, models both image depth cues as well as the relationships between different parts of the image. Inference in our model is tractable, and requires only solving a convex optimization problem. Other than assuming that the environment is made up of a number of small planes, our model makes no explicit assumptions about the structure of the scene; this enables the algorithm to capture much more detailed 3D structure than does prior art (such as Saxena et ah, 2005, Delage et ah, 2005, and Hoiem et el, 2005), and also give a much richer experience in the 3D flythroughs created using image-based rendering, even for scenes with significant non-vertical structure. Using this approach, we have created qualitatively correct 3D models for 64.9% of 588 images downloaded from the Internet, as compared to Hoiem et al.'s performance of 33.1%. Further, our models are quantitatively more accurate than either Saxena et al. or Hoiem et al.

352 citations


Proceedings Article
06 Jan 2007
TL;DR: This paper shows that by adding monocular cues to stereo (triangulation) ones, it is shown that significantly more accurate depth estimates are obtained than is possible using either monocular or stereo cues alone.
Abstract: Depth estimation in computer vision and robotics is most commonly done via stereo vision (stereopsis), in which images from two cameras are used to triangulate and estimate distances. However, there are also numerous monocular visual cues--such as texture variations and gradients, defocus, color/haze, etc. --that have heretofore been little exploited in such systems. Some of these cues apply even in regions without texture, where stereo would work poorly. In this paper, we apply a Markov Random Field (MRF) learning algorithm to capture some of these monocular cues, and incorporate them into a stereo system. We show that by adding monocular cues to stereo (triangulation) ones, we obtain significantly more accurate depth estimates than is possible using either monocular or stereo cues alone. This holds true for a large variety of environments, including both indoor environments and unstructured outdoor environments containing trees/forests, buildings, etc. Our approach is general, and applies to incorporating monocular cues together with any off-the-shelf stereo system.

233 citations


Proceedings Article
03 Dec 2007
TL;DR: This paper proposes a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task, and achieves results superior to any previously published work.
Abstract: We consider apprenticeship learning—learning from expert demonstrations—in the setting of large, complex domains. Past work in apprenticeship learning requires that the expert demonstrate complete trajectories through the domain. However, in many problems even an expert has difficulty controlling the system, which makes this approach infeasible. For example, consider the task of teaching a quadruped robot to navigate over extreme terrain; demonstrating an optimal policy (i.e., an optimal set of foot locations over the entire terrain) is a highly non-trivial task, even for an expert. In this paper we propose a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task. This type of advice is often feasible for experts to give, even if the expert is unable to demonstrate complete trajectories. This allows us to extend the apprenticeship learning paradigm to much larger, more challenging domains. In particular, in this paper we apply the hierarchical apprenticeship learning algorithm to the task of quadruped locomotion over extreme terrain, and achieve, to the best of our knowledge, results superior to any previously published work.

156 citations


Proceedings Article
19 Jul 2007
TL;DR: Shift-invariant sparse coding (SISC) as mentioned in this paper is an extension of sparse coding which reconstructs a (usually time-series) input using all of the basis functions in all possible shifts.
Abstract: Sparse coding is an unsupervised learning algorithm that learns a succinct high-level representation of the inputs given only unlabeled data; it represents each input as a sparse linear combination of a set of basis functions. Originally applied to modeling the human visual cortex, sparse coding has also been shown to be useful for self-taught learning, in which the goal is to solve a supervised classification task given access to additional unlabeled data drawn from different classes than that in the supervised learning problem. Shift-invariant sparse coding (SISC) is an extension of sparse coding which reconstructs a (usually time-series) input using all of the basis functions in all possible shifts. In this paper, we present an efficient algorithm for learning SISC bases. Our method is based on iteratively solving two large convex optimization problems: The first, which computes the linear coefficients, is an L1-regularized linear least squares problem with potentially hundreds of thousands of variables. Existing methods typically use a heuristic to select a small subset of the variables to optimize, but we present a way to efficiently compute the exact solution. The second, which solves for bases, is a constrained linear least squares problem. By optimizing over complex-valued variables in the Fourier domain, we reduce the coupling between the different variables, allowing the problem to be solved efficiently. We show that SISC's learned high-level representations of speech and music provide useful features for classification tasks within those domains. When applied to classification, under certain conditions the learned features outperform state of the art spectral and cepstral features.

153 citations


Proceedings Article
03 Dec 2007
TL;DR: This paper derives an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random fields (CRFs).
Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random fields (CRFs). Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we find that multiple hyperparameter learning can provide a significant boost in accuracy compared to using only a single regularization hyperparameter.

119 citations


Book ChapterDOI
29 Mar 2007
TL;DR: In this paper, a Markov random field (MRF) model is used to identify the different planes and edges in the scene, as well as their orientations, and an iterative optimization algorithm is applied to infer the most probable position of all the planes, and thereby obtain a 3D reconstruction.
Abstract: 3d reconstruction from a single image is inherently an ambiguous problem. Yet when we look at a picture, we can often infer 3d information about the scene. Humans perform single-image 3d reconstructions by using a variety of single-image depth cues, for example, by recognizing objects and surfaces, and reasoning about how these surfaces are connected to each other. In this paper, we focus on the problem of automatic 3d reconstruction of indoor scenes, specifically ones (sometimes called “Manhattan worlds”) that consist mainly of orthogonal planes. We use a Markov random field (MRF) model to identify the different planes and edges in the scene, as well as their orientations. Then, an iterative optimization algorithm is applied to infer the most probable position of all the planes, and thereby obtain a 3d reconstruction. Our approach is fully automatic—given an input image, no human intervention is necessary to obtain an approximate 3d reconstruction.

110 citations


Proceedings Article
01 Jun 2007
TL;DR: A discriminative classifier is trained over a wide variety of features derived from WordNet structure, corpus-based evidence, and evidence from other lexical resources, and a learned similarity measure outperforms previously proposed automatic methods for sense clustering on the task of predicting human sense merging judgments.
Abstract: It has been widely observed that different NLP applications require different sense granularities in order to best exploit word sense distinctions, and that for many applications WordNet senses are too fine-grained. In contrast to previously proposed automatic methods for sense clustering, we formulate sense merging as a supervised learning problem, exploiting human-labeled sense clusterings as training data. We train a discriminative classifier over a wide variety of features derived from WordNet structure, corpus-based evidence, and evidence from other lexical resources. Our learned similarity measure outperforms previously proposed automatic methods for sense clustering on the task of predicting human sense merging judgments, yielding an absolute F-score improvement of 4.1% on nouns, 13.6% on verbs, and 4.0% on adjectives. Finally, we propose a model for clustering sense taxonomies using the outputs of our classifier, and we make available several automatically sense-clustered WordNets of various sense granularities.

105 citations


Proceedings Article
06 Jan 2007
TL;DR: This paper presents a novel method for identifying and tracking objects in multiresolution digital video of partially cluttered environments and uses a learned "attentive" interest map on a low resolution data stream to direct a high resolution "fovea".
Abstract: Human object recognition in a physical 3-d environment is still far superior to that of any robotic vision system. We believe that one reason (out of many) for this--one that has not heretofore been significantly exploited in the artificial vision literature--is that humans use a fovea to fixate on, or near an object, thus obtaining a very high resolution image of the object and rendering it easy to recognize. In this paper, we present a novel method for identifying and tracking objects in multiresolution digital video of partially cluttered environments. Our method is motivated by biological vision systems and uses a learned "attentive" interest map on a low resolution data stream to direct a high resolution "fovea." Objects that are recognized in the fovea can then be tracked using peripheral vision. Because object recognition is run only on a small foveal image, our system achieves performance in real-time object recognition and tracking that is well beyond simpler systems.

87 citations


Proceedings Article
06 Jan 2007
TL;DR: This paper proposes a unified approach to these two problems that dynamically models the objects to be manipulated and localizes the robot at the same time, and applies this approach to the task of navigating from one office to another (including manipulating doors).
Abstract: In recent years, probabilistic approaches have found many successful applications to mobile robot localization, and to object state estimation for manipulation. In this paper, we propose a unified approach to these two problems that dynamically models the objects to be manipulated and localizes the robot at the same time. Our approach applies in the common setting where only a lowresolution (10cm) grid-map of a building is available, but we also have a high-resolution (0.1cm) model of the object to be manipulated. Our method is based on defining a unifying probabilistic model over these two representations. The resulting algorithm works in real-time, and estimates the position of objects with sufficient precision for manipulation tasks. We apply our approach to the task of navigating from one office to another (including manipulating doors). Our approach, successfully tested on multiple doors, allows the robot to navigate through a hallway to an office door, grasp and turn the door handle, and continuously manipulate the door as it moves into the office.

Proceedings Article
01 Jan 2007
TL;DR: The hardware and software integration frameworks used to facilitate the development of these components and to bring them together for the demonstration of the STAIR 1 robot responding to a verbal command to fetch an item are described.
Abstract: The STanford Artificial Intelligence Robot (STAIR) project is a long-term group effort aimed at producing a viable home and office assistant robot. As a small concrete step towards this goal, we showed a demonstration video at the 2007 AAAI Mobile Robot Exhibition of the STAIR 1 robot responding to a verbal command to fetch an item. Carrying out this task involved the integration of multiple components, including spoken dialog, navigation, computer visual object detection, and robotic grasping. This paper describes the hardware and software integration frameworks used to facilitate the development of these components and to bring them together for the demonstration.

Patent
21 Nov 2007
TL;DR: In this paper, a set of monocular images and their corresponding ground-truth depth maps are used to determine a relationship between monocular image features and the depth of image points.
Abstract: Three-dimensional image data is generated. According to an example embodiment, three-dimensional depth information is estimated from a still image. A set of monocular images and their corresponding ground-truth depth maps are used to determine a relationship between monocular image features and the depth of image points. For different points in a particular image, the determined relationship is used together with local and global image features including monocular cues to determine relative depths of the points.

Proceedings ArticleDOI
26 Dec 2007
TL;DR: This paper shows how monocular image cues can be combined with triangulation cues to build a photo-realistic model of a scene given only a few images-even ones taken from very different viewpoints or with little overlap.
Abstract: We consider the task of creating a 3-d model of a large novel environment, given only a small number of images of the scene. This is a difficult problem, because if the images are taken from very different viewpoints or if they contain similar-looking structures, then most geometric reconstruction methods will have great difficulty finding good correspondences. Further, the reconstructions given by most algorithms include only points in 3-d that were observed in two or more images; a point observed only in a single image would not be reconstructed. In this paper, we show how monocular image cues can be combined with triangulation cues to build a photo-realistic model of a scene given only a few images-even ones taken from very different viewpoints or with little overlap. Our approach begins by over-segmenting each image into small patches (superpixels). It then simultaneously tries to infer the 3-d position and orientation of every superpixel in every image. This is done using a Markov random field (MRF) which simultaneously reasons about monocular cues and about the relations between multiple image patches, both within the same image and across different images (triangulation cues). MAP inference in our model is efficiently approximated using a series of linear programs, and our algorithm scales well to a large number of images.

Proceedings ArticleDOI
27 Jun 2007
TL;DR: A method is proposed that uses a (possibly inaccurate) simulator to identify a low-dimensional subspace of policies that spans the variations in model dynamics that can be learned on the real system using much less data than would be required to learn a policy in the original class.
Abstract: We consider the task of omnidirectional path following for a quadruped robot: moving a four-legged robot along any arbitrary path while turning in any arbitrary manner. Learning a controller capable of such motion requires learning the parameters of a very high-dimensional policy, a difficult task on a real robot. Although learning such a policy can be much easier in a model (or “simulator”) of the system, it can be extremely difficult to build a sufficiently accurate simulator. In this paper we propose a method that uses a (possibly inaccurate) simulator to identify a low-dimensional subspace of policies that spans the variations in model dynamics. This subspace will be robust to variations in the model, and can be learned on the real system using much less data than would be required to learn a policy in the original class. In our approach, we sample several models from a distribution over the kinematic and dynamics parameters of the simulator, then formulate an optimization problem that can be solved using the Reduced Rank Regression (RRR) algorithm to construct a low-dimensional class of policies that spans the major axes of variation in the space of controllers. We present a successful application of this technique to the task of omnidirectional path following, and demonstrate improvement over a number of alternative methods, including a hand-tuned controller. We present, to the best of our knowledge, the first controller capable of omnidirectional path following with parameters optimized simultaneously for all directions of motion and turning rates.

Proceedings Article
06 Jan 2007
TL;DR: A method is presented that combines factor graphs and static program analysis to automatically infer specifications directly from programs, and the inferred specifications are highly accurate and with them numerous bugs are discovered.
Abstract: Automatic tools for finding software errors require knowledge of the rules a program must obey, or "specifications," before they can identify bugs. We present a method that combines factor graphs and static program analysis to automatically infer specifications directly from programs. We illustrate the approach on inferring functions in C programs that allocate and release resources, and evaluate the approach on three codebases: SDL, OpenSSH, and the OS kernel for Mac OS X (XNU). The inferred specifications are highly accurate and with them we have discovered numerous bugs.

Journal ArticleDOI
TL;DR: In this paper, a self-surveying camera array (SSCA) is used to track a target helicopter in each camera frame and to localize the helicopter in an array-fixed frame.
Abstract: A Self-surveying Camera Array (SSCA) is a vision-based local-area positioning system consisting of multiple ground-deployed cameras that are capable of self-surveying their extrinsic parameters while tracking and localizing a moving target. This paper presents the self-surveying algorithm being used to track a target helicopter in each camera frame and to localize the helicopter in an array-fixed frame. Three cameras are deployed independently in an arbitrary arrangement that allows each camera to view the helicopter's flight volume. The helicopter then flies an unplanned path that allows the cameras to calibrate the relative locations and orientations by utilizing a self-surveying algorithm that is extended from the well-known structure from motion algorithm and the bundle adjustment technique. This yields the cameras'extrinsic parameters enabling real-time helicopter positioning via triangulation. This paper also presents results from field trials, which verify the feasibility of the SSCA as a readily-deployable system applicable to helicopter tracking and localization. The results demonstrate that, compared to the differential GPS solution as true reference, the SSCA alone is capable of positioning the helicopter with meter-level accuracy. The SSCA has been integrated with onboard inertial sensors providing a reliable positioning system to enable successful autonomous hovering.

28 Sep 2007
TL;DR: A highly portable multi-antenna datalogging system which can log several minutes of multi-channel raw GPS L1 baseband and interleaves two serial data streams with the baseband data, allowing, e.g., inertial data to remain synchronized with the data stream.
Abstract: We present the design and implementation of a highly portable multi-antenna datalogging system which can log several minutes of multi-channel raw GPS L1 baseband. In addition, our design interleaves two serial data streams with the baseband data, allowing, e.g., inertial data to remain synchronized with the data stream. The system is FPGA-based and uses two CompactFlash cards for storage. The data is extracted by imaging the CompactFlash cards onto a PC and recovering synchronization codes interleaved with the baseband data. The resulting system is useful for multi(and single-) antenna receiver development, as it allows for simple data collection. The data files Figure 1. Multi-channel GPS L1 data collection system can then be used as inputs to software receivers for algorithm development, testing, and quantitative comparisons. Because the digital logging section is cleanly separated from the RF front-end section, it is possible to substitute any manner of RF front-end (e.g. Galileo) in place of our GPS L1 section, so long as the total data rate stays within the 250 megabit/sec capacity of the logging section.