scispace - formally typeset
Search or ask a question

Showing papers by "Andrew Y. Ng published in 2014"


Posted Content
TL;DR: Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.
Abstract: We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

1,761 citations


Journal ArticleDOI
TL;DR: The DT-RNN model, which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences, outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa.
Abstract: Previous work on Recursive Neural Networks (RNNs) shows that these models can produce compositional feature vectors for accurately representing and classifying sentences or images. However, the sentence vectors of previous models cannot accurately represent visually grounded meaning. We introduce the DT-RNN model which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences. Unlike previous RNN-based models which use constituency trees, DT-RNNs naturally focus on the action and agents in a sentence. They are better able to abstract from the details of word order and syntactic expression. DT-RNNs outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa. They also give more similar representations to sentences that describe the same image.

916 citations


Posted Content
TL;DR: This paper demonstrates that a straightforward recurrent neural network architecture can achieve a high level of accuracy and proposes and evaluates a modified prefix-search decoding algorithm that enables first-pass speech recognition with a langu age model, completely unaided by the cumbersome infrastructure of HMM-based systems.
Abstract: We present a method to perform first-pass large vocabulary co ntinuous speech recognition using only a neural network and language model. Deep neural network acoustic models are now commonplace in HMM-based speech recognition systems, but building such systems is a complex, domain-specific task. Recent work demonstrated the feasibility of discarding the HMM sequence modeling framework by directly predicting transcript text from audio. This paper extends this approach in two ways. First, we demonstrate that a straightforward recurrent neural network architecture can achieve a high level of accuracy. Second, we propose and evaluate a modified prefix-search decoding algorith m. This approach to decoding enables first-pass speech recognition with a langu age model, completely unaided by the cumbersome infrastructure of HMM-based systems. Experiments on the Wall Street Journal corpus demonstrate fairly competitive word error rates, and the importance of bi-directional network recurrence.

172 citations


Posted Content
TL;DR: An empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance, and suggests that a relatively simple DNN architecture and optimization technique produces strong results.
Abstract: Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.

82 citations


Journal ArticleDOI
TL;DR: This paper presents a hand designed for minimalistic dexterous manipulation, in which every stage of the design process also considered its manufacturing cost.
Abstract: Historically, robotic hand research has tended to focus on two areas: severely underactuated hands, and high-degree-of-freedom fully actuated hands. Comparatively little research has been done in between those spaces. Furthermore, despite the large number of robotic hand designs that have been proposed in the past few decades, very few robot hands are available for purchase on the commercial market. In this paper, we present a hand designed for minimalistic dexterous manipulation, in which every stage of the design process also considered its manufacturing cost. We discuss the various trade-offs made in the design. Finally, we present the results of experiments in which the robotic hand was affixed to a manipulator arm and teleoperated to grasp and manipulate a variety of objects.

50 citations


Journal ArticleDOI
01 May 2014-Ubiquity
TL;DR: The process and biometrics Coursera uses to establish and verify student identity during a course and data is presented that suggest verified certificate programs help increase student success rates in courses.
Abstract: Massive open online courses (MOOCs) enable the delivery of high-quality educational experiences to large groups of students. Coursera, one of the largest MOOC providers, developed a program to provide students with verified credentials as a record of their MOOC performance. Such credentials help students convey achievements in MOOCs to future employers and academic programs. This article outlines the process and biometrics Coursera uses to establish and verify student identity during a course. We additionally present data that suggest verified certificate programs help increase student success rates in courses.

37 citations


Patent
11 Aug 2014
TL;DR: In this article, a user is prompted to provide authentication information including at least one of a plurality of types of information, and the authentication information received is compared to at least a portion of stored enrollment information associated with the user with which the received authentication information is associated.
Abstract: Performing identity verification for online education is disclosed. In response to receiving a notification of a submission event, a user is prompted to provide authentication information including at least one of a plurality of types of information. Authentication information received is compared to at least a portion of stored enrollment information associated with the user with which the received authentication information is associated. The stored enrollment information includes at least two different types of information collected during an enrollment phase, including the at least one type of information solicited during the user prompting. In the event that matching criteria are met based at least in part on the comparison, a first action is performed. In the event that matching criteria are not met based at least in part on the comparison, a second action that is different from the first action is performed.

34 citations


Posted Content
30 Jun 2014
TL;DR: The results show that with sufficient training data, increasing DNN model size is an effective, direct path to performance improvements, and even smaller DNNs benefit from a larger training corpus.
Abstract: Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.

22 citations


Patent
01 Aug 2014

22 citations


01 Jan 2014
TL;DR: It is shown that when represented in the information form, map posteriors are dominated by a small number of links that tie together nearby features in the map, which is developed into a sparse variant of the EIF, called the sparse extended information filter (SEIF).
Abstract: In this paper we describe a scalable algorithm for the simultaneous mapping and localization (SLAM) problem. SLAM is the problem of acquiring a map of a static environment with a mobile robot. The vast majority of SLAM algorithms are based on the extended Kalman filter (EKF). In this paper we advocate an algorithm that relies on the dual of the EKF, the extended information filter (EIF). We show that when represented in the information form, map posteriors are dominated by a small number of links that tie together nearby features in the map. This insight is developed into a sparse variant of the EIF, called the sparse extended information filter (SEIF). SEIFs represent maps by graphical networks of features that are locally interconnected, where links represent relative information between pairs of nearby features, as well as information about the robot's pose relative to the map. We show that all essential update equations in SEIFs can be executed in constant time, irrespective of the size of the map. We also provide empirical results obtained for a benchmark data set collected in an outdoor environment, and using a multi-robot mapping simulation.

2 citations


ReportDOI
01 Nov 2014
TL;DR: This research project addressed the problem of learning useful deep representations from unlabeled data by innovating new unsupervised deep learning algorithms capable of learning important semantic structure in the input data in a domain general way.
Abstract: : This research project addressed the problem of learning useful deep representations from unlabeled data. The major goal was to innovate new unsupervised deep learning algorithms capable of learning important semantic structure in the input data in a domain general way. At the conclusion of this project, these goals stand fulfilled. The lab produced a variety of new and influential learning algorithms including Independent Subspace Analysis (ISA); Reconstruction Independent Components Analysis (RICA); recursive neural networks; and recursive tensor networks, among others. These algorithms have posted state-of-the-art results across a number of domains and tasks, and have had impact on both academia and industry.