scispace - formally typeset
Search or ask a question

Showing papers by "Dong Yu published in 2011"


Proceedings Article
Frank Seide1, Gang Li1, Dong Yu1
01 Aug 2011
TL;DR: Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, combine the classic artificial-neural-network HMMs with traditional context-dependent acoustic modeling and deep-belief-network pre-training to greatly outperform conventional CD-GMM (Gaussian mixture model) HMMs.

822 citations


Proceedings ArticleDOI
Frank Seide1, Gang Li1, Xie Chen1, Dong Yu1
01 Dec 2011
TL;DR: This work investigates the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective to reduce the word error rate for speaker-independent transcription of phone calls.
Abstract: We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much as one third—from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs with HLDA features, to 18.5%—using 300+ hours of training data (Switchboard), 9000+ tied triphone states, and up to 9 hidden network layers.

702 citations


Journal ArticleDOI
Dong Yu1, Li Deng1
TL;DR: The purpose of this article is to introduce the readers to the emerging technologies enabled by deep learning and to review the research work conducted in this area that is of direct relevance to signal processing.
Abstract: The purpose of this article is to introduce the readers to the emerging technologies enabled by deep learning and to review the research work conducted in this area that is of direct relevance to signal processing. We also point out, in our view, the future research directions that may attract interests of and require efforts from more signal processing researchers and practitioners in this emerging area for advancing signal and information processing technology and applications.

387 citations


Proceedings Article
Dong Yu1, Michael L. Seltzer1
01 Aug 2011
TL;DR: This paper shows how the use of unsupervised pretraining of a DNN enhances the network’s discriminative power and improves the bottleneck features it generates, and shows that a neural networktrained to predict context-dependent senone targets produces better bottleneck features than one trained to predict monophone states.
Abstract: Bottleneck features have been shown to be effective in improving the accuracy of automatic speech recognition (ASR) systems. Conventionally, bottleneck features are extracted from a multi-layer perceptron (MLP) trained to predict context-independent monophone states. The MLP typically has three hidden layers and is trained using the backpropagation algorithm. In this paper, we propose two improvements to the training of bottleneck features motivated by recent advances in the use of deep neural networks (DNNs) for speech recognition. First, we show how the use of unsupervised pretraining of a DNN enhances the network’s discriminative power and improves the bottleneck features it generates. Second, we show that a neural network trained to predict context-dependent senone targets produces better bottleneck features than one trained to predict monophone states. Bottleneck features trained using the proposed methods produced a 16% relative reduction in sentence error rate over conventional bottleneck features on a large vocabulary business search task.

317 citations


Journal Article
TL;DR: A property common to these shallow learning models is the simple architecture that consists of only one layer responsible for transforming the raw input signals or features into a problem-specific feature space, which may be unobservable.
Abstract: Today, signal processing research has a significantly widened its scope compared with just a few years ago [4], and machine learning has been an important technical area of the signal processing society. Since 2006, deep learning—a new area of machine learning research—has emerged [7], impacting a wide range of signal and information processing work within the traditional and the new, widened scopes. Various workshops, such as the 2009 ICML Workshop on Learning Feature Hierarchies; the 2008 NIPS Deep Learning Workshop: Foundations and Future Directions; and the 2009 NIPS Workshop on Deep Learning for Speech Recognition and Related Applications as well as an upcoming special issue on deep learning for speech and language processing in IEEE Transactions on Audio, Speech, and Language Processing (2010) have been devoted exclusively to deep learning and its applications to classical signal processing areas. We have also seen the government sponsor research on deep learning.

260 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: This work proposes a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines on a challenging, large vocabulary, spontaneous speech recognition dataset from the Bing mobile voice search task.
Abstract: The context-independent deep belief network (DBN) hidden Markov model (HMM) hybrid architecture has recently achieved promising results for phone recognition. In this work, we propose a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines on a challenging, large vocabulary, spontaneous speech recognition dataset from the Bing mobile voice search task. Our system achieves absolute sentence accuracy improvements of 5.8% and 9.2% over GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively, which translate to relative error reductions of 16.0% and 23.2%.

213 citations


Patent
Dong Yu1, Li Deng1, Frank Seide1, Gang Li1
26 Nov 2011
TL;DR: In this article, a discriminative pretraining method is proposed to bring the DNN layer weights close to a good local optimum, while still leaving them in a range with a high gradient so that they can be fine-tuned effectively.
Abstract: Discriminative pretraining technique embodiments are presented that pretrain the hidden layers of a Deep Neural Network (DNN). In general, a one-hidden-layer neural network is trained first using labels discriminatively with error back-propagation (BP). Then, after discarding an output layer in the previous one-hidden-layer neural network, another randomly initialized hidden layer is added on top of the previously trained hidden layer along with a new output layer that represents the targets for classification or recognition. The resulting multiple-hidden-layer DNN is then discriminatively trained using the same strategy, and so on until the desired number of hidden layers is reached. This produces a pretrained DNN. The discriminative pretraining technique embodiments have the advantage of bringing the DNN layer weights close to a good local optimum, while still leaving them in a range with a high gradient so that they can be fine-tuned effectively.

112 citations


Patent
Li Deng1, Dong Yu1, George E. Dahl1
06 Sep 2011
TL;DR: In this article, a Deep Belief Network (DBN) consisting of many layers of nonlinear units with connecting weights between layers is trained by a pretraining step followed by a fine-tuning step.
Abstract: A method is disclosed herein that includes an act of causing a processor to receive a sample, wherein the sample is one of spoken utterance, an online handwriting sample, or a moving image sample. The method also comprises the act of causing the processor to decode the sample based at least in part upon an output of a combination of a deep structure and a context-dependent Hidden Markov Model (HMM), wherein the deep structure is configured to output a posterior probability of a context-dependent unit. The deep structure is a Deep Belief Network consisting of many layers of nonlinear units with connecting weights between layers trained by a pretraining step followed by a fine-tuning step.

90 citations


Journal ArticleDOI
Dong Yu1, Jinyu Li1, Li Deng1
TL;DR: Three confidence calibration methods have been developed and the importance of key features exploited: the generic confidence-score, the application-dependent word distribution, and the rule coverage ratio are demonstrated.
Abstract: Most speech recognition applications in use today rely heavily on confidence measure for making optimal decisions. In this paper, we aim to answer the question: what can be done to improve the quality of confidence measure if we cannot modify the speech recognition engine? The answer provided in this paper is a post-processing step called confidence calibration, which can be viewed as a special adaptation technique applied to confidence measure. Three confidence calibration methods have been developed in this work: the maximum entropy model with distribution constraints, the artificial neural network, and the deep belief network. We compare these approaches and demonstrate the importance of key features exploited: the generic confidence-score, the application-dependent word distribution, and the rule coverage ratio. We demonstrate the effectiveness of confidence calibration on a variety of tasks with significant normalized cross entropy increase and equal error rate reduction.

84 citations


Patent
07 Sep 2011
TL;DR: In this paper, the authors proposed a method to enable a processor to access a deep-structured model retained in a computer-readable medium, wherein the deep structured model comprises a plurality of layers with weights assigned to each layer, transition probabilities between states, and language model scores.
Abstract: A method is disclosed herein that include an act of causing a processor to access a deep-structured model retained in a computer-readable medium, wherein the deep-structured model comprises a plurality of layers with weights assigned thereto, transition probabilities between states, and language model scores. The method can further include the act of jointly substantially optimizing the weights, the transition probabilities, and the language model scores of the deep-structured model using the optimization criterion based on a sequence rather than a set of unrelated frames.

64 citations


Patent
Dong Yu1, Li Deng1, Frank Seide1, Gang Li1
28 Nov 2011
TL;DR: In this paper, the sparseness of non-zero hidden layer interconnection weight values is exploited to train a fully connected DNN by sweeping through a full training set a number of times and only the interconnections whose weight magnitudes exceed a minimum weight threshold are considered in further training.
Abstract: Deep Neural Network (DNN) training technique embodiments are presented that train a DNN while exploiting the sparseness of non-zero hidden layer interconnection weight values. Generally, a fully connected DNN is initially trained by sweeping through a full training set a number of times. Then, for the most part, only the interconnections whose weight magnitudes exceed a minimum weight threshold are considered in further training. This minimum weight threshold can be established as a value that results in only a prescribed maximum number of interconnections being considered when setting interconnection weight values via an error back-propagation procedure during the training. It is noted that the continued DNN training tends to converge much faster than the initial training.

Patent
Li Deng1, Dong Yu1, Alejandro Acero1
31 Mar 2011
TL;DR: In this article, the authors proposed a method to jointly optimize the weights, the transition probabilities, and the language model scores of the deep-structured model using the optimization criterion based on a sequence rather than a set of unrelated frames.
Abstract: A method is disclosed herein that includes an act of causing a processor to access a deep-structured, layered or hierarchical model, called deep convex network, retained in a computer-readable medium, wherein the deep-structured model comprises a plurality of layers with weights assigned thereto. This layered model can produce the output serving as the scores to combine with transition probabilities between states in a hidden Markov model and language model scores to form a full speech recognizer. The method makes joint use of nonlinear random projections and RBM weights, and it stacks a lower module's output with the raw data to establish its immediately higher module. Batch-based, convex optimization is performed to learn a portion of the deep convex network's weights, rendering it appropriate for parallel computation to accomplish the training. The method can further include the act of jointly substantially optimizing the weights, the transition probabilities, and the language model scores of the deep-structured model using the optimization criterion based on a sequence rather than a set of unrelated frames.

Proceedings ArticleDOI
Dong Yu1, Li Deng1
27 Aug 2011
TL;DR: A set of novel, batch-mode algorithms developed recently are described as one key component in scalable, deep neural network based speech recognition, to structure the singlehidden-layer neural network so that the upper-layer's weights can be written as a deterministic function of the lower-layer’s weights.
Abstract: We describe a set of novel, batch-mode algorithms we developed recently as one key component in scalable, deep neural network based speech recognition. The essence of these algorithms is to structure the singlehidden-layer neural network so that the upper-layer’s weights can be written as a deterministic function of the lower-layer’s weights. This structure is effectively exploited during training by plugging in the deterministic function to the least square error objective function while calculating the gradients. Accelerating techniques are further exploited to make the weight updates move along the most promising directions. The experiments on TIMIT frame-level phone and phonestate classification show strong results. In particular, the error rate is strictly monotonically dropping as the minibatch size increases. This demonstrates the potential for the proposed batch-mode algorithms in large scale speech recognition since they are easily parallelizable across computers.

Journal ArticleDOI
TL;DR: A recent govern ment study concluded that drivers performing complex secondary tasks such as operating or viewing a mobile device or personal digital assistant (PDA) were between 1.7 and 5.5 times more likely to be involved in a crash or near crash.
Abstract: Over the last decade, our ability to access, store, and consume huge amount of media and infor mation on mobile devices has skyrocketed. While this has allowed people who are on the go to be more entertained, informed, and con nected, the small-form factor of mobile devices makes manag ing all of this content a difficult task. This difficulty is significantly amplified when we consider how many people are using these devices while driving in automobiles and the high risk of driver distraction such devices present. A recent govern ment study concluded that drivers performing complex secondary tasks such as operating or viewing a mobile device or personal digital assistant (PDA) were between 1.7 and 5.5 times more likely to be involved in a crash or near crash.

Li Deng1, Dong Yu1
01 Jun 2011
TL;DR: Experimental results on handwriting image recognition task (MNIST) and on phone state classification (TIMIT) demonstrate superior performance of DCN over DBN not only in training efficiency but also in classification accuracy.
Abstract:  To overcome the scalability challenge associated with Deep Belief Network (DBN), we have designed a novel deep learning architecture, deep convex network (DCN). The learning problem in DCN is convex within each layer. Additional structure-exploited fine tuning further improves the quality of DCN. The full learning in DCN is batch-mode based instead of stochastic, naturally lending it amenable to parallel training that can be distributed over many machines. Experimental results on handwriting image recognition task (MNIST) and on phone state classification (TIMIT) demonstrate superior performance of DCN over DBN not only in training efficiency but also in classification accuracy. DCN gives the error rate of 0.83%, the lowest without the use of additional training data produced by elastic distortion. The corresponding error rate by the best DBN which we have carefully tuned is 1.06%. On the TIMIT task, DCN also outperforms DBN but with a relatively smaller percentage so far.

Proceedings Article
Jinyu Li1, Dong Yu1, Li Deng1, Yifan Gong1
01 Jun 2011
TL;DR: This paper presents the recent study on using this structured model of physical distortion for robust automatic speech recognition, and shows that online updating all the noise and channel distortion parameters is critical to the success of the proposed JAC algorithms.
Abstract:  It is well known that the distorted speech can be considered generated from the clean speech with the additive noise and the convolutive channel as In this paper, we present our recent study on using this structured model of physical distortion for robust automatic speech recognition. Three methods are introduced for joint compensation of additive and convolutive distortions (JAC), with different online computation costs. They are JAC model adaptation, GMM-based JAC model adaptation, and JAC feature enhancement. All these algorithms consist of two main steps. First, the noise and channel parameters are estimated using a nonlinear environment distortion model in the cepstral domain, and the vector-Taylor-series (VTS) linearization technique collectively. Second, the estimated noise and channel parameters are used to adapt the hidden Markov model (HMM) parameters or clean the distorted speech feature. In the experimental evaluation using the standard Aurora 2 task, the proposed JAC algorithms all achieve around 89% accuracy using the cleantrained complex HMM backend, compare favorably over previously developed techniques. In the meanwhile, the JAC feature enhancement method has much smaller computation cost than the other two methods, and can be used as a high-accuracy low-cost noise robust front end. Detailed analysis on the experimental results shows that online updating all the noise and channel distortion parameters is critical to the success of our algorithms.