Showing papers by "Dong Yu published in 2011"

PDF

Open Access

Proceedings Article•

Conversational Speech Transcription Using Context-Dependent Deep Neural Networks.

[...]

Frank Seide¹, Gang Li¹, Dong Yu¹•Institutions (1)

01 Aug 2011

TL;DR: Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, combine the classic artificial-neural-network HMMs with traditional context-dependent acoustic modeling and deep-belief-network pre-training to greatly outperform conventional CD-GMM (Gaussian mixture model) HMMs.

...read moreread less

822 citations

Proceedings Article•DOI•

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

[...]

Frank Seide¹, Gang Li¹, Xie Chen¹, Dong Yu¹•Institutions (1)

Microsoft¹

01 Dec 2011

TL;DR: This work investigates the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective to reduce the word error rate for speaker-independent transcription of phone calls.

...read moreread less

Abstract: We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much as one third—from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs with HLDA features, to 18.5%—using 300+ hours of training data (Switchboard), 9000+ tied triphone states, and up to 9 hidden network layers.

...read moreread less

702 citations

Journal Article•DOI•

Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP]

[...]

Dong Yu¹, Li Deng¹•Institutions (1)

Microsoft¹

01 Jan 2011-IEEE Signal Processing Magazine

TL;DR: The purpose of this article is to introduce the readers to the emerging technologies enabled by deep learning and to review the research work conducted in this area that is of direct relevance to signal processing.

...read moreread less

Abstract: The purpose of this article is to introduce the readers to the emerging technologies enabled by deep learning and to review the research work conducted in this area that is of direct relevance to signal processing. We also point out, in our view, the future research directions that may attract interests of and require efforts from more signal processing researchers and practitioners in this emerging area for advancing signal and information processing technology and applications.

...read moreread less

387 citations

Proceedings Article•

Improved Bottleneck Features Using Pretrained Deep Neural Networks.

[...]

Dong Yu¹, Michael L. Seltzer¹•Institutions (1)

Microsoft¹

01 Aug 2011

TL;DR: This paper shows how the use of unsupervised pretraining of a DNN enhances the network’s discriminative power and improves the bottleneck features it generates, and shows that a neural networktrained to predict context-dependent senone targets produces better bottleneck features than one trained to predict monophone states.

...read moreread less

Abstract: Bottleneck features have been shown to be effective in improving the accuracy of automatic speech recognition (ASR) systems. Conventionally, bottleneck features are extracted from a multi-layer perceptron (MLP) trained to predict context-independent monophone states. The MLP typically has three hidden layers and is trained using the backpropagation algorithm. In this paper, we propose two improvements to the training of bottleneck features motivated by recent advances in the use of deep neural networks (DNNs) for speech recognition. First, we show how the use of unsupervised pretraining of a DNN enhances the network’s discriminative power and improves the bottleneck features it generates. Second, we show that a neural network trained to predict context-dependent senone targets produces better bottleneck features than one trained to predict monophone states. Bottleneck features trained using the proposed methods produced a 16% relative reduction in sentence error rate over conventional bottleneck features on a large vocabulary business search task.

...read moreread less

317 citations

Journal Article•

Deep Learning and Its Applications to Signal and Information Processing

[...]

Dong Yu, Li Deng

01 Jan 2011-IEEE Signal Processing Magazine

TL;DR: A property common to these shallow learning models is the simple architecture that consists of only one layer responsible for transforming the raw input signals or features into a problem-specific feature space, which may be unobservable.

...read moreread less

Abstract: Today, signal processing research has a significantly widened its scope compared with just a few years ago [4], and machine learning has been an important technical area of the signal processing society. Since 2006, deep learning—a new area of machine learning research—has emerged [7], impacting a wide range of signal and information processing work within the traditional and the new, widened scopes. Various workshops, such as the 2009 ICML Workshop on Learning Feature Hierarchies; the 2008 NIPS Deep Learning Workshop: Foundations and Future Directions; and the 2009 NIPS Workshop on Deep Learning for Speech Recognition and Related Applications as well as an upcoming special issue on deep learning for speech and language processing in IEEE Transactions on Audio, Speech, and Language Processing (2010) have been devoted exclusively to deep learning and its applications to classical signal processing areas. We have also seen the government sponsor research on deep learning.

...read moreread less

260 citations

Proceedings Article•DOI•

Large vocabulary continuous speech recognition with context-dependent DBN-HMMS

[...]

George E. Dahl¹, Dong Yu², Li Deng², Alex Acero²•Institutions (2)

University of Toronto¹, Microsoft²

22 May 2011

TL;DR: This work proposes a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines on a challenging, large vocabulary, spontaneous speech recognition dataset from the Bing mobile voice search task.

...read moreread less

Abstract: The context-independent deep belief network (DBN) hidden Markov model (HMM) hybrid architecture has recently achieved promising results for phone recognition. In this work, we propose a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines on a challenging, large vocabulary, spontaneous speech recognition dataset from the Bing mobile voice search task. Our system achieves absolute sentence accuracy improvements of 5.8% and 9.2% over GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively, which translate to relative error reductions of 16.0% and 23.2%.

...read moreread less

213 citations

Patent•

Discriminative pretraining of deep neural networks

[...]

Dong Yu¹, Li Deng¹, Frank Seide¹, Gang Li¹•Institutions (1)

Microsoft¹

26 Nov 2011

TL;DR: In this article, a discriminative pretraining method is proposed to bring the DNN layer weights close to a good local optimum, while still leaving them in a range with a high gradient so that they can be fine-tuned effectively.

...read moreread less

Abstract: Discriminative pretraining technique embodiments are presented that pretrain the hidden layers of a Deep Neural Network (DNN). In general, a one-hidden-layer neural network is trained first using labels discriminatively with error back-propagation (BP). Then, after discarding an output layer in the previous one-hidden-layer neural network, another randomly initialized hidden layer is added on top of the previously trained hidden layer along with a new output layer that represents the targets for classification or recognition. The resulting multiple-hidden-layer DNN is then discriminatively trained using the same strategy, and so on until the desired number of hidden layers is reached. This produces a pretrained DNN. The discriminative pretraining technique embodiments have the advantage of bringing the DNN layer weights close to a good local optimum, while still leaving them in a range with a high gradient so that they can be fine-tuned effectively.

...read moreread less

112 citations

Patent•

Deep belief network for large vocabulary continuous speech recognition

[...]

Li Deng¹, Dong Yu¹, George E. Dahl¹•Institutions (1)

Microsoft¹

06 Sep 2011

TL;DR: In this article, a Deep Belief Network (DBN) consisting of many layers of nonlinear units with connecting weights between layers is trained by a pretraining step followed by a fine-tuning step.

...read moreread less

Abstract: A method is disclosed herein that includes an act of causing a processor to receive a sample, wherein the sample is one of spoken utterance, an online handwriting sample, or a moving image sample. The method also comprises the act of causing the processor to decode the sample based at least in part upon an output of a combination of a deep structure and a context-dependent Hidden Markov Model (HMM), wherein the deep structure is configured to output a posterior probability of a context-dependent unit. The deep structure is a Deep Belief Network consisting of many layers of nonlinear units with connecting weights between layers trained by a pretraining step followed by a fine-tuning step.

...read moreread less

90 citations

Journal Article•DOI•

Calibration of Confidence Measures in Speech Recognition

[...]

Dong Yu¹, Jinyu Li¹, Li Deng¹•Institutions (1)

Microsoft¹

01 Nov 2011-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Three confidence calibration methods have been developed and the importance of key features exploited: the generic confidence-score, the application-dependent word distribution, and the rule coverage ratio are demonstrated.

...read moreread less

Abstract: Most speech recognition applications in use today rely heavily on confidence measure for making optimal decisions. In this paper, we aim to answer the question: what can be done to improve the quality of confidence measure if we cannot modify the speech recognition engine? The answer provided in this paper is a post-processing step called confidence calibration, which can be viewed as a special adaptation technique applied to confidence measure. Three confidence calibration methods have been developed in this work: the maximum entropy model with distribution constraints, the artificial neural network, and the deep belief network. We compare these approaches and demonstrate the importance of key features exploited: the generic confidence-score, the application-dependent word distribution, and the rule coverage ratio. We demonstrate the effectiveness of confidence calibration on a variety of tasks with significant normalized cross entropy increase and equal error rate reduction.

...read moreread less

84 citations

Patent•

Full-sequence training of deep structures for speech recognition

[...]

Dong Yu¹, Li Deng¹, Abdelrahman Mohamed¹•Institutions (1)

Microsoft¹

07 Sep 2011

TL;DR: In this paper, the authors proposed a method to enable a processor to access a deep-structured model retained in a computer-readable medium, wherein the deep structured model comprises a plurality of layers with weights assigned to each layer, transition probabilities between states, and language model scores.

...read moreread less

Abstract: A method is disclosed herein that include an act of causing a processor to access a deep-structured model retained in a computer-readable medium, wherein the deep-structured model comprises a plurality of layers with weights assigned thereto, transition probabilities between states, and language model scores. The method can further include the act of jointly substantially optimizing the weights, the transition probabilities, and the language model scores of the deep-structured model using the optimization criterion based on a sequence rather than a set of unrelated frames.

...read moreread less

64 citations

Patent•

Exploiting sparseness in training deep neural networks

[...]

Dong Yu¹, Li Deng¹, Frank Seide¹, Gang Li¹•Institutions (1)

Microsoft¹

28 Nov 2011

TL;DR: In this paper, the sparseness of non-zero hidden layer interconnection weight values is exploited to train a fully connected DNN by sweeping through a full training set a number of times and only the interconnections whose weight magnitudes exceed a minimum weight threshold are considered in further training.

...read moreread less

Abstract: Deep Neural Network (DNN) training technique embodiments are presented that train a DNN while exploiting the sparseness of non-zero hidden layer interconnection weight values. Generally, a fully connected DNN is initially trained by sweeping through a full training set a number of times. Then, for the most part, only the interconnections whose weight magnitudes exceed a minimum weight threshold are considered in further training. This minimum weight threshold can be established as a value that results in only a prescribed maximum number of interconnections being considered when setting interconnection weight values via an error back-propagation procedure during the training. It is noted that the continued DNN training tends to converge much faster than the initial training.

...read moreread less

Patent•

Deep convex network with joint use of nonlinear random projection, restricted boltzmann machine and batch-based parallelizable optimization

[...]

Li Deng¹, Dong Yu¹, Alejandro Acero¹•Institutions (1)

Microsoft¹

31 Mar 2011

TL;DR: In this article, the authors proposed a method to jointly optimize the weights, the transition probabilities, and the language model scores of the deep-structured model using the optimization criterion based on a sequence rather than a set of unrelated frames.

...read moreread less

Abstract: A method is disclosed herein that includes an act of causing a processor to access a deep-structured, layered or hierarchical model, called deep convex network, retained in a computer-readable medium, wherein the deep-structured model comprises a plurality of layers with weights assigned thereto. This layered model can produce the output serving as the scores to combine with transition probabilities between states in a hidden Markov model and language model scores to form a full speech recognizer. The method makes joint use of nonlinear random projections and RBM weights, and it stacks a lower module's output with the raw data to establish its immediately higher module. Batch-based, convex optimization is performed to learn a portion of the deep convex network's weights, rendering it appropriate for parallel computation to accomplish the training. The method can further include the act of jointly substantially optimizing the weights, the transition probabilities, and the language model scores of the deep-structured model using the optimization criterion based on a sequence rather than a set of unrelated frames.

...read moreread less

Proceedings Article•DOI•

Accelerated Parallelizable Neural Network Learning Algorithm for Speech Recognition.

[...]

Dong Yu¹, Li Deng¹•Institutions (1)

Microsoft¹

27 Aug 2011

TL;DR: A set of novel, batch-mode algorithms developed recently are described as one key component in scalable, deep neural network based speech recognition, to structure the singlehidden-layer neural network so that the upper-layer's weights can be written as a deterministic function of the lower-layer’s weights.

...read moreread less

Abstract: We describe a set of novel, batch-mode algorithms we developed recently as one key component in scalable, deep neural network based speech recognition. The essence of these algorithms is to structure the singlehidden-layer neural network so that the upper-layer’s weights can be written as a deterministic function of the lower-layer’s weights. This structure is effectively exploited during training by plugging in the deterministic function to the least square error objective function while calculating the gradients. Accelerating techniques are further exploited to make the weight updates move along the most promising directions. The experiments on TIMIT frame-level phone and phonestate classification show strong results. In particular, the error rate is strictly monotonically dropping as the minibatch size increases. This demonstrates the potential for the proposed batch-mode algorithms in large scale speech recognition since they are easily parallelizable across computers.

...read moreread less

Journal Article•DOI•

In-Car Media Search

[...]

Michael L. Seltzer, Yun-Cheng Ju¹, Ivan Tashev², Ye-Yi Wang, Dong Yu - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Technical University of Sofia²

16 Jun 2011-IEEE Signal Processing Magazine

TL;DR: A recent govern ment study concluded that drivers performing complex secondary tasks such as operating or viewing a mobile device or personal digital assistant (PDA) were between 1.7 and 5.5 times more likely to be involved in a crash or near crash.

...read moreread less

Abstract: Over the last decade, our ability to access, store, and consume huge amount of media and infor mation on mobile devices has skyrocketed. While this has allowed people who are on the go to be more entertained, informed, and con nected, the small-form factor of mobile devices makes manag ing all of this content a difficult task. This difficulty is significantly amplified when we consider how many people are using these devices while driving in automobiles and the high risk of driver distraction such devices present. A recent govern ment study concluded that drivers performing complex secondary tasks such as operating or viewing a mobile device or personal digital assistant (PDA) were between 1.7 and 5.5 times more likely to be involved in a crash or near crash.

...read moreread less

Deep Convex Networks for Image and Speech Classification

[...]

Li Deng¹, Dong Yu¹•Institutions (1)

Microsoft¹

01 Jun 2011

TL;DR: Experimental results on handwriting image recognition task (MNIST) and on phone state classification (TIMIT) demonstrate superior performance of DCN over DBN not only in training efficiency but also in classification accuracy.

...read moreread less

Abstract:  To overcome the scalability challenge associated with Deep Belief Network (DBN), we have designed a novel deep learning architecture, deep convex network (DCN). The learning problem in DCN is convex within each layer. Additional structure-exploited fine tuning further improves the quality of DCN. The full learning in DCN is batch-mode based instead of stochastic, naturally lending it amenable to parallel training that can be distributed over many machines. Experimental results on handwriting image recognition task (MNIST) and on phone state classification (TIMIT) demonstrate superior performance of DCN over DBN not only in training efficiency but also in classification accuracy. DCN gives the error rate of 0.83%, the lowest without the use of additional training data produced by elastic distortion. The corresponding error rate by the best DBN which we have carefully tuned is 1.06%. On the TIMIT task, DCN also outperforms DBN but with a relatively smaller percentage so far.

...read moreread less

Proceedings Article•

Towards High-Accuracy Low-Cost Noisy Robust Speech Recognition Exploiting Structured Model

[...]

Jinyu Li¹, Dong Yu¹, Li Deng¹, Yifan Gong¹•Institutions (1)

Microsoft¹

01 Jun 2011

TL;DR: This paper presents the recent study on using this structured model of physical distortion for robust automatic speech recognition, and shows that online updating all the noise and channel distortion parameters is critical to the success of the proposed JAC algorithms.

...read moreread less

Abstract:  It is well known that the distorted speech can be considered generated from the clean speech with the additive noise and the convolutive channel as In this paper, we present our recent study on using this structured model of physical distortion for robust automatic speech recognition. Three methods are introduced for joint compensation of additive and convolutive distortions (JAC), with different online computation costs. They are JAC model adaptation, GMM-based JAC model adaptation, and JAC feature enhancement. All these algorithms consist of two main steps. First, the noise and channel parameters are estimated using a nonlinear environment distortion model in the cepstral domain, and the vector-Taylor-series (VTS) linearization technique collectively. Second, the estimated noise and channel parameters are used to adapt the hidden Markov model (HMM) parameters or clean the distorted speech feature. In the experimental evaluation using the standard Aurora 2 task, the proposed JAC algorithms all achieve around 89% accuracy using the cleantrained complex HMM backend, compare favorably over previously developed techniques. In the meanwhile, the JAC feature enhancement method has much smaller computation cost than the other two methods, and can be used as a high-accuracy low-cost noise robust front end. Detailed analysis on the experimental results shows that online updating all the noise and channel distortion parameters is critical to the success of our algorithms.

...read moreread less