Showing papers by "Dong Yu published in 2012"

PDF

Open Access

Journal Article•DOI•

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

[...]

Geoffrey E. Hinton¹, Li Deng², Dong Yu², George E. Dahl¹, Abdelrahman Mohamed¹, Navdeep Jaitly¹, Andrew W. Senior³, Vincent Vanhoucke³, Patrick Nguyen³, Tara N. Sainath⁴, Brian Kingsbury⁴ - Show less +7 more•Institutions (4)

University of Toronto¹, Microsoft², Google³, IBM⁴

18 Oct 2012-IEEE Signal Processing Magazine

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

9,091 citations

Journal Article•DOI•

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

[...]

George E. Dahl¹, Dong Yu², Li Deng², Alex Acero²•Institutions (2)

University of Toronto¹, Microsoft²

01 Jan 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.

...read moreread less

Abstract: We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.

...read moreread less

3,120 citations

Journal Article•

Deep Neural Networks for Acoustic Modeling in Speech Recognition

[...]

Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew W. Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, Brian Kingsbury - Show less +7 more

01 Nov 2012-IEEE Signal Processing Magazine

TL;DR: This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.

...read moreread less

2,527 citations

Proceedings Article•

Conversational speech transcription using context-dependent deep neural networks

[...]

Dong Yu¹, Frank Seide¹, Gang Li¹•Institutions (1)

Microsoft¹

26 Jun 2012

TL;DR: Context-Dependent Deep-Neural-Network (CD-DNN-HMMs) as mentioned in this paper combine the classic artificial-neural-network HMMs with traditional context-dependent acoustic modeling and deep-belief-network pre-training.

...read moreread less

Abstract: Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, combine the classic artificial-neural-network HMMs with traditional context-dependent acoustic modeling and deep-belief-network pre-training. CD-DNN-HMMs greatly outperform conventional CD-GMM (Gaussian mixture model) HMMs: The word error rate is reduced by up to one third on the difficult benchmarking task of speaker-independent single-pass transcription of telephone conversations.

...read moreread less

792 citations

Proceedings Article•DOI•

Adaptation of context-dependent deep neural networks for automatic speech recognition

[...]

Kaisheng Yao¹, Dong Yu¹, Frank Seide¹, Hang Su¹, Li Deng¹, Yifan Gong¹ - Show less +2 more•Institutions (1)

Microsoft¹

01 Dec 2012

TL;DR: On a large vocabulary speech recognition task, a stochastic gradient ascent implementation of the fDLR and the top hidden layer adaptation is shown to reduce word error rates (WERs) by 17% and 14%, respectively, compared to the baseline DNN performances.

...read moreread less

Abstract: In this paper, we evaluate the effectiveness of adaptation methods for context-dependent deep-neural-network hidden Markov models (CD-DNN-HMMs) for automatic speech recognition. We investigate the affine transformation and several of its variants for adapting the top hidden layer. We compare the affine transformations against direct adaptation of the softmax layer weights. The feature-space discriminative linear regression (fDLR) method with the affine transformations on the input layer is also evaluated. On a large vocabulary speech recognition task, a stochastic gradient ascent implementation of the fDLR and the top hidden layer adaptation is shown to reduce word error rates (WERs) by 17% and 14%, respectively, compared to the baseline DNN performances. With a batch update implementation, the softmax layer adaptation technique reduces WERs by 10%. We observe that using bias shift performs as well as doing scaling plus bias shift.

...read moreread less

244 citations

Proceedings Article•DOI•

Scalable stacking and learning for building deep architectures

[...]

Li Deng¹, Dong Yu¹, John Platt¹•Institutions (1)

Microsoft¹

25 Mar 2012

TL;DR: The Deep Stacking Network (DSN) is presented, which overcomes the problem of parallelizing learning algorithms for deep architectures and provides a method of stacking simple processing modules in buiding deep architectures, with a convex learning problem in each module.

...read moreread less

Abstract: Deep Neural Networks (DNNs) have shown remarkable success in pattern recognition tasks. However, parallelizing DNN training across computers has been difficult. We present the Deep Stacking Network (DSN), which overcomes the problem of parallelizing learning algorithms for deep architectures. The DSN provides a method of stacking simple processing modules in buiding deep architectures, with a convex learning problem in each module. Additional fine tuning further improves the DSN, while introducing minor non-convexity. Full learning in the DSN is batch-mode, making it amenable to parallel training over many machines and thus be scalable over the potentially huge size of the training data. Experimental results on both the MNIST (image) and TIMIT (speech) classification tasks demonstrate that the DSN learning algorithm developed in this work is not only parallelizable in implementation but it also attains higher classification accuracy than the DNN.

...read moreread less

208 citations

Proceedings Article•DOI•

Exploiting sparseness in deep neural networks for large vocabulary speech recognition

[...]

Dong Yu¹, Frank Seide¹, Gang Li¹, Li Deng¹•Institutions (1)

Microsoft¹

25 Mar 2012

TL;DR: The goal of enforcing sparseness as soft regularization and convex constraint optimization problems is formulated, solutions under the stochastic gradient ascent setting are proposed, and novel data structures are proposed to exploit the randomSparseness patterns to reduce model size and computation time.

...read moreread less

Abstract: Recently, we developed context-dependent deep neural network (DNN) hidden Markov models for large vocabulary speech recognition. While reducing errors by 33% compared to its discriminatively trained Gaussian-mixture counterpart on the switchboard benchmark task, DNN requires much more parameters. In this paper, we report our recent work on DNN for improved generalization, model size, and computation speed by exploiting parameter sparseness. We formulate the goal of enforcing sparseness as soft regularization and convex constraint optimization problems, and propose solutions under the stochastic gradient ascent setting. We also propose novel data structures to exploit the random sparseness patterns to reduce model size and computation time. The proposed solutions have been evaluated on the voice-search and switchboard datasets. They have decreased the number of nonzero connections to one third while reducing the error rate by 0.2–0.3% over the fully connected model on both datasets. The nonzero connections have been further reduced to only 12% and 19% on the two respective datasets without sacrificing speech recognition performance. Under these conditions we can reduce the model size to 18% and 29%, and computation to 14% and 23%, respectively, on these two datasets.

...read moreread less

190 citations

Proceedings Article•DOI•

Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM

[...]

Jinyu Li¹, Dong Yu¹, Jui-Ting Huang¹, Yifan Gong¹•Institutions (1)

Microsoft¹

01 Dec 2012

TL;DR: This paper presents the strategy of using mixed-bandwidth training data to improve wideband speech recognition accuracy in the CD-DNN-HMM framework, and shows that DNNs provide the flexibility of using arbitrary features.

...read moreread less

Abstract: Context-dependent deep neural network hidden Markov model (CD-DNN-HMM) is a recently proposed acoustic model that significantly outperformed Gaussian mixture model (GMM)-HMM systems in many large vocabulary speech recognition (LVSR) tasks. In this paper we present our strategy of using mixed-bandwidth training data to improve wideband speech recognition accuracy in the CD-DNN-HMM framework. We show that DNNs provide the flexibility of using arbitrary features. By using the Mel-scale log-filter bank features we not only achieve higher recognition accuracy than using MFCCs, but also can formulate the mixed-bandwidth training problem as a missing feature problem, in which several feature dimensions have no value when narrowband speech is presented. This treatment makes training CD-DNN-HMMs with mixed-bandwidth data an easy task since no bandwidth extension is needed. Our experiments on voice search data indicate that the proposed solution not only provides higher recognition accuracy for the wideband speech but also allows the same CD-DNN-HMM to recognize mixed-bandwidth speech. By exploiting mixed-bandwidth training data CD-DNN-HMM outperforms fMPE+BMMI trained GMM-HMM, which cannot benefit from using narrowband data, by 18.4%.

...read moreread less

143 citations

Proceedings Article•DOI•

Pipelined Back-Propagation for Context-Dependent Deep Neural Networks.

[...]

Xie Chen¹, Adam Eversole¹, Gang Li¹, Dong Yu¹, Frank Seide¹ - Show less +1 more•Institutions (1)

Microsoft¹

09 Sep 2012

TL;DR: It is shown that the pipelined approximation to BP, which parallelizes computation with respect to layers, is an efficient way of utilizing multiple GPGPU cards in a single server.

...read moreread less

Abstract: The Context-Dependent Deep-Neural-Network HMM, or CDDNN-HMM, is a recently proposed acoustic-modeling technique for HMM-based speech recognition that can greatly outperform conventional Gaussian-mixture based HMMs For example, a CD-DNN-HMM trained on the 2000h Fisher corpus achieves 144% word error rate on the Hub5’00-FSH speakerindependent phone-call transcription task, compared to 196% obtained by a state-of-the-art, conventional discriminatively trained GMM-based HMM That CD-DNN-HMM, however, took 59 days to train on a modern GPGPU—the immense computational cost of the minibatch based back-propagation (BP) training is a major roadblock Unlike the familiar Baum-Welch training for conventional HMMs, BP cannot be efficiently parallelized across data In this paper we show that the pipelined approximation to BP, which parallelizes computation with respect to layers, is an efficient way of utilizing multiple GPGPU cards in a single server Using 2 and 4 GPGPUs, we achieve a 19 and 33 times end-to-end speed-up, at parallelization efficiency of 095 and 082, respectively, at no loss of recognition accuracy

...read moreread less

91 citations

The shared views of four research groups )

[...]

Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew W. Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, Brian Kingsbury - Show less +7 more

01 Jan 2012

84 citations

Journal Article•DOI•

Introduction to the Special Section on Deep Learning for Speech and Language Processing

[...]

Dong Yu¹, Geoffrey E. Hinton², Nelson Morgan, Jen-Tzung Chien³, Shigeki Sagayama⁴ - Show less +1 more•Institutions (4)

Microsoft¹, University of Toronto², National Cheng Kung University³, University of Tokyo⁴

01 Jan 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Current speech recognition systems, for example, typically use Gaussian mixture models (GMMs), to estimate the observation (or emission) probabilities of hidden Markov models (HMMs), and GMMs are generative models that have only one layer of latent variables.

...read moreread less

Abstract: Current speech recognition systems, for example, typically use Gaussian mixture models (GMMs), to estimate the observation (or emission) probabilities of hidden Markov models (HMMs), and GMMs are generative models that have only one layer of latent variables. Instead of developing more powerful models, most of the research effort has gone into finding better ways of estimating the GMM parameters so that error rates are decreased or the margin between different classes is increased. The same observation holds for natural language processing (NLP) in which maximum entropy (MaxEnt) models and conditional random fields (CRFs) have been popular for the last decade. Both of these approaches use shallow models whose success largely depends on the use of carefully handcrafted features.

...read moreread less

Proceedings Article•DOI•

Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition

[...]

Dong Yu¹, Sabato Marco Siniscalchi², Li Deng¹, Chin-Hui Lee³•Institutions (3)

Microsoft¹, Kore University of Enna², Georgia Institute of Technology³

25 Mar 2012

TL;DR: Deep neural networks are employed to improve detection accuracy over conventional shallow MLPs (multi-layer perceptrons) with one hidden layer to open the door to a new family of flexible speech recognition system design for both top-down and bottom-up, lattice-based search strategies and knowledge integration.

...read moreread less

Abstract: Generation of high-precision sub-phonetic attribute (also known as phonological features) and phone lattices is a key frontend component for detection-based bottom-up speech recognition. In this paper we employ deep neural networks (DNNs) to improve detection accuracy over conventional shallow MLPs (multi-layer perceptrons) with one hidden layer. A range of DNN architectures with five to seven hidden layers and up to 2048 hidden units per layer have been explored. Training on the SI84 and testing on the Nov92 WSJ data, the proposed DNNs achieve significant improvements over the shallow MLPs, producing greater than 90% frame-level attribute estimation accuracies for all 21 attributes tested for the full system. On the phone detection task, we also obtain excellent frame-level accuracy of 86.6%. With this level of high-precision detection of basic speech units we have opened the door to a new family of flexible speech recognition system design for both top-down and bottom-up, lattice-based search strategies and knowledge integration.

...read moreread less

Patent•

Deep neural networks training for speech and pattern recognition

[...]

Frank Seide¹, Gang Li¹, Dong Yu¹, Adam Eversole¹, Xie Chen¹ - Show less +1 more•Institutions (1)

Microsoft¹

20 Nov 2012

TL;DR: In this paper, the use of a pipelined algorithm that performs parallelized computations to train deep neural networks (DNNs) for performing data analysis may reduce training time.

...read moreread less

Abstract: The use of a pipelined algorithm that performs parallelized computations to train deep neural networks (DNNs) for performing data analysis may reduce training time. The DNNs may be one of context-independent DNNs or context-dependent DNNs. The training may include partitioning training data into sample batches of a specific batch size. The partitioning may be performed based on rates of data transfers between processors that execute the pipelined algorithm, considerations of accuracy and convergence, and the execution speed of each processor. Other techniques for training may include grouping layers of the DNNs for processing on a single processor, distributing a layer of the DNNs to multiple processors for processing, or modifying an execution order of steps in the pipelined algorithm.

...read moreread less

Journal Article•DOI•

Efficient and effective algorithms for training single-hidden-layer neural networks

[...]

Dong Yu¹, Li Deng¹•Institutions (1)

Microsoft¹

01 Apr 2012-Pattern Recognition Letters

TL;DR: Experiments show that the algorithms proposed in this paper obtain significantly better classification accuracy than ELM when the same number of hidden units is used, and at the expense of 5 times or less the training cost incurred by the ELM training.

...read moreread less

Proceedings Article•DOI•

A deep architecture with bilinear modeling of hidden representations: Applications to phonetic recognition

[...]

Brian Hutchinson¹, Li Deng², Dong Yu²•Institutions (2)

University of Washington¹, Microsoft²

25 Mar 2012

TL;DR: A novel deep architecture, the Tensor Deep Stacking Network (T-DSN), where multiple blocks are stacked one on top of another and where a bilinear mapping from hidden representations to the output in each block is used to incorporate higher-order statistics of the input features.

...read moreread less

Abstract: We develop and describe a novel deep architecture, the Tensor Deep Stacking Network (T-DSN), where multiple blocks are stacked one on top of another and where a bilinear mapping from hidden representations to the output in each block is used to incorporate higher-order statistics of the input features. A learning algorithm for the T-DSN is presented, in which the main parameter estimation burden is shifted to a convex sub-problem with a closed-form solution. Using an efficient and scalable parallel implementation, we train a T-DSN to discriminate standard three-state monophones in the TIMIT database. The T-DSN outperforms an alternative pretrained Deep Neural Network (DNN) architecture in frame-level classification (both state and phone) and in the cross-entropy measure. For continuous phonetic recognition, T-DSN performs equivalently to a DNN but without the need for a hard-to-scale, sequential fine-tuning step.

...read moreread less

Factorized deep neural networks for adaptive speech recognition

[...]

Dong Yu¹, Xin Chen², Li Deng¹•Institutions (2)

Microsoft¹, University of Missouri²

01 Mar 2012

TL;DR: Two types of factorized adaptive DNNs are proposed and described, improving the earlier versions of CD-DNN-HMMs and providing new ways of modeling speaker and environment factors and insight onto how environment invariant DNN models may be constructed and subsequently trained.

...read moreread less

Abstract: Recently, we have shown that context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) can achieve very promising recognition results on large vocabulary speech recognition tasks, as evidenced by over one third fewer word errors than the discriminatively trained conventional HMM-based systems on the 300hr Switchboard benchmark task. In this paper, we propose and describe two types of factorized adaptive DNNs, improving the earlier versions of CD-DNN-HMMs. In the first model, the hidden speaker and environment factors and tied triphone states are jointly approximated; while in the second model, the factors are firstly estimated and then fed into the main DNN to predict tied triphone states. We evaluated these models on the small 30hr Switchboard task. The preliminary results indicate that more training data are needed to show the full potential of these models. However, these models provide new ways of modeling speaker and environment factors and offer insight onto how environment invariant DNN models may be constructed and subsequently trained. Index Terms — automatic speech recognition, deep neural networks, factorized DNN, CD-DNN-HMM

...read moreread less

Patent•

Computer-implemented deep tensor neural network

[...]

Dong Yu¹, Li Deng¹, Frank Seide¹•Institutions (1)

Microsoft¹

29 Aug 2012

TL;DR: In this article, a deep tensor neural network (DTNN) is described, wherein the DTNN is suitable for employment in a computer-implemented recognition/classification system.

...read moreread less

Abstract: A deep tensor neural network (DTNN) is described herein, wherein the DTNN is suitable for employment in a computer-implemented recognition/classification system. Hidden layers in the DTNN comprise at least one projection layer, which includes a first subspace of hidden units and a second subspace of hidden units. The first subspace of hidden units receives a first nonlinear projection of input data to a projection layer and generates the first set of output data based at least in part thereon, and the second subspace of hidden units receives a second nonlinear projection of the input data to the projection layer and generates the second set of output data based at least in part thereon. A tensor layer, which can converted into a conventional layer of a DNN, generates the third set of output data based upon the first set of output data and the second set of output data.

...read moreread less

Proceedings Article•DOI•

Large Vocabulary Speech Recognition Using Deep Tensor Neural Networks.

[...]

Dong Yu¹, Li Deng¹, Frank Seide¹•Institutions (1)

Microsoft¹

01 Sep 2012

TL;DR: Evaluation on 30hr Switchboard task indicates that DTNNs can outperform DNNs with similar number of parameters with 5% relative word error reduction, and is extended to deep tensor neural networks (DTNNs) in which one or more layers are double-projection and tensor layers.

...read moreread less

Abstract: Recently, we proposed and developed the context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) for large vocabulary speech recognition and achieved highly promising recognition results including over one third fewer word errors than the discriminatively trained, conventional HMM-based systems on the 300hr Switchboard benchmark task. In this paper, we extend DNNs to deep tensor neural networks (DTNNs) in which one or more layers are double-projection and tensor layers. The basic idea of the DTNN comes from our realization that many factors interact with each other to predict the output. To represent these interactions, we project the input to two nonlinear subspaces through the double-projection layer and model the interactions between these two subspaces and the output neurons through a tensor with three-way connections. Evaluation on 30hr Switchboard task indicates that DTNNs can outperform DNNs with similar number of parameters with 5% relative word error reduction

...read moreread less

Patent•

Tensor deep stacked neural network

[...]

Dong Yu¹, Li Deng¹, Brian Hutchinson¹•Institutions (1)

Microsoft¹

15 Feb 2012

TL;DR: The tensor deep stacked neural (TDSN) network as discussed by the authors is a tensor neural network that uses a bilinear representation to map a hidden layer to the predication layer.

...read moreread less

Abstract: A tensor deep stacked neural (T-DSN) network for obtaining predictions for discriminative modeling problems. The T-DSN network and method use bilinear modeling with a tensor representation to map a hidden layer to the predication layer. The T-DSN network is constructed by stacking blocks of a single hidden layer tensor neural network (SHLTNN) on top of each other. The single hidden layer for each block then is separated or divided into a plurality of two or more sections. In some embodiments, the hidden layer is separated into a first hidden layer section and a second hidden layer section. These multiple sections of the hidden layer are combined using a product operator to obtain an implicit hidden layer having a single section. In some embodiments the product operator is a Khatri-Rao product. A prediction is made using the implicit hidden layer and weights, and the output prediction layer is consequently obtained.

...read moreread less

Proceedings Article•DOI•

Context-dependent Deep Neural Networks for audio indexing of real-life data

[...]

Gang Li¹, Huifeng Zhu¹, Gong Cheng¹, Kit Thambiratnam², Behrooz Chitsaz¹, Dong Yu¹, Frank Seide¹ - Show less +3 more•Institutions (2)

Microsoft¹, Search Technologies²

01 Dec 2012

TL;DR: It is found that for the best speaker-independent CD-DNN-HMM, with 32k senones trained on 2000h of data, the one-fourth reduction does carry over to inhomogeneous field data, and that DNN likelihood evaluation is a sizeable runtime factor even in the wide-beam context of generating rich lattices.

...read moreread less

Abstract: We apply Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, to the real-life problem of audio indexing of data across various sources. Recently, we had shown that on the Switchboard benchmark on speaker-independent transcription of phone calls, CD-DNN-HMMs with 7 hidden layers reduce the word error rate by as much as one-third, compared to discriminatively trained Gaussian-mixture HMMs, and by one-fourth if the GMM-HMM also uses fMPE features. This paper takes CD-DNN-HMM based recognition into a real-life deployment for audio indexing. We find that for our best speaker-independent CD-DNN-HMM, with 32k senones trained on 2000h of data, the one-fourth reduction does carry over to inhomogeneous field data (video podcasts and talks). Compared to a speaker-adaptive GMM system, the relative improvement is 18%, at very similar end-to-end runtime. In system building, we find that DNNs can benefit from a larger number of senones than the GMM-HMM; and that DNN likelihood evaluation is a sizeable runtime factor even in our wide-beam context of generating rich lattices: Cutting the model size by 60% reduces runtime by one-third at a 5% relative WER loss.

...read moreread less