Improved Bottleneck Features Using Pretrained Deep Neural Networks.

Home
/
Papers
/
Improved Bottleneck Features Using Pretrained Deep Neural Networks.

Proceedings Article•

Improved Bottleneck Features Using Pretrained Deep Neural Networks.

Dong Yu¹, Michael L. Seltzer¹•Institutions (1)

01 Aug 2011-pp 237-240

TL;DR: This paper shows how the use of unsupervised pretraining of a DNN enhances the network’s discriminative power and improves the bottleneck features it generates, and shows that a neural networktrained to predict context-dependent senone targets produces better bottleneck features than one trained to predict monophone states.

read less

Abstract: Bottleneck features have been shown to be effective in improving the accuracy of automatic speech recognition (ASR) systems. Conventionally, bottleneck features are extracted from a multi-layer perceptron (MLP) trained to predict context-independent monophone states. The MLP typically has three hidden layers and is trained using the backpropagation algorithm. In this paper, we propose two improvements to the training of bottleneck features motivated by recent advances in the use of deep neural networks (DNNs) for speech recognition. First, we show how the use of unsupervised pretraining of a DNN enhances the network’s discriminative power and improves the bottleneck features it generates. Second, we show that a neural network trained to predict context-dependent senone targets produces better bottleneck features than one trained to predict monophone states. Bottleneck features trained using the proposed methods produced a 16% relative reduction in sentence error rate over conventional bottleneck features on a large vocabulary business search task.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book•

Deep Learning: Methods and Applications

[...]

Li Deng¹, Dong Yu¹•Institutions (1)

Microsoft¹

12 Jun 2014

TL;DR: This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.

...read moreread less

Abstract: This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks. The application areas are chosen with the following three criteria in mind: (1) expertise or knowledge of the authors; (2) the application areas that have already been transformed by the successful use of deep learning technology, such as speech recognition and computer vision; and (3) the application areas that have the potential to be impacted significantly by deep learning and that have been experiencing research growth, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.

...read moreread less

2,817 citations

Journal Article•DOI•

Speech Recognition Using Deep Neural Networks: A Systematic Review

[...]

Ali Bou Nassif¹, Ismail Shahin¹, Imtinan Basem Attili¹, Mohammad Azzeh², Khaled Shaalan³ - Show less +1 more•Institutions (3)

University of Sharjah¹, Applied Science Private University², British University in Dubai³

01 Feb 2019-IEEE Access

TL;DR: A thorough examination of the different studies that have been conducted since 2006, when deep learning first arose as a new area of machine learning, for speech applications is provided.

...read moreread less

Abstract: Over the past decades, a tremendous amount of research has been done on the use of machine learning for speech processing applications, especially speech recognition. However, in the past few years, research has focused on utilizing deep learning for speech-related applications. This new area of machine learning has yielded far better results when compared to others in a variety of applications including speech, and thus became a very attractive area of research. This paper provides a thorough examination of the different studies that have been conducted since 2006, when deep learning first arose as a new area of machine learning, for speech applications. A thorough statistical analysis is provided in this review which was conducted by extracting specific information from 174 papers published between the years 2006 and 2018. The results provided in this paper shed light on the trends of research in this area as well as bring focus to new research topics.

...read moreread less

701 citations

Proceedings Article•DOI•

Restructuring of Deep Neural Network Acoustic Models with Singular Value Decomposition

[...]

Jian Xue¹, Jinyu Li¹, Yifan Gong¹•Institutions (1)

Microsoft¹

25 Aug 2013

TL;DR: This paper applies singular value decomposition (SVD) on the weight matrices in DNN, and then restructure the model based on the inherent sparseness of the original matrices, and can reduce the DNN model size significantly with negligible accuracy loss.

...read moreread less

Abstract: Recently proposed deep neural network (DNN) obtains significant accuracy improvements in many large vocabulary continuous speech recognition (LVCSR) tasks. However, DNN requires much more parameters than traditional systems, which brings huge cost during online evaluation, and also limits the application of DNN in a lot of scenarios. In this paper we present our new effort on DNN aiming at reducing the model size while keeping the accuracy improvements. We apply singular value decomposition (SVD) on the weight matrices in DNN, and then restructure the model based on the inherent sparseness of the original matrices. After restructuring we can reduce the DNN model size significantly with negligible accuracy loss. We also fine-tune the restructured model using the regular back-propagation method to get the accuracy back when reducing the DNN model size heavily. The proposed method has been evaluated on two LVCSR tasks, with context-dependent DNN hidden Markov model (CD-DNN-HMM). Experimental results show that the proposed approach dramatically reduces the DNN model size by more than 80% without losing any accuracy. Index Terms: deep neural network, singular value decomposition, model restructuring

...read moreread less

464 citations

Cites background or methods from "Improved Bottleneck Features Using ..."

...The work in [10] shows that DNN trained bottleneck feature reduces word error rate by 16% relatively on a large vocabulary business search task...
[...]
...Besides CD-DNN-HMMs, DNN can also be used to provide the bottle-neck feature vectors of the GMM in a GMM-HMM system [10][11] Both applications of DNN in ASR achieved significant accuracy improvement....
[...]

An Introduction to Computational Networks and the Computational Network Toolkit

[...]

Dong Yu, Adam Eversole, Michael L. Seltzer, Kaisheng Yao, Oleksii Kuchaiev, Yu Zhang, Frank Seide, Zhiheng Huang, Brian Guenter, Huaming Wang, Jasha Droppo, Geoffrey Zweig, Christopher J. Rossbach, Jie Gao, Andreas Stolcke, Jon Currey, Malcolm Slaney, Guoguo Chen, Amit Kumar Agarwal, Christopher H. Basoglu, Marko Padmilac, Alexey Kamenev, Vladimir Ivanov, Scott Cypher, Hari Parthasarathi, Bhaskar Mitra, Baolin Peng, Xuedong Huang - Show less +24 more

01 Aug 2014

TL;DR: The computational network toolkit (CNTK), an implementation of CN that supports both GPU and CPU, is introduced and the architecture and the key components of the CNTK are described, the command line options to use C NTK, and the network definition and model editing language are described.

...read moreread less

Abstract: We introduce computational network (CN), a unified framework for describing arbitrary learning machines, such as deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), long short term memory (LSTM), logistic regression, and maximum entropy model, that can be illustrated as a series of computational steps. A CN is a directed graph in which each leaf node represents an input value or a parameter and each non-leaf node represents a matrix operation upon its children. We describe algorithms to carry out forward computation and gradient calculation in CN and introduce most popular computation node types used in a typical CN. We further introduce the computational network toolkit (CNTK), an implementation of CN that supports both GPU and CPU. We describe the architecture and the key components of the CNTK, the command line options to use CNTK, and the network definition and model editing language, and provide sample setups for acoustic model, language model, and spoken language understanding. We also describe the Argon speech recognition decoder as an example to integrate with CNTK.

...read moreread less

408 citations

Proceedings Article•DOI•

Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks

[...]

Chen Zhang¹, Zhenman Fang², Peipei Zhou², Peichen Pan, Jason Cong¹ - Show less +1 more•Institutions (2)

Peking University¹, University of California, Los Angeles²

07 Nov 2016

TL;DR: This paper design and implement Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN on FPGAs with a key focus on the bandwidth optimization by the memory access reorganization not studied in prior work.

...read moreread less

Abstract: With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN on FPGAs. First, we propose a uniformed convolutional matrix-multiplication representation for both computation-intensive convolutional layers and communication-intensive fully connected (FCN) layers. Second, we design Caffeine with the goal to maximize the underlying FPGA computing and bandwidth resource utilization, with a key focus on the bandwidth optimization by the memory access reorganization not studied in prior work. Moreover, we implement Caffeine in the portable high-level synthesis and provide various hardware/software definable parameters for user configurations. Finally, we also integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet network on multiple FPGA platforms. Caffeine achieves a peak performance of 365 GOPS on Xilinx KU060 FPGA and 636 GOPS on Virtex7 690t FPGA. This is the best published result to our best knowledge. We achieve more than 100× speedup on FCN layers over previous FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 7.3× and 43.5× performance and energy gains over Caffe on a 12-core Xeon server, and 1.5× better energy-efficiency over the GPU implementation on a medium-sized FPGA (KU060). Performance projections to a system with a high-end FPGA (Virtex7 690t) shows even higher gains.

...read moreread less

360 citations

Cites background or methods from "Improved Bottleneck Features Using ..."

...However, in many other areas such as speech and auto-encoder, fully connected neural network is also a major type of workload, such as networks presented on work [39][40][41][42][43][44]....
[...]
...Figure 17 presents a design space of a bottleneck network, which is also frequently used in prior work[40][41][42]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Reducing the Dimensionality of Data with Neural Networks

[...]

Geoffrey E. Hinton¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

28 Jul 2006-Science

TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.

...read moreread less

Abstract: High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

...read moreread less

16,717 citations

Journal Article•DOI•

Training products of experts by minimizing contrastive divergence

[...]

Geoffrey E. Hinton¹•Institutions (1)

University College London¹

01 Aug 2002-Neural Computation

TL;DR: A product of experts (PoE) is an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary because it is hard even to approximate the derivatives of the renormalization term in the combination rule.

...read moreread less

Abstract: It is possible to combine multiple latent-variable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual "expert" models makes it hard to generate samples from the combined model but easy to infer the values of the latent variables of each expert, because the combination rule ensures that the latent variables of different experts are conditionally independent when given the data. A product of experts (PoE) is therefore an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary. Training a PoE by maximizing the likelihood of the data is difficult because it is hard even to approximate the derivatives of the renormalization term in the combination rule. Fortunately, a PoE can be trained using a different objective function called "contrastive divergence" whose derivatives with regard to the parameters can be approximated accurately and efficiently. Examples are presented of contrastive divergence learning using several types of expert on several types of data.

...read moreread less

5,150 citations

"Improved Bottleneck Features Using ..." refers methods in this paper

...Because 〈 h 〉03*45 is extremely expensive to compute exactly, the contrastive divergence (CD) approximation to the gradient is used, where 〈 h 〉03*45 is replaced by running the Gibbs sampler initialized at the data for one full step [12]....
[...]

Journal Article•DOI•

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

[...]

George E. Dahl¹, Dong Yu², Li Deng², Alex Acero²•Institutions (2)

University of Toronto¹, Microsoft²

01 Jan 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.

...read moreread less

Abstract: We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.

...read moreread less

3,120 citations

"Improved Bottleneck Features Using ..." refers methods in this paper

...For example, the DNN-HMM which exploits the discriminative learning ability of pretrained DNNs and the sequential modeling ability of hidden Markov models (HMMs) outperformed the conventional Gaussian mixture model (GMM)-HMM for both phoneme recognition [6][7] and large vocabulary speech recognition [ 8 ] tasks.,We are also interested in knowing how bottleneck features perform compared to the context-dependent DNN-HMMs developed recently [ 8 ] For these purposes, we conducted a series of experiments using the Windows Live Search for Mobile (WLS4M) corpus collected from real users of a smartphone application for business search [13][14].,The lexicon and trigram language model (LM) used for decoding were the same as used in our previous work [ 8 ].,In all the results reported here, we followed the DNN training recipe in [ 8 ]....
[...]

Journal Article•

Why Does Unsupervised Pre-training Help Deep Learning?

[...]

Dumitru Erhan¹, Yoshua Bengio¹, Aaron Courville¹, Pierre-Antoine Manzagol¹, Pascal Vincent¹, Samy Bengio² - Show less +2 more•Institutions (2)

Université de Montréal¹, Google²

01 Mar 2010-Journal of Machine Learning Research

TL;DR: In this paper, the authors empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples, and they suggest that unsupervised pretraining guides the learning towards basins of attraction of minima that support better generalization.

...read moreread less

Abstract: Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training.

...read moreread less

2,036 citations

Journal Article•DOI•

Acoustic Modeling Using Deep Belief Networks

[...]

Abdelrahman Mohamed¹, George E. Dahl¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

01 Jan 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: It is shown that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters.

...read moreread less

Abstract: Gaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition. We show that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters. These networks are first pre-trained as a multi-layer generative model of a window of spectral feature vectors without making use of any discriminative information. Once the generative pre-training has designed the features, we perform discriminative fine-tuning using backpropagation to adjust the features slightly to make them better at predicting a probability distribution over the states of monophone hidden Markov models.

...read moreread less

1,767 citations

"Improved Bottleneck Features Using ..." refers methods in this paper

...For example, the DNN-HMM which exploits the discriminative learning ability of pretrained DNNs and the sequential modeling ability of hidden Markov models (HMMs) outperformed the conventional Gaussian mixture model (GMM)-HMM for both phoneme recognition [ 6 ][7] and large vocabulary speech recognition [8] tasks....
[...]