Showing papers on "TIMIT published in 2014"

PDF

Open Access

Journal Article•DOI•

Convolutional neural networks for speech recognition

[...]

Ossama Abdel-Hamid¹, Abdelrahman Mohamed², Hui Jiang¹, Li Deng³, Gerald Penn², Dong Yu³ - Show less +2 more•Institutions (3)

York University¹, University of Toronto², Microsoft³

01 Oct 2014-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.

...read moreread less

Abstract: Recently, the hybrid deep neural network (DNN)- hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is partially attributed to the ability of the DNN to model complex correlations in speech features. In this paper, we show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations. Experimental results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone recognition and the voice search large vocabulary speech recognition tasks.

...read moreread less

1,948 citations

Proceedings Article•

Do Deep Nets Really Need to be Deep

[...]

Jimmy Ba¹, Rich Caruana²•Institutions (2)

University of Toronto¹, Microsoft²

08 Dec 2014

TL;DR: This paper empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models.

...read moreread less

Abstract: Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this paper we empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow nets can learn these deep functions using the same number of parameters as the original deep models. On the TIMIT phoneme recognition and CIFAR-10 image recognition tasks, shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional models.

...read moreread less

1,526 citations

Posted Content•

End-to-end continuous speech recognition using attention-based recurrent nn: First results

[...]

Jan Chorowski¹, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio², Yoshua Bengio³, Yoshua Bengio⁴ - Show less +2 more•Institutions (4)

University of Wrocław¹, Alcatel-Lucent², AT&T³, École Polytechnique de Montréal⁴

04 Dec 2014-arXiv: Neural and Evolutionary Computing

TL;DR: Initial results demonstrate that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.

...read moreread less

Abstract: We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.

...read moreread less

483 citations

Proceedings Article•DOI•

Deep learning for monaural speech separation

[...]

Po-Sen Huang¹, Minje Kim¹, Mark Hasegawa-Johnson¹, Paris Smaragdis¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

04 May 2014

TL;DR: The joint optimization of the deep learning models (deep neural networks and recurrent neural networks) with an extra masking layer, which enforces a reconstruction constraint, is proposed to enhance the separation performance of monaural speech separation models.

...read moreread less

Abstract: Monaural source separation is useful for many real-world applications though it is a challenging problem. In this paper, we study deep learning for monaural speech separation. We propose the joint optimization of the deep learning models (deep neural networks and recurrent neural networks) with an extra masking layer, which enforces a reconstruction constraint. Moreover, we explore a discriminative training criterion for the neural networks to further enhance the separation performance. We evaluate our approaches using the TIMIT speech corpus for a monaural speech separation task. Our proposed models achieve about 3.8~4.9 dB SIR gain compared to NMF models, while maintaining better SDRs and SARs.

...read moreread less

445 citations

Proceedings Article•

A Clockwork RNN

[...]

Jan Koutník¹, Klaus Greff¹, Faustino Gomez¹, Jürgen Schmidhuber¹•Institutions (1)

Dalle Molle Institute for Artificial Intelligence Research¹

21 Jun 2014

TL;DR: This paper introduces a simple, yet powerful modification to the simple RNN architecture, the Clockwork RNN (CW-RNN), in which the hidden layer is partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate.

...read moreread less

Abstract: Sequence prediction and classification are ubiquitous and challenging problems in machine learning that can require identifying complex dependencies between temporally distant inputs. Recurrent Neural Networks (RNNs) have the ability, in theory, to cope with these temporal dependencies by virtue of the short-term memory implemented by their recurrent (feedback) connections. However, in practice they are difficult to train successfully when long-term memory is required. This paper introduces a simple, yet powerful modification to the simple RNN (SRN) architecture, the Clockwork RNN (CW-RNN), in which the hidden layer is partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate. Rather than making the standard RNN models more complex, CW-RNN reduces the number of SRN parameters, improves the performance significantly in the tasks tested, and speeds up the network evaluation. The network is demonstrated in preliminary experiments involving three tasks: audio signal generation, TIMIT spoken word classification, where it outperforms both SRN and LSTM networks, and online handwriting recognition, where it outperforms SRNs.

...read moreread less

335 citations

Journal Article•DOI•

Fast adaptation of deep neural network based on discriminant codes for speech recognition

[...]

Shaofei Xue¹, Ossama Abdel-Hamid², Hui Jiang², Li-Rong Dai¹, Qingfeng Liu¹ - Show less +1 more•Institutions (2)

University of Science and Technology of China¹, York University²

01 Dec 2014-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A general adaptation scheme for DNN based on discriminant condition codes is proposed, which is directly fed to various layers of a pre-trained DNN through a new set of connection weights, which are quite effective to adapt large DNN models using only a small amount of adaptation data.

...read moreread less

Abstract: Fast adaptation of deep neural networks (DNN) is an important research topic in deep learning. In this paper, we have proposed a general adaptation scheme for DNN based on discriminant condition codes, which are directly fed to various layers of a pre-trained DNN through a new set of connection weights. Moreover, we present several training methods to learn connection weights from training data as well as the corresponding adaptation methods to learn new condition code from adaptation data for each new test condition. In this work, the fast adaptation scheme is applied to supervised speaker adaptation in speech recognition based on either frame-level cross-entropy or sequence-level maximum mutual information training criterion. We have proposed three different ways to apply this adaptation scheme based on the so-called speaker codes: i) Nonlinear feature normalization in feature space; ii) Direct model adaptation of DNN based on speaker codes; iii) Joint speaker adaptive training with speaker codes. We have evaluated the proposed adaptation methods in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that all three methods are quite effective to adapt large DNN models using only a small amount of adaptation data. For example, the Switchboard results have shown that the proposed speaker-code-based adaptation methods may achieve up to 8-10% relative error reduction using only a few dozens of adaptation utterances per speaker. Finally, we have achieved very good performance in Switchboard (12.1% in WER) after speaker adaptation using sequence training criterion, which is very close to the best performance reported in this task ("Deep convolutional neural networks for LVCSR," T. N. Sainath et al., Proc. IEEE Acoust., Speech, Signal Process., 2013).

...read moreread less

157 citations

Proceedings Article•DOI•

Kernel methods match Deep Neural Networks on TIMIT

[...]

Po-Sen Huang¹, Haim Avron², Tara N. Sainath², Vikas Sindhwani², Bhuvana Ramabhadran² - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, IBM²

04 May 2014

TL;DR: Two algorithmic schemes to address this computational bottleneck in the context of kernel ridge regression are developed and it is demonstrated that these schemes enable kernel methods to match the performance of state of the art Deep Neural Networks on TIMIT for speech recognition and classification tasks.

...read moreread less

Abstract: Despite their theoretical appeal and grounding in tractable convex optimization techniques, kernel methods are often not the first choice for large-scale speech applications due to their significant memory requirements and computational expense. In recent years, randomized approximate feature maps have emerged as an elegant mechanism to scale-up kernel methods. Still, in practice, a large number of random features is required to obtain acceptable accuracy in predictive tasks. In this paper, we develop two algorithmic schemes to address this computational bottleneck in the context of kernel ridge regression. The first scheme is a specialized distributed block coordinate descent procedure that avoids the explicit materialization of the feature space data matrix, while the second scheme gains efficiency by combining multiple weak random feature models in an ensemble learning framework. We demonstrate that these schemes enable kernel methods to match the performance of state of the art Deep Neural Networks on TIMIT for speech recognition and classification tasks. In particular, we obtain the best classification error rates reported on TIMIT using kernel methods.

...read moreread less

135 citations

Proceedings Article•DOI•

Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition

[...]

László Tóth¹•Institutions (1)

University of Szeged¹

04 May 2014

TL;DR: The two network architectures, convolution along the frequency axis and time-domain convolution, can be readily combined and report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.

...read moreread less

Abstract: Convolutional neural networks have proved very successful in image recognition, thanks to their tolerance to small translations. They have recently been applied to speech recognition as well, using a spectral representation as input. However, in this case the translations along the two axes - time and frequency - should be handled quite differently. So far, most authors have focused on convolution along the frequency axis, which offers invariance to speaker and speaking style variations. Other researchers have developed a different network architecture that applies time-domain convolution in order to process a longer time-span of input in a hierarchical manner. These two approaches have different background motivations, and both offer significant gains over a standard fully connected network. Here we show that the two network architectures can be readily combined, like their advantages. With the combined model we report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.

...read moreread less

87 citations

Proceedings Article•DOI•

Unsupervised submodular subset selection for speech data

[...]

Kai Wei¹, Yuzong Liu¹, Katrin Kirchhoff¹, Jeff A. Bilmes¹•Institutions (1)

University of Washington¹

04 May 2014

TL;DR: Results show that this novel and entirely unsupervised approach to selecting subsets of acoustic data for training phone recognizers consistently outperforms a number of baseline methods while being computationally very efficient and requiring no labeling.

...read moreread less

Abstract: We conduct a comparative study on selecting subsets of acoustic data for training phone recognizers The data selection problem is approached as a constrained submodular optimization problem Previous applications of this approach required transcriptions or acoustic models trained in a supervised way In this paper we develop and evaluate a novel and entirely unsupervised approach, and apply it to TIMIT data Results show that our method consistently outperforms a number of baseline methods while being computationally very efficient and requiring no labeling

...read moreread less

50 citations

Journal Article•DOI•

Detection of the closure-burst transitions of stops and affricates in continuous speech using the plosion index.

[...]

T. V. Ananthapadmanabha¹, A. P. Prathosh², A. G. Ramakrishnan²•Institutions (2)

Temple University¹, Indian Institute of Science²

14 Jan 2014-Journal of the Acoustical Society of America

TL;DR: A rule-based algorithm is designed that aims at selecting only those events associated with the closure-burst transitions of stops and affricates and gives a performance comparable to or better than the state-of-the-art methods.

...read moreread less

Abstract: Automatic and accurate detection of the closure-burst transition events of stops and affricates serves many applications in speech processing. A temporal measure named the plosion index is proposed to detect such events, which are characterized by an abrupt increase in energy. Using the maxima of the pitch-synchronous normalized cross correlation as an additional temporal feature, a rule-based algorithm is designed that aims at selecting only those events associated with the closure-burst transitions of stops and affricates. The performance of the algorithm, characterized by receiver operating characteristic curves and temporal accuracy, is evaluated using the labeled closure-burst transitions of stops and affricates of the entire TIMIT test and training databases. The robustness of the algorithm is studied with respect to global white and babble noise as well as local noise using the TIMIT test set and on telephone quality speech using the NTIMIT test set. For these experiments, the proposed algorithm, which does not require explicit statistical training and is based on two one-dimensional temporal measures, gives a performance comparable to or better than the state-of-the-art methods. In addition, to test the scalability, the algorithm is applied on the Buckeye conversational speech corpus and databases of two Indian languages.

...read moreread less

41 citations

Proceedings Article•DOI•

Deep Scattering Spectrum with deep neural networks

[...]

Vijayaditya Peddinti¹, Tara N. Sainath², Shay Maymon², Bhuvana Ramabhadran², David Nahamoo², Vaibhava Goel² - Show less +2 more•Institutions (2)

Johns Hopkins University¹, IBM²

04 May 2014

TL;DR: This paper identifies the effective normalization, neural network topology and regularization techniques to effectively model higher order scatter and results in relative improvement of 7% compared to log-mel features on TIMIT, providing a phonetic error rate of 17.4%, one of the lowest reported PERs to date on this task.

...read moreread less

Abstract: State-of-the-art convolutional neural networks (CNNs) typically use a log-mel spectral representation of the speech signal. However, this representation is limited by the spectro-temporal resolution afforded by log-mel filter-banks. A novel technique known as Deep Scattering Spectrum (DSS) addresses this limitation and preserves higher resolution information, while ensuring time warp stability, through the cascaded application of the wavelet-modulus operator. The first order scatter is equivalent to log-mel features and standard CNN modeling techniques can directly be used with these features. However the higher order scatter, which preserves the higher resolution information, presents new challenges in modelling. This paper explores how to effectively use DSS features with CNN acoustic models. Specifically, we identify the effective normalization, neural network topology and regularization techniques to effectively model higher order scatter. The use of these higher order scatter features, in conjunction with CNNs, results in relative improvement of 7% compared to log-mel features on TIMIT, providing a phonetic error rate (PER) of 17.4%, one of the lowest reported PERs to date on this task.

...read moreread less

Proceedings Article•DOI•

Highly accurate phonetic segmentation using boundary correction models and system fusion

[...]

Andreas Stolcke¹, Neville Ryant², Vikramjit Mitra, Jiahong Yuan², Wen Wang, Mark Liberman² - Show less +2 more•Institutions (2)

Microsoft¹, University of Pennsylvania²

04 May 2014

TL;DR: This work investigates techniques for boosting the accuracy of automatic phonetic segmentation based on HMM acoustic-phonetic models and finds that combining multiple acoustic front-ends gives additional gains in accuracy, and that conditioning the combiner on phonetic context and side information helps.

...read moreread less

Abstract: Accurate phone-level segmentation of speech remains an important task for many subfields of speech research. We investigate techniques for boosting the accuracy of automatic phonetic segmentation based on HMM acoustic-phonetic models. In prior work [25] we were able to improve on state-of-the-art alignment accuracy by employing special phone boundary HMM models, trained on phonetically segmented training data, in conjunction with a simple boundary-time correction model. Here we present further improved results by using more powerful statistical models for boundary correction that are conditioned on phonetic context and duration features. Furthermore, we find that combining multiple acoustic front-ends gives additional gains in accuracy, and that conditioning the combiner on phonetic context and side information helps. Overall, we reduce segmentation errors on the TIMIT corpus by almost one half, from 93.9% to 96.8% boundary accuracy with a 20-ms tolerance.

...read moreread less

Journal Article•DOI•

Large Vocabulary Continuous Speech Recognition With Reservoir-Based Acoustic Models

[...]

Fabian Triefenbach¹, Kris Demuynck¹, Jean-Pierre Martens¹•Institutions (1)

Ghent University¹

22 Jan 2014-IEEE Signal Processing Letters

TL;DR: The development of an RC-HMM hybrid that provides good recognition on the Wall Street Journal benchmark is described, and given that RC-based acoustic modeling is a fairly new approach, these results open up promising perspectives.

...read moreread less

Abstract: Thanks to research in neural network based acoustic modeling, progress in Large Vocabulary Continuous Speech Recognition (LVCSR) seems to have gained momentum recently. In search for further progress, the present letter investigates Reservoir Computing (RC) as an alternative new paradigm for acoustic modeling. RC unifies the appealing dynamical modeling capacity of a Recurrent Neural Network (RNN) with the simplicity and robustness of linear regression as a model for training the weights of that network. In previous work, an RC-HMM hybrid yielding very good phone recognition accuracy on TIMIT could be designed, but no proof was offered yet that this success would also transfer to LVCSR. This letter describes the development of an RC-HMM hybrid that provides good recognition on the Wall Street Journal benchmark. For the WSJ0 5k word task, word error rates of 6.2% (bigram language model) and 3.9% (trigram) are obtained on the Nov-92 evaluation set. Given that RC-based acoustic modeling is a fairly new approach, these results open up promising perspectives.

...read moreread less

Journal Article•DOI•

Feature extraction technique using ERB like wavelet sub-band periodic and aperiodic decomposition for TIMIT phoneme recognition

[...]

Astik Biswas¹, Prasanna Kumar Sahu¹, Anirban Bhowmick², Mahesh Chandra²•Institutions (2)

National Institute of Technology, Rourkela¹, Birla Institute of Technology, Mesra²

01 Dec 2014-International Journal of Speech Technology

TL;DR: This frontend feature processing technique employs equivalent rectangular bandwidth (ERB) filter like wavelet speech feature extraction method called Wavelet ERB Sub-band based Periodicity and Aperiodicity Decomposition (WERB-SPADE), and examines its validity for TIMIT phone recognition task in noisy environments.

...read moreread less

Abstract: In the recent years, wavelet transform has been found to be an effective tool for the time---frequency analysis for non-stationary and quasi-stationary signals such as speech signals. In the recent past, wavelet transform has been used as feature extraction in speech recognition applications. Here we propose a wavelet based feature extraction technique that signifies both the periodic and aperiodic information along with sub-band instantaneous frequency of speech signal for robust speech recognition in noisy environment. This technique is based on parallel distributed processing technique inspired by the human speech perception process. This frontend feature processing technique employs equivalent rectangular bandwidth (ERB) filter like wavelet speech feature extraction method called Wavelet ERB Sub-band based Periodicity and Aperiodicity Decomposition (WERB-SPADE), and examines its validity for TIMIT phone recognition task in noisy environments. The speech sound is filtered by 24 band ERB like wavelet filter banks, and then the equal loudness pre-emphasized output of each band is processed through comb filter. Each comb filter is designed individually for each frequency sub-band to decompose the signal into periodic and aperiodic features. Thus it takes the advantage of the robustness shown by periodic features without losing certain important information like formant transition incorporated in aperiodic features. Speech recognition experiments with a standard HMM recognizer under both clean-training and multi-training condition training is conducted. Proposed technique shows more robustness compared to other features especially in noisy condition.

...read moreread less

Proceedings Article•DOI•

Unsupervised spoken term detection with spoken queries by multi-level acoustic patterns with varying model granularity

[...]

Cheng-Tao Chung¹, Chun-an Chan¹, Lin-Shan Lee¹•Institutions (1)

National Taiwan University¹

04 May 2014

TL;DR: In this article, a new approach for unsupervised Spoken Term Detection with spoken queries using multiple sets of acoustic patterns automatically discovered from the target corpus is presented, which can jointly capture the characteristics of the spoken terms.

...read moreread less

Abstract: This paper presents a new approach for unsupervised Spoken Term Detection with spoken queries using multiple sets of acoustic patterns automatically discovered from the target corpus. The different pattern HMM configurations(number of states per model, number of distinct models, number of Gaussians per state)form a three-dimensional model granularity space. Different sets of acoustic patterns automatically discovered on different points properly distributed over this three-dimensional space are complementary to one another, thus can jointly capture the characteristics of the spoken terms. By representing the spoken content and spoken query as sequences of acoustic patterns, a series of approaches for matching the pattern index sequences while considering the signal variations are developed. In this way, not only the on-line computation load can be reduced, but the signal distributions caused by different speakers and acoustic conditions can be reasonably taken care of. The results indicate that this approach significantly outperformed the unsupervised feature-based DTW baseline by 16.16% in mean average precision on the TIMIT corpus.

...read moreread less

Proceedings Article•

A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property

[...]

Jianshu Chen¹, Li Deng²•Institutions (2)

University of California, Los Angeles¹, Microsoft²

01 Apr 2014

TL;DR: In this article, a primal-dual training method was proposed to formulates the learning of the RNN as a formal optimization problem with an inequality constraint that provides a sufficient condition for the stability of the network dynamics.

...read moreread less

Abstract: We present an architecture of a recurrent neural network (RNN) with a fully-connected deep neural network (DNN) as its feature extractor. The RNN is equipped with both causal temporal prediction and non-causal look-ahead, via auto-regression (AR) and moving-average (MA), respectively. The focus of this paper is a primal-dual training method that formulates the learning of the RNN as a formal optimization problem with an inequality constraint that provides a sufficient condition for the stability of the network dynamics. Experimental results demonstrate the effectiveness of this new method, which achieves 18.86% phone recognition error on the TIMIT benchmark for the core test set. The result approaches the best result of 17.7%, which was obtained by using RNN with long short-term memory (LSTM). The results also show that the proposed primal-dual training method produces lower recognition errors than the popular RNN methods developed earlier based on the carefully tuned threshold parameter that heuristically prevents the gradient from exploding.

...read moreread less

Proceedings Article•DOI•

Intrinsic Spectral Analysis Based on Temporal Context Features for Query-by-Example Spoken Term Detection

[...]

Peng Yang¹, Cheung-Chi Leung², Lei Xie³, Bin Ma³, Haizhou Li³ - Show less +1 more•Institutions (3)

Northwestern Polytechnical University¹, Institute for Infocomm Research Singapore², Agency for Science, Technology and Research³

14 Sep 2014

TL;DR: Experimental results on the TIMIT speech corpus show that the ISA features can provide a relative 13.5% improvement in mean average precision over the baseline features, when the temporal context information is used.

...read moreread less

Abstract: We investigate the use of intrinsic spectral analysis (ISA) for query-by-example spoken term detection (QbE-STD). In the task, spoken queries and test utterances in an audio archive are converted to ISA features, and dynamic time warping is applied to match the feature sequence in each query with those in test utterances. Motivated by manifold learning, ISA has been proposed to recover from untranscribed utterances a set of nonlinear basis functions for the speech manifold, and shown with improved phonetic separability and inherent speaker independence. Due to the coarticulation phenomenon in speech, we propose to use temporal context information to obtain the ISA features. Gaussian posteriorgram, as an efficient acoustic representation usually used in QbE-STD, is considered a baseline feature. Experimental results on the TIMIT speech corpus show that the ISA features can provide a relative 13.5% improvement in mean average precision over the baseline features, when the temporal context information is used. Index Terms: spoken term detection, intrinsic spectral analysis, Gaussian posteriorgram, dynamic time warping

...read moreread less

Proceedings Article•DOI•

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models.

[...]

Navdeep Jaitly¹, Vincent Vanhoucke², Geoffrey E. Hinton¹•Institutions (2)

University of Toronto¹, Google²

14 Sep 2014

TL;DR: A simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural NetworkHidden Markov Model (ANN-HMM) hybrid systems.

...read moreread less

Abstract: We describe a simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural NetworkHidden Markov Model (ANN-HMM) hybrid systems. In this approach a Deep Neural Network (DNN) is trained to predict the forced-alignment state of multiple frames using a separate softmax unit for each of the frames. This is in contrast to the usual method of training a DNN to predict only the state of the central frame. By itself this is not sufficient to improve accuracy of the system significantly. However, if we average the predictions for each frame from the different contexts it is associated with we achieve state of the art results on TIMIT using a fully connected Deep Neural Network without convolutional architectures or dropout training. On a 14 hour subset of Wall Street Journal (WSJ) using a context dependent DNN-HMM system it leads to a relative improvement of 6.4% on the dev set (testdev93) and 9.3% on test set (test-eval92).

...read moreread less

Journal Article•DOI•

On speech features fusion, α-integration Gaussian modeling and multi-style training for noise robust speaker classification

[...]

A. Venturini¹, L. Zao¹, Rosângela Coelho¹•Institutions (1)

Instituto Militar de Engenharia¹

01 Dec 2014-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The results show that the best SV accuracy was obtained with the MFCC + pH features fusion, the MS/SS and the Colored-MT, which are based on colored, white and narrow-band noises.

...read moreread less

Abstract: This paper investigates the fusion of Mel-frequency cepstral coefficients (MFCC) and statistical pH features to improve the performance of speaker verification (SV) in non-stationary noise conditions. The α-integrated Gaussian Mixture Model (α-GMM) classifier is adopted for speaker modeling. Two different approaches are applied to reduce the effects of noise corruption in the SV task: speech enhancement and multi-style training (MT). The spectral subtraction with minimum statistics (MS/SS) and the optimally-modified log-spectral amplitude with improved minima controlled recursive averaging (IMCRA/OMLSA) are examined for the speech enhancement procedure. The MT techniques are based on colored (Colored-MT), white (White-MT) and narrow-band (Narrow-MT) noises. Six real non-stationary noises, collected from different acoustic sources, are used to corrupt the TIMIT speech database in four different signal-to-noise ratios (SNR). The index of non-stationarity (INS) is chosen for the stationarity tests of the acoustic noises. Complementary SV experiments are conducted in realistic noisy conditions using the MIT database. The results show that the best SV accuracy was obtained with the MFCC + pH features fusion, the MS/SS and the Colored-MT.

...read moreread less

Proceedings Article•DOI•

Deep Scattering Spectra with Deep Neural Networks for LVCSR Tasks

[...]

Tara N. Sainath¹, Vijayaditya Peddinti², Brian Kingsbury¹, Petr Fousek¹, Bhuvana Ramabhadran¹, David Nahamoo³ - Show less +2 more•Institutions (3)

IBM¹, Johns Hopkins University², Nuance Communications³

14 Sep 2014

TL;DR: This paper explores the optimal multi-resolution time and frequency scattering operations for LVCSR tasks, and explores techniques to reduce the dimension of the DSS features, which are similar to multi- Resolution log-mel + MFCCs and similar improvements can be obtained with this representation.

...read moreread less

Abstract: Log-mel filterbank features, which are commonly used features for CNNs, can remove higher-resolution information from the speech signal. A novel technique, known as Deep Scattering Spectrum (DSS), addresses this issue and looks to preserve this information. DSS features have shown promise on TIMIT, both for classification and recognition. In this paper, we extend the use of DSS features for LVCSR tasks. First, we explore the optimal multi-resolution time and frequency scattering operations for LVCSR tasks. Next, we explore techniques to reduce the dimension of the DSS features. We also incorporate speaker adaptation techniques into the DSS features. Results on a 50 and 430 hour English Broadcast News task show that the DSS features provide between a 4-7% relative improvement in WER over log-mel features, within a state-of-the-art CNN framework which incorporates speaker-adaptation and sequence training. Finally, we show that DSS features are similar to multi-resolution log-mel + MFCCs, and similar improvements can be obtained with this representation.

...read moreread less

Journal Article•DOI•

Auditory ERB like admissible wavelet packet features for TIMIT phoneme recognition

[...]

Prasanna Kumar Sahu¹, Astik Biswas¹, Anirban Bhowmick², Mahesh Chandra²•Institutions (2)

National Institute of Technology, Rourkela¹, Birla Institute of Technology, Mesra²

01 Sep 2014-Engineering Science and Technology, an International Journal

TL;DR: A new filter structure using admissible wavelet packet is analyzed for English phoneme recognition that has the benefit of having frequency bands spacing similar to the auditory Equivalent Rectangular Bandwidth (ERB) scale.

...read moreread less

Proceedings Article•DOI•

Speaker adaptation of hybrid NN/HMM model for speech recognition based on singular value decomposition

[...]

Shaofei Xue¹, Hui Jiang², Li-Rong Dai¹•Institutions (2)

University of Science and Technology of China¹, York University²

27 Oct 2014

TL;DR: In this article, a speaker adaptation method for the hybrid NN/HMM speech recognition model based on singular value decomposition (SVD) is proposed, which applies SVD on the weight matrices in trained DNNs, and then tunes diagonal matrices with the adaptation data.

...read moreread less

Abstract: Recently several speaker adaptation methods have been proposed for deep neural network (DNN) in many large vocabulary continuous speech recognition (LVCSR) tasks. However, only a few methods rely on tuning the weight matrices in trained DNNs to optimize system performance since it is very prone to over-fitting especially when some class labels are missing in the adaptation data. In this paper, we propose a new speaker adaptation method for the hybrid NN/HMM speech recognition model based on singular value decomposition (SVD). We apply SVD on the weight matrices in trained DNNs, and then tune diagonal matrices with the adaptation data. This solves the over-fitting problem since we can change the weight matrices slightly by only modifying the singular values. We evaluate the proposed adaptation method in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that it is effective to adapt large DNN models using only a small amount of adaptation data. For example, the Switchboard results have shown that the proposed SVD-based adaptation method may achieve up to 3-6% relative error reduction using only a few dozens of adaptation utterances per speaker.

...read moreread less

Proceedings Article•DOI•

Transductive nonnegative matrix factorization for semi-supervised high-performance speech separation

[...]

Naiyang Guan¹, Long Lan¹, Dacheng Tao², Zhigang Luo¹, Xuejun Yang¹ - Show less +1 more•Institutions (2)

National University of Defense Technology¹, University of Technology, Sydney²

04 May 2014

TL;DR: Experiments show that the proposed TNMF-based methods outperform traditional NMF- based methods for separating the monophonic mixtures of speech signals of known speakers.

...read moreread less

Abstract: Regarding the non-negativity property of the magnitude spectrogram of speech signals, nonnegative matrix factorization (NMF) has obtained promising performance for speech separation by independently learning a dictionary on the speech signals of each known speaker. However, traditional NM-F fails to represent the mixture signals accurately because the dictionaries for speakers are learned in the absence of mixture signals. In this paper, we propose a new transductive NMF algorithm (TNMF) to jointly learn a dictionary on both speech signals of each speaker and the mixture signals to be separated. Since TNMF learns a more descriptive dictionary by encoding the mixture signals than that learned by NMF, it significantly boosts the separation performance. Experiments results on a popular TIMIT dataset show that the proposed TNMF-based methods outperform traditional NMF-based methods for separating the monophonic mixtures of speech signals of known speakers.

...read moreread less

Proceedings Article•DOI•

Phone sequence modeling with recurrent neural networks

[...]

Nicolas Boulanger-Lewandowski¹, Jasha Droppo², Michael L. Seltzer², Dong Yu²•Institutions (2)

Université de Montréal¹, Microsoft²

04 May 2014

TL;DR: A hybrid architecture that combines a phonetic model with an arbitrary frame-level acoustic model is introduced and it is suggested that this approach can readily replace HMMs in current state-of-the-art systems.

...read moreread less

Abstract: In this paper, we investigate phone sequence modeling with recurrent neural networks in the context of speech recognition. We introduce a hybrid architecture that combines a phonetic model with an arbitrary frame-level acoustic model and we propose efficient algorithms for training, decoding and sequence alignment. We evaluate the advantage of our phonetic model on the TIMIT and Switchboard-mini datasets in complementarity to a powerful context-dependent deep neural network (DNN) acoustic classifier and a higher-level 3-gram language model. Consistent improvements of 2-10% in phone accuracy and 3% in word error rate suggest that our approach can readily replace HMMs in current state-of-the-art systems.

...read moreread less

Proceedings Article•DOI•

Domain Adaptation for Text Dependent Speaker Verification

[...]

Hagai Aronowitz¹, Asaf Rendel¹•Institutions (1)

IBM¹

14 Sep 2014

TL;DR: This work investigates the ability to build high accuracy text-dependent systems using no data at all from the target domain, and introduces several techniques addressing both lexical mismatch and channel mismatch.

...read moreread less

Abstract: Recently we have investigated the use of state-of-the-art textdependent speaker verification algorithms for user authentication and obtained satisfactory results mainly by using a fair amount of text-dependent development data from the target domain. In this work we investigate the ability to build high accuracy text-dependent systems using no data at all from the target domain. Instead of using target domain data, we use resources such as TIMIT, Switchboard, and NIST data. We introduce several techniques addressing both lexical mismatch and channel mismatch. These techniques include synthesizing a universal background model according to lexical content, automatic filtering of irrelevant phonetic content, exploiting information in residual supervectors (usually discarded in the i-vector framework), and inter dataset variability modeling. These techniques reduce verification error significantly, and also improve accuracy when target domain data is available.

...read moreread less

Proceedings Article•DOI•

Labeling unsegmented sequence data with DNN-HMM and its application for speech recognition

[...]

Xiangang Li¹, Xihong Wu¹•Institutions (1)

Peking University¹

27 Oct 2014

TL;DR: This research comes up with the `yes' answer by presenting forward-backward learning algorithm for DNN-HMM framework, and a training procedure is proposed, in which, the training for context independent (CI) DNN -HMM is treated as the pre-training for context dependent (CD) DMM.

...read moreread less

Abstract: Recently, deep neural network (DNN) with hidden Markov model (HMM) has turned out to be a superior sequence learning framework, based on which significant improvements were achieved in many application tasks, such as automatic speech recognition (ASR). However, the training of DNN-HMM requires the pre-segmented training data, which can be generated using Gaussian Mixture Model (GMM) in ASR tasks. Thus, questions are raised by many researchers: can we train the DNN-HMM without GMM seeding, and what does it suggest if the answer is yes? In this research, we come up with the ‘yes’ answer by presenting forward-backward learning algorithm for DNN-HMM framework. Besides, a training procedure is proposed, in which, the training for context independent (CI) DNN-HMM is treated as the pre-training for context dependent (CD) DNN-HMM. To evaluate the contribution of this work, experiments on ASR task with the benchmark corpus TIMIT are performed, and the results demonstrate the effectiveness of this research.

...read moreread less

Proceedings Article•DOI•

Fast segment search for corpus-based speech enhancement based on speech recognition technology

[...]

Atsunori Ogawa¹, Keisuke Kinoshita¹, Takaaki Hori¹, Tomohiro Nakatani¹, Atsushi Nakamura¹ - Show less +1 more•Institutions (1)

Nippon Telegraph and Telephone¹

04 May 2014

TL;DR: In the experiments for non-stationary noisy observations using the 26 multi-condition TIMIT parallel speech corpus, the proposed search method found the segments almost in real-time without degrading the quality of the enhanced speech.

...read moreread less

Abstract: Corpus-based speech enhancement has received increasing attention recently since it shows high enhancement performance in highly non-stationary noisy environments by precisely modeling the long-term temporal dynamics of speech. However, it has a disadvantage in that the cost is very high for searching the longest matching clean speech segments from a multi-condition parallel speech corpus. This paper proposes a fast segment search method for corpus-based speech enhancement. It is mainly based on two techniques derived from speech recognition technology. The first is an A* search like segment evaluation function for accurately finding the longest matching segments. The second is a tree and linear connected search space for efficiently sharing the segment likelihood calculations. In the experiments for non-stationary noisy observations using the 26 multi-condition TIMIT parallel speech corpus, the proposed search method found the segments almost in real-time without degrading the quality of the enhanced speech. Our method was about 7 to 13 times faster than the conventional segment search method.

...read moreread less

Proceedings Article•DOI•

Building an ensemble of CD-DNN-HMM acoustic model using random forests of phonetic decision trees

[...]

Tuo Zhao¹, Yunxin Zhao¹, Xin Chen²•Institutions (2)

University of Missouri¹, Pearson Education²

27 Oct 2014

TL;DR: It is demonstrated that the proposed RF-PDT+CD-DNN based EAM significantly outperforms the CD- DNN based single acoustic model (SAM) in phone and word recognition accuracies.

...read moreread less

Abstract: We propose an RF-PDT+CD-DNN approach to generate an ensemble of context-dependent pre-trained deep neural networks (CD-DNNs) using random forests of phonetic decision trees (RF-PDTs) and constructing a CD-DNN-HMM-based ensemble acoustic model (EAM). We present evaluation results on the TIMIT dataset and a telemedicine automatic captioning dataset and demonstrate that the proposed RF-PDT+CD-DNN based EAM significantly outperforms the CD-DNN based single acoustic model (SAM) in phone and word recognition accuracies.

...read moreread less

Proceedings Article•DOI•

Contrastive auto-encoder for phoneme recognition

[...]

Xin Zheng¹, Zhiyong Wu¹, Helen Meng¹, Lianhong Cai¹•Institutions (1)

Tsinghua University¹

04 May 2014

TL;DR: This work proposes a new type of auto-encoder for feature learning called contrastive auto- Encoder, able to leverage class labels in constructing its representation layer by modeling two autoencoders together and making their differences contribute to the total loss function.

...read moreread less

Abstract: Speech data typically contains task irrelevant information lying within features. Specifically, phonetic information, speaker characteristic information, emotional information and noise are always mixed together and tend to impair one another for certain task. We propose a new type of auto-encoder for feature learning called contrastive auto-encoder. Unlike other variants of auto-encoders, contrastive auto-encoder is able to leverage class labels in constructing its representation layer. We achieve this by modeling two autoencoders together and making their differences contribute to the total loss function. The transformation built with contrastive auto-encoder can be seen as a task-specific and invariant feature learner. Our experiments on TIMIT clearly show the superiority of the feature extracted from contrastive auto-encoder over original acoustic feature, feature extracted from deep auto-encoder, and feature extracted from a model that contrastive auto-encoder originates from.

...read moreread less

Proceedings Article•DOI•

Tamil speech enhancement using non-linear spectral subtraction

[...]

G Prabhakaran¹, J. Indra¹, N. Kasthuri¹•Institutions (1)

Kongu Engineering College¹

03 Apr 2014

TL;DR: The aim of the proposed method is to reduce the background noise present in the speech signal by using spectral subtraction techniques, and enhanced speech is obtained.

...read moreread less

Abstract: Speech enhancement is a technique used to reduce the background noise present in the speech signal. It simply means the improvement in intelligibility and quality of degraded speech. The noises present in the speech signal are additive noise, echo, reverbration and speaker interference. The aim of the proposed method is to reduce the background noise present in the speech signal by using spectral subtraction techniques. The magnitude of the spectrum of estimated noise is subtracted from the spectrum of noisy speech signal. Five clean speeches are taken as sample speech. Sample noise such as pink noise, white noise and volvo noise are taken from database (TIMIT & NOIZEUS corpus). By using Non-linear spectral subtraction and Multiband spectral subtraction techniques, enhanced speech is obtained. Performance of the above two methods are compared based on the two parameters namely Signal to Noise Ratio and Log Spectral Distance.

...read moreread less