scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2014"


Journal ArticleDOI
TL;DR: It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Abstract: Recently, the hybrid deep neural network (DNN)- hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is partially attributed to the ability of the DNN to model complex correlations in speech features. In this paper, we show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations. Experimental results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone recognition and the voice search large vocabulary speech recognition tasks.

1,948 citations


Proceedings Article
08 Dec 2014
TL;DR: This paper empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models.
Abstract: Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this paper we empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow nets can learn these deep functions using the same number of parameters as the original deep models. On the TIMIT phoneme recognition and CIFAR-10 image recognition tasks, shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional models.

1,526 citations


Posted Content
TL;DR: Initial results demonstrate that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
Abstract: We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.

483 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: The joint optimization of the deep learning models (deep neural networks and recurrent neural networks) with an extra masking layer, which enforces a reconstruction constraint, is proposed to enhance the separation performance of monaural speech separation models.
Abstract: Monaural source separation is useful for many real-world applications though it is a challenging problem. In this paper, we study deep learning for monaural speech separation. We propose the joint optimization of the deep learning models (deep neural networks and recurrent neural networks) with an extra masking layer, which enforces a reconstruction constraint. Moreover, we explore a discriminative training criterion for the neural networks to further enhance the separation performance. We evaluate our approaches using the TIMIT speech corpus for a monaural speech separation task. Our proposed models achieve about 3.8~4.9 dB SIR gain compared to NMF models, while maintaining better SDRs and SARs.

445 citations


Proceedings Article
21 Jun 2014
TL;DR: This paper introduces a simple, yet powerful modification to the simple RNN architecture, the Clockwork RNN (CW-RNN), in which the hidden layer is partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate.
Abstract: Sequence prediction and classification are ubiquitous and challenging problems in machine learning that can require identifying complex dependencies between temporally distant inputs. Recurrent Neural Networks (RNNs) have the ability, in theory, to cope with these temporal dependencies by virtue of the short-term memory implemented by their recurrent (feedback) connections. However, in practice they are difficult to train successfully when long-term memory is required. This paper introduces a simple, yet powerful modification to the simple RNN (SRN) architecture, the Clockwork RNN (CW-RNN), in which the hidden layer is partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate. Rather than making the standard RNN models more complex, CW-RNN reduces the number of SRN parameters, improves the performance significantly in the tasks tested, and speeds up the network evaluation. The network is demonstrated in preliminary experiments involving three tasks: audio signal generation, TIMIT spoken word classification, where it outperforms both SRN and LSTM networks, and online handwriting recognition, where it outperforms SRNs.

335 citations


Journal ArticleDOI
TL;DR: A general adaptation scheme for DNN based on discriminant condition codes is proposed, which is directly fed to various layers of a pre-trained DNN through a new set of connection weights, which are quite effective to adapt large DNN models using only a small amount of adaptation data.
Abstract: Fast adaptation of deep neural networks (DNN) is an important research topic in deep learning. In this paper, we have proposed a general adaptation scheme for DNN based on discriminant condition codes, which are directly fed to various layers of a pre-trained DNN through a new set of connection weights. Moreover, we present several training methods to learn connection weights from training data as well as the corresponding adaptation methods to learn new condition code from adaptation data for each new test condition. In this work, the fast adaptation scheme is applied to supervised speaker adaptation in speech recognition based on either frame-level cross-entropy or sequence-level maximum mutual information training criterion. We have proposed three different ways to apply this adaptation scheme based on the so-called speaker codes: i) Nonlinear feature normalization in feature space; ii) Direct model adaptation of DNN based on speaker codes; iii) Joint speaker adaptive training with speaker codes. We have evaluated the proposed adaptation methods in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that all three methods are quite effective to adapt large DNN models using only a small amount of adaptation data. For example, the Switchboard results have shown that the proposed speaker-code-based adaptation methods may achieve up to 8-10% relative error reduction using only a few dozens of adaptation utterances per speaker. Finally, we have achieved very good performance in Switchboard (12.1% in WER) after speaker adaptation using sequence training criterion, which is very close to the best performance reported in this task ("Deep convolutional neural networks for LVCSR," T. N. Sainath et al., Proc. IEEE Acoust., Speech, Signal Process., 2013).

157 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: Two algorithmic schemes to address this computational bottleneck in the context of kernel ridge regression are developed and it is demonstrated that these schemes enable kernel methods to match the performance of state of the art Deep Neural Networks on TIMIT for speech recognition and classification tasks.
Abstract: Despite their theoretical appeal and grounding in tractable convex optimization techniques, kernel methods are often not the first choice for large-scale speech applications due to their significant memory requirements and computational expense. In recent years, randomized approximate feature maps have emerged as an elegant mechanism to scale-up kernel methods. Still, in practice, a large number of random features is required to obtain acceptable accuracy in predictive tasks. In this paper, we develop two algorithmic schemes to address this computational bottleneck in the context of kernel ridge regression. The first scheme is a specialized distributed block coordinate descent procedure that avoids the explicit materialization of the feature space data matrix, while the second scheme gains efficiency by combining multiple weak random feature models in an ensemble learning framework. We demonstrate that these schemes enable kernel methods to match the performance of state of the art Deep Neural Networks on TIMIT for speech recognition and classification tasks. In particular, we obtain the best classification error rates reported on TIMIT using kernel methods.

135 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: The two network architectures, convolution along the frequency axis and time-domain convolution, can be readily combined and report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.
Abstract: Convolutional neural networks have proved very successful in image recognition, thanks to their tolerance to small translations. They have recently been applied to speech recognition as well, using a spectral representation as input. However, in this case the translations along the two axes - time and frequency - should be handled quite differently. So far, most authors have focused on convolution along the frequency axis, which offers invariance to speaker and speaking style variations. Other researchers have developed a different network architecture that applies time-domain convolution in order to process a longer time-span of input in a hierarchical manner. These two approaches have different background motivations, and both offer significant gains over a standard fully connected network. Here we show that the two network architectures can be readily combined, like their advantages. With the combined model we report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.

87 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: Results show that this novel and entirely unsupervised approach to selecting subsets of acoustic data for training phone recognizers consistently outperforms a number of baseline methods while being computationally very efficient and requiring no labeling.
Abstract: We conduct a comparative study on selecting subsets of acoustic data for training phone recognizers The data selection problem is approached as a constrained submodular optimization problem Previous applications of this approach required transcriptions or acoustic models trained in a supervised way In this paper we develop and evaluate a novel and entirely unsupervised approach, and apply it to TIMIT data Results show that our method consistently outperforms a number of baseline methods while being computationally very efficient and requiring no labeling

50 citations


Journal ArticleDOI
TL;DR: A rule-based algorithm is designed that aims at selecting only those events associated with the closure-burst transitions of stops and affricates and gives a performance comparable to or better than the state-of-the-art methods.
Abstract: Automatic and accurate detection of the closure-burst transition events of stops and affricates serves many applications in speech processing. A temporal measure named the plosion index is proposed to detect such events, which are characterized by an abrupt increase in energy. Using the maxima of the pitch-synchronous normalized cross correlation as an additional temporal feature, a rule-based algorithm is designed that aims at selecting only those events associated with the closure-burst transitions of stops and affricates. The performance of the algorithm, characterized by receiver operating characteristic curves and temporal accuracy, is evaluated using the labeled closure-burst transitions of stops and affricates of the entire TIMIT test and training databases. The robustness of the algorithm is studied with respect to global white and babble noise as well as local noise using the TIMIT test set and on telephone quality speech using the NTIMIT test set. For these experiments, the proposed algorithm, which does not require explicit statistical training and is based on two one-dimensional temporal measures, gives a performance comparable to or better than the state-of-the-art methods. In addition, to test the scalability, the algorithm is applied on the Buckeye conversational speech corpus and databases of two Indian languages.

41 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: This paper identifies the effective normalization, neural network topology and regularization techniques to effectively model higher order scatter and results in relative improvement of 7% compared to log-mel features on TIMIT, providing a phonetic error rate of 17.4%, one of the lowest reported PERs to date on this task.
Abstract: State-of-the-art convolutional neural networks (CNNs) typically use a log-mel spectral representation of the speech signal. However, this representation is limited by the spectro-temporal resolution afforded by log-mel filter-banks. A novel technique known as Deep Scattering Spectrum (DSS) addresses this limitation and preserves higher resolution information, while ensuring time warp stability, through the cascaded application of the wavelet-modulus operator. The first order scatter is equivalent to log-mel features and standard CNN modeling techniques can directly be used with these features. However the higher order scatter, which preserves the higher resolution information, presents new challenges in modelling. This paper explores how to effectively use DSS features with CNN acoustic models. Specifically, we identify the effective normalization, neural network topology and regularization techniques to effectively model higher order scatter. The use of these higher order scatter features, in conjunction with CNNs, results in relative improvement of 7% compared to log-mel features on TIMIT, providing a phonetic error rate (PER) of 17.4%, one of the lowest reported PERs to date on this task.

Proceedings ArticleDOI
04 May 2014
TL;DR: This work investigates techniques for boosting the accuracy of automatic phonetic segmentation based on HMM acoustic-phonetic models and finds that combining multiple acoustic front-ends gives additional gains in accuracy, and that conditioning the combiner on phonetic context and side information helps.
Abstract: Accurate phone-level segmentation of speech remains an important task for many subfields of speech research. We investigate techniques for boosting the accuracy of automatic phonetic segmentation based on HMM acoustic-phonetic models. In prior work [25] we were able to improve on state-of-the-art alignment accuracy by employing special phone boundary HMM models, trained on phonetically segmented training data, in conjunction with a simple boundary-time correction model. Here we present further improved results by using more powerful statistical models for boundary correction that are conditioned on phonetic context and duration features. Furthermore, we find that combining multiple acoustic front-ends gives additional gains in accuracy, and that conditioning the combiner on phonetic context and side information helps. Overall, we reduce segmentation errors on the TIMIT corpus by almost one half, from 93.9% to 96.8% boundary accuracy with a 20-ms tolerance.

Journal ArticleDOI
TL;DR: The development of an RC-HMM hybrid that provides good recognition on the Wall Street Journal benchmark is described, and given that RC-based acoustic modeling is a fairly new approach, these results open up promising perspectives.
Abstract: Thanks to research in neural network based acoustic modeling, progress in Large Vocabulary Continuous Speech Recognition (LVCSR) seems to have gained momentum recently. In search for further progress, the present letter investigates Reservoir Computing (RC) as an alternative new paradigm for acoustic modeling. RC unifies the appealing dynamical modeling capacity of a Recurrent Neural Network (RNN) with the simplicity and robustness of linear regression as a model for training the weights of that network. In previous work, an RC-HMM hybrid yielding very good phone recognition accuracy on TIMIT could be designed, but no proof was offered yet that this success would also transfer to LVCSR. This letter describes the development of an RC-HMM hybrid that provides good recognition on the Wall Street Journal benchmark. For the WSJ0 5k word task, word error rates of 6.2% (bigram language model) and 3.9% (trigram) are obtained on the Nov-92 evaluation set. Given that RC-based acoustic modeling is a fairly new approach, these results open up promising perspectives.

Journal ArticleDOI
TL;DR: This frontend feature processing technique employs equivalent rectangular bandwidth (ERB) filter like wavelet speech feature extraction method called Wavelet ERB Sub-band based Periodicity and Aperiodicity Decomposition (WERB-SPADE), and examines its validity for TIMIT phone recognition task in noisy environments.
Abstract: In the recent years, wavelet transform has been found to be an effective tool for the time---frequency analysis for non-stationary and quasi-stationary signals such as speech signals. In the recent past, wavelet transform has been used as feature extraction in speech recognition applications. Here we propose a wavelet based feature extraction technique that signifies both the periodic and aperiodic information along with sub-band instantaneous frequency of speech signal for robust speech recognition in noisy environment. This technique is based on parallel distributed processing technique inspired by the human speech perception process. This frontend feature processing technique employs equivalent rectangular bandwidth (ERB) filter like wavelet speech feature extraction method called Wavelet ERB Sub-band based Periodicity and Aperiodicity Decomposition (WERB-SPADE), and examines its validity for TIMIT phone recognition task in noisy environments. The speech sound is filtered by 24 band ERB like wavelet filter banks, and then the equal loudness pre-emphasized output of each band is processed through comb filter. Each comb filter is designed individually for each frequency sub-band to decompose the signal into periodic and aperiodic features. Thus it takes the advantage of the robustness shown by periodic features without losing certain important information like formant transition incorporated in aperiodic features. Speech recognition experiments with a standard HMM recognizer under both clean-training and multi-training condition training is conducted. Proposed technique shows more robustness compared to other features especially in noisy condition.

Proceedings ArticleDOI
04 May 2014
TL;DR: In this article, a new approach for unsupervised Spoken Term Detection with spoken queries using multiple sets of acoustic patterns automatically discovered from the target corpus is presented, which can jointly capture the characteristics of the spoken terms.
Abstract: This paper presents a new approach for unsupervised Spoken Term Detection with spoken queries using multiple sets of acoustic patterns automatically discovered from the target corpus. The different pattern HMM configurations(number of states per model, number of distinct models, number of Gaussians per state)form a three-dimensional model granularity space. Different sets of acoustic patterns automatically discovered on different points properly distributed over this three-dimensional space are complementary to one another, thus can jointly capture the characteristics of the spoken terms. By representing the spoken content and spoken query as sequences of acoustic patterns, a series of approaches for matching the pattern index sequences while considering the signal variations are developed. In this way, not only the on-line computation load can be reduced, but the signal distributions caused by different speakers and acoustic conditions can be reasonably taken care of. The results indicate that this approach significantly outperformed the unsupervised feature-based DTW baseline by 16.16% in mean average precision on the TIMIT corpus.

Proceedings Article
01 Apr 2014
TL;DR: In this article, a primal-dual training method was proposed to formulates the learning of the RNN as a formal optimization problem with an inequality constraint that provides a sufficient condition for the stability of the network dynamics.
Abstract: We present an architecture of a recurrent neural network (RNN) with a fully-connected deep neural network (DNN) as its feature extractor. The RNN is equipped with both causal temporal prediction and non-causal look-ahead, via auto-regression (AR) and moving-average (MA), respectively. The focus of this paper is a primal-dual training method that formulates the learning of the RNN as a formal optimization problem with an inequality constraint that provides a sufficient condition for the stability of the network dynamics. Experimental results demonstrate the effectiveness of this new method, which achieves 18.86% phone recognition error on the TIMIT benchmark for the core test set. The result approaches the best result of 17.7%, which was obtained by using RNN with long short-term memory (LSTM). The results also show that the proposed primal-dual training method produces lower recognition errors than the popular RNN methods developed earlier based on the carefully tuned threshold parameter that heuristically prevents the gradient from exploding.

Proceedings ArticleDOI
14 Sep 2014
TL;DR: Experimental results on the TIMIT speech corpus show that the ISA features can provide a relative 13.5% improvement in mean average precision over the baseline features, when the temporal context information is used.
Abstract: We investigate the use of intrinsic spectral analysis (ISA) for query-by-example spoken term detection (QbE-STD). In the task, spoken queries and test utterances in an audio archive are converted to ISA features, and dynamic time warping is applied to match the feature sequence in each query with those in test utterances. Motivated by manifold learning, ISA has been proposed to recover from untranscribed utterances a set of nonlinear basis functions for the speech manifold, and shown with improved phonetic separability and inherent speaker independence. Due to the coarticulation phenomenon in speech, we propose to use temporal context information to obtain the ISA features. Gaussian posteriorgram, as an efficient acoustic representation usually used in QbE-STD, is considered a baseline feature. Experimental results on the TIMIT speech corpus show that the ISA features can provide a relative 13.5% improvement in mean average precision over the baseline features, when the temporal context information is used. Index Terms: spoken term detection, intrinsic spectral analysis, Gaussian posteriorgram, dynamic time warping

Proceedings ArticleDOI
14 Sep 2014
TL;DR: A simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural NetworkHidden Markov Model (ANN-HMM) hybrid systems.
Abstract: We describe a simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural NetworkHidden Markov Model (ANN-HMM) hybrid systems. In this approach a Deep Neural Network (DNN) is trained to predict the forced-alignment state of multiple frames using a separate softmax unit for each of the frames. This is in contrast to the usual method of training a DNN to predict only the state of the central frame. By itself this is not sufficient to improve accuracy of the system significantly. However, if we average the predictions for each frame from the different contexts it is associated with we achieve state of the art results on TIMIT using a fully connected Deep Neural Network without convolutional architectures or dropout training. On a 14 hour subset of Wall Street Journal (WSJ) using a context dependent DNN-HMM system it leads to a relative improvement of 6.4% on the dev set (testdev93) and 9.3% on test set (test-eval92).

Journal ArticleDOI
TL;DR: The results show that the best SV accuracy was obtained with the MFCC + pH features fusion, the MS/SS and the Colored-MT, which are based on colored, white and narrow-band noises.
Abstract: This paper investigates the fusion of Mel-frequency cepstral coefficients (MFCC) and statistical pH features to improve the performance of speaker verification (SV) in non-stationary noise conditions. The α-integrated Gaussian Mixture Model (α-GMM) classifier is adopted for speaker modeling. Two different approaches are applied to reduce the effects of noise corruption in the SV task: speech enhancement and multi-style training (MT). The spectral subtraction with minimum statistics (MS/SS) and the optimally-modified log-spectral amplitude with improved minima controlled recursive averaging (IMCRA/OMLSA) are examined for the speech enhancement procedure. The MT techniques are based on colored (Colored-MT), white (White-MT) and narrow-band (Narrow-MT) noises. Six real non-stationary noises, collected from different acoustic sources, are used to corrupt the TIMIT speech database in four different signal-to-noise ratios (SNR). The index of non-stationarity (INS) is chosen for the stationarity tests of the acoustic noises. Complementary SV experiments are conducted in realistic noisy conditions using the MIT database. The results show that the best SV accuracy was obtained with the MFCC + pH features fusion, the MS/SS and the Colored-MT.

Proceedings ArticleDOI
14 Sep 2014
TL;DR: This paper explores the optimal multi-resolution time and frequency scattering operations for LVCSR tasks, and explores techniques to reduce the dimension of the DSS features, which are similar to multi- Resolution log-mel + MFCCs and similar improvements can be obtained with this representation.
Abstract: Log-mel filterbank features, which are commonly used features for CNNs, can remove higher-resolution information from the speech signal. A novel technique, known as Deep Scattering Spectrum (DSS), addresses this issue and looks to preserve this information. DSS features have shown promise on TIMIT, both for classification and recognition. In this paper, we extend the use of DSS features for LVCSR tasks. First, we explore the optimal multi-resolution time and frequency scattering operations for LVCSR tasks. Next, we explore techniques to reduce the dimension of the DSS features. We also incorporate speaker adaptation techniques into the DSS features. Results on a 50 and 430 hour English Broadcast News task show that the DSS features provide between a 4-7% relative improvement in WER over log-mel features, within a state-of-the-art CNN framework which incorporates speaker-adaptation and sequence training. Finally, we show that DSS features are similar to multi-resolution log-mel + MFCCs, and similar improvements can be obtained with this representation.

Journal ArticleDOI
TL;DR: A new filter structure using admissible wavelet packet is analyzed for English phoneme recognition that has the benefit of having frequency bands spacing similar to the auditory Equivalent Rectangular Bandwidth (ERB) scale.

Proceedings ArticleDOI
27 Oct 2014
TL;DR: In this article, a speaker adaptation method for the hybrid NN/HMM speech recognition model based on singular value decomposition (SVD) is proposed, which applies SVD on the weight matrices in trained DNNs, and then tunes diagonal matrices with the adaptation data.
Abstract: Recently several speaker adaptation methods have been proposed for deep neural network (DNN) in many large vocabulary continuous speech recognition (LVCSR) tasks. However, only a few methods rely on tuning the weight matrices in trained DNNs to optimize system performance since it is very prone to over-fitting especially when some class labels are missing in the adaptation data. In this paper, we propose a new speaker adaptation method for the hybrid NN/HMM speech recognition model based on singular value decomposition (SVD). We apply SVD on the weight matrices in trained DNNs, and then tune diagonal matrices with the adaptation data. This solves the over-fitting problem since we can change the weight matrices slightly by only modifying the singular values. We evaluate the proposed adaptation method in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that it is effective to adapt large DNN models using only a small amount of adaptation data. For example, the Switchboard results have shown that the proposed SVD-based adaptation method may achieve up to 3-6% relative error reduction using only a few dozens of adaptation utterances per speaker.

Proceedings ArticleDOI
04 May 2014
TL;DR: Experiments show that the proposed TNMF-based methods outperform traditional NMF- based methods for separating the monophonic mixtures of speech signals of known speakers.
Abstract: Regarding the non-negativity property of the magnitude spectrogram of speech signals, nonnegative matrix factorization (NMF) has obtained promising performance for speech separation by independently learning a dictionary on the speech signals of each known speaker. However, traditional NM-F fails to represent the mixture signals accurately because the dictionaries for speakers are learned in the absence of mixture signals. In this paper, we propose a new transductive NMF algorithm (TNMF) to jointly learn a dictionary on both speech signals of each speaker and the mixture signals to be separated. Since TNMF learns a more descriptive dictionary by encoding the mixture signals than that learned by NMF, it significantly boosts the separation performance. Experiments results on a popular TIMIT dataset show that the proposed TNMF-based methods outperform traditional NMF-based methods for separating the monophonic mixtures of speech signals of known speakers.

Proceedings ArticleDOI
04 May 2014
TL;DR: A hybrid architecture that combines a phonetic model with an arbitrary frame-level acoustic model is introduced and it is suggested that this approach can readily replace HMMs in current state-of-the-art systems.
Abstract: In this paper, we investigate phone sequence modeling with recurrent neural networks in the context of speech recognition. We introduce a hybrid architecture that combines a phonetic model with an arbitrary frame-level acoustic model and we propose efficient algorithms for training, decoding and sequence alignment. We evaluate the advantage of our phonetic model on the TIMIT and Switchboard-mini datasets in complementarity to a powerful context-dependent deep neural network (DNN) acoustic classifier and a higher-level 3-gram language model. Consistent improvements of 2-10% in phone accuracy and 3% in word error rate suggest that our approach can readily replace HMMs in current state-of-the-art systems.

Proceedings ArticleDOI
Hagai Aronowitz1, Asaf Rendel1
14 Sep 2014
TL;DR: This work investigates the ability to build high accuracy text-dependent systems using no data at all from the target domain, and introduces several techniques addressing both lexical mismatch and channel mismatch.
Abstract: Recently we have investigated the use of state-of-the-art textdependent speaker verification algorithms for user authentication and obtained satisfactory results mainly by using a fair amount of text-dependent development data from the target domain. In this work we investigate the ability to build high accuracy text-dependent systems using no data at all from the target domain. Instead of using target domain data, we use resources such as TIMIT, Switchboard, and NIST data. We introduce several techniques addressing both lexical mismatch and channel mismatch. These techniques include synthesizing a universal background model according to lexical content, automatic filtering of irrelevant phonetic content, exploiting information in residual supervectors (usually discarded in the i-vector framework), and inter dataset variability modeling. These techniques reduce verification error significantly, and also improve accuracy when target domain data is available.

Proceedings ArticleDOI
27 Oct 2014
TL;DR: This research comes up with the `yes' answer by presenting forward-backward learning algorithm for DNN-HMM framework, and a training procedure is proposed, in which, the training for context independent (CI) DNN -HMM is treated as the pre-training for context dependent (CD) DMM.
Abstract: Recently, deep neural network (DNN) with hidden Markov model (HMM) has turned out to be a superior sequence learning framework, based on which significant improvements were achieved in many application tasks, such as automatic speech recognition (ASR). However, the training of DNN-HMM requires the pre-segmented training data, which can be generated using Gaussian Mixture Model (GMM) in ASR tasks. Thus, questions are raised by many researchers: can we train the DNN-HMM without GMM seeding, and what does it suggest if the answer is yes? In this research, we come up with the ‘yes’ answer by presenting forward-backward learning algorithm for DNN-HMM framework. Besides, a training procedure is proposed, in which, the training for context independent (CI) DNN-HMM is treated as the pre-training for context dependent (CD) DNN-HMM. To evaluate the contribution of this work, experiments on ASR task with the benchmark corpus TIMIT are performed, and the results demonstrate the effectiveness of this research.

Proceedings ArticleDOI
04 May 2014
TL;DR: In the experiments for non-stationary noisy observations using the 26 multi-condition TIMIT parallel speech corpus, the proposed search method found the segments almost in real-time without degrading the quality of the enhanced speech.
Abstract: Corpus-based speech enhancement has received increasing attention recently since it shows high enhancement performance in highly non-stationary noisy environments by precisely modeling the long-term temporal dynamics of speech. However, it has a disadvantage in that the cost is very high for searching the longest matching clean speech segments from a multi-condition parallel speech corpus. This paper proposes a fast segment search method for corpus-based speech enhancement. It is mainly based on two techniques derived from speech recognition technology. The first is an A* search like segment evaluation function for accurately finding the longest matching segments. The second is a tree and linear connected search space for efficiently sharing the segment likelihood calculations. In the experiments for non-stationary noisy observations using the 26 multi-condition TIMIT parallel speech corpus, the proposed search method found the segments almost in real-time without degrading the quality of the enhanced speech. Our method was about 7 to 13 times faster than the conventional segment search method.

Proceedings ArticleDOI
27 Oct 2014
TL;DR: It is demonstrated that the proposed RF-PDT+CD-DNN based EAM significantly outperforms the CD- DNN based single acoustic model (SAM) in phone and word recognition accuracies.
Abstract: We propose an RF-PDT+CD-DNN approach to generate an ensemble of context-dependent pre-trained deep neural networks (CD-DNNs) using random forests of phonetic decision trees (RF-PDTs) and constructing a CD-DNN-HMM-based ensemble acoustic model (EAM). We present evaluation results on the TIMIT dataset and a telemedicine automatic captioning dataset and demonstrate that the proposed RF-PDT+CD-DNN based EAM significantly outperforms the CD-DNN based single acoustic model (SAM) in phone and word recognition accuracies.

Proceedings ArticleDOI
04 May 2014
TL;DR: This work proposes a new type of auto-encoder for feature learning called contrastive auto- Encoder, able to leverage class labels in constructing its representation layer by modeling two autoencoders together and making their differences contribute to the total loss function.
Abstract: Speech data typically contains task irrelevant information lying within features. Specifically, phonetic information, speaker characteristic information, emotional information and noise are always mixed together and tend to impair one another for certain task. We propose a new type of auto-encoder for feature learning called contrastive auto-encoder. Unlike other variants of auto-encoders, contrastive auto-encoder is able to leverage class labels in constructing its representation layer. We achieve this by modeling two autoencoders together and making their differences contribute to the total loss function. The transformation built with contrastive auto-encoder can be seen as a task-specific and invariant feature learner. Our experiments on TIMIT clearly show the superiority of the feature extracted from contrastive auto-encoder over original acoustic feature, feature extracted from deep auto-encoder, and feature extracted from a model that contrastive auto-encoder originates from.

Proceedings ArticleDOI
03 Apr 2014
TL;DR: The aim of the proposed method is to reduce the background noise present in the speech signal by using spectral subtraction techniques, and enhanced speech is obtained.
Abstract: Speech enhancement is a technique used to reduce the background noise present in the speech signal. It simply means the improvement in intelligibility and quality of degraded speech. The noises present in the speech signal are additive noise, echo, reverbration and speaker interference. The aim of the proposed method is to reduce the background noise present in the speech signal by using spectral subtraction techniques. The magnitude of the spectrum of estimated noise is subtracted from the spectrum of noisy speech signal. Five clean speeches are taken as sample speech. Sample noise such as pink noise, white noise and volvo noise are taken from database (TIMIT & NOIZEUS corpus). By using Non-linear spectral subtraction and Multiband spectral subtraction techniques, enhanced speech is obtained. Performance of the above two methods are compared based on the two parameters namely Signal to Noise Ratio and Log Spectral Distance.