scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2016"


Posted Content
TL;DR: Stochastic recurrent neural networks are introduced which glue a deterministic recurrent neural network and a state space model together to form a stochastic and sequential neural generative model.
Abstract: How can we efficiently propagate uncertainty in a latent state representation with recurrent neural networks? This paper introduces stochastic recurrent neural networks which glue a deterministic recurrent neural network and a state space model together to form a stochastic and sequential neural generative model. The clear separation of deterministic and stochastic layers allows a structured variational inference network to track the factorization of the model's posterior distribution. By retaining both the nonlinear recursive structure of a recurrent neural network and averaging over the uncertainty in a latent path, like a state space model, we improve the state of the art results on the Blizzard and TIMIT speech modeling data sets by a large margin, while achieving comparable performances to competing methods on polyphonic music modeling.

269 citations


Proceedings Article
01 Jan 2016
TL;DR: In this paper, a stochastic recurrent neural network (SRNN) is proposed to propagate uncertainty in a latent state representation with recurrent neural networks (RNNs), which glue a deterministic RNN and a state space model together.
Abstract: How can we efficiently propagate uncertainty in a latent state representation with recurrent neural networks? This paper introduces stochastic recurrent neural networks which glue a deterministic recurrent neural network and a state space model together to form a stochastic and sequential neural generative model. The clear separation of deterministic and stochastic layers allows a structured variational inference network to track the factorization of the model’s posterior distribution. By retaining both the nonlinear recursive structure of a recurrent neural network and averaging over the uncertainty in a latent path, like a state space model, we improve the state of the art results on the Blizzard and TIMIT speech modeling data sets by a large margin, while achieving comparable performances to competing methods on polyphonic music modeling.

205 citations


Proceedings ArticleDOI
01 Sep 2016
TL;DR: This paper uses simple spectrograms as input to a CNN and study the optimal design of those networks for speaker identification and clustering, and demonstrates the approach on the well known TIMIT dataset, achieving results comparable with the state of the art-without the need for handcrafted features.
Abstract: Deep learning, especially in the form of convolutional neural networks (CNNs), has triggered substantial improvements in computer vision and related fields in recent years. This progress is attributed to the shift from designing features and subsequent individual sub-systems towards learning features and recognition systems end to end from nearly unprocessed data. For speaker clustering, however, it is still common to use handcrafted processing chains such as MFCC features and GMM-based models. In this paper, we use simple spectrograms as input to a CNN and study the optimal design of those networks for speaker identification and clustering. Furthermore, we elaborate on the question how to transfer a network, trained for speaker identification, to speaker clustering. We demonstrate our approach on the well known TIMIT dataset, achieving results comparable with the state of the art-without the need for handcrafted features.

73 citations


Proceedings ArticleDOI
12 Sep 2016
TL;DR: In this paper, a segmental recurrent neural network (RNN) is used for feature extraction in an end-to-end manner, which does not rely on an external system to provide features or segmentation boundaries.
Abstract: We study the segmental recurrent neural network for end-to-end acoustic modelling. This model connects the segmental conditional random field (CRF) with a recurrent neural network (RNN) used for feature extraction. Compared to most previous CRF-based acoustic models, it does not rely on an external system to provide features or segmentation boundaries. Instead, this model marginalises out all the possible segmentations, and features are extracted from the RNN trained together with the segmental CRF. In essence, this model is self-contained and can be trained end-to-end. In this paper, we discuss practical training and decoding issues as well as the method to speed up the training in the context of speech recognition. We performed experiments on the TIMIT dataset. We achieved 17.3% phone error rate (PER) from the first-pass decoding — the best reported result using CRFs, despite the fact that we only used a zeroth-order CRF and without using any language model.

68 citations


Journal ArticleDOI
TL;DR: A new stereo speech separation system where deep neural networks are used to generate soft T-F mask for separation, where the orientations of the dominant source are estimated based on low-level features based on mixing vector, interaural level, and phase difference.
Abstract: Time-frequency (T-F) masking is an effective method for stereo speech source separation. However, reliable estimation of the T-F mask from sound mixtures is a challenging task, especially when room reverberations are present in the mixtures. In this paper, we propose a new stereo speech separation system where deep neural networks are used to generate soft T-F mask for separation. More specifically, the deep neural network, which is composed of two sparse autoencoders and a softmax regression, is used to estimate the orientations of the dominant source at each T-F unit, based on low-level features, such as mixing vector (MV), interaural level, and phase difference (IPD/ILD). The dataset for training the networks was generated by the convolution of binaural room impulse responses (RIRs) and clean speech signals positioned in different angles with respect to the sensors. With the training dataset, we use unsupervised learning to extract high-level features from low-level features and use supervised learning to find the nonlinear functions between high-level features and the orientations of dominant source. By using the trained networks, the probability that each T-F unit belongs to different sources (target and interferers) can be estimated based on the localization cues which is further used to generate the soft mask for source separation. Experiments based on real binaural RIRs and TIMIT dataset are provided to show the performance of the proposed system for reverberant speech mixtures, as compared with a model-based T-F masking technique proposed recently.

50 citations


Posted Content
TL;DR: In this article, a segmental recurrent neural network (RNN) was used for feature extraction in an end-to-end acoustic modeling model, which does not rely on an external system to provide features or segmentation boundaries.
Abstract: We study the segmental recurrent neural network for end-to-end acoustic modelling. This model connects the segmental conditional random field (CRF) with a recurrent neural network (RNN) used for feature extraction. Compared to most previous CRF-based acoustic models, it does not rely on an external system to provide features or segmentation boundaries. Instead, this model marginalises out all the possible segmentations, and features are extracted from the RNN trained together with the segmental CRF. In essence, this model is self-contained and can be trained end-to-end. In this paper, we discuss practical training and decoding issues as well as the method to speed up the training in the context of speech recognition. We performed experiments on the TIMIT dataset. We achieved 17.3 phone error rate (PER) from the first-pass decoding --- the best reported result using CRFs, despite the fact that we only used a zeroth-order CRF and without using any language model.

42 citations


Proceedings ArticleDOI
08 Sep 2016
TL;DR: An intriguing correlation between ASR and SER is found, where features learned in some layers particularly towards initial layers of the network for either task were found to be applicable to the other task with varying degree.
Abstract: The correlation between Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER) is poorly understood. Studying such correlation may pave the way for integrating both tasks into a single system or may provide insights that can aid in advancing both systems such as improving ASR in dealing with emotional speech or embedding linguistic input into SER. In this paper, we quantify the relation between ASR and SER by studying the relevance of features learned between both tasks in deep convolutional neural networks using transfer learning. Experiments are conducted using the TIMIT and IEMOCAP databases. Results reveal an intriguing correlation between both tasks, where features learned in some layers particularly towards initial layers of the network for either task were found to be applicable to the other task with varying degree.

35 citations


Posted Content
TL;DR: The authors used hard binary stochastic decisions to select the timesteps at which outputs will be produced, and trained a policy gradient method to produce these binary decisions using a standard policy gradient.
Abstract: Sequence-to-sequence models with soft attention had significant success in machine translation, speech recognition, and question answering. Though capable and easy to use, they require that the entirety of the input sequence is available at the beginning of inference, an assumption that is not valid for instantaneous translation and speech recognition. To address this problem, we present a new method for solving sequence-to-sequence problems using hard online alignments instead of soft offline alignments. The online alignments model is able to start producing outputs without the need to first process the entire input sequence. A highly accurate online sequence-to-sequence model is useful because it can be used to build an accurate voice-based instantaneous translator. Our model uses hard binary stochastic decisions to select the timesteps at which outputs will be produced. The model is trained to produce these stochastic decisions using a standard policy gradient method. In our experiments, we show that this model achieves encouraging performance on TIMIT and Wall Street Journal (WSJ) speech recognition datasets.

35 citations


Journal ArticleDOI
Huy Phan1, Lars Hertel1, Marco Maass1, Radoslaw Mazur1, Alfred Mertins1 
TL;DR: This work considers speech patterns as basic acoustic concepts, which embody and represent the target nonspeech signal, and proposes an algorithm to select a sufficient subset, which provides an approximate representation capability of the entire set of available speech patterns.
Abstract: The human auditory system is very well matched to both human speech and environmental sounds. Therefore, the question arises whether human speech material may provide useful information for training systems for analyzing nonspeech audio signals, e.g., in a classification task. In order to answer this question, we consider speech patterns as basic acoustic concepts, which embody and represent the target nonspeech signal. To find out how similar the nonspeech signal is to speech, we classify it with a classifier trained on the speech patterns and use the classification posteriors to represent the closeness to the speech bases. The speech similarities are finally employed as a descriptor to represent the target signal. We further show that a better descriptor can be obtained by learning to organize the speech categories hierarchically with a tree structure. Furthermore, these descriptors are generic. That is, once the speech classifier has been learned, it can be employed as a feature extractor for different datasets without retraining. Lastly, we propose an algorithm to select a sufficient subset, which provides an approximate representation capability of the entire set of available speech patterns. We conduct experiments for the application of audio event analysis. Phone triplets from the TIMIT dataset were used as speech patterns to learn the descriptors for audio events of three different datasets with different complexity, including UPC-TALP, Freiburg-106, and NAR. The experimental results on the event classification task show that a good performance can be easily obtained even if a simple linear classifier is used. Furthermore, fusion of the learned descriptors as an additional source leads to state-of-the-art performance on all the three target datasets.

33 citations


Proceedings ArticleDOI
31 Mar 2016
TL;DR: The developed ConvRBM with sampling from noisy rectified linear units (NReLUs) is trained in an unsupervised way to model speech signal of arbitrary lengths and weights of the model can represent an auditory-like filterbank.
Abstract: Convolutional Restricted Boltzmann Machine (ConvRBM) as a model for speech signal is presented in this paper. We have developed ConvRBM with sampling from noisy rectified linear units (NReLUs). ConvRBM is trained in an unsupervised way to model speech signal of arbitrary lengths. Weights of the model can represent an auditory-like filterbank. Our proposed learned filterbank is also nonlinear with respect to center frequencies of subband filters similar to standard filterbanks (such as Mel, Bark, ERB, etc.). We have used our proposed model as a front-end to learn features and applied to speech recognition task. Performance of ConvRBM features is improved compared to MFCC with relative improvement of 5% on TIMIT test set and 7% on WSJ0 database for both Nov'92 test sets using GMM-HMM systems. With DNN-HMM systems, we achieved relative improvement of 3% on TIMIT test set over MFCC and Mel filterbank (FBANK). On WSJ0 Nov'92 test sets, we achieved relative improvement of 4–14% using ConvRBM features over MFCC features and 3.6–5.6% using ConvRBM filterbank over FBANK features.

28 citations


Journal ArticleDOI
01 Feb 2016
TL;DR: A new speaker adaptation method for the hybrid NN/HMM speech recognition model based on singular value decomposition (SVD) is proposed, which alleviates the over-fitting problem via updating the weight matrices slightly by only modifying the singular values.
Abstract: Recently several speaker adaptation methods have been proposed for deep neural network (DNN) in many large vocabulary continuous speech recognition (LVCSR) tasks. However, only a few methods rely on tuning the connection weights in trained DNNs directly to optimize system performance since it is very prone to over-fitting especially when some class labels are missing in the adaptation data. In this paper, we propose a new speaker adaptation method for the hybrid NN/HMM speech recognition model based on singular value decomposition (SVD). We apply SVD on the weight matrices in trained DNNs and then tune rectangular diagonal matrices with the adaptation data. This alleviates the over-fitting problem via updating the weight matrices slightly by only modifying the singular values. We evaluate the proposed adaptation method in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that it is effective to adapt large DNN models using only a small amount of adaptation data. For example, recognition results in the Switchboard task have shown that the proposed SVD-based adaptation method may achieve up to 3-6 % relative error reduction using only a few dozens of adaptation utterances per speaker.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: An improvement of the kernel ridge regression studied in Huang et al., ICASSP 2014, is presented and it is shown that the proposal is computationally advantageous and the speech recognition accuracy is highly comparable.
Abstract: Recent evidences suggest that the performance of kernel methods may match that of deep neural networks (DNNs), which have been the state-of-the-art approach for speech recognition. In this work, we present an improvement of the kernel ridge regression studied in Huang et al., ICASSP 2014, and show that our proposal is computationally advantageous. Our approach performs classifications by using the one-vs-one scheme, which, under certain assumptions, reduces the costs of the one-vs-rest scheme by asymptotically a factor of c2 in training time and c in memory consumption. Here, c is the number of classes and it is typically on the order of hundreds and thousands for speech recognition. We demonstrate empirical results on the benchmark corpus TIMIT. In particular, the classification accuracy is one to two percentages higher (in the absolute term) than the best of the kernel methods and of the DNNs reported by Huang et al, and the speech recognition accuracy is highly comparable.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: This paper studies how to conduct effective speaker code based speaker adaptation on RNN-BLSTM and demonstrates that theSpeaker code based adaptation method is also a valid adaptation method for RNN/LSTM.
Abstract: Recently, recurrent neural network with bidirectional Long Short-Term Memory (RNN-BLSTM) acoustic model has been shown to give great performance on the TIMIT [1] and other speech recognition tasks. Meanwhile, the speaker code based adaptation method has been demonstrated as a valid adaptation method for Deep Neural Network (DNN) acoustic model [2]. However, whether the speaker code based adaptation method is also valid for RNN-BLSTM has not been reported to the best our knowledge. In this paper, we study how to conduct effective speaker code based speaker adaptation on RNN-BLSTM and demonstrate that the speaker code based adaptation method is also a valid adaptation method for RNN-BLSTM. Experimental results on TIMIT have shown that the adaptation of RNN-LSTM can achieve over 10% relative reduction in phone error rate (PER) compared to without adaptation. Then, a set of comparative experiments are implemented to analyze the different contribution of the adaptation on cell input and each gate activation function of the BLSTM. It's found that the adaptation on cell input activation function is more effective than the adaptation on each gate activation function.

Journal ArticleDOI
TL;DR: A novel method for estimating the long-term SNR of speech signals based on features, from which it can approximately detect regions of speech presence in a noisy signal, and that it outperforms other SNR estimation methods.
Abstract: Many speech processing algorithms and applications rely on the explicit knowledge of signal-to-noise ratio (SNR) in their design and implementation. Estimating the SNR of a signal can enhance the performance of such technologies. We propose a novel method for estimating the long-term SNR of speech signals based on features, from which we can approximately detect regions of speech presence in a noisy signal. By measuring the energy in these regions, we create sets of energy ratios, from which we train regression models for different types of noise. If the type of noise that corrupts a signal is known, we use the corresponding regression model to estimate the SNR. When the noise is unknown, we use a deep neural network to find the “closest” regression model to estimate the SNR. Evaluations were done based on the TIMIT speech corpus, using noises from the NOISEX-92 noise database. Furthermore, we performed cross-corpora experiments by training on TIMIT and NOISEX-92 and testing on the Wall Street Journal speech corpus and DEMAND noise database. Our results show that our system provides accurate SNR estimations across different noise types, corpora, and that it outperforms other SNR estimation methods.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: The proposed algorithm is evaluated on the TIMIT core test set using the perceptual evaluation of speech quality (PESQ) measure and segmental SNR measure and is shown to give a consistent improvement over a wide range of SNRs when compared to competitive algorithms.
Abstract: In this paper, we propose a minimum mean square error spectral estimator for clean speech spectral amplitudes that uses a Kalman filter to model the temporal dynamics of the spectral amplitudes in the modulation domain. Using a two-parameter Gamma distribution to model the prior distribution of the speech spectral amplitudes, we derive closed form expressions for the posterior mean and variance of the spectral amplitudes as well as for the associated update step of the Kalman filter. The performance of the proposed algorithm is evaluated on the TIMIT core test set using the perceptual evaluation of speech quality (PESQ) measure and segmental SNR measure and is shown to give a consistent improvement over a wide range of SNRs when compared to competitive algorithms.

Journal ArticleDOI
TL;DR: It is shown from single-talk and double-talk scenarios using speech signals from TIMIT database that the proposed algorithm achieves a better performance, more than 3 dB of attenuation in the misalignment evaluation compared to GSVSS-NLMS, non-parametric VSS- NLMS, and standard NLMS algorithms for a non-stationary input in noisy environments.

Proceedings ArticleDOI
30 Apr 2016
TL;DR: The resulting LA-DNN model eliminates the need for pre-training, addresses the gradient vanishing problem for deep networks, has higher capacity in modeling linear transformations, trains significantly faster than normal DNN, and produces better acoustic models.
Abstract: Deep neural networks (DNN) are a powerful tool for many large vocabulary continuous speech recognition (LVCSR) tasks. Training a very deep network is a challenging problem and pre-training techniques are needed in order to achieve the best results. In this paper, we propose a new type of network architecture, Linear Augmented Deep Neural Network (LA-DNN). This type of network augments each non-linear layer with a linear connection from layer input to layer output. The resulting LA-DNN model eliminates the need for pre-training, addresses the gradient vanishing problem for deep networks, has higher capacity in modeling linear transformations, trains significantly faster than normal DNN, and produces better acoustic models. The proposed model has been evaluated on TIMIT phoneme recognition and AMI speech recognition tasks. Experimental results show that the LA-DNN models can have 70% fewer parameters than a DNN, while still improving accuracy. On the TIMIT phoneme recognition task, the smaller LA-DNN model improves TIMIT phone accuracy by 2% absolute, and AMI word accuracy by 1.7% absolute.

Journal ArticleDOI
TL;DR: The proposed multichannel spectral enhancement method for reverberation-robust ASR using distributed microphones uses the techniques of nonnegative tensor factorization in order to identify the clean speech component from a set of observed reverberant spectrograms from the different channels.
Abstract: Automatic speech recognition (ASR) using distant (far-field) microphones is a challenging task, in which room reverberation is one of the primary causes of performance degradation. This study proposes a multichannel spectral enhancement method for reverberation-robust ASR using distributed microphones. The proposed method uses the techniques of nonnegative tensor factorization in order to identify the clean speech component from a set of observed reverberant spectrograms from the different channels. The general family of alpha–beta divergences is used for the tensor decomposition task which provides increased flexibility for the algorithm and is shown to provide improvements in highly reverberant scenarios. Unlike many conventional array processing solutions, the proposed method does not require closely-spaced microphones and is independent of source and microphone locations. The algorithm can automatically adapt to unbalanced direct-to-reverberation ratios among different channels, which is useful in blind scenarios in which no information is available about source-to-microphone distances. For a medium vocabulary distant ASR task based on TIMIT utterances, and using clean-trained deep neural network acoustic models, absolute WER improvements of +17.2%, +20.7%, and +23.2% are achieved in single-channel, two-channel, and four-channel scenarios.

Journal Article
TL;DR: Experimental results have shown that the HOPE framework yields significant performance gains over the current state-of-the-art methods in various types of NN learning problems, including unsupervised feature learning, supervised or semi-supervised learning.
Abstract: In this paper, we propose a novel model for high-dimensional data, called the Hybrid Orthogonal Projection and Estimation (HOPE) model, which combines a linear orthogonal projection and a finite mixture model under a unified generative modeling framework. The HOPE model itself can be learned unsupervised from unlabelled data based on the maximum likelihood estimation as well as discriminatively from labelled data. More interestingly, we have shown the proposed HOPE models are closely related to neural networks (NNs) in a sense that each hidden layer can be reformulated as a HOPE model. As a result, the HOPE framework can be used as a novel tool to probe why and how NNs work, more importantly, to learn NNs in either supervised or unsupervised ways. In this work, we have investigated the HOPE framework to learn NNs for several standard tasks, including image recognition on MNIST and speech recognition on TIMIT. Experimental results have shown that the HOPE framework yields significant performance gains over the current state-of-the-art methods in various types of NN learning problems, including unsupervised feature learning, supervised or semi-supervised learning.

Journal ArticleDOI
TL;DR: This work performs a mode-shape classification, which is formulated as a supervised binary classification problem - mode-shapes representing the syllabic nuclei as one class and remaining as the other, using the temporal correlation and selected sub-band correlation in TCSSBC feature contour.

Proceedings ArticleDOI
01 Oct 2016
TL;DR: The experimental results show that under different noise and different SNR, the improved GFCC that proposed has the lowest equal error rate and the best robustness, especially in the noise ratio is lower than 10dB, has greater advantage compared to other algorithms.
Abstract: Focused on the issue that the robustness of traditional Mel Frequency Cepstral Coefficients (MFCC) feature degrades drastically in speaker recognition system, a kind algorithm that based improved Gammatone Frequency Cepstral Coefficients (GFCC) is proposed. The different between traditional MFCC and GFCC is that GFCC uses Gammatone filter bank to replace Mel filter bank to improve robustness. On this basis, this paper proposes one way that use Multitaper Estimation, MVA (Mean Subtraction, Variance Normzlization and Autoregressive Moving Average Filter)and other technologies to further enhance its robustness and tested with TIMIT speech database. The experimental results show that under different noise and different SNR, the improved GFCC that proposed by this paper has the lowest equal error rate and the best robustness, especially in the noise ratio is lower than 10dB, has greater advantage compared to other algorithms.

Posted Content
TL;DR: In this article, an unsupervised algorithm based on sequence prediction models such as Markov chains and recurrent neural networks is proposed to predict speech features frame-by-frame by analyzing the error profile of a model.
Abstract: Phonemic segmentation of speech is a critical step of speech recognition systems We propose a novel unsupervised algorithm based on sequence prediction models such as Markov chains and recurrent neural network Our approach consists in analyzing the error profile of a model trained to predict speech features frame-by-frame Specifically, we try to learn the dynamics of speech in the MFCC space and hypothesize boundaries from local maxima in the prediction error We evaluate our system on the TIMIT dataset, with improvements over similar methods

Proceedings ArticleDOI
21 Mar 2016
TL;DR: A reduced feature vector employing new information detected from the speaker's voice for performing text-independent speaker verification applications using GMM is proposed and the power spectrum density of the speech signal is used to improve the system's performance.
Abstract: The Gaussian mixture models (GMM) represent an efficient model that was broadly used in most of speaker recognition applications. This study introduces a novel method for speaker verification task. We propose a reduced feature vector employing new information detected from the speaker's voice for performing text-independent speaker verification applications using GMM. We use the power spectrum density of the speech signal to improve the system's performance. Speaker verification experiments were evaluated with the TIMIT dataset. The suggested system performance is evaluated against the baseline systems. The decrease in the error rate is well observed and the results have demonstrated the effectiveness of the new approach which avoids the use of more complex algorithms or the combination of different approaches.

Proceedings Article
19 Jun 2016
TL;DR: An expectation-maximization (EM) based online CTC algorithm is introduced that enables unidirectional RNNs to learn sequences that are longer than the amount of unrolling and can also be trained to process an infinitely long input sequence without pre-segmentation or external reset.
Abstract: Connectionist temporal classification (CTC) based supervised sequence training of recurrent neural networks (RNNs) has shown great success in many machine learning areas including end-to-end speech and handwritten character recognition. For the CTC training, however, it is required to unroll (or unfold) the RNN by the length of an input sequence. This unrolling requires a lot of memory and hinders a small footprint implementation of online learning or adaptation. Furthermore, the length of training sequences is usually not uniform, which makes parallel training with multiple sequences inefficient on shared memory models such as graphics processing units (GPUs). In this work, we introduce an expectation-maximization (EM) based online CTC algorithm that enables unidirectional RNNs to learn sequences that are longer than the amount of unrolling. The RNNs can also be trained to process an infinitely long input sequence without pre-segmentation or external reset. Moreover, the proposed approach allows efficient parallel training on GPUs. Our approach achieves 20.7% phoneme error rate (PER) on the very long input sequence that is generated by concatenating all 192 utterances in the TIMIT core test set. In the end-to-end speech recognition task on the Wall Street Journal corpus, a network can be trained with only 64 times of unrolling with little performance loss.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: This paper introduces a novel method for blind speech segmentation at a phone level based on image processing that considers the spectrogram of the waveform of an utterance as an image and hypothesizes that its striping defects, i.e. discontinuities, appear due to phone boundaries.
Abstract: This paper introduces a novel method for blind speech segmentation at a phone level based on image processing. We consider the spectrogram of the waveform of an utterance as an image and hypothesize that its striping defects, i.e. discontinuities, appear due to phone boundaries. Using a simple image destriping algorithm these discontinuities are found. To discover phone transitions which are not as salient in the image, we compute spectral changes derived from the time evolution of Mel cepstral parametrisation of speech. These socalled image-based and acoustic features are then combined to form a mixed probability function, whose values indicate the likelihood of a phone boundary being located at the corresponding time frame. The method is completely unsupervised and achieves an accuracy of 75.59% at a −3.26% over-segmentation rate, yielding an F-measure of 0.76 and an 0.80 R-value on the TIMIT dataset.

Proceedings ArticleDOI
TL;DR: This paper presents a novel PCA/LDA-based approach that is faster than traditional statistical model-based methods and achieves competitive results.
Abstract: Various algorithms for text-independent speaker recognition have been developed through the decades, aiming to improve both accuracy and efficiency. This paper presents a novel PCA/LDA-based approach that is faster than traditional statistical model-based methods and achieves competitive results. First, the performance based on only PCA and only LDA is measured; then a mixed model, taking advantages of both methods, is introduced. A subset of the TIMIT corpus composed of 200 male speakers, is used for enrollment, validation and testing. The best results achieve 100%; 96% and 95% classification rate at population level 50; 100 and 200, using 39-dimensional MFCC features with delta and double delta. These results are based on 12-second text-independent speech for training and 4-second data for test. These are comparable to the conventional MFCC-GMM methods, but require significantly less time to train and operate.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: This paper empirically measure the performance of using the softmax outputs connected to different hidden layers of an already fine-tuned deep neural network (DNN) and explores decoding strategies that do not require computing all thehidden layers of the DNN.
Abstract: In speech recognition, a trade-off can be made between transcription accuracy and computation time. In this paper, we empirically measure the performance of using the softmax outputs connected to different hidden layers of an already fine-tuned deep neural network (DNN) and explore decoding strategies that do not require computing all the hidden layers of the DNN. We find that selecting the specific outputs from a variable-depth DNN achieves better Phoneme Error Rates (PER) on the TIMIT task than directly training a fixed-depth DNN with the same number of layers. We experimented with different ways of stopping the forward-propagation early, first by using a threshold on the entropy of the respective outputs, and formulate a ‘gating’ system on the hidden layers to predict when to stop the forward propagation.

Posted Content
TL;DR: In this paper, the authors demonstrate the application of nonparametric Bayesian models to acoustic unit discovery and show that the discovered units are correlated with phonemes and therefore are linguistically meaningful.
Abstract: State of the art speech recognition systems use data-intensive context-dependent phonemes as acoustic units. However, these approaches do not translate well to low resourced languages where large amounts of training data is not available. For such languages, automatic discovery of acoustic units is critical. In this paper, we demonstrate the application of nonparametric Bayesian models to acoustic unit discovery. We show that the discovered units are correlated with phonemes and therefore are linguistically meaningful. We also present a spoken term detection (STD) by example query algorithm based on these automatically learned units. We show that our proposed system produces a P@N of 61.2% and an EER of 13.95% on the TIMIT dataset. The improvement in the EER is 5% while P@N is only slightly lower than the best reported system in the literature.

Proceedings ArticleDOI
01 Aug 2016
TL;DR: DNN trained on ConvRBM with rectified units provide significant complementary information in terms of temporal modulation features to help unsupervised representation learning for speech recognition task.
Abstract: There has been a significant research attention for unsupervised representation learning to learn the features for speech processing applications. In this paper, we investigate unsupervised representation learning using Convolutional Restricted Boltzmann Machine (ConvRBM) with rectified units for speech recognition task. Temporal modulation representation is learned using log Mel-spectrogram as an input to ConvRBM. ConvRBM as modulation features and filterbank as spectral features were separately trained on DNNs and then system combination is used. With our proposed setup, ConvRBM features were applied to speech recognition task on TIMIT and WSJ0 databases. On TIMIT database, we achieved relative improvement of 5.93% in PER on test set compared to only filterbank features. For WSJ0 database, we achieved relative improvement of 3.63–4.3% in WER on test sets compared to filterbank features. Hence, DNN trained on ConvRBM with rectified units provide significant complementary information in terms of temporal modulation features.

Proceedings ArticleDOI
01 Nov 2016
TL;DR: This paper presents a robust voice activity detection (VAD) method via a combination of gammatone filtering and entropy as an information-theoretic measure in the detection algorithm that outperforms other existing methods in terms of detection accuracy.
Abstract: Voice activity detector (VAD) is used to detect the presence or absence of human voice in a signal. A robust VAD algorithm is essential to distinguish human voice in a noisy acoustic signal. There were many recent works in development of robust VAD which focus on unsupervised features extraction such as temporal variation, signal-to-noise ratio in [1] and etc. However, these methods are typically sensitive to nonstationary noise especially under low SNR. To overcome these problems, this paper presents a robust voice activity detection (VAD) method via a combination of gammatone filtering and entropy as an information-theoretic measure in the detection algorithm. The performance of the proposed algorithm is tested using speech signals from TIMIT test corpus with additive noise at varying degrees of signal-to-noise ratio. The results show that the proposed robust VAD outperforms other existing methods in terms of detection accuracy.