scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2013"


Proceedings ArticleDOI
26 May 2013
TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

7,316 citations


Posted Content
TL;DR: In this paper, deep recurrent neural networks (RNNs) are used to combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates \emph{deep recurrent neural networks}, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

5,310 citations


Proceedings ArticleDOI
01 Dec 2013
TL;DR: The hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates, and the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy.
Abstract: Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.

1,619 citations


01 Jan 2013
TL;DR: Improvements in speech recognition are suggested without increasing the number of training epochs, and it is suggested that data transformations should be an important component of training neural networks for speech, especially for data limited projects.
Abstract: Augmenting datasets by transforming inputs in a way that does not change the label is a crucial ingredient of the state of the art methods for object recognition using neural networks. However this approach has (to our knowledge) not been exploited successfully in speech recognition (with or without neural networks). In this paper we lay the foundation for this approach, and show one way of augmenting speech datasets by transforming spectrograms, using a random linear warping along the frequency dimension. In practice this can be achieved by using warping techniques that are used for vocal tract length normalization (VTLN) - with the difference that a warp factor is generated randomly each time, during training, rather than tting a single warp factor to each training and test speaker (or utterance). At test time, a prediction is made by averaging the predictions over multiple warp factors. When this technique is applied to TIMIT using Deep Neural Networks (DNN) of dierent depths, the Phone Error Rate (PER) improved by an average of 0.65% on the test set. For a Convolutional neural network (CNN) with convolutional layer in the bottom, a gain of 1.0% was observed. These improvements were achieved without increasing the number of training epochs, and suggest that data transformations should be an important component of training neural networks for speech, especially for data limited projects.

351 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: A new fast speaker adaptation method for the hybrid NN-HMM speech recognition model that can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.
Abstract: In this paper, we propose a new fast speaker adaptation method for the hybrid NN-HMM speech recognition model. The adaptation method depends on a joint learning of a large generic adaptation neural network for all speakers as well as multiple small speaker codes (one per speaker). The joint training method uses all training data along with speaker labels to update adaptation NN weights and speaker codes based on the standard back-propagation algorithm. In this way, the learned adaptation NN is capable of transforming each speaker features into a generic speaker-independent feature space when a small speaker code is given. Adaptation to a new speaker can be simply done by learning a new speaker code using the same back-propagation algorithm without changing any NN weights. In this method, a separate speaker code is learned for each speaker while the large adaptation NN is learned from the whole training set. The main advantage of this method is that the size of speaker codes is very small. As a result, it is possible to conduct a very fast adaptation of the hybrid NN/HMM model for each speaker based on only a small amount of adaptation data (i.e., just a few utterances). Experimental results on TIMIT have shown that it can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.

269 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: It is demonstrated that, even on a strong baseline, multi-task learning can provide a significant decrease in error rate, and this paper explores three natural choices for the secondary task: the phone label, the phone context, and the state context.
Abstract: In this paper we demonstrate how to improve the performance of deep neural network (DNN) acoustic models using multi-task learning. In multi-task learning, the network is trained to perform both the primary classification task and one or more secondary tasks using a shared representation. The additional model parameters associated with the secondary tasks represent a very small increase in the number of trained parameters, and can be discarded at runtime. In this paper, we explore three natural choices for the secondary task: the phone label, the phone context, and the state context. We demonstrate that, even on a strong baseline, multi-task learning can provide a significant decrease in error rate. Using phone context, the phonetic error rate (PER) on TIMIT is reduced from 21.63% to 20.25% on the core test set, and surpassing the best performance in the literature for a DNN that uses a standard feed-forward network architecture.

256 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: A novel, data-driven approach to voice activity detection based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features clearly outperforming three state-of-the-art reference algorithms under the same conditions.
Abstract: A novel, data-driven approach to voice activity detection is presented. The approach is based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features. To approximate real-life scenarios, large amounts of noisy speech instances are mixed by using both read and spontaneous speech from the TIMIT and Buckeye corpora, and adding real long term recordings of diverse noise types. The approach is evaluated on unseen synthetically mixed test data as well as a real-life test set consisting of four full-length Hollywood movies. A frame-wise Equal Error Rate (EER) of 33.2% is obtained for the four movies and an EER of 9.6% is obtained for the synthetic test data at a peak SNR of 0 dB, clearly outperforming three state-of-the-art reference algorithms under the same conditions.

236 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: A novel deep convolutional neural network architecture is developed, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance.
Abstract: We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided by domain knowledge about how speech classes would change when formant frequencies are modified. The convolution and heterogeneous-pooling layers are followed by a fully connected multi-layer neural network to form a deep architecture interfaced to an HMM for continuous speech recognition. During training, all layers of this entire deep net are regularized using a variant of the “dropout” technique. Experimental evaluation demonstrates the effectiveness of both heterogeneous pooling and dropout regularization. On the TIMIT phonetic recognition task, we have achieved an 18.7% phone error rate, lowest on this standard task reported in the literature with a single system and with no use of information about speaker identity. Preliminary experiments on large vocabulary speech recognition in a voice search task also show error rate reduction using heterogeneous pooling in the deep convolutional neural network.

185 citations


Journal ArticleDOI
TL;DR: A sufficient depth of the T-DSN, a symmetry in the two hidden layers structure in each T- DSN block, the model parameter learning algorithm, and a softmax layer on top of T-DsN are shown to have all contributed to the low error rates observed in the experiments for all three tasks.
Abstract: A novel deep architecture, the tensor deep stacking network (T-DSN), is presented. The T-DSN consists of multiple, stacked blocks, where each block contains a bilinear mapping from two hidden layers to the output layer, using a weight tensor to incorporate higher order statistics of the hidden binary (([0,1])) features. A learning algorithm for the T-DSN's weight matrices and tensors is developed and described in which the main parameter estimation burden is shifted to a convex subproblem with a closed-form solution. Using an efficient and scalable parallel implementation for CPU clusters, we train sets of T-DSNs in three popular tasks in increasing order of the data size: handwritten digit recognition using MNIST (60k), isolated state/phone classification and continuous phone recognition using TIMIT (1.1 m), and isolated phone classification using WSJ0 (5.2 m). Experimental results in all three tasks demonstrate the effectiveness of the T-DSN and the associated learning methods in a consistent manner. In particular, a sufficient depth of the T-DSN, a symmetry in the two hidden layers structure in each T-DSN block, our model parameter learning algorithm, and a softmax layer on top of T-DSN are shown to have all contributed to the low error rates observed in the experiments for all three tasks.

164 citations


Posted Content
TL;DR: This paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates, and indicates that CNNs can learn features relevant for phoneme classification automatically from the rawspeech signal.
Abstract: In hybrid hidden Markov model/artificial neural networks (HMM/ANN) automatic speech recognition (ASR) system, the phoneme class conditional probabilities are estimated by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then modeling the acoustic features with an ANN. Recent advances in machine learning techniques, more specifically in the field of image processing and text processing, have shown that such divide and conquer strategy (i.e., separating feature extraction and modeling steps) may not be necessary. Motivated from these studies, in the framework of convolutional neural networks (CNNs), this paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates. On TIMIT phoneme recognition task, we study different ANN architectures to show the benefit of CNNs and compare the proposed approach against conventional approach where, spectral-based feature MFCC is extracted and modeled by a multilayer perceptron. Our studies show that the proposed approach can yield comparable or better phoneme recognition performance when compared to the conventional approach. It indicates that CNNs can learn features relevant for phoneme classification automatically from the raw speech signal.

155 citations


Proceedings ArticleDOI
12 Jun 2013
TL;DR: In this article, the authors investigated a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates, and compared the proposed approach with conventional approach where, spectral-based feature MFCC is extracted and modeled by a multilayer perceptron.
Abstract: In hybrid hidden Markov model/artificial neural networks (HMM/ANN) automatic speech recognition (ASR) system, the phoneme class conditional probabilities are estimated by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then modeling the acoustic features with an ANN. Recent advances in machine learning techniques, more specifically in the field of image processing and text processing, have shown that such divide and conquer strategy (i.e., separating feature extraction and modeling steps) may not be necessary. Motivated from these studies, in the framework of convolutional neural networks (CNNs), this paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates. On TIMIT phoneme recognition task, we study different ANN architectures to show the benefit of CNNs and compare the proposed approach against conventional approach where, spectral-based feature MFCC is extracted and modeled by a multilayer perceptron. Our studies show that the proposed approach can yield comparable or better phoneme recognition performance when compared to the conventional approach. It indicates that CNNs can learn features relevant for phoneme classification automatically from the raw speech signal.

Proceedings ArticleDOI
Raman Arora1, Karen Livescu1
26 May 2013
TL;DR: The behavior of CCA-based acoustic features on the task of phonetic recognition is studied, and to what extent they are speaker-independent or domain-independent.
Abstract: Canonical correlation analysis (CCA) and kernel CCA can be used for unsupervised learning of acoustic features when a second view (e.g., articulatory measurements) is available for some training data, and such projections have been used to improve phonetic frame classification. Here we study the behavior of CCA-based acoustic features on the task of phonetic recognition, and investigate to what extent they are speaker-independent or domain-independent. The acoustic features are learned using data drawn from the University of Wisconsin X-ray Microbeam Database (XRMB). The features are evaluated within and across speakers on XRMB data, as well as on out-of-domain TIMIT and MOCHA-TIMIT data. Experimental results show consistent improvement with the learned acoustic features over baseline MFCCs and PCA projections. In both speaker-dependent and cross-speaker experiments, phonetic error rates are improved by 4-9% absolute (10-23% relative) using CCA-based features over baseline MFCCs. In cross-domain phonetic recognition (training on XRMB and testing on MOCHA or TIMIT), the learned projections provide smaller improvements.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: Experimental results on the TIMIT dataset demonstrates that both methods are quite effective in terms of adapting CNN based acoustic models and can achieve even better performance by combining these two methods together.
Abstract: Recently, we have proposed a novel fast adaptation method for the hybrid DNN-HMM models in speech recognition [1]. This method relies on learning an adaptation NN that is capable of transforming input speech features for a certain speaker into a more speaker independent space given a suitable speaker code. Speaker codes are learned for each speaker during adaptation. The whole multi-speaker training dataset is used to learn the adaptation NN weights. Our previous work has shown that this method is quite effective in adapting DNNs even when only a very small amount of adaptation data is available. However, the proposed method does not work well in the case of convolutional neural network (CNN). In this paper, we investigate the fast adaptation of CNN models. We first modify the speaker code based adaptation method to better suit to the CNN structure. Moreover, we investigate a new adaptation scheme using speaker specific adaptive nodes output weights. These weights scale different nodes outputs to optimize the model for new speakers. Experimental results on the TIMIT dataset demonstrates that both methods are quite effective in terms of adapting CNN based acoustic models and we can achieve even better performance by combining these two methods together.

Journal ArticleDOI
TL;DR: It is shown here that simple reservoir-based AMs achieve reasonable phone recognition and that deep hierarchical and bi-directional reservoir architectures lead to a very competitive Phone Error Rate (PER) of 23.1% on the well-known TIMIT task.
Abstract: Accurate acoustic modeling is an essential requirement of a state-of-the-art continuous speech recognizer. The Acoustic Model (AM) describes the relation between the observed speech signal and the non-observable sequence of phonetic units uttered by the speaker. Nowadays, most recognizers use Hidden Markov Models (HMMs) in combination with Gaussian Mixture Models (GMMs) to model the acoustics, but neural-based architectures are on the rise again. In this work, the recently introduced Reservoir Computing (RC) paradigm is used for acoustic modeling. A reservoir is a fixed - and thus non-trained - Recurrent Neural Network (RNN) that is combined with a trained linear model. This approach combines the ability of an RNN to model the recent past of the input sequence with a simple and reliable training procedure. It is shown here that simple reservoir-based AMs achieve reasonable phone recognition and that deep hierarchical and bi-directional reservoir architectures lead to a very competitive Phone Error Rate (PER) of 23.1% on the well-known TIMIT task.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: Results show that the combination of special one-state phone boundary models and monophone HMMs can significantly improve forced alignment accuracy and HMM-based forced alignment systems can benefit from using precise phonetic segmentation for training HMMs.
Abstract: This study attempts to improve automatic phonetic segmentation within the HMM framework. Experiments were conducted to investigate the use of phone boundary models, the use of precise phonetic segmentation for training HMMs, and the difference between context-dependent and contextindependent phone models in terms of forced alignment performance. Results show that the combination of special one-state phone boundary models and monophone HMMs can significantly improve forced alignment accuracy. HMM-based forced alignment systems can also benefit from using precise phonetic segmentation for training HMMs. Context-dependent phone models are not better than context-independent models when combined with phone boundary models. The proposed system achieves 93.92% agreement (of phone boundaries) within 20 ms compared to manual segmentation on the TIMIT corpus. This is the best reported result on TIMIT to our knowledge.

Journal ArticleDOI
TL;DR: A novel technique is proposed that exploits LSTM in combination with Connectionist Temporal Classification in order to improve performance by using a self-learned amount of contextual information.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper applies graphbased learning to variable-length segments rather than to the fixed-length vector representations that have been used previously, and finds that the best learning algorithms are those that can incorporate prior knowledge.
Abstract: This paper presents several novel contributions to the emerging framework of graph-based semi-supervised learning for speech processing. First, we apply graphbased learning to variable-length segments rather than to the fixed-length vector representations that have been used previously. As part of this work we compare various graph-based learners, and we utilize an efficient feature selection technique for high-dimensional feature spaces that alleviates computational costs and improves the performance of graph-based learners. Finally, we present a method to improve regularization during the learning process. Experimental evaluation on the TIMIT frame and segment classification tasks demonstrates that the graphbased classifiers outperform standard baseline classifiers; furthermore, we find that the best learning algorithms are those that can incorporate prior knowledge.

Journal ArticleDOI
TL;DR: A set of model-based approaches for unsupervised spoken term detection (STD) with spoken queries that requires neither speech recognition nor annotated data and the usefulness of ASMs for STD in zero-resource settings and the potential of an instantly responding STD system using ASM indexing are demonstrated.
Abstract: We present a set of model-based approaches for unsupervised spoken term detection (STD) with spoken queries that requires neither speech recognition nor annotated data. This work shows the possibilities in migrating from DTW-based to model-based approaches for unsupervised STD. The proposed approach consists of three components: self-organizing models, query matching, and query modeling. To construct the self-organizing models, repeated patterns are captured and modeled using acoustic segment models (ASMs). In the query matching phase, a document state matching (DSM) approach is proposed to represent documents as ASM sequences, which are matched to the query frames. In this way, not only do the ASMs better model the signal distributions and time trajectories of speech, but the much-smaller number of states than frames for the documents leads to a much lower computational load. A novel duration-constrained Viterbi (DC-Vite) algorithm is further proposed for the above matching process to handle the speaking rate distortion problem. In the query modeling phase, a pseudo likelihood ratio (PLR) approach is proposed in the pseudo relevance feedback (PRF) framework. A likelihood ratio evaluated with query/anti-query HMMs trained with pseudo relevant/irrelevant examples is used to verify the detected spoken term hypotheses. The proposed framework demonstrates the usefulness of ASMs for STD in zero-resource settings and the potential of an instantly responding STD system using ASM indexing. The best performance is achieved by integrating DTW-based approaches into the rescoring steps in the proposed framework. Experimental results show an absolute 14.2% of mean average precision improvement with 77% CPU time reduction compared with the segmental DTW approach on a Mandarin broadcast news corpus. Consistent improvements were found on TIMIT and MediaEval 2011 Spoken Web Search corpus.

Proceedings ArticleDOI
26 May 2013
TL;DR: The results support the hypothesis that visual features are less affected by whisper speech, and that the lips' articulation between whisper and neutral speech is similar, providing a valuable whisper-invariant modality.
Abstract: Current automatic speech recognition (ASR) systems cannot recognize whisper speech with high accuracy. ASR systems are trained with neutral speech, which have significant acoustic differences with whisper speech (i.e., energy, duration, harmonics structure, and spectral slope). Given the limitations of speech-based systems to process whisper speech, we propose to explore the benefits of visual features describing the orofacial area. We hypothesize that the lips' articulation between whisper and neutral speech is similar, providing a valuable whisper-invariant modality. This paper introduces the first audiovisual corpus of whisper speech. While we are targeting over 40 speakers, the current corpus has recordings from eleven subjects who were asked to read TIMIT sentences, and isolated digits alternating between neutral and whisper speech. The corpus also includes spontaneous recordings, in which the subject answered a series of general questions. The paper also analyzes an exhaustive set of audiovisual features, including action units (AUs), lip spreading, fundamental frequency, intensity, MFCCs, and formants. We study the differences in the features' distributions between whisper and neutral speech using Kullback-Leibler divergence (KLD). Then, we conducted statistical test to determine whether the differences in the features are statistically significant. The results support our hypothesis that visual features are less affected by whisper speech.

01 Jan 2013
TL;DR: This thesis presents two posteriorgram-based speech representations which enable speaker-independent, and noisy spoken term matching, and shows two lower-bounding based methods for Dynamic Time Warping (DTW) based pattern matching algorithms.
Abstract: This thesis is motivated by the challenge of searching and extracting useful information from speech data in a completely unsupervised setting. In many real world speech processing problems, obtaining annotated data is not cost and time effective. We therefore ask how much can we learn from speech data without any transcription. To address this question, in this thesis, we chose the query-by-example spoken term detection as a specific scenario to demonstrate that this task can be done in the unsupervised setting without any annotations. To build the unsupervised spoken term detection framework, we contributed three main techniques to form a complete working flow. First, we present two posteriorgram-based speech representations which enable speaker-independent, and noisy spoken term matching. The feasibility and effectiveness of both posteriorgram features are demonstrated through a set of spoken term detection experiments on different datasets. Second, we show two lower-bounding based methods for Dynamic Time Warping (DTW) based pattern matching algorithms. Both algorithms greatly outperform the conventional DTW in a single-threaded computing environment. Third, we describe the parallel implementation of the lower-bounded DTW search algorithm. Experimental results indicate that the total running time of the entire spoken detection system grows linearly with corpus size. We also present the training of large Deep Belief Networks (DBNs) on Graphical Processing Units (GPUs). The phonetic classification experiment on the TIMIT corpus showed a speed-up of 36x for pre-training and 45x for back-propagation for a two-layer DBN trained on the GPU platform compared to the CPU platform. Thesis Supervisor: James R. Glass Title: Senior Research Scientist

Posted Content
TL;DR: The focus of this paper is a primal-dual training method that formulates the learning of the RNN as a formal optimization problem with an inequality constraint that provides a sufficient condition for the stability of the network dynamics.
Abstract: We present an architecture of a recurrent neural network (RNN) with a fully-connected deep neural network (DNN) as its feature extractor. The RNN is equipped with both causal temporal prediction and non-causal look-ahead, via auto-regression (AR) and moving-average (MA), respectively. The focus of this paper is a primal-dual training method that formulates the learning of the RNN as a formal optimization problem with an inequality constraint that provides a sufficient condition for the stability of the network dynamics. Experimental results demonstrate the effectiveness of this new method, which achieves 18.86% phone recognition error on the TIMIT benchmark for the core test set. The result approaches the best result of 17.7%, which was obtained by using RNN with long short-term memory (LSTM). The results also show that the proposed primal-dual training method produces lower recognition errors than the popular RNN methods developed earlier based on the carefully tuned threshold parameter that heuristically prevents the gradient from exploding.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper shows how to discover non-linear features of frames of spectrograms using a novel autoencoder that is used in a Deep Neural Network Hidden Markov Model (DNN-HMM) hybrid system for automatic speech recognition.
Abstract: In this paper we show how we can discover non-linear features of frames of spectrograms using a novel autoencoder. The autoencoder uses a neural network encoder that predicts how a set of prototypes called templates need to be transformed to reconstruct the data, and a decoder that is a function that performs this operation of transforming prototypes and reconstructing the input. We demonstrate this method on spectrograms from the TIMIT database. The features are used in a Deep Neural Network Hidden Markov Model (DNN-HMM) hybrid system for automatic speech recognition. On the TIMIT monophone recognition task we were able to achieve gains of 0.5% over Mel log spectra, by augmenting traditional the spectra with the predicted transformation parameters. Further, using the recently discovered ‘dropout’ training, we were able to achieve a phone error rate (PER) of 17.9% on the dev set and 19.5% on the test set, which, to our knowledge is the best reported number on this task using a hybrid system.

Journal ArticleDOI
TL;DR: The classical problem of delta feature computation is addressed, and the operation involved in terms of Savitzky-Golay (SG) filtering is interpreted as SG filtering with a fixed impulse response, bringing in significantly lower latency with no loss in accuracy.
Abstract: We address the classical problem of delta feature computation, and interpret the operation involved in terms of Savitzky-Golay (SG) filtering. Features such as the mel-frequency cepstral coefficients (MFCCs), obtained based on short-time spectra of the speech signal, are commonly used in speech recognition tasks. In order to incorporate the dynamics of speech, auxiliary delta and delta-delta features, which are computed as temporal derivatives of the original features, are used. Typically, the delta features are computed in a smooth fashion using local least-squares (LS) polynomial fitting on each feature vector component trajectory. In the light of the original work of Savitzky and Golay, and a recent article by Schafer in IEEE Signal Processing Magazine, we interpret the dynamic feature vector computation for arbitrary derivative orders as SG filtering with a fixed impulse response. This filtering equivalence brings in significantly lower latency with no loss in accuracy, as validated by results on a TIMIT phoneme recognition task. The SG filters involved in dynamic parameter computation can be viewed as modulation filters, proposed by Hermansky.

Journal ArticleDOI
TL;DR: This paper proposes a new method for feature extraction from the trajectory of the speech signal in the RPS using the multivariate autoregressive (MVAR) method and benefits from linear discriminant analysis (LDA) for dimension reduction.

Journal ArticleDOI
TL;DR: A novel Self-Adjustable Neural Network is presented, to enable the network to adjust itself according to different data input sizes, and is benchmarked against the standard and state-of-the-art recogniser, Hidden Markov Model.

Proceedings ArticleDOI
M. Cutajar1, Edward Gatt1, Ivan Grech1, Owen Casha1, Joseph Micallef1 
01 Jul 2013
TL;DR: This paper presents the design of a digital hardware implementation based on Support Vector Machines (SVMs), for the task of multi-speaker phoneme recognition, and a priority scheme was also included in the architecture, in order to forecast the three most likely phonemes.
Abstract: This paper presents the design of a digital hardware implementation based on Support Vector Machines (SVMs), for the task of multi-speaker phoneme recognition. The One-against-one multiclass SVM method, with the Radial Basis Function (RBF) kernel was considered. Furthermore, a priority scheme was also included in the architecture, in order to forecast the three most likely phonemes. The designed system was synthesised on a Xilinx Virtex-II XC2V3000 FPGA, and evaluated with the TIMIT corpus. This phoneme recognition system is intended to be implemented on a dedicated chip, along with the Discrete Wavelet Transforms (DWTs) for feature extraction, to further improve the resultant performance.

Dissertation
01 Apr 2013
TL;DR: Analysts should have an understanding of the basis of LPC analysis and know how it is applied to obtain formant measurements in the software that they use, and understand the influence of L PC order and the other analysis parameters concerning formant tracking.
Abstract: The aim of this thesis is to provide guidance and information that will assist forensic speech scientists, and phoneticians generally, in making more accurate formant measurements, using commonly available speech analysis tools. Formant measurements are an important speech feature that are often examined in forensic casework, and are used widely in many other areas within the field of phonetics. However, the performance of software currently used by analysts has not been subject to detailed investigation. This thesis reports on a series of experiments that examine the influence that the analysis tools, analysis settings and speakers have on formant measurements. The influence of these three factors was assessed by examining formant measurement errors and their behaviour. This was done using both synthetic and real speech. The synthetic speech was generated with known formant values so that the measurement errors could be calculated precisely. To investigate the influence of different speakers on measurement performance, synthetic speakers were created with different third formant structures and with different glottal source signals. These speakers’ synthetic vowels were analysed using Praat’s normal formant measuring tool across a range of LPC orders. The real speech was from a subset of 186 speakers from the TIMIT corpus. The measurements from these speakers were compared with a set of hand-corrected reference formant values to establish the performance of four measurement tools across a range of analysis parameters and measurement strategies. The analysis of the measurement errors explored the relationships between the analysis tools, the analysis parameters and the speakers, and also examined how the errors varied over the vowel space. LPC order was found to have the greatest influence on the magnitude of the errors and their overall behaviour was closely associated with the underlying measurement process used by the tools. The performance of the formant trackers tended to be better than the simple Praat measuring tool, and allowing the LPC order to vary across tokens improved the performance for all tools. The performance was found to differ across speakers, and for each real speaker, the best performance was obtained when the measurements were made with a range of LPC orders, rather than being restricted to just one. The most significant guidance that arises from the results is that analysts should have an understanding of the basis of LPC analysis and know how it is applied to obtain formant measurements in the software that they use. They should also understand the influence of LPC order and the other analysis parameters concerning formant tracking. This will enable them to select the most appropriate settings and avoid making unreliable measurements.

Journal ArticleDOI
TL;DR: The proposed methods have led to significant improvements in recognition accuracy over conventional Hidden Markov Model (HMM) baseline systems, and the integration of EAMs with CVEM, DT, and MLP has also significantly improved the accuracy performances of the single model systems based on CVEM; where the increased inter-model diversity is shown to have played an important role in the performance gain.
Abstract: We propose a novel approach of using Cross Validation (CV) and Speaker Clustering (SC) based data samplings to construct an ensemble of acoustic models for speech recognition. We also investigate the effects of the existing techniques of Cross Validation Expectation Maximization (CVEM), Discriminative Training (DT), and Multiple Layer Perceptron (MLP) features on the quality of the proposed ensemble acoustic models (EAMs). We have evaluated the proposed methods on TIMIT phoneme recognition task as well as on a telemedicine automatic captioning task. The proposed methods have led to significant improvements in recognition accuracy over conventional Hidden Markov Model (HMM) baseline systems, and the integration of EAMs with CVEM, DT, and MLP has also significantly improved the accuracy performances of the single model systems based on CVEM, DT, and MLP, where the increased inter-model diversity is shown to have played an important role in the performance gain.

Proceedings ArticleDOI
26 May 2013
TL;DR: A speech enhancement algorithm that applies a Kalman filter in the modulation domain to the output of a conventional enhancer operating in the time-frequency domain is proposed and it is demonstrated that it gives consistent performance improvements over the baseline enhancer.
Abstract: We propose a speech enhancement algorithm that applies a Kalman filter in the modulation domain to the output of a conventional enhancer operating in the time-frequency domain. The speech model required by the Kalman filter is obtained by performing linear predictive analysis in each frequency bin of the modulation domain signal. We show, however, that the corresponding speech synthesis filter can have a very high gain at low frequencies and may approach instability. To improve the stability of the synthesis filter, we propose two alternative methods of limiting its low frequency gain. We evaluate the performance of the speech enhancement algorithm on the core TIMIT test set and demonstrate that it gives consistent performance improvements over the baseline enhancer.

Proceedings ArticleDOI
26 May 2013
TL;DR: This paper attempts to apply deep Boltzmann machine (DBM) on acoustic modeling with the advantages that a top-down feedback is incorporated and the parameters of all layers can be jointly optimized.
Abstract: In the past few years, deep neural networks (DNNs) achieved great successes in speech recognition. The layer-wise pre-trained deep belief network (DBN) is known as one of the critical factor to optimize the DNN. However, the DBN has one shortcoming that the pre-training procedure is in a greedy forward pass. The top-down influences on the inference process are ignored, thus the pre-trained DBN is suboptimal. In this paper, we attempt to apply deep Boltzmann machine (DBM) on acoustic modeling. DBM has the advantages that a top-down feedback is incorporated and the parameters of all layers can be jointly optimized. Experiments are conducted on the TIMIT phone recognition task to investigate the DBM-DNN acoustic model. Comparing with the DBN-DNN with same amount of parameters, phone error rate on the core test set is reduced by 3.8% relatively, and additional 5.1% by dropout fine-tuning.