scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2010"


Proceedings Article
06 Dec 2010
TL;DR: This work uses the mean-covariance restricted Boltzmann machine (mcRBM) to learn features of speech data that serve as input into a standard DBN, and achieves a phone error rate superior to all published results on speaker-independent TIMIT to date.
Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonal-covariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date.

326 citations


Proceedings Article
06 Dec 2010
TL;DR: A theorem stating that a certain perceptron-like learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss is stated.
Abstract: In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptron-like learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem.

144 citations


Proceedings ArticleDOI
26 Sep 2010
TL;DR: The evaluation of five baseline Vad systems on the QUT-NOISE-TIMIT corpus is conducted to validate the data and show that the variety of noise available will allow for better evaluation of VAD systems than existing approaches in the literature.
Abstract: The QUT-NOISE-TIMIT corpus consists of 600 hours of noisy speech sequences designed to enable a thorough evaluation of voice activity detection (VAD) algorithms across a wide variety of common background noise scenarios. In order to construct the final mixed-speech database, a collection of over 10 hours of background noise was conducted across 10 unique locations covering 5 common noise scenarios, to create the QUT-NOISE corpus. This background noise corpus was then mixed with speech events chosen from the TIMIT clean speech corpus over a wide variety of noise lengths, signal-to-noise ratios (SNRs) and active speech proportions to form the mixed-speech QUT-NOISE-TIMIT corpus. The evaluation of five baseline VAD systems on the QUT-NOISE-TIMIT corpus is conducted to validate the data and show that the variety of noise available will allow for better evaluation of VAD systems than existing approaches in the literature.

135 citations


Proceedings ArticleDOI
14 Mar 2010
TL;DR: Conditional Restricted Boltzmann Machines (CRBMs) have recently proved to be very effective for modeling motion capture sequences and this paper investigates the application of this more powerful type of generative model to acoustic modeling.
Abstract: For decades, Hidden Markov Models (HMMs) have been the state-of-the-art technique for acoustic modeling despite their unrealistic independence assumptions and the very limited representational capacity of their hidden states. Conditional Restricted Boltzmann Machines (CRBMs) have recently proved to be very effective for modeling motion capture sequences and this paper investigates the application of this more powerful type of generative model to acoustic modeling. On the standard TIMIT corpus, one type of CRBM outperforms HMMs and is comparable with the best other methods, achieving a phone error rate (PER) of 26.7% on the TIMIT core test set.

121 citations


Proceedings ArticleDOI
14 Mar 2010
TL;DR: The viability of using the posteriorgram approach to handle many talkers by finding clusters of words in the TIMIT corpus is demonstrated.
Abstract: In this paper, we explore the use of a Gaussian posteriorgram based representation for unsupervised discovery of speech patterns. Compared with our previous work, the new approach provides significant improvement towards speaker independence. The framework consists of three main procedures: a Gaussian posteriorgram generation procedure which learns an unsupervised Gaussian mixture model and labels each speech frame with a Gaussian posteriorgram representation; a segmental dynamic time warping procedure which locates pairs of similar sequences of Gaussian posteriorgram vectors; and a graph clustering procedure which groups similar sequences into clusters. We demonstrate the viability of using the posteriorgram approach to handle many talkers by finding clusters of words in the TIMIT corpus.

90 citations


Proceedings ArticleDOI
14 Mar 2010
TL;DR: A novel bayesian compressive sensing (CS) technique for phonetic classification is introduced and it is found that this method outperforms the SVM, kNN and Gaussian Mixture Model (GMM) methods on the TIMIT phonetics classification task.
Abstract: In this paper, we introduce a novel bayesian compressive sensing (CS) technique for phonetic classification CS is often used to characterize a signal from a few support training examples, similar to k-nearest neighbor (kNN) and Support Vector Machines (SVMs) However, unlike SVMs and kNNs, CS allows the number of supports to be adapted to the specific signal being characterized On the TIMIT phonetic classification task, we find that our CS method outperforms the SVM, kNN and Gaussian Mixture Model (GMM) methods Our CS method achieves an accuracy of 8001%, one of the best reported result in the literature to date

77 citations


Proceedings ArticleDOI
23 Aug 2010
TL;DR: The closed-set problem of speaker identification is addressed by presenting a novel sparse representation classification algorithm using the GMM mean super vector kernel for all the training utterances to generate a naturally sparse representation.
Abstract: We address the closed-set problem of speaker identification by presenting a novel sparse representation classification algorithm. We propose to develop an over complete dictionary using the GMM mean super vector kernel for all the training utterances. A given test utterance corresponds to only a small fraction of the whole training database. We therefore propose to represent a given test utterance as a linear combination of all the training utterances, thereby generating a naturally sparse representation. Using this sparsity, the unknown vector of coefficients is computed via l1minimization which is also the sparsest solution [12]. Ideally, the vector of coefficients so obtained has nonzero entries representing the class index of the given test utterance. Experiments have been conducted on the standard TIMIT [14] database and a comparison with the state-of-art speaker identification algorithms yields a favorable performance index for the proposed algorithm.

74 citations


Proceedings ArticleDOI
06 Dec 2010
TL;DR: The results obtained show that approach PSO-SVM gives a better classification in terms of accuracy even though the execution time is increased, which makes it possible to optimize the performance of classifier SVM (Separating with Vast Margin).
Abstract: In many problems of classification, the performances of a classifier are often evaluated by a factor (rate of error).the factor is not well adapted for the complex real problems, in particular the problems multiclass. Our contribution consists in adapting an evolutionary method for optimization of this factor. Among the methods of optimization used we chose the method PSO (Particle Swarm Optimization) which makes it possible to optimize the performance of classifier SVM (Separating with Vast Margin). The experiments are carried out on corpus TIMIT. The results obtained show that approach PSO-SVM gives a better classification in terms of accuracy even though the execution time is increased.

58 citations


Journal ArticleDOI
TL;DR: In this paper, the authors apply the PDF projection theorem to generalize the hidden Markov model (HMM) to accommodate multiple simultaneous segmentations of the raw data and multiple feature extraction transformations.
Abstract: We apply the PDF projection theorem to generalize the hidden Markov model (HMM) to accommodate multiple simultaneous segmentations of the raw data and multiple feature extraction transformations. Different segment sizes and feature transformations are assigned to each state. The algorithm averages over all allowable segmentations by mapping the segmentations to a “proxy” HMM and using the forward procedure. A by-product of the algorithm is the set of a posteriori state probability estimates that serve as a description of the input data. These probabilities have simultaneously the temporal resolution of the smallest processing windows and the processing gain and frequency resolution of the largest processing windows. The method is demonstrated on the problem of precisely modeling the consonant “T” in order to detect the presence of a distinct “burst” component. We compare the algorithm against standard speech analysis methods using data from the TIMIT corpus.

52 citations


Proceedings ArticleDOI
Dong Yu1, Li Deng1
26 Sep 2010
TL;DR: Both the DHCRF and the HMM are superior to the discriminatively trained tri-phone hidden Markov model using identical input features and the use of this new sequential deep-learning model for phonetic recognition is investigated.
Abstract: We extend our earlier work on deep-structured conditional random field (DCRF) and develop deep-structured hidden conditional random field (DHCRF). We investigate the use of this new sequential deep-learning model for phonetic recognition. DHCRF is a hierarchical model in which the final layer is a hidden conditional random field (HCRF) and the intermediate layers are zero-th-order conditional random fields (CRFs). Parameter estimation and sequence inference in the DHCRF are developed in this work. They are carried out layer by layer so that the time complexity is linear to the number of layers. In the DHCRF, the training label is available only at the final layer and the state boundary is unknown. This difficulty is addressed by using unsupervised learning for the intermediate layers and lattice-based supervised learning for the final layer. Experiments on the standard TIMIT phone recognition task show small performance improvement of a three-layer DHCRF over a two-layer DHCRF; both are significantly better than the single-layer DHCRF and are superior to the discriminatively trained tri-phone hidden Markov model (HMM) using identical input features.

47 citations


Proceedings Article
01 Dec 2010
TL;DR: A conditional entropy minimizer is added to the maximum mutual information criteria, which enables to incorporate unlabeled data in a discriminative training fashion in a semi-supervised manner for Gaussian Mixture Models.
Abstract: In this paper, we propose a new semi-supervised training method for Gaussian Mixture Models. We add a conditional entropy minimizer to the maximum mutual information criteria, which enables to incorporate unlabeled data in a discriminative training fashion. The training method is simple but surprisingly effective. The preconditioned conjugate gradient method provides a reasonable convergence rate for parameter update. The phonetic classification experiments on the TIMIT corpus demonstrate significant improvements due to unlabeled data via our training criteria.

Journal ArticleDOI
TL;DR: In this article, a 2D analysis framework using 2D transformations of the time-frequency space is proposed to obtain an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency.
Abstract: This paper considers the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological studies implicating the use of pitch dynamics in speech by humans. We develop and assess signal processing schemes aimed at exploiting temporal change of pitch to address the high-pitch formant frequency estimation problem. Specifically, we propose a 2-D analysis framework using 2-D transformations of the time-frequency space. In one approach, we project changing spectral harmonics over time to a 1-D function of frequency. In a second approach, we draw upon previous work of Quatieri and Ezzat , , with similarities to the auditory modeling efforts of Chi , where localized 2-D Fourier transforms of the time-frequency space provide improved source-filter separation when pitch is changing. Our methods show quantitative improvements for synthesized vowels with stationary formant structure in comparison to traditional and homomorphic linear prediction. We also demonstrate the feasibility of applying our methods on stationary vowel regions of natural speech spoken by high-pitch females of the TIMIT corpus. Finally, we show improvements afforded by the proposed analysis framework in formant tracking on examples of stationary and time-varying formant structure.

Proceedings ArticleDOI
26 Sep 2010
TL;DR: The goal of this paper is to answer the above two questions, both through mathematically analyzing different sparseness methods and also comparing these approaches for phonetic classification in TIMIT.
Abstract: The use of exemplar-based techniques for both speech classification and recognition tasks has become increasingly popular in recent years However, the notion of why sparseness is important for exemplar-based speech processing has been relatively unexplored In addition, little analysis has been done in speech processing on the appropriateness of different types of sparsity regularization constraints The goal of this paper is to answer the above two questions, both through mathematically analyzing different sparseness methods and also comparing these approaches for phonetic classification in TIMIT

Proceedings ArticleDOI
14 Mar 2010
TL;DR: Two nonlinear feature dimensionality reduction methods based on neural networks for a HMM-based phone recognition system are presented and it is shown that recognition accuracies with the transformed features are slightly higher than those obtained with original features and considerably higher than obtained with linear dimensionality Reduction methods.
Abstract: This paper presents two nonlinear feature dimensionality reduction methods based on neural networks for a HMM-based phone recognition system. The neural networks are trained as feature classifiers to reduce feature dimensionality as well as maximize discrimination among speech features. The outputs of different network layers are used for obtaining transformed features. Moreover, the training of the neural networks uses the category information that corresponds to a state in HMMs so that the trained networks can better accommodate the temporal variability of features and obtain more discriminative features in a low dimensional space. Experimental evaluation using the TIMIT database shows that recognition accuracies with the transformed features are slightly higher than those obtained with original features and considerably higher than obtained with linear dimensionality reduction methods. The highest phone accuracy obtained with 39 phone classes and TIMIT was 74.9% using a large number of training iterations based on the state-specific targets.

Proceedings ArticleDOI
15 Mar 2010
TL;DR: The focus of this paper is to develop a knowledge-based robust syllable segmentation algorithm and to establish the importance of accurate segmentation in both the training and testing phases of a speech recognition system.
Abstract: The focus of this paper is two-fold: (a) to develop a knowledge-based robust syllable segmentation algorithm and (b) to establish the importance of accurate segmentation in both the training and testing phases of a speech recognition system. A robust segmentation algorithm for segmenting the speech signal into syllables is first developed. This uses a non-statistical technique that is based on group delay(GD) segmentation and Vowel Onset point(VOP) detection. The transcription corresponding to the utterance is syllabified using rules. This produces an annotation for the train data. The annotated train data is then used to train a syllable-based speech recognition system. The test signal is also segmented using the proposed algorithm. This segmentation information is then incorporated into the linguistic search space to reduce both computational complexity and word error rate(WER). WER's of 4.4% and 21.2% are reported on the TIMIT and NTIMIT databases respectively.

Proceedings ArticleDOI
18 Jul 2010
TL;DR: The results indicate that the proposed multi-SNR multi-environment speaker models and speech enhancement preprocessing methods have enhanced the speaker recognition performance in the presence of different noisy environments.
Abstract: In this paper we are exploring different models and methods for improving the performance of text independent speaker identification system for mobile devices. The major issues in speaker recognition for mobile devices are (i) presence of varying background environment, (ii) effect of speech coding introduced by the mobile device, and (iii) impairments due to wireless channel. In this paper, we are proposing multi-SNR multi-environment speaker models and speech enhancement (preprocessing) methods for improving the performance of speaker recognition system in mobile environment. For this study, we have simulated five different background environments (Car, Factory, High frequency, pink noise and white Gaussian noise) using NOISEX data. Speaker recognition studies are carried out on TIMIT, cellular, and microphone speech databases. Autoassociative neural network models are explored for developing these multi-SNR multi-environment speaker models. The results indicate that the proposed multi-SNR multi-environment speaker models and speech enhancement preprocessing methods have enhanced the speaker recognition performance in the presence of different noisy environments.

Proceedings ArticleDOI
18 Jul 2010
TL;DR: A new algorithm of feature parameter extraction is proposed for application in speaker recognition system, which combines the traditional MFCC and the dynamic MFCC as a new series of coefficients that are weighted as front-end parameters of the GMM to decrease the dimension of the mixed weighted GMM and reduce the computation complexity.
Abstract: In this paper, a new algorithm of feature parameter extraction is proposed for application in speaker recognition system, which combines the traditional MFCC and the dynamic MFCC as a new series of coefficients. According to the statistics analysis of the different contribution by the dynamic MFCC and traditional MFCC, these coefficients are weighted as front-end parameters of the GMM, which would decrease the dimension of the mixed weighted GMM and reduce the computation complexity. The experiments based on the TIMIT and VOA speech database were implemented in MATLAB environment, and the results showed the speaker recognition system with the Weighted Dynamic MFCC could obtain better performance with high recognition rate and low computational complexity.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that the proposed temporal-frequency based reconstruction method is more effective at increasing speech recognition performance in adverse conditions.
Abstract: This paper proposes a novel missing-feature reconstruction method to improve speech recognition in background noise environments. The existing missing-feature reconstruction method utilizes log-spectral correlation across frequency bands. In this paper, we propose to employ a temporal spectral feature analysis to improve the missing-feature reconstruction performance by leveraging temporal correlation across neighboring frames. In a similar manner with the conventional method, a Gaussian mixture model is obtained by training over the obtained temporal spectral feature set. The final estimates for missing-feature reconstruction are obtained by a selective combination of the original frequency correlation based method and the proposed temporal correlation-based method. Performance of the proposed method is evaluated on the TIMIT speech corpus using various types of background noise conditions and the CU-Move in-vehicle speech corpus. Experimental results demonstrate that the proposed method is more effective at increasing speech recognition performance in adverse conditions. By employing the proposed temporal-frequency based reconstruction method, a +17.71% average relative improvement in word error rate (WER) is obtained for white, car, speech babble, and background music conditions over 5-, 10-, and 15-dB SNR, compared to the original frequency correlation-based method. We also obtain a +16.72% relative improvement in real-life in-vehicle conditions using data from the CU-Move corpus.

Journal ArticleDOI
TL;DR: An empirical comparison of kernel selection for SVM was used and performance on text-independent speaker identification using the TIMIT corpus showed that the best performance had been achieved by using polynomial kernel and reported a speaker identification rate equal to 82.47%.
Abstract: Support vector machine (SVM) was the first proposed kernel-based method. It uses a kernel function to transform data from input space into a high-dimensional feature space in which it searches for a separating hyperplane. SVM aims to maximise the generalisation ability that depends on the empirical risk and the complexity of the machine. SVM has been widely adopted in real-world applications including speech recognition. In this paper, an empirical comparison of kernel selection for SVM were used and discussed to achieve performance on text-independent speaker identification using the TIMIT corpus. We were focused on SVM trained using linear, polynomial and radial basis function (RBF) kernels. Results showed that the best performance had been achieved by using polynomial kernel and reported a speaker identification rate equal to 82.47%.

Proceedings ArticleDOI
26 Sep 2010
TL;DR: This paper explores the use of two proposed glottal signatures, derived from the residual signal, for speaker identification, and promising results are shown to outperform other approaches based onglottal features.
Abstract: Most of current speaker recognition systems are based on features extracted from the magnitude spectrum of speech. However the excitation signal produced by the glottis is expected to convey complementary relevant information about the speaker identity. This paper explores the use of two proposed glottal signatures, derived from the residual signal, for speaker identification. Experiments using these signatures are performed on both TIMIT and YOHO databases. Promising results are shown to outperform other approaches based on glottal features. Besides it is highlighted that the signatures can be used for text-independent speaker recognition and that only several seconds of voiced speech are sufficient for estimating them reliably.

Journal ArticleDOI
TL;DR: This paper demonstrates the application of the Laplacian eigenmaps latent variable model (LELVM) to the task of speech recognition and implies the superiority of the proposed method to the usual PCA methods.

Proceedings ArticleDOI
01 Nov 2010
TL;DR: The word-based stress detection method proposed in this paper employs the SVM to classify the stress patterns and the frame-averaged feature and the contextual information intra-word can be input to the classifiers without normalization.
Abstract: This paper investigates lexical stress detection for Chinese learners of English, where a combined differential acoustic feature is developed to represent the lexical stress of polysyllabic words in continuous speech. The use of frame-averaged feature and the contextual information intra-word can be input to the classifiers without normalization. The word-based stress detection method proposed in this paper employs the SVM to classify the stress patterns. In the experiments, a subset from TIMIT corpus is used with carefully selected target polysyllable words and sentences. Multiple SVMs are trained with the combined differential acoustic features. Both the speech from native and non-native speakers is tested for lexical stress detection. The detection system obtained an average word accuracy of 89.78% on speech from native speaker, and 77.37% on speech from Chinese learners.

Book ChapterDOI
07 Apr 2010
TL;DR: An effective algorithm based on particle swarm optimization is proposed here for discovering the best feature combinations and with the optimized feature subset, the performance of the system is improved and the speed of verification is significantly increased.
Abstract: The problem addressed in this paper concerns the feature subset selection for an automatic speaker verification system. An effective algorithm based on particle swarm optimization is proposed here for discovering the best feature combinations. After feature reduction phase, feature vectors are applied to a Gaussian mixture model which is a text-independent speaker verification model. The performance of proposed system is compared to the performance of a genetic algorithm-based system and the baseline algorithm. Experimentation is carried out, using TIMIT corpora. The results of experiments indicate that with the optimized feature subset, the performance of the system is improved. Moreover, the speed of verification is significantly increased since by use of PSO, number of features is reduced over 85% which consequently decrease the complexity of our ASV system.

Journal ArticleDOI
TL;DR: A two-stage speech activity detection system is presented which at first takes advantage of a voice activity detector to discard pause segments out of the audio signals; this is done even in presence of stationary background noises.

Proceedings Article
01 Jan 2010
TL;DR: The experimental results show that the proposed training method for a large-scale linear classifier employed in WFSTbased decoding by using a distributed perceptron algorithm was successfully applied to a large vocabulary continuous speech recognition task, and achieved an improvement compared with the performance of the minimum phone error based discriminative training of acoustic models.
Abstract: This paper describes a discriminative approach that further advances the framework for Weighted Finite State Transducer (WFST) based decoding. The approach introduces additional linear models for adjusting the scores of a decoding graph composed of conventional information source models (e.g., hidden Markov models and N-gram models), and reviews the WFSTbased decoding process as a linear classifier for structured data (e.g., sequential multiclass data). The difficulty with the approach is that the number of dimensions of the additional linear models becomes very large in proportion to the number of arcs in a WFST, and our previous study only applied it to a small task (TIMIT phoneme recognition). This paper proposes a training method for a large-scale linear classifier employed in WFSTbased decoding by using a distributed perceptron algorithm. The experimental results show that the proposed approach was successfully applied to a large vocabulary continuous speech recognition task, and achieved an improvement compared with the performance of the minimum phone error based discriminative training of acoustic models. Index Terms: speech recognition, weighted finite state transducer, linear classifier, distributed perceptron, large vocabulary continuous speech recognition

Proceedings ArticleDOI
14 Mar 2010
TL;DR: An initial attempt for phoneme recognition using structured SVM is presented, which was able to offer an absolute performance improvement of 1.33% over HMMs even with a highly simplified initial approach, probably because of the concept of maximized margin of SVM.
Abstract: Structured Support Vector Machine (SVM) is a recently developed extension of the very successful SVM approach, which can efficiently classify structured pattern with maximized margin This paper presents an initial attempt for phoneme recognition using structured SVM We simply learn the basic framework of HMMs in configuring the structured SVM In the preliminary experiments with TIMIT corpus, the proposed approach was able to offer an absolute performance improvement of 133% over HMMs even with a highly simplified initial approach, probably because of the concept of maximized margin of SVM We see the potential of this approach because of the high generality, high flexibility, and high power of structured SVM

Journal ArticleDOI
TL;DR: Phoneme recognition experiments on TIMIT show that the features derived from the combined set of discriminative filters outperform conventional speech recognition features, and also contain significant complementary information.
Abstract: This paper proposes novel data-driven and feedback based discriminative spectro-temporal filters for feature extraction in automatic speech recognition (ASR). Initially a first set of spectro-temporal filters are designed to separate each phoneme from the rest of the phonemes. A hybrid Hidden Markov Model/Multilayer Perceptron (HMM/MLP) phoneme recognition system is trained on the features derived using these filters. As a feedback to the feature extraction stage, top confusions of this system are identified, and a second set of filters are designed specifically to address these confusions. Phoneme recognition experiments on TIMIT show that the features derived from the combined set of discriminative filters outperform conventional speech recognition features, and also contain significant complementary information.

Proceedings ArticleDOI
01 Dec 2010
TL;DR: A comparative evaluation of speech enhancement algorithms for robust automatic speech recognition on a core test set of the TIMIT speech corpus and mean objective speech quality and ASR correctness scores under two noise conditions are given.
Abstract: A comparative evaluation of speech enhancement algorithms for robust automatic speech recognition is presented. The evaluation is performed on a core test set of the TIMIT speech corpus. Mean objective speech quality scores as well as ASR correctness scores under two noise conditions are given.

Proceedings ArticleDOI
13 Dec 2010
TL;DR: An effective algorithm for classification of one group of phonemes, namely the unvoiced fricatives, which are characterized by a relatively large amount of spectral energy in the high frequency range is presented.
Abstract: Classification of phonemes is the process of assigning a phonetic category to a short section of speech signal. It is a key stage in various applications such as Spoken Term Detection, continuous speech recognition and music to lyrics synchronization, but it can also be useful on its own, for example in the professional music industry, and for applications for the hearing impaired. In this study we present an effective algorithm for classification of one group of phonemes, namely the unvoiced fricatives, which are characterized by a relatively large amount of spectral energy in the high frequency range. The classification between individual phonemes within this group is fairly difficult due to the fact that their acoustic-phonetic characteristics are quite similar. A three-stage classification algorithm between the unvoiced fricatives is utilized. In the first, preprocessing stage, each phoneme segment is divided into consecutive non-overlapping short windowed frames, which is represented by a 15-dimensional feature vector. In the second stage a support vector machine (SVM) is trained, using radial basis kernel function and an automatic grid search for optimizing the SVM parameter. A tree-based algorithm is used in the classification stage, where the phonemes are first classified into two subgroups according to their articulation: sibilants (/s/ and /sh/) and the nonsibilants (/f/ and /th/). Each subgroup is further classified using another SVM. For the evaluation of the performance of the algorithm we used more than 11000 phonemes extracted from the TIMIT speech database. Using a majority vote for the feature vectors of the-same phoneme, the overall accuracy of 85% is obtained (91% for the subset /s/, /sh/ and /f/). These results are comparable and somewhat better than those achieved in other studies. The efficiency and robustness of the algorithm make it implementable in real time applications for the hearing impaired or in recording studios.

Journal ArticleDOI
TL;DR: Different organization measures of a Kohonen map are proposed that are restricted to evaluate a map without referring to data manifold, and how these organization measures act as fitness measures are proposed.
Abstract: Unsupervised learning scheme like the self-organizing map (SOM) has been used to classify speech sounds in an ordered manner. SOM is able to extract the most salient features of the input signal and provides a simple way of visualizing them. The distance between two units on the map was used as an objective measure of their perceptual similarity. This paper presents a study of the evaluation of a SOM trained by a sequential learning algorithm integrating information enrichment principal. Two complementary analyses are proposed: quantitative analysis and qualitative one. SOM has been used to visualize speech database as a phenotypic map. The latter was used to generate quantitative measures of the input space. In this paper, we propose different organization measures of a Kohonen map. Some of them are restricted to evaluate a map without referring to data manifold. Others are restricted to quantify map organization with respect to data manifold. We propose also how these organization measures act as fitness measures. These organizations measures are evaluated on the case study of phoneme classification of TIMIT acoustic-phonetic continuous speech corpus. The experimental results show that the proposed combined organization measures provide more significant values according to map sizes.