scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2001"


Journal ArticleDOI
TL;DR: A new training method based on GA and Baum–Welch algorithms to obtain an HMM model with optimised number of states in the HMM models and its model parameters is proposed.

87 citations


01 Jan 2001
TL;DR: By selecting an appropriate distance measure, an automated procedure to map phonemes from a source language (English) to a target language (Afrikaans) can be applied, with recognition results comparable to a manual mapping process undertaken by a phonetic expert.
Abstract: This paper explores an automated approach to mapping one phoneme set to another, based on the acoustic distances of the individual phonemes. The main goal of this investigation is to automate the technique for creating initial/baseline acoustic models for a new language. Using this technique, it would be possible to rapidly build speech recognition systems for a variety of languages. A subsidiary objective of this investigation is to compare different acoustic distance measures and to assess their ability to quantify the acoustic similarity between phonemes. The distance measures that were considered for this investigation are the Kullback-Leibler measure, the Bhattacharyya distance metric, the Mahalanobis measure, the Euclidean measure, the L2 metric and the Jeffreys-Matusita distance. Both the TIMIT and SUN Speech corpora were used. It was found that by selecting an appropriate distance measure, an automated procedure to map phonemes from a source language (English) to a target language (Afrikaans) can be applied, with recognition results comparable to a manual mapping process undertaken by a phonetic expert.

29 citations


Proceedings Article
01 Jan 2001
TL;DR: The algorithm presented here is applied to plosives detection, but can easily be adapted to any class of phonemes, and uses the loss-based multi-class decisions.
Abstract: This paper presents a novel algorithm for precise spotting of plosives. The algorithm is based on a pattern matching technique implemented with margin classifiers, such as support vector machines (SVM). A special hierarchical treatment to overcome the problem of fricative and false silence detection is presented. It uses the loss-based multi-class decisions. Furthermore, a method for smoothing the overall decisions by sequential linear programming is described. The proposed algorithm was tested on the TIMIT corpus, which produced a very high spotting accuracy. The algorithm presented here is applied to plosives detection, but can easily be adapted to any class of phonemes.

22 citations


Proceedings ArticleDOI
C. Antoniou1
07 May 2001
TL;DR: This work proposes a decomposition of the network into modular components, where each component estimates a phone posterior, and uses the use of the broad-class posteriors along with the phone posteriors to greatly enhance acoustic modelling.
Abstract: Traditionally, neural networks such as multi-layer perceptrons handle acoustic context by increasing the dimensionality of the observation vector, in order to include information of the neighbouring acoustic vectors, on either side of the current frame. As a result the monolithic network is trained on a high multi-dimensional space. The trend is to use the same fixed-size observation vector across the one network that estimates the posterior probabilities for all phones, simultaneously. We propose a decomposition of the network into modular components, where each component estimates a phone posterior. The size of the observation vector we use, is not fixed across the modularised networks, but rather accounts for the phone that each network is trained to classify. For each observation vector, we estimate very large acoustic context through broad-class posteriors. The use of the broad-class posteriors along with the phone posteriors greatly enhance acoustic modelling. We report significant improvements in phone classification and word recognition on the TIMIT corpus. Our results are also better than the best context-dependent system in the literature.

9 citations


Proceedings ArticleDOI
10 Mar 2001
TL;DR: It was found that error-free data recovery resulted in voiced and unvoiced frames, while high bit-errors occurred in frames containing voiced/unvoiced boundaries, and modifying the phase, in accordance with data, led to higher successful retrieval than modifying the spectral density of the cover audio.
Abstract: This paper presents results of two methods of embedding digital audio data into another audio signal for secure communication The data-embedded, or stego, signal is created for transmission by modifying the power spectral density or the phase spectrum of the cover audio at the perceptually masked frequencies in each frame in accordance with the covert audio data Embedded data in each frame is recovered from the quantized frames of the received stego signal without synchronization or reference to the original cover signal Using utterances from Texas Instruments Massachusetts Institute of Technology (TIMIT) databases, it was found that error-free data recovery resulted in voiced and unvoiced frames, while high bit-errors occurred in frames containing voiced/unvoiced boundaries Modifying the phase, in accordance with data, led to higher successful retrieval than modifying the spectral density of the cover audio In both cases, no difference was detected in perceived speech quality between the cover signal and the received stego signal

8 citations


Proceedings ArticleDOI
07 May 2001
TL;DR: A parameter estimation algorithm based on the extended Kalman filter (EKF) for the statistical coarticulatory hidden dynamic model (HDM) is presented and it is shown that the HDM is capable of generating speech vectors close to those from the corresponding real data.
Abstract: Presents a parameter estimation algorithm based on the extended Kalman filter (EKF) for the statistical coarticulatory hidden dynamic model (HDM). We show how the EKF parameter estimation algorithm unifies and simplifies the estimation of both the state and parameter vectors. Experiments based on N-best rescoring demonstrate superior performance of the (context-independent) HDM over a triphone baseline HMM in the TIMIT phonetic recognition task. We also show that the HDM is capable of generating speech vectors close to those from the corresponding real data.

7 citations


Proceedings Article
01 Jan 2001
TL;DR: It is shown that some non-parametric classifiers have considerable advantages over traditional hidden Markov models and support vector machines were found the most suitable and the easiest to tune.
Abstract: This paper addresses the problem of classification of speech transition sounds. A number of non parametric classifiers are compared, and it is shown that some non-parametric classifiers have considerable advantages over traditional hidden Markov models. Among the non-parametric classifiers, support vector machines were found the most suitable and the easiest to tune. Some of the reasons for the superiority of non-parametric classifiers will be discussed. The algorithm was tested on the voiced stop consonant phones extracted from the TIMIT corpus and resulted in very low error rates.

6 citations


Journal ArticleDOI
TL;DR: This paper used the TIMIT corpus of spoken sentences produced by talkers from a number of distinct dialect regions in the United States, and found that several phonetic features distinguish between the dialects.
Abstract: The perception of phonological differences between regional dialects of American English by naive listeners is poorly understood. Using the TIMIT corpus of spoken sentences produced by talkers from a number of distinct dialect regions in the United States, an acoustic analysis conducted in Experiment I confirmed that several phonetic features distinguish between the dialects. In Experiment II recordings of the sentences were played back to naive listeners who were asked to categorize each talker into one of six geographical dialect regions. Results suggested that listeners are able to reliably categorize talkers into three broad dialect clusters, but have more difficulty accurately categorizing talkers into six smaller regions. Correlations between the acoustic measures and both actual dialect affiliation of the talkers and dialect categorization of the talkers by the listeners revealed that the listeners in this study were sensitive to acoustic‐phonetic features of the dialects in categorizing the talker...

5 citations


Proceedings Article
01 Jan 2001
TL;DR: The results show that while beamforming alone can suppress background noise levels, the combination of beamforming and constrained enhancement can provide as much as a 63% improvement in objective quality, suggesting a potential single comprehensive solution for in-vehicle speech systems.
Abstract: In this paper, we investigate the integration of two processing methods to improve speech quality for invehicle speech systems: multi-sensor beamforming and constrained iterative (Auto-LSP) speech enhancement. The intent is to establish an intelligent microphone array processing scheme in high noise environments by considering the effectiveness of a multi-sensor beamformer method and the Auto-LSP single channel speech enhancement method. The goal therefore is to design a system where the strengths of one method help compensate any potential weaknesses of the other. The noise cancellation method is an acoustic beamformer designed and constructed using a linear microphone array. The speech enhancement method is the constrained iterative Auto-LSP approach, previously considered for single channel enhancement. After establishing the combined processing scheme, evaluations are performed using speech and acoustic noise data collected in vehicles. Noise suppression levels by the beamformer is established for different road noise conditions. Quality improvement from the enhancement scheme is assessed using objective speech quality measures over a test speech corpus using TIMIT data. The results show that while beamforming alone can suppress background noise levels, the combination of beamforming and constrained enhancement can provide as much as a 63% improvement in objective quality, suggesting a potential single comprehensive solution for in-vehicle speech systems.

5 citations


Journal ArticleDOI
TL;DR: Word recognition experiments on the TIMIT and NON-TIMIT with discrete Hidden Markov Model (HMM) and continuous density HMM showed that steady performance improvement could be achieved for open set testing, proving the effectiveness of the proposed adaptive frame length feature extraction scheme especially for the open testing.
Abstract: We propose an adaptive frame speech analysis scheme through dividing speech signal into stationary and dynamic region. Long frame analysis is used for stationary speech, and short frame analysis for dynamic speech. For computation convenience, the feature vector of short frame is designed to be identical to that of long frame. Two expressions are derived to represent the feature vector of short frames. Word recognition experiments on the TIMIT and NON-TIMIT with discrete Hidden Markov Model (HMM) and continuous density HMM showed that steady performance improvement could be achieved for open set testing. On the TIMIT database, adaptive frame length approach (AFL) reduces the error reduction rates from 4.47% to 11.21% and 4.54% to 9.58% for DHMM and CHMM, respectively. In the NON-TIMIT database, AFL also can reduce the error reduction rates from 1.91% to 11.55% and 2.63% to 9.5% for discrete hidden Markov model (DHMM) and continuous HMM (CHMM), respectively. These results proved the effectiveness of our proposed adaptive frame length feature extraction scheme especially for the open testing. In fact, this is a practical measurement for evaluating the performance of a speech recognition system.

4 citations


Proceedings ArticleDOI
02 May 2001
TL;DR: Experimental evaluations based on 258 speakers of the TIMIT and NTIMIT corpuses suggest that the feature mappers improve the verification performance remarkably.
Abstract: The performance of speaker verification systems is often compromised under real-world environments. For example, variations in handset characteristics could cause severe performance degradation. This paper presents a novel method to overcome this problem by using a non-linear handset mapper. Under this method, a mapper is constructed by training an elliptical basis function network using distorted speech features as inputs and the corresponding clean features as the desired outputs. During feature recuperation, clean features are recovered by feeding the distorted features to the feature mapper. The recovered features are then presented to a speaker model as if they were derived from clean speech. Experimental evaluations based on 258 speakers of the TIMIT and NTIMIT corpuses suggest that the feature mappers improve the verification performance remarkably.

Dissertation
01 Jan 2001
TL;DR: This thesis describes a speech recognition system that was built to support spontaneous speech understanding that achieved a word recognition accuracy of 67.6% using a task-specific bigram statistical language model and context-dependent acoustic models.
Abstract: This thesis describes a speech recognition system that was built to support spontaneous speech understanding. The system is composed of (1) a front end acoustic analyzer which computes Mel-frequency cepstral coefficients, (2) acoustic models of context-dependent phonemes (triphones), (3) a back-off bigram statistical language model, and (4) a beam search decoder based on the Viterbi algorithm. The contextdependent acoustic models resulted in 67.9% phoneme recognition accuracy on the standard TIMIT speech database. Spontaneous speech was collected using a "Wizard of Oz" simulation of a simple spatial manipulation game. Naive subjects were instructed to manipulate blocks on a computer screen in order to solve a series of geometric puzzles using only spoken commands. A hidden human operator performed actions in response to each spoken command. The speech from thirteen subjects formed the corpus for the speech recognition results reported here. Using a task-specific bigram statistical language model and context-dependent acoustic models, the system achieved a word recognition accuracy of 67.6%. The recognizer operated using a vocabulary of 523 words. The recognition had a word perplexity of 36. Thesis Supervisor: Deb Roy Title: Assistant Professor

Proceedings ArticleDOI
01 May 2001
TL;DR: A multiple mixture segmental hidden Markov model (MMSHMM) is presented, extended from the linear probabilistic-trajectory segmental HMM, which uses multiple mixture components for model parameters to represent the variability due to the variation within each speaker and also the differences between speakers.
Abstract: A multiple mixture segmental hidden Markov model (MMSHMM) is presented. This model is extended from the linear probabilistic-trajectory segmental HMM. Each segment is characterized by a linear trajectory with slope and mid-point parameters, and also the residual error covariances around the trajectory, so that both extra-segmental and intra-segmental variation are represented. Instead of modeling single distribution for each model parameter as earlier work, we use multiple mixture components for model parameters to represent the variability due to the variation within each speaker and also the differences between speakers. This model is evaluated on two applications. One is a phonetic classification task with TIMIT corpus, which shows that MMSHMM has advantages over conventional HMM. Another one is a speaker-independent keyword spotting task with the Road Rally database. By rescoring putative events hypothesized by a primary HMM keyword spotter, the experiments show that the performance is improved through distinguishing true hits from false alarms.

Proceedings ArticleDOI
19 Aug 2001
TL;DR: This paper presents an adaptation approach based on the Baum-Welch algorithm method that is adapted to all the parameters of the hidden Markov models (HMM) with adaptation data.
Abstract: This paper presents an adaptation approach based on the Baum-Welch algorithm method. This method applies the same framework as is are used for training speech recognizers with abundant training data. The Baum-Welch adaptation method is adapted to all the parameters of the hidden Markov models (HMM) with adaptation data. If a large amount of adaptation data is available, these methods could gradually approximate the speaker-dependent ones. The approach is evaluated through the phoneme recognition task on the TIMIT corpus. On the speaker adaptation experiments, up to 91.48% recognition rate is achieved.

Proceedings ArticleDOI
07 May 2001
TL;DR: Statistical methods for reconstructing speech at the phoneme level are used to find missing phonemes that are removed from sentences in the TIMIT corpus and the most likely candidate is selected to reconstruct the sentence.
Abstract: Statistical methods for reconstructing speech at the phoneme level are used to find missing phonemes that are removed from sentences in the TIMIT corpus. Probabilities for the occurrence of the missing phoneme(s) are generated and the most likely candidate(s) selected to reconstruct the sentence. The method includes symmetric and asymmetric 'confidence windowing' around the missing phoneme(s) for determination of the most likely candidates. The reconstruction rates for one or more phonemes missing in a sequence can exceed 85%.

Proceedings Article
01 Jan 2001
TL;DR: The iterative maximum-likelihood procedure employed to train both parts of the model are described, and the first unsupervised adaptation and self-adaptation results for the new model are given, showing that it outperforms standard techniques when small amounts of adaptation data are available.
Abstract: In a recent paper, we described a compact, context-dependent acoustic model incorporating strong a priori knowledge and designed to support extremely rapid speaker adaptation [9]. The two parts of this “bipartite” model are: 1. A speakerdependent, context-independent (SDCI) part with a small number of parameters called the “eigencentroid”. 2. A speaker-independent, context-dependent (SICD) part with a large number of parameters called the “delta trees”. For the first time, we describe in the current paper the iterative maximum-likelihood procedure employed to train both parts of the model. This paper also gives the first unsupervised adaptation and self-adaptation results for the new model, showing that it outperforms standard techniques when small amounts of adaptation data (10 sec. or less of sp eech) are available. Relative error rate reduction (ERR) is 12.1% for supervised adaptation and 11.2% for unsupervised adaptation on three TIMIT sentences; it is 10.4% for self-adaptation on a single TIMIT sentence. Finally, the paper analyzes the correlation between sex and the SDCI part of the model, and shows how modeling of acoustic variability is affected by the explicit separation into SD and CD components.