scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2003"


Journal ArticleDOI
TL;DR: A new perceptually motivated approach is proposed for enhancement of speech corrupted by colored noise that takes into account the frequency masking properties of the human auditory system and reduces the perceptual effect of the residual noise.
Abstract: A new perceptually motivated approach is proposed for enhancement of speech corrupted by colored noise. The proposed approach takes into account the frequency masking properties of the human auditory system and reduces the perceptual effect of the residual noise. This new perceptual method is incorporated into a frequency-domain speech enhancement method and a subspace-based speech enhancement method. A better power spectrum/autocorrelation function estimator was also developed to improve the performance of the proposed algorithms. Objective measures and informal listening tests demonstrated significant improvements over other methods when tested with TIMIT sentences corrupted by various types of colored noise.

102 citations


Proceedings ArticleDOI
31 May 2003
TL;DR: PLASER is a multimedia tool with instant feedback designed to teach English pronunciation for high-school students of Hong Kong whose mother tongue is Cantonese Chinese, and shows that the students' pronunciation skill was improved.
Abstract: PLASER is a multimedia tool with instant feedback designed to teach English pronunciation for high-school students of Hong Kong whose mother tongue is Cantonese Chinese. The objective is to teach correct pronunciation and not to assess a student's overall pronunciation quality. Major challenges related to speech recognition technology include: allowance for non-native accent, reliable and corrective feedbacks, and visualization of errors.PLASER employs hidden Markov models to represent position-dependent English phonemes. They are discriminatively trained using the standard American English TIMIT corpus together with a set of TIMIT utterances collected from "good" local English speakers. There are two kinds of speaking exercises: minimal-pair exercises and word exercises. In the word exercises, PLASER computes a confidence-based score for each phoneme of the given word, and paints each vowel or consonant segment in the word using a novel 3-color scheme to indicate their pronunciation accuracy. PLASER was used by 900 students of grade 7 and 8 over a period of 2--3 months. About 80% of the students said that they preferred using PLASER over traditional English classes to learn pronunciation. A pronunciation test was also conducted before and after they used PLASER. The result from 210 students shows that the students' pronunciation skill was improved. (The statistics is significant at the 99% confidence level.)

77 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: The paper presents a novel method for speech recognition by utilizing nonlinear/chaotic signal processing techniques to extract time-domain based phase space features and conjecture that phase space and MFCC features used in combination within a classifier may yield increased accuracy for various speech recognition tasks.
Abstract: The paper presents a novel method for speech recognition by utilizing nonlinear/chaotic signal processing techniques to extract time-domain based phase space features. By exploiting the theoretical results derived in nonlinear dynamics, a processing space called a reconstructed phase space can be generated where a salient model (the natural distribution of the attractor) can be extracted for speech recognition. To discover the discriminatory power of these features, isolated phoneme classification experiments were performed using the TIMIT corpus and compared to a baseline classifier that uses MFCC (Mel frequency cepstral coefficient) features. The results demonstrate that phase space features contain substantial discriminatory power, even though MFCC features outperformed the phase space features on direct comparisons. The authors conjecture that phase space and MFCC features used in combination within a classifier may yield increased accuracy for various speech recognition tasks.

51 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: This work applies linear and nonlinear data-driven feature transformations to the logarithmic mel-spectral context feature vectors in the TIMIT phone recognition task, finding that the four methods outperform the baseline system.
Abstract: Feature extraction is the key element when aiming at robust speech recognition. Both linear and nonlinear data-driven feature transformations are applied to the logarithmic mel-spectral context feature vectors in the TIMIT phone recognition task. Transformations are based on principal component analysis (PCA), independent component analysis (ICA), linear discriminant analysis (LDA) and multilayer perceptron network based nonlinear discriminant analysis (NLDA). All four methods outperform the baseline system which consists of the standard feature representation based on MFCCs (mel-frequency cepstral coefficients) with the first-order deltas, using a mixture-of-Gaussians HMM recognizer. Further improvement is gained by forming the feature vector as a concatenation of the outputs of all four feature transformations.

44 citations


Journal ArticleDOI
TL;DR: Improvement of the HMM performance is expected from the introduction of a very effective non-parametric probability density function estimate: the k -nearest neighbors ( k -nn) estimate.

31 citations


Journal ArticleDOI
TL;DR: This paper shows how a layered modular/ensemble neural network architecture for acoustic modelling provides good acoustic modelling in a series of experiments on the TIMIT speech corpus and shows how to employ information from a wide context within the architectural framework and that this provides performance equivalent to the best context-dependent acoustic modelling systems.

29 citations


Proceedings Article
01 Jan 2003
TL;DR: Two novel variants of TRAPS developed to address some shortcomings of the TRAPS classifiers are experimented with and it is found that approximately 20 discriminative temporal patterns per critical band is sufficient for good recognition performance.
Abstract: Motivated by the temporal processing properties of human hearing, researchers have explored various methods to incorporate temporal and contextual information in ASR systems. One such approach, TempoRAl PatternS (TRAPS), takes temporal processing to the extreme and analyzes the energy pattern over long periods of time (500 ms to 1000 ms) within separate critical bands of speech. In this paper we extend the work on TRAPS by experimenting with two novel variants of TRAPS developed to address some shortcomings of the TRAPS classifiers. Both the Hidden Activation TRAPS (HATS) and Tonotopic MultiLayer Perceptrons (TMLP) require 84% less parameters than TRAPS but can achieve significant phone recognition error reduction when tested on the TIMIT corpus under clean, reverberant, and several noise conditions. In addition, the TMLP performs training in a single stage and does not require critical band level training targets. Using these variants, we find that approximately 20 discriminative temporal patterns per critical band is sufficient for good recognition performance. In combination with a conventional PLP system, these TRAPS variants achieve significant additional performance improvements.

25 citations


Book ChapterDOI
TL;DR: This work proposes an iterative speaker pruning algorithm for speeding up the identification in the context of real-time systems by dropping out unlikely speakers as more data arrives into the processing buffer.
Abstract: Speaker identification is a computationally expensive task. In this work, we propose an iterative speaker pruning algorithm for speeding up the identification in the context of real-time systems. The proposed algorithm reduces computational load by dropping out unlikely speakers as more data arrives into the processing buffer. The process is repeated until there is just one speaker left in the candidate set. Care must be taken in designing the pruning heuristics, so that the correct speaker will not be pruned. Two variants of the pruning algorithm are presented, and simulations with TIMIT corpus show that an error rate of 10 % can be achieved in 10 seconds for 630 speakers.

18 citations


Journal ArticleDOI
TL;DR: In analyzing the Texas Instruments/Massachusetts Institute of Technology (TIMIT) speech database, this method revealed the intrinsic structures of vowels and consonants and is demonstrated in the superior classification performance for the most difficult phonemes.
Abstract: In most hidden Markov model-based automatic speech recognition systems, one of the fundamental questions is to determine the intrinsic speech feature dimensionality and the number of clusters used on the Gaussian mixture model. We analyzed mel-frequency band energies using a variational Bayesian principal component analysis method to estimate the feature dimensionality as well as the number of Gaussian mixtures by learning a maximum lower bound of the evidence instead of maximizing the likelihood function as used in conventional speech recognition systems. In analyzing the Texas Instruments/Massachusetts Institute of Technology (TIMIT) speech database, our method revealed the intrinsic structures of vowels and consonants. The usefulness of this method is demonstrated in the superior classification performance for the most difficult phonemes /b/, /d/, and /g/.

17 citations


Book ChapterDOI
TL;DR: Two new algorithms are proposed to combine the heuristic weighted distance and the partition normalized distance measure with the group vector quantization to take full advantage of both directions to linearly lift up higher order MFCC feature vector components.
Abstract: Weighted distance measure and discriminative training are two different directions to enhance VQ-based solutions for speaker identification. In the first direction, the partition normalized distance measure successfully used normalized feature components to account for varying importance of the LPC coefficients. In the second direction, the group vector quantization speeded up discriminative training by randomly selecting a group of vectors as a training unit in each learning step. This paper introduces an alternative, called heuristic weighted distance, to linearly lift up higher order MFCC feature vector components. Then two new algorithms are proposed to combine the heuristic weighted distance and the partition normalized distance measure with the group vector quantization to take full advantage of both directions. Testing on the TIMIT and NTIMIT corpora showed that the proposed methods are superior to current VQ-based solutions, and are in a comparable range to the Gaussian Mixture Model using the Wavelet or MFCC features.

17 citations


Proceedings ArticleDOI
01 Jan 2003
TL;DR: This work uses Mutual Information as measure of the usefulness of individual time-frequency cells for various speech classification tasks and shows that selecting input features according to the mutual information criteria can provides a significant increase in classification accuracy.
Abstract: Information concerning the identity of subword units such as phones cannot easily be pinpointed because it is broadly distributed in time and frequency. Continuing earlier work, we use Mutual Information as measure of the usefulness of individual time-frequency cells for various speech classification tasks, usin gt he hand-annotations of the TIMIT database as our ground truth. Since different broad phonetic classes such as vowels and stops have such different temporal characteristics, we examine mutual information separately for each class, revealing structure that was not uncovered in earlier work; further structure is revealed by aligning the time-frequency displays of each phone at the center of their hand-marked segments, rather than averaging across all possible alignments within each segment. Based on these results, we evaluate a range of vowel classifiers over the TIMIT test set and show that selecting input features according to the mutual information criteria can provides a significant increase in classification accuracy.

Journal ArticleDOI
TL;DR: A novel and simple feature extraction method for speech recognition using the two-dimensional root cepstrum (TDRC) has been introduced because of the adjustable root index parameter (/spl gamma/) which shows promising results compared with the original TDC.
Abstract: A novel and simple feature extraction method for speech recognition using the two-dimensional root cepstrum (TDRC) has been introduced. Because of the adjustable root index parameter (/spl gamma/) it has some advantages over the original two-dimensional cepstrum (TDC). Experiments on isolated word recognition using the TIMIT data base show promising results compared with the original TDC.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: A new algorithm for speaker verification called OSCILLO, which can verify a speaker's ID without background speaker models by applying tolerance interval analysis in statistics, greatly reduces the space requirement of the system and the time for both training and verification.
Abstract: Speaker verification concerns the problem of verifying whether a given utterance has been pronounced by a claimed authorized speaker. This problem is important because an accurate speaker verification system can be applied to many security applications. We present a new algorithm for speaker verification called OSCILLO. By applying tolerance interval analysis in statistics, OSCILLO can verify a speaker's ID without background speaker models. This greatly reduces the space requirement of the system and the time for both training and verification. Experimental results show that OSCILLO can achieve error rates comparable or better than the GMM-based system with background speaker models for three benchmark databases: TCC-300, TIMIT and NIST 2000.

Proceedings Article
01 Jan 2003
TL;DR: The preliminary experimental results show that vowel classification by nonlinear dynamics analysis can produce similar result when compared with a classifier using Mel frequency cepstral coefficient (MFCC) features.
Abstract: An approach is presented in this paper for vowel classification by analyzing the dynamics of speech production in a reconstructed phase space. The proposed approach has the ability of capturing nonlinearities that may exist in speech production. Global flow reconstruction is used to generate a quantitative description of the structure and trajectory of vowel attractors in a reconstructed phase space. A distance measure is defined to quantify the dynamic similarity between phoneme attractors. Templates of the dynamics for each vowel class are selected by cluster analysis. Classifying out-of-sample vowel phonemes is done using a nearest neighbor classifier. Experiments are conducted on both speaker dependent and independent vowel classification tasks using the TIMIT corpus. The preliminary experimental results show that vowel classification by nonlinear dynamics analysis can produce similar result when compared with a classifier using Mel frequency cepstral coefficient (MFCC) features.

Proceedings ArticleDOI
02 Jul 2003
TL;DR: Experimental results on the TIMIT corpus demonstrate that, if the dimension of the intermediate space is sufficiently high or the number of articulatory-to-acoustic mappings is sufficiently large, then this upper-bound can be achieved.
Abstract: A simple multiple-level HMM is presented in which speech dynamics are modelled as linear trajectories in an intermediate, formant-based representation and the mapping between the intermediate and acoustic data is achieved using one or more linear transformations. An upper-bound on the performance of such a system is established. Experimental results on the TIMIT corpus demonstrate that, if the dimension of the intermediate space is sufficiently high or the number of articulatory-to-acoustic mappings is sufficiently large, then this upper-bound can be achieved.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: Experiments with phone classification using TIMIT and a total of 760 features indicated that the proposed method automatically discovered important information in the data, and the accuracy was higher when using a homogeneous set of 118 features based on PLP (perceptual linear prediction) coefficients.
Abstract: We investigate feature selection applied to automatic speech recognition (ASR) systems. We focus on systems based on support vector machines (SVM), which can naturally use features optimized for each classifier. We present a new method for feature selection based on the AdaBoost algorithm. This method was an order of magnitude faster than a similar one, while leading to equivalent accuracy. Experiments with phone classification using TIMIT and a total of 760 features (PLP, MFCC, Seneff's, formants, etc.) indicated that the proposed method automatically discovered important information in the data. When using only 25 selected features per SVM, the accuracy was higher than when using a homogeneous set of 118 features based on PLP (perceptual linear prediction) coefficients.

01 Jan 2003
TL;DR: In this article, preemphasis, prewhitening, and joint linear prediction of common component of speech sources are proposed to improve blind separation of speech source based on ADF.
Abstract: Whitening processing methods are proposed to improve the effectiveness of blind separation of speech sources based on ADF. The proposed methods include preemphasis, prewhitening, and joint linear prediction of common component of speech sources. The effect of ADF filter lengths on source separation performance was also investigated. Experimental data were generated by convolving TIMIT speech with acoustic path impulse responses measured in real acoustic environment, where microphonesource distances were approximately 2 m and initial targetto-interference ratio was 0 dB. The proposed methods significantly speeded up convergence rate, increased target-tointerference ratio in separated speech, and improved accuracy of automatic phone recognition on target speech. The preemphasis and prewhitening methods alone produced large impact on system performance, and combined preemphasis with joint prediction yielded the highest phone recognition accuracy.

Book ChapterDOI
08 Sep 2003
TL;DR: Among the compared techniques, the technique based on TRAP achieves the best results in the clean speech, it achieves about 10% relative improovements against baseline system and its advantage is also observed in the presence of mismatch between training and testing conditions.
Abstract: We investigate and compare several techniques for automatic recognition of unconstrained context-independent phoneme strings from TIMIT and NTIMIT databases. Among the compared techniques, the technique based on TempoRAl Patterns (TRAP) achieves the best results in the clean speech, it achieves about 10% relative improovements against baseline system. Its advantage is also observed in the presence of mismatch between training and testing conditions. Issues such as the optimal length of temporal patterns in the TRAP technique and the effectiveness of mean and variance normalization of the patterns and the multi-band input the TRAP estimations, are also explored.

01 Jan 2003
TL;DR: A method for reducing the training time and the number of networks required to achieve a desired performance level is developed along with another method to improve the accuracy to compensate for the expected loss resulting from the network reduction.
Abstract: AUTOMATIC SPEAKER IDENTIFICATION USING REUSABLE AND RETRAINABLE BINARY-PAIR PARTITIONED NEURAL NETWORKS Ashutosh Mishra Old Dominion University May 2003 Director: Dr. Stephen A. Zahorian This thesis presents an extension of the work previously done on speaker identification using Binary Pair Partitioned (BPP) neural networks. In the previous work, a separate network was used for each pair of speakers in the speaker population. Although the basic BPP approach did perform well and had a simple underlying algorithm, it had the obvious disadvantage of requiring an extremely large number of networks for speaker identification with large speaker populations. It also requires training of networks proportional to the square of the number of speakers under consideration, leading to a very large number of networks to be trained and correspondingly large training and evaluation times. In the present work, the concepts of clustered speakers and reusable binary networks are investigated. Systematic methods are explored for using a network originally trained to separate only two specific speakers to also separate other speakers of other speaker pairs. For example, it would seem quite likely that a network trained to separate a particular female speaker from a particular male speaker would also reliably separate many other male speakers from many other female speakers. The focal point of the research is to develop a method for reducing the training time and the number of networks required to achieve a desired performance level. A new method of reducing the network requirement is developed along with another method to improve the accuracy to compensate for the expected loss resulting from the network reduction (compared to the BPP approach). The two methods investigated are-reusable binary-paired partitioned neural networks (RBPP) and retrained and reusable binary-pair partitioned neural networks (RRBPP). Both the methods explored and described in this thesis work very well for clean (studio quality) speech but do not provide the desired level of performance with bandwidth – limited speech (telephone quality). In this thesis, a detailed description of both the methods and the experimental results is provided. All experimental results reported are based on either the Texas Instruments Massachusetts Institute of Technology (TIMIT) or Nynex TIMIT (NTIMIT) databases, using 8 sentences (approximately 24 seconds) for training and up to two sentences (approximately 6 seconds for testing). Best results obtained with TIMIT, using 102 speakers, for BPP, RBPP, and RRBPP respectively (for 2 sentences i.e. ~ 6 seconds of test data) are 99.02 %, 99.02 %, 99.02 % of speakers correctly identified. Corresponding recognition rates for NTIMIT, again using 102 speakers, are 84.3%, 75.5% and 77.5%. Using all 630 speakers, the accuracy rates for TIMIT are 99%, 97% and 96%, and the accuracy rates for NTIMIT are ~72 %, 48% and 41 %.