Showing papers on "TIMIT published in 2003"

PDF

Open Access

Journal Article•DOI•

A perceptually motivated approach for speech enhancement

[...]

Yi Hu¹, Philipos C. Loizou¹•Institutions (1)

26 Aug 2003-IEEE Transactions on Speech and Audio Processing

TL;DR: A new perceptually motivated approach is proposed for enhancement of speech corrupted by colored noise that takes into account the frequency masking properties of the human auditory system and reduces the perceptual effect of the residual noise.

...read moreread less

Abstract: A new perceptually motivated approach is proposed for enhancement of speech corrupted by colored noise. The proposed approach takes into account the frequency masking properties of the human auditory system and reduces the perceptual effect of the residual noise. This new perceptual method is incorporated into a frequency-domain speech enhancement method and a subspace-based speech enhancement method. A better power spectrum/autocorrelation function estimator was also developed to improve the performance of the proposed algorithms. Objective measures and informal listening tests demonstrated significant improvements over other methods when tested with TIMIT sentences corrupted by various types of colored noise.

...read moreread less

102 citations

Proceedings Article•DOI•

PLASER: pronunciation learning via automatic speech recognition

[...]

Brian Mak¹, Man-Hung Siu¹, Mimi Ng¹, Yik-Cheung Tam¹, Yu-Chung Chan¹, Kin-Wah Chan¹, Ka-Yee Leung¹, Simon Ho¹, Fong-Ho Chong¹, Jimmy Wong¹, Jacqueline Lo¹ - Show less +7 more•Institutions (1)

Hong Kong University of Science and Technology¹

31 May 2003

TL;DR: PLASER is a multimedia tool with instant feedback designed to teach English pronunciation for high-school students of Hong Kong whose mother tongue is Cantonese Chinese, and shows that the students' pronunciation skill was improved.

...read moreread less

Abstract: PLASER is a multimedia tool with instant feedback designed to teach English pronunciation for high-school students of Hong Kong whose mother tongue is Cantonese Chinese. The objective is to teach correct pronunciation and not to assess a student's overall pronunciation quality. Major challenges related to speech recognition technology include: allowance for non-native accent, reliable and corrective feedbacks, and visualization of errors.PLASER employs hidden Markov models to represent position-dependent English phonemes. They are discriminatively trained using the standard American English TIMIT corpus together with a set of TIMIT utterances collected from "good" local English speakers. There are two kinds of speaking exercises: minimal-pair exercises and word exercises. In the word exercises, PLASER computes a confidence-based score for each phoneme of the given word, and paints each vowel or consonant segment in the word using a novel 3-color scheme to indicate their pronunciation accuracy. PLASER was used by 900 students of grade 7 and 8 over a period of 2--3 months. About 80% of the students said that they preferred using PLASER over traditional English classes to learn pronunciation. A pronunciation test was also conducted before and after they used PLASER. The result from 210 students shows that the students' pronunciation skill was improved. (The statistics is significant at the 99% confidence level.)

...read moreread less

77 citations

Proceedings Article•DOI•

Speech recognition using reconstructed phase space features

[...]

A.C. Lindgren¹, Michael T. Johnson¹, Richard J. Povinelli¹•Institutions (1)

Marquette University¹

06 Apr 2003

TL;DR: The paper presents a novel method for speech recognition by utilizing nonlinear/chaotic signal processing techniques to extract time-domain based phase space features and conjecture that phase space and MFCC features used in combination within a classifier may yield increased accuracy for various speech recognition tasks.

...read moreread less

Abstract: The paper presents a novel method for speech recognition by utilizing nonlinear/chaotic signal processing techniques to extract time-domain based phase space features. By exploiting the theoretical results derived in nonlinear dynamics, a processing space called a reconstructed phase space can be generated where a salient model (the natural distribution of the attractor) can be extracted for speech recognition. To discover the discriminatory power of these features, isolated phoneme classification experiments were performed using the TIMIT corpus and compared to a baseline classifier that uses MFCC (Mel frequency cepstral coefficient) features. The results demonstrate that phase space features contain substantial discriminatory power, even though MFCC features outperformed the phase space features on direct comparisons. The authors conjecture that phase space and MFCC features used in combination within a classifier may yield increased accuracy for various speech recognition tasks.

...read moreread less

51 citations

Proceedings Article•DOI•

Experiments with linear and nonlinear feature transformations in HMM based phone recognition

[...]

Panu Somervuo

06 Apr 2003

TL;DR: This work applies linear and nonlinear data-driven feature transformations to the logarithmic mel-spectral context feature vectors in the TIMIT phone recognition task, finding that the four methods outperform the baseline system.

...read moreread less

Abstract: Feature extraction is the key element when aiming at robust speech recognition. Both linear and nonlinear data-driven feature transformations are applied to the logarithmic mel-spectral context feature vectors in the TIMIT phone recognition task. Transformations are based on principal component analysis (PCA), independent component analysis (ICA), linear discriminant analysis (LDA) and multilayer perceptron network based nonlinear discriminant analysis (NLDA). All four methods outperform the baseline system which consists of the standard feature representation based on MFCCs (mel-frequency cepstral coefficients) with the first-order deltas, using a mixture-of-Gaussians HMM recognizer. Further improvement is gained by forming the feature vector as a concatenation of the outputs of all four feature transformations.

...read moreread less

44 citations

Journal Article•DOI•

Non-parametric probability estimation for HMM-based automatic speech recognition

[...]

Fabrice Lefèvre¹•Institutions (1)

Centre national de la recherche scientifique¹

01 Apr 2003-Computer Speech & Language

TL;DR: Improvement of the HMM performance is expected from the introduction of a very effective non-parametric probability density function estimate: the k -nearest neighbors ( k -nn) estimate.

...read moreread less

31 citations

Journal Article•DOI•

Experiments in speech recognition using a modular MLP architecture for acoustic modelling

[...]

T. Jeff Reynolds¹, Christos Andrea Antoniou¹•Institutions (1)

University of Essex¹

01 Nov 2003-Information Sciences

TL;DR: This paper shows how a layered modular/ensemble neural network architecture for acoustic modelling provides good acoustic modelling in a series of experiments on the TIMIT speech corpus and shows how to employ information from a wide context within the architectural framework and that this provides performance equivalent to the best context-dependent acoustic modelling systems.

...read moreread less

29 citations

Proceedings Article•

Learning discriminative temporal patterns in speech: development of novel TRAPS-like classifiers.

[...]

Barry Y. Chen, Shuangyu Chang¹, Sunil Sivadas¹•Institutions (1)

University of California, Berkeley¹

01 Jan 2003

TL;DR: Two novel variants of TRAPS developed to address some shortcomings of the TRAPS classifiers are experimented with and it is found that approximately 20 discriminative temporal patterns per critical band is sufficient for good recognition performance.

...read moreread less

Abstract: Motivated by the temporal processing properties of human hearing, researchers have explored various methods to incorporate temporal and contextual information in ASR systems. One such approach, TempoRAl PatternS (TRAPS), takes temporal processing to the extreme and analyzes the energy pattern over long periods of time (500 ms to 1000 ms) within separate critical bands of speech. In this paper we extend the work on TRAPS by experimenting with two novel variants of TRAPS developed to address some shortcomings of the TRAPS classifiers. Both the Hidden Activation TRAPS (HATS) and Tonotopic MultiLayer Perceptrons (TMLP) require 84% less parameters than TRAPS but can achieve significant phone recognition error reduction when tested on the TIMIT corpus under clean, reverberant, and several noise conditions. In addition, the TMLP performs training in a single stage and does not require critical band level training targets. Using these variants, we find that approximately 20 discriminative temporal patterns per critical band is sufficient for good recognition performance. In combination with a conventional PLP system, these TRAPS variants achieve significant additional performance improvements.

...read moreread less

25 citations

Book Chapter•DOI•

A speaker pruning algorithm for real-time speaker identification

[...]

Tomi Kinnunen¹, Evgeny Karpov¹, Pasi Fränti¹•Institutions (1)

University of Eastern Finland¹

09 Jun 2003-Lecture Notes in Computer Science

TL;DR: This work proposes an iterative speaker pruning algorithm for speeding up the identification in the context of real-time systems by dropping out unlikely speakers as more data arrives into the processing buffer.

...read moreread less

Abstract: Speaker identification is a computationally expensive task. In this work, we propose an iterative speaker pruning algorithm for speeding up the identification in the context of real-time systems. The proposed algorithm reduces computational load by dropping out unlikely speakers as more data arrives into the processing buffer. The process is repeated until there is just one speaker left in the candidate set. Care must be taken in designing the pruning heuristics, so that the correct speaker will not be pruned. Two variants of the pruning algorithm are presented, and simulations with TIMIT corpus show that an error rate of 10 % can be achieved in 10 seconds for 630 speakers.

...read moreread less

18 citations

Journal Article•DOI•

Speech feature analysis using variational Bayesian PCA

[...]

Oh-Wook Kwon¹, Kwokleung Chan¹, Te-Won Lee¹•Institutions (1)

University of California, Los Angeles¹

08 Apr 2003-IEEE Signal Processing Letters

TL;DR: In analyzing the Texas Instruments/Massachusetts Institute of Technology (TIMIT) speech database, this method revealed the intrinsic structures of vowels and consonants and is demonstrated in the superior classification performance for the most difficult phonemes.

...read moreread less

Abstract: In most hidden Markov model-based automatic speech recognition systems, one of the fundamental questions is to determine the intrinsic speech feature dimensionality and the number of clusters used on the Gaussian mixture model. We analyzed mel-frequency band energies using a variational Bayesian principal component analysis method to estimate the feature dimensionality as well as the number of Gaussian mixtures by learning a maximum lower bound of the evidence instead of maximizing the likelihood function as used in conventional speech recognition systems. In analyzing the Texas Instruments/Massachusetts Institute of Technology (TIMIT) speech database, our method revealed the intrinsic structures of vowels and consonants. The usefulness of this method is demonstrated in the superior classification performance for the most difficult phonemes /b/, /d/, and /g/.

...read moreread less

17 citations

Book Chapter•DOI•

Enhanced VQ-based algorithms for speech independent speaker identification

[...]

Ningping Fan¹, Justinian Rosca¹•Institutions (1)

Princeton University¹

09 Jun 2003-Lecture Notes in Computer Science

TL;DR: Two new algorithms are proposed to combine the heuristic weighted distance and the partition normalized distance measure with the group vector quantization to take full advantage of both directions to linearly lift up higher order MFCC feature vector components.

...read moreread less

Abstract: Weighted distance measure and discriminative training are two different directions to enhance VQ-based solutions for speaker identification. In the first direction, the partition normalized distance measure successfully used normalized feature components to account for varying importance of the LPC coefficients. In the second direction, the group vector quantization speeded up discriminative training by randomly selecting a group of vectors as a training unit in each learning step. This paper introduces an alternative, called heuristic weighted distance, to linearly lift up higher order MFCC feature vector components. Then two new algorithms are proposed to combine the heuristic weighted distance and the partition normalized distance measure with the group vector quantization to take full advantage of both directions. Testing on the TIMIT and NTIMIT corpora showed that the proposed methods are superior to current VQ-based solutions, and are in a comparable range to the Gaussian Mixture Model using the Wavelet or MFCC features.

...read moreread less

17 citations

Proceedings Article•DOI•

Using Mutual Information to design class-specific phone recognizers

[...]

Patricia Scanlon¹, Daniel P. W. Ellis¹, Richard B. Reilly²•Institutions (2)

Columbia University¹, University College Dublin²

01 Jan 2003

TL;DR: This work uses Mutual Information as measure of the usefulness of individual time-frequency cells for various speech classification tasks and shows that selecting input features according to the mutual information criteria can provides a significant increase in classification accuracy.

...read moreread less

Abstract: Information concerning the identity of subword units such as phones cannot easily be pinpointed because it is broadly distributed in time and frequency. Continuing earlier work, we use Mutual Information as measure of the usefulness of individual time-frequency cells for various speech classification tasks, usin gt he hand-annotations of the TIMIT database as our ground truth. Since different broad phonetic classes such as vowels and stops have such different temporal characteristics, we examine mutual information separately for each class, revealing structure that was not uncovered in earlier work; further structure is revealed by aligning the time-frequency displays of each phone at the center of their hand-marked segments, rather than averaging across all possible alignments within each segment. Based on these results, we evaluate a range of vowel classifiers over the TIMIT test set and show that selecting input features according to the mutual information criteria can provides a significant increase in classification accuracy.

...read moreread less

Journal Article•DOI•

Two-dimensional root cepstrum as feature extraction method for speech recognition

[...]

E. Chilton¹, H. Marvi¹•Institutions (1)

University of Surrey¹

15 May 2003-Electronics Letters

TL;DR: A novel and simple feature extraction method for speech recognition using the two-dimensional root cepstrum (TDRC) has been introduced because of the adjustable root index parameter (/spl gamma/) which shows promising results compared with the original TDC.

...read moreread less

Abstract: A novel and simple feature extraction method for speech recognition using the two-dimensional root cepstrum (TDRC) has been introduced. Because of the adjustable root index parameter (/spl gamma/) it has some advantages over the original two-dimensional cepstrum (TDC). Experiments on isolated word recognition using the TIMIT data base show promising results compared with the original TDC.

...read moreread less

Proceedings Article•DOI•

Speaker verification without background speaker models

[...]

Chun-Nan Hsu, Hau-Chung Yu, Bo-Hou Yang

06 Apr 2003

TL;DR: A new algorithm for speaker verification called OSCILLO, which can verify a speaker's ID without background speaker models by applying tolerance interval analysis in statistics, greatly reduces the space requirement of the system and the time for both training and verification.

...read moreread less

Abstract: Speaker verification concerns the problem of verifying whether a given utterance has been pronounced by a claimed authorized speaker. This problem is important because an accurate speaker verification system can be applied to many security applications. We present a new algorithm for speaker verification called OSCILLO. By applying tolerance interval analysis in statistics, OSCILLO can verify a speaker's ID without background speaker models. This greatly reduces the space requirement of the system and the time for both training and verification. Experimental results show that OSCILLO can achieve error rates comparable or better than the GMM-based system with background speaker models for three benchmark databases: TCC-300, TIMIT and NIST 2000.

...read moreread less

Proceedings Article•

Vowel Classification by Global Dynamic Modeling

[...]

Xiaolin Liu, Richard J. Povinelli, Michael T. Johnson¹•Institutions (1)

Marquette University¹

01 Jan 2003

TL;DR: The preliminary experimental results show that vowel classification by nonlinear dynamics analysis can produce similar result when compared with a classifier using Mel frequency cepstral coefficient (MFCC) features.

...read moreread less

Abstract: An approach is presented in this paper for vowel classification by analyzing the dynamics of speech production in a reconstructed phase space. The proposed approach has the ability of capturing nonlinearities that may exist in speech production. Global flow reconstruction is used to generate a quantitative description of the structure and trajectory of vowel attractors in a reconstructed phase space. A distance measure is defined to quantify the dynamic similarity between phoneme attractors. Templates of the dynamics for each vowel class are selected by cluster analysis. Classifying out-of-sample vowel phonemes is done using a nearest neighbor classifier. Experiments are conducted on both speaker dependent and independent vowel classification tasks using the TIMIT corpus. The preliminary experimental results show that vowel classification by nonlinear dynamics analysis can produce similar result when compared with a classifier using Mel frequency cepstral coefficient (MFCC) features.

...read moreread less

Proceedings Article•DOI•

Development of articulatory-based multilevel segmental HMMs for phonetic classification in ASR

[...]

M.J. Russell¹, P.J.B. Jackson¹, M.L.P. Wong¹•Institutions (1)

University of Birmingham¹

02 Jul 2003

TL;DR: Experimental results on the TIMIT corpus demonstrate that, if the dimension of the intermediate space is sufficiently high or the number of articulatory-to-acoustic mappings is sufficiently large, then this upper-bound can be achieved.

...read moreread less

Abstract: A simple multiple-level HMM is presented in which speech dynamics are modelled as linear trajectories in an intermediate, formant-based representation and the mapping between the intermediate and acoustic data is achieved using one or more linear transformations. An upper-bound on the performance of such a system is established. Experimental results on the TIMIT corpus demonstrate that, if the dimension of the intermediate space is sufficiently high or the number of articulatory-to-acoustic mappings is sufficiently large, then this upper-bound can be achieved.

...read moreread less

Proceedings Article•DOI•

Mining speech: automatic selection of heterogeneous features using boosting

[...]

Aldebaro Klautau¹•Institutions (1)

University of California, San Diego¹

06 Apr 2003

TL;DR: Experiments with phone classification using TIMIT and a total of 760 features indicated that the proposed method automatically discovered important information in the data, and the accuracy was higher when using a homogeneous set of 118 features based on PLP (perceptual linear prediction) coefficients.

...read moreread less

Abstract: We investigate feature selection applied to automatic speech recognition (ASR) systems. We focus on systems based on support vector machines (SVM), which can naturally use features optimized for each classifier. We present a new method for feature selection based on the AdaBoost algorithm. This method was an order of magnitude faster than a similar one, while leading to equivalent accuracy. Experiments with phone classification using TIMIT and a total of 760 features (PLP, MFCC, Seneff's, formants, etc.) indicated that the proposed method automatically discovered important information in the data. When using only 25 selected features per SVM, the accuracy was higher than when using a homogeneous set of 118 features based on PLP (perceptual linear prediction) coefficients.

...read moreread less

Whitening processing for blind separation of speech signals

[...]

Yunxin Zhao, Rong Hu, Satoshi Nakamura

01 Jan 2003

TL;DR: In this article, preemphasis, prewhitening, and joint linear prediction of common component of speech sources are proposed to improve blind separation of speech source based on ADF.

...read moreread less

Abstract: Whitening processing methods are proposed to improve the effectiveness of blind separation of speech sources based on ADF. The proposed methods include preemphasis, prewhitening, and joint linear prediction of common component of speech sources. The effect of ADF filter lengths on source separation performance was also investigated. Experimental data were generated by convolving TIMIT speech with acoustic path impulse responses measured in real acoustic environment, where microphonesource distances were approximately 2 m and initial targetto-interference ratio was 0 dB. The proposed methods significantly speeded up convergence rate, increased target-tointerference ratio in separated speech, and improved accuracy of automatic phone recognition on target speech. The preemphasis and prewhitening methods alone produced large impact on system performance, and combined preemphasis with joint prediction yielded the highest phone recognition accuracy.

...read moreread less

Book Chapter•DOI•

Phoneme Recognition Using Temporal Patterns

[...]

Pavel Matějka¹, Petr Schwarz¹, Hynek Hermansky¹, Jan Cernocký•Institutions (1)

Oregon Health & Science University¹

08 Sep 2003

TL;DR: Among the compared techniques, the technique based on TRAP achieves the best results in the clean speech, it achieves about 10% relative improovements against baseline system and its advantage is also observed in the presence of mismatch between training and testing conditions.

...read moreread less

Abstract: We investigate and compare several techniques for automatic recognition of unconstrained context-independent phoneme strings from TIMIT and NTIMIT databases. Among the compared techniques, the technique based on TempoRAl Patterns (TRAP) achieves the best results in the clean speech, it achieves about 10% relative improovements against baseline system. Its advantage is also observed in the presence of mismatch between training and testing conditions. Issues such as the optimal length of temporal patterns in the TRAP technique and the effectiveness of mean and variance normalization of the patterns and the multi-band input the TRAP estimations, are also explored.

...read moreread less

Automatic speaker identification using reusable and retrainable binary-pair partitioned neural networks

[...]

Ashutosh Mishra

01 Jan 2003

TL;DR: A method for reducing the training time and the number of networks required to achieve a desired performance level is developed along with another method to improve the accuracy to compensate for the expected loss resulting from the network reduction.

...read moreread less

Abstract: AUTOMATIC SPEAKER IDENTIFICATION USING REUSABLE AND RETRAINABLE BINARY-PAIR PARTITIONED NEURAL NETWORKS Ashutosh Mishra Old Dominion University May 2003 Director: Dr. Stephen A. Zahorian This thesis presents an extension of the work previously done on speaker identification using Binary Pair Partitioned (BPP) neural networks. In the previous work, a separate network was used for each pair of speakers in the speaker population. Although the basic BPP approach did perform well and had a simple underlying algorithm, it had the obvious disadvantage of requiring an extremely large number of networks for speaker identification with large speaker populations. It also requires training of networks proportional to the square of the number of speakers under consideration, leading to a very large number of networks to be trained and correspondingly large training and evaluation times. In the present work, the concepts of clustered speakers and reusable binary networks are investigated. Systematic methods are explored for using a network originally trained to separate only two specific speakers to also separate other speakers of other speaker pairs. For example, it would seem quite likely that a network trained to separate a particular female speaker from a particular male speaker would also reliably separate many other male speakers from many other female speakers. The focal point of the research is to develop a method for reducing the training time and the number of networks required to achieve a desired performance level. A new method of reducing the network requirement is developed along with another method to improve the accuracy to compensate for the expected loss resulting from the network reduction (compared to the BPP approach). The two methods investigated are-reusable binary-paired partitioned neural networks (RBPP) and retrained and reusable binary-pair partitioned neural networks (RRBPP). Both the methods explored and described in this thesis work very well for clean (studio quality) speech but do not provide the desired level of performance with bandwidth – limited speech (telephone quality). In this thesis, a detailed description of both the methods and the experimental results is provided. All experimental results reported are based on either the Texas Instruments Massachusetts Institute of Technology (TIMIT) or Nynex TIMIT (NTIMIT) databases, using 8 sentences (approximately 24 seconds) for training and up to two sentences (approximately 6 seconds for testing). Best results obtained with TIMIT, using 102 speakers, for BPP, RBPP, and RRBPP respectively (for 2 sentences i.e. ~ 6 seconds of test data) are 99.02 %, 99.02 %, 99.02 % of speakers correctly identified. Corresponding recognition rates for NTIMIT, again using 102 speakers, are 84.3%, 75.5% and 77.5%. Using all 630 speakers, the accuracy rates for TIMIT are 99%, 97% and 96%, and the accuracy rates for NTIMIT are ~72 %, 48% and 41 %.

...read moreread less