Topic
TIMIT
About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.
Papers published on a yearly basis
Papers
More filters
••
25 Mar 2012TL;DR: This paper introduces a segmentation process consisting of two phases, first, forced alignment is performed using an HMM-GMM model and the resulting segmentation is then locally refined using an SVM based boundary model.
Abstract: Phonetic segmentation is an important step in the development of a concatenative TTS voice. This paper introduces a segmentation process consisting of two phases. First, forced alignment is performed using an HMM-GMM model. The resulting segmentation is then locally refined using an SVM based boundary model. Both the models are derived from multi-speaker data using a speaker adaptive training procedure. Evaluation results are obtained on the TIMIT corpus and on a proprietary single-speaker TTS corpus.
10 citations
••
07 May 2001TL;DR: This work proposes a decomposition of the network into modular components, where each component estimates a phone posterior, and uses the use of the broad-class posteriors along with the phone posteriors to greatly enhance acoustic modelling.
Abstract: Traditionally, neural networks such as multi-layer perceptrons handle acoustic context by increasing the dimensionality of the observation vector, in order to include information of the neighbouring acoustic vectors, on either side of the current frame. As a result the monolithic network is trained on a high multi-dimensional space. The trend is to use the same fixed-size observation vector across the one network that estimates the posterior probabilities for all phones, simultaneously. We propose a decomposition of the network into modular components, where each component estimates a phone posterior. The size of the observation vector we use, is not fixed across the modularised networks, but rather accounts for the phone that each network is trained to classify. For each observation vector, we estimate very large acoustic context through broad-class posteriors. The use of the broad-class posteriors along with the phone posteriors greatly enhance acoustic modelling. We report significant improvements in phone classification and word recognition on the TIMIT corpus. Our results are also better than the best context-dependent system in the literature.
9 citations
•
01 Jan 2002
TL;DR: A novel approach to integration of formant frequency and conventional MFCC data in phone recognition experiments on TIMIT by exploiting the relationship between formant frequencies and vocal tract geometry and reducing the error rate by 6.1% relative to a conventional representation alone.
Abstract: This paper presents a novel approach to integration of formant frequency and conventional MFCC data in phone recognition experiments on TIMIT. Naive use of format data introduces classification errors if formant frequency estimates are poor, resulting in a net drop in performance. However, by exploiting a measure of confidence in the formant frequency estimates, formant data can contribute to classification in parts of a speech signal where it is reliable, and be replaced by conventional MFCC data when it is not. In this way an improvement of 4.7% is achieved. Moreover, by exploiting the relationship between formant frequencies and vocal tract geometry, simple formant-based vocal tract length normalisation reduces the error rate by 6.1% relative to a conventional representation alone.
9 citations
••
29 May 2017TL;DR: A speech biometric I-vector with low and fixed dimension of 100 to identify speakers and shows identification rate improvement compared with the classical Gaussian Mixture Model-Universal Background Model with a Maximum Likelihood (ML) classifier system.
Abstract: Physiological and behavioural human characteristics are exploited in biometrics and performance metrics are used to measure some characteristic of an individual. The measure might lead to a one-to-one match, which is called authentication or one-from-N, and a match represents identification. In this paper, we exploit a speech biometric I-vector with low and fixed dimension of 100 to identify speakers. The main structure of the system consists of an I-vector with three fusion methods. It has low complexity and is efficient due to using an Extreme Learning Machine (ELM) classifier. The system is evaluated with 120 speakers from dialect regions one and four from both the TIMIT and NTIMIT databases in order to provide a fair comparison with our previous study based on the traditional Gaussian Mixture Model-Universal Background Model (GMM-UBM) with a Maximum Likelihood (ML) classifier system. The system shows identification rate improvement compared with the classical GMM-UBM.
9 citations
••
03 Oct 1996TL;DR: The goal is to mimic the resolution properties of the human auditory system, but using a computationally efficient FFT-based front end rather than a more complex auditory model.
Abstract: The authors present an approach for efficiently computing a compact temporal/spectral feature set for representing a segment of speech, with effective resolution depending on both frequency and time position within the segment. The goal is to mimic the resolution properties of the human auditory system, but using a computationally efficient FFT-based front end rather than a more complex auditory model. In particular they apply both frequency and time "warping" to FFT spectra to obtain good frequency resolution at low frequencies and good time resolution at high frequencies. Time resolution is also varied so that the center of the segment is better represented than the endpoints. The resolution can be varied by the selection of "warping" functions controlled using a small number of parameters. The method was experimentally verified for the classification of six stops extracted from the TIMIT continuous speech database. The best classification rate obtained was 81.2% for test data using 50 features computed with the method presented.
9 citations