scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Journal ArticleDOI
TL;DR: The complex signal approximation (cSA), which is operated in the complex domain to utilize the phase information of the desired speech signal to improve the separation performance, is proposed.
Abstract: In recent research, deep neural network (DNN) has been used to solve the monaural source separation problem. According to the training objectives, DNN-based monaural speech separation is categorized into three aspects, namely masking, mapping, and signal approximation based techniques. However, the performance of the traditional methods is not robust due to variations in real-world environments. Besides, in the vanilla DNN-based methods, the temporal information cannot be fully utilized. Therefore, in this paper, the long short-term memory (LSTM) neural network is applied to exploit the long-term speech contexts. Then, we propose the complex signal approximation (cSA), which is operated in the complex domain to utilize the phase information of the desired speech signal to improve the separation performance. The IEEE and the TIMIT corpora are used to generate mixtures with noise and speech interferences to evaluate the efficacy of the proposed method. The experimental results demonstrate the advantages of the proposed cSA-based LSTM recurrent neural network method in terms of different objective performance measures.

29 citations

Journal ArticleDOI
TL;DR: This paper presents an artificial neural network (ANN) for speaker-independent isolated word speech recognition that is a multilayer perceptron (MLP) in concatenation and the architecture of these three subnets are described, and the associated adaptive learning algorithms are derived.
Abstract: This paper presents an artificial neural network (ANN) for speaker-independent isolated word speech recognition. The network consists of three subnets in concatenation. The static information within one frame of speech signal is processed in the probabilistic mapping subnet that converts an input vector of acoustic features into a probability vector whose components are estimated probabilities of the feature vector belonging to the phonetic classes that constitute the words in the vocabulary. The dynamics capturing subnet computes the first-order cross correlation between the components of the probability vectors to serve as the discriminative feature derived from the interframe temporal information of the speech signal. These dynamic features are passed for decision-making to the classification subnet, which is a multilayer perceptron (MLP). The architecture of these three subnets are described, and the associated adaptive learning algorithms are derived. The recognition results for a subset of the DARPA TIMIT speech database are reported. The correct recognition rate of the proposed ANN system is 95.5%, whereas that of the best of continuous hidden Markov model (HMM)-based systems is only 91.0%. >

29 citations

Proceedings ArticleDOI
15 Mar 1999
TL;DR: Neural network based adaptation methods are applied to telephone speech recognition and a new unsupervised model adaptation method is proposed that does not require transcriptions and can be used with the neural networks.
Abstract: The performance of well-trained speech recognizers using high quality full bandwidth speech data is usually degraded when used in real world environments. In particular, telephone speech recognition is extremely difficult due to the limited bandwidth of the transmission channels. In this paper, neural network based adaptation methods are applied to telephone speech recognition and a new unsupervised model adaptation method is proposed. The advantage of the neural network based approach is that the retraining of speech recognizers for telephone speech is avoided. Furthermore, because the multi-layer neural network is able to compute nonlinear functions, it can accommodate for the non-linear mapping between full bandwidth speech and telephone speech. The new unsupervised model adaptation method does not require transcriptions and can be used with the neural networks. Experimental results on TIMIT/NTIMIT corpora show that the performance of the proposed methods is comparable to that of recognizers retained on telephone speech.

29 citations

Journal ArticleDOI
TL;DR: In this article, a 2D analysis framework using 2D transformations of the time-frequency space is proposed to obtain an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency.
Abstract: This paper considers the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological studies implicating the use of pitch dynamics in speech by humans. We develop and assess signal processing schemes aimed at exploiting temporal change of pitch to address the high-pitch formant frequency estimation problem. Specifically, we propose a 2-D analysis framework using 2-D transformations of the time-frequency space. In one approach, we project changing spectral harmonics over time to a 1-D function of frequency. In a second approach, we draw upon previous work of Quatieri and Ezzat , , with similarities to the auditory modeling efforts of Chi , where localized 2-D Fourier transforms of the time-frequency space provide improved source-filter separation when pitch is changing. Our methods show quantitative improvements for synthesized vowels with stationary formant structure in comparison to traditional and homomorphic linear prediction. We also demonstrate the feasibility of applying our methods on stationary vowel regions of natural speech spoken by high-pitch females of the TIMIT corpus. Finally, we show improvements afforded by the proposed analysis framework in formant tracking on examples of stationary and time-varying formant structure.

29 citations

01 Jan 2001
TL;DR: By selecting an appropriate distance measure, an automated procedure to map phonemes from a source language (English) to a target language (Afrikaans) can be applied, with recognition results comparable to a manual mapping process undertaken by a phonetic expert.
Abstract: This paper explores an automated approach to mapping one phoneme set to another, based on the acoustic distances of the individual phonemes. The main goal of this investigation is to automate the technique for creating initial/baseline acoustic models for a new language. Using this technique, it would be possible to rapidly build speech recognition systems for a variety of languages. A subsidiary objective of this investigation is to compare different acoustic distance measures and to assess their ability to quantify the acoustic similarity between phonemes. The distance measures that were considered for this investigation are the Kullback-Leibler measure, the Bhattacharyya distance metric, the Mahalanobis measure, the Euclidean measure, the L2 metric and the Jeffreys-Matusita distance. Both the TIMIT and SUN Speech corpora were used. It was found that by selecting an appropriate distance measure, an automated procedure to map phonemes from a source language (English) to a target language (Afrikaans) can be applied, with recognition results comparable to a manual mapping process undertaken by a phonetic expert.

29 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895