scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Journal ArticleDOI
TL;DR: An incorporation of spectral voicing information of speech signal for improving the accuracy of automatic phoneme alignment under noisy conditions is presented and significant performance improvements are reported.
Abstract: An incorporation of spectral voicing information of speech signal for improving the accuracy of automatic phoneme alignment under noisy conditions is presented. Experiments are conducted on the TIMIT speech corpus corrupted by an additive noise at various signal-to-noise ratios. Significant performance improvements are reported.

1 citations

Proceedings ArticleDOI
01 Nov 2018
TL;DR: A phase correction method, which is based on the joint optimization of clean speech and noise by deep neural network (DNN), and the ideal ratio masking (IRM) is employed to estimate the cleanspeech and noise, and the phase correction is combined to get the final clean speech.
Abstract: Speech enhancement is an important issue in the field of speech signal processing. With the development of deep learning, speech enhancement technology combined with neural network has provided a more diverse solution for this field. In this paper, we present a new approach to enhance the noisy speech, which is recorded by a single channel. We propose a phase correction method, which is based on the joint optimization of clean speech and noise by deep neural network (DNN). In this method, the ideal ratio masking (IRM) is employed to estimate the clean speech and noise, and the phase correction is combined to get the final clean speech. Experiments are conducted by using TIMIT corpus combined with four types of noises at three different signal to noise ratio (SNR) levels. The results show that the proposed method has a significant improvement over the referenced DNN-based enhancement method for both objective evaluation criterion and subjective evaluation criterion.

1 citations

Proceedings ArticleDOI
01 Nov 2019
TL;DR: This work focuses on phoneme modeling in English language using Artificial Neural Networks (ANN) using RASTA-PLP features and the results for different input sizes, optimizing training functions, number ofhidden layers and number of hidden nodes for the ANN are compared.
Abstract: Human speech signal is rich with information such as the identity of the speaker, the spoken message, emotional and physical state of the speaker, the spoken language, gender, age and other information. Automatic Speech Recognition (ASR) involves complex tasks aimed at recognition and translation of human speech to text by computers. Phoneme Recognition means recognizing the phonemes associated with speech utterance and is a part of ASR. Developing a phonetic engine and enhancing its performance can lead to significant improvement in ASR. In this paper we propose Artifical Neural Network (ANN) based Phoneme Modeling. We compare performance of speech features such as Inner Hair Cell Coefficients (IHCC) and Mel-Frequency Cepstral Coefficient (MFCC) using various neural network architectures, different optimization algorithms, change in input data vector dimensionality (corresponding to different contextual information) and also increasing the number of epochs. Experiments were carried out on TIMIT database. Our experimental results indicates MFCC performs much better than IHCC feature and the best optimization algorithm is found to be SGD for IHCC and Adagrad for MFCC features by trying out with different sizes of input vector, changing the number of training iterations, training algorithms for various neural network architectures.

1 citations

Dissertation
01 Jan 2012
TL;DR: Spearheading in this direction, speech signal stationarity has been capitalized to a greater extent than previously proposed technique of cluster size based sorting of code vectors to speedup PDE.
Abstract: Telephony networks are frequently connected to computers for speech processing to extract useful information such as automatic speaker identification (ASI). Matching of feature vectors extracted from speech sample of an unknown speaker, with models of registered speakers is the most time consuming component in real-time speaker identification systems.Time controlling parameters are size d and count T of extracted test feature vectors as well as size M , complexity and count N of models of registered speakers.Reported speedup techniques for Vector quantization (VQ) and Gaussian mixture model (GMM) based ASI systems reduce test feature vector count T by pre-quantization and reduce candidate registered speakers N by pruning unlikely models which introduces accuracy degradation. Vantage point tree (VPT) indexing of code vectors has also been used to decrease the effect of parameter M on ASI speed for VQ based systems. Somehow parameter d has remained unexplored in ASI speedup studies. Speedup techniques for VQ based and GMM based real-time ASI without loss of accuracy are presented in this thesis.Speeding up closest code vector search (CCS) is focused for VQ based systems.Capability of partial distortion elimination (PDE), through reducing d parameter of codebook, was found more promising than VPT to speedup CCS. Advancing in this direction, speech signal stationarity has been capitalized to a greater extent than previously proposed technique of cluster size based sorting of code vectors to speedup PDE. Proximity relationship among code vectors established through Linde Buzo Gray (LBG) process of codebook generation has been substantiated. Based upon the high correlation of proximate code vectors, circular partial distortion elimination (CPDE) and toggling-CPDE algorithms have been proposed to speedup CCS.Further speedup for ASI is proposed through test feature vector sequence pruning (VSP) when a codebook proves unlikely during search of best match speaker. Empirical results presented in this thesis show that an average speedup factor up to 5.8 for 630 registered speakers of TIMIT 8kHz corpus and 6.6 for 230 speakers of NIST-1999 database have been achieved through integrating VSP and TCPDE. Speeding up potential of hierarchical speaker pruning (HSP) for faster ASI has also been demonstrated in this thesis. HSP prunes unlikely candidate speakers based on ranking results of coarse speaker models. Best match is then found from the detailed models of remaining speakers.VQ based and GMM based ASI systems are explored in depth for parameters governing the speedup performance of HSP.Using the smallest possible coarse model and pruning the largest number of detailed candidate models is the key objective for speedup through HSP. City block distance (CBD) is proposed instead of Euclidean distance (EUD) for ranking speakers in VQ based systems.This allows use of smaller codebook for ranking and pruning greater number of speakers. HSP has been ignored by previous authors for GMM based ASI systems due to discouraging speedup results in their studies of VQ-based systems.However, we achieved speedup factors up to 6.61 and 10.40 for GMM based ASI systems using HSP for 230 speaker from NIST-1999 and 630 speakers from TIMIT data, respectively. While speedup factors of up to 22.46 and 34.78 are achieved on TIMIT and NIST-1999 data for VQ based systems, respectively. All the speedup factors reported are with out any accuracy loss.

1 citations

Journal ArticleDOI
TL;DR: In this article, a robust method for vowel region detection from multimode speech is proposed based on continuous wavelet transform coefficients and phone boundaries for detecting the vowel regions from different modes of the speech signal.
Abstract: The aim of this paper is to explore a robust method for vowel region detection from multimode speech. In realistic scenario, speech can be classified into three modes namely; conversation, extempore, and read. The existing method detects the vowel form the speech recorded in clean environment which may not be appropriate for the multimode speech tasks. To address this issue, we proposed an approach based on continuous wavelet transform coefficients and phone boundaries for detecting the vowel regions from different modes of the speech signal. For evaluation of the proposed vowel region (VR) detection technique, TIMIT (read speech) and Bengali (read, extempore, and conversation speech) corpora are used. The proposed VR detection technique is compared to the state-of-the-art methods. The experiments has recorded significant gain in the performance of the proposed technique than the state-of-the-art methods. The efficiency of the proposed technique is shown by extracting vocal tract and excitation source features from automatically detected VRs for developing the multilingual speech mode classification (MSMC) model. The evaluation results report that the performance of the MSMC model is significantly improved when features are extracted from the vowel regions than the entire speech utterance.

1 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895