scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Dissertation
01 Jan 2016
TL;DR: This thesis is to enhance the robustness of the system in the complex auditory environment by developing new front-end acoustic feature extraction techniques by using the one-way Analysis Of Variance (ANOVA}-based camera fusion strategy.
Abstract: In this era of smart applications, Automatic Speech Recognition (ASR) has established itself as an emerging technology that is becoming popular day by day. However, the accuracy and reliability of these systems are somehow restricted by the acoustic conditions such as background noise, and channel noise. Thus, there is a considerable gap between human-machine communications, due to their lack of robustness in the composite auditory scene. The objective of this thesis is to enhance the robustness of the system in the complex auditory environment by developing new front-end acoustic feature extraction techniques. Pros and cons of the different techniques are also highlighted. In the recent years, wavelet based acoustic features have been popular for speech recognition applications. The wavelet transform is an excellent tool for the time-frequency analysis with good signal denoising property. A new auditory-based Wavelet Packet (WP) features are proposed to enhance the system performance across different types of noisy conditions. The design and development of the proposed technique is carried out in such a way that it mimics the frequency response of human ear according to the Equivalent Rectangular Bandwidth (ERB) scale. In the subsequent chapters, the further developments of the proposed technique are discussed by using the Sub-band based Periodicity and Aperiodicity Decomposition (SPADE) and harmonic analysis. The TIMIT (English) and CSIR-TIFR (Hindi) phoneme recognition tasks are carried out to evaluate the performance of proposed technique. The simulation results demonstrate the potentiality of proposed techniques to enhance the system accuracy in a wide range of SNR. Further, visual modality plays a vital role in computer vision systems when the acoustic modality is disturbed by the background noise. However, most of the systems rarely addressed the visual domain problems, to make it work in real world conditions. Multiple-camera protocol ensures more flexibility to the system by allowing speakers to move freely. In the last chapter, consideration is given to Audio-Visual Speech Recognition (AVSR) implementation in vehicular environments, which resulted in one novel contribution-the one-way Analysis Of Variance (ANOVA}-based camera fusion strategy. Multiple-camera fusion technique is an imperative part of multiple cameras computer vision applications. The ANOVA-based approach is proposed to study the relative contribution of each camera for AVSR experiments in-vehicle environments. The four-cam automotive audio-visual corpus is used to investigate the performance of the proposed technique. Speech is a primary medium of communication for humans, and various speech-based applications can work reliably only by improving the performance of ASR across different environments. In the modern era, there is a vast potential and immense possibility of using speech effectively as a communication medium between human and machine. The robust and reliable speech technology ensures people to experience the full benefits of Information and Communication Technology (lCT).

3 citations

Proceedings ArticleDOI
25 Oct 1994
TL;DR: In this paper, a Gaussian mixture speaker model was used for speaker identification and experiments were conducted on the TIMIT and NTIMIT databases, achieving accuracies of 99.5% and 60.7% for clean, wideband speech and telephone speech, respectively.
Abstract: The two largest factors affecting automatic speaker identification performance are the size of the population to be distinguished among and the degradations introduced by noisy communication channels (e.g., telephone transmission). To experimentally examine these two factors, this paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identification and experiments are conducted on the TIMIT and NTIMIT databases. The aims of this study are to (1) establish how well text-independent speaker identification can perform under near ideal conditions for very large populations (using the TIMIT database), (2) gauge the performance loss incurred by transmitting the speech over the telephone network (using the NTIMIT database), and (3) examine the validity of current models of telephone degradations commonly used in developing compensation techniques (using the NTIMIT calibration signals). This is believed to be the first speaker identification experiments on the complete 630 speaker TIMIT and NTIMIT databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 99.5% and 60.7% are achieved on the TIMIT and NTIMIT databases, respectively.© (1994) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

3 citations

01 Jan 2010
TL;DR: The speedup technique used in this paper partially prunes VQ codebook mean vectors using partial distortion elimination (PDE) for accelerating the bottlenecking component of speaker identification.
Abstract: Matching of feature vectors extracted from speech sample of an unknown speaker, with models of registered speakers is the most time consuming component of real0time speaker identification systems. Time controlling parameters are size and count of extracted test feature vectors as well as size, complexity and count of models of registered speakers. We studied vector quantization (VQ) for accelerating the bottlenecking component of speaker identification which is less investigated than Gaussian mixture model (GMM). Already reported acceleration techniques in VQ approach reduce test feature vector count by pre0quantization and reduce candidate registered speakers by pruning unlikely ones, thereby, introducing risk of accuracy degradation. The speedup technique used in this paper partially prunes VQ codebook mean vectors using partial distortion elimination (PDE). Acceleration factor of up to 3.29 on 630 registered speakers of TIMIT 8kHz speech data and 4 on 91 registered speakers of CSLU speech data is achieved respectively.

3 citations

Book ChapterDOI
11 Sep 2017
TL;DR: In this paper, a novel modification of the ladder network was proposed for semi-supervised learning of recurrent neural networks, which was evaluated with a phoneme recognition task on the TIMIT corpus.
Abstract: Ladder networks are a notable new concept in the field of semi-supervised learning by showing state-of-the-art results in image recognition tasks while being compatible with many existing neural architectures. We present the recurrent ladder network, a novel modification of the ladder network, for semi-supervised learning of recurrent neural networks which we evaluate with a phoneme recognition task on the TIMIT corpus. Our results show that the model is able to consistently outperform the baseline and achieve fully-supervised baseline performance with only 75% of all labels which demonstrates that the model is capable of using unsupervised data as an effective regulariser.

3 citations

Proceedings ArticleDOI
26 Jul 2021
TL;DR: In this paper, a deep neural network (DNN) system was proposed for the automatic detection of speech in audio signals, otherwise known as voice activity detection (VAD), with the best performance being obtained for the latter.
Abstract: In this paper, we propose a deep neural network (DNN) system for the automatic detection of speech in audio signals, otherwise known as voice activity detection (VAD). Several DNN types were investigated, including multilayer perceptrons (MLPs), recurrent neural networks (RNNs), and convolutional neural networks (CNNs), with the best performance being obtained for the latter. Additional postprocessing techniques, i.e., hysteresis thresholds, minimum duration filtering, and bilateral extension, were employed in order to boost performance. The systems were trained and tested using several data subsets of the CENSREC-1-C database, with different simulated ambient noise conditions, and additional testing was done on a different CENSREC-1-C data subset containing actual ambient noise, as well as on a subset of the TIMIT database. An accuracy of up to 99.13% was obtained for the CENSREC-1-C datasets, and 97.60% for the TIMIT dataset.

3 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895