Topic
TIMIT
About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: In this article , a comparative analysis of three different approaches: classification with still images (CNN model), classification based on previous images (CRNN model), and classification of sequences of images (Seq2Seq model).
Abstract: Existing literature on speech activity detection (SAD) highlights different approaches within neural networks but does not provide a comprehensive comparison to these methods. This is important because such neural approaches often require hardware-intensive resources. In this article, we provide a comparative analysis of three different approaches: classification with still images (CNN model), classification based on previous images (CRNN model), and classification of sequences of images (Seq2Seq model). Our experimental results using the Vid-TIMIT dataset show that the CNN model can achieve an accuracy of 97% whereas the CRNN and Seq2Seq models increase the classification to 99%. Further experiments show that the CRNN model is almost as accurate as the Seq2Seq model (99.1% vs. 99.6% of classification accuracy, respectively) but 57% faster to train (326 vs. 761 secs. per epoch).
••
01 Jan 2022TL;DR: This article explored three different types of methods for DNN-based speaker meta information estimation and compared the estimation results between the original speech and the anonymized speech using the TIMIT dataset.
Abstract: There has been concerns about how speech data is collected and shared in the real world since human speech itself has personally identifiable information about the speaker and speech is available to reliably estimate speaker meta information. In this paper, we explore three different types of methods for DNN based speaker meta information estimation and compare the estimation results between the original speech and the anonymized speech. We used McAdam's coefficient-based signal processing technique to make the anonymized speech and privacy-preserving data. Experiments derived using TIMIT dataset show a slight degradation in performance of anonymized speech against the original. Experiments reveal that the model employing both DNN based embedding and voice anonymization can achieve comparable performance to the model using the original speech.
••
05 Apr 2022TL;DR: In this article , the authors proposed a complex recurrent VAE framework, specifically in which complex-valued recurrent neural network and L1 reconstruction loss are used to account for the temporal property of speech signals.
Abstract: As an extension of variational autoencoder (VAE), complex VAE uses complex Gaussian distributions to model latent variables and data. This work proposes a complex recurrent VAE framework, specifically in which complex-valued recurrent neural network and L1 reconstruction loss are used. Firstly, to account for the temporal property of speech signals, this work introduces complex-valued recurrent neural network in the complex VAE framework. Besides, L1 loss is used as the reconstruction loss in this framework. To exemplify the use of the complex generative model in speech processing, we choose speech enhancement as the specific application in this paper. Experiments are based on the TIMIT dataset. The results show that the proposed method offers improvements on objective metrics in speech intelligibility and signal quality.
••
01 Jan 1994
TL;DR: A telephone speech database suitable for talker identification research (Godfrey, 1992) was not generally available at the time of this research, though clean speech databases such as TIMIT (Garofolo et al., 1988) have been available.
Abstract: It is difficult to implement talker recognition on the telephone network because of normal variation in the channel characteristics. The primary component of variation is due to the different telephone handsets or microphone frequency characteristics (Rosenberg and Soong, 1992). Lack of availability of telephone speech databases has also contributed to slow progress in the solution of these problems, though clean speech databases such as TIMIT (Garofolo et al., 1988) have been available. A telephone speech database suitable for talker identification research (Godfrey, 1992) was not generally available at the time of this research.
••
01 Dec 2017TL;DR: DNN trained on system combination of Mel-filterbank energies and SBAE features provide complementary information present in speech signal to help representation learning.
Abstract: Recently, unsupervised representation learning to learn the features from speech signals has seen a tremendous upsurge for speech processing applications. In this paper, we investigate a modified architecture of autoencoder namely, subband autoencoder (SBAE) for representation learning. Features were learned from spectrogram as an input to SBAE for speech recognition task. SBAE features and Mel-filterbank energies as spectral features were trained separately on DNN and then system combination is used. This technique was applied to speech recognition task on TIMIT and WSJO databases. On TIMIT database, we achieved an absolute improvement in PER on test set of 3% over Mel-filterbank energies alone. For WSJO database, we achieved relative improvement of 9.89% in WER on test sets compared to filterbank energies. Hence, DNN trained on system combination of Mel-filterbank energies and SBAE features provide complementary information present in speech signal.