scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Journal ArticleDOI
TL;DR: In this paper , a new audio-visual deepfake dataset containing multimodal video forgeries is proposed, which uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks.
Abstract: With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material has become increasingly simple. Current technology enables the creation of videos where both the visual and audio contents are falsified. While the multimedia forensics community has begun to address this threat by developing fake media detectors. However, the vast majority existing forensic techniques only analyze one modality at a time. This is an important limitation when authenticating manipulated videos, because sophisticated forgeries may be difficult to detect without exploiting cross-modal inconsistencies (e.g., across the audio and visual tracks). One important reason for the lack of multimodal detectors is a similar lack of research datasets containing multimodal forgeries. Existing datasets typically contain only one falsified modality, such as deepfaked videos with authentic audio tracks, or synthetic audio with no associated video. Currently, datasets are needed that can be used to develop, train, and test these forensic algorithms. In this paper, we propose a new audio-visual deepfake dataset containing multimodal video forgeries. We present a general pipeline for synthesizing deepfake speech content from a given video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. We use this pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.

1 citations

Proceedings ArticleDOI
20 Aug 2006
TL;DR: The experimental results show that effectiveness of the Gaussian mixture model is effective and the performance on TIMIT corpus indicates the potential applications in speech recognition, synthesis and coding.
Abstract: Implicit speech segmentation is basically to find time instances when the spectral distortion is large. Spectral variation function is a widely used measure of spectral distortion. However, SVF is a data-dependent measure. In order to make the measurement data-independent, a likelihood ratio is constructed to measure the spectral distortion. This ratio can be computed efficiently with a Bayesian predictive model. The prior of the Bayesian predictive model is estimated from unlabeled data via an unsupervised machine learning technique - Gaussian mixture model (GMM). The experimental results show that effectiveness of this novel method. The performance on TIMIT corpus indicates the potential applications in speech recognition, synthesis and coding

1 citations

01 Jan 2007
TL;DR: The results show that both RBF and EBF networks are very robust in detecting impostors for clean speech, with the E BF networks being significantly better than the RBF networks in this respect.
Abstract: This paper presents several text-independent speaker verification experiments based on Radial Basis Function (RBF) and Elliptical Basis Function (EBF) networks. The experiments involve 76 speakers from dialect region 2 of the TIMIT and NTIMIT databases. Each speaker was modelled by a 12-input, 2-output network in which one output represents the speaker class while the other represents the anti-speaker class. The results show that both RBF and EBF networks are very robust in detecting impostors for clean speech, with the EBF networks being significantly better than the RBF networks in this respect. For clean speech, a false acceptance rate of 0.06% and a false rejection rate of 0.19% have been achieved. However, for telephone speech, the false acceptance rate and the false rejection rate are increased to 11.7% and 8.71%, respectively. It is concluded that better pre-processing techniques are required to reduce the effects of noise and channel variations.

1 citations

Book ChapterDOI
23 Jul 2010
TL;DR: F-ratio is computed as a theoretical measure on the features of the training speeches to validate the experimental results for perceptual features with pitch and % false rejection rate is less for mel frequency linear predictive cepstrum.
Abstract: The main objective of this paper is to explore the effectiveness of perceptual features combined with pitch for text independent speaker recognition. In this algorithm, these features are captured and Gaussian mixture models are developed representing L feature vectors of speech for every speaker. Speakers are identified based on first finding posteriori probability density function between mixtures of speaker models and test speech vectors. Speakers are classified based on maximum probability density function which corresponds to a speaker model. This algorithm gives the good overall accuracy of 98% for mel frequency perceptual linear predictive cepstrum combined with pitch for identifying speaker among 8 speakers chosen randomly from 8 different dialect regions in “TIMIT” database by considering GMM speaker models of 12 mixtures. It also gives the better average accuracy of 95.75% for the same feature with respect to 8 speakers chosen randomly from the same dialect region for12 mixtures GMM speaker models. Mel frequency linear predictive cepstrum gives the better accuracy of 96.75% and 96.125% for GMM speaker models of 16 mixtures by considering speakers from different dialect regions and from same dialect region respectively. This algorithm is also evaluated for 4, 8 and 32 mixtures GMM speaker models. 12 mixtures GMM speaker models are tested for population of 20 speakers and the accuracy is found to be slightly less as compared to that for the the speaker population of 8 speakers. The noteworthy feature of speaker identification algorithm is to evaluate the testing procedure on identical messages for all the speakers. This work is extended to speaker verification whose performance is measured in terms of % False rejection rate, % False acceptance rate and % Equal error rate. % False acceptance rate and % Equal error rate are found to be less for mel frequency perceptual linear predictive cepstrum with pitch and % false rejection rate is less for mel frequency linear predictive cepstrum. In this work, F-ratio is computed as a theoretical measure on the features of the training speeches to validate the experimental results for perceptual features with pitch. χ 2 distribution tool is used to perform the statistical justification of good experimental results for all the features with respect to both speaker identification and verification.

1 citations

Journal ArticleDOI
TL;DR: It is confirmed that the modified YL(Yi and Loizou) algorithm, using a VAD(voice activity detector) for enhancing speech corrupted by colored noise shows better performance from SNR(signal to noise ratio) and SSD(speech spectral distortion) viewpoint over the previous two approach.
Abstract: In this paper, we proposed the modified YL(Yi and Loizou) algorithm, using a VAD(voice activity detector) for enhancing speech corrupted by colored noise. The performance of the proposed algorithm has been compared to the YL algorithm and LS(Lee and Son, etc.) algorithm by computer simulation. The colored noises used in the experiment were a car noise and multi-talker babble from the AURORA data base and the used voices from the TIMIT data base. It is confirmed that the proposed algorithm shows better performance from SNR(signal to noise ratio) and SSD(speech spectral distortion) viewpoint over the previous two approach.

1 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895