Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection

[...]

01 Jan 2023-IEEE Access

TL;DR: In this paper , a new audio-visual deepfake dataset containing multimodal video forgeries is proposed, which uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks.

...read moreread less

Abstract: With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material has become increasingly simple. Current technology enables the creation of videos where both the visual and audio contents are falsified. While the multimedia forensics community has begun to address this threat by developing fake media detectors. However, the vast majority existing forensic techniques only analyze one modality at a time. This is an important limitation when authenticating manipulated videos, because sophisticated forgeries may be difficult to detect without exploiting cross-modal inconsistencies (e.g., across the audio and visual tracks). One important reason for the lack of multimodal detectors is a similar lack of research datasets containing multimodal forgeries. Existing datasets typically contain only one falsified modality, such as deepfaked videos with authentic audio tracks, or synthetic audio with no associated video. Currently, datasets are needed that can be used to develop, train, and test these forensic algorithms. In this paper, we propose a new audio-visual deepfake dataset containing multimodal video forgeries. We present a general pipeline for synthesizing deepfake speech content from a given video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. We use this pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.

...read moreread less

1 citations

Proceedings Article•DOI•

A Bayesian Predictive Method for Automatic Speech Segmentation

[...]

Ming Liu¹, Thomas S. Huang¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

20 Aug 2006

TL;DR: The experimental results show that effectiveness of the Gaussian mixture model is effective and the performance on TIMIT corpus indicates the potential applications in speech recognition, synthesis and coding.

...read moreread less

Abstract: Implicit speech segmentation is basically to find time instances when the spectral distortion is large. Spectral variation function is a widely used measure of spectral distortion. However, SVF is a data-dependent measure. In order to make the measurement data-independent, a likelihood ratio is constructed to measure the spectral distortion. This ratio can be computed efficiently with a Bayesian predictive model. The prior of the Bayesian predictive model is estimated from unlabeled data via an unsupervised machine learning technique - Gaussian mixture model (GMM). The experimental results show that effectiveness of this novel method. The performance on TIMIT corpus indicates the potential applications in speech recognition, synthesis and coding

...read moreread less

1 citations

Text-Independent Speaker Verification over a Telephone Network by Radial Basis Function Networks

[...]

Man-Wai Mak¹•Institutions (1)

Hong Kong Polytechnic University¹

01 Jan 2007

TL;DR: The results show that both RBF and EBF networks are very robust in detecting impostors for clean speech, with the E BF networks being significantly better than the RBF networks in this respect.

...read moreread less

Abstract: This paper presents several text-independent speaker verification experiments based on Radial Basis Function (RBF) and Elliptical Basis Function (EBF) networks. The experiments involve 76 speakers from dialect region 2 of the TIMIT and NTIMIT databases. Each speaker was modelled by a 12-input, 2-output network in which one output represents the speaker class while the other represents the anti-speaker class. The results show that both RBF and EBF networks are very robust in detecting impostors for clean speech, with the EBF networks being significantly better than the RBF networks in this respect. For clean speech, a false acceptance rate of 0.06% and a false rejection rate of 0.19% have been achieved. However, for telephone speech, the false acceptance rate and the false rejection rate are increased to 11.7% and 8.71%, respectively. It is concluded that better pre-processing techniques are required to reduce the effects of noise and channel variations.

...read moreread less

1 citations

Book Chapter•DOI•

Source and System Features for Text Independent Speaker Recognition Using GMM Speaker Models

[...]

A. Revathi, Y. Venkataramani

23 Jul 2010

TL;DR: F-ratio is computed as a theoretical measure on the features of the training speeches to validate the experimental results for perceptual features with pitch and % false rejection rate is less for mel frequency linear predictive cepstrum.

...read moreread less

Abstract: The main objective of this paper is to explore the effectiveness of perceptual features combined with pitch for text independent speaker recognition. In this algorithm, these features are captured and Gaussian mixture models are developed representing L feature vectors of speech for every speaker. Speakers are identified based on first finding posteriori probability density function between mixtures of speaker models and test speech vectors. Speakers are classified based on maximum probability density function which corresponds to a speaker model. This algorithm gives the good overall accuracy of 98% for mel frequency perceptual linear predictive cepstrum combined with pitch for identifying speaker among 8 speakers chosen randomly from 8 different dialect regions in “TIMIT” database by considering GMM speaker models of 12 mixtures. It also gives the better average accuracy of 95.75% for the same feature with respect to 8 speakers chosen randomly from the same dialect region for12 mixtures GMM speaker models. Mel frequency linear predictive cepstrum gives the better accuracy of 96.75% and 96.125% for GMM speaker models of 16 mixtures by considering speakers from different dialect regions and from same dialect region respectively. This algorithm is also evaluated for 4, 8 and 32 mixtures GMM speaker models. 12 mixtures GMM speaker models are tested for population of 20 speakers and the accuracy is found to be slightly less as compared to that for the the speaker population of 8 speakers. The noteworthy feature of speaker identification algorithm is to evaluate the testing procedure on identical messages for all the speakers. This work is extended to speaker verification whose performance is measured in terms of % False rejection rate, % False acceptance rate and % Equal error rate. % False acceptance rate and % Equal error rate are found to be less for mel frequency perceptual linear predictive cepstrum with pitch and % false rejection rate is less for mel frequency linear predictive cepstrum. In this work, F-ratio is computed as a theoretical measure on the features of the training speeches to validate the experimental results for perceptual features with pitch. χ 2 distribution tool is used to perform the statistical justification of good experimental results for all the features with respect to both speaker identification and verification.

...read moreread less

1 citations

Journal Article•DOI•

A Generalized Subspace Approach for Enhancing Speech Corrupted by Colored Noise Using Voice Activity Detector(VAD)

[...]

Kyung-Sik Son, Hyun-Tae Kim

31 Aug 2013-The Journal of the Korean Institute of Information and Communication Engineering

TL;DR: It is confirmed that the modified YL(Yi and Loizou) algorithm, using a VAD(voice activity detector) for enhancing speech corrupted by colored noise shows better performance from SNR(signal to noise ratio) and SSD(speech spectral distortion) viewpoint over the previous two approach.

...read moreread less

Abstract: In this paper, we proposed the modified YL(Yi and Loizou) algorithm, using a VAD(voice activity detector) for enhancing speech corrupted by colored noise. The performance of the proposed algorithm has been compared to the YL algorithm and LS(Lee and Son, etc.) algorithm by computer simulation. The colored noises used in the experiment were a car noise and multi-talker babble from the AURORA data base and the used voices from the TIMIT data base. It is confirmed that the proposed algorithm shows better performance from SNR(signal to noise ratio) and SSD(speech spectral distortion) viewpoint over the previous two approach.

...read moreread less

1 citations

Collapse

Network Information

Performance

Metrics

1,488

Papers

68,688

Citations

No. of papers in the topic in previous years
Year	Papers
2023	24
2022	62
2021	67
2020	86
2019	77
2018	95

TIMIT

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics