scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Proceedings ArticleDOI
18 Jul 2022
TL;DR: In this article , the authors proposed a dynamic scene perception module (DSPM) which consists of two parts, one for dynamic scene estimation, and the other for adaptive region perception, where the scene estimator utilizes a spectrum energy-based attention mechanism to obtain the coefficients of each convolution kernel.
Abstract: Speech enhancement aims to recover clean speech from complex noise backgrounds. This paper proposes a novel information processing module dubbed dynamic scene perception module (DSPM) that can help existing systems to accommodate various complex scenarios. The inspiration of DSPM is based on the observation that different regions of the noisy spectrum in different scenarios have different enhancing requirements. Concretely, DSPM consists of two parts, one for dynamic scene estimation, and the other for adaptive region perception. In particular, the scene estimator utilizes a spectrum-energy-based attention mechanism to obtain the coefficients of each convolution kernel. Then, at each position’ the region perceptron chooses the corresponding kernels by considering the requirements of the current region (preserve vocals or suppress noise). Systematic evaluations on the TIMIT corpus and Voice Bank + DEMAND demonstrate the effectiveness of our method. Compared with the existing systems, our proposed method achieved better performance under various SNR conditions and complex noise scenarios.
Posted ContentDOI
16 Sep 2022
TL;DR: In this paper , the authors present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material using Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques.
Abstract: With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors, mainly due to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms. In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with other state-of-the-art sets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both mono and multimodal conditions, showing the need for multimodal forensic detectors and more suitable data.
Journal ArticleDOI
03 Jun 2021
TL;DR: The outcomes of this study proves that the decrease in quality in NID is twice morethan the DN based steganography while increasing the embedding capacities.
Abstract: Aim: The main motive of this study is to perform Adaptive Multi Rate Wideband (AMR-WB) SpeechSteganography in network security to produce the stego speech with less loss of quality whileincreasing embedding capacities. Materials and Methods: TIMIT Acoustic-Phonetic ContinuousSpeech Corpus dataset consists of about 16000 speech samples out of which 1000 samples are takenand 80% pretest power for analyzing the speech steganography. AMR-WB Speech steganography isperformed by Diameter Neighbor codebook partition algorithm (Group 1) and Neighbor IndexDivision codebook division algorithm (Group 2). Results: The AMR-WB speech steganographyusing DN codebook partition obtained average quality rate of 2.8893 and NID codebook divisionalgorithm obtained average quality rate of 2.4196 in the range of 300bps embedding capacity.Conclusion: The outcomes of this study proves that the decrease in quality in NID is twice morethan the DN based steganography while increasing the embedding capacities.
Book ChapterDOI
12 Aug 2017
TL;DR: Experimental results showed that the proposed AFK-SVD method can improve the quality of the reconstructed speech signal in PESQ by 0.8 and SNR by 3 - 7 dB in average, and a two-level feedback filter measure is developed for removal of speech distortion caused by over-representation.
Abstract: Sparse representation is a common issue in many signal processing problems. In speech signal processing, how to sparsely represent a speech signal by dictionary learning for improving transmission efficiency has attracted considerable attention in recent years. K-SVD algorithm for dictionary learning is a typical method. But it requires to know the dictionary size prior to dictionary training. A suitable dictionary size can effectively avoid the problem of under-representation or over-representation, which affects the quality of reconstruction speech significantly. To tackle this problem, an Adaptive dictionary size Feedback filtering K-SVD (AFK-SVD) approach is presented in this paper for dictionary leaning. The proposed method first selects the dictionary size adaptively based on the speech signal feasure prior to dictionary learning, and then filters out the noise caused by over-representation. The approach has two unique features: (1) a learning model is constructed based on the training set specifically for adaptive determination of a range of the dictionary size; and (2) a two-level feedback filter measure is developed for removal of speech distortion caused by over-representation. The speech signals from TIMIT speech data sets are used to demonstrate the presented AFK-SVD approach. Experimental results showed that, in comparison with K-SVD, the proposed AFK-SVD method can improve the quality of the reconstructed speech signal in PESQ by 0.8 and SNR by 3 - 7 dB in average.
Proceedings Article
01 Dec 2012
TL;DR: The essence of PCBA is to create a transformation strategy which makes the distribution of phoneme-classes of distant noisy speech be similar to those of close microphone acoustic model in thirteen dimensional MFCC space (mostly in c0-c1 plane).
Abstract: : A new adaptation strategy for distant noisy speech is created by phoneme class based approaches for context independent acoustic models. Unlike the previous approaches such as MLLR-MAP adaptation which adapts acoustic model to the features, our phoneme-class based adaptation (PCBA) adapts the distant data features to our acoustic model which has trained on close microphone TIMIT sentences. The essence of PCBA is to create a transformation strategy which makes the distribution of phoneme-classes of distant noisy speech be similar to those of close microphone acoustic model in thirteen dimensional MFCC space (mostly in c0-c1 plane). It creates a mean, orientation and variance adaptation scheme for each phoneme class to compensate the mismatch. New adapted features and new and improved acoustic models which are produced by PCBA are outperforming those created by MLLR-MAP adaptation for ASR and KWS. And PCBA offers a new powerful understanding in acoustic-modeling of distant speech.

Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895