Topic
TIMIT
About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.
Papers published on a yearly basis
Papers
More filters
••
18 Jul 2022TL;DR: In this article , the authors proposed a dynamic scene perception module (DSPM) which consists of two parts, one for dynamic scene estimation, and the other for adaptive region perception, where the scene estimator utilizes a spectrum energy-based attention mechanism to obtain the coefficients of each convolution kernel.
Abstract: Speech enhancement aims to recover clean speech from complex noise backgrounds. This paper proposes a novel information processing module dubbed dynamic scene perception module (DSPM) that can help existing systems to accommodate various complex scenarios. The inspiration of DSPM is based on the observation that different regions of the noisy spectrum in different scenarios have different enhancing requirements. Concretely, DSPM consists of two parts, one for dynamic scene estimation, and the other for adaptive region perception. In particular, the scene estimator utilizes a spectrum-energy-based attention mechanism to obtain the coefficients of each convolution kernel. Then, at each position’ the region perceptron chooses the corresponding kernels by considering the requirements of the current region (preserve vocals or suppress noise). Systematic evaluations on the TIMIT corpus and Voice Bank + DEMAND demonstrate the effectiveness of our method. Compared with the existing systems, our proposed method achieved better performance under various SNR conditions and complex noise scenarios.
••
16 Sep 2022TL;DR: In this paper , the authors present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material using Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques.
Abstract: With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors, mainly due to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms. In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with other state-of-the-art sets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both mono and multimodal conditions, showing the need for multimodal forensic detectors and more suitable data.
••
03 Jun 2021
TL;DR: The outcomes of this study proves that the decrease in quality in NID is twice morethan the DN based steganography while increasing the embedding capacities.
Abstract: Aim: The main motive of this study is to perform Adaptive Multi Rate Wideband (AMR-WB) SpeechSteganography in network security to produce the stego speech with less loss of quality whileincreasing embedding capacities. Materials and Methods: TIMIT Acoustic-Phonetic ContinuousSpeech Corpus dataset consists of about 16000 speech samples out of which 1000 samples are takenand 80% pretest power for analyzing the speech steganography. AMR-WB Speech steganography isperformed by Diameter Neighbor codebook partition algorithm (Group 1) and Neighbor IndexDivision codebook division algorithm (Group 2). Results: The AMR-WB speech steganographyusing DN codebook partition obtained average quality rate of 2.8893 and NID codebook divisionalgorithm obtained average quality rate of 2.4196 in the range of 300bps embedding capacity.Conclusion: The outcomes of this study proves that the decrease in quality in NID is twice morethan the DN based steganography while increasing the embedding capacities.
••
12 Aug 2017TL;DR: Experimental results showed that the proposed AFK-SVD method can improve the quality of the reconstructed speech signal in PESQ by 0.8 and SNR by 3 - 7 dB in average, and a two-level feedback filter measure is developed for removal of speech distortion caused by over-representation.
Abstract: Sparse representation is a common issue in many signal processing problems. In speech signal processing, how to sparsely represent a speech signal by dictionary learning for improving transmission efficiency has attracted considerable attention in recent years. K-SVD algorithm for dictionary learning is a typical method. But it requires to know the dictionary size prior to dictionary training. A suitable dictionary size can effectively avoid the problem of under-representation or over-representation, which affects the quality of reconstruction speech significantly. To tackle this problem, an Adaptive dictionary size Feedback filtering K-SVD (AFK-SVD) approach is presented in this paper for dictionary leaning. The proposed method first selects the dictionary size adaptively based on the speech signal feasure prior to dictionary learning, and then filters out the noise caused by over-representation. The approach has two unique features: (1) a learning model is constructed based on the training set specifically for adaptive determination of a range of the dictionary size; and (2) a two-level feedback filter measure is developed for removal of speech distortion caused by over-representation. The speech signals from TIMIT speech data sets are used to demonstrate the presented AFK-SVD approach. Experimental results showed that, in comparison with K-SVD, the proposed AFK-SVD method can improve the quality of the reconstructed speech signal in PESQ by 0.8 and SNR by 3 - 7 dB in average.
•
01 Dec 2012TL;DR: The essence of PCBA is to create a transformation strategy which makes the distribution of phoneme-classes of distant noisy speech be similar to those of close microphone acoustic model in thirteen dimensional MFCC space (mostly in c0-c1 plane).
Abstract: : A new adaptation strategy for distant noisy speech is created by phoneme class based approaches for context independent acoustic models. Unlike the previous approaches such as MLLR-MAP adaptation which adapts acoustic model to the features, our phoneme-class based adaptation (PCBA) adapts the distant data features to our acoustic model which has trained on close microphone TIMIT sentences. The essence of PCBA is to create a transformation strategy which makes the distribution of phoneme-classes of distant noisy speech be similar to those of close microphone acoustic model in thirteen dimensional MFCC space (mostly in c0-c1 plane). It creates a mean, orientation and variance adaptation scheme for each phoneme class to compensate the mismatch. New adapted features and new and improved acoustic models which are produced by PCBA are outperforming those created by MLLR-MAP adaptation for ASR and KWS. And PCBA offers a new powerful understanding in acoustic-modeling of distant speech.