scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Proceedings ArticleDOI
Hang Chen, Jun Du, Bao-Cai Yin, Jia Pan, Chin-Hui Lee 
04 Jun 2023
TL;DR: In this article , a mask-based audio-visual progressive learning speech enhancement (AVPL) framework with visual information reconstruction (VIR) was proposed to increase SNRs gradually, where each stage of AVPL takes a concatenation of pre-trained visual embedding and the previous representation as input and predicts a mask with the intermediate representation of the current stage.
Abstract: Video information has been widely introduced to speech enhancement as its contribution at low signal-to-noise ratios (SNRs). Conventional audio-visual speech enhancement networks take noisy speech and video as input and learn features of clean speech directly. To reduce the large SNR gap between the learning target and input noisy speech, we propose a novel mask-based audio-visual progressive learning speech enhancement (AVPL) framework with visual information reconstruction (VIR) to increase SNRs gradually. Each stage of AVPL takes a concatenation of pre-trained visual embedding and the previous representation as input and predicts a mask with the intermediate representation of the current stage. To extract more visual information and deal with the performance distortion, the AVPL-VIR model reconstructs the visual embedding as it is fed in for each stage. Experiment on the TCD-TIMIT dataset shows that the progressive learning method significantly outperforms direct learning for both audio-only and audio-visual models. Moreover, by reconstructing video information, the VIR module provides a more accurate and comprehensive representation of the data, which in turn improves the performance of both AVDL and AVPL.
Posted Content
TL;DR: This article proposed a data-driven pronunciation estimation and acoustic modeling method which only takes the orthographic transcription to jointly estimate a set of sub-word units and a reliable dictionary and showed that the proposed method which is based on semi-supervised training of a deep neural network largely outperforms phoneme based continuous speech recognition on the TIMIT dataset.
Abstract: Phonemic or phonetic sub-word units are the most commonly used atomic elements to represent speech signals in modern ASRs. However they are not the optimal choice due to several reasons such as: large amount of effort required to handcraft a pronunciation dictionary, pronunciation variations, human mistakes and under-resourced dialects and languages. Here, we propose a data-driven pronunciation estimation and acoustic modeling method which only takes the orthographic transcription to jointly estimate a set of sub-word units and a reliable dictionary. Experimental results show that the proposed method which is based on semi-supervised training of a deep neural network largely outperforms phoneme based continuous speech recognition on the TIMIT dataset.
Proceedings ArticleDOI
27 Jul 2021
TL;DR: In this article, the authors proposed to incorporate the features derived from the analytic phase of the speech signals for speech recognition using recurrent neural networks (RNNs) and its variants have achieved significant success in speech recognition.
Abstract: Recurrent neural networks (RNNs) and its variants have achieved significant success in speech recognition. Long short term memory (LSTM) and gated recurrent units (GRUs) are the two most popular variants which overcome the vanishing gradient problem of RNNs and also learn effectively long term dependencies. Light gated recurrent units (Li-GRUs) are more compact versions of standard GRUs. Li-GRUs have been shown to provide better recognition accuracy with significantly faster training. These different RNN inspired architectures invariably use magnitude based features and the phase information is generally ignored. We propose to incorporate the features derived from the analytic phase of the speech signals for speech recognition using these RNN variants. Instantaneous frequency filter-bank (IFFB) features derived from Fourier transform relations performed at par with the standard MFCC features for recurrent units based acoustic models despite being derived from phase information only. Different system combinations of IFFB features with the magnitude based features provided lowest PER of 12.9% and showed relative improvements of up to 16.8% over standalone MFCC features on TIMIT phone recognition using Li-GRU based architecture. IFFB features significantly outperformed the modified group delay coefficients (MGDC) features in all our experiments.
Journal ArticleDOI
TL;DR: In this paper, a two-stage approach is proposed for accurate detection of vowel onset points (VOPs), where the first stage detects VOPs using continuous wavelet transform coefficients, and the position of the detected VOP are corrected using phone boundaries in the second stage.
Abstract: In this paper, we propose a novel approach for accurate detection of vowel onset points (VOPs). A VOP is the instant at which a vowel begins in a speech signal. Precise identification of VOPs is important for various speech applications such as speech segmentation and speech rate modification. Existing methods detect the majority of VOPs to an accuracy of 40 ms deviation, which may not be appropriate for the above speech applications. To address this issue, we proposed a two-stage approach for accurate detection of VOPs. At the first stage, VOPs are detected using continuous wavelet transform coefficients, and the position of the detected VOPs are corrected using phone boundaries in the second stage. The phone boundaries are detected by the spectral transition measure method. Experiments are done using TIMIT and Bengali speech corpora. Performance of the proposed approach is compared with two standard signal processing based methods as well as with a recent VOP detection technique. The evaluation results show that the proposed method performs better than the existing methods.
Journal ArticleDOI
TL;DR: In this article , a cooperative structure of deep autoencoders (DAEs) as generative models and deep neural networks (DNNs) was proposed for speech enhancement, which achieved an average perceptual evaluation of speech quality (PESQ) improvement of up to about 0.3 for TIMIT dataset.
Abstract: In this paper, we present a new supervised speech enhancement approach based on a cooperative structure of deep autoencoders (DAEs) as generative models and deep neural networks (DNN). The DAE is used as a nonlinear alternative to nonnegative matrix factorization (NMF) for the extraction of harmonic structures and encoded features of the noise, clean and noisy signals, and a DNN is deployed as a nonlinear mapper. We introduce a deep network imitating NMF in a non-linear manner to overcome the problems of a simple linear model, such as performance degradation in non-stationary environments. Compared to combinatorial NMF and DNN methods, we do all of the decomposition, enhancement, and reconstruction processes in a nonlinear framework via a suitable cooperative structure of encoder, DNN, and decoders, and jointly optimize them. We also propose a supervised hierarchical multi-target training approach, performed in two steps, such that the DNN not only predicts the low-level encoded features as primary targets but it also predicts the high-level actual spectral signals as secondary targets. The first step acts as a pretraining for the second step which improves the learning strategy. Moreover, to exploit a more discriminative model for noise reduction, a DNN-based noise classification and fusion strategy (NCF) is also proposed. The experiments on TIMIT dataset reveal that the proposed methods outperform the previous approaches and achieve an average perceptual evaluation of speech quality (PESQ) improvement of up to about 0.3 for speech enhancement.

Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895