scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Proceedings Article
01 Jan 2001
TL;DR: The iterative maximum-likelihood procedure employed to train both parts of the model are described, and the first unsupervised adaptation and self-adaptation results for the new model are given, showing that it outperforms standard techniques when small amounts of adaptation data are available.
Abstract: In a recent paper, we described a compact, context-dependent acoustic model incorporating strong a priori knowledge and designed to support extremely rapid speaker adaptation [9]. The two parts of this “bipartite” model are: 1. A speakerdependent, context-independent (SDCI) part with a small number of parameters called the “eigencentroid”. 2. A speaker-independent, context-dependent (SICD) part with a large number of parameters called the “delta trees”. For the first time, we describe in the current paper the iterative maximum-likelihood procedure employed to train both parts of the model. This paper also gives the first unsupervised adaptation and self-adaptation results for the new model, showing that it outperforms standard techniques when small amounts of adaptation data (10 sec. or less of sp eech) are available. Relative error rate reduction (ERR) is 12.1% for supervised adaptation and 11.2% for unsupervised adaptation on three TIMIT sentences; it is 10.4% for self-adaptation on a single TIMIT sentence. Finally, the paper analyzes the correlation between sex and the SDCI part of the model, and shows how modeling of acoustic variability is affected by the explicit separation into SD and CD components.
Proceedings ArticleDOI
30 Jun 2022
TL;DR: In this paper, the self-supervised pre-training model wav2vec2.0 is applied to the MDD task, yielding satisfactory results.
Abstract: Mispronunciation Detection and Diagnosis (MDD) is one of the key components of the Computer Assisted Pronunciation Training (CAPT) system. The construction of the current mainstream MDD system is an automatic speech recognition (ASR) system based on DNN-HMM, on which a large amount of labeled data is required for training. In this paper, the self-supervised pre-training model wav2vec2.0 is applied to the MDD task. Self-supervised pre-training uses a large amount of unlabeled data to learn common features, and only a small amount of labeled data is required for training in subsequent applications. In order to utilize the prior text information, the audio features are combined with the text features through the attention mechanism, and the information of both is used in the decoding process. The experiment is conducted on the publicly available L2-Aritic and TIMIT datasets, yielding satisfactory results.
Proceedings ArticleDOI
04 Jun 2023
TL;DR: SpeakerAugment as discussed by the authors is a data augmentation method for generalizable speech separation that aims to increase the diversity of speaker identity in training data, to mitigate speaker mismatch of domain mismatch.
Abstract: Existing speech separation models based on deep learning typically generalize poorly due to domain mismatch. In this paper, we propose SpeakerAugment (SA), a data augmentation method for generalizable speech separation that aims to increase the diversity of speaker identity in training data, to mitigate speaker mismatch of domain mismatch. The SA consists of two sub-policies: (1) SA-Vocoder, which uses a vocoder to manipulate pitch and formants parameters of speakers. (2) SA-Spectrum, which directly performs pitch-shift and time-stretch on the spectrum of each speech signal. The SA is simple and effective. Experimental results show that using SA can significantly improve the generalization ability of models, especially for: 1) The training set with fewer speakers, e.g., WSJ0-2mix, or 2) The target test set with complex linguistic conditions, e.g., the TIMIT based test set. Moreover, as a data augmentation method, SA has good potential to be applicable to other speech related tasks. We validate this by applying SA in speech recognition, and experimental results show that the generalization ability is also improved.
Posted ContentDOI
29 Jul 2022
TL;DR: In this paper , a two-stage iterative framework is proposed to learn ASR from unpaired phone sequences and speech utterances, where GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence, and another HMM model is introduced to train from the generator's output.
Abstract: ASR has been shown to achieve great performance recently. However, most of them rely on massive paired data, which is not feasible for low-resource languages worldwide. This paper investigates how to learn directly from unpaired phone sequences and speech utterances. We design a two-stage iterative framework. GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence. In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance and provides a better segmentation for the next iteration. In the experiment, we first investigate different choices of model designs. Then we compare the framework to different types of baselines: (i) supervised methods (ii) acoustic unit discovery based methods (iii) methods learning from unpaired data. Our framework performs consistently better than all acoustic unit discovery methods and previous methods learning from unpaired data based on the TIMIT dataset.
Proceedings ArticleDOI
26 Nov 2022
TL;DR: Wang et al. as discussed by the authors proposed an encoder-decoder-based convolutional neural network of audio-visual speech enhancement with deep multi-modality fusion, which uses temporal attention to align multidirectional features selectively and preserves the temporal correlation by linear interpolation.
Abstract: In daily interactions, human speech perception is inherently a multi-modality process. Audio-visual speech enhancement (AV-SE) aims to aid speech enhancement with the help of visual information. However, the fusion strategy of most AV-SE approaches is too simple, resulting in the dominance of audio modality. The visual modality is usually ignored, especially when the signal-to-noise ratio (SNR) is medium or high. This paper proposes an encoder-decoder-based convolutional neural network of AV-SE with deep multi-modality fusion. The deep multi-modality fusion uses temporal attention to align multi-modality features selectively and preserves the temporal correlation by linear interpolation. The novel fusion strategy can take full advantage of video features, leading to a balanced multi-modality representation. To further improve the performance of AV-SE, mixed deep feature loss is introduced. Two neural networks are applied to model the characteristics of speech and noise signals, respectively. The experiment conducted on NTCD-TIMIT demonstrates the effectiveness of our proposed model. Compared to audio-only baseline and simple fusion approaches, our model achieves better performance in objective metrics under all SNR conditions.

Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895