Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•

Maximum-likelihood training of a bipartite acoustic model for speech recognition.

[...]

Florent Perronnin, Roland Kuhn, Patrick Nguyen, Jean-Claude Junqua¹•Institutions (1)

Panasonic¹

01 Jan 2001

TL;DR: The iterative maximum-likelihood procedure employed to train both parts of the model are described, and the first unsupervised adaptation and self-adaptation results for the new model are given, showing that it outperforms standard techniques when small amounts of adaptation data are available.

...read moreread less

Abstract: In a recent paper, we described a compact, context-dependent acoustic model incorporating strong a priori knowledge and designed to support extremely rapid speaker adaptation [9]. The two parts of this “bipartite” model are: 1. A speakerdependent, context-independent (SDCI) part with a small number of parameters called the “eigencentroid”. 2. A speaker-independent, context-dependent (SICD) part with a large number of parameters called the “delta trees”. For the first time, we describe in the current paper the iterative maximum-likelihood procedure employed to train both parts of the model. This paper also gives the first unsupervised adaptation and self-adaptation results for the new model, showing that it outperforms standard techniques when small amounts of adaptation data (10 sec. or less of sp eech) are available. Relative error rate reduction (ERR) is 12.1% for supervised adaptation and 11.2% for unsupervised adaptation on three TIMIT sentences; it is 10.4% for self-adaptation on a single TIMIT sentence. Finally, the paper analyzes the correlation between sex and the SDCI part of the model, and shows how modeling of acoustic variability is affected by the explicit separation into SD and CD components.

...read moreread less

Proceedings Article•DOI•

W2V-ATT: research on text-dependent MDD method based on wav2vec2.0

[...]

Ruiqiang Li, Xiao Chen Lai

30 Jun 2022

TL;DR: In this paper, the self-supervised pre-training model wav2vec2.0 is applied to the MDD task, yielding satisfactory results.

...read moreread less

Abstract: Mispronunciation Detection and Diagnosis (MDD) is one of the key components of the Computer Assisted Pronunciation Training (CAPT) system. The construction of the current mainstream MDD system is an automatic speech recognition (ASR) system based on DNN-HMM, on which a large amount of labeled data is required for training. In this paper, the self-supervised pre-training model wav2vec2.0 is applied to the MDD task. Self-supervised pre-training uses a large amount of unlabeled data to learn common features, and only a small amount of labeled data is required for training in subsequent applications. In order to utilize the prior text information, the audio features are combined with the text features through the attention mechanism, and the information of both is used in the decoding process. The experiment is conducted on the publicly available L2-Aritic and TIMIT datasets, yielding satisfactory results.

...read moreread less

Proceedings Article•DOI•

Speakeraugment: Data Augmentation for Generalizable Source Separation via Speaker Parameter Manipulation

[...]

Kai Wang, Yuhang Yang, Hao Ming Huang, Ying Hu, Li Sheng - Show less +1 more

04 Jun 2023

TL;DR: SpeakerAugment as discussed by the authors is a data augmentation method for generalizable speech separation that aims to increase the diversity of speaker identity in training data, to mitigate speaker mismatch of domain mismatch.

...read moreread less

Abstract: Existing speech separation models based on deep learning typically generalize poorly due to domain mismatch. In this paper, we propose SpeakerAugment (SA), a data augmentation method for generalizable speech separation that aims to increase the diversity of speaker identity in training data, to mitigate speaker mismatch of domain mismatch. The SA consists of two sub-policies: (1) SA-Vocoder, which uses a vocoder to manipulate pitch and formants parameters of speakers. (2) SA-Spectrum, which directly performs pitch-shift and time-stretch on the spectrum of each speech signal. The SA is simple and effective. Experimental results show that using SA can significantly improve the generalization ability of models, especially for: 1) The training set with fewer speakers, e.g., WSJ0-2mix, or 2) The target test set with complex linguistic conditions, e.g., the TIMIT based test set. Moreover, as a data augmentation method, SA has good potential to be applicable to other speech related tasks. We validate this by applying SA in speech recognition, and experimental results show that the generalization ability is also improved.

...read moreread less

Posted Content•DOI•

Learning Phone Recognition from Unpaired Audio and Phone Sequences Based on Generative Adversarial Network

[...]

29 Jul 2022

TL;DR: In this paper , a two-stage iterative framework is proposed to learn ASR from unpaired phone sequences and speech utterances, where GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence, and another HMM model is introduced to train from the generator's output.

...read moreread less

Abstract: ASR has been shown to achieve great performance recently. However, most of them rely on massive paired data, which is not feasible for low-resource languages worldwide. This paper investigates how to learn directly from unpaired phone sequences and speech utterances. We design a two-stage iterative framework. GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence. In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance and provides a better segmentation for the next iteration. In the experiment, we first investigate different choices of model designs. Then we compare the framework to different types of baselines: (i) supervised methods (ii) acoustic unit discovery based methods (iii) methods learning from unpaired data. Our framework performs consistently better than all acoustic unit discovery methods and previous methods learning from unpaired data based on the TIMIT dataset.

...read moreread less

Proceedings Article•DOI•

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

[...]

26 Nov 2022

TL;DR: Wang et al. as discussed by the authors proposed an encoder-decoder-based convolutional neural network of audio-visual speech enhancement with deep multi-modality fusion, which uses temporal attention to align multidirectional features selectively and preserves the temporal correlation by linear interpolation.

...read moreread less

Abstract: In daily interactions, human speech perception is inherently a multi-modality process. Audio-visual speech enhancement (AV-SE) aims to aid speech enhancement with the help of visual information. However, the fusion strategy of most AV-SE approaches is too simple, resulting in the dominance of audio modality. The visual modality is usually ignored, especially when the signal-to-noise ratio (SNR) is medium or high. This paper proposes an encoder-decoder-based convolutional neural network of AV-SE with deep multi-modality fusion. The deep multi-modality fusion uses temporal attention to align multi-modality features selectively and preserves the temporal correlation by linear interpolation. The novel fusion strategy can take full advantage of video features, leading to a balanced multi-modality representation. To further improve the performance of AV-SE, mixed deep feature loss is introduced. Two neural networks are applied to model the characteristics of speech and noise signals, respectively. The experiment conducted on NTCD-TIMIT demonstrates the effectiveness of our proposed model. Compared to audio-only baseline and simple fusion approaches, our model achieves better performance in objective metrics under all SNR conditions.

...read moreread less

Collapse

Network Information

Performance

Metrics

1,488

Papers

68,688

Citations

No. of papers in the topic in previous years
Year	Papers
2023	24
2022	62
2021	67
2020	86
2019	77
2018	95

TIMIT

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics