scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Proceedings ArticleDOI
24 Feb 2023
TL;DR: In this paper , a model of Densely Connected Feed Forward Convolutional Networks (DCFCN) is proposed to solve the overfitting problem of deep neural network under the training condition of small data sets.
Abstract: It is easy to overfit the neural network and the accuracy of the speech recognition model is not high when the neural network is trained on small data sets in multi-person conversation scenarios. This paper proposes a model of Densely connected feedforward convolutional networks (DCFCN). The deep features are gradually extracted by dense connection compression network, and the shallow features and deep features are combined to solve the overfitting problem of deep neural network under the training condition of small data sets. A feedforward neural channel is added to input the initial information of text content into the decoding structure to solve the problem of error chain propagation caused by the character prediction error of LAS starting position. Finally, the multi person conversation scene and TIMIT speech are synthesized and tested. The results show that DCFCN algorithm effectively solves the over fitting problem, and the recognition accuracy is improved by 20% compared with the traditional LAS algorithm.
Posted ContentDOI
15 Jun 2022
TL;DR: In this article , a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information was proposed to reduce the gap between the learning objective of phoneme recognition and MDD.
Abstract: Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT). In the field of assessing the pronunciation quality of constrained speech, the given transcriptions can play the role of a teacher. Conventional methods have fully utilized the prior texts for the model construction or improving the system performance, e.g. forced-alignment and extended recognition networks. Recently, some end-to-end based methods attempt to incorporate the prior texts into model training and preliminarily show the effectiveness. However, previous studies mostly consider applying raw attention mechanism to fuse audio representations with text representations, without taking possible text-pronunciation mismatch into account. In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information. Moreover, given the transcriptions, we design an extra contrastive loss to reduce the gap between the learning objective of phoneme recognition and MDD. We conducted experiments using two publicly available datasets (TIMIT and L2-Arctic) and our best model improved the F1 score from $57.51\%$ to $61.75\%$ compared to the baselines. Besides, we provide a detailed analysis to shed light on the effectiveness of gating mechanism and contrastive learning on MDD.
Posted Content
TL;DR: This paper proposed that higher order thalamic nuclei provide a locale for a temporal difference between top-down predictions and an actual event outcome and hypothesize that the mechanism itself is predictive error-driven learning.
Abstract: Infants, adults, non-human primates and non-primates all learn patterns implicitly, and they do so across modalities. The biological evidence supports the hypothesis that the mechanism for this learning is general but computationally local. We hypothesize that the mechanism itself is predictive error-driven learning. We build on recent work that advanced a biologically plausible model of error backpropagation learning which proposes that higher order thalamic nuclei provide a locale for a temporal difference between top-down predictions and an actual event outcome. Our neural network based on that work also models the auditory cortex hierarchy of core, belt and parabelt and the caudal-rostral axis within regions. We simulated two studies showing statistical learning in infants, a seminal study using synthesized speech and a more recent study using human speech. Before simulating these studies the network was trained on spoken sentences from the TIMIT corpus to emulate infant's experience listening to random speech. The implemented neural network, learning only by predicting the next brief speech segment, learned in both simulations to predict in-word syllables better than next-word syllables showing that prediction could be the basis for word segmentation and thus statistical learning.
Proceedings ArticleDOI
23 Oct 2019
TL;DR: A probabilistic model is used to describe the probability of observing the noisy phoneme-level labels given an utterance and a training algorithm is proposed which can simultaneously learn parameters of the neural-network and the mismatch model is proposed.
Abstract: The Connectionist Temporal Classification (CTC) technique can be used to train a neural-network based speech recognizer. When the technique is used to train a phoneme recognizer, it is required that training data should be annotated with phoneme-level labels. This is not feasible if large speech databases are used. One approach to make use of such speech data is to convert the word-level transcriptions into phoneme-level labels, followed by a CTC training. The problem of this approach is that the converted phonemelevel labels may mismatch the audio content of the speech data. This paper uses a probabilistic model to describe the probability of observing the noisy phoneme-level labels given an utterance. The model consists of a neural network which predicts the probability of any phoneme sequence, and another so-called mismatch model to describe the probability of disturbing a phoneme sequence to another. Based on the Expectation-Maximization (EM) framework, we propose a training algorithm which can simultaneously learn parameters of the neural-network and the mismatch model. Effectiveness of our method is verified by comparing recognition performance of our method with a conventional training method on TIMIT corpus.
Journal ArticleDOI
TL;DR: The vocal source and the pulse shape of the glottal flow are determined through the regularized ratio of the speech signal spectra at the intervals of the open and closed vocal slit within each period of the fundamental tone.
Abstract: The vocal source and the pulse shape of the glottal flow are determined through the regularized ratio of the speech signal spectra at the intervals of the open and closed vocal slit within each period of the fundamental tone. Three databases were used: Russian numerals for 216 men and 177 women, the base obtained by converting the Russian database by the codec on 9.2 kbps, and the TIMIT database. The pitch period and 7 coefficients for the principal components of the glottal flow provide an average error of recognizing males below 8% for a sequence of 6 vowels. The minimum average recognition error for the initial base of Russian numerals for females makes about 15%, for males in the codec database makes about 15%, and for males in the TIMIT makes about 44%. The minimum average error of males’ recognition in the space of 7 coefficients for the principal components in the Russian database makes about 26%, but about 27% of the speakers have an average error of less than 10%.

Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895