Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Research on high-precision lightweight speech recognition model with small training set in Multi-person conversation scenario

[...]

24 Feb 2023

TL;DR: In this paper , a model of Densely Connected Feed Forward Convolutional Networks (DCFCN) is proposed to solve the overfitting problem of deep neural network under the training condition of small data sets.

...read moreread less

Abstract: It is easy to overfit the neural network and the accuracy of the speech recognition model is not high when the neural network is trained on small data sets in multi-person conversation scenarios. This paper proposes a model of Densely connected feedforward convolutional networks (DCFCN). The deep features are gradually extracted by dense connection compression network, and the shallow features and deep features are combined to solve the overfitting problem of deep neural network under the training condition of small data sets. A feedforward neural channel is added to input the initial information of text content into the decoding structure to solve the problem of error chain propagation caused by the character prediction error of LAS starting position. Finally, the multi person conversation scene and TIMIT speech are synthesized and tested. The results show that DCFCN algorithm effectively solves the over fitting problem, and the recognition accuracy is improved by 20% compared with the traditional LAS algorithm.

...read moreread less

Posted Content•DOI•

Text-Aware End-to-end Mispronunciation Detection and Diagnosis

[...]

15 Jun 2022

TL;DR: In this article , a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information was proposed to reduce the gap between the learning objective of phoneme recognition and MDD.

...read moreread less

Abstract: Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT). In the field of assessing the pronunciation quality of constrained speech, the given transcriptions can play the role of a teacher. Conventional methods have fully utilized the prior texts for the model construction or improving the system performance, e.g. forced-alignment and extended recognition networks. Recently, some end-to-end based methods attempt to incorporate the prior texts into model training and preliminarily show the effectiveness. However, previous studies mostly consider applying raw attention mechanism to fuse audio representations with text representations, without taking possible text-pronunciation mismatch into account. In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information. Moreover, given the transcriptions, we design an extra contrastive loss to reduce the gap between the learning objective of phoneme recognition and MDD. We conducted experiments using two publicly available datasets (TIMIT and L2-Arctic) and our best model improved the F1 score from $57.51\%$ to $61.75\%$ compared to the baselines. Besides, we provide a detailed analysis to shed light on the effectiveness of gating mechanism and contrastive learning on MDD.

...read moreread less

Posted Content•

Statistical Learning in Speech: A Biologically Based Predictive Learning Model

[...]

John Rohrlich¹, Randall C. O'Reilly•Institutions (1)

University of California, Davis¹

13 Aug 2021-arXiv: Neurons and Cognition

TL;DR: This paper proposed that higher order thalamic nuclei provide a locale for a temporal difference between top-down predictions and an actual event outcome and hypothesize that the mechanism itself is predictive error-driven learning.

...read moreread less

Abstract: Infants, adults, non-human primates and non-primates all learn patterns implicitly, and they do so across modalities. The biological evidence supports the hypothesis that the mechanism for this learning is general but computationally local. We hypothesize that the mechanism itself is predictive error-driven learning. We build on recent work that advanced a biologically plausible model of error backpropagation learning which proposes that higher order thalamic nuclei provide a locale for a temporal difference between top-down predictions and an actual event outcome. Our neural network based on that work also models the auditory cortex hierarchy of core, belt and parabelt and the caudal-rostral axis within regions. We simulated two studies showing statistical learning in infants, a seminal study using synthesized speech and a more recent study using human speech. Before simulating these studies the network was trained on spoken sentences from the TIMIT corpus to emulate infant's experience listening to random speech. The implemented neural network, learning only by predicting the next brief speech segment, learned in both simulations to predict in-word syllables better than next-word syllables showing that prediction could be the basis for word segmentation and thus statistical learning.

...read moreread less

Proceedings Article•DOI•

Using Noisy Word-Level Labels to Train a Phoneme Recognizer based on Neural Networks by Expectation Maximization

[...]

Chen Li¹, Bo Zhang¹, Shan Huang¹, Zhenhuan Liu¹•Institutions (1)

Nankai University¹

23 Oct 2019

TL;DR: A probabilistic model is used to describe the probability of observing the noisy phoneme-level labels given an utterance and a training algorithm is proposed which can simultaneously learn parameters of the neural-network and the mismatch model is proposed.

...read moreread less

Abstract: The Connectionist Temporal Classification (CTC) technique can be used to train a neural-network based speech recognizer. When the technique is used to train a phoneme recognizer, it is required that training data should be annotated with phoneme-level labels. This is not feasible if large speech databases are used. One approach to make use of such speech data is to convert the word-level transcriptions into phoneme-level labels, followed by a CTC training. The problem of this approach is that the converted phonemelevel labels may mismatch the audio content of the speech data. This paper uses a probabilistic model to describe the probability of observing the noisy phoneme-level labels given an utterance. The model consists of a neural network which predicts the probability of any phoneme sequence, and another so-called mismatch model to describe the probability of disturbing a phoneme sequence to another. Based on the Expectation-Maximization (EM) framework, we propose a training algorithm which can simultaneously learn parameters of the neural-network and the mismatch model. Effectiveness of our method is verified by comparing recognition performance of our method with a conventional training method on TIMIT corpus.

...read moreread less

Journal Article•DOI•

Vocal Source Contribution to Speaker Recognition

[...]

V. N. Sorokin¹•Institutions (1)

Russian Academy of Sciences¹

19 Sep 2018-Pattern Recognition and Image Analysis

TL;DR: The vocal source and the pulse shape of the glottal flow are determined through the regularized ratio of the speech signal spectra at the intervals of the open and closed vocal slit within each period of the fundamental tone.

...read moreread less

Abstract: The vocal source and the pulse shape of the glottal flow are determined through the regularized ratio of the speech signal spectra at the intervals of the open and closed vocal slit within each period of the fundamental tone. Three databases were used: Russian numerals for 216 men and 177 women, the base obtained by converting the Russian database by the codec on 9.2 kbps, and the TIMIT database. The pitch period and 7 coefficients for the principal components of the glottal flow provide an average error of recognizing males below 8% for a sequence of 6 vowels. The minimum average recognition error for the initial base of Russian numerals for females makes about 15%, for males in the codec database makes about 15%, and for males in the TIMIT makes about 44%. The minimum average error of males’ recognition in the space of 7 coefficients for the principal components in the Russian database makes about 26%, but about 27% of the speakers have an average error of less than 10%.

...read moreread less

Collapse

Network Information

Performance

Metrics

1,488

Papers

68,688

Citations

No. of papers in the topic in previous years
Year	Papers
2023	24
2022	62
2021	67
2020	86
2019	77
2018	95

TIMIT

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics