scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Proceedings ArticleDOI
14 Mar 2010
TL;DR: This paper presents cross-language automatic phonetic segmentation using Hidden Markov Models (HMMs) so as to provide extensive models that will be applicable across languages.
Abstract: Annotation of large multilingual corpora remains a challenge to the data-driven approach to speech research, especially for under-resourced languages. This paper presents cross-language automatic phonetic segmentation using Hidden Markov Models (HMMs). The underlying notion is segmentation based on articulation (manner and place) so as to provide extensive models that will be applicable across languages. A test on the Appen Spanish speech corpus gives phone recognition accuracy of 61.15% when bootstrapped with acoustic models trained on the TIMIT as compared with a baseline result of 54.63% for flat start initialization of the monophone models.

8 citations

Journal ArticleDOI
TL;DR: The experiments on TIMIT phone recognition and the Wall Street Journal 5K-vocabulary continuous speech recognition show that eigentriphones estimated from state clusters defined by the nodes in the same phonetic regression class tree used in state tying result in further performance gain.
Abstract: Most automatic speech recognizers employ tied-state triphone hidden Markov models (HMM), in which the corresponding triphone states of the same base phone are tied. State tying is commonly performed with the use of a phonetic regression class tree which renders robust context-dependent modeling possible by carefully balancing the amount of training data with the degree of tying. However, tying inevitably introduces quantization error: triphones tied to the same state are not distinguishable in that state. Recently we proposed a new triphone modeling approach called eigentriphone modeling in which all triphone models are, in general, distinct. The idea is to create an eigenbasis for each base phone (or phone state) and all its triphones (or triphone states) are represented as distinct points in the space spanned by the basis. We have shown that triphone HMMs trained using model-based or state-based eigentriphones perform at least as well as conventional tied-state HMMs. In this paper, we further generalize the definition of eigentriphones over clusters of acoustic units. Our experiments on TIMIT phone recognition and the Wall Street Journal 5K-vocabulary continuous speech recognition show that eigentriphones estimated from state clusters defined by the nodes in the same phonetic regression class tree used in state tying result in further performance gain.

8 citations

Proceedings ArticleDOI
01 Sep 2017
TL;DR: In this article, a text-independent speaker classifier is trained using in-domain speaker data and features of conversational speech from out-of-domain speakers are then converted into likelihood vectors, i.e., similarity scores comparing to the indomain speakers.
Abstract: The mechanism proposed here is for real-time speaker change detection in conversations, which firstly trains a neural network text-independent speaker classifier using in-domain speaker data. Through the network, features of conversational speech from out-of-domain speakers are then converted into likelihood vectors, i.e., similarity scores comparing to the in-domain speakers. These transformed features demonstrate very distinctive patterns, which facilitates differentiating speakers and enable speaker change detection with some straight-forward distance metrics. The speaker classifier and the speaker change detector are trained/tested using speech of the first 200 (in-domain) and the remaining 126 (out-of-domain) male speakers in TIMIT, respectively. For the speaker classification, 100% accuracy at a 200 speaker size is achieved on any testing file, given the speech duration is at least 0.97 seconds. For the speaker change detection using speaker classification outputs, performance based on 0.5,1, and 2 seconds of inspection intervals were evaluated in terms of error rate and F1 score, using synthesized data by concatenating speech from various speakers. It captures close to 97% of the changes by comparing the current second of speech with the previous second, which is very competitive among literature using other methods.

8 citations

Journal ArticleDOI
TL;DR: In this paper, an unsupervised speech separation method was proposed, which combined with Convolutional Non-Negative Matrix Factorization and Joint Approximative Diagonalization of Eigenmatrix.
Abstract: As network supporting devices and sensors in the Internet of Things are leaping forward, countless real-world data will be generated for human intelligent applications. Speech sensor networks, an important part of the Internet of Things, have numerous application needs. Indeed, the sensor data can further help intelligent applications to provide higher quality services, whereas this data may involve considerable noise data. Accordingly, speech signal processing method should be urgently implemented to acquire low-noise and effective speech data. Blind source separation and enhancement technique refer to one of the representative methods. However, in the unsupervised complex environment, in the only presence of a single-channel signal, many technical challenges are imposed on achieving single-channel and multiperson mixed speech separation. For this reason, this study develops an unsupervised speech separation method CNMF+JADE, i.e., a hybrid method combined with Convolutional Non-Negative Matrix Factorization and Joint Approximative Diagonalization of Eigenmatrix. Moreover, an adaptive wavelet transform-based speech enhancement technique is proposed, capable of adaptively and effectively enhancing the separated speech signal. The proposed method is aimed at yielding a general and efficient speech processing algorithm for the data acquired by speech sensors. As revealed from the experimental results, in the TIMIT speech sources, the proposed method can effectively extract the target speaker from the mixed speech with a tiny training sample. The algorithm is highly general and robust, capable of technically supporting the processing of speech signal acquired by most speech sensors.

8 citations

Journal ArticleDOI
TL;DR: In this article , a hybrid approach for automatic speech recognition, the TriNNOnto has been proposed in which, integrates different approaches like Language Model integrated with dynamic Triune Ontology generation scheme, Acoustic Model and Feature modelling are hybridised based on the Tribonacci based Deep Neural Network, which decides upon the number of layers depending on the size of the samples and their count.

8 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895