scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Posted Content
TL;DR: In this paper, a CNN-based speaker recognition model was proposed for extracting robust speaker embeddings, which can be extracted efficiently with linear activation in the embedding layer with text-independent input.
Abstract: In this paper, we propose a Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings. The embedding can be extracted efficiently with linear activation in the embedding layer. To understand how the speaker recognition model operates with text-independent input, we modify the structure to extract frame-level speaker embeddings from each hidden layer. We feed utterances from the TIMIT dataset to the trained network and use several proxy tasks to study the networks ability to represent speech input and differentiate voice identity. We found that the networks are better at discriminating broad phonetic classes than individual phonemes. In particular, frame-level embeddings that belong to the same phonetic classes are similar (based on cosine distance) for the same speaker. The frame level representation also allows us to analyze the networks at the frame level, and has the potential for other analyses to improve speaker recognition.

44 citations

Proceedings ArticleDOI
25 Oct 2020
TL;DR: Speech-XLNet as discussed by the authors proposes an XLNet-like pretraining scheme for unsupervised acoustic model pretraining to learn speech representations with self-attention network, which is finetuned under the hybrid SAN/HMM framework.
Abstract: Self-attention network (SAN) can benefit significantly from the bi-directional representation learning through unsupervised pretraining paradigms such as BERT and XLNet. In this paper, we present an XLNet-like pretraining scheme "Speech-XLNet" for unsupervised acoustic model pretraining to learn speech representations with SAN. The pretrained SAN is finetuned under the hybrid SAN/HMM framework. We conjecture that by shuffling the speech frame orders, the permutation in Speech-XLNet serves as a strong regularizer to encourage the SAN to make inferences by focusing on global structures through its attention weights. In addition, Speech-XLNet also allows the model to explore the bi-directional contexts for effective speech representation learning. Experiments on TIMIT and WSJ demonstrate that Speech-XLNet greatly improves the SAN/HMM performance in terms of both convergence speed and recognition accuracy compared to the one trained from randomly initialized weights. Our best systems achieve a relative improvement of 11.9% and 8.3% on the TIMIT and WSJ tasks respectively. In particular, the best system achieves a phone error rate (PER) of 13.3% on the TIMIT test set, which to our best knowledge, is the lowest PER obtained from a single system.

44 citations

Proceedings ArticleDOI
03 Oct 1996
TL;DR: A new rate-of-speech (ROS) detector that operates independently from the recognition process is presented and the ROS estimate is subsequently used to compensate for the effects of unusual speech rates on continuous speech recognition.
Abstract: In this paper, we present a new rate-of-speech (ROS) detector that operates independently from the recognition process. This detector is evaluated on the TIMIT corpus and positioned with respect to other ROS detectors. The ROS estimate is subsequently used to compensate for the effects of unusual speech rates on continuous speech recognition. We report on results obtained with two ROS compensation techniques on a speaker-independent acoustic-phonetic decoding task.

44 citations

Proceedings ArticleDOI
06 Apr 2003
TL;DR: This work applies linear and nonlinear data-driven feature transformations to the logarithmic mel-spectral context feature vectors in the TIMIT phone recognition task, finding that the four methods outperform the baseline system.
Abstract: Feature extraction is the key element when aiming at robust speech recognition. Both linear and nonlinear data-driven feature transformations are applied to the logarithmic mel-spectral context feature vectors in the TIMIT phone recognition task. Transformations are based on principal component analysis (PCA), independent component analysis (ICA), linear discriminant analysis (LDA) and multilayer perceptron network based nonlinear discriminant analysis (NLDA). All four methods outperform the baseline system which consists of the standard feature representation based on MFCCs (mel-frequency cepstral coefficients) with the first-order deltas, using a mixture-of-Gaussians HMM recognizer. Further improvement is gained by forming the feature vector as a concatenation of the outputs of all four feature transformations.

44 citations

Book ChapterDOI
17 Dec 2008
TL;DR: A new approach for emotion recognition is described based on both the extraction and the characterization of phonemic units such as vowels and consonants, which are provided by a pseudophonetic speech segmentation phase combined with a vowel detector.
Abstract: This paper is dedicated to the description and the study of a new feature extraction approach for emotion recognition. Our contribution is based on the extraction and the characterization of phonemic units such as vowels and consonants, which are provided by a pseudo-phonetic speech segmentation phase combined with a vowel detector. The segmentation algorithm is evaluated on both emotional (Berlin) and non-emotional (TIMIT, NTIMIT) databases. Concerning the emotion recognition task, we propose to extract MFCC acoustic features from these pseudo-phonetic segments (vowels, consonants) and we compare this approach with traditional voice and unvoiced segments. The classification is achieved by the well-known k-nn classifier (k nearest neighbors) on the Berlin corpus.

43 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895