Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Posted Content•

Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model

[...]

Suwon Shon¹, Hao Tang¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

12 Sep 2018-arXiv: Audio and Speech Processing

TL;DR: In this paper, a CNN-based speaker recognition model was proposed for extracting robust speaker embeddings, which can be extracted efficiently with linear activation in the embedding layer with text-independent input.

...read moreread less

Abstract: In this paper, we propose a Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings. The embedding can be extracted efficiently with linear activation in the embedding layer. To understand how the speaker recognition model operates with text-independent input, we modify the structure to extract frame-level speaker embeddings from each hidden layer. We feed utterances from the TIMIT dataset to the trained network and use several proxy tasks to study the networks ability to represent speech input and differentiate voice identity. We found that the networks are better at discriminating broad phonetic classes than individual phonemes. In particular, frame-level embeddings that belong to the same phonetic classes are similar (based on cosine distance) for the same speaker. The frame level representation also allows us to analyze the networks at the frame level, and has the potential for other analyses to improve speaker recognition.

...read moreread less

44 citations

Proceedings Article•DOI•

Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks

[...]

Xingchen Song¹, Guangsen Wang², Zhiyong Wu¹, Yiheng Huang³, Dan Su³, Dong Yu³, Helen Meng⁴ - Show less +3 more•Institutions (4)

Tsinghua University¹, Salesforce.com², Tencent³, The Chinese University of Hong Kong⁴

25 Oct 2020

TL;DR: Speech-XLNet as discussed by the authors proposes an XLNet-like pretraining scheme for unsupervised acoustic model pretraining to learn speech representations with self-attention network, which is finetuned under the hybrid SAN/HMM framework.

...read moreread less

Abstract: Self-attention network (SAN) can benefit significantly from the bi-directional representation learning through unsupervised pretraining paradigms such as BERT and XLNet. In this paper, we present an XLNet-like pretraining scheme "Speech-XLNet" for unsupervised acoustic model pretraining to learn speech representations with SAN. The pretrained SAN is finetuned under the hybrid SAN/HMM framework. We conjecture that by shuffling the speech frame orders, the permutation in Speech-XLNet serves as a strong regularizer to encourage the SAN to make inferences by focusing on global structures through its attention weights. In addition, Speech-XLNet also allows the model to explore the bi-directional contexts for effective speech representation learning. Experiments on TIMIT and WSJ demonstrate that Speech-XLNet greatly improves the SAN/HMM performance in terms of both convergence speed and recognition accuracy compared to the one trained from randomly initialized weights. Our best systems achieve a relative improvement of 11.9% and 8.3% on the TIMIT and WSJ tasks respectively. In particular, the best system achieves a phone error rate (PER) of 13.3% on the TIMIT test set, which to our best knowledge, is the lowest PER obtained from a single system.

...read moreread less

44 citations

Proceedings Article•DOI•

A fast and reliable rate of speech detector

[...]

J.P. Verhasselt¹, Jean-Pierre Martens¹•Institutions (1)

Ghent University¹

03 Oct 1996

TL;DR: A new rate-of-speech (ROS) detector that operates independently from the recognition process is presented and the ROS estimate is subsequently used to compensate for the effects of unusual speech rates on continuous speech recognition.

...read moreread less

Abstract: In this paper, we present a new rate-of-speech (ROS) detector that operates independently from the recognition process. This detector is evaluated on the TIMIT corpus and positioned with respect to other ROS detectors. The ROS estimate is subsequently used to compensate for the effects of unusual speech rates on continuous speech recognition. We report on results obtained with two ROS compensation techniques on a speaker-independent acoustic-phonetic decoding task.

...read moreread less

44 citations

Proceedings Article•DOI•

Experiments with linear and nonlinear feature transformations in HMM based phone recognition

[...]

Panu Somervuo

06 Apr 2003

TL;DR: This work applies linear and nonlinear data-driven feature transformations to the logarithmic mel-spectral context feature vectors in the TIMIT phone recognition task, finding that the four methods outperform the baseline system.

...read moreread less

Abstract: Feature extraction is the key element when aiming at robust speech recognition. Both linear and nonlinear data-driven feature transformations are applied to the logarithmic mel-spectral context feature vectors in the TIMIT phone recognition task. Transformations are based on principal component analysis (PCA), independent component analysis (ICA), linear discriminant analysis (LDA) and multilayer perceptron network based nonlinear discriminant analysis (NLDA). All four methods outperform the baseline system which consists of the standard feature representation based on MFCCs (mel-frequency cepstral coefficients) with the first-order deltas, using a mixture-of-Gaussians HMM recognizer. Further improvement is gained by forming the feature vector as a concatenation of the outputs of all four feature transformations.

...read moreread less

44 citations

Book Chapter•DOI•

Exploiting a Vowel Based Approach for Acted Emotion Recognition

[...]

Fabien Ringeval¹, Mohamed Chetouani¹•Institutions (1)

Pierre-and-Marie-Curie University¹

17 Dec 2008

TL;DR: A new approach for emotion recognition is described based on both the extraction and the characterization of phonemic units such as vowels and consonants, which are provided by a pseudophonetic speech segmentation phase combined with a vowel detector.

...read moreread less

Abstract: This paper is dedicated to the description and the study of a new feature extraction approach for emotion recognition. Our contribution is based on the extraction and the characterization of phonemic units such as vowels and consonants, which are provided by a pseudo-phonetic speech segmentation phase combined with a vowel detector. The segmentation algorithm is evaluated on both emotional (Berlin) and non-emotional (TIMIT, NTIMIT) databases. Concerning the emotion recognition task, we propose to extract MFCC acoustic features from these pseudo-phonetic segments (vowels, consonants) and we compare this approach with traditional voice and unvoiced segments. The classification is achieved by the well-known k-nn classifier (k nearest neighbors) on the Berlin corpus.

...read moreread less

43 citations

Collapse

Network Information

Performance

Metrics

1,488

Papers

68,688

Citations

No. of papers in the topic in previous years
Year	Papers
2023	24
2022	62
2021	67
2020	86
2019	77
2018	95

TIMIT

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics