scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Journal ArticleDOI
TL;DR: The kurtosis maximization ideas for source separation are extended to include delays in the mixing model to at least account for propagation delays from speakers to microphones.
Abstract: Blind source separation of mixtures of speech signals has received considerable attention in the research community over the last 2 years. One computationally efficient method employs a gradient search algorithm to maximize the kurtosis of the outputs thereby achieving separation of the source signals. While this method has reported excellent separation results (30–50‐dB SIR), it assumes a simple linear mixing model. In the general case, convolutional mixing models are used, however, this is a rather difficult problem due to causality and stability restrictions on the inverse not to mention length requirements in the FIR approximation. Research results with the general problem are modest at best. In this paper, we extend the kurtosis maximization ideas for source separation to include delays in the mixing model to at least account for propagation delays from speakers to microphones. The algorithm is designed to first estimate the relative delays of the sources within each mixture using a standard autocorrelation technique. These delay estimates are then used in the kurtosis maximization algorithm where the separation matrix is now modified to include these delays. Simulation results (using the TIMIT speech corpus) generally indicate good separation quality (10–20 dB) with little additional computational overhead.

3 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: A set of long short term memory (LSTM) deep neural networks are utilized to distinguish a particular speaker from the rest of the speakers in a single channel recorded speech to provide the suitable result.
Abstract: In this paper, we utilized a set of long short term memory (LSTM) deep neural networks to distinguish a particular speaker from the rest of the speakers in a single channel recorded speech. The structure of this network is modified to provide the suitable result. The proposed architecture models the sequence of spectral data in each frame as the key feature. Each network has two memory cells and accepts an 8 band spectral window as the input. The results of the reconstructions of different bands are merged to rebuild the speaker’s utterance. We evaluated the intended speaker's reconstruction performance of the proposed system with PESQ and MSE measures. Using all utterances of each speaker in TIMIT dataset as the training data to build an LSTM based attention auto-encoder model, we achieved 3.66 in PESQ measure to rebuild the intended speaker. In contrast, the PESQ was 1.92 in average for other speakers when we used the mentioned speaker’s network. This test was successfully repeated for different utterances of different speakers.

3 citations

Posted Content
TL;DR: Six different speaker embedding de-mixing architectures are investigated and the obtained results show that one of the proposed architectures obtained close performance, reaching 96.9% identification accuracy and 0.89 cosine similarity.
Abstract: Separating different speaker properties from a multi-speaker environment is challenging. Instead of separating a two-speaker signal in signal space like speech source separation, a speaker embedding de-mixing approach is proposed. The proposed approach separates different speaker properties from a two-speaker signal in embedding space. The proposed approach contains two steps. In step one, the clean speaker embeddings are learned and collected by a residual TDNN based network. In step two, the two-speaker signal and the embedding of one of the speakers are both input to a speaker embedding de-mixing network. The de-mixing network is trained to generate the embedding of the other speaker by reconstruction loss. Speaker identification accuracy and the cosine similarity score between the clean embeddings and the de-mixed embeddings are used to evaluate the quality of the obtained embeddings. Experiments are done in two kind of data: artificial augmented two-speaker data (TIMIT) and real world recording of two-speaker data (MC-WSJ). Six different speaker embedding de-mixing architectures are investigated. Comparing with the performance on the clean speaker embeddings, the obtained results show that one of the proposed architectures obtained close performance, reaching 96.9% identification accuracy and 0.89 cosine similarity.

3 citations

Proceedings ArticleDOI
TL;DR: AM-SincNet as discussed by the authors introduces a margin of separation between the classes that forces the samples from the same class to be closer to each other and also maximizes the distance between classes.
Abstract: Speaker Recognition is a challenging task with essential applications such as authentication, automation, and security. The SincNet is a new deep learning based model which has produced promising results to tackle the mentioned task. To train deep learning systems, the loss function is essential to the network performance. The Softmax loss function is a widely used function in deep learning methods, but it is not the best choice for all kind of problems. For distance-based problems, one new Softmax based loss function called Additive Margin Softmax (AM-Softmax) is proving to be a better choice than the traditional Softmax. The AM-Softmax introduces a margin of separation between the classes that forces the samples from the same class to be closer to each other and also maximizes the distance between classes. In this paper, we propose a new approach for speaker recognition systems called AM-SincNet, which is based on the SincNet but uses an improved AM-Softmax layer. The proposed method is evaluated in the TIMIT dataset and obtained an improvement of approximately 40% in the Frame Error Rate compared to SincNet.

3 citations

Proceedings ArticleDOI
15 Apr 2018
TL;DR: This work develops methods for acoustic feature learning in the setting where they have access to an external, domain-mismatched dataset of paired speech and articulatory measurements, either with or without labels, based on deep variational CCA and extensions that use both source and target domain data and labels.
Abstract: Previous work has shown that it is possible to improve speech recognition by learning acoustic features from paired acoustic-articulatory data, for example by using canonical correlation analysis (CCA) or its deep extensions. One limitation of this prior work is that the learned feature models are difficult to port to new datasets or domains, and articulatory data is not available for most speech corpora. In this work we study the problem of acoustic feature learning in the setting where we have access to an external, domain-mismatched dataset of paired speech and articulatory measurements, either with or without labels. We develop methods for acoustic feature learning in these settings, based on deep variational CCA and extensions that use both source and target domain data and labels. Using this approach, we improve phonetic recognition accuracies on both TIMIT and Wall Street Journal and analyze a number of design choices.

3 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895