Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Blind source separation of mixtures of speech signals with unknown propagation delays

[...]

Phillip L. De Leon, Yunsheng Ma¹•Institutions (1)

New Mexico State University¹

14 Nov 2000-Journal of the Acoustical Society of America

TL;DR: The kurtosis maximization ideas for source separation are extended to include delays in the mixing model to at least account for propagation delays from speakers to microphones.

...read moreread less

Abstract: Blind source separation of mixtures of speech signals has received considerable attention in the research community over the last 2 years. One computationally efficient method employs a gradient search algorithm to maximize the kurtosis of the outputs thereby achieving separation of the source signals. While this method has reported excellent separation results (30–50‐dB SIR), it assumes a simple linear mixing model. In the general case, convolutional mixing models are used, however, this is a rather difficult problem due to causality and stability restrictions on the inverse not to mention length requirements in the FIR approximation. Research results with the general problem are modest at best. In this paper, we extend the kurtosis maximization ideas for source separation to include delays in the mixing model to at least account for propagation delays from speakers to microphones. The algorithm is designed to first estimate the relative delays of the sources within each mixture using a standard autocorrelation technique. These delay estimates are then used in the kurtosis maximization algorithm where the separation matrix is now modified to include these delays. Simulation results (using the TIMIT speech corpus) generally indicate good separation quality (10–20 dB) with little additional computational overhead.

...read moreread less

3 citations

Proceedings Article•DOI•

An LSTM Auto-Encoder for Single-Channel Speaker Attention System

[...]

Mahnaz Rahmani¹, Farbod Razzazi¹•Institutions (1)

Islamic Azad University¹

01 Oct 2019

TL;DR: A set of long short term memory (LSTM) deep neural networks are utilized to distinguish a particular speaker from the rest of the speakers in a single channel recorded speech to provide the suitable result.

...read moreread less

Abstract: In this paper, we utilized a set of long short term memory (LSTM) deep neural networks to distinguish a particular speaker from the rest of the speakers in a single channel recorded speech. The structure of this network is modified to provide the suitable result. The proposed architecture models the sequence of spectral data in each frame as the key feature. Each network has two memory cells and accepts an 8 band spectral window as the input. The results of the reconstructions of different bands are merged to rebuild the speaker’s utterance. We evaluated the intended speaker's reconstruction performance of the proposed system with PESQ and MSE measures. Using all utterances of each speaker in TIMIT dataset as the training data to build an LSTM based attention auto-encoder model, we achieved 3.66 in PESQ measure to rebuild the intended speaker. In contrast, the PESQ was 1.92 in average for other speakers when we used the mentioned speaker’s network. This test was successfully repeated for different utterances of different speakers.

...read moreread less

3 citations

Posted Content•

Supervised Speaker Embedding De-Mixing in Two-Speaker Environment

[...]

Yanpei Shi¹, Thomas Hain¹•Institutions (1)

University of Sheffield¹

14 Jan 2020-arXiv: Sound

TL;DR: Six different speaker embedding de-mixing architectures are investigated and the obtained results show that one of the proposed architectures obtained close performance, reaching 96.9% identification accuracy and 0.89 cosine similarity.

...read moreread less

Abstract: Separating different speaker properties from a multi-speaker environment is challenging. Instead of separating a two-speaker signal in signal space like speech source separation, a speaker embedding de-mixing approach is proposed. The proposed approach separates different speaker properties from a two-speaker signal in embedding space. The proposed approach contains two steps. In step one, the clean speaker embeddings are learned and collected by a residual TDNN based network. In step two, the two-speaker signal and the embedding of one of the speakers are both input to a speaker embedding de-mixing network. The de-mixing network is trained to generate the embedding of the other speaker by reconstruction loss. Speaker identification accuracy and the cosine similarity score between the clean embeddings and the de-mixed embeddings are used to evaluate the quality of the obtained embeddings. Experiments are done in two kind of data: artificial augmented two-speaker data (TIMIT) and real world recording of two-speaker data (MC-WSJ). Six different speaker embedding de-mixing architectures are investigated. Comparing with the performance on the clean speaker embeddings, the obtained results show that one of the proposed architectures obtained close performance, reaching 96.9% identification accuracy and 0.89 cosine similarity.

...read moreread less

3 citations

Proceedings Article•DOI•

Additive Margin SincNet for Speaker Recognition

[...]

Joao Antonio Chagas Nunes¹, David Macêdo¹, Cleber Zanchettin¹•Institutions (1)

Federal University of Pernambuco¹

28 Jan 2019-arXiv: Audio and Speech Processing

TL;DR: AM-SincNet as discussed by the authors introduces a margin of separation between the classes that forces the samples from the same class to be closer to each other and also maximizes the distance between classes.

...read moreread less

Abstract: Speaker Recognition is a challenging task with essential applications such as authentication, automation, and security. The SincNet is a new deep learning based model which has produced promising results to tackle the mentioned task. To train deep learning systems, the loss function is essential to the network performance. The Softmax loss function is a widely used function in deep learning methods, but it is not the best choice for all kind of problems. For distance-based problems, one new Softmax based loss function called Additive Margin Softmax (AM-Softmax) is proving to be a better choice than the traditional Softmax. The AM-Softmax introduces a margin of separation between the classes that forces the samples from the same class to be closer to each other and also maximizes the distance between classes. In this paper, we propose a new approach for speaker recognition systems called AM-SincNet, which is based on the SincNet but uses an improved AM-Softmax layer. The proposed method is evaluated in the TIMIT dataset and obtained an improvement of approximately 40% in the Frame Error Rate compared to SincNet.

...read moreread less

3 citations

Proceedings Article•DOI•

Acoustic Feature Learning Using Cross-Domain Articulatory Measurements

[...]

Qingming Tang¹, Weiran Wang², Karen Livescu¹•Institutions (2)

Toyota Technological Institute at Chicago¹, Amazon.com²

15 Apr 2018

TL;DR: This work develops methods for acoustic feature learning in the setting where they have access to an external, domain-mismatched dataset of paired speech and articulatory measurements, either with or without labels, based on deep variational CCA and extensions that use both source and target domain data and labels.

...read moreread less

Abstract: Previous work has shown that it is possible to improve speech recognition by learning acoustic features from paired acoustic-articulatory data, for example by using canonical correlation analysis (CCA) or its deep extensions. One limitation of this prior work is that the learned feature models are difficult to port to new datasets or domains, and articulatory data is not available for most speech corpora. In this work we study the problem of acoustic feature learning in the setting where we have access to an external, domain-mismatched dataset of paired speech and articulatory measurements, either with or without labels. We develop methods for acoustic feature learning in these settings, based on deep variational CCA and extensions that use both source and target domain data and labels. Using this approach, we improve phonetic recognition accuracies on both TIMIT and Wall Street Journal and analyze a number of design choices.

...read moreread less

3 citations

Collapse

Network Information

Performance

Metrics

1,488

Papers

68,688

Citations

No. of papers in the topic in previous years
Year	Papers
2023	24
2022	62
2021	67
2020	86
2019	77
2018	95

TIMIT

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics