scispace - formally typeset
Search or ask a question
Topic

TIMIT

About: TIMIT is a research topic. Over the lifetime, 1401 publications have been published within this topic receiving 59888 citations. The topic is also known as: TIMIT Acoustic-Phonetic Continuous Speech Corpus.


Papers
More filters
Journal ArticleDOI
TL;DR: A novel source cell-phone identification system suitable for both clean and noisy environments using spectral distribution features of constant Q transform (CQT) domain and multi-scene training method and Experimental results show that the features proposed have superior performance.
Abstract: With the widespread availability of cell-phone recording devices, source cell-phone identification has become a hot topic in multimedia forensics. At present, the research on the source cell-phone identification in clean conditions has achieved good results, but that in noisy environments is not ideal. This paper proposes a novel source cell-phone identification system suitable for both clean and noisy environments using spectral distribution features of constant Q transform (CQT) domain and multi-scene training method. Based on the analysis, it is found that the identification difficulty lies in different models of cell-phones of the same brand, and their tiny differences are mainly in the middle and low frequency bands. Therefore, this paper extracts spectral distribution features from the CQT domain, which has a higher frequency resolution in the mid-low frequency. To evaluate the effectiveness of the proposed feature, four classification techniques of Support Vector Machine (SVM), Random Forest (RF), Convolutional Neural Network (CNN) and Recurrent Neuron Network-Long Short-Term Memory Neural Network (RNN-BLSTM) are used to identify the source recording device. Experimental results show that the features proposed in this paper have superior performance. Compared with Mel frequency cepstral coefficient (MFCC) and linear frequency cepstral coefficient (LFCC), it enhances the accuracy of cell-phones within the same brand, whether the speech to be tested comprises clean speech files or noisy speech files. In addition, the CNN classification effect is outstanding. In terms of models, the model is established by the multi-scene training method, which improves the distinguishing ability of the model in the noisy environment than single-scenario training method. The average accuracy rate in CNN for clean speech files on the CKC speech database (CKC-SD) and TIMIT Recaptured Database (TIMIT-RD) databases increased from 95.47% and 97.89% to 97.08% and 99.29%, respectively. For noisy speech files with seen noisy types and unseen noisy types, the performance was greatly improved, and most of the recognition rates exceeded 90%. Therefore, the source identification system in this paper is robust to noise.

21 citations

01 Jan 2013
TL;DR: This thesis presents two posteriorgram-based speech representations which enable speaker-independent, and noisy spoken term matching, and shows two lower-bounding based methods for Dynamic Time Warping (DTW) based pattern matching algorithms.
Abstract: This thesis is motivated by the challenge of searching and extracting useful information from speech data in a completely unsupervised setting. In many real world speech processing problems, obtaining annotated data is not cost and time effective. We therefore ask how much can we learn from speech data without any transcription. To address this question, in this thesis, we chose the query-by-example spoken term detection as a specific scenario to demonstrate that this task can be done in the unsupervised setting without any annotations. To build the unsupervised spoken term detection framework, we contributed three main techniques to form a complete working flow. First, we present two posteriorgram-based speech representations which enable speaker-independent, and noisy spoken term matching. The feasibility and effectiveness of both posteriorgram features are demonstrated through a set of spoken term detection experiments on different datasets. Second, we show two lower-bounding based methods for Dynamic Time Warping (DTW) based pattern matching algorithms. Both algorithms greatly outperform the conventional DTW in a single-threaded computing environment. Third, we describe the parallel implementation of the lower-bounded DTW search algorithm. Experimental results indicate that the total running time of the entire spoken detection system grows linearly with corpus size. We also present the training of large Deep Belief Networks (DBNs) on Graphical Processing Units (GPUs). The phonetic classification experiment on the TIMIT corpus showed a speed-up of 36x for pre-training and 45x for back-propagation for a two-layer DBN trained on the GPU platform compared to the CPU platform. Thesis Supervisor: James R. Glass Title: Senior Research Scientist

21 citations

Proceedings ArticleDOI
18 Jul 2010
TL;DR: A new algorithm of feature parameter extraction is proposed for application in speaker recognition system, which combines the traditional MFCC and the dynamic MFCC as a new series of coefficients that are weighted as front-end parameters of the GMM to decrease the dimension of the mixed weighted GMM and reduce the computation complexity.
Abstract: In this paper, a new algorithm of feature parameter extraction is proposed for application in speaker recognition system, which combines the traditional MFCC and the dynamic MFCC as a new series of coefficients. According to the statistics analysis of the different contribution by the dynamic MFCC and traditional MFCC, these coefficients are weighted as front-end parameters of the GMM, which would decrease the dimension of the mixed weighted GMM and reduce the computation complexity. The experiments based on the TIMIT and VOA speech database were implemented in MATLAB environment, and the results showed the speaker recognition system with the Weighted Dynamic MFCC could obtain better performance with high recognition rate and low computational complexity.

21 citations

Journal ArticleDOI
TL;DR: This study aims to establish a small system of text-independent recognition of speakers for a relatively small group of speakers at a sound stage using the Direct Deep Neural Network (DNN)-based approach, in which the posterior opportunities of the output layer are utilized to determine the speaker’s presence.
Abstract: This study aims is to establish a small system of text-independent recognition of speakers for a relatively small group of speakers at a sound stage. The fascinating justification for the International Space Station (ISS) to detect if the astronauts are speaking at a specific time has influenced the difficulty. In this work, we employed Machine Learning Applications. Accordingly, we used the Direct Deep Neural Network (DNN)-based approach, in which the posterior opportunities of the output layer are utilized to determine the speaker’s presence. In line with the small footprint design objective, a simple DNN model with only sufficient hidden units or sufficient hidden units per layer was designed, thereby reducing the cost of parameters through intentional preparation to avoid the normal overfitting problem and optimize the algorithmic aspects, such as context-based training, activation functions, validation, and learning rate. Two commercially available databases, namely, TIMIT clean speech and HTIMIT multihandset communication database and TIMIT noise-added data framework, were tested for this reference model that we developed using four sound categories at three distinct signal-to-noise ratios. Briefly, we used a dynamic pruning method in which the conditions of all layers are simultaneously pruned, and the pruning mechanism is reassigned. The usefulness of this approach was evaluated on all the above contact databases

21 citations

Proceedings ArticleDOI
17 May 2004
TL;DR: A detector is described that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech and the results are compared to a baseline system.
Abstract: In this paper, the states in the speech production process are defined by a number of categorical articulatory features. We describe a detector that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech. The detector consists of a bank of recurrent neural network (RNN) classifiers, a variable depth lattice generator and Viterbi decoder. A bank of classifiers has been previously used for articulatory feature detection by many researchers. We extend their work first by creating variable depth lattices for each feature and then by combining them into product lattices for rescoring using the Viterbi algorithm. During the rescoring we incorporate language and duration constraints along with the posterior probabilities of classes provided by the RNN classifiers. We present our results for the place and manner features using TIMIT data, and compare the results to a baseline system. We report performance improvements both at the frame and segment levels.

21 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
76% related
Feature (machine learning)
33.9K papers, 798.7K citations
75% related
Feature vector
48.8K papers, 954.4K citations
74% related
Natural language
31.1K papers, 806.8K citations
73% related
Deep learning
79.8K papers, 2.1M citations
72% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202324
202262
202167
202086
201977
201895