scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2017"


Proceedings Article
15 Feb 2017
TL;DR: It is found that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers.
Abstract: We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence. We exhaustively evaluate the proposed confidence penalty and label smoothing on 6 common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). We find that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers.

617 citations


Posted Content
TL;DR: In this article, the authors explore regularizing neural networks by penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning.
Abstract: We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence. We exhaustively evaluate the proposed confidence penalty and label smoothing on 6 common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). We find that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers.

519 citations


Proceedings Article
15 Nov 2017
TL;DR: This work unify successful ideas from recently proposed architectures into a stochastic recurrent model that achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST.
Abstract: Many efforts have been devoted to training generative latent variable models with autoregressive decoders, such as recurrent neural networks (RNN). Stochastic recurrent models have been successful in capturing the variability observed in natural sequential data such as speech. We unify successful ideas from recently proposed architectures into a stochastic recurrent model: each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps. Training is performed with amortised variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence. In addition to maximizing the variational lower bound, we ease training of the latent variables by adding an auxiliary cost which forces them to reconstruct the state of the backward recurrent network. This provides the latent variables with a task-independent objective that enhances the performance of the overall model. We found this strategy to perform better than alternative approaches such as KL annealing. Although being conceptually simple, our model achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST. Finally, we apply our model to language modeling on the IMDB dataset where the auxiliary cost helps in learning interpretable latent variables.

105 citations


Posted Content
TL;DR: In this paper, the authors proposed an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with Connectionist Temporal Classification (CTC) directly without recurrent connections.
Abstract: Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks. Meanwhile, Connectionist Temporal Classification (CTC) with Recurrent Neural Networks (RNNs), which is proposed for labeling unsegmented sequences, makes it feasible to train an end-to-end speech recognition system instead of hybrid settings. However, RNNs are computationally expensive and sometimes difficult to train. In this paper, inspired by the advantages of both CNNs and the CTC approach, we propose an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections. By evaluating the approach on the TIMIT phoneme recognition task, we show that the proposed model is not only computationally efficient, but also competitive with the existing baseline systems. Moreover, we argue that CNNs have the capability to model temporal correlations with appropriate context information.

57 citations


Proceedings ArticleDOI
Andrew Rosenberg1, Kartik Audhkhasi1, Abhinav Sethy1, Bhuvana Ramabhadran1, Michael Picheny1 
05 Mar 2017
TL;DR: This work investigates the use of Connectionist Temporal Classification networks, recurrent encoder-decoders with attention, two end-to-end ASR systems for keyword search and speech recognition on low resource languages.
Abstract: In recent years, so-called, “end-to-end” speech recognition systems have emerged as viable alternatives to traditional ASR frameworks. Keyword search, localizing an orthographic query in a speech corpus, is typically performed by using automatic speech recognition (ASR) to generate an index. Previous work has evaluated the use of end-to-end systems for ASR on well known corpora (WSJ, Switchboard, TIMIT, etc.) in high-resource languages like English and Mandarin. In this work, we investigate the use of Connectionist Temporal Classification (CTC) networks, recurrent encoder-decoders with attention, two end-to-end ASR systems for keyword search and speech recognition on low resource languages. We find end-to-end systems can generate high quality 1-best transcripts on low-resource languages, but, because they generate very sharp posteriors, their utility is limited for KWS. We explore a number of ways to address this limitation with modest success. Experimental results reported are based on the IARPA BABEL OP3 languages and evaluation framework. This paper represents the first results using “end-to-end” techniques for speech recognition and keyword search on low-resource languages.

52 citations


Posted Content
TL;DR: This paper proposed a stochastic recurrent model, where each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps, and training is performed with amortized variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence.
Abstract: Many efforts have been devoted to training generative latent variable models with autoregressive decoders, such as recurrent neural networks (RNN). Stochastic recurrent models have been successful in capturing the variability observed in natural sequential data such as speech. We unify successful ideas from recently proposed architectures into a stochastic recurrent model: each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps. Training is performed with amortized variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence. In addition to maximizing the variational lower bound, we ease training of the latent variables by adding an auxiliary cost which forces them to reconstruct the state of the backward recurrent network. This provides the latent variables with a task-independent objective that enhances the performance of the overall model. We found this strategy to perform better than alternative approaches such as KL annealing. Although being conceptually simple, our model achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST. Finally, we apply our model to language modeling on the IMDB dataset where the auxiliary cost helps in learning interpretable latent variables. Source Code: \url{this https URL}

42 citations



Proceedings ArticleDOI
07 Mar 2017
TL;DR: This work proposes to use a feature representation obtained by pairwise learning in a low-resource language for query-by-example spoken term detection (QbE-STD), and extracts features from an internal hidden layer of the pairwise trained AE to perform acoustic pattern matching for QbE -STD.
Abstract: We propose to use a feature representation obtained by pairwise learning in a low-resource language for query-by-example spoken term detection (QbE-STD). We assume that word pairs identified by humans are available in the low-resource target language. The word pairs are parameterized by a multi-lingual bottleneck feature (BNF) extractor that is trained using transcribed data in high-resource languages. The multi-lingual BNFs of the word pairs are used as an initial feature representation to train an autoencoder (AE). We extract features from an internal hidden layer of the pairwise trained AE to perform acoustic pattern matching for QbE-STD. Our experiments on the TIMIT and Switchboard corpora show that the pairwise learning brings 7.61% and 8.75% relative improvements in mean average precision (MAP) respectively over the initial feature representation.

31 citations


Proceedings Article
01 Dec 2017
TL;DR: In this paper, a deterministic feature map is proposed to approximate the kernel in the frequency domain using Gaussian quadrature, which is faster to generate and achieves accuracy comparable to the state-of-theart kernel methods based on random Fourier features.
Abstract: Kernel methods have recently attracted resurgent interest, showing performance competitive with deep neural networks in tasks such as speech recognition. The random Fourier features map is a technique commonly used to scale up kernel machines, but employing the randomized feature map means that O(e-2) samples are required to achieve an approximation error of at most e. We investigate some alternative schemes for constructing feature maps that are deterministic, rather than random, by approximating the kernel in the frequency domain using Gaussian quadrature. We show that deterministic feature maps can be constructed, for any γ > 0, to achieve error e with O(eγ + e-1/γ) samples as e goes to 0. Our method works particularly well with sparse ANOVA kernels, which are inspired by the convolutional layer of CNNs. We validate our methods on datasets in different domains, such as MNIST and TIMIT, showing that deterministic features are faster to generate and achieve accuracy comparable to the state-of-the-art kernel methods based on random Fourier features.

31 citations


Proceedings ArticleDOI
05 Mar 2017
TL;DR: A recently developed deep learning model, recurrent convolutional neural network (RCNN), is proposed to use for speech processing, which inherits some merits of recurrent neural networks (RNN) and convolutionals (CNN) and is competitive with previous methods in terms of accuracy and efficiency.
Abstract: Different neural networks have exhibited excellent performance on various speech processing tasks, and they usually have specific advantages and disadvantages. We propose to use a recently developed deep learning model, recurrent convolutional neural network (RCNN), for speech processing, which inherits some merits of recurrent neural network (RNN) and convolutional neural network (CNN). The core module can be viewed as a convolutional layer embedded with an RNN, which enables the model to capture both temporal and frequency dependance in the spectrogram of the speech in an efficient way. The model is tested on speech corpus TIMIT for phoneme recognition and IEMOCAP for emotion recognition. Experimental results show that the model is competitive with previous methods in terms of accuracy and efficiency.

30 citations


Proceedings ArticleDOI
05 Mar 2017
TL;DR: This work presents a new method for solving sequence-to-sequence problems using hard online alignments instead of soft offline alignments, which achieves encouraging performance on TIMIT and Wall Street Journal speech recognition datasets.
Abstract: Sequence-to-sequence models with soft attention had significant success in machine translation, speech recognition, and question answering. Though capable and easy to use, they require that the entirety of the input sequence is available at the beginning of inference, an assumption that is not valid for instantaneous translation and speech recognition. To address this problem, we present a new method for solving sequence-to-sequence problems using hard online alignments instead of soft offline alignments. The online alignments model is able to start producing outputs without the need to first process the entire input sequence. A highly accurate online sequence-to-sequence model is useful because it can be used to build an accurate voice-based instantaneous translator. Our model uses hard binary stochastic decisions to select the timesteps at which outputs will be produced. The model is trained to produce these stochastic decisions using a standard policy gradient method. In our experiments, we show that this model achieves encouraging performance on TIMIT and Wall Street Journal (WSJ) speech recognition datasets.

Journal ArticleDOI
TL;DR: An effective algorithm is proposed for automatic speech recognition task using speech trajectories reconstructed in the phase space, and some useful features from the Recurrence Plot of the embedded speech signals in the RPS are evaluated via applying a two-dimensional wavelet transform to the resulted RP diagrams.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: In this article, an unsupervised algorithm based on sequence prediction models such as Markov chains and recurrent neural networks is proposed for phonemic segmentation of speech, which consists in analyzing the error profile of a model trained to predict speech features frame-by-frame.
Abstract: Phonemic segmentation of speech is a critical step of speech recognition systems. We propose a novel unsupervised algorithm based on sequence prediction models such as Markov chains and recurrent neural network. Our approach consists in analyzing the error profile of a model trained to predict speech features frame-by-frame. Specifically, we try to learn the dynamics of speech in the MFCC space and hypothesize boundaries from local maxima in the prediction error. We evaluate our system on the TIMIT dataset, with improvements over similar methods.

Journal ArticleDOI
TL;DR: A theoretical framework and an experimental evaluation are presented showing that reducing the dimension of features by applying the discrete Karhunen–Loève transform (DKLT) to the log-spectrum of the speech signal guarantees better performance compared to conventional MFCC features.
Abstract: Speaker identification plays a crucial role in biometric person identification as systems based on human speech are increasingly used for the recognition of people. Mel frequency cepstral coefficients (MFCCs) have been widely adopted for decades in speech processing to capture the speech-specific characteristics with a reduced dimensionality. However, although their ability to decorrelate the vocal source and the vocal tract filter make them suitable for speech recognition, they greatly mitigate the speaker variability, a specific characteristic that distinguishes different speakers. This paper presents a theoretical framework and an experimental evaluation showing that reducing the dimension of features by applying the discrete Karhunen–Loeve transform (DKLT) to the log-spectrum of the speech signal guarantees better performance compared to conventional MFCC features. In particular with short sequences of speech frames, with typical duration of less than 2 s, the performance of truncated DKLT representation achieved for the identification of five speakers are always better than those achieved with the MFCCs for the experiments we performed. Additionally, the framework was tested on up to 100 TIMIT speakers with sequences of less than 3.5 s showing very good recognition capabilities.

Proceedings ArticleDOI
01 Dec 2017
TL;DR: Experimental results show that the method of this paper can detect end-points of voice signal more accurately and outperforms the conventional VAD algorithms.
Abstract: In this paper, an efficient classification of voice segment from the silence segment, unvoiced segment algorithm, which is both more accurate and laid-back to implement is proposed by comparing to some previous algorithms. The proposed algorithm uses spectral entropy and short time features such as zero crossing rate, short time energy, linear prediction error are used for voice activity detection (VAD). A compound parameter, D, is calculated by using all these four parameters. Dmax is calculated from all the frames of the signal. Then the value of D/Dmax is used to determine whether the frames are classified as speech and non-speech and silence frames. The threshold values have to be obtained empirically. Experimental results show that the method of this paper can detect end-points of voice signal more accurately and outperforms the conventional VAD algorithms. The method we used in this work was evaluated on TIMIT Acoustic-Phonetic Continuous Speech Corpus. This corpus is mostly used for speech recognition application and contains clean speech data and is compared with some of the most recent proposed algorithms.

Posted Content
TL;DR: In this article, a feed-forward neural network was proposed for text-independent speaker classification and verification, which achieved 100% classification rate in classification and less than 6% Equal Error Rate (ERR), using merely about 1 second and 5 seconds of data.
Abstract: This work presents a novel framework based on feed-forward neural network for text-independent speaker classification and verification, two related systems of speaker recognition. With optimized features and model training, it achieves 100% classification rate in classification and less than 6% Equal Error Rate (ERR), using merely about 1 second and 5 seconds of data respectively. Features with stricter Voice Active Detection (VAD) than the regular one for speech recognition ensure extracting stronger voiced portion for speaker recognition, speaker-level mean and variance normalization helps to eliminate the discrepancy between samples from the same speaker. Both are proven to improve the system performance. In building the neural network speaker classifier, the network structure parameters are optimized with grid search and dynamically reduced regularization parameters are used to avoid training terminated in local minimum. It enables the training goes further with lower cost. In speaker verification, performance is improved with prediction score normalization, which rewards the speaker identity indices with distinct peaks and penalizes the weak ones with high scores but more competitors, and speaker-specific thresholding, which significantly reduces ERR in the ROC curve. TIMIT corpus with 8K sampling rate is used here. First 200 male speakers are used to train and test the classification performance. The testing files of them are used as in-domain registered speakers, while data from the remaining 126 male speakers are used as out-of-domain speakers, i.e. imposters in speaker verification.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: This work is aimed at studying the influence of various activation functions on speech recognition system, and it is observed that the performance of ReLU-networks is superior compared to the other networks for the smaller sized dataset (i.e., TIMIT dataset).
Abstract: Significant developments in deep learning methods have been achieved with the capability to train more deeper networks. The performance of speech recognition system has been greatly improved by the use of deep learning techniques. Most of the developments in deep learning are associated with the development of new activation functions and the corresponding initializations. The development of Rectified linear units (ReLU) has revolutionized the use of supervised deep learning methods for speech recognition. Recently there has been a great deal of research interest in the development of activation functions Leaky-ReLU (LReLU), Parametric-ReLU (PReLU), Exponential Linear units (ELU) and Parametric-ELU (PELU). This work is aimed at studying the influence of various activation functions on speech recognition system. In this work, a hidden Markov model-Deep neural network (HMM-DNN) based speech recognition is used, where deep neural networks with different activation functions have been employed to obtain the emission probabilities of hidden Markov model. In this work, two datasets i.e., TIMIT and WSJ are employed to study the behavior of various speech recognition systems with different sized datasets. During the study, it is observed that the performance of ReLU-networks is superior compared to the other networks for the smaller sized dataset (i.e., TIMIT dataset). For the datasets of sufficiently larger size (i.e., WSJ) performance of ELU-networks is superior to the other networks.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: In this paper, a convolutional auto-encoder model was proposed for convolutive non-negative matrix factorization (CNMF) and a recurrent neural network (RNN) was used in the encoder.
Abstract: Convolutive Non-Negative Matrix Factorization model factorizes a given audio spectrogram using frequency templates with a temporal dimension. In this paper, we present a convolutional auto-encoder model that acts as a neural network alternative to convolutive NMF. Using the modeling flexibility granted by neural networks, we also explore the idea of using a Recurrent Neural Network in the encoder. Experimental results on speech mixtures from TIMIT dataset indicate that the convolutive architecture provides a significant improvement in separation performance in terms of BSS eval metrics.

Journal ArticleDOI
TL;DR: The developed MFA model is able to enhance the security of the systems against spoofing and communication attacks while improving the recognition performance via solving problems and overcoming limitations.
Abstract: In this paper, a Multi-Factor Authentication (MFA) method is developed by a combination of Personal Identification Number (PIN), One Time Password (OTP), and speaker biometric through the speech watermarks. For this reason, a multipurpose digital speech watermarking applied to embed semi-fragile and robust watermarks simultaneously in the speech signal, respectively to provide tamper detection and proof of ownership. Similarly, the blind semi-fragile speech watermarking technique, Discrete Wavelet Packet Transform (DWPT) and Quantization Index Modulation (QIM) are used to embed the watermark in an angle of the wavelet's sub-bands where more speaker specific information is available. For copyright protection of the speech, a blind and robust speech watermarking are used by applying DWPT and multiplication. Where less speaker specific information is available the robust watermark is embedded through manipulating the amplitude of the wavelet's sub-bands. Experimental results on TIMIT, MIT, and MOBIO demonstrate that there is a trade-off among recognition performance of speaker recognition systems, robustness, and capacity which are presented by various triangles. Furthermore, threat model and attack analysis are used to evaluate the feasibility of the developed MFA model. Accordingly, the developed MFA model is able to enhance the security of the systems against spoofing and communication attacks while improving the recognition performance via solving problems and overcoming limitations.

Journal ArticleDOI
TL;DR: Experiments with the TIMIT and NIST 2008 databases show that SGRs, when used in conjunction with power-normalized cepStral coefficients and linear prediction cepstral coefficients, can improve the performance significantly across all noise conditions in mismatched situations.
Abstract: This letter investigates the use of subglottal resonances (SGRs) for noise-robust speaker identification (SID). It is motivated by the speaker specificity and stationarity of subglottal acoustics, and the development of noise-robust SGR estimation algorithms which are reliable at low signal-to-noise ratios for large datasets. A two-stage framework is proposed which combines the SGRs with different cepstral features. The cepstral features are used in the first stage to reduce the number of target speakers for a test utterance, and then SGRs are used as complementary second-stage features to conduct identification. Experiments with the TIMIT and NIST 2008 databases show that SGRs, when used in conjunction with power-normalized cepstral coefficients and linear prediction cepstral coefficients, can improve the performance significantly (2%–6% absolute accuracy improvement) across all noise conditions in mismatched situations.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This work presents a novel framework based on feed-forward neural network for text-independent speaker classification and verification, two related systems of speaker recognition that achieves 100% classification rate in classification and less than 6% Equal Error Rate (ERR) using merely about 1 second and 5 seconds of data respectively.
Abstract: This work presents a novel framework based on feed-forward neural network for text-independent speaker classification and verification, two related systems of speaker recognition. With optimized features and model training, it achieves 100% classification rate in classification and less than 6% Equal Error Rate (ERR), using merely about 1 second and 5 seconds of data respectively. Features with stricter Voice Active Detection (VAD) than the regular one for speech recognition ensure extracting stronger voiced portion for speaker recognition, speaker-level mean and variance normalization helps to eliminate the discrepancy between samples from the same speaker. Both are proven to improve the system performance. In building the neural network speaker classifier, the network structure parameters are optimized with grid search and dynamically reduced regularization parameters are used to avoid training terminated in local minimum. It enables the training goes further with lower cost. In speaker verification, performance is improved with prediction score normalization, which rewards the speaker identity indices with distinct peaks and penalizes the weak ones with high scores but more competitors, and speaker-specific thresholding, which significantly reduces ERR in the ROC curve. TIMIT corpus with 8K sampling rate is used here. First 200 male speakerwje used to train and test the classification performance. The testing files of them are used as in-domain registered speakers, while data from the remaining 126 male speakers are used as out-of-domain speakers, i.e. imposters in speaker verification.

Journal ArticleDOI
TL;DR: As recommendations from the study, mean fusion is found to yield overall best performance in terms of speaker identification accuracy (SIA) with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings.
Abstract: In this study, a speaker identification system is considered consisting of a feature extraction stage which utilizes both power normalized cepstral coefficients (PNCCs) and Mel frequency cepstral coefficients (MFCC). Normalization is applied by employing cepstral mean and variance normalization (CMVN) and feature warping (FW), together with acoustic modeling using a Gaussian mixture model-universal background model (GMM-UBM). The main contributions are comprehensive evaluations of the effect of both additive white Gaussian noise (AWGN) and non-stationary noise (NSN) (with and without a G.712 type handset) upon identification performance. In particular, three NSN types with varying signal to noise ratios (SNRs) were tested corresponding to street traffic, a bus interior, and a crowded talking environment. The performance evaluation also considered the effect of late fusion techniques based on score fusion, namely, mean, maximum, and linear weighted sum fusion. The databases employed were TIMIT, SITW, and NIST 2008; and 120 speakers were selected from each database to yield 3600 speech utterances. As recommendations from the study, mean fusion is found to yield overall best performance in terms of speaker identification accuracy (SIA) with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings.

Proceedings ArticleDOI
01 May 2017
TL;DR: Four systems based on different speech features were combined in score-level to improve verification accuracy under clean and noisy speech conditions and this reduces the equal error rates is in some cases up to 44%.
Abstract: So far, many methods have been proposed for speaker verification which provide good results, but their performances reduce in actual noisy environments. A common approach to partially alleviate this problem is the fusion of several methods. In this paper, four systems based on different speech features, i.e., MFCC, IMFCC, LFCC, and PNCC were combined in score-level to improve verification accuracy under clean and noisy speech conditions. The features pairwise and foursome fusion in a speaker verification system based on speaker modeling through the Gaussian mixture model (GMM) were evaluated. TIMIT and NOISEX92 databases were used to implement as the speech and noise datasets, respectively. The experimental results show that the score-level fusion of different feature vectors enhances the accuracy of speaker verification system and this reduces the equal error rates is in some cases up to 44%.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: In this article, the authors proposed several types of residual LSTM methods for acoustic modeling, and compared with classic LSTMs, their architecture showed more than 8% relative reduction in Phone Error Rate (PER) on TIMIT tasks.
Abstract: Long Short-Term Memory (LSTM) is the primary recurrent neural networks architecture for acoustic modeling in automatic speech recognition systems. Residual learning is an efficient method to help neural networks converge easier and faster. In this paper, we propose several types of residual LSTM methods for our acoustic modeling. Our experiments indicate that, compared with classic LSTM, our architecture shows more than 8% relative reduction in Phone Error Rate (PER) on TIMIT tasks. At the same time, our residual fast LSTM approach shows 4% relative reduction in PER on the same task. Besides, we find that all this architecture could have good results on THCHS-30, Librispeech and Switchboard corpora.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This work trains a CNN to extract embeddings that are similar for equal speakers (regardless of their specific identity) using weakly labeled data and exceeds the clustering performance of all previous approaches on the well-known TIMIT dataset.
Abstract: Recent work has shown that convolutional neural networks (CNNs) trained in a supervised fashion for speaker identification are able to extract features from spectrograms which can be used for speaker clustering. These features are represented by the activations of a certain hidden layer and are called embeddings. However, previous approaches require plenty of additional speaker data to learn the embedding, and although the clustering results are then on par with more traditional approaches using MFCC features etc., room for improvements stems from the fact that these embeddings are trained with a surrogate task that is rather far away from segregating unknown voices — namely, identifying few specific speakers. We address both problems by training a CNN to extract embeddings that are similar for equal speakers (regardless of their specific identity) using weakly labeled data. We demonstrate our approach on the well-known TIMIT dataset that has often been used for speaker clustering experiments in the past. We exceed the clustering performance of all previous approaches, but require just 100 instead of 590 unrelated speakers to learn an embedding suited for clustering.

Proceedings ArticleDOI
26 Oct 2017
TL;DR: The results show that the I-vector approach is better than the Gaussian Mixture Model-Universal Background Model for both clean and AWGN conditions without a handset, however, the GMM-UBM had better accuracy for NSN types.
Abstract: In this paper, two models, the I-vector and the Gaussian Mixture Model-Universal Background Model (GMM-UBM), are compared for the speaker identification task. Four feature combinations of I-vectors with seven fusion techniques are considered: maximum, mean, weighted sum, cumulative, interleaving and concatenated for both two and four features. In addition, an Extreme Learning Machine (ELM) is exploited to identify speakers, and then Speaker Identification Accuracy (SIA) is calculated. Both systems are evaluated for 120 speakers from the TIMIT and NIST 2008 databases for clean speech. Furthermore, a comprehensive evaluation is made under Additive White Gaussian Noise (AWGN) conditions and with three types of Non Stationary Noise (NSN), both with and without handset effects for the TIMIT database. The results show that the I-vector approach is better than the GMM-UBM for both clean and AWGN conditions without a handset. However, the GMM-UBM had better accuracy for NSN types.

Journal ArticleDOI
TL;DR: A new model based on some proposed components called matched filters (MFs), where instead of using a fixed filter bank for the entire speech signal, the proposed TOC is generated by adopting a pair of vowel and consonant MFs for each voiced speech frame.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: This work proposes to improve the non-parametric Bayesian phone-loop model by incorporating a Hierarchical Pitman-Yor based bigram Language Model on top of the units' transitions, which shows an absolute improvement of 1–2% on the Normalized Mutual Information (NMI) metric.
Abstract: Recent work on Acoustic Unit Discovery (AUD) has led to the development of a non-parametric Bayesian phone-loop model where the prior over the probability of the phone-like units is assumed to be sampled from a Dirichlet Process (DP). In this work, we propose to improve this model by incorporating a Hierarchical Pitman-Yor based bigram Language Model on top of the units' transitions. This new model makes use of the phonotactic context information but assumes a fixed number of units. To remedy this limitation we first train a DP phone-loop model to infer the number of units, then, the bigram phone-loop is initialized from the DP phone-loop and trained until convergence of its parameters. Results show an absolute improvement of 1–2%on the Normalized Mutual Information (NMI) metric. Furthermore, we show that, combined with Multilingual Bottleneck (MBN) features the model yields a same or higher NMI as an English phone recogniser trained on TIMIT.

Proceedings ArticleDOI
01 Aug 2017
TL;DR: This work introduces an approach to calculate the parameter in the discriminative term adaptively via the discrepancy between target features and has improved robustness and a better separation performance than the previous approach.
Abstract: Monaural source separation is an important research area which can help to improve the performance of several real-world applications, such as speech recognition and assisted living systems. Huang et al. proposed deep recurrent neural networks (DRNNs) with discriminative criterion objective function to improve the performance of source separation. However, the penalty factor in the objective function is selected randomly and empirically. Therefore, we introduce an approach to calculate the parameter in the discriminative term adaptively via the discrepancy between target features. The penalty factor can be changed with inputs to improve the separation performance. The proposed method is evaluated with different settings and architectures of neural networks. In these experiments, the TIMIT corpus is explored as the database and the signal to distortion ratio (SDR) as the measurement. Comparing with the previous approach, our method has improved robustness and a better separation performance.

Journal ArticleDOI
TL;DR: The results of the experiments show that the proposed model can obtain good performance in clear and noisy environment and be insensitive to the low-quality speech, but the time cost of the model is high.
Abstract: Today, more and more people have benefited from the speaker recognition. However, the accuracy of speaker recognition often drops off rapidly because of the low-quality speech and noise. This paper proposed a new speaker recognition model based on wavelet packet entropy (WPE), i-vector, and cosine distance scoring (CDS). In the proposed model, WPE transforms the speeches into short-term spectrum feature vectors (short vectors) and resists the noise. I-vector is generated from those short vectors and characterizes speech to improve the recognition accuracy. CDS fast compares with the difference between two i-vectors to give out the recognition result. The proposed model is evaluated by TIMIT speech database. The results of the experiments show that the proposed model can obtain good performance in clear and noisy environment and be insensitive to the low-quality speech, but the time cost of the model is high. To reduce the time cost, the parallel computation is used.