Showing papers on "TIMIT published in 2017"

PDF

Open Access

Proceedings Article•

Regularizing Neural Networks by Penalizing Confident Output Distributions

[...]

Gabriel Pereyra¹, George Tucker¹, Jan Chorowski², Łukasz Kaiser³, Geoffrey E. Hinton¹ - Show less +1 more•Institutions (3)

Google¹, University of Wrocław², University of Illinois at Urbana–Champaign³

15 Feb 2017

TL;DR: It is found that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers.

...read moreread less

Abstract: We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence. We exhaustively evaluate the proposed confidence penalty and label smoothing on 6 common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). We find that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers.

...read moreread less

617 citations

Posted Content•

Regularizing Neural Networks by Penalizing Confident Output Distributions

[...]

Gabriel Pereyra¹, George Tucker¹, Jan Chorowski², Łukasz Kaiser³, Geoffrey E. Hinton¹ - Show less +1 more•Institutions (3)

Google¹, University of Wrocław², University of Illinois at Urbana–Champaign³

23 Jan 2017-arXiv: Neural and Evolutionary Computing

TL;DR: In this article, the authors explore regularizing neural networks by penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning.

...read moreread less

519 citations

Proceedings Article•

Z-Forcing: Training Stochastic Recurrent Networks

[...]

Anirudh Goyal¹, Alessandro Sordoni², Marc-Alexandre Côté³, Nan Rosemary Ke⁴, Yoshua Bengio¹ - Show less +1 more•Institutions (4)

Université de Montréal¹, Microsoft², Université de Sherbrooke³, École Polytechnique de Montréal⁴

15 Nov 2017

TL;DR: This work unify successful ideas from recently proposed architectures into a stochastic recurrent model that achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST.

...read moreread less

Abstract: Many efforts have been devoted to training generative latent variable models with autoregressive decoders, such as recurrent neural networks (RNN). Stochastic recurrent models have been successful in capturing the variability observed in natural sequential data such as speech. We unify successful ideas from recently proposed architectures into a stochastic recurrent model: each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps. Training is performed with amortised variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence. In addition to maximizing the variational lower bound, we ease training of the latent variables by adding an auxiliary cost which forces them to reconstruct the state of the backward recurrent network. This provides the latent variables with a task-independent objective that enhances the performance of the overall model. We found this strategy to perform better than alternative approaches such as KL annealing. Although being conceptually simple, our model achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST. Finally, we apply our model to language modeling on the IMDB dataset where the auxiliary cost helps in learning interpretable latent variables.

...read moreread less

105 citations

Posted Content•

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

[...]

Ying Zhang, Mohammad Pezeshki, Philemon Brakel, Saizheng Zhang, Cesar Laurent Yoshua Bengio, Aaron Courville - Show less +2 more

10 Jan 2017-arXiv: Computation and Language

TL;DR: In this paper, the authors proposed an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with Connectionist Temporal Classification (CTC) directly without recurrent connections.

...read moreread less

Abstract: Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks. Meanwhile, Connectionist Temporal Classification (CTC) with Recurrent Neural Networks (RNNs), which is proposed for labeling unsegmented sequences, makes it feasible to train an end-to-end speech recognition system instead of hybrid settings. However, RNNs are computationally expensive and sometimes difficult to train. In this paper, inspired by the advantages of both CNNs and the CTC approach, we propose an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections. By evaluating the approach on the TIMIT phoneme recognition task, we show that the proposed model is not only computationally efficient, but also competitive with the existing baseline systems. Moreover, we argue that CNNs have the capability to model temporal correlations with appropriate context information.

...read moreread less

57 citations

Proceedings Article•DOI•

End-to-end speech recognition and keyword search on low-resource languages

[...]

Andrew Rosenberg¹, Kartik Audhkhasi¹, Abhinav Sethy¹, Bhuvana Ramabhadran¹, Michael Picheny¹ - Show less +1 more•Institutions (1)

IBM¹

05 Mar 2017

TL;DR: This work investigates the use of Connectionist Temporal Classification networks, recurrent encoder-decoders with attention, two end-to-end ASR systems for keyword search and speech recognition on low resource languages.

...read moreread less

Abstract: In recent years, so-called, “end-to-end” speech recognition systems have emerged as viable alternatives to traditional ASR frameworks. Keyword search, localizing an orthographic query in a speech corpus, is typically performed by using automatic speech recognition (ASR) to generate an index. Previous work has evaluated the use of end-to-end systems for ASR on well known corpora (WSJ, Switchboard, TIMIT, etc.) in high-resource languages like English and Mandarin. In this work, we investigate the use of Connectionist Temporal Classification (CTC) networks, recurrent encoder-decoders with attention, two end-to-end ASR systems for keyword search and speech recognition on low resource languages. We find end-to-end systems can generate high quality 1-best transcripts on low-resource languages, but, because they generate very sharp posteriors, their utility is limited for KWS. We explore a number of ways to address this limitation with modest success. Experimental results reported are based on the IARPA BABEL OP3 languages and evaluation framework. This paper represents the first results using “end-to-end” techniques for speech recognition and keyword search on low-resource languages.

...read moreread less

52 citations

Posted Content•

Z-Forcing: Training Stochastic Recurrent Networks

[...]

Anirudh Goyal¹, Alessandro Sordoni², Marc-Alexandre Côté³, Nan Rosemary Ke⁴, Yoshua Bengio¹ - Show less +1 more•Institutions (4)

Université de Montréal¹, Microsoft², Université de Sherbrooke³, École Polytechnique de Montréal⁴

15 Nov 2017-arXiv: Machine Learning

TL;DR: This paper proposed a stochastic recurrent model, where each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps, and training is performed with amortized variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence.

...read moreread less

Abstract: Many efforts have been devoted to training generative latent variable models with autoregressive decoders, such as recurrent neural networks (RNN). Stochastic recurrent models have been successful in capturing the variability observed in natural sequential data such as speech. We unify successful ideas from recently proposed architectures into a stochastic recurrent model: each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps. Training is performed with amortized variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence. In addition to maximizing the variational lower bound, we ease training of the latent variables by adding an auxiliary cost which forces them to reconstruct the state of the backward recurrent network. This provides the latent variables with a task-independent objective that enhances the performance of the overall model. We found this strategy to perform better than alternative approaches such as KL annealing. Although being conceptually simple, our model achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST. Finally, we apply our model to language modeling on the IMDB dataset where the auxiliary cost helps in learning interpretable latent variables. Source Code: \url{this https URL}

...read moreread less

42 citations

Proceedings Article•DOI•

NTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition.

[...]

Ahmed Hussen Abdelaziz¹•Institutions (1)

International Computer Science Institute¹

20 Aug 2017

32 citations

Proceedings Article•DOI•

Pairwise learning using multi-lingual bottleneck features for low-resource query-by-example spoken term detection

[...]

Yougen Yuan¹, Cheung-Chi Leung², Lei Xie¹, Hongjie Chen¹, Bin Ma², Haizhou Li² - Show less +2 more•Institutions (2)

Northwestern Polytechnical University¹, Institute for Infocomm Research Singapore²

07 Mar 2017

TL;DR: This work proposes to use a feature representation obtained by pairwise learning in a low-resource language for query-by-example spoken term detection (QbE-STD), and extracts features from an internal hidden layer of the pairwise trained AE to perform acoustic pattern matching for QbE -STD.

...read moreread less

Abstract: We propose to use a feature representation obtained by pairwise learning in a low-resource language for query-by-example spoken term detection (QbE-STD). We assume that word pairs identified by humans are available in the low-resource target language. The word pairs are parameterized by a multi-lingual bottleneck feature (BNF) extractor that is trained using transcribed data in high-resource languages. The multi-lingual BNFs of the word pairs are used as an initial feature representation to train an autoencoder (AE). We extract features from an internal hidden layer of the pairwise trained AE to perform acoustic pattern matching for QbE-STD. Our experiments on the TIMIT and Switchboard corpora show that the pairwise learning brings 7.61% and 8.75% relative improvements in mean average precision (MAP) respectively over the initial feature representation.

...read moreread less

31 citations

Proceedings Article•

Gaussian Quadrature for Kernel Features.

[...]

Tri Dao¹, Christopher De Sa², Christopher Ré¹•Institutions (2)

Stanford University¹, Cornell University²

01 Dec 2017

TL;DR: In this paper, a deterministic feature map is proposed to approximate the kernel in the frequency domain using Gaussian quadrature, which is faster to generate and achieves accuracy comparable to the state-of-theart kernel methods based on random Fourier features.

...read moreread less

Abstract: Kernel methods have recently attracted resurgent interest, showing performance competitive with deep neural networks in tasks such as speech recognition. The random Fourier features map is a technique commonly used to scale up kernel machines, but employing the randomized feature map means that O(e-2) samples are required to achieve an approximation error of at most e. We investigate some alternative schemes for constructing feature maps that are deterministic, rather than random, by approximating the kernel in the frequency domain using Gaussian quadrature. We show that deterministic feature maps can be constructed, for any γ > 0, to achieve error e with O(eγ + e-1/γ) samples as e goes to 0. Our method works particularly well with sparse ANOVA kernels, which are inspired by the convolutional layer of CNNs. We validate our methods on datasets in different domains, such as MNIST and TIMIT, showing that deterministic features are faster to generate and achieve accuracy comparable to the state-of-the-art kernel methods based on random Fourier features.

...read moreread less

31 citations

Proceedings Article•DOI•

Recurrent convolutional neural network for speech processing

[...]

Yue Zhao¹, Xingyu Jin¹, Xiaolin Hu¹•Institutions (1)

Tsinghua University¹

05 Mar 2017

TL;DR: A recently developed deep learning model, recurrent convolutional neural network (RCNN), is proposed to use for speech processing, which inherits some merits of recurrent neural networks (RNN) and convolutionals (CNN) and is competitive with previous methods in terms of accuracy and efficiency.

...read moreread less

Abstract: Different neural networks have exhibited excellent performance on various speech processing tasks, and they usually have specific advantages and disadvantages. We propose to use a recently developed deep learning model, recurrent convolutional neural network (RCNN), for speech processing, which inherits some merits of recurrent neural network (RNN) and convolutional neural network (CNN). The core module can be viewed as a convolutional layer embedded with an RNN, which enables the model to capture both temporal and frequency dependance in the spectrogram of the speech in an efficient way. The model is tested on speech corpus TIMIT for phoneme recognition and IEMOCAP for emotion recognition. Experimental results show that the model is competitive with previous methods in terms of accuracy and efficiency.

...read moreread less

30 citations

Proceedings Article•DOI•

Learning online alignments with continuous rewards policy gradient

[...]

Yuping Luo¹, Chung-Cheng Chiu², Navdeep Jaitly², Ilya Sutskever•Institutions (2)

Tsinghua University¹, Google²

05 Mar 2017

TL;DR: This work presents a new method for solving sequence-to-sequence problems using hard online alignments instead of soft offline alignments, which achieves encouraging performance on TIMIT and Wall Street Journal speech recognition datasets.

...read moreread less

Abstract: Sequence-to-sequence models with soft attention had significant success in machine translation, speech recognition, and question answering. Though capable and easy to use, they require that the entirety of the input sequence is available at the beginning of inference, an assumption that is not valid for instantaneous translation and speech recognition. To address this problem, we present a new method for solving sequence-to-sequence problems using hard online alignments instead of soft offline alignments. The online alignments model is able to start producing outputs without the need to first process the entire input sequence. A highly accurate online sequence-to-sequence model is useful because it can be used to build an accurate voice-based instantaneous translator. Our model uses hard binary stochastic decisions to select the timesteps at which outputs will be produced. The model is trained to produce these stochastic decisions using a standard policy gradient method. In our experiments, we show that this model achieves encouraging performance on TIMIT and Wall Street Journal (WSJ) speech recognition datasets.

...read moreread less

Journal Article•DOI•

Improvement of automatic speech recognition systems via nonlinear dynamical features evaluated from the recurrence plot of speech signals

[...]

Shabnam Gholamdokht Firooz¹, Farshad Almasganj¹, Yasser Shekofteh¹•Institutions (1)

Amirkabir University of Technology¹

01 Feb 2017-Computers & Electrical Engineering

TL;DR: An effective algorithm is proposed for automatic speech recognition task using speech trajectories reconstructed in the phase space, and some useful features from the Recurrence Plot of the embedded speech signals in the RPS are evaluated via applying a two-dimensional wavelet transform to the resulted RP diagrams.

...read moreread less

Proceedings Article•DOI•

Blind phoneme segmentation with temporal prediction errors

[...]

Paul Michel¹, Okko Räsänen², Roland Thiollière, Emmanuel Dupoux•Institutions (2)

Carnegie Mellon University¹, Aalto University²

01 Jul 2017

TL;DR: In this article, an unsupervised algorithm based on sequence prediction models such as Markov chains and recurrent neural networks is proposed for phonemic segmentation of speech, which consists in analyzing the error profile of a model trained to predict speech features frame-by-frame.

...read moreread less

Abstract: Phonemic segmentation of speech is a critical step of speech recognition systems. We propose a novel unsupervised algorithm based on sequence prediction models such as Markov chains and recurrent neural network. Our approach consists in analyzing the error profile of a model trained to predict speech features frame-by-frame. Specifically, we try to learn the dynamics of speech in the MFCC space and hypothesize boundaries from local maxima in the prediction error. We evaluate our system on the TIMIT dataset, with improvements over similar methods.

...read moreread less

Journal Article•DOI•

An Investigation on the Accuracy of Truncated DKLT Representation for Speaker Identification With Short Sequences of Speech Frames

[...]

Giorgio Biagetti¹, Paolo Crippa¹, Laura Falaschetti¹, Simone Orcioni¹, Claudio Turchetti¹ - Show less +1 more•Institutions (1)

Marche Polytechnic University¹

01 Dec 2017-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: A theoretical framework and an experimental evaluation are presented showing that reducing the dimension of features by applying the discrete Karhunen–Loève transform (DKLT) to the log-spectrum of the speech signal guarantees better performance compared to conventional MFCC features.

...read moreread less

Abstract: Speaker identification plays a crucial role in biometric person identification as systems based on human speech are increasingly used for the recognition of people. Mel frequency cepstral coefficients (MFCCs) have been widely adopted for decades in speech processing to capture the speech-specific characteristics with a reduced dimensionality. However, although their ability to decorrelate the vocal source and the vocal tract filter make them suitable for speech recognition, they greatly mitigate the speaker variability, a specific characteristic that distinguishes different speakers. This paper presents a theoretical framework and an experimental evaluation showing that reducing the dimension of features by applying the discrete Karhunen–Loeve transform (DKLT) to the log-spectrum of the speech signal guarantees better performance compared to conventional MFCC features. In particular with short sequences of speech frames, with typical duration of less than 2 s, the performance of truncated DKLT representation achieved for the identification of five speakers are always better than those achieved with the MFCCs for the experiments we performed. Additionally, the framework was tested on up to 100 TIMIT speakers with sequences of less than 3.5 s showing very good recognition capabilities.

...read moreread less

Proceedings Article•DOI•

The combination of spectral entropy, zero crossing rate, short time energy and linear prediction error for voice activity detection

[...]

Thein Htay Zaw¹, Nu War•Institutions (1)

University of Computer Studies, Yangon¹

01 Dec 2017

TL;DR: Experimental results show that the method of this paper can detect end-points of voice signal more accurately and outperforms the conventional VAD algorithms.

...read moreread less

Abstract: In this paper, an efficient classification of voice segment from the silence segment, unvoiced segment algorithm, which is both more accurate and laid-back to implement is proposed by comparing to some previous algorithms. The proposed algorithm uses spectral entropy and short time features such as zero crossing rate, short time energy, linear prediction error are used for voice activity detection (VAD). A compound parameter, D, is calculated by using all these four parameters. Dmax is calculated from all the frames of the signal. Then the value of D/Dmax is used to determine whether the frames are classified as speech and non-speech and silence frames. The threshold values have to be obtained empirically. Experimental results show that the method of this paper can detect end-points of voice signal more accurately and outperforms the conventional VAD algorithms. The method we used in this work was evaluated on TIMIT Acoustic-Phonetic Continuous Speech Corpus. This corpus is mostly used for speech recognition application and contains clean speech data and is compared with some of the most recent proposed algorithms.

...read moreread less

Posted Content•

Neural Network Based Speaker Classification and Verification Systems with Enhanced Features

[...]

Zhenhao Ge, Ananth N. Iyer, Srinath Cheluvaraja, Ram Sundaram, Aravind Ganapathiraju - Show less +1 more

08 Feb 2017-arXiv: Sound

TL;DR: In this article, a feed-forward neural network was proposed for text-independent speaker classification and verification, which achieved 100% classification rate in classification and less than 6% Equal Error Rate (ERR), using merely about 1 second and 5 seconds of data.

...read moreread less

Abstract: This work presents a novel framework based on feed-forward neural network for text-independent speaker classification and verification, two related systems of speaker recognition. With optimized features and model training, it achieves 100% classification rate in classification and less than 6% Equal Error Rate (ERR), using merely about 1 second and 5 seconds of data respectively. Features with stricter Voice Active Detection (VAD) than the regular one for speech recognition ensure extracting stronger voiced portion for speaker recognition, speaker-level mean and variance normalization helps to eliminate the discrepancy between samples from the same speaker. Both are proven to improve the system performance. In building the neural network speaker classifier, the network structure parameters are optimized with grid search and dynamically reduced regularization parameters are used to avoid training terminated in local minimum. It enables the training goes further with lower cost. In speaker verification, performance is improved with prediction score normalization, which rewards the speaker identity indices with distinct peaks and penalizes the weak ones with high scores but more competitors, and speaker-specific thresholding, which significantly reduces ERR in the ROC curve. TIMIT corpus with 8K sampling rate is used here. First 200 male speakers are used to train and test the classification performance. The testing files of them are used as in-domain registered speakers, while data from the remaining 126 male speakers are used as out-of-domain speakers, i.e. imposters in speaker verification.

...read moreread less

Proceedings Article•DOI•

Investigative study of various activation functions for speech recognition

[...]

Hari Krishna Vydana¹, Anil Kumar Vuppala¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

01 Mar 2017

TL;DR: This work is aimed at studying the influence of various activation functions on speech recognition system, and it is observed that the performance of ReLU-networks is superior compared to the other networks for the smaller sized dataset (i.e., TIMIT dataset).

...read moreread less

Abstract: Significant developments in deep learning methods have been achieved with the capability to train more deeper networks. The performance of speech recognition system has been greatly improved by the use of deep learning techniques. Most of the developments in deep learning are associated with the development of new activation functions and the corresponding initializations. The development of Rectified linear units (ReLU) has revolutionized the use of supervised deep learning methods for speech recognition. Recently there has been a great deal of research interest in the development of activation functions Leaky-ReLU (LReLU), Parametric-ReLU (PReLU), Exponential Linear units (ELU) and Parametric-ELU (PELU). This work is aimed at studying the influence of various activation functions on speech recognition system. In this work, a hidden Markov model-Deep neural network (HMM-DNN) based speech recognition is used, where deep neural networks with different activation functions have been employed to obtain the emission probabilities of hidden Markov model. In this work, two datasets i.e., TIMIT and WSJ are employed to study the behavior of various speech recognition systems with different sized datasets. During the study, it is observed that the performance of ReLU-networks is superior compared to the other networks for the smaller sized dataset (i.e., TIMIT dataset). For the datasets of sufficiently larger size (i.e., WSJ) performance of ELU-networks is superior to the other networks.

...read moreread less

Proceedings Article•DOI•

Neural network alternatives toconvolutive audio models for source separation

[...]

Shrikant Venkataramani¹, Cem Subakan¹, Paris Smaragdis¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Sep 2017

TL;DR: In this paper, a convolutional auto-encoder model was proposed for convolutive non-negative matrix factorization (CNMF) and a recurrent neural network (RNN) was used in the encoder.

...read moreread less

Abstract: Convolutive Non-Negative Matrix Factorization model factorizes a given audio spectrogram using frequency templates with a temporal dimension. In this paper, we present a convolutional auto-encoder model that acts as a neural network alternative to convolutive NMF. Using the modeling flexibility granted by neural networks, we also explore the idea of using a Recurrent Neural Network in the encoder. Experimental results on speech mixtures from TIMIT dataset indicate that the convolutive architecture provides a significant improvement in separation performance in terms of BSS eval metrics.

...read moreread less

Journal Article•DOI•

Multi-factor authentication model based on multipurpose speech watermarking and online speaker recognition

[...]

Mohammad Ali Nematollahi¹, Hamurabi Gamboa-Rosales², Francisco J. Martinez-Ruiz², José Ismael De la Rosa-Vargas², Syed Abdul Rahman Al-Haddad³, Mansour Esmaeilpour¹ - Show less +2 more•Institutions (3)

Islamic Azad University¹, Autonomous University of Zacatecas², Universiti Putra Malaysia³

01 Mar 2017-Multimedia Tools and Applications

TL;DR: The developed MFA model is able to enhance the security of the systems against spoofing and communication attacks while improving the recognition performance via solving problems and overcoming limitations.

...read moreread less

Abstract: In this paper, a Multi-Factor Authentication (MFA) method is developed by a combination of Personal Identification Number (PIN), One Time Password (OTP), and speaker biometric through the speech watermarks. For this reason, a multipurpose digital speech watermarking applied to embed semi-fragile and robust watermarks simultaneously in the speech signal, respectively to provide tamper detection and proof of ownership. Similarly, the blind semi-fragile speech watermarking technique, Discrete Wavelet Packet Transform (DWPT) and Quantization Index Modulation (QIM) are used to embed the watermark in an angle of the wavelet's sub-bands where more speaker specific information is available. For copyright protection of the speech, a blind and robust speech watermarking are used by applying DWPT and multiplication. Where less speaker specific information is available the robust watermark is embedded through manipulating the amplitude of the wavelet's sub-bands. Experimental results on TIMIT, MIT, and MOBIO demonstrate that there is a trade-off among recognition performance of speaker recognition systems, robustness, and capacity which are presented by various triangles. Furthermore, threat model and attack analysis are used to evaluate the feasibility of the developed MFA model. Accordingly, the developed MFA model is able to enhance the security of the systems against spoofing and communication attacks while improving the recognition performance via solving problems and overcoming limitations.

...read moreread less

Journal Article•DOI•

Robust speaker identification via fusion of subglottal resonances and cepstral features

[...]

Jinxi Guo¹, Ruochen Yang¹, Harish Arsikere¹, Abeer Alwan¹•Institutions (1)

University of California, Los Angeles¹

24 Apr 2017-Journal of the Acoustical Society of America

TL;DR: Experiments with the TIMIT and NIST 2008 databases show that SGRs, when used in conjunction with power-normalized cepStral coefficients and linear prediction cepstral coefficients, can improve the performance significantly across all noise conditions in mismatched situations.

...read moreread less

Abstract: This letter investigates the use of subglottal resonances (SGRs) for noise-robust speaker identification (SID). It is motivated by the speaker specificity and stationarity of subglottal acoustics, and the development of noise-robust SGR estimation algorithms which are reliable at low signal-to-noise ratios for large datasets. A two-stage framework is proposed which combines the SGRs with different cepstral features. The cepstral features are used in the first stage to reduce the number of target speakers for a test utterance, and then SGRs are used as complementary second-stage features to conduct identification. Experiments with the TIMIT and NIST 2008 databases show that SGRs, when used in conjunction with power-normalized cepstral coefficients and linear prediction cepstral coefficients, can improve the performance significantly (2%–6% absolute accuracy improvement) across all noise conditions in mismatched situations.

...read moreread less

Proceedings Article•DOI•

Neural network based speaker classification and verification systems with enhanced features

[...]

Zhenhao Ge, Ananth N. Iyer, Srinath Cheluvaraja, Ram Sundaram, Aravind Ganapathiraju - Show less +1 more

01 Sep 2017

TL;DR: This work presents a novel framework based on feed-forward neural network for text-independent speaker classification and verification, two related systems of speaker recognition that achieves 100% classification rate in classification and less than 6% Equal Error Rate (ERR) using merely about 1 second and 5 seconds of data respectively.

...read moreread less

Abstract: This work presents a novel framework based on feed-forward neural network for text-independent speaker classification and verification, two related systems of speaker recognition. With optimized features and model training, it achieves 100% classification rate in classification and less than 6% Equal Error Rate (ERR), using merely about 1 second and 5 seconds of data respectively. Features with stricter Voice Active Detection (VAD) than the regular one for speech recognition ensure extracting stronger voiced portion for speaker recognition, speaker-level mean and variance normalization helps to eliminate the discrepancy between samples from the same speaker. Both are proven to improve the system performance. In building the neural network speaker classifier, the network structure parameters are optimized with grid search and dynamically reduced regularization parameters are used to avoid training terminated in local minimum. It enables the training goes further with lower cost. In speaker verification, performance is improved with prediction score normalization, which rewards the speaker identity indices with distinct peaks and penalizes the weak ones with high scores but more competitors, and speaker-specific thresholding, which significantly reduces ERR in the ROC curve. TIMIT corpus with 8K sampling rate is used here. First 200 male speakerwje used to train and test the classification performance. The testing files of them are used as in-domain registered speakers, while data from the remaining 126 male speakers are used as out-of-domain speakers, i.e. imposters in speaker verification.

...read moreread less

Journal Article•DOI•

Evaluation of a speaker identification system with and without fusion using three databases in the presence of noise and handset effects

[...]

Musab T. S. Al-Kaltakchi¹, Musab T. S. Al-Kaltakchi², Wai Lok Woo¹, Satnam Dlay¹, Jonathon A. Chambers¹ - Show less +1 more•Institutions (2)

Newcastle University¹, Al-Mustansiriya University²

02 Dec 2017-EURASIP Journal on Advances in Signal Processing

TL;DR: As recommendations from the study, mean fusion is found to yield overall best performance in terms of speaker identification accuracy (SIA) with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings.

...read moreread less

Abstract: In this study, a speaker identification system is considered consisting of a feature extraction stage which utilizes both power normalized cepstral coefficients (PNCCs) and Mel frequency cepstral coefficients (MFCC). Normalization is applied by employing cepstral mean and variance normalization (CMVN) and feature warping (FW), together with acoustic modeling using a Gaussian mixture model-universal background model (GMM-UBM). The main contributions are comprehensive evaluations of the effect of both additive white Gaussian noise (AWGN) and non-stationary noise (NSN) (with and without a G.712 type handset) upon identification performance. In particular, three NSN types with varying signal to noise ratios (SNRs) were tested corresponding to street traffic, a bus interior, and a crowded talking environment. The performance evaluation also considered the effect of late fusion techniques based on score fusion, namely, mean, maximum, and linear weighted sum fusion. The databases employed were TIMIT, SITW, and NIST 2008; and 120 speakers were selected from each database to yield 3600 speech utterances. As recommendations from the study, mean fusion is found to yield overall best performance in terms of speaker identification accuracy (SIA) with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings.

...read moreread less

Proceedings Article•DOI•

Robust features fusion for text independent speaker verification enhancement in noisy environments

[...]

Mohsen Mohammadi, Hamid Reza Sadegh Mohammadi

01 May 2017

TL;DR: Four systems based on different speech features were combined in score-level to improve verification accuracy under clean and noisy speech conditions and this reduces the equal error rates is in some cases up to 44%.

...read moreread less

Abstract: So far, many methods have been proposed for speaker verification which provide good results, but their performances reduce in actual noisy environments. A common approach to partially alleviate this problem is the fusion of several methods. In this paper, four systems based on different speech features, i.e., MFCC, IMFCC, LFCC, and PNCC were combined in score-level to improve verification accuracy under clean and noisy speech conditions. The features pairwise and foursome fusion in a speaker verification system based on speaker modeling through the Gaussian mixture model (GMM) were evaluated. TIMIT and NOISEX92 databases were used to implement as the speech and noise datasets, respectively. The experimental results show that the score-level fusion of different feature vectors enhances the accuracy of speaker verification system and this reduces the equal error rates is in some cases up to 44%.

...read moreread less

Proceedings Article•DOI•

An improved residual LSTM architecture for acoustic modeling

[...]

Lu Huang¹, Ji Xu², Jiasong Sun¹, Yi Yang¹•Institutions (2)

Tsinghua University¹, Chinese Academy of Sciences²

01 Jul 2017

TL;DR: In this article, the authors proposed several types of residual LSTM methods for acoustic modeling, and compared with classic LSTMs, their architecture showed more than 8% relative reduction in Phone Error Rate (PER) on TIMIT tasks.

...read moreread less

Abstract: Long Short-Term Memory (LSTM) is the primary recurrent neural networks architecture for acoustic modeling in automatic speech recognition systems. Residual learning is an efficient method to help neural networks converge easier and faster. In this paper, we propose several types of residual LSTM methods for our acoustic modeling. Our experiments indicate that, compared with classic LSTM, our architecture shows more than 8% relative reduction in Phone Error Rate (PER) on TIMIT tasks. At the same time, our residual fast LSTM approach shows 4% relative reduction in PER on the same task. Besides, we find that all this architecture could have good results on THCHS-30, Librispeech and Switchboard corpora.

...read moreread less

Proceedings Article•DOI•

Learning embeddings for speaker clustering based on voice equality

[...]

Yanick Xavier Lukic¹, Carlo Vogt¹, Oliver Dürr¹, Thilo Stadelmann¹•Institutions (1)

Zürcher Fachhochschule¹

01 Sep 2017

TL;DR: This work trains a CNN to extract embeddings that are similar for equal speakers (regardless of their specific identity) using weakly labeled data and exceeds the clustering performance of all previous approaches on the well-known TIMIT dataset.

...read moreread less

Abstract: Recent work has shown that convolutional neural networks (CNNs) trained in a supervised fashion for speaker identification are able to extract features from spectrograms which can be used for speaker clustering. These features are represented by the activations of a certain hidden layer and are called embeddings. However, previous approaches require plenty of additional speaker data to learn the embedding, and although the clustering results are then on par with more traditional approaches using MFCC features etc., room for improvements stems from the fact that these embeddings are trained with a surrogate task that is rather far away from segregating unknown voices — namely, identifying few specific speakers. We address both problems by training a CNN to extract embeddings that are similar for equal speakers (regardless of their specific identity) using weakly labeled data. We demonstrate our approach on the well-known TIMIT dataset that has often been used for speaker clustering experiments in the past. We exceed the clustering performance of all previous approaches, but require just 100 instead of 590 unrelated speakers to learn an embedding suited for clustering.

...read moreread less

Proceedings Article•DOI•

Comparison of I-vector and GMM-UBM approaches to speaker identification with TIMIT and NIST 2008 databases in challenging environments

[...]

Musab T. S. Al-Kaltakchi¹, Wai Lok Woo¹, Satnam Dlay¹, Jonathon A. Chambers¹•Institutions (1)

Newcastle University¹

26 Oct 2017

TL;DR: The results show that the I-vector approach is better than the Gaussian Mixture Model-Universal Background Model for both clean and AWGN conditions without a handset, however, the GMM-UBM had better accuracy for NSN types.

...read moreread less

Abstract: In this paper, two models, the I-vector and the Gaussian Mixture Model-Universal Background Model (GMM-UBM), are compared for the speaker identification task. Four feature combinations of I-vectors with seven fusion techniques are considered: maximum, mean, weighted sum, cumulative, interleaving and concatenated for both two and four features. In addition, an Extreme Learning Machine (ELM) is exploited to identify speakers, and then Speaker Identification Accuracy (SIA) is calculated. Both systems are evaluated for 120 speakers from the TIMIT and NIST 2008 databases for clean speech. Furthermore, a comprehensive evaluation is made under Additive White Gaussian Noise (AWGN) conditions and with three types of Non Stationary Noise (NSN), both with and without handset effects for the TIMIT database. The results show that the I-vector approach is better than the GMM-UBM for both clean and AWGN conditions without a handset. However, the GMM-UBM had better accuracy for NSN types.

...read moreread less

Journal Article•DOI•

Vowel detection using a perceptually-enhanced spectrum matching conditioned to phonetic context and speaker identity

[...]

Hamidreza Baradaran Kashani¹, Abolghasem Sayadiyan¹, Hamid Sheikhzadeh¹•Institutions (1)

Amirkabir University of Technology¹

01 Jul 2017-Speech Communication

TL;DR: A new model based on some proposed components called matched filters (MFs), where instead of using a fixed filter bank for the entire speech signal, the proposed TOC is generated by adopting a pair of vowel and consonant MFs for each voiced speech frame.

...read moreread less

Proceedings Article•DOI•

Bayesian phonotactic Language Model for Acoustic Unit Discovery

[...]

Lucas Ondel¹, Lukas Burget¹, Jan Cernocky¹, Santosh Kesiraju²•Institutions (2)

Brno University of Technology¹, International Institute of Information Technology, Hyderabad²

01 Mar 2017

TL;DR: This work proposes to improve the non-parametric Bayesian phone-loop model by incorporating a Hierarchical Pitman-Yor based bigram Language Model on top of the units' transitions, which shows an absolute improvement of 1–2% on the Normalized Mutual Information (NMI) metric.

...read moreread less

Abstract: Recent work on Acoustic Unit Discovery (AUD) has led to the development of a non-parametric Bayesian phone-loop model where the prior over the probability of the phone-like units is assumed to be sampled from a Dirichlet Process (DP). In this work, we propose to improve this model by incorporating a Hierarchical Pitman-Yor based bigram Language Model on top of the units' transitions. This new model makes use of the phonotactic context information but assumes a fixed number of units. To remedy this limitation we first train a DP phone-loop model to infer the number of units, then, the bigram phone-loop is initialized from the DP phone-loop and trained until convergence of its parameters. Results show an absolute improvement of 1–2%on the Normalized Mutual Information (NMI) metric. Furthermore, we show that, combined with Multilingual Bottleneck (MBN) features the model yields a same or higher NMI as an English phone recogniser trained on TIMIT.

...read moreread less

Proceedings Article•DOI•

Monaural source separation based on adaptive discriminative criterion in neural networks

[...]

Yang Sun¹, Lei Zhu², Jonathon A. Chambers¹, Syed Mohsen Naqvi¹•Institutions (2)

Newcastle University¹, Harbin Engineering University²

01 Aug 2017

TL;DR: This work introduces an approach to calculate the parameter in the discriminative term adaptively via the discrepancy between target features and has improved robustness and a better separation performance than the previous approach.

...read moreread less

Abstract: Monaural source separation is an important research area which can help to improve the performance of several real-world applications, such as speech recognition and assisted living systems. Huang et al. proposed deep recurrent neural networks (DRNNs) with discriminative criterion objective function to improve the performance of source separation. However, the penalty factor in the objective function is selected randomly and empirically. Therefore, we introduce an approach to calculate the parameter in the discriminative term adaptively via the discrepancy between target features. The penalty factor can be changed with inputs to improve the separation performance. The proposed method is evaluated with different settings and architectures of neural networks. In these experiments, the TIMIT corpus is explored as the database and the signal to distortion ratio (SDR) as the measurement. Comparing with the previous approach, our method has improved robustness and a better separation performance.

...read moreread less

Journal Article•DOI•

Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and Cosine Distance Scoring

[...]

Lei Lei, She Kun

14 May 2017-Journal of Electrical and Computer Engineering

TL;DR: The results of the experiments show that the proposed model can obtain good performance in clear and noisy environment and be insensitive to the low-quality speech, but the time cost of the model is high.

...read moreread less

Abstract: Today, more and more people have benefited from the speaker recognition. However, the accuracy of speaker recognition often drops off rapidly because of the low-quality speech and noise. This paper proposed a new speaker recognition model based on wavelet packet entropy (WPE), i-vector, and cosine distance scoring (CDS). In the proposed model, WPE transforms the speeches into short-term spectrum feature vectors (short vectors) and resists the noise. I-vector is generated from those short vectors and characterizes speech to improve the recognition accuracy. CDS fast compares with the difference between two i-vectors to give out the recognition result. The proposed model is evaluated by TIMIT speech database. The results of the experiments show that the proposed model can obtain good performance in clear and noisy environment and be insensitive to the low-quality speech, but the time cost of the model is high. To reduce the time cost, the parallel computation is used.

...read moreread less