scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2019"


Proceedings ArticleDOI
15 Sep 2019
TL;DR: A novel noise adaptive speech enhancement (SE) system, which employs a domain adversarial training (DAT) approach to tackle the issue of a noise type mismatch between the training and testing conditions, and provides significant improvements in PESQ, SSNR, and STOI over the SE system without an adaptation.
Abstract: In this study, we propose a novel noise adaptive speech enhancement (SE) system, which employs a domain adversarial training (DAT) approach to tackle the issue of a noise type mismatch between the training and testing conditions. Such a mismatch is a critical problem in deep-learning-based SE systems. A large mismatch may cause a serious performance degradation to the SE performance. Because we generally use a well-trained SE system to handle various unseen noise types, a noise type mismatch commonly occurs in real-world scenarios. The proposed noise adaptive SE system contains an encoder-decoder-based enhancement model and a domain discriminator model. During adaptation, the DAT approach encourages the encoder to produce noise-invariant features based on the information from the discriminator model and consequentially increases the robustness of the enhancement model to unseen noise types. Herein, we regard stationary noises as the source domain (with the ground truth of clean speech) and non-stationary noises as the target domain (without the ground truth). We evaluated the proposed system on TIMIT sentences. The experiment results show that the proposed noise adaptive SE system successfully provides significant improvements in PESQ (19.0%), SSNR (39.3%), and STOI (27.0%) over the SE system without an adaptation.

31 citations


Posted Content
TL;DR: This paper proposed vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task, which uses either a gumbel softmax or online k-means clustering to quantize the dense representations.
Abstract: We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.

29 citations


Journal ArticleDOI
TL;DR: The complex signal approximation (cSA), which is operated in the complex domain to utilize the phase information of the desired speech signal to improve the separation performance, is proposed.
Abstract: In recent research, deep neural network (DNN) has been used to solve the monaural source separation problem. According to the training objectives, DNN-based monaural speech separation is categorized into three aspects, namely masking, mapping, and signal approximation based techniques. However, the performance of the traditional methods is not robust due to variations in real-world environments. Besides, in the vanilla DNN-based methods, the temporal information cannot be fully utilized. Therefore, in this paper, the long short-term memory (LSTM) neural network is applied to exploit the long-term speech contexts. Then, we propose the complex signal approximation (cSA), which is operated in the complex domain to utilize the phase information of the desired speech signal to improve the separation performance. The IEEE and the TIMIT corpora are used to generate mixtures with noise and speech interferences to evaluate the efficacy of the proposed method. The experimental results demonstrate the advantages of the proposed cSA-based LSTM recurrent neural network method in terms of different objective performance measures.

29 citations


Journal ArticleDOI
TL;DR: A two-stage approach with two DNN-based methods to address dereverberation and separation in the monaural source separation problem, which outperform the state-of-the-art specifically in highly reverberant room environments.
Abstract: Deep neural networks DNNs have been used for dereverberation and separation in the monaural source separation problem. However, the performance of current state-of-the-art methods is limited, particularly when applied in highly reverberant room environments. In this paper, we propose a two-stage approach with two DNN-based methods to address this problem. In the first stage, the dereverberation of the speech mixture is achieved with the proposed dereverberation mask DM. In the second stage, the dereverberant speech mixture is separated with the ideal ratio mask IRM. To realize this two-stage approach, in the first DNN-based method, the DM is integrated with the IRM to generate the enhanced time-frequency T-F mask, namely the ideal enhanced mask IEM, as the training target for the single DNN. In the second DNN-based method, the DM and the IRM are predicted with two individual DNNs. The IEEE and the TIMIT corpora with real room impulse responses and noise from the NOISEX dataset are used to generate speech mixtures for evaluations. The proposed methods outperform the state-of-the-art specifically in highly reverberant room environments.

28 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: This is the first paper to describe in-depth these various flavors of TDNN, providing details regarding the speech features used at input, constituent layers and their dimensionality, regularization techniques etc.
Abstract: Kaldi NNET3 is at the moment the leading speech recognition toolkit on many well-known tasks such as LibriSpeech, TED-LIUM or TIMIT. Several versions of the time-delay neural network (TDNN) architecture were recently proposed, implemented and evaluated for acoustic modeling with Kaldi: plain TDNN, convolutional TDNN (CNN-TDNN), long short-term memory TDNN (TDNN-LSTM) and TDNN-LSTM with attention. To the best of our knowledge, this is the first paper to describe in-depth these various flavors of TDNN, providing details regarding the speech features used at input, constituent layers and their dimensionality, regularization techniques etc. The various acoustic models were evaluated in conjunction with n-gram and recurrent language models in an automatic speech recognition (ASR) experiment for the Romanian language. We report significantly better results over the previous ASR systems for the same Romanian ASR tasks.

25 citations


Proceedings ArticleDOI
15 Sep 2019
TL;DR: In this paper, the output-label permutation is considered as a discrete latent random variable with a uniform prior distribution and a log-likelihood function is defined based on the prior distributions and the separation errors of all permutations.
Abstract: Single-microphone, speaker-independent speech separation is normally performed through two steps: (i) separating the specific speech sources, and (ii) determining the best output-label assignment to find the separation error. The second step is the main obstacle in training neural networks for speech separation. Recently proposed Permutation Invariant Training (PIT) addresses this problem by determining the output-label assignment which minimizes the separation error. In this study, we show that a major drawback of this technique is the overconfident choice of the output-label assignment, especially in the initial steps of training when the network generates unreliable outputs. To solve this problem, we propose Probabilistic PIT (Prob-PIT) which considers the output-label permutation as a discrete latent random variable with a uniform prior distribution. Prob-PIT defines a log-likelihood function based on the prior distributions and the separation errors of all permutations; it trains the speech separation networks by maximizing the log-likelihood function. Prob-PIT can be easily implemented by replacing the minimum function of PIT with a soft-minimum function. We evaluate our approach for speech separation on both TIMIT and CHiME datasets. The results show that the proposed method significantly outperforms PIT in terms of Signal to Distortion Ratio and Signal to Interference Ratio.

23 citations


Journal ArticleDOI
TL;DR: A statistical analysis reveals the super-Gaussian and heteroscedastic properties of the prediction errors in nonlinear regression deep neural network (DNN)-based speech enhancement when estimating clean log-power spectral (LPS) components at DNN outputs with noisy LPS features in DNN input vectors.
Abstract: From a statistical perspective, the conventional minimum mean squared error (MMSE) criterion can be considered as the maximum likelihood (ML) solution under an assumed homoscedastic Gaussian error model. However, in this paper, a statistical analysis reveals the super-Gaussian and heteroscedastic properties of the prediction errors in nonlinear regression deep neural network (DNN)-based speech enhancement when estimating clean log-power spectral (LPS) components at DNN outputs with noisy LPS features in DNN input vectors. Accordingly, we propose treating all dimensions of the prediction error vector as statistically independent random variables and model them with generalized Gaussian distributions (GGDs). Then, the objective function with the GGD error model is derived according to the ML criterion. Experiments on the TIMIT corpus corrupted by simulated additive noises show consistent improvements of our proposed DNN framework over the conventional DNN framework in terms of various objective quality measures under 14 unseen noise types evaluated and at various signal-to-noise ratio levels. Furthermore, the ML optimization objective with GGD outperforms the conventional MMSE criterion, achieving improved generalization and robustness.

21 citations


Proceedings ArticleDOI
08 May 2019
TL;DR: A unified DNN architecture to predict both height and age of a speaker for short durations of speech is proposed, and a novel initialization scheme for the deep neural architecture is introduced, that avoids the requirement for a large training dataset.
Abstract: Automatic height and age prediction of a speaker has a wide variety of applications in speaker profiling, forensics etc. Often in such applications only a few seconds of speech data is available to reliably estimate the speaker parameters. Traditionally, age and height were predicted separately using different estimation algorithms. In this work, we propose a unified DNN architecture to predict both height and age of a speaker for short durations of speech. A novel initialization scheme for the deep neural architecture is introduced, that avoids the requirement for a large training dataset. We evaluate the system on TIMIT dataset where the mean duration of speech segments is around 2.5s. The DNN system is able to improve the age RMSE by at least 0.6 years as compared to a conventional support vector regression system trained on Gaussian Mixture Model mean supervectors. The system achieves an RMSE error of 6.85 and 6.29 cm for male and female height prediction. In case of age estimation, the RMSE errors are 7.60 and 8.63 years for male and female respectively. Analysis of shorter speech segments reveals that even with 1 second speech input the performance degradation is at most 3% compared to the full duration speech files.

20 citations


Journal ArticleDOI
TL;DR: A new deep architecture in which two heterogeneous classification techniques named as CNN and support vector machines (SVMs) are combined together is proposed, which improves the result by 13.33% and 2.31% over baseline CNN and segmental recurrent neural networks respectively.
Abstract: Convolutional neural networks (CNNs) have demonstrated the state-of-the-art performances on automatic speech recognition. Softmax activation function for prediction and minimizing the cross-entropy loss is employed by most of the CNNs. This paper proposes a new deep architecture in which two heterogeneous classification techniques named as CNN and support vector machines (SVMs) are combined together. In this proposed model, features are learned using convolution layer and classified by SVMs. The last layer of CNN i.e. softmax layer is replaced by SVMs to efficiently deal with high dimensional features. This model should be interpreted as a special form of structured SVM and named as convolutional support vector machine (CSVM). Instead of training each component separately, the parameters of CNN and SVMs are jointly trained using frame level max-margin, sequence level max-margin, and state-level minimum Bayes risk criterion. The performance of CSVM is checked on TIMIT and Wall Street Journal datasets for phone recognition. By incorporating the features of both CNN and SVMs, CSVM improves the result by 13.33% and 2.31% over baseline CNN and segmental recurrent neural networks respectively.

20 citations


Proceedings ArticleDOI
21 Oct 2019
TL;DR: This work proposes an utterance-level classification-aided non-intrusive (UCAN) assessment approach that combines the task of quality score classification with the regression task ofquality score estimation, and uses a categorical quality ranking as an auxiliary constraint to assist with quality score estimation.
Abstract: Objective metrics, such as the perceptual evaluation of speech quality (PESQ) have become standard measures for evaluating speech. These metrics enable efficient and costless evaluations, where ratings are often computed by comparing a degraded speech signal to its underlying clean reference signal. Reference-based metrics, however, cannot be used to evaluate real-world signals that have inaccessible references. This project develops a nonintrusive framework for evaluating the perceptual quality of noisy and enhanced speech. We propose an utterance-level classification-aided non-intrusive (UCAN) assessment approach that combines the task of quality score classification with the regression task of quality score estimation. Our approach uses a categorical quality ranking as an auxiliary constraint to assist with quality score estimation, where we jointly train a multi-layered convolutional neural network in a multi-task manner. This approach is evaluated using the TIMIT speech corpus and several noises under a wide range of signal-to-noise ratios. The results show that the proposed system significantly improves quality score estimation as compared to several state-of-the-art approaches.

19 citations


Proceedings ArticleDOI
12 May 2019
TL;DR: This work presents a novel network design based on combination of many convolutional and recurrent layers that solves dilemmas of receptive field size, number of parameters and spatial resolution of features in deeper layers of the network.
Abstract: When designing fully-convolutional neural network, there is a trade-off between receptive field size, number of parameters and spatial resolution of features in deeper layers of the network. In this work we present a novel network design based on combination of many convolutional and recurrent layers that solves these dilemmas. We compare our solution with U-nets based models known from the literature and other baseline models on speech enhancement task. We test our solution on TIMIT speech utterances combined with noise segments extracted from NOISEX-92 database and show clear advantage of proposed solution in terms of SDR (signal-to-distortion ratio), SIR (signal-to-interference ratio) and STOI (spectro-temporal objective intelligibility) metrics compared to the current state-of-the-art.

Proceedings ArticleDOI
17 Apr 2019
TL;DR: This paper proposes a fully-trainable windowed attention and provides a detailed analysis on the factors which affect the performance of such an attention mechanism and results in a 17% performance improvement over the traditional attention model.
Abstract: The usual attention mechanisms used for encoder-decoder models do not constrain the relationship between input and output sequences to be monotonic. To address this we explore windowed attention mechanisms which restrict attention to a block of source hidden states. Rule-based windowing restricts attention to a (typically large) fixed-length window. The performance of such methods is poor if the window size is small. In this paper, we propose a fully-trainable windowed attention and provide a detailed analysis on the factors which affect the performance of such an attention mechanism. Compared to the rule-based window methods, the learned window size is significantly smaller yet the model’s performance is competitive. On the TIMIT corpus this approach has resulted in a 17% (relative) performance improvement over the traditional attention model. Our model also yields comparable accuracies to the joint CTC-attention model on the Wall Street Journal corpus.

Posted Content
TL;DR: Speech-XLNet, an XLNet-like pretraining scheme for unsupervised acoustic model pretraining to learn speech representations with SAN, greatly improves the SAN/HMM performance in terms of both convergence speed and recognition accuracy compared to the one trained from randomly initialized weights.
Abstract: Self-attention network (SAN) can benefit significantly from the bi-directional representation learning through unsupervised pretraining paradigms such as BERT and XLNet. In this paper, we present an XLNet-like pretraining scheme "Speech-XLNet" for unsupervised acoustic model pretraining to learn speech representations with SAN. The pretrained SAN is finetuned under the hybrid SAN/HMM framework. We conjecture that by shuffling the speech frame orders, the permutation in Speech-XLNet serves as a strong regularizer to encourage the SAN to make inferences by focusing on global structures through its attention weights. In addition, Speech-XLNet also allows the model to explore the bi-directional contexts for effective speech representation learning. Experiments on TIMIT and WSJ demonstrate that Speech-XLNet greatly improves the SAN/HMM performance in terms of both convergence speed and recognition accuracy compared to the one trained from randomly initialized weights. Our best systems achieve a relative improvement of 11.9% and 8.3% on the TIMIT and WSJ tasks respectively. In particular, the best system achieves a phone error rate (PER) of 13.3% on the TIMIT test set, which to our best knowledge, is the lowest PER obtained from a single system.

Proceedings ArticleDOI
15 Sep 2019
TL;DR: In this paper, a Bayesian Subspace Hidden Markov Model (SHMM) is used to learn the notion of acoustic units from the labeled data and then the model uses its knowledge to find new acoustic units on the target language.
Abstract: This work tackles the problem of learning a set of language specific acoustic units from unlabeled speech recordings given a set of labeled recordings from other languages. Our approach may be described by the following two steps procedure: first the model learns the notion of acoustic units from the labelled data and then the model uses its knowledge to find new acoustic units on the target language. We implement this process with the Bayesian Subspace Hidden Markov Model (SHMM), a model akin to the Subspace Gaussian Mixture Model (SGMM) where each low dimensional embedding represents an acoustic unit rather than just a HMM's state. The subspace is trained on 3 languages from the GlobalPhone corpus (German, Polish and Spanish) and the AUs are discovered on the TIMIT corpus. Results, measured in equivalent Phone Error Rate, show that this approach significantly outperforms previous HMM based acoustic units discovery systems and compares favorably with the Variational Auto Encoder-HMM.

Posted Content
TL;DR: Results show that this approach significantly outperforms previous HMM based acoustic units discovery systems and compares favorably with the Variational Auto Encoder-HMM.
Abstract: This work tackles the problem of learning a set of language specific acoustic units from unlabeled speech recordings given a set of labeled recordings from other languages. Our approach may be described by the following two steps procedure: first the model learns the notion of acoustic units from the labelled data and then the model uses its knowledge to find new acoustic units on the target language. We implement this process with the Bayesian Subspace Hidden Markov Model (SHMM), a model akin to the Subspace Gaussian Mixture Model (SGMM) where each low dimensional embedding represents an acoustic unit rather than just a HMM's state. The subspace is trained on 3 languages from the GlobalPhone corpus (German, Polish and Spanish) and the AUs are discovered on the TIMIT corpus. Results, measured in equivalent Phone Error Rate, show that this approach significantly outperforms previous HMM based acoustic units discovery systems and compares favorably with the Variational Auto Encoder-HMM.

Journal ArticleDOI
TL;DR: A new pooling technique, multileVEL region of interest (RoI) pooling is proposed which pools the multilevel information from multiple ConvNet layers which improves extracted features using additional information from the multilesvel convolutional neural network layers.
Abstract: Efficient and robust automatic speech recognition (ASR) systems are in high demand in the present scenario. Mostly ASR systems are generally fed with cepstral features like mel-frequency cepstral coefficients and perceptual linear prediction. However, some attempts are also made in speech recognition to shift on simple features like critical band energies or spectrogram using deep learning models. These approaches always claim that they have the ability to train directly with the raw signal. Such systems highly depend on the excellent discriminative power of ConvNet layers to separate two phonemes having nearly similar accents but they do not offer high recognition rate. The main reason for limited recognition rate is stride based pooling methods that performs sharp reduction in output dimensionality i.e. at least 75%. To improve the performance, region-based convolutional neural networks (R-CNNs) and Fast R-CNN were proposed but their performances did not meet the expected level. Therefore, a new pooling technique, multilevel region of interest (RoI) pooling is proposed which pools the multilevel information from multiple ConvNet layers. The newly proposed architecture is named as multilevel RoI convolutional neural network (MR-CNN). It is designed by simply placing RoI pooling layers after up to four coarsest layers. It improves extracted features using additional information from the multilevel ConvNet layers. Its performance is evaluated on TIMIT and Wall Street Journal (WSJ) datasets for phoneme recognition. Phoneme error-rate offered by this model on raw speech is 16.4% and 17.1% on TIMIT and WSJ datasets respectively which is slightly better than spectral features.

Posted Content
TL;DR: It is shown that a large improvement in the accuracy of deep speech models can be achieved with effective Neural Architecture Optimization at a very low computational cost.
Abstract: Deep neural networks (DNNs) have been demonstrated to outperform many traditional machine learning algorithms in Automatic Speech Recognition (ASR). In this paper, we show that a large improvement in the accuracy of deep speech models can be achieved with effective Neural Architecture Optimization at a very low computational cost. Phone recognition tests with the popular LibriSpeech and TIMIT benchmarks proved this fact by displaying the ability to discover and train novel candidate models within a few hours (less than a day) many times faster than the attention-based seq2seq models. Our method achieves test error of 7% Word Error Rate (WER) on the LibriSpeech corpus and 13% Phone Error Rate (PER) on the TIMIT corpus, on par with state-of-the-art results.

Posted Content
TL;DR: In this paper, an abstract symbolic sequential modeling for learning the acoustic noisy-clean speech mapping is incorporated into the speech enhancement framework, where the symbolic sequences for acoustic signals are obtained as discrete representations with a Vector Quantized Variational Autoencoder algorithm.
Abstract: In a noisy environment, a lossy speech signal can be automatically restored by a listener if he/she knows the language well. That is, with the built-in knowledge of a "language model", a listener may effectively suppress noise interference and retrieve the target speech signals. Accordingly, we argue that familiarity with the underlying linguistic content of spoken utterances benefits speech enhancement (SE) in noisy environments. In this study, in addition to the conventional modeling for learning the acoustic noisy-clean speech mapping, an abstract symbolic sequential modeling is incorporated into the SE framework. This symbolic sequential modeling can be regarded as a "linguistic constraint" in learning the acoustic noisy-clean speech mapping function. In this study, the symbolic sequences for acoustic signals are obtained as discrete representations with a Vector Quantized Variational Autoencoder algorithm. The obtained symbols are able to capture high-level phoneme-like content from speech signals. The experimental results demonstrate that the proposed framework can obtain notable performance improvement in terms of perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) on the TIMIT dataset.

Journal ArticleDOI
TL;DR: A novel method is proposed for source smartphone identification by using encoding characteristics as the intrinsic fingerprint of recording devices by using Variance Threshold and SVM-RFE to choose the optimal features.

Journal ArticleDOI
TL;DR: In this paper, a modification to DTW that performs individual and independent pairwise alignment of feature trajectories is proposed, termed feature trajectory dynamic time warping (FTDTW), which is applied as a similarity measure in the agglomerative hierarchical clustering of speech segments.
Abstract: Dynamic time warping (DTW) can be used to compute the similarity between two sequences of generally differing length. We propose a modification to DTW that performs individual and independent pairwise alignment of feature trajectories. The modified technique, termed feature trajectory dynamic time warping (FTDTW), is applied as a similarity measure in the agglomerative hierarchical clustering of speech segments. Experiments using MFCC and PLP parametrisations extracted from TIMIT and from the Spoken Arabic Digit Dataset (SADD) show consistent and statistically significant improvements in the quality of the resulting clusters in terms of F-measure and normalised mutual information (NMI).

Proceedings ArticleDOI
15 Sep 2019
TL;DR: It is argued that familiarity with the underlying linguistic content of spoken utterances benefits speech enhancement (SE) in noisy environments and the proposed framework can obtain notable performance improvement in terms of perceptual evaluation of speech quality and short-time objective intelligibility on the TIMIT dataset.
Abstract: In a noisy environment, a lossy speech signal can be automatically restored by a listener if he/she knows the language well. That is, with the built-in knowledge of a "language model", a listener may effectively suppress noise interference and retrieve the target speech signals. Accordingly, we argue that familiarity with the underlying linguistic content of spoken utterances benefits speech enhancement (SE) in noisy environments. In this study, in addition to the conventional modeling for learning the acoustic noisy-clean speech mapping, an abstract symbolic sequential modeling is incorporated into the SE framework. This symbolic sequential modeling can be regarded as a "linguistic constraint" in learning the acoustic noisy-clean speech mapping function. In this study, the symbolic sequences for acoustic signals are obtained as discrete representations with a Vector Quantized Variational Autoencoder algorithm. The obtained symbols are able to capture high-level phoneme-like content from speech signals. The experimental results demonstrate that the proposed framework can obtain notable performance improvement in terms of perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) on the TIMIT dataset.

Journal ArticleDOI
TL;DR: BGGD is adopted for modeling the data and its mixture is extended as an ICA mixture model by employing gradient ascent along with expectation maximization for parameter estimation, and inferring the shape parameter in Gaussian distribution and Laplace distribution can be characterized as special cases.
Abstract: In this paper, we propose bounded generalized Gaussian mixture model with independent component analysis (ICA). One limitation in ICA is that it assumes the sources to be independent from each other. This assumption can be relaxed by employing a mixture model. In our proposed model, bounded generalized Gaussian distribution (BGGD) is adopted for modeling the data and we have further extended its mixture as an ICA mixture model by employing gradient ascent along with expectation maximization for parameter estimation. By inferring the shape parameter in BGGD, Gaussian distribution and Laplace distribution can be characterized as special cases. In order to validate the effectiveness of this algorithm, experiments are performed on blind source separation (BSS) and BSS as preprocessing to unsupervised keyword spotting. For BSS, TIMIT, TSP and Noizeus speech corpora are selected and results are compared with ICA. For keyword spotting, TIMIT speech corpus is selected and recognition results are further compared before and after BSS being applied as preprocessing when speech utterances are affected by mixing of noise or other speech utterances. The mixing of noise or speech utterances with a particular or target speech utterance can greatly affect the intelligibility of a speech signal. The results achieved from the presented experiments on different applications have demonstrated the effectiveness of ICA mixture model in statistical learning.

Proceedings ArticleDOI
14 Jul 2019
TL;DR: AM-SincNet as discussed by the authors introduces a margin of separation between the classes that forces the samples from the same class to be closer to each other and also maximizes the distance between classes.
Abstract: Speaker Recognition is a challenging task with essential applications such as authentication, automation, and security. The SincNet is a new deep learning based model which has produced promising results to tackle the mentioned task. To train deep learning systems, the loss function is essential to the network performance. The Softmax loss function is a widely used function in deep learning methods, but it is not the best choice for all kind of problems. For distance-based problems, one new Softmax based loss function called Additive Margin Softmax (AM-Softmax) is proving to be a better choice than the traditional Softmax. The AM-Softmax introduces a margin of separation between the classes that forces the samples from the same class to be closer to each other and also maximizes the distance between classes. In this paper, we propose a new approach for speaker recognition systems called AM-SincNet, which is based on the SincNet but uses an improved AM-Softmax layer. The proposed method is evaluated in the TIMIT dataset and obtained an improvement of approximately 40% in the Frame Error Rate when compared to SincNet.

Journal ArticleDOI
TL;DR: This work introduces a joint model trained with nonnegative matrix factorization (NMF)-based high-level features and puts forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs.
Abstract: A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.

Journal ArticleDOI
TL;DR: This work adopted the One-Class SVM as their anomaly detection model, with a specific model built for each phoneme, and reduced the false-acceptance and false-rejection rates by 26% and 39% respectively compared to the conventional Goodness-of-Pronunciation algorithm.

Posted Content
TL;DR: In this article, a Generative Adversarial Network (GAN) was proposed to generate a large annotated speech corpus for training ASR systems, in which a generator and discriminator learn from each other iteratively to improve the performance.
Abstract: Producing a large annotated speech corpus for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced, but collecting a relatively big unlabeled data set for such languages is more achievable. This is why some initial effort have been reported on completely unsupervised speech recognition learned from unlabeled data only, although with relatively high error rates. In this paper, we develop a Generative Adversarial Network (GAN) to achieve this purpose, in which a Generator and a Discriminator learn from each other iteratively to improve the performance. We further use a set of Hidden Markov Models (HMMs) iteratively refined from the machine generated labels to work in harmony with the GAN. The initial experiments on TIMIT data set achieve an phone error rate of 33.1%, which is 8.5% lower than the previous state-of-the-art.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: Indic TIMIT and Indic English lexicon could be useful for a number of potential applications in Indian context including automatic speech recognition, mispronunciation detection & diagnosis, native language identification, accent adaptation, accent conversion, voice conversion, speech synthesis, grapheme-to-phoneme conversion, automatic phoneme unit discovery and pronunciation error analysis.
Abstract: With the advancements in the speech technology, demand for larger speech corpora is increasing particularly those from non-native English speakers. In order to cater to this demand under Indian context, we acquire a database named Indic TIMIT, a phonetically rich Indian English speech corpus. It contains ~240 hours of speech recordings from 80 subjects, in which, each subject has spoken a set of 2342 stimuli available in the TIMIT corpus. Further, the corpus also contains phoneme transcriptions for a sub-set of recordings, which are manually annotated by two linguists reflecting speaker's pronunciation. Considering these, Indic TIMIT is unique with respect to the existing corpora that are available in Indian context. Along with Indic TIMIT, a lexicon named Indic English lexicon is provided, which is constructed by incorporating pronunciation variations specific to Indians obtained from their errors to the existing word pronunciations in a native English lexicon. In this paper, the effectiveness of Indic TIMIT and Indic English lexicon is shown respectively in comparison with the data from TIMIT and a lexicon augmented with all the word pronunciations from CMU, Beep and the lexicon available in the TIMIT corpus. Indic TIMIT and Indic English lexicon could be useful for a number of potential applications in Indian context including automatic speech recognition, mispronunciation detection & diagnosis, native language identification, accent adaptation, accent conversion, voice conversion, speech synthesis, grapheme-to-phoneme conversion, automatic phoneme unit discovery and pronunciation error analysis.

Journal ArticleDOI
TL;DR: The modules for authenticating the persons by using speech as a biometric against recorded playback attacks are presented and the system is found to be robust against playback attacks and has given better performance in authenticating sixteen speakers considered in this work.
Abstract: This work presents the modules for authenticating the persons by using speech as a biometric against recorded playback attacks. It involves the implementation of feature extraction, modeling technique and testing procedure for authenticating the persons. Playback attacks are simulated by recording the original speech utterances using the speakers and mikes in a laptop using Audacity software. This work mainly involves the process for distinguishing original and recorded speeches and authenticating the speakers based on voice as a biometric. Features extracted from the original and recorded speeches are used to develop models for them. Voice passwords are assigned to the speakers and features are extracted from the training speech created by fusing the password specific original speech utterances. These features are applied to the training algorithm to generate password specific speaker models. Testing procedure involves the feature extraction and application of features to the models pertaining to recorded and original speech models. If the test speech belongs to the recorded speech, it is prevented from undergoing the further process. If it is an original speech, feature vectors of the test speech are applied to the password specific speaker models and based on the classification criteria, a speaker is identified and authenticated. Our system is found to be robust against playback attacks and has given better performance in authenticating sixteen speakers considered in our work. Passwords are isolated words and digits chosen from “TIMIT” speech database. This work is also extended to using AVSpoof database for authenticating 44 speakers against replay attacks and the performance is analyzed in terms of rejection rate.

Journal ArticleDOI
TL;DR: Preliminary results suggest that non-native speakers of English fail to produce flaps and reduced vowels, insert or delete segments, engage in more self-correction, and place pauses in different locations from native speakers.
Abstract: This study investigates differences in sentence and story production between native and non-native speakers of English for use with a system of Automatic Speech Recognition (ASR). Previous studies have shown that production errors by non-native speakers of English include misproduced segments (Flege, 1995), longer pause duration (Anderson-Hsieh and Venkatagiri, 1994), abnormal pause location within clauses (Kang, 2010), and non-reduction of function words (Jang, 2009). The present study uses phonemically balanced sentences from TIMIT (Garofolo et al., 1993) and a story to provide an additional comparison of the differences in production by native and non-native speakers of English. Consistent with previous research, preliminary results suggest that non-native speakers of English fail to produce flaps and reduced vowels, insert or delete segments, engage in more self-correction, and place pauses in different locations from native speakers. Non-native English speakers furthermore produce different patterns of intonation from native speakers and produce errors indicative of transfer from their L1 phonology, such as coda deletion and vowel epenthesis. Native speaker productions also contained errors, the majority of which were content-related. These results indicate that difficulties posed by English ASR systems in recognizing non-native speech are due largely to the heterogeneity of non-native production.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: A convolutional neural network with Mel-spectrogram as an input for the identification purpose and the proposed CNN architecture is evaluated to compare with state-of-the-art systems for clean and noisy speech samples.
Abstract: Conventional speaker identification systems require features that are carefully designed to achieve high identification accuracy rates. With deep learning, these features are learned rather than specifically designed. The improvements of deep neural networks algorithms and techniques lead to an increase in using deep neural networks for speaker identification systems in favour of the conventional systems. In this paper, we use a convolutional neural network with Mel-spectrogram as an input for the identification purpose. The experiments are done on TIMIT dataset to evaluate the proposed CNN architecture and to compare with state-of-the-art systems for clean and noisy speech samples.