scispace - formally typeset
Search or ask a question
Journal ArticleDOI

FMLLR Speaker Normalization With i-Vector: In Pseudo-FMLLR and Distillation Framework

TL;DR: Two unsupervised speaker normalization techniques are proposed—one at feature level and other at model level of acoustic modeling—to overcome the drawbacks of FMLLR and i-vectors in real-time scenarios.
Abstract: When an automatic speech recognition (ASR) system is deployed for real-world applications, it often receives only one utterance at a time for decoding. This single utterance could be of short duration depending on the ASR task. In these cases, robust estimation of speaker normalizing methods like feature-space maximum likelihood linear regression (FMLLR) and i -vectors may not be feasible. In this paper, we propose two unsupervised speaker normalization techniques—one at feature level and other at model level of acoustic modeling—to overcome the drawbacks of FMLLR and i -vectors in real-time scenarios. At feature level, we propose the use of deep neural networks (DNN) to generate pseudo-FMLLR features from time-synchronous pair of filterbank and FMLLR features. These pseudo-FMLLR features can then be used for DNN acoustic model training and decoding. At model level, we propose a generalized distillation framework, where a teacher DNN trained on FMLLR features guides the training and optimization of a student DNN trained on filterbank features. In both the proposed methods, the ambiguity in choosing the speaker-specific FMLLR transform can be reduced by augmenting i -vectors to the input filterbank features. Experiments conducted on 33-h and 110-h subsets of Switchboard corpus show that the proposed methods provide significant gains over DNNs trained on FMLLR, i -vector appended FMLLR, filterbank and i -vector appended filterbank features, in real-time scenario.
Citations
More filters
Journal ArticleDOI
TL;DR: This article unifies neural modeling results that illustrate several basic design principles and mechanisms used by advanced brains to develop cortical maps with multiple psychological functions and concerns the role of Adaptive Resonance Theory top-down matching and attentional circuits in the dynamic stabilization of early development and adult learning.
Abstract: This article unifies neural modeling results that illustrate several basic design principles and mechanisms that are used by advanced brains to develop cortical maps with multiple psychological functions. One principle concerns how brains use a strip map that simultaneously enables one feature to be represented throughout its extent, as well as an ordered array of another feature at different positions of the strip. Strip maps include circuits to represent ocular dominance and orientation columns, place-value numbers, auditory streams, speaker-normalized speech, and cognitive working memories that can code repeated items. A second principle concerns how feature detectors for multiple functions develop in topographic maps, including maps for optic flow navigation, reinforcement learning, motion perception, and category learning at multiple organizational levels. A third principle concerns how brains exploit a spatial gradient of cells that respond at an ordered sequence of different rates. Such a rate gradient is found along the dorsoventral axis of the entorhinal cortex, whose lateral branch controls the development of time cells, and whose medial branch controls the development of grid cells. Populations of time cells can be used to learn how to adaptively time behaviors for which a time interval of hundreds of milliseconds, or several seconds, must be bridged, as occurs during trace conditioning. Populations of grid cells can be used to learn hippocampal place cells that represent the large spaces in which animals navigate. A fourth principle concerns how and why all neocortical circuits are organized into layers, and how functionally distinct columns develop in these circuits to enable map development. A final principle concerns the role of Adaptive Resonance Theory top-down matching and attentional circuits in the dynamic stabilization of early development and adult learning. Cortical maps are modeled in visual, auditory, temporal, parietal, prefrontal, entorhinal, and hippocampal cortices.

10 citations

Journal ArticleDOI
TL;DR: A new feature engineering approach for deep learning-based acoustic modeling, which utilizes input feature contributions through an auxiliary deep neural network called a feature contribution network (FCN) whose output layer is composed of sigmoid-based contribution gates.
Abstract: In this paper, we introduce a new feature engineering approach for deep learning-based acoustic modeling, which utilizes input feature contributions For this purpose, we propose an auxiliary deep neural network (DNN) called a feature contribution network (FCN) whose output layer is composed of sigmoid-based contribution gates In our framework, the FCN tries to learn element-level discriminative contributions of input features and an acoustic model network (AMN) is trained by gated features generated by element-wise multiplication between contribution gate outputs and input features In addition, we also propose a regularization method for the FCN, which helps the FCN to activate the minimum number of the gates The proposed methods were evaluated on the TED-LIUM release 1 corpus We applied the proposed methods to DNN- and long short-term memory-based AMNs Experimental results results showed that AMNs with the FCNs consistently improved recognition performance compared with AMN-only frameworks

4 citations


Cites background from "FMLLR Speaker Normalization With i-..."

  • ...In [17], the authors introduced an auxiliary network to generate psuedo-fMLLR features from filterbank...

    [...]

Proceedings ArticleDOI
01 Jul 2019
TL;DR: A novel method for long-term prediction of small time-series data designed in the framework of generalized distillation to utilize the middle-time data between the input and output times as "privileged information," which is available only in the training phase and not in the test phase is proposed.
Abstract: The recent increase of "big data" in our society has led to major impacts of machine learning and data mining technologies in various fields ranging from marketing to science. On the other hand, there still exist areas where only small-sized data are available for various reasons, for example, high data acquisition costs or the rarity of targets events. Machine learning tasks using such small data are usually difficult because of the lack of information available for training accurate prediction models. In particular, for long-term time-series prediction, the data size tends to be small because of the unavailability of the data between input and output times in training. Such limitations on the size of time-series data further make long-term prediction tasks quite difficult; in addition, the difficulty that the far future is more uncertain than the near future.In this paper, we propose a novel method for long-term prediction of small time-series data designed in the framework of generalized distillation. The key idea of the proposed method is to utilize the middle-time data between the input and output times as "privileged information," which is available only in the training phase and not in the test phase. We demonstrate the effectiveness of the proposed method on both synthetic data and real-world data. The experimental results show the proposed method performs well, particularly when the task is difficult and has high input dimensions.

4 citations


Cites methods from "FMLLR Speaker Normalization With i-..."

  • ...Generalized distillation is extended and also applied to speech normalization tasks [25], [26]....

    [...]

Journal ArticleDOI
30 Jan 2020
TL;DR: It is verified that the i-vector recognition model employs a lower error rate of the and is more efficient than the GMM-UBM model, and the main problems that need to be dealt with.
Abstract: The commonly used text independent Voiceprint recognition models are Gaussian Mixture Model (GMM) and GMM and general background model (GMM-UBM). In the equalization vector of the GMM model, both the speaker information and the channel information are included, which results in unstable performance of the recognition system of the GMM and GMM-UBM models. In addition, the recognition ability for cross channel is poor, moreover, both models are limited by the maximum likelihood criterion. So, they employ weak ability to distinguish categories. I-vector is also known as identity authentication vector and has been proposed on the basis of Gaussian super vector in recent years. The method uses one space instead of the two spaces, including the difference between the speakers and the difference between the channels, and it is known as the most cutting-edge speaker modeling technology available today. Therefore, this paper adopted i-vector framework as the speaker recognition model, and studied the main problems that need to be dealt with. The recognition effect of GMM-UBM model and i-vector model were investigated by experiment as well. Through comparison experiments, it is verified that the i-vector recognition model employs a lower error rate of the and is more efficient. In the recognition phase, to quickly recognize the speaker's identity only needs to record two seconds of speech, and the system recognition accuracy reaches 97%.

2 citations


Cites background from "FMLLR Speaker Normalization With i-..."

  • ...Where S denotes the super-vector associated with the speaker and the channel; m denotes the super-vector independent of the speaker and the channel; the subspace matrix T of overall variation completes the mapping from the high dimensional space to the low dimensional space, thereby it makes the vector after dimensionality reduction is more conducive to further classifying and recognizing; ω represents the vector associated with the speaker channel, which is a full-variable spatial difference factor containing speaker information and channel information [14]....

    [...]

Proceedings ArticleDOI
30 Aug 2021
TL;DR: This paper shows that the proposed joint VAE based mapping achieves a large improvements over ASR models trained using filterbank SI features and shows that jointVAE outperforms DA by a large margin.
Abstract: Speaker adaptation is known to provide significant improvement in speech recognition accuracy. However, in practical scenario, only a few seconds of audio is available due to which it may be infeasible to apply speaker adaptation methods such as i-vector and fMLLR robustly. Also, decoding with fMLLR transformation happens in two-passes which is impractical for real-time applications. In recent past, mapping speech features from speaker independent (SI) space to fMLLR normalized space using denosing autoencoder (DA) has been explored. To the best of our knowledge, such mapping generally does not yield consistent improvement. In this paper, we show that our proposed joint VAE based mapping achieves a large improvements over ASR models trained using filterbank SI features. We also show that joint VAE outperforms DA by a large margin. We observe a relative improvement of 17% in word error rate (WER) compared to ASR model trained using filterbank features with i-vectors and 23% without i-vectors.

Cites background or methods from "FMLLR Speaker Normalization With i-..."

  • ...Recently these issues were addressed in [9], where the authors proposed to learn a DNN based mapping from filterbank (speaker independent space) to fMLLR-normalized features in a regression framework, and then use these normalized features predicted by the DNN to train acoustic model....

    [...]

  • ...To the best of our knowledge, the DA based mapping does not show consistent improvements [9] for speaker adaptation....

    [...]

  • ...From the results, it may be concluded that the generalization power of DA to unseen speakers [9] is rather poor....

    [...]

  • ...The third model is trained on 41-dimensional log-mel filterbank features with ±2 splicing concatenated with 40-dimensional i-vectors [9] for speaker aware training....

    [...]

  • ...Following [9], the baseline DA is trained to map filterbank features to fMLLRnormalized features....

    [...]

References
More filters
Posted Content
TL;DR: This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.
Abstract: A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

12,857 citations

Proceedings Article
01 Jan 2011
TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Abstract: We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

5,857 citations


"FMLLR Speaker Normalization With i-..." refers methods in this paper

  • ...Kaldi toolkit [27] was used for feature extraction, GMMHMM training and DNN modeling....

    [...]

Journal ArticleDOI
TL;DR: The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.
Abstract: In contrast to the conventional minimum mean square error (MMSE)-based noise reduction techniques, we propose a supervised method to enhance speech by means of finding a mapping function between noisy and clean speech signals based on deep neural networks (DNNs). In order to be able to handle a wide range of additive noises in real-world situations, a large training set that encompasses many possible combinations of speech and noise types, is first designed. A DNN architecture is then employed as a nonlinear regression function to ensure a powerful modeling capability. Several techniques have also been proposed to improve the DNN-based speech enhancement system, including global variance equalization to alleviate the over-smoothing problem of the regression model, and the dropout and noise-aware training strategies to further improve the generalization capability of DNNs to unseen noise conditions. Experimental results demonstrate that the proposed framework can achieve significant improvements in both objective and subjective measures over the conventional MMSE based technique. It is also interesting to observe that the proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general. Furthermore, the resulting DNN model, trained with artificial synthesized data, is also effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.

1,250 citations


"FMLLR Speaker Normalization With i-..." refers methods in this paper

  • ...The speech enhancing DNN proposed in [14]–[16] is the inspiration for the proposed method....

    [...]

Journal ArticleDOI
TL;DR: This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture that tends to achieve significant improvements in terms of various objective quality measures.
Abstract: This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture. In the DNN learning process, a large training set ensures a powerful modeling capability to estimate the complicated nonlinear mapping from observed noisy speech to desired clean signals. Acoustic context was found to improve the continuity of speech to be separated from the background noises successfully without the annoying musical artifact commonly observed in conventional speech enhancement algorithms. A series of pilot experiments were conducted under multi-condition training with more than 100 hours of simulated speech data, resulting in a good generalization capability even in mismatched testing conditions. When compared with the logarithmic minimum mean square error approach, the proposed DNN-based algorithm tends to achieve significant improvements in terms of various objective quality measures. Furthermore, in a subjective preference evaluation with 10 listeners, 76.35% of the subjects were found to prefer DNN-based enhanced speech to that obtained with other conventional technique.

860 citations


"FMLLR Speaker Normalization With i-..." refers methods in this paper

  • ...The speech enhancing DNN proposed in [14]–[16] is the inspiration for the proposed method....

    [...]

Proceedings ArticleDOI
01 Dec 2013
TL;DR: This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.
Abstract: We propose to adapt deep neural network (DNN) acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR. For both training and test, the i-vector for a given speaker is concatenated to every frame belonging to that speaker and changes across different speakers. Experimental results on a Switchboard 300 hours corpus show that DNNs trained on speaker independent features and i-vectors achieve a 10% relative improvement in word error rate (WER) over networks trained on speaker independent features only. These networks are comparable in performance to DNNs trained on speaker-adapted features (with VTLN and FMLLR) with the advantage that only one decoding pass is needed. Furthermore, networks trained on speaker-adapted features and i-vectors achieve a 5-6% relative improvement in WER after hessian-free sequence training over networks trained on speaker-adapted features only.

714 citations


"FMLLR Speaker Normalization With i-..." refers background in this paper

  • ...i-vectors are estimated for each train speaker and is then concatenated with all the filterbank feature frames belonging to that speaker [12]....

    [...]

  • ...Speaker codes [6], [9], eigenvectors in speaker space [10], speaker separation bottleneck features [11], and i-vectors [12], [13] are a few examples of auxiliary speaker-specific codes....

    [...]