FMLLR Speaker Normalization With i-Vector: In Pseudo-FMLLR and Distillation Framework

doi:10.1109/TASLP.2018.2795754

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Developmental Designs and Adult Functions of Cortical Maps in Multiple Modalities: Perception, Attention, Navigation, Numbers, Streaming, Speech, and Cognition.

[...]

Stephen Grossberg¹•Institutions (1)

Boston University¹

06 Feb 2020-Frontiers in Neuroinformatics

TL;DR: This article unifies neural modeling results that illustrate several basic design principles and mechanisms used by advanced brains to develop cortical maps with multiple psychological functions and concerns the role of Adaptive Resonance Theory top-down matching and attentional circuits in the dynamic stabilization of early development and adult learning.

...read moreread less

Abstract: This article unifies neural modeling results that illustrate several basic design principles and mechanisms that are used by advanced brains to develop cortical maps with multiple psychological functions. One principle concerns how brains use a strip map that simultaneously enables one feature to be represented throughout its extent, as well as an ordered array of another feature at different positions of the strip. Strip maps include circuits to represent ocular dominance and orientation columns, place-value numbers, auditory streams, speaker-normalized speech, and cognitive working memories that can code repeated items. A second principle concerns how feature detectors for multiple functions develop in topographic maps, including maps for optic flow navigation, reinforcement learning, motion perception, and category learning at multiple organizational levels. A third principle concerns how brains exploit a spatial gradient of cells that respond at an ordered sequence of different rates. Such a rate gradient is found along the dorsoventral axis of the entorhinal cortex, whose lateral branch controls the development of time cells, and whose medial branch controls the development of grid cells. Populations of time cells can be used to learn how to adaptively time behaviors for which a time interval of hundreds of milliseconds, or several seconds, must be bridged, as occurs during trace conditioning. Populations of grid cells can be used to learn hippocampal place cells that represent the large spaces in which animals navigate. A fourth principle concerns how and why all neocortical circuits are organized into layers, and how functionally distinct columns develop in these circuits to enable map development. A final principle concerns the role of Adaptive Resonance Theory top-down matching and attentional circuits in the dynamic stabilization of early development and adult learning. Cortical maps are modeled in visual, auditory, temporal, parietal, prefrontal, entorhinal, and hippocampal cortices.

...read moreread less

10 citations

Journal Article•DOI•

Learning Self-Informed Feature Contribution for Deep Learning-Based Acoustic Modeling

[...]

Younggwan Kim¹, Myung Jong Kim, Jahyun Goo¹, Hoirin Kim¹•Institutions (1)

KAIST¹

01 Nov 2018-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A new feature engineering approach for deep learning-based acoustic modeling, which utilizes input feature contributions through an auxiliary deep neural network called a feature contribution network (FCN) whose output layer is composed of sigmoid-based contribution gates.

...read moreread less

Abstract: In this paper, we introduce a new feature engineering approach for deep learning-based acoustic modeling, which utilizes input feature contributions For this purpose, we propose an auxiliary deep neural network (DNN) called a feature contribution network (FCN) whose output layer is composed of sigmoid-based contribution gates In our framework, the FCN tries to learn element-level discriminative contributions of input features and an acoustic model network (AMN) is trained by gated features generated by element-wise multiplication between contribution gate outputs and input features In addition, we also propose a regularization method for the FCN, which helps the FCN to activate the minimum number of the gates The proposed methods were evaluated on the TED-LIUM release 1 corpus We applied the proposed methods to DNN- and long short-term memory-based AMNs Experimental results results showed that AMNs with the FCNs consistently improved recognition performance compared with AMN-only frameworks

...read moreread less

4 citations

Cites background from "FMLLR Speaker Normalization With i-..."

...In [17], the authors introduced an auxiliary network to generate psuedo-fMLLR features from filterbank...
[...]

Proceedings Article•DOI•

Long-Term Prediction of Small Time-Series Data Using Generalized Distillation

[...]

Shogo Hayashi¹, Akira Tanimoto², Hisashi Kashima¹•Institutions (2)

Kyoto University¹, NEC²

01 Jul 2019

TL;DR: A novel method for long-term prediction of small time-series data designed in the framework of generalized distillation to utilize the middle-time data between the input and output times as "privileged information," which is available only in the training phase and not in the test phase is proposed.

...read moreread less

Abstract: The recent increase of "big data" in our society has led to major impacts of machine learning and data mining technologies in various fields ranging from marketing to science. On the other hand, there still exist areas where only small-sized data are available for various reasons, for example, high data acquisition costs or the rarity of targets events. Machine learning tasks using such small data are usually difficult because of the lack of information available for training accurate prediction models. In particular, for long-term time-series prediction, the data size tends to be small because of the unavailability of the data between input and output times in training. Such limitations on the size of time-series data further make long-term prediction tasks quite difficult; in addition, the difficulty that the far future is more uncertain than the near future.In this paper, we propose a novel method for long-term prediction of small time-series data designed in the framework of generalized distillation. The key idea of the proposed method is to utilize the middle-time data between the input and output times as "privileged information," which is available only in the training phase and not in the test phase. We demonstrate the effectiveness of the proposed method on both synthetic data and real-world data. The experimental results show the proposed method performs well, particularly when the task is difficult and has high input dimensions.

...read moreread less

4 citations

Cites methods from "FMLLR Speaker Normalization With i-..."

...Generalized distillation is extended and also applied to speech normalization tasks [25], [26]....
[...]

Journal Article•DOI•

Text independent voiceprint recognition model based on I-vector

[...]

Jing Zhang, Minfeng Yao

30 Jan 2020

TL;DR: It is verified that the i-vector recognition model employs a lower error rate of the and is more efficient than the GMM-UBM model, and the main problems that need to be dealt with.

...read moreread less

Abstract: The commonly used text independent Voiceprint recognition models are Gaussian Mixture Model (GMM) and GMM and general background model (GMM-UBM). In the equalization vector of the GMM model, both the speaker information and the channel information are included, which results in unstable performance of the recognition system of the GMM and GMM-UBM models. In addition, the recognition ability for cross channel is poor, moreover, both models are limited by the maximum likelihood criterion. So, they employ weak ability to distinguish categories. I-vector is also known as identity authentication vector and has been proposed on the basis of Gaussian super vector in recent years. The method uses one space instead of the two spaces, including the difference between the speakers and the difference between the channels, and it is known as the most cutting-edge speaker modeling technology available today. Therefore, this paper adopted i-vector framework as the speaker recognition model, and studied the main problems that need to be dealt with. The recognition effect of GMM-UBM model and i-vector model were investigated by experiment as well. Through comparison experiments, it is verified that the i-vector recognition model employs a lower error rate of the and is more efficient. In the recognition phase, to quickly recognize the speaker's identity only needs to record two seconds of speech, and the system recognition accuracy reaches 97%.

...read moreread less

2 citations

Cites background from "FMLLR Speaker Normalization With i-..."

...Where S denotes the super-vector associated with the speaker and the channel; m denotes the super-vector independent of the speaker and the channel; the subspace matrix T of overall variation completes the mapping from the high dimensional space to the low dimensional space, thereby it makes the vector after dimensionality reduction is more conducive to further classifying and recognizing; ω represents the vector associated with the speaker channel, which is a full-variable spatial difference factor containing speaker information and channel information [14]....
[...]

Proceedings Article•DOI•

Speaker Normalization Using Joint Variational Autoencoder

[...]

Shashi Kumar¹, Shakti P. Rath², Abhishek Pandey•Institutions (2)

Indian Institute of Technology Guwahati¹, Samsung²

30 Aug 2021

TL;DR: This paper shows that the proposed joint VAE based mapping achieves a large improvements over ASR models trained using filterbank SI features and shows that jointVAE outperforms DA by a large margin.

...read moreread less

Abstract: Speaker adaptation is known to provide significant improvement in speech recognition accuracy. However, in practical scenario, only a few seconds of audio is available due to which it may be infeasible to apply speaker adaptation methods such as i-vector and fMLLR robustly. Also, decoding with fMLLR transformation happens in two-passes which is impractical for real-time applications. In recent past, mapping speech features from speaker independent (SI) space to fMLLR normalized space using denosing autoencoder (DA) has been explored. To the best of our knowledge, such mapping generally does not yield consistent improvement. In this paper, we show that our proposed joint VAE based mapping achieves a large improvements over ASR models trained using filterbank SI features. We also show that joint VAE outperforms DA by a large margin. We observe a relative improvement of 17% in word error rate (WER) compared to ASR model trained using filterbank features with i-vectors and 23% without i-vectors.

...read moreread less

Cites background or methods from "FMLLR Speaker Normalization With i-..."

...Recently these issues were addressed in [9], where the authors proposed to learn a DNN based mapping from filterbank (speaker independent space) to fMLLR-normalized features in a regression framework, and then use these normalized features predicted by the DNN to train acoustic model....
[...]
...To the best of our knowledge, the DA based mapping does not show consistent improvements [9] for speaker adaptation....
[...]
...From the results, it may be concluded that the generalization power of DA to unseen speakers [9] is rather poor....
[...]
...The third model is trained on 41-dimensional log-mel filterbank features with ±2 splicing concatenated with 40-dimensional i-vectors [9] for speaker aware training....
[...]
...Following [9], the baseline DA is trained to map filterbank features to fMLLRnormalized features....
[...]

FMLLR Speaker Normalization With i-Vector: In Pseudo-FMLLR and Distillation Framework

Citations

Cites background from "FMLLR Speaker Normalization With i-..."

Cites methods from "FMLLR Speaker Normalization With i-..."

Cites background from "FMLLR Speaker Normalization With i-..."

Cites background or methods from "FMLLR Speaker Normalization With i-..."

References

"FMLLR Speaker Normalization With i-..." refers methods in this paper

"FMLLR Speaker Normalization With i-..." refers methods in this paper

"FMLLR Speaker Normalization With i-..." refers methods in this paper

"FMLLR Speaker Normalization With i-..." refers background in this paper

Related Papers (5)