scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Investigation of different acoustic modeling techniques for low resource Indian language data

TL;DR: This paper proposes to train DNN containing bottleneck layer in two stages, which shows improved performance when compared to baseline SGMM and DNN models for limited training data.
Abstract: In this paper, we investigate the performance of deep neural network (DNN) and Subspace Gaussian mixture model (SGMM) in low-resource condition. Even though DNN outperforms SGMM and continuous density hidden Markov models (CDHMM) for high-resource data, it degrades in performance while modeling low-resource data. Our experimental results show that SGMM outperforms DNN for limited transcribed data. To resolve this problem in DNN, we propose to train DNN containing bottleneck layer in two stages: First stage involves extraction of bottleneck features. In second stage, the extracted bottleneck features from first stage are used to train DNN having bottleneck layer. All our experiments are performed using two Indian languages (Tamil & Hindi) in Mandi database. Our proposed method shows improved performance when compared to baseline SGMM and DNN models for limited training data.
Citations
More filters
Dissertation
01 Jul 2016
TL;DR: This research applied different acoustic modelling techniques to compensate for the effects of regional accents on the ASR performance, and proposed a new approach for visualisation of the AID feature space that is helpful in analysing the AIDs recognition accuracies and analysing AID confusion matrices.
Abstract: Accent is cited as an issue for speech recognition systems. Our experiments showed that the ASR word error rate is up to seven times greater for accented speech compared with standard British English. The main objective of this research is to develop Automatic Speech Recognition (ASR) techniques that are robust to accent variation. We applied different acoustic modelling techniques to compensate for the effects of regional accents on the ASR performance. For conventional GMM-HMM based ASR systems, we showed that using a small amount of data from a test speaker to choose an accent dependent model using an accent identification system, or building a model using the data from N neighbouring speakers in AID space, will result in superior performance compared to that obtained with unsupervised or supervised speaker adaptation. In addition we showed that using a DNN-HMM rather than a GMM-HMM based acoustic model would improve the recognition accuracy considerably. Even if we apply two stages of accent followed by speaker adaptation to the GMM-HMM baseline system, the GMM-HMM based system will not outperform the baseline DNN-HMM based system. For more contemporary DNN-HMM based ASR systems we investigated how adding different types of accented data to the training set can provide better recognition accuracy on accented speech. Finally, we proposed a new approach for visualisation of the AID feature space. This is helpful in analysing the AID recognition accuracies and analysing AID confusion matrices.

12 citations


Cites result from "Investigation of different acoustic..."

  • ...Recent studies showed that SGMM and DNN based systems provide similar results [168] and the LSTM RNN based systems provide the highest accuracy of all [167, 169, 170]....

    [...]

Journal ArticleDOI
TL;DR: The results show that DNN is an effective acoustic modeling technique for the Amharic language and the hybrid acoustic units perform the best under both sufficient and limited training datasets and achieve the highest accuracy.
Abstract: Multitask learning (MTL) is helpful for improving the performance of related tasks when the training dataset is limited and sparse, especially for low-resource languages. Amharic is a low-resource language and suffers from the problems of training data scarcity, sparsity, and unevenness. Consequently, fundamental acoustic units-based speech recognizers perform worse compared with the speech recognizers of technologically favored languages. This paper presents the results of our contributions to the use of various hybrid acoustic modeling units for the Amharic language. The fundamental acoustic units, namely, syllable, phone, and rounded phone units-based deep neural network (DNN) models have been developed. Various hybrid acoustic units have been investigated by jointly training the fundamental acoustic units via the MTL technique. Those hybrid units and the fundamental units are discussed and compared. The experimental results demonstrate that all the fundamental units-based DNN models outperform the Gaussian mixture models (GMM) with relative performance improvements of 14.14%-23.31%. All the hybrid units outperform the fundamental acoustic units with relative performance improvements of 1.33%-4.27%. The syllable and phone units exhibit higher performance under sufficient and limited training datasets, respectively. All the hybrid units are useful with both sufficient and limited training datasets and outperformed the fundamental units. Overall, our results show that DNN is an effective acoustic modeling technique for the Amharic language. The context-dependent (CD) syllable is the more suitable unit if a sufficient training corpus is available and the accuracy of the recognizer is prioritized. The CD phone is a superior unit if the available training dataset is limited and realizes the highest accuracy and fast recognition speed. The hybrid acoustic units perform the best under both sufficient and limited training datasets and achieve the highest accuracy.

9 citations

Journal ArticleDOI
TL;DR: The results of contributions to combine CNN and conventional RNN with gate, highway, and residual networks to reduce the above problems are presented and the optimal neural network structures and training strategies for the proposed neural network models are explored.
Abstract: Deep neural networks (DNNs) have shown a great achievement in acoustic modeling for speech recognition task. Of these networks, convolutional neural network (CNN) is an effective network for representing the local properties of the speech formants. However, CNN is not suitable for modeling the long-term context dependencies between speech signal frames. Recently, the recurrent neural networks (RNNs) have shown great abilities for modeling long-term context dependencies. However, the performance of RNNs is not good for low-resource speech recognition tasks, and is even worse than the conventional feed-forward neural networks. Moreover, these networks often overfit severely on the training corpus in the low-resource speech recognition tasks. This paper presents the results of our contributions to combine CNN and conventional RNN with gate, highway, and residual networks to reduce the above problems. The optimal neural network structures and training strategies for the proposed neural network models are explored. Experiments were conducted on the Amharic and Chaha datasets, as well as on the limited language packages (10-h) of the benchmark datasets released under the Intelligence Advanced Research Projects Activity (IARPA) Babel Program. The proposed neural network models achieve 0.1–42.79% relative performance improvements over their corresponding feed-forward DNN, CNN, bidirectional RNN (BRNN), or bidirectional gated recurrent unit (BGRU) baselines across six language collections. These approaches are promising candidates for developing better performance acoustic models for low-resource speech recognition tasks.

8 citations

Proceedings Article
TL;DR: This study characterise an (LR) Austrian German conversational task and shows that fine-tuning of a model pre-trained using self-supervision leads to improvements consistent with those in the literature; this extends to cases where a lexicon and language model are included.
Abstract: Conversational speech represents one of the most complex of automatic speech recognition (ASR) tasks owing to the high inter-speaker variation in both pronunciation and conversational dynamics. Such complexity is particularly sensitive to low-resourced (LR) scenarios. Recent developments in self-supervision have allowed such scenarios to take advantage of large amounts of otherwise unrelated data. In this study, we characterise an (LR) Austrian German conversational task. We begin with a non-pre-trained baseline and show that fine-tuning of a model pre-trained using self-supervision leads to improvements consistent with those in the literature; this extends to cases where a lexicon and language model are included. We also show that the advantage of pre-training indeed arises from the larger database rather than the self-supervision. Further, by use of a leave-one-conversation out technique, we demonstrate that robustness problems remain with respect to inter-speaker and inter-conversation variation. This serves to guide where future research might best be focused in light of the current state-of-the-art.

2 citations

References
More filters
Proceedings Article
01 Jan 2011
TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Abstract: We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

5,857 citations


"Investigation of different acoustic..." refers methods in this paper

  • ...The DNN training for all datasets are performed using standard recipe as in Kaldi [8]....

    [...]

  • ...These parameters are fixed as per Kaldi’s DNN implementation....

    [...]

  • ...Dropout method is performed only during the DNN training phase and implemented using Kaldi+PDNN [9]....

    [...]

  • ...The experiments are performed using Kaldi Speech recognition toolkit [8]....

    [...]

ReportDOI
01 Feb 1993

1,238 citations


"Investigation of different acoustic..." refers methods in this paper

  • ...The TIMIT database [10] is recorded on eight principal dialects of American English for 490 speakers with a sampling rate of 16 kHz....

    [...]

Proceedings ArticleDOI
26 May 2013
TL;DR: The noise robustness of DNN-based acoustic models can match state-of-the-art performance on the Aurora 4 task without any explicit noise compensation and can be further improved by incorporating information about the environment into DNN training using a new method called noise-aware training.
Abstract: Recently, a new acoustic model based on deep neural networks (DNN) has been introduced. While the DNN has generated significant improvements over GMM-based systems on several tasks, there has been no evaluation of the robustness of such systems to environmental distortion. In this paper, we investigate the noise robustness of DNN-based acoustic models and find that they can match state-of-the-art performance on the Aurora 4 task without any explicit noise compensation. This performance can be further improved by incorporating information about the environment into DNN training using a new method called noise-aware training. When combined with the recently proposed dropout training technique, a 7.5% relative improvement over the previously best published result on this task is achieved using only a single decoding pass and no additional decoding complexity compared to a standard DNN.

690 citations


"Investigation of different acoustic..." refers background in this paper

  • ...This is due to the noise robust characteristics of DNN models [4]....

    [...]

Journal ArticleDOI
TL;DR: A new approach to speech recognition, in which all Hidden Markov Model states share the same Gaussian Mixture Model (GMM) structure with the same number of Gaussians in each state, appears to give better results than a conventional model.

304 citations


"Investigation of different acoustic..." refers methods in this paper

  • ...SGMM [3] is one such technique to model low-resource data efficiently by estimating only less number of parameters....

    [...]

  • ...In SGMM, the shared parameters are projected using state specific subspace vectors to obtain state specific GMM parameters [3]....

    [...]