Investigation of different acoustic modeling techniques for low resource Indian language data

doi:10.1109/NCC.2015.7084860

Home
/
Papers
/
Investigation of different acoustic modeling techniques for low resource Indian language data

Proceedings Article•DOI•

Investigation of different acoustic modeling techniques for low resource Indian language data

R Sriranjani¹, Murali Karthick B¹, Srinivasan Umesh¹•Institutions (1)

Indian Institute of Technology Madras¹

16 Apr 2015-pp 1-5

TL;DR: This paper proposes to train DNN containing bottleneck layer in two stages, which shows improved performance when compared to baseline SGMM and DNN models for limited training data.

read less

Abstract: In this paper, we investigate the performance of deep neural network (DNN) and Subspace Gaussian mixture model (SGMM) in low-resource condition. Even though DNN outperforms SGMM and continuous density hidden Markov models (CDHMM) for high-resource data, it degrades in performance while modeling low-resource data. Our experimental results show that SGMM outperforms DNN for limited transcribed data. To resolve this problem in DNN, we propose to train DNN containing bottleneck layer in two stages: First stage involves extraction of bottleneck features. In second stage, the extracted bottleneck features from first stage are used to train DNN having bottleneck layer. All our experiments are performed using two Indian languages (Tamil & Hindi) in Mandi database. Our proposed method shows improved performance when compared to baseline SGMM and DNN models for limited training data.

...read moreread less

Citations

PDF

Open Access

More filters

Dissertation•

Acoustic model selection for recognition of regional accented speech

[...]

Maryam Najafian

01 Jul 2016

TL;DR: This research applied different acoustic modelling techniques to compensate for the effects of regional accents on the ASR performance, and proposed a new approach for visualisation of the AID feature space that is helpful in analysing the AIDs recognition accuracies and analysing AID confusion matrices.

...read moreread less

Abstract: Accent is cited as an issue for speech recognition systems. Our experiments showed that the ASR word error rate is up to seven times greater for accented speech compared with standard British English. The main objective of this research is to develop Automatic Speech Recognition (ASR) techniques that are robust to accent variation. We applied different acoustic modelling techniques to compensate for the effects of regional accents on the ASR performance. For conventional GMM-HMM based ASR systems, we showed that using a small amount of data from a test speaker to choose an accent dependent model using an accent identification system, or building a model using the data from N neighbouring speakers in AID space, will result in superior performance compared to that obtained with unsupervised or supervised speaker adaptation. In addition we showed that using a DNN-HMM rather than a GMM-HMM based acoustic model would improve the recognition accuracy considerably. Even if we apply two stages of accent followed by speaker adaptation to the GMM-HMM baseline system, the GMM-HMM based system will not outperform the baseline DNN-HMM based system. For more contemporary DNN-HMM based ASR systems we investigated how adding different types of accented data to the training set can provide better recognition accuracy on accented speech. Finally, we proposed a new approach for visualisation of the AID feature space. This is helpful in analysing the AID recognition accuracies and analysing AID confusion matrices.

...read moreread less

12 citations

Cites result from "Investigation of different acoustic..."

...Recent studies showed that SGMM and DNN based systems provide similar results [168] and the LSTM RNN based systems provide the highest accuracy of all [167, 169, 170]....
[...]

Journal Article•DOI•

Investigation of Various Hybrid Acoustic Modeling Units via a Multitask Learning and Deep Neural Network Technique for LVCSR of the Low-Resource Language, Amharic

[...]

Tessfu Geteye Fantaye¹, Junqing Yu¹, Tulu Tilahun Hailu¹•Institutions (1)

Huazhong University of Science and Technology¹

26 Jul 2019-IEEE Access

TL;DR: The results show that DNN is an effective acoustic modeling technique for the Amharic language and the hybrid acoustic units perform the best under both sufficient and limited training datasets and achieve the highest accuracy.

...read moreread less

Abstract: Multitask learning (MTL) is helpful for improving the performance of related tasks when the training dataset is limited and sparse, especially for low-resource languages. Amharic is a low-resource language and suffers from the problems of training data scarcity, sparsity, and unevenness. Consequently, fundamental acoustic units-based speech recognizers perform worse compared with the speech recognizers of technologically favored languages. This paper presents the results of our contributions to the use of various hybrid acoustic modeling units for the Amharic language. The fundamental acoustic units, namely, syllable, phone, and rounded phone units-based deep neural network (DNN) models have been developed. Various hybrid acoustic units have been investigated by jointly training the fundamental acoustic units via the MTL technique. Those hybrid units and the fundamental units are discussed and compared. The experimental results demonstrate that all the fundamental units-based DNN models outperform the Gaussian mixture models (GMM) with relative performance improvements of 14.14%-23.31%. All the hybrid units outperform the fundamental acoustic units with relative performance improvements of 1.33%-4.27%. The syllable and phone units exhibit higher performance under sufficient and limited training datasets, respectively. All the hybrid units are useful with both sufficient and limited training datasets and outperformed the fundamental units. Overall, our results show that DNN is an effective acoustic modeling technique for the Amharic language. The context-dependent (CD) syllable is the more suitable unit if a sufficient training corpus is available and the accuracy of the recognizer is prioritized. The CD phone is a superior unit if the available training dataset is limited and realizes the highest accuracy and fast recognition speed. The hybrid acoustic units perform the best under both sufficient and limited training datasets and achieve the highest accuracy.

...read moreread less

9 citations

Journal Article•DOI•

Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition

[...]

Tessfu Geteye Fantaye, Junqing Yu, Tulu Tilahun Hailu

02 May 2020-The first computers

TL;DR: The results of contributions to combine CNN and conventional RNN with gate, highway, and residual networks to reduce the above problems are presented and the optimal neural network structures and training strategies for the proposed neural network models are explored.

...read moreread less

Abstract: Deep neural networks (DNNs) have shown a great achievement in acoustic modeling for speech recognition task. Of these networks, convolutional neural network (CNN) is an effective network for representing the local properties of the speech formants. However, CNN is not suitable for modeling the long-term context dependencies between speech signal frames. Recently, the recurrent neural networks (RNNs) have shown great abilities for modeling long-term context dependencies. However, the performance of RNNs is not good for low-resource speech recognition tasks, and is even worse than the conventional feed-forward neural networks. Moreover, these networks often overfit severely on the training corpus in the low-resource speech recognition tasks. This paper presents the results of our contributions to combine CNN and conventional RNN with gate, highway, and residual networks to reduce the above problems. The optimal neural network structures and training strategies for the proposed neural network models are explored. Experiments were conducted on the Amharic and Chaha datasets, as well as on the limited language packages (10-h) of the benchmark datasets released under the Intelligence Advanced Research Projects Activity (IARPA) Babel Program. The proposed neural network models achieve 0.1–42.79% relative performance improvements over their corresponding feed-forward DNN, CNN, bidirectional RNN (BRNN), or bidirectional gated recurrent unit (BGRU) baselines across six language collections. These approaches are promising candidates for developing better performance acoustic models for low-resource speech recognition tasks.

...read moreread less

8 citations

Building HMM-SGMM Continuous Automatic Speech Recognition on Myanmar Web News

[...]

Aye Nyein Mon, Win Pa Pa, Ye Kyaw Thu

16 Feb 2017

6 citations

Proceedings Article•

Conversational Speech Recognition Needs Data? Experiments with Austrian German

[...]

J.M. Linke, Philip N. Garner, Gernot Kubin, Barbara Schuppler

TL;DR: This study characterise an (LR) Austrian German conversational task and shows that fine-tuning of a model pre-trained using self-supervision leads to improvements consistent with those in the literature; this extends to cases where a lexicon and language model are included.

...read moreread less

Abstract: Conversational speech represents one of the most complex of automatic speech recognition (ASR) tasks owing to the high inter-speaker variation in both pronunciation and conversational dynamics. Such complexity is particularly sensitive to low-resourced (LR) scenarios. Recent developments in self-supervision have allowed such scenarios to take advantage of large amounts of otherwise unrelated data. In this study, we characterise an (LR) Austrian German conversational task. We begin with a non-pre-trained baseline and show that fine-tuning of a model pre-trained using self-supervision leads to improvements consistent with those in the literature; this extends to cases where a lexicon and language model are included. We also show that the advantage of pre-training indeed arises from the larger database rather than the self-supervision. Further, by use of a leave-one-conversation out technique, we demonstrate that robustness problems remain with respect to inter-speaker and inter-conversation variation. This serves to guide where future research might best be focused in light of the current state-of-the-art.

...read moreread less

2 citations

References

PDF

Open Access

More filters

Proceedings Article•

The Kaldi Speech Recognition Toolkit

[...]

Daniel Povey¹, Arnab Ghoshal², Gilles Boulianne, Lukas Burget³, Ondrej Glembek³, Nagendra Kumar Goel, Mirko Hannemann³, Petr Motlicek⁴, Yanmin Qian⁵, Petr Schwarz³, Jan Silovsky, Georg Stemmer⁶, Karel Vesely³ - Show less +9 more•Institutions (6)

Microsoft¹, Saarland University², Brno University of Technology³, Idiap Research Institute⁴, Tsinghua University⁵, University of Erlangen-Nuremberg⁶

01 Jan 2011

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

...read moreread less

Abstract: We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

...read moreread less

5,857 citations

"Investigation of different acoustic..." refers methods in this paper

...The DNN training for all datasets are performed using standard recipe as in Kaldi [8]....
[...]
...These parameters are fixed as per Kaldi’s DNN implementation....
[...]
...Dropout method is performed only during the DNN training phase and implemented using Kaldi+PDNN [9]....
[...]
...The experiments are performed using Kaldi Speech recognition toolkit [8]....
[...]

Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST

[...]

John S. Garofolo, Lori Lamel, W M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren - Show less +2 more

01 Feb 1993

2,164 citations

Report•DOI•

DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1

[...]

John S. Garofolo, Lori Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren - Show less +2 more

01 Feb 1993

1,238 citations

"Investigation of different acoustic..." refers methods in this paper

...The TIMIT database [10] is recorded on eight principal dialects of American English for 490 speakers with a sampling rate of 16 kHz....
[...]

Proceedings Article•DOI•

An investigation of deep neural networks for noise robust speech recognition

[...]

Michael L. Seltzer¹, Dong Yu¹, Yongqiang Wang²•Institutions (2)

Microsoft¹, University of Cambridge²

26 May 2013

TL;DR: The noise robustness of DNN-based acoustic models can match state-of-the-art performance on the Aurora 4 task without any explicit noise compensation and can be further improved by incorporating information about the environment into DNN training using a new method called noise-aware training.

...read moreread less

Abstract: Recently, a new acoustic model based on deep neural networks (DNN) has been introduced. While the DNN has generated significant improvements over GMM-based systems on several tasks, there has been no evaluation of the robustness of such systems to environmental distortion. In this paper, we investigate the noise robustness of DNN-based acoustic models and find that they can match state-of-the-art performance on the Aurora 4 task without any explicit noise compensation. This performance can be further improved by incorporating information about the environment into DNN training using a new method called noise-aware training. When combined with the recently proposed dropout training technique, a 7.5% relative improvement over the previously best published result on this task is achieved using only a single decoding pass and no additional decoding complexity compared to a standard DNN.

...read moreread less

690 citations

"Investigation of different acoustic..." refers background in this paper

...This is due to the noise robust characteristics of DNN models [4]....
[...]

Journal Article•DOI•

The subspace Gaussian mixture model-A structured model for speech recognition

[...]

Daniel Povey¹, Lukas Burget², Mohit Agarwal³, Pinar Akyazi⁴, Feng Kai⁵, Arnab Ghoshal⁶, Ondřej Glembek², Nagendra Kumar Goel, Martin Karafiat², Ariya Rastrow⁷, Richard Rose⁸, Petr Schwarz², Samuel Thomas⁷ - Show less +9 more•Institutions (8)

Microsoft¹, Brno University of Technology², Indian Institute of Information Technology, Allahabad³, Boğaziçi University⁴, Hong Kong University of Science and Technology⁵, Saarland University⁶, Johns Hopkins University⁷, McGill University⁸

01 Apr 2011-Computer Speech & Language

TL;DR: A new approach to speech recognition, in which all Hidden Markov Model states share the same Gaussian Mixture Model (GMM) structure with the same number of Gaussians in each state, appears to give better results than a conventional model.

...read moreread less

304 citations

"Investigation of different acoustic..." refers methods in this paper

...SGMM [3] is one such technique to model low-resource data efficiently by estimating only less number of parameters....
[...]
...In SGMM, the shared parameters are projected using state specific subspace vectors to obtain state specific GMM parameters [3]....
[...]