scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Cross-lingual acoustic modeling for Indian languages based on Subspace Gaussian Mixture Models

TL;DR: It is observed that the word accuracy of cross-lingual acoustic model of Bengali was approximately 2.5% above it's CDHMM model and gave equivalent performance as it's monolingual SGMM model.
Abstract: Cross-lingual acoustic modeling using Subspace Gaussian Mixture Model for low-resource languages of Indian origin is investigated. Building acoustic model for a low-resource language with limited vocabulary by leveraging resources from another language with comparatively larger resources was focused upon. Experiments were done on Bengali and Tamil corpus from MANDI database, with Tamil having greater resources than Bengali. We observed that the word accuracy of cross-lingual acoustic model of Bengali was approximately 2.5% above it's CDHMM model and gave equivalent performance as it's monolingual SGMM model.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article, signal processing cues like short-term energy (STE) and sub-band spectral flux (SBSF) are used in tandem with HMM based forced alignment for automatic speech segmentation.

8 citations

Proceedings ArticleDOI
01 Dec 2017
TL;DR: The experimental result showed that the acoustic model with the shared phoneme set improved ASR performance for a few languages in comparison with the language-specific ASR system in which language identification was perfectly performed.
Abstract: This paper proposes a new acoustic modeling method for the automatic speech recognition (ASR) of data, in which multilingual utterances are mixed, without using any language identification technologies To perform ASR of unknown-language utterance, first, language identification is performed to determine the language Then, a language-specific ASR system is used to recognize the utterance Our proposed method does not train language-specific acoustic models but trains an acoustic model that can speech-recognize utterances spoken by some sort of language To realize multilingual acoustic modeling, we create a new phoneme set by sharing a part of language-specific phonemes with other languages The shared phoneme set enables the amount of training data to increase on appearance Therefore, the acoustic model with the shared phoneme set can perform ASR for a minor language (low-resource language) utterance The experimental result showed that the acoustic model with the shared phoneme set improved ASR performance for a few languages in comparison with the language-specific ASR system in which language identification was perfectly performed

7 citations


Cites background from "Cross-lingual acoustic modeling for..."

  • ...Some studies [6], [7] have shown that the availability of utilizing the speech data of another language for acoustic modeling....

    [...]

Proceedings ArticleDOI
01 Dec 2014
TL;DR: This paper proposes multiple cross-lingual techniques to address the problem of data insufficiency in acoustic modeling of three Indian languages namely Bengali, Hindi and Tamil, and uses the principles of phone-cluster adaptive training (phone-CAT) to map phonemes between two languages.
Abstract: One of the major problems in acoustic modeling for a low-resource language is data sparsity In recent years, cross-lingual acoustic modeling techniques have been employed to overcome this problem In this paper we propose multiple cross-lingual techniques to address the problem of data insufficiency The first method, which we call as the cross-lingual phone-CAT, uses the principles of phone-cluster adaptive training (phone-CAT), where the parameters of context-dependent states are obtained by linear interpolation of monophone cluster models The second method uses the interpolation vectors of phone-CAT, which is known to capture the phonetic context information, to map phonemes between two languages Finally, the data-driven phoneme-mapping technique is incorporated into the cross-lingual phone-CAT, to obtain what we call as the phoneme-mapped cross-lingual phone-CAT The proposed techniques are employed in acoustic modeling of three Indian languages namely Bengali, Hindi and Tamil The phoneme-mapped cross-lingual phone-CAT gave relative improvements of 1514% for Bengali, 164% for Hindi and 113% for Tamil over the conventional cross-lingual subspace Gaussian mixture model (SGMM) in low-resource scenario

6 citations


Cites background or methods from "Cross-lingual acoustic modeling for..."

  • ..., Tamil, Hindi and Bengali, from the Mandi database [18, 19], were selected for our experiments....

    [...]

  • ...“Speech-based access to agricultural commodity prices” is a Government of India initiative to build an ASR system for farmers to access the prices of agricultural commodities in various districts across India using mobile phones [18, 19]....

    [...]

Proceedings ArticleDOI
01 Mar 2016
TL;DR: This paper investigates methods to improve the recognition performance of low-resource languages with limited training data by borrowing subspace parameters from a high-resource language in subspace Gaussian mixture model (SGMM) framework and gets consistent improvement in performance over conventional monolingual SGMM of the low- resource language.
Abstract: In this paper, we investigate methods to improve the recognition performance of low-resource languages with limited training data by borrowing subspace parameters from a high-resource language in subspace Gaussian mixture model (SGMM) framework. As a first step, only the state-specific vectors are updated using low-resource language, while retaining all the globally shared parameters from the high-resource language. This approach gave improvements only in some cases. However, when both state-specific and weight projection vectors are re-estimated with low-resource language, we get consistent improvement in performance over conventional monolingual SGMM of the low-resource language. Further, we conducted experiments to investigate the effect of different shared parameters on the acoustic model built using the proposed method. Experiments were done on the Tamil, Hindi and Bengali corpus of MANDI database. Relative improvement of 16.17% for Tamil, 13.74% for Hindi and 12.5% for Bengali, over respective monolingual SGMM were obtained.

1 citations

References
More filters
Proceedings Article
01 Jan 2011
TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Abstract: We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

5,857 citations


"Cross-lingual acoustic modeling for..." refers methods in this paper

  • ...The open source Kaldi software1 [9] was used for building both CDHMM and SGMM acoustic models....

    [...]

Journal ArticleDOI
TL;DR: It is shown that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the cross-channel condition, it is capable of performing at least as well as fusions of multiple systems of other types.
Abstract: We propose a new approach to the problem of estimating the hyperparameters which define the interspeaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10%-15% reductions in error rates on the core condition and the extended data condition (as measured both by equal error rates and the NIST detection cost function). We show that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the cross-channel condition, it is capable of performing at least as well as fusions of multiple systems of other types. (The comparisons are based on the best results on these tasks that have been reported in the literature.) In the case of the cross-channel condition, a factor analysis model with 300 speaker factors and 200 channel factors can achieve equal error rates of less than 3.0%. This is a substantial improvement over the best results that have previously been reported on this task.

671 citations


"Cross-lingual acoustic modeling for..." refers methods in this paper

  • ...This method has similarities to Joint Factor Analysis (JFA) [5] used in speaker recognition and Eigenvoices [6] and Cluster Adaptive Training (CAT) [7] proposed for speech recognition....

    [...]

Journal ArticleDOI
TL;DR: A new model-based speaker adaptation algorithm called the eigenvoice approach, which constrains the adapted model to be a linear combination of a small number of basis vectors obtained offline from a set of reference speakers, and thus greatly reduces the number of free parameters to be estimated from adaptation data.
Abstract: This paper describes a new model-based speaker adaptation algorithm called the eigenvoice approach. The approach constrains the adapted model to be a linear combination of a small number of basis vectors obtained offline from a set of reference speakers, and thus greatly reduces the number of free parameters to be estimated from adaptation data. These "eigenvoice" basis vectors are orthogonal to each other and guaranteed to represent the most important components of variation between the reference speakers. Experimental results for a small-vocabulary task (letter recognition) given in the paper show that the approach yields major improvements in performance for tiny amounts of adaptation data. For instance, we obtained 16% relative improvement in error rate with one letter of supervised adaptation data, and 26% relative improvement with four letters of supervised adaptation data. After a comparison of the eigenvoice approach with other speaker adaptation algorithms, the paper concludes with a discussion of future work.

554 citations

Journal ArticleDOI
TL;DR: Different methods for multilingual acoustic model combination and a polyphone decision tree specialization procedure are introduced for estimating acoustic models for a new target language using speech data from varied source languages, but only limited data from the target language.

427 citations


"Cross-lingual acoustic modeling for..." refers background in this paper

  • ...In [1], pooling of phones and data from different languages to build a multilingual acoustic model was proposed....

    [...]

Journal ArticleDOI
TL;DR: A new approach to speech recognition, in which all Hidden Markov Model states share the same Gaussian Mixture Model (GMM) structure with the same number of Gaussians in each state, appears to give better results than a conventional model.

304 citations


"Cross-lingual acoustic modeling for..." refers background in this paper

  • ...But for large-vocabulary ASR systems, the number of tied states J increases and in those cases CDHMM systems will have more number of parameters compared to it’s SGMM counterpart [3]....

    [...]

  • ...working principles of SGMM refer to [3]....

    [...]