scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Language identification using shifted delta cepstra

04 Aug 2002-Vol. 3
TL;DR: A method for finding the optimal parameters for identifying a set of languages, specifies these parameters for a language identification task, and provides a performance comparison.
Abstract: A variety of speech identification technologies currently use Gaussian mixture models Until recently, however, they were considered inferior to parallel phone recognition language modeling for identifying the language of a speaker Experiments in the last year have shown that Gaussian mixture models can provide high performance language identification when shifted delta cepstra are used as the feature set Not only can they achieve comparative or even superior performance to parallel phone recognition language modeling, Gaussian mixture models also require less computation Performance can be further improved by altering the shifted delta cepstra parameters and the number of mixtures The optimal parameter set varies depending on the languages to be identified This paper describes a method for finding the optimal parameters for identifying a set of languages, specifies these parameters for a language identification task, and provides a performance comparison
Citations
More filters
Proceedings Article
01 Jan 2002
TL;DR: Two GMM-based approaches to language identification that use shifted delta cepstra (SDC) feature vectors to achieve LID performance comparable to that of the best phone-based systems are described.
Abstract: Published results indicate that automatic language identification (LID) systems that rely on multiple-language phone recognition and n-gram language modeling produce the best performance in formal LID evaluations. By contrast, Gaussian mixture model (GMM) systems, which measure acoustic characteristics, are far more efficient computationally but have tended to provide inferior levels of performance. This paper describes two GMM-based approaches to language identification that use shifted delta cepstra (SDC) feature vectors to achieve LID performance comparable to that of the best phone-based systems. The approaches include both acoustic scoring and a recently developed GMM tokenization system that is based on a variation of phonetic recognition and language modeling. System performance is evaluated on both the CallFriend and OGI corpora.

459 citations

Journal ArticleDOI
TL;DR: This tutorial presents an overview of the progression of spoken language identification (LID) systems and current developments, and Evaluations of the LID system are presented using NIST language recognition evaluation tasks.
Abstract: This tutorial presents an overview of the progression of spoken language identification (LID) systems and current developments. The introduction provides a background on automatic language identification systems using syntactic, morphological, and in particular, acoustic, phonetic, phonotactic and prosodic level information. Different frontend features that are used in LID systems are presented. Several normalization and language modelling techniques have also been presented. We also discuss different LID system architectures that embrace a variety of front-ends and back-ends, and configurations such as hierarchical and fusion classifiers. Evaluations of the LID system are presented using NIST language recognition evaluation tasks.

127 citations


Cites methods from "Language identification using shift..."

  • ...In [ 32 ], initial investigations are made to assess the viability of an automated technique for determining the optimal parameters using hill-climbing algorithms....

    [...]

Proceedings ArticleDOI
12 May 2008
TL;DR: This work proposes a novel automatic, speaker-independent classification approach to monitor, in real-time, the person's cognitive load level by using speech features, and achieves high accuracy on two different tasks, each of which has three discrete cognitive load levels.
Abstract: Monitoring cognitive load is important for the prevention of faulty errors in task-critical operations, and the development of adaptive user interfaces, to maintain productivity and efficiency in work performance. Speech, as an objective and non-intrusive measure, is a suitable method for monitoring cognitive load. Existing approaches for cognitive load monitoring are limited in speaker-dependent recognition and need manually labeled data. We propose a novel automatic, speaker-independent classification approach to monitor, in real-time, the person's cognitive load level by using speech features. In this approach, a Gaussian mixture model (GMM) based classifier is created with unsupervised training. Channel and speaker normalization are deployed for improving robustness. Different delta techniques are investigated for capturing temporal information. And a background model is introduced to reduce the impact of insufficient training data. The final system achieves 71.1% and 77.5% accuracy on two different tasks, each of which has three discrete cognitive load levels. This performance shows a great potential in real-world applications.

83 citations


Cites background from "Language identification using shift..."

  • ...By capturing extra long-term feature patterns, Shifted Delta Coefficients (SDC) [6] has been reported to be superior than acceleration and delta in speech recognition tasks....

    [...]

  • ...To capture the temporal information of features, three different approaches are investigated in this paper: Delta cepstrum, Acceleration (delta-delta) and Shifted Delta Cepstra....

    [...]

  • ...The best-performing configuration utilized Shifted Delta Coefficients, channel and speaker normalization, and background model....

    [...]

Journal ArticleDOI
TL;DR: This paper compares different approaches for using deep neural networks (DNNs) trained to predict senone posteriors for the task of spoken language recognition (SLR) and concludes that the approach based on bottleneck features followed by i-vector modeling outperform the other two approaches.
Abstract: This paper compares different approaches for using deep neural networks (DNNs) trained to predict senone posteriors for the task of spoken language recognition (SLR). These approaches have recently been found to outperform various baseline systems on different datasets, but they have not yet been compared to each other or to a common baseline. Two of these approaches use the DNNs to generate feature vectors which are then processed in different ways to predict the score of each language given a test sample. The features are extracted either from a bottleneck layer in the DNN or from the output layer. In the third approach, the standard i-vector extraction procedure is modified to use the senones as classes and the DNN to predict the zeroth order statistics. We compare these three approaches and conclude that the approach based on bottleneck features followed by i-vector modeling outperform the other two approaches. We also show that score-level fusion of some of these approaches leads to gains over using a single approach for short-duration test samples. Finally, we demonstrate that fusing systems that use DNNs trained with several languages leads to improvements in performance over the best single system, and we propose an adaptation procedure for DNNs trained with languages with less available data. Overall, we show improvements between 40% and 70% relative to a state-of-the-art Gaussian mixture model (GMM) i-vector system on test durations from 3 seconds to 120 seconds on two significantly different tasks: the NIST 2009 language recognition evaluation task and the DARPA RATS language identification task.

81 citations


Cites background from "Language identification using shift..."

  • ...For example, for the three-second condition, the best reported Cavg results we could find for this task were 9.71, for a DBF-based system [10], and 10.2, for a fusion of 15 different systems [35]....

    [...]

Journal ArticleDOI
01 Jul 2014-PLOS ONE
TL;DR: This work proposes using Deep Bottleneck Features (DBF) for spoken LID, and shows that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability.
Abstract: A key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine learning have led to significant improvements, LID performance is still lacking, especially for short duration speech utterances. With the hypothesis that language information is weak and represented only latently in speech, and is largely dependent on the statistical properties of the speech content, existing representations may be insufficient. Furthermore they may be susceptible to the variations caused by different speakers, specific content of the speech segments, and background noise. To address this, we propose using Deep Bottleneck Features (DBF) for spoken LID, motivated by the success of Deep Neural Networks (DNN) in speech recognition. We show that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability. To evaluate the effectiveness of this, we design two acoustic models, termed DBF-TV and parallel DBF-TV (PDBF-TV), using a DBF based i-vector representation for each speech utterance. Results on NIST language recognition evaluation 2009 (LRE09) show significant improvements over state-of-the-art systems. By fusing the output of phonotactic and acoustic approaches, we achieve an EER of 1.08%, 1.89% and 7.01% for 30 s, 10 s and 3 s test utterances respectively. Furthermore, various DBF configurations have been extensively evaluated, and an optimal system proposed.

71 citations

References
More filters
Journal ArticleDOI
TL;DR: Four approaches for automatic language identification of speech utterances are compared: Gaussian mixture model (GMM) classification; single-language phone recognition followed by languaged dependent, interpolated n-gram language modeling (PRLM); parallel PRLM, which uses multiple single- language phone recognizers, each trained in a different language; and languagedependent parallel phone recognition (PPR).
Abstract: Abstruct- We have compared the performance of four approaches for automatic language identification of speech utterances: Gaussian mixture model (GMM) classification; single-language phone recognition followed by languagedependent, interpolated n-gram language modeling (PRLM); parallel PRLM, which uses multiple single-language phone recognizers, each trained in a different language; and languagedependent parallel phone recognition (PPR). These approaches, which span a wide range of training requirements and levels of recognition complexity, were evaluated with the Oregon Graduate Institute Multi-Language Telephone Speech Corpus. Systems containing phone recognizers performed better than the simpler GMM classifier. The top-performing system was parallel PRLM, which exhibited an error rate of 2% for 45-s utterances and 5% for 10-s utterances in two-language, closed-set, forcedchoice classification. The error rate for 11-language, closed-set, forced-choice classification was 11 % for 45-s utterances and 21% for 10-s utterances.

710 citations

Proceedings Article
01 Jan 2002
TL;DR: Two GMM-based approaches to language identification that use shifted delta cepstra (SDC) feature vectors to achieve LID performance comparable to that of the best phone-based systems are described.
Abstract: Published results indicate that automatic language identification (LID) systems that rely on multiple-language phone recognition and n-gram language modeling produce the best performance in formal LID evaluations. By contrast, Gaussian mixture model (GMM) systems, which measure acoustic characteristics, are far more efficient computationally but have tended to provide inferior levels of performance. This paper describes two GMM-based approaches to language identification that use shifted delta cepstra (SDC) feature vectors to achieve LID performance comparable to that of the best phone-based systems. The approaches include both acoustic scoring and a recently developed GMM tokenization system that is based on a variation of phonetic recognition and language modeling. System performance is evaluated on both the CallFriend and OGI corpora.

459 citations