scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Brno University of Technology System for NIST 2005 Language Recognition Evaluation

TL;DR: The language identification (LID) system developed in Speech@FIT group at Brno University of Technology (BUT) for NIST 2005 Language Recognition Evaluation is presented and a discussion of performance on LRE 2005 recognition task is provided.
Abstract: This paper presents the language identification (LID) system developed in Speech@FIT group at Brno University of Technology (BUT) for NIST 2005 Language Recognition Evaluation. The system consists of two parts: phonotactic and acoustic. Phonotactic system is based on hybrid phoneme recognizers trained on SpeechDat-E database. Phoneme lattices are used to train and test phonotactic language models. Further improvement is obtained by using anti-models. Acoustic system is based on GMM modeling trained under Maximum Mutual Information framework. We describe both parts and provide a discussion of performance on LRE 2005 recognition task.
Citations
More filters
Journal ArticleDOI
TL;DR: It seems that the state-of-the-art LID system performs much better on the standard 12 class NIST 2003 Language Recognition Evaluation task or the two class ethnic group recognition task than on the 14 class regional accent recognition task.

109 citations


Cites methods from "Brno University of Technology Syste..."

  • ...The result of the BUT fused system (row 11) was obtained by fusing phonotactic and acoustic systems [108], where the phonotactic system is very similar to our system (same phone recognizers) except they used phone lattice and antimodel technique....

    [...]

Journal ArticleDOI
TL;DR: This paper compares channel variability modeling in the usual Gaussian mixture model domain, and the proposed feature domain compensation technique, and shows that the two approaches lead to similar results on the NIST 2005 Speaker Recognition Evaluation data with a reduced computation cost.
Abstract: The variability of the channel and environment is one of the most important factors affecting the performance of text-independent speaker verification systems. The best techniques for channel compensation are model based. Most of them have been proposed for Gaussian mixture models, while in the feature domain blind channel compensation is usually performed. The aim of this work is to explore techniques that allow more accurate intersession compensation in the feature domain. Compensating the features rather than the models has the advantage that the transformed parameters can be used with models of a different nature and complexity and for different tasks. In this paper, we evaluate the effects of the compensation of the intersession variability obtained by means of the channel factors approach. In particular, we compare channel variability modeling in the usual Gaussian mixture model domain, and our proposed feature domain compensation technique. We show that the two approaches lead to similar results on the NIST 2005 Speaker Recognition Evaluation data with a reduced computation cost. We also report the results of a system, based on the intersession compensation technique in the feature space that was among the best participants in the NIST 2006 Speaker Recognition Evaluation. Moreover, we show how we obtained significant performance improvement in language recognition by estimating and compensating, in the feature domain, the distortions due to interspeaker variability within the same language.

97 citations

DOI
01 Jan 2011
TL;DR: This thesis describes a variety of approaches that make use of multiple streams of information in the acoustic signal to build a system that recognizes the regional dialect and accent of a speaker, and demonstrates how such a technology can be employed to improve Automatic Speech Recognition (ASR).
Abstract: A fundamental challenge for current research on speech science and technology is understanding and modeling individual variation in spoken language. Individuals have their own speaking styles, depending on many factors, such as their dialect and accent as well as their socioeconomic background. These individual differences typically introduce modeling difficulties for large-scale speaker-independent systems designed to process input from any variant of a given language. This dissertation focuses on automatically identifying the dialect or accent of a speaker given a sample of their speech, and demonstrates how such a technology can be employed to improve Automatic Speech Recognition (ASR). In this thesis, we describe a variety of approaches that make use of multiple streams of information in the acoustic signal to build a system that recognizes the regional dialect and accent of a speaker. In particular, we examine frame-based acoustic, phonetic, and phonotactic features, as well as high-level prosodic features, comparing generative and discriminative modeling techniques. We first analyze the effectiveness of approaches to language identification that have been successfully employed by that community, applying them here to dialect identification. We next show how we can improve upon these techniques. Finally, we introduce several novel modeling approaches – Discriminative Phonotactics and kernel-based methods. We test our best performing approach on four broad Arabic dialects, ten Arabic sub-dialects, American English vs. Indian English accents, American English Southern vs. Non-Southern, American dialects at the state level plus Canada, and three Portuguese dialects. Our experiments demonstrate that our novel approach, which relies on the hypothesis that certain phones are realized differently across dialects, achieves new state-of-the-art performance on most dialect recognition tasks. This approach achieves an Equal Error Rate (EER) of 4% for four broad Arabic dialects, an EER of 6.3% for American vs. Indian English accents, 14.6% for American English Southern vs. Non-Southern dialects, and 7.9% for three Portuguese dialects. Our framework can also be used to automatically extract linguistic knowledge, specifically the context-dependent phonetic cues that may distinguish one dialect form another. We illustrate the efficacy of our approach by demonstrating the correlation of our results with geographical proximity of the various dialects. As a final measure of the utility of our studies, we also show that, it is possible to improve ASR. Employing our dialect identification system prior to ASR to identify the Levantine Arabic dialect in mixed speech of a variety of dialects allows us to optimize the engine's language model and use Levantine-specific acoustic models where appropriate. This procedure improves the Word Error Rate (WER) for Levantine by 4.6% absolute; 9.3% relative. In addition, we demonstrate in this thesis that, using a linguistically-motivated pronunciation modeling approach, we can improve the WER of a state-of-the art ASR system by 2.2% absolute and 11.5% relative WER on Modern Standard Arabic.

95 citations


Cites methods from "Brno University of Technology Syste..."

  • ...Following [Matejka et al., 2006], they further discriminatively train the dialect GMMs with the Maximum Mutual Information (MMI) criterion, where the objective function is the posterior probability of correctly classifying all training utterances, using the extended Baum-Welch algorithm [Povey, 2004]....

    [...]

  • ...We use these posterior probabilities to represent the detection scores used in plotting DET curves, similar to [Matejka et al., 2006]; where D is the set of dialects of interest, p(Or|λDx) represents the likelihood of Or given the model λDx of dialect Dx, and τr normalizes duration differences across trials....

    [...]

Proceedings ArticleDOI
14 Mar 2010
TL;DR: This paper presented a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2009 Language Recognition Evaluation (LRE), which consists of a fusion of three core recognizers, two based on spectral similarity and one based on tokenization.
Abstract: This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2009 Language Recognition Evaluation (LRE). This system consists of a fusion of three core recognizers, two based on spectral similarity and one based on tokenization. The 2009 LRE differed from previous ones in that test data included narrowband segments from worldwide Voice of America broadcasts as well as conventional recorded conversational telephone speech. Results are presented for the 23-language closed-set and open-set detection tasks at the 30, 10, and 3 second durations along with a discussion of the language-pair task. On the 30 second 23-language closed set detection task, the system achieved a 1.64 average error rate.

90 citations

Journal ArticleDOI
TL;DR: The experimental results on the AP17-OLR database demonstrate that the proposed end-to-end short utterances based speech language identification (SLI) approach can improve the performance of SLD, especially on short utterance.
Abstract: Conversations in the intelligent vehicles are usually short utterance. As the durations of the short utterances are small (e.g., less than 3 s), it is difficult to learn sufficient information to distinguish the type of languages. In this paper, we propose an end-to-end short utterances based speech language identification (SLI) approach, which is especially suitable for the short utterance based language identification. This approach is implemented with a long short-term memory (LSTM) neural network, which is designed for the SLI application in intelligent vehicles. The features used for LSTM learning are generated by a transfer learning method. The bottleneck features of a deep neural network, which are obtained for a mandarin acoustic-phonetic classifier, are used for the LSTM training. In order to improve the SLD accuracy with short utterances, a phase vocoder based time-scale modification method is utilized to reduce/increase the speech rate of the test utterance. By connecting the normal, speech rate reduced, and speech rate increased utterances, we can extend the length of the test utterances such that the performance of the SLI system is improved. The experimental results on the AP17-OLR database demonstrate that the proposed method can improve the performance of SLD, especially on short utterance. The proposed SLI has robust performance under the vehicular noisy environment.

81 citations


Cites methods from "Brno University of Technology Syste..."

  • ...Different variants of PRLM method based on parallel phone recognition and phone selection on multilingual phone set have been discussed in [2], [26], [27]....

    [...]

References
More filters
Proceedings Article
01 Jan 2002
TL;DR: The functionality of the SRILM toolkit is summarized and its design and implementation is discussed, highlighting ease of rapid prototyping, reusability, and combinability of tools.
Abstract: SRILM is a collection of C++ libraries, executable programs, and helper scripts designed to allow both production of and experimentation with statistical language models for speech recognition and other applications. SRILM is freely available for noncommercial purposes. The toolkit supports creation and evaluation of a variety of language model types based on N-gram statistics, as well as several related tasks, such as statistical tagging and manipulation of N-best lists and word lattices. This paper summarizes the functionality of the toolkit and discusses its design and implementation, highlighting ease of rapid prototyping, reusability, and combinability of tools.

4,904 citations


"Brno University of Technology Syste..." refers methods in this paper

  • ...We used standard Witten-Bell discounting [12] implemented in SRI LM toolkit5 [13]....

    [...]

16 Sep 1995
TL;DR: The Fundamentals of HTK: General Principles of HMMs, Recognition and Viterbi Decoding, and Continuous Speech Recognition.
Abstract: 1 The Fundamentals of HTK 2 1.1 General Principles of HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Isolated Word Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Output Probability Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Baum-Welch Re-Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Recognition and Viterbi Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 Continuous Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.7 Speaker Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2,095 citations

Journal ArticleDOI
TL;DR: The authors propose the application of a Poisson process model of novelty, which ability to predict novel tokens is evaluated, and it consistently outperforms existing methods and offers a small improvement in the coding efficiency of text compression over the best method previously known.
Abstract: Approaches to the zero-frequency problem in adaptive text compression are discussed. This problem relates to the estimation of the likelihood of a novel event occurring. Although several methods have been used, their suitability has been on empirical evaluation rather than a well-founded model. The authors propose the application of a Poisson process model of novelty. Its ability to predict novel tokens is evaluated, and it consistently outperforms existing methods. It is applied to a practical statistical coding scheme, where a slight modification is required to avoid divergence. The result is a well-founded zero-frequency model that explains observed differences in the performance of existing methods, and offers a small improvement in the coding efficiency of text compression over the best method previously known. >

835 citations


"Brno University of Technology Syste..." refers methods in this paper

  • ...We used standard Witten-Bell discounting [12] implemented in SRI LM toolkit5 [13]....

    [...]

Journal ArticleDOI
TL;DR: Four approaches for automatic language identification of speech utterances are compared: Gaussian mixture model (GMM) classification; single-language phone recognition followed by languaged dependent, interpolated n-gram language modeling (PRLM); parallel PRLM, which uses multiple single- language phone recognizers, each trained in a different language; and languagedependent parallel phone recognition (PPR).
Abstract: Abstruct- We have compared the performance of four approaches for automatic language identification of speech utterances: Gaussian mixture model (GMM) classification; single-language phone recognition followed by languagedependent, interpolated n-gram language modeling (PRLM); parallel PRLM, which uses multiple single-language phone recognizers, each trained in a different language; and languagedependent parallel phone recognition (PPR). These approaches, which span a wide range of training requirements and levels of recognition complexity, were evaluated with the Oregon Graduate Institute Multi-Language Telephone Speech Corpus. Systems containing phone recognizers performed better than the simpler GMM classifier. The top-performing system was parallel PRLM, which exhibited an error rate of 2% for 45-s utterances and 5% for 10-s utterances in two-language, closed-set, forcedchoice classification. The error rate for 11-language, closed-set, forced-choice classification was 11 % for 45-s utterances and 21% for 10-s utterances.

710 citations


"Brno University of Technology Syste..." refers methods in this paper

  • ...20h(t) denotes the h-th element of feature vector o(t) filtering of cepstral trajectories is used to alleviate channel mismatch [6] and Vocal-tract length normalization (VTLN) [7] performs simple speaker adaptation....

    [...]

Journal ArticleDOI
TL;DR: This article reports significant gains in recognition performance and model compactness as a result of discriminative training based on MCE training applied to HMMs, in the context of three challenging large-vocabulary speech recognition tasks.
Abstract: The minimum classification error (MCE) framework for discriminative training is a simple and general formalism for directly optimizing recognition accuracy in pattern recognition problems. The framework applies directly to the optimization of hidden Markov models (HMMs) used for speech recognition problems. However, few if any studies have reported results for the application of MCE training to large-vocabulary, continuous-speech recognition tasks. This article reports significant gains in recognition performance and model compactness as a result of discriminative training based on MCE training applied to HMMs, in the context of three challenging large-vocabulary (up to 100 k word) speech recognition tasks: the Corpus of Spontaneous Japanese lecture speech transcription task, a telephone-based name recognition task, and the MIT Jupiter telephone-based conversational weather information task. On these tasks, starting from maximum likelihood (ML) baselines, MCE training yielded relative reductions in word error ranging from 7% to 20%. Furthermore, this paper evaluates the use of different methods for optimizing the MCE criterion function, as well as the use of precomputed recognition lattices to speed up training. An overview of the MCE framework is given, with an emphasis on practical implementation issues

581 citations


"Brno University of Technology Syste..." refers background in this paper

  • ...If the silence is longer than 5 sec., the system flushes the previous segment and starts a new one....

    [...]