scispace - formally typeset
Search or ask a question

Showing papers by "Dong Yu published in 2009"


Proceedings ArticleDOI
19 Apr 2009
TL;DR: The results demonstrate that when the context coverage is poor in language-specific training, one tenth of the adaptation data can be used to achieve equivalent performance in cross-lingual speech recognition.
Abstract: We study key issues related to multilingual acoustic modeling for automatic speech recognition (ASR) through a series of large-scale ASR experiments. Our study explores shared structures embedded in a large collection of speech data spanning over a number of spoken languages in order to establish a common set of universal phone models that can be used for large vocabulary ASR of all the languages seen or unseen during training. Language-universal and language-adaptive models are compared with language-specific models, and the comparison results show that in many cases it is possible to build general-purpose language-universal and language-adaptive acoustic models that outperform language-specific ones if the set of shared units, the structure of shared states, and the shared acoustic-phonetic properties among different languages can be properly utilized. Specifically, our results demonstrate that when the context coverage is poor in language-specific training, we can use one tenth of the adaptation data to achieve equivalent performance in cross-lingual speech recognition.

127 citations


Journal ArticleDOI
Jinyu Li1, Li Deng1, Dong Yu1, Yifan Gong1, Alex Acero1 
TL;DR: A model-domain environment robust adaptation algorithm, which demonstrates high performance in the standard Aurora 2 speech recognition task without discriminative training of the HMM system, using the clean-trained complex HMM backend as the baseline system for the unsupervised model adaptation.

104 citations


Journal ArticleDOI
TL;DR: It is shown that under the well-matched condition the proposed discriminatively trained VPHMM outperforms the conventional HMM trained in the same way with relative word error rate (WER) reduction of 19% and 15%, respectively, when only mean is updated and when both mean and variances are updated.
Abstract: We propose a new framework and the associated maximum-likelihood and discriminative training algorithms for the variable-parameter hidden Markov model (VPHMM) whose mean and variance parameters vary as functions of additional environment-dependent conditioning parameters. Our framework differs from the VPHMM proposed by Cui and Gong (2007) in that piecewise spline interpolation instead of global polynomial regression is used to represent the dependency of the HMM parameters on the conditioning parameters, and a more effective functional form is used to model the variances. Our framework unifies and extends the conventional discrete VPHMM. It no longer requires quantization in estimating the model parameters and can support both parameter sharing and instantaneous conditioning parameters naturally. We investigate the strengths and weaknesses of the model on the Aurora-3 corpus. We show that under the well-matched condition the proposed discriminatively trained VPHMM outperforms the conventional HMM trained in the same way with relative word error rate (WER) reduction of 19% and 15%, respectively, when only mean is updated and when both mean and variances are updated.

48 citations


Journal ArticleDOI
Dong Yu1, Li Deng1, Alex Acero1
TL;DR: A spline-based solution to the MaxEnt model with non-linear continuous weighting functions is proposed and it is illustrated that the optimization problem can be converted into a standard log-linear model at a higher-dimensional space.

46 citations


Proceedings Article
01 Dec 2009
TL;DR: A focus on the new strategy that combines the layer-wise unsupervised pre-training using entropy-based multi-objective optimization and the conditional likelihood-based back-propagation fine tuning, as inspired by the recent development in learning deep belief networks.
Abstract: We have proposed the deep-structured conditional random fields (CRFs) for sequential labeling and classification recently. The core of this model is its deep structure and its discriminative nature. This paper outlines the learning strategies and algorithms we have developed for the deep-structured CRFs, with a focus on the new strategy that combines the layer-wise unsupervised pre-training using entropy-based multi-objective optimization and the conditional likelihood-based back-propagation fine tuning, as inspired by the recent development in learning deep belief networks.

43 citations


Proceedings ArticleDOI
19 Apr 2009
TL;DR: The new discriminative pronunciation learning technique overcomes the limitation of the traditional ways of introducing alternative pronunciations that often enlarge confusability across different lexical items and is used to improve the pronunciation-modeling component of a speech recognition system designed for mobile voice search.
Abstract: In this paper, we report our recent research aimed at improving the pronunciation-modeling component of a speech recognition system designed for mobile voice search. Our new discriminative learning technique overcomes the limitation of the traditional ways of introducing alternative pronunciations that often enlarge confusability across different lexical items. Instead, we make use of a phonetic recognizer to generate pronunciation candidates, which are then evaluated and selected using the global minimum-classification-error measure, guaranteeing a reduction of the training-set error rate after introducing alternative pronunciations. A maximum entropy approach is subsequently used to learn the weight parameters of the selected pronunciation candidates. Our experimental results demonstrate the effectiveness of the discriminative pronunciation learning technique in a real-world speech recognition task where pronunciation of business names presents special difficulty for high-accuracy speech recognition.

32 citations


Proceedings Article
Dong Yu1, Li Deng1, Alex Acero1
01 Sep 2009
TL;DR: It is demonstrated that a 20.8% classification error rate can be achieved on the TIMIT phone classification task using the HCRF-DC model, which is superior to any published single-system result on this heavily evaluated task.
Abstract: We advance the recently proposed hidden conditional random field (HCRF) model by replacing the moment constraints (MCs) with the distribution constraints (DCs). We point out that the distribution constraints are the same as the traditional moment constraints for the binary features but are able to better regularize the probability distribution of the continuousvalued features than the moment constraints. We show that under the distribution constraints the HCRF model is no longer log-linear but embeds the model parameters in non-linear functions. We provide an effective solution to the resulting more difficult optimization problem by converting it to the traditional log-linear form at a higher-dimensional space of features exploiting cubic spline. We demonstrate that a 20.8% classification error rate (CER) can be achieved on the TIMIT phone classification task using the HCRF-DC model. This result is superior to any published single-system result on this heavily evaluated task including the HCRF-MC model, the discriminatively trained HMMs, and the large-margin HMMs using the same features. Index Terms: hidden conditional random field, maximum entropy, moment constraint, distribution constraint, phone classification, TIMIT, cubic spline

29 citations


Proceedings ArticleDOI
Dong Yu1, Li Deng1, Peng Liu1, Jian Wu1, Yifan Gong1, Alex Acero1 
19 Apr 2009
TL;DR: The results show that the AM merging technique performs the best, achieving 60% relative WER reduction over the IPA-based technique.
Abstract: This paper proposes and compares four cross-lingual and bilingual automatic speech recognition techniques under the constraint that only the acoustic model (AM) of the native language is used at runtime. The first three techniques fall into the category of lexicon conversion where each phoneme sequence (PHS) in the foreign language (FL) lexicon is mapped into the native language (NL) phoneme sequence. The first technique determines the PHS mapping through the international phonetic alphabet (IPA) features; The second and third techniques are data-driven. They determine the mapping by converting the PHS into corresponding context-independent and context-dependent hidden Markov models (HMMs) respectively and searching for the NL PHS with the least Kullback-Leibler divergence (KLD) between the HMMs. The fourth technique falls into the category of AM merging where the FL's AM is merged into the NL's AM by mapping each senone in the FL's AM to the senone in the NL's AM with the minimum KLD. We discuss the strengths and limitations of each technique developed, report empirical evaluation results on recognizing English utterances with a Korean recognizer, and demonstrate the high correlation between the average KLD and the word error rate (WER). The results show that the AM merging technique performs the best, achieving 60% relative WER reduction over the IPA-based technique.

23 citations


Journal ArticleDOI
Dong Yu1, Li Deng1
TL;DR: The use of splines is described for solving nonlinear model estimation problems, in which nonlinear functions with unknown shapes and values are involved, by converting the nonlinear estimation problems into linear ones at a higher- dimensional space.
Abstract: We describe the use of splines for solving nonlinear model estimation problems, in which nonlinear functions with unknown shapes and values are involved, by converting the nonlinear estimation problems into linear ones at a higher- dimensional space. This contrasts with the typical use of the splines for function interpolation where the functional values at some input points are given and the values corresponding to other input points are sought for via interpolation. The technique described in this column applies to arbitrary nonlinear estimation problems where one or more one-dimensional nonlinear functions are involved and can be extended to cases where higher-dimensional nonlinear functions are used.

19 citations


Proceedings ArticleDOI
19 Apr 2009
TL;DR: A new active learning algorithm is proposed to address the problem of selecting a limited subset of utterances for transcribing from a large amount of unlabeled utterances so that the accuracy of the automatic speech recognition system can be maximized.
Abstract: We propose a new active learning algorithm to address the problem of selecting a limited subset of utterances for transcribing from a large amount of unlabeled utterances so that the accuracy of the automatic speech recognition system can be maximized. Our algorithm differentiates itself from earlier work in that it uses a criterion that maximizes the lattice entropy reduction over the whole dataset. We introduce our criterion, show how it can be simplified and approximated, and describe the detailed algorithm to optimize the criterion. We demonstrate the effectiveness of our new algorithm with directory assistance data collected under the real usage scenarios and show that our new algorithm consistently outperforms the confidence based approach by a significant margin. Using the algorithm cuts the number of utterances needed for transcribing by 50% to achieve the same recognition accuracy obtained using the confidence-based approach, and by 60% compared to the random sampling approach.

16 citations


Patent
Dong Yu1, Li Deng1, Jinyu Li1
10 Dec 2009
TL;DR: In this article, a calibration model for use in a speech recognition system is described, which is one that has been trained for a specific usage scenario, based upon a calibration training set obtained from a previous similar/corresponding usage scenario or scenarios.
Abstract: Described is a calibration model for use in a speech recognition system. The calibration model adjusts the confidence scores output by a speech recognition engine to thereby provide an improved calibrated confidence score for use by an application. The calibration model is one that has been trained for a specific usage scenario, e.g., for that application, based upon a calibration training set obtained from a previous similar/corresponding usage scenario or scenarios. Different calibration models may be used with different usage scenarios, e.g., during different conditions. The calibration model may comprise a maximum entropy classifier with distribution constraints, trained with continuous raw confidence scores and multi-valued word tokens, and/or other distributions and extracted features.

Patent
Dong Yu1, Li Deng1, Alejandro Acero1
01 Apr 2009
TL;DR: In this paper, a spline-based solution is proposed to transform the optimization problem into a standard log-linear optimization problem without continuous weights at a higher-dimensional space, where the continuous weights may be approximated by a single-valued weight.
Abstract: Described is a technology by which a maximum entropy (MaxEnt) model, such as used as a classifier or in a conditional random field or hidden conditional random field that embed the maximum entropy model, uses continuous features with continuous weights that are continuous functions of the feature values (instead of single-valued weights). The continuous weights may be approximated by a spline-based solution. In general, this converts the optimization problem into a standard log-linear optimization problem without continuous weights at a higher-dimensional space.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: A novel semi-supervised learning algorithm for automatic speech recognition that determines whether a hypothesized transcription should be used in the training by taking into consideration collective information from all utterances available instead of solely based on the confidence from that utterance itself.
Abstract: Training accurate acoustic models typically requires a large amount of transcribed data, which can be expensive to obtain. In this paper, we describe a novel semi-supervised learning algorithm for automatic speech recognition. The algorithm determines whether a hypothesized transcription should be used in the training by taking into consideration collective information from all utterances available instead of solely based on the confidence from that utterance itself. It estimates the expected entropy reduction each utterance and transcription pair may cause to the whole unlabeled dataset and choose the ones with the positive gains. We compare our algorithm with existing confidence-based semi-supervised learning algorithm and show that the former can consistently outperform the latter when the same amount of utterances is selected into the training set. We also indicate that our algorithm may determine the cutoff-point in a principled way by demonstrating that the point it finds is very close to the achievable peak point.