scispace - formally typeset
Search or ask a question

Showing papers by "Dong Yu published in 2006"


Journal ArticleDOI
Li Deng1, Dong Yu1, Alejandro Acero1
TL;DR: This paper shows how the use of resonance target parameters and their temporal filtering enables joint modeling of long-span coarticulation and phonetic reduction effects and demonstrates superior recognizer performance over a modern hidden Markov model-based system.
Abstract: Modeling dynamic structure of speech is a novel paradigm in speech recognition research within the generative modeling framework, and it offers a potential to overcome limitations of the current hidden Markov modeling approach. Analogous to structured language models where syntactic structure is exploited to represent long-distance relationships among words , the structured speech model described in this paper makes use of the dynamic structure in the hidden vocal tract resonance space to characterize long-span contextual influence among phonetic units. A general overview is provided first on hierarchically classified types of dynamic speech models in the literature. A detailed account is then given for a specific model type called the hidden trajectory model, and we describe detailed steps of model construction and the parameter estimation algorithms. We show how the use of resonance target parameters and their temporal filtering enables joint modeling of long-span coarticulation and phonetic reduction effects. Experiments on phonetic recognition evaluation demonstrate superior recognizer performance over a modern hidden Markov model-based system. Error analysis shows that the greatest performance gain occurs within the sonorant speech class

95 citations


Patent
Xiaodong He1, Alex Acero1, Dong Yu1, Li Deng1
25 Aug 2006
TL;DR: In this paper, a method and apparatus for training an acoustic model are disclosed, where a training corpus is accessed and converted into an initial acoustic model, and scores are calculated for a correct class and competitive classes, respectively, for each token given the acoustic model.
Abstract: A method and apparatus for training an acoustic model are disclosed. A training corpus is accessed and converted into an initial acoustic model. Scores are calculated for a correct class and competitive classes, respectively, for each token given the acoustic model. From this score a misclassification measure is calculated and then a loss function is calculated from the misclassification measure. The loss function also includes a margin value that varies over each iteration in the training. Based on the calculated loss function the acoustic model is updated, where the loss function with the margin value is minimized. This process repeats until such time as an empirical convergence is met.

54 citations


Journal ArticleDOI
Li Deng1, Dong Yu1, Alejandro Acero1
TL;DR: The new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.
Abstract: A structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation. At the first stage, the dynamics of formants or vocal tract resonances (VTRs) in fluent speech is generated using prior information of resonance targets in the phone sequence, in absence of acoustic data. Bidirectional temporal filtering with finite-impulse response (FIR) is applied to the segmental target sequence as the FIR filter's input, where forward filtering produces anticipatory coarticulation and backward filtering produces regressive coarticulation. The filtering process is shown also to result in realistic resonance-frequency undershooting or reduction for fast-rate and low-effort speech in a contextually assimilated manner. At the second stage, the dynamics of speech cepstra are predicted analytically based on the FIR-filtered and speaker-adapted VTR targets, and the prediction residuals are modeled by Gaussian random variables with trainable parameters. The combined system of these two stages, thus, generates correlated and causally related VTR and cepstral dynamics, where phonetic reduction is represented explicitly in the hidden resonance space and implicitly in the observed cepstral space. We present details of model simulation demonstrating quantitative effects of speaking rate and segment duration on the magnitude of reduction, agreeing closely with experimental measurement results in the acoustic-phonetic literature. This two-stage model is implemented and applied to the TIMIT phonetic recognition task. Using the N-best (N=2000) rescoring paradigm, the new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.

37 citations


Proceedings ArticleDOI
14 May 2006
TL;DR: A technique for rapid speech application development that generates robust semantic context-free grammars (CFG) given rigid CFGs as input that is written in the W3C SRGS format and thus can run in many standard automatic speech recognition engines.
Abstract: We propose a technique for rapid speech application development that generates robust semantic context-free grammars (CFG) given rigid CFGs as input. Users' speech does not always conform to rigid CFGs, so robust grammars improve the caller's experience. Our system takes a simple CFG and then generates a hybrid n-gram/CFG that is written in the W3C SRGS format and thus can run in many standard automatic speech recognition engines. The hybrid network leverages an application-independent word n-gram which can be shared across different applications. In addition, our tool allows developers to provide a few example sentences to adapt the n-gram for improved accuracy. Our experiments show the robust CFG has no loss in accuracy for test utterances that can be covered by the rigid CFG, but offers large improvements for cases where the user's sentence cannot be covered by the rigid CFG. It also has a much better rejection for utterances that contain no slot at all. With a few example sentences for adaptation, our robust CFG can achieve the recognition accuracy close to the class-based n-gram LM customized for the application.

30 citations


01 Jan 2006
TL;DR: In this article, a structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation, where the dynamics of formants or vocal tract resonances (VTRs) in fluent speech are generated using prior information of resonance targets in the phone sequence, in absence of acoustic data.
Abstract: A structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation. At the first stage, the dynamics of formants or vocal tract resonances (VTRs) in fluent speech is generated using prior information of resonance targets in the phone sequence, in absence of acoustic data. Bidirectional temporal filtering with finite-impulse response (FIR) is applied to the segmental target sequence as the FIR filter’s input, where forward filtering produces anticipatory coarticulation and backward filtering produces regressive coarticulation. The filtering process is shown also to result in realistic resonance-frequency undershooting or reduction for fast-rate and low-effort speech in a contextually assimilated manner. At the second stage, the dynamics of speech cepstra are predicted analytically based on the FIR-filtered and speaker-adapted VTR targets, and the prediction residuals are modeled by Gaussian random variables with trainable parameters. The combined system of these two stages, thus, generates correlated and causally related VTR and cepstral dynamics, where phonetic reduction is represented explicitly in the hidden resonance space and implicitly in the observed cepstral space. We present details of model simulation demonstrating quantitative effects of speaking rate and segment duration on the magnitude of reduction, agreeing closely with experimental measurement results in the acoustic-phonetic literature. This two-stage model is implemented and applied to the TIMIT phonetic recognition task. Using the -best ( = 2000) rescoring paradigm, the new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.

23 citations


Journal ArticleDOI
Dong Yu1, Li Deng1, Alex Acero1
TL;DR: Improved likelihood score computation in theHTM and a novel A∗-based time-asynchronous lattice-constrained decoding algorithm for the HTM evaluation are described and improvement of recognition accuracy by the new search algorithm on recognition lattices over the traditional N-best rescoring paradigm is shown.

20 citations


Patent
14 Mar 2006
TL;DR: In this paper, a shareable filler model from a word n-gram model is presented, which is based on a probabilistic context free grammar (PCFG) and modified into a substantially application-independent PCFG.
Abstract: A method of forming a shareable filler model (shareable model for garbage words) from a word n-gram model is provided. The word n-gram model is converted into a probabilistic context free grammar (PCFG). The PCFG is modified into a substantially application-independent PCFG, which constitutes the shareable filler model.

16 citations


Patent
13 Dec 2006
TL;DR: In this article, a computer-implemented method for improving the accuracy of a directory assistance system is described, which includes constructing a prefix tree based on a collection of alphabetically organized words.
Abstract: A computer-implemented method is disclosed for improving the accuracy of a directory assistance system. The method includes constructing a prefix tree based on a collection of alphabetically organized words. The prefix tree is utilized as a basis for generating splitting rules for a compound word included in an index associated with the directory assistance system. A language model check and a pronunciation check are conducted in order to determine which of the generated splitting rules are mostly likely correct. The compound word is split into word components based on the most likely correct rule or rules. The word components are incorporated into a data set associated with the directory assistance system, such as into a recognition grammar and/or the index.

13 citations


Patent
17 Feb 2006
TL;DR: In this paper, a time-synchronous lattice-constrained search algorithm is developed and used to process a linguistic model of speech that has a long contextual-span capability.
Abstract: A time-synchronous lattice-constrained search algorithm is developed and used to process a linguistic model of speech that has a long-contextual-span capability. In the algorithm, hypotheses are represented as traces that include an indication of a current frame, previous frames and future frames. Each frame can include an associated linguistic unit such as a phone or units that are derived from a phone. Additionally, pruning strategies can be applied to speed up the search. Further, word-ending recombination methods are developed to speed up the computation. These methods can effectively deal with an exponentially increased search space.

9 citations


Patent
19 Dec 2006
TL;DR: In this paper, a statistical language model is trained for use in a directory assistance system using the data in the directory assistance listing corpus, which is used to determine how important words in the corpus are in distinguishing a listing from other listings and how likely words are to be omitted or added by a user.
Abstract: A statistical language model is trained for use in a directory assistance system using the data in a directory assistance listing corpus. Calculations are made to determine how important words in the corpus are in distinguishing a listing from other listings, and how likely words are to be omitted or added by a user. The language model is trained using these calculations.

7 citations


Patent
17 Feb 2006
TL;DR: In this article, the parameters for distributions of a hidden trajectory model including means and variances are estimated using an acoustic likelihood function for observation vectors as an objection function for optimization, which includes only acoustic data and not any intermediate estimate on hidden dynamic variables.
Abstract: Parameters for distributions of a hidden trajectory model including means and variances are estimated using an acoustic likelihood function for observation vectors as an objection function for optimization. The estimation includes only acoustic data and not any intermediate estimate on hidden dynamic variables. Gradient ascent methods can be developed for optimizing the acoustic likelihood function.

Proceedings Article
01 Sep 2006
TL;DR: A novel, effective, and efficient utterance verification (UV) technology for access control in the interactive voice response (IVR) systems is proposed by using the secret answer to a question and a word N-gram based filler model to construct a context-free grammar.
Abstract: In this paper we propose a novel, effective, and efficient utterance verification (UV) technology for access control in the interactive voice response (IVR) systems. The key of our approach is to construct a context-free grammar by using the secret answer to a question and a word N-gram based filler model. The N-gram filler provides rich alternatives to the secret answer and can potentially improve the accuracy of the UV task. It can also absorb carrier words used by callers and thus can improve the robustness. We also propose using a predictor based on the best alternative to calculate the confidence. We show detailed experimental results on a tough UV test set that contains 930 positive and 930 negative cases and discuss types of questions that are suitable for the UV task. We demonstrate that our approach can achieve a 2.14% equal error rate (EER) on average and 0.8% false accept rate if the false reject rate is 2.6% and above. This is a 49% EER reduction compared w ith the approaches using acoustic fillers, and a 72% EER reduction compared with the posterior probability based confidence measurement. Index Terms: utterance verification, filler model, w ord spotting, confidence measure

Proceedings Article
19 Sep 2006
TL;DR: A novel time-synchronous decoder, designed specifically for a Hidden Trajectory Model ( HTM) whose likelihood s core computation depends on long-span phonetic contexts, is presented.
Abstract: A novel time-synchronous decoder, designed specifically for a Hidden Trajectory Model ( HTM) whose likelihood s core computation depends on long-span phonetic contexts, is presented. HTM is a recently developed acoustic model aimed to capture the underlying dynamic structure of speech coarticulation and reduction using a compact set of parameters. The long-span nature of the HTM had posed a great technical challenge for developing efficient search algorithms for full evaluation of the model. Taking on the challenge, the decoding algorithm is developed to deal effectively with the exponentially increased search space by HTMspecific t echniques for hypothesis representation, w ord-ending recombination, and hypothesis pruning. Experimental results obtained on the TIMIT phonetic recognition task are reported, extending our earlier HTM evaluation paradigms based on N-best and A* lattice rescoring. Index T erms: Hidden Trajectory Model, t ime-synchronous decoding, trace-based hypothesis, TIMIT