Showing papers by "Dong Yu published in 2006"

PDF

Open Access

Journal Article•DOI•

[...]

Li Deng¹, Dong Yu¹, Alejandro Acero¹•Institutions (1)

01 Sep 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper shows how the use of resonance target parameters and their temporal filtering enables joint modeling of long-span coarticulation and phonetic reduction effects and demonstrates superior recognizer performance over a modern hidden Markov model-based system.

...read moreread less

Abstract: Modeling dynamic structure of speech is a novel paradigm in speech recognition research within the generative modeling framework, and it offers a potential to overcome limitations of the current hidden Markov modeling approach. Analogous to structured language models where syntactic structure is exploited to represent long-distance relationships among words , the structured speech model described in this paper makes use of the dynamic structure in the hidden vocal tract resonance space to characterize long-span contextual influence among phonetic units. A general overview is provided first on hierarchically classified types of dynamic speech models in the literature. A detailed account is then given for a specific model type called the hidden trajectory model, and we describe detailed steps of model construction and the parameter estimation algorithms. We show how the use of resonance target parameters and their temporal filtering enables joint modeling of long-span coarticulation and phonetic reduction effects. Experiments on phonetic recognition evaluation demonstrate superior recognizer performance over a modern hidden Markov model-based system. Error analysis shows that the greatest performance gain occurs within the sonorant speech class

...read moreread less

95 citations

Patent•

Incrementally regulated discriminative margins in mce training for speech recognition

[...]

Xiaodong He¹, Alex Acero¹, Dong Yu¹, Li Deng¹•Institutions (1)

Microsoft¹

25 Aug 2006

TL;DR: In this paper, a method and apparatus for training an acoustic model are disclosed, where a training corpus is accessed and converted into an initial acoustic model, and scores are calculated for a correct class and competitive classes, respectively, for each token given the acoustic model.

...read moreread less

Abstract: A method and apparatus for training an acoustic model are disclosed. A training corpus is accessed and converted into an initial acoustic model. Scores are calculated for a correct class and competitive classes, respectively, for each token given the acoustic model. From this score a misclassification measure is calculated and then a loss function is calculated from the misclassification measure. The loss function also includes a margin value that varies over each iteration in the training. Based on the calculated loss function the acoustic model is updated, where the loss function with the margin value is minimized. This process repeats until such time as an empirical convergence is met.

...read moreread less

54 citations

Journal Article•DOI•

A bidirectional target-filtering model of speech coarticulation and reduction: two-stage implementation for phonetic recognition

[...]

Li Deng¹, Dong Yu¹, Alejandro Acero¹•Institutions (1)

Microsoft¹

01 Dec 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.

...read moreread less

Abstract: A structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation. At the first stage, the dynamics of formants or vocal tract resonances (VTRs) in fluent speech is generated using prior information of resonance targets in the phone sequence, in absence of acoustic data. Bidirectional temporal filtering with finite-impulse response (FIR) is applied to the segmental target sequence as the FIR filter's input, where forward filtering produces anticipatory coarticulation and backward filtering produces regressive coarticulation. The filtering process is shown also to result in realistic resonance-frequency undershooting or reduction for fast-rate and low-effort speech in a contextually assimilated manner. At the second stage, the dynamics of speech cepstra are predicted analytically based on the FIR-filtered and speaker-adapted VTR targets, and the prediction residuals are modeled by Gaussian random variables with trainable parameters. The combined system of these two stages, thus, generates correlated and causally related VTR and cepstral dynamics, where phonetic reduction is represented explicitly in the hidden resonance space and implicitly in the observed cepstral space. We present details of model simulation demonstrating quantitative effects of speaking rate and segment duration on the magnitude of reduction, agreeing closely with experimental measurement results in the acoustic-phonetic literature. This two-stage model is implemented and applied to the TIMIT phonetic recognition task. Using the N-best (N=2000) rescoring paradigm, the new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.

...read moreread less

37 citations

Proceedings Article•DOI•

N-Gram Based Filler Model for Robust Grammar Authoring

[...]

Dong Yu¹, Yun-Cheng Ju¹, Ye-Yi Wang¹, Alejandro Acero¹•Institutions (1)

Microsoft¹

14 May 2006

TL;DR: A technique for rapid speech application development that generates robust semantic context-free grammars (CFG) given rigid CFGs as input that is written in the W3C SRGS format and thus can run in many standard automatic speech recognition engines.

...read moreread less

Abstract: We propose a technique for rapid speech application development that generates robust semantic context-free grammars (CFG) given rigid CFGs as input. Users' speech does not always conform to rigid CFGs, so robust grammars improve the caller's experience. Our system takes a simple CFG and then generates a hybrid n-gram/CFG that is written in the W3C SRGS format and thus can run in many standard automatic speech recognition engines. The hybrid network leverages an application-independent word n-gram which can be shared across different applications. In addition, our tool allows developers to provide a few example sentences to adapt the n-gram for improved accuracy. Our experiments show the robust CFG has no loss in accuracy for test utterances that can be covered by the rigid CFG, but offers large improvements for cases where the user's sentence cannot be covered by the rigid CFG. It also has a much better rejection for utterances that contain no slot at all. With a few example sentences for adaptation, our robust CFG can achieve the recognition accuracy close to the class-based n-gram LM customized for the application.

...read moreread less

30 citations

A Bidirectional Target Filtering Model of Speech Coarticulation: two-stage Implementation for Phonetic Recognition

[...]

Li Deng, Dong Yu, Alex Acero

01 Jan 2006

TL;DR: In this article, a structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation, where the dynamics of formants or vocal tract resonances (VTRs) in fluent speech are generated using prior information of resonance targets in the phone sequence, in absence of acoustic data.

...read moreread less

Abstract: A structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation. At the first stage, the dynamics of formants or vocal tract resonances (VTRs) in fluent speech is generated using prior information of resonance targets in the phone sequence, in absence of acoustic data. Bidirectional temporal filtering with finite-impulse response (FIR) is applied to the segmental target sequence as the FIR filter’s input, where forward filtering produces anticipatory coarticulation and backward filtering produces regressive coarticulation. The filtering process is shown also to result in realistic resonance-frequency undershooting or reduction for fast-rate and low-effort speech in a contextually assimilated manner. At the second stage, the dynamics of speech cepstra are predicted analytically based on the FIR-filtered and speaker-adapted VTR targets, and the prediction residuals are modeled by Gaussian random variables with trainable parameters. The combined system of these two stages, thus, generates correlated and causally related VTR and cepstral dynamics, where phonetic reduction is represented explicitly in the hidden resonance space and implicitly in the observed cepstral space. We present details of model simulation demonstrating quantitative effects of speaking rate and segment duration on the magnitude of reduction, agreeing closely with experimental measurement results in the acoustic-phonetic literature. This two-stage model is implemented and applied to the TIMIT phonetic recognition task. Using the -best ( = 2000) rescoring paradigm, the new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.

...read moreread less

23 citations

Journal Article•DOI•

A lattice search technique for a long-contextual-span hidden trajectory model of speech

[...]

Dong Yu¹, Li Deng¹, Alex Acero¹•Institutions (1)

Microsoft¹

01 Sep 2006-Speech Communication

TL;DR: Improved likelihood score computation in theHTM and a novel A∗-based time-asynchronous lattice-constrained decoding algorithm for the HTM evaluation are described and improvement of recognition accuracy by the new search algorithm on recognition lattices over the traditional N-best rescoring paradigm is shown.

...read moreread less

20 citations

Patent•

Shareable filler model for grammar authoring

[...]

Alejandro Acero¹, Dong Yu¹, Ye-Yi Wang¹, Yun-Cheng Ju¹•Institutions (1)

Microsoft¹

14 Mar 2006

TL;DR: In this paper, a shareable filler model from a word n-gram model is presented, which is based on a probabilistic context free grammar (PCFG) and modified into a substantially application-independent PCFG.

...read moreread less

Abstract: A method of forming a shareable filler model (shareable model for garbage words) from a word n-gram model is provided. The word n-gram model is converted into a probabilistic context free grammar (PCFG). The PCFG is modified into a substantially application-independent PCFG, which constitutes the shareable filler model.

...read moreread less

16 citations

Patent•

Compound word splitting for directory assistance services

[...]

Dong Yu¹, Alejandro Acero¹, Yun-Cheng Ju¹•Institutions (1)

Microsoft¹

13 Dec 2006

TL;DR: In this article, a computer-implemented method for improving the accuracy of a directory assistance system is described, which includes constructing a prefix tree based on a collection of alphabetically organized words.

...read moreread less

Abstract: A computer-implemented method is disclosed for improving the accuracy of a directory assistance system. The method includes constructing a prefix tree based on a collection of alphabetically organized words. The prefix tree is utilized as a basis for generating splitting rules for a compound word included in an index associated with the directory assistance system. A language model check and a pronunciation check are conducted in order to determine which of the generated splitting rules are mostly likely correct. The compound word is split into word components based on the most likely correct rule or rules. The word components are incorporated into a data set associated with the directory assistance system, such as into a recognition grammar and/or the index.

...read moreread less

13 citations

Patent•

Time synchronous decoding for long-span hidden trajectory model

[...]

Xiaolong Li¹, Li Deng¹, Dong Yu¹, Alejandro Acero¹•Institutions (1)

Microsoft¹

17 Feb 2006

TL;DR: In this paper, a time-synchronous lattice-constrained search algorithm is developed and used to process a linguistic model of speech that has a long contextual-span capability.

...read moreread less

Abstract: A time-synchronous lattice-constrained search algorithm is developed and used to process a linguistic model of speech that has a long-contextual-span capability. In the algorithm, hypotheses are represented as traces that include an indication of a current frame, previous frames and future frames. Each frame can include an associated linguistic unit such as a phone or units that are derived from a phone. Additionally, pruning strategies can be applied to speed up the search. Further, word-ending recombination methods are developed to speed up the computation. These methods can effectively deal with an exponentially increased search space.

...read moreread less

9 citations

Patent•

Adapting a language model to accommodate inputs not found in a directory assistance listing

[...]

Dong Yu¹, Alejandro Acero¹, Yun-Cheng Ju¹•Institutions (1)

Microsoft¹

19 Dec 2006

TL;DR: In this paper, a statistical language model is trained for use in a directory assistance system using the data in the directory assistance listing corpus, which is used to determine how important words in the corpus are in distinguishing a listing from other listings and how likely words are to be omitted or added by a user.

...read moreread less

Abstract: A statistical language model is trained for use in a directory assistance system using the data in a directory assistance listing corpus. Calculations are made to determine how important words in the corpus are in distinguishing a listing from other listings, and how likely words are to be omitted or added by a user. The language model is trained using these calculations.

...read moreread less

7 citations

Patent•

Parameter learning in a hidden trajectory model

[...]

Li Deng¹, Dong Yu¹, Xiaolong Li¹, Alejandro Acero¹•Institutions (1)

Microsoft¹

17 Feb 2006

TL;DR: In this article, the parameters for distributions of a hidden trajectory model including means and variances are estimated using an acoustic likelihood function for observation vectors as an objection function for optimization, which includes only acoustic data and not any intermediate estimate on hidden dynamic variables.

...read moreread less

Abstract: Parameters for distributions of a hidden trajectory model including means and variances are estimated using an acoustic likelihood function for observation vectors as an objection function for optimization. The estimation includes only acoustic data and not any intermediate estimate on hidden dynamic variables. Gradient ascent methods can be developed for optimizing the acoustic likelihood function.

...read moreread less

Proceedings Article•

An effective and efficient utterance verification technology using word n-gram filler models.

[...]

Dong Yu¹, Yun-Cheng Ju, Alex Acero•Institutions (1)

Microsoft¹

01 Sep 2006

TL;DR: A novel, effective, and efficient utterance verification (UV) technology for access control in the interactive voice response (IVR) systems is proposed by using the secret answer to a question and a word N-gram based filler model to construct a context-free grammar.

...read moreread less

Abstract: In this paper we propose a novel, effective, and efficient utterance verification (UV) technology for access control in the interactive voice response (IVR) systems. The key of our approach is to construct a context-free grammar by using the secret answer to a question and a word N-gram based filler model. The N-gram filler provides rich alternatives to the secret answer and can potentially improve the accuracy of the UV task. It can also absorb carrier words used by callers and thus can improve the robustness. We also propose using a predictor based on the best alternative to calculate the confidence. We show detailed experimental results on a tough UV test set that contains 930 positive and 930 negative cases and discuss types of questions that are suitable for the UV task. We demonstrate that our approach can achieve a 2.14% equal error rate (EER) on average and 0.8% false accept rate if the false reject rate is 2.6% and above. This is a 49% EER reduction compared w ith the approaches using acoustic fillers, and a 72% EER reduction compared with the posterior probability based confidence measurement. Index Terms: utterance verification, filler model, w ord spotting, confidence measure

...read moreread less

Proceedings Article•

A time-synchronous phonetic decoder for a long-contextual-Span hidden trajectory model.

[...]

Xiaolong Li, Li Deng, Dong Yu, Alex Acero

19 Sep 2006

TL;DR: A novel time-synchronous decoder, designed specifically for a Hidden Trajectory Model ( HTM) whose likelihood s core computation depends on long-span phonetic contexts, is presented.

...read moreread less

Abstract: A novel time-synchronous decoder, designed specifically for a Hidden Trajectory Model ( HTM) whose likelihood s core computation depends on long-span phonetic contexts, is presented. HTM is a recently developed acoustic model aimed to capture the underlying dynamic structure of speech coarticulation and reduction using a compact set of parameters. The long-span nature of the HTM had posed a great technical challenge for developing efficient search algorithms for full evaluation of the model. Taking on the challenge, the decoding algorithm is developed to deal effectively with the exponentially increased search space by HTMspecific t echniques for hypothesis representation, w ord-ending recombination, and hypothesis pruning. Experimental results obtained on the TIMIT phonetic recognition task are reported, extending our earlier HTM evaluation paradigms based on N-best and A* lattice rescoring. Index T erms: Hidden Trajectory Model, t ime-synchronous decoding, trace-based hypothesis, TIMIT

...read moreread less