Showing papers by "Dong Yu published in 2007"

PDF

Open Access

Patent•DOI•

Speech-centric multimodal user interface design in mobile technology

[...]

Dong Yu¹, Deng Li¹•Institutions (1)

15 Mar 2007

TL;DR: In this article, a multi-modal human computer interface (HCI) receives a plurality of available information inputs concurrently, or serially, and employs a subset of the inputs to determine or infer user intent with respect to a communication or information goal.

...read moreread less

Abstract: A multi-modal human computer interface (HCI) receives a plurality of available information inputs concurrently, or serially, and employs a subset of the inputs to determine or infer user intent with respect to a communication or information goal. Received inputs are respectively parsed, and the parsed inputs are analyzed and optionally synthesized with respect to one or more of each other. In the event sufficient information is not available to determine user intent or goal, feedback can be provided to the user in order to facilitate clarifying, confirming, or augmenting the information inputs.

...read moreread less

297 citations

Proceedings Article•DOI•

High-performance hmm adaptation with joint compensation of additive and convolutive distortions via Vector Taylor Series

[...]

Jinyu Li¹, Li Deng¹, Dong Yu¹, Yifan Gong¹, Alejandro Acero¹ - Show less +1 more•Institutions (1)

Microsoft¹

01 Apr 2007

TL;DR: A model-domain environment-robust adaptation algorithm, which demonstrates high performance in the standard Aurora 2 speech recognition task and adaptation of the dynamic portion of the HMM mean and variance parameters is critical to the success of the algorithm.

...read moreread less

Abstract: In this paper, we present our recent development of a model-domain environment-robust adaptation algorithm, which demonstrates high performance in the standard Aurora 2 speech recognition task. The algorithm consists of two main steps. First, the noise and channel parameters are estimated using a nonlinear environment distortion model in the cepstral domain, the speech recognizer's "feedback" information, and the vector-Taylor-series (VTS) linearization technique collectively. Second, the estimated noise and channel parameters are used to adapt the static and dynamic portions of the HMM means and variances. This two-step algorithm enables joint compensation of both additive and convolutive distortions (JAC). In the experimental evaluation using the standard Aurora 2 task, the proposed JAC/VTS algorithm achieves 91.11% accuracy using the clean-trained simple HMM backend as the baseline system for the model adaptation. This represents high recognition performance on this task without discriminative training of the HMM system. Detailed analysis on the experimental results shows that adaptation of the dynamic portion of the HMM mean and variance parameters is critical to the success of our algorithm.

...read moreread less

136 citations

Proceedings Article•DOI•

Automated Directory Assistance System - from Theory to Practice

[...]

Dong Yu, Yun-Cheng Ju, Ye-Yi Wang, Geoffrey Zweig, Alex Acero¹ - Show less +1 more•Institutions (1)

Microsoft¹

27 Aug 2007

TL;DR: It is shown that many theoretical and practical issues need to be resolved when applying the basic idea of voice search to the development of ADAS, and the experiences in addressing these issues are shared, especially in pre-processing the listing database, generating a high performance LM, and developing efficient, accurate, and robust search algorithms.

...read moreread less

Abstract: The automated directory assistance system (ADAS) is traditionally formulated as an automatic speech recognition (ASR) problem. Recently, it has been formulated as a voice search problem, where a spoken utterance is firstly converted into text, which in turn is used to search for the listing. In this paper, we focus on the design and development of the utterance-to-listing component of ADAS. We show that many theoretical and practical issues need to be resolved when applying the basic idea of voice search to the development of ADAS. We share our experiences in addressing these issues, especially in pre-processing the listing database, generating a high performance LM, and developing efficient, accurate, and robust search algorithms. Field tests of our prototype system indicate that an 81% task completion rate can be achieved. Index Terms : speech recognition, directory assistance, voice search, TFIDF, spoken dialog system, vector space model 1. Introduction An automated directory assistance system (ADAS) [1] [2] [3] [5] [6] is a spoken dialog system that provides the caller with the phone number and/or address of the business or residential listing he/she requests. It is a very complicated system that involves automatic speech recognition (ASR), listing lookup, disambiguation, and dialog design. The core element of the ADAS is the utterance-to-listing (U2L) component that maps an utterance K

...read moreread less

67 citations

Proceedings Article•DOI•

Large-Margin Minimum Classification Error Training for Large-Scale Speech Recognition Tasks

[...]

Dong Yu¹, Li Deng¹, Xiaodong He¹, Alejandro Acero¹•Institutions (1)

Microsoft¹

15 Apr 2007

TL;DR: This work has successfully applied LM-MCE training approach to the Microsoft internal large vocabulary telephony speech recognition task and achieved significant recognition accuracy improvement across-the-board.

...read moreread less

Abstract: Recently, we have developed a novel discriminative training method named large-margin minimum classification error (LM-MCE) training that incorporates the idea of discriminative margin into the conventional minimum classification error (MCE) training method. In our previous work, this novel approach was formulated specifically for the MCE training using the sigmoid loss function and its effectiveness was demonstrated on the TIDIGITS task alone. In this paper two additional contributions are made. First, we formulate LM-MCE as a Bayes risk minimization problem whose loss function not only includes empirical error rates but also a margin-bound risk. This new formulation allows us to extend the same technique to a wide variety of MCE based training. Second, we have successfully applied LM-MCE training approach to the Microsoft internal large vocabulary telephony speech recognition task (with 2000 hours of training data and 120K of vocabulary) and achieved significant recognition accuracy improvement across-the-board. To our best knowledge, this is the first time that the large-margin approach is demonstrated to be successful in large-scale speech recognition tasks.

...read moreread less

62 citations

Proceedings Article•DOI•

Use of Differential Cepstra as Acoustic Features in Hidden Trajectory Modeling for Phonetic Recognition

[...]

Li Deng¹, Dong Yu¹•Institutions (1)

Microsoft¹

15 Apr 2007

TL;DR: The earlier version of the hidden trajectory model (HTM) for speech dynamics which predicts the "static" cepstra as the observed acoustic feature is generalized to one which predicts joint Static/delta-cepstra HTM, enabling efficient computation of the joint likelihood for both static and delta cepstral sequences as the acoustic features given the model.

...read moreread less

Abstract: The earlier version of the hidden trajectory model (HTM) for speech dynamics which predicts the "static" cepstra as the observed acoustic feature is generalized to one which predicts joint static cepstra and their temporal differentials (i.e., delta cepstra). The formulation of this generalized HTM is presented in the generative-modeling framework, enabling efficient computation of the joint likelihood for both static and delta cepstral sequences as the acoustic features given the model. The parameter estimation techniques for the new model are developed and presented, giving closed-form estimation formulas after the use of vector Taylor series approximation. We show principled generalization from the earlier static-cepstra HTM to the new static/delta-cepstra HTM not only in terms of model formulations but also in terms of their respective analytical forms in (monophone) parameter estimation. Experimental results on the standard TIMIT phonetic recognition task demonstrate recognition accuracy improvement over the earlier best HTM system, both significantly better than state-of-the-art triphone HMM systems.

...read moreread less

53 citations

Proceedings Article•DOI•

VoicePedia: Towards Speech-based Access to Unstructured Information

[...]

Jahanzeb Sherwani¹, Dong Yu², Tim Paek², Mary Czerwinski², Yun-Cheng Ju², Alex Acero² - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Microsoft²

01 Aug 2007

TL;DR: VoicePedia, a telephone-based dialog system for searching and browsing Wikipedia, is developed and a user study comparing the use of VoicePedia to SmartPedia is presented, a Smartphone GUI-based alternative.

...read moreread less

Abstract: Currently there are no dialog systems that enable purely voice-based access to the unstructured information on websites such as Wikipedia. Such systems could be revolutionary for non-literate users in the developing world. To investigate interface issues in such a system, we developed VoicePedia, a telephone-based dialog system for searching and browsing Wikipedia. In this paper, we present the system, as well as a user study comparing the use of VoicePedia to SmartPedia, a Smartphone GUI-based alternative. Keyword entry through the voice interface was significantly faster, while search result navigation, and page browsing were significantly slower. Although users preferred the GUI-based interface, task success rates between both systems were comparable – a promising result for regions where Smartphones and data plans are not viable. Index Terms: dialog system, information access

...read moreread less

41 citations

Patent•

High performance hmm adaptation with joint compensation of additive and convolutive distortions

[...]

Dong Yu¹, Li Deng¹, Alejandro Acero¹, Yifan Gong¹, Jinyu Li¹ - Show less +1 more•Institutions (1)

Microsoft¹

03 Dec 2007

TL;DR: In this article, a method of compensating for additive and convolutive distortions applied to a signal indicative of an utterance is discussed, which includes receiving a signal and initializing noise mean and channel mean vectors.

...read moreread less

Abstract: A method of compensating for additive and convolutive distortions applied to a signal indicative of an utterance is discussed. The method includes receiving a signal and initializing noise mean and channel mean vectors. Gaussian dependent matrix and Hidden Markov Model (HMM) parameters are calculated or updated to account for additive noise from the noise mean vector or convolutive distortion from the channel mean vector. The HMM parameters are adapted by decoding the utterance using the previously calculated HMM parameters and adjusting the Gaussian dependent matrix and the HMM parameters based upon data received during the decoding. The adapted HMM parameters are applied to decode the input utterance and provide a transcription of the utterance.

...read moreread less

31 citations

Patent•

Confidence measure generation for speech related searching

[...]

Ye-Yi Wang¹, Yun-Cheng Ju¹, Dong Yu¹•Institutions (1)

Microsoft¹

03 Aug 2007

TL;DR: In this paper, a confidence measure generator was proposed to calculate an overall confidence measure for voice search results based upon the features received from the speech recognizer, search component, and dialog manager.

...read moreread less

Abstract: A voice search system has a speech recognizer, a search component, and a dialog manager. A confidence measure generator receives speech recognition features from the speech recognizer, search features from the search component, and dialog features from the dialog manager, and calculates an overall confidence measure for voice search results based upon the features received. The invention can be extended to include the generation of additional features, based on those received from the individual components of the voice search system.

...read moreread less

28 citations

Patent•

Searching a database of listings

[...]

Ye-Yi Wang¹, Dong Yu¹, Yun-Cheng Ju¹, Alejandro Acero¹, Geoffrey Zweig¹ - Show less +1 more•Institutions (1)

Microsoft¹

10 May 2007

TL;DR: In this article, a database having listings rather than long documents is searched using a term frequency-inverse document frequency (Tf/Idf) algorithm, which is based on the Tf/IDF algorithm.

...read moreread less

Abstract: A database having listings rather than long documents is searched using a term frequency-inverse document frequency (Tf/Idf) algorithm.

...read moreread less

26 citations

Proceedings Article•DOI•

Voice-Rate: A Dialog System for Consumer Ratings

[...]

Geoffrey Zweig¹, Yun-Cheng Ju¹, Patrick Nguyen¹, Dong Yu¹, Ye-Yi Wang¹, Alex Acero¹ - Show less +2 more•Institutions (1)

Microsoft¹

23 Apr 2007

TL;DR: Voice-Rate as discussed by the authors is an automated dialog system which provides access to over one million ratings of products and businesses and can be accessed by dialing 1-877-456-DATA.

...read moreread less

Abstract: Voice-Rate is an automated dialog system which provides access to over one million ratings of products and businesses. By calling a toll-free number, consumers can access ratings for products, national businesses such as airlines, and local businesses such as restaurants. Voice-Rate also has a facility for recording and analyzing ratings that are given over the phone. The service has been primed with ratings taken from a variety of web sources, and we are augmenting these with user ratings. Voice-Rate can be accessed by dialing 1-877-456-DATA.

...read moreread less

19 citations

Patent•

Generic framework for large-margin MCE training in speech recognition

[...]

Dong Yu¹, Alejandro Acero¹, Li Deng¹, Xiaodong He¹•Institutions (1)

Microsoft¹

20 Feb 2007

TL;DR: In this article, a method and apparatus for training an acoustic model are disclosed, where a training corpus is accessed and converted into an initial acoustic model, and scores are calculated for a correct class and competitive classes, respectively, for each token given the initial model.

...read moreread less

Abstract: A method and apparatus for training an acoustic model are disclosed. A training corpus is accessed and converted into an initial acoustic model. Scores are calculated for a correct class and competitive classes, respectively, for each token given the initial acoustic model. Also, a sample-adaptive window bandwidth is calculated for each training token. From the calculated scores and the sample-adaptive window bandwidth values, loss values are calculated based on a loss function. The loss function, which may be derived from a Bayesian risk minimization viewpoint, can include a margin value that moves a decision boundary such that token-to-boundary distances for correct tokens that are near the decision boundary are maximized. The margin can either be a fixed margin or can vary monotonically as a function of algorithm iterations. The acoustic model is updated based on the calculated loss values. This process can be repeated until an empirical convergence is met.

...read moreread less

Proceedings Article•DOI•

The voice-rate dialog system for consumer ratings.

[...]

Geoffrey Zweig¹, Patrick Nguyen¹, Yun-Cheng Ju¹, Ye-Yi Wang¹, Dong Yu¹, Alex Acero¹ - Show less +2 more•Institutions (1)

Microsoft¹

27 Aug 2007

TL;DR: Voice-Rate is an automated dialog system which provides access to over one million ratings of products and businesses by calling a toll-free number and has been primed with ratings taken from a variety of web sources, and is augmenting with user ratings.

...read moreread less

Abstract: Voice-Rate is an experimental dialog system that makes product and business ratings available to consumers via a tollfree phone number. By calling Voice-Rate, users can access the ratings of more than one million products, a quarter million local businesses (restaurants), and three thousand national businesses. This paper describes the Voice Rate system, and solutions to three key technical challenges: robust name-matching, efficient disambiguation, and review synthesis for telephone playback. Voice-Rate can be accessed by calling 1-877-456-DATA (toll-free) within the U.S.

...read moreread less

Proceedings Article•DOI•

Confidence Measures for Voice Search Applications

[...]

Ye-Yi Wang¹, Dong Yu¹, Yun-Cheng Ju¹, Geoffrey Zweig¹, Alex Acero¹ - Show less +1 more•Institutions (1)

Microsoft¹

27 Aug 2007

TL;DR: This paper reviews voice search technology, and proposes a new and effective method for computing semantic confidence measures that explores the use of maximum entropy classifiers as confidence models, and investigates a feature selection algorithm that leads to an effective subset of prominent features for the classifier.

...read moreread less

Abstract: Voice search is the technology underlying many spoken dialog applications that enable users to access information using spoken queries. This paper reviews voice search technology, and proposes a new and effective method for computing semantic confidence measures. It explores the use of maximum entropy classifiers as confidence models, and investigates a feature selection algorithm that leads to an effective subset of prominent features for the classifier. The experimental results on a directory assistance application show that the reduced feature set not only makes the model more effective in handling different recognition and search engine combinations, but also results in a very informative confidence measure that is closely correlated with the actual voice search accuracy. Index Terms : voice search, directory assistance, confidence measure, Tf-Idf vector space model, maximum entropy model. 1. Introduction

...read moreread less

Proceedings Article•DOI•

A Discriminative Training Framework using N-Best Speech Recognition Transcriptions and Scores for Spoken Utterance Classification

[...]

Sibel Yaman¹, Li Deng², Dong Yu², Ye-Yi Wang², Alejandro Acero² - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Microsoft²

15 Apr 2007

TL;DR: In this paper, a novel discriminative training approach to spoken utterance classification is proposed, in which the classification error rate is approximated as differentiable functions of the language and classifier model parameters.

...read moreread less

Abstract: In this paper, we propose a novel discriminative training approach to spoken utterance classification (SUC). The ultimate objective of the SUC task, originally developed to map a spoken speech utterance into the most appropriate semantic class, is to minimize the classification error rate (CER). Conventionally, a two-phase approach is adapted, in which the first phase is the ASR transcription phase, and the second phase is the semantic classification phase. In the proposed framework, the classification error rate is approximated as differentiable functions of the language and classifier model parameters. Furthermore, in order to exploit all the available information from the first phase, class-specific discriminant functions are defined based on score functions derived from the N-best lists. Our experimental results on the standard ATIS database indicate a notable reduction in CER from the earlier best result on the identical task. The proposed framework achieved a reduction of CER from 4.92% to 4.04%.

...read moreread less

Patent•DOI•

Hidden trajectory modeling with differential cepstra for speech recognition

[...]

Li Deng¹, Dong Yu¹•Institutions (1)

Microsoft¹

19 Jan 2007-Journal of the Acoustical Society of America

TL;DR: In this paper, a system for speech recognition uses differential cepstra over time frames as acoustic features, together with the traditional static cepstral features, for hidden trajectory modeling, and provides greater accuracy and performance in automatic speech recognition.

...read moreread less

Abstract: A novel system for speech recognition uses differential cepstra over time frames as acoustic features, together with the traditional static cepstral features, for hidden trajectory modeling, and provides greater accuracy and performance in automatic speech recognition. According to one illustrative embodiment, an automatic speech recognition method includes receiving a speech input, generating an interpretation of the speech, and providing an output based at least in part on the interpretation of the speech input. The interpretation of the speech uses hidden trajectory modeling with observation vectors that are based on cepstra and on differential cepstra derived from the cepstra. A method is developed that can automatically train the hidden trajectory model's parameters that are corresponding to the components of the differential cepstra in the full acoustic feature vectors.

...read moreread less

Patent•

Indexing and ranking processes for directory assistance services

[...]

Dong Yu¹, Alejandro Acero¹, Yun-Cheng Ju¹, Ye-Yi Wang¹•Institutions (1)

Microsoft¹

12 Jan 2007

TL;DR: A computer-implemented method is disclosed for providing a directory assistance service as discussed by the authors, which includes generating an indexing file that is a representation of information associated with a collection of listings stored in an index.

...read moreread less

Abstract: A computer-implemented method is disclosed for providing a directory assistance service The method includes generating an indexing file that is a representation of information associated with a collection of listings stored in an index The indexing file is utilized as a basis for ranking listings in an index based on the strength of association with a query Based at least in part on the ranking, an output is provided and is indicative of listings in the index that are likely correspond to the query At least one particular listing in the index is excluded from the output without there ever being a comparison of features in the query with features in the one particular listing

...read moreread less

Proceedings Article•

Commute UX: Telephone Dialog System for Location-based Services.

[...]

Ivan Tashev, Michael L. Seltzer, Yun-Cheng Ju, Dong Yu, Alex Acero¹ - Show less +1 more•Institutions (1)

Microsoft¹

01 Sep 2007

TL;DR: The strategies employed by the system were evaluated through user studies and a system employing the best strategies was deployed and evaluated through an analysis of 700 calls over a two month period.

...read moreread less

Abstract: In this paper, we describe a telephone dialog system for location-based services. In such systems, the effectiveness with which both the user can input location information to the system and the system delivers location information to the user is critical. We describe strategies for both of these issues in the context of a dialog system for real-time information about traffic, gas prices, and weather. The strategies employed by our system were evaluated through user studies and a system employing the best strategies was deployed. The system is evaluated through an analysis of 700 calls over a two month period.

...read moreread less

Proceedings Article•DOI•

Handling phonetic context and speaker variation in a structure-based speech recognizer.

[...]

Dong Yu, Li Deng, Alex Acero¹•Institutions (1)

Microsoft¹

27 Aug 2007

TL;DR: An effective target-value normalization algorithm that can be applied to both training and unknown test speakers and which provides structured context dependency for speech acoustics without using context dependent parameters as required by HMMs is described.

...read moreread less

Abstract: Recently we have developed a novel type of structure-based speech recognizer, which uses parameterized, non-recursive hidden trajectory model of vocal tract resonances (VTR) or formants to capture the dynamic structure of long-range speech coarticulation and reduction. The underlying model of this recognizer carries out bi-directional FIR filtering on the piecewise constant sequences of the VTR targets. In this paper, we elaborate on two key aspects of the model. First, the phonetic context controls the movement direction and thus the formation of the VTR trajectories. This provides structured context dependency for speech acoustics without using context dependent parameters as required by HMMs. Second, VTR targets as the key context-independent parameters of the model vary across speakers. We describe an effective target-value normalization algorithm that can be applied to both training and unknown test speakers. We report experimental results demonstrating the effectiveness of the normalization algorithm in the context of structure-based speech recognition. We also provide computational analysis on the HTM-based speech decoder.

...read moreread less