scispace - formally typeset
Search or ask a question

Showing papers by "Dong Yu published in 2007"


PatentDOI
Dong Yu1, Deng Li1
15 Mar 2007
TL;DR: In this article, a multi-modal human computer interface (HCI) receives a plurality of available information inputs concurrently, or serially, and employs a subset of the inputs to determine or infer user intent with respect to a communication or information goal.
Abstract: A multi-modal human computer interface (HCI) receives a plurality of available information inputs concurrently, or serially, and employs a subset of the inputs to determine or infer user intent with respect to a communication or information goal. Received inputs are respectively parsed, and the parsed inputs are analyzed and optionally synthesized with respect to one or more of each other. In the event sufficient information is not available to determine user intent or goal, feedback can be provided to the user in order to facilitate clarifying, confirming, or augmenting the information inputs.

297 citations


Proceedings ArticleDOI
Jinyu Li1, Li Deng1, Dong Yu1, Yifan Gong1, Alejandro Acero1 
01 Apr 2007
TL;DR: A model-domain environment-robust adaptation algorithm, which demonstrates high performance in the standard Aurora 2 speech recognition task and adaptation of the dynamic portion of the HMM mean and variance parameters is critical to the success of the algorithm.
Abstract: In this paper, we present our recent development of a model-domain environment-robust adaptation algorithm, which demonstrates high performance in the standard Aurora 2 speech recognition task. The algorithm consists of two main steps. First, the noise and channel parameters are estimated using a nonlinear environment distortion model in the cepstral domain, the speech recognizer's "feedback" information, and the vector-Taylor-series (VTS) linearization technique collectively. Second, the estimated noise and channel parameters are used to adapt the static and dynamic portions of the HMM means and variances. This two-step algorithm enables joint compensation of both additive and convolutive distortions (JAC). In the experimental evaluation using the standard Aurora 2 task, the proposed JAC/VTS algorithm achieves 91.11% accuracy using the clean-trained simple HMM backend as the baseline system for the model adaptation. This represents high recognition performance on this task without discriminative training of the HMM system. Detailed analysis on the experimental results shows that adaptation of the dynamic portion of the HMM mean and variance parameters is critical to the success of our algorithm.

136 citations


Proceedings ArticleDOI
Dong Yu, Yun-Cheng Ju, Ye-Yi Wang, Geoffrey Zweig, Alex Acero1 
27 Aug 2007
TL;DR: It is shown that many theoretical and practical issues need to be resolved when applying the basic idea of voice search to the development of ADAS, and the experiences in addressing these issues are shared, especially in pre-processing the listing database, generating a high performance LM, and developing efficient, accurate, and robust search algorithms.
Abstract: The automated directory assistance system (ADAS) is traditionally formulated as an automatic speech recognition (ASR) problem. Recently, it has been formulated as a voice search problem, where a spoken utterance is firstly converted into text, which in turn is used to search for the listing. In this paper, we focus on the design and development of the utterance-to-listing component of ADAS. We show that many theoretical and practical issues need to be resolved when applying the basic idea of voice search to the development of ADAS. We share our experiences in addressing these issues, especially in pre-processing the listing database, generating a high performance LM, and developing efficient, accurate, and robust search algorithms. Field tests of our prototype system indicate that an 81% task completion rate can be achieved. Index Terms : speech recognition, directory assistance, voice search, TFIDF, spoken dialog system, vector space model 1. Introduction An automated directory assistance system (ADAS) [1] [2] [3] [5] [6] is a spoken dialog system that provides the caller with the phone number and/or address of the business or residential listing he/she requests. It is a very complicated system that involves automatic speech recognition (ASR), listing lookup, disambiguation, and dialog design. The core element of the ADAS is the utterance-to-listing (U2L) component that maps an utterance K 

67 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: This work has successfully applied LM-MCE training approach to the Microsoft internal large vocabulary telephony speech recognition task and achieved significant recognition accuracy improvement across-the-board.
Abstract: Recently, we have developed a novel discriminative training method named large-margin minimum classification error (LM-MCE) training that incorporates the idea of discriminative margin into the conventional minimum classification error (MCE) training method. In our previous work, this novel approach was formulated specifically for the MCE training using the sigmoid loss function and its effectiveness was demonstrated on the TIDIGITS task alone. In this paper two additional contributions are made. First, we formulate LM-MCE as a Bayes risk minimization problem whose loss function not only includes empirical error rates but also a margin-bound risk. This new formulation allows us to extend the same technique to a wide variety of MCE based training. Second, we have successfully applied LM-MCE training approach to the Microsoft internal large vocabulary telephony speech recognition task (with 2000 hours of training data and 120K of vocabulary) and achieved significant recognition accuracy improvement across-the-board. To our best knowledge, this is the first time that the large-margin approach is demonstrated to be successful in large-scale speech recognition tasks.

62 citations


Proceedings ArticleDOI
Li Deng1, Dong Yu1
15 Apr 2007
TL;DR: The earlier version of the hidden trajectory model (HTM) for speech dynamics which predicts the "static" cepstra as the observed acoustic feature is generalized to one which predicts joint Static/delta-cepstra HTM, enabling efficient computation of the joint likelihood for both static and delta cepstral sequences as the acoustic features given the model.
Abstract: The earlier version of the hidden trajectory model (HTM) for speech dynamics which predicts the "static" cepstra as the observed acoustic feature is generalized to one which predicts joint static cepstra and their temporal differentials (i.e., delta cepstra). The formulation of this generalized HTM is presented in the generative-modeling framework, enabling efficient computation of the joint likelihood for both static and delta cepstral sequences as the acoustic features given the model. The parameter estimation techniques for the new model are developed and presented, giving closed-form estimation formulas after the use of vector Taylor series approximation. We show principled generalization from the earlier static-cepstra HTM to the new static/delta-cepstra HTM not only in terms of model formulations but also in terms of their respective analytical forms in (monophone) parameter estimation. Experimental results on the standard TIMIT phonetic recognition task demonstrate recognition accuracy improvement over the earlier best HTM system, both significantly better than state-of-the-art triphone HMM systems.

53 citations


Proceedings ArticleDOI
01 Aug 2007
TL;DR: VoicePedia, a telephone-based dialog system for searching and browsing Wikipedia, is developed and a user study comparing the use of VoicePedia to SmartPedia is presented, a Smartphone GUI-based alternative.
Abstract: Currently there are no dialog systems that enable purely voice-based access to the unstructured information on websites such as Wikipedia. Such systems could be revolutionary for non-literate users in the developing world. To investigate interface issues in such a system, we developed VoicePedia, a telephone-based dialog system for searching and browsing Wikipedia. In this paper, we present the system, as well as a user study comparing the use of VoicePedia to SmartPedia, a Smartphone GUI-based alternative. Keyword entry through the voice interface was significantly faster, while search result navigation, and page browsing were significantly slower. Although users preferred the GUI-based interface, task success rates between both systems were comparable – a promising result for regions where Smartphones and data plans are not viable. Index Terms: dialog system, information access

41 citations


Patent
Dong Yu1, Li Deng1, Alejandro Acero1, Yifan Gong1, Jinyu Li1 
03 Dec 2007
TL;DR: In this article, a method of compensating for additive and convolutive distortions applied to a signal indicative of an utterance is discussed, which includes receiving a signal and initializing noise mean and channel mean vectors.
Abstract: A method of compensating for additive and convolutive distortions applied to a signal indicative of an utterance is discussed. The method includes receiving a signal and initializing noise mean and channel mean vectors. Gaussian dependent matrix and Hidden Markov Model (HMM) parameters are calculated or updated to account for additive noise from the noise mean vector or convolutive distortion from the channel mean vector. The HMM parameters are adapted by decoding the utterance using the previously calculated HMM parameters and adjusting the Gaussian dependent matrix and the HMM parameters based upon data received during the decoding. The adapted HMM parameters are applied to decode the input utterance and provide a transcription of the utterance.

31 citations


Patent
Ye-Yi Wang1, Yun-Cheng Ju1, Dong Yu1
03 Aug 2007
TL;DR: In this paper, a confidence measure generator was proposed to calculate an overall confidence measure for voice search results based upon the features received from the speech recognizer, search component, and dialog manager.
Abstract: A voice search system has a speech recognizer, a search component, and a dialog manager. A confidence measure generator receives speech recognition features from the speech recognizer, search features from the search component, and dialog features from the dialog manager, and calculates an overall confidence measure for voice search results based upon the features received. The invention can be extended to include the generation of additional features, based on those received from the individual components of the voice search system.

28 citations


Patent
Ye-Yi Wang1, Dong Yu1, Yun-Cheng Ju1, Alejandro Acero1, Geoffrey Zweig1 
10 May 2007
TL;DR: In this article, a database having listings rather than long documents is searched using a term frequency-inverse document frequency (Tf/Idf) algorithm, which is based on the Tf/IDF algorithm.
Abstract: A database having listings rather than long documents is searched using a term frequency-inverse document frequency (Tf/Idf) algorithm.

26 citations


Proceedings ArticleDOI
Geoffrey Zweig1, Yun-Cheng Ju1, Patrick Nguyen1, Dong Yu1, Ye-Yi Wang1, Alex Acero1 
23 Apr 2007
TL;DR: Voice-Rate as discussed by the authors is an automated dialog system which provides access to over one million ratings of products and businesses and can be accessed by dialing 1-877-456-DATA.
Abstract: Voice-Rate is an automated dialog system which provides access to over one million ratings of products and businesses. By calling a toll-free number, consumers can access ratings for products, national businesses such as airlines, and local businesses such as restaurants. Voice-Rate also has a facility for recording and analyzing ratings that are given over the phone. The service has been primed with ratings taken from a variety of web sources, and we are augmenting these with user ratings. Voice-Rate can be accessed by dialing 1-877-456-DATA.

19 citations


Patent
20 Feb 2007
TL;DR: In this article, a method and apparatus for training an acoustic model are disclosed, where a training corpus is accessed and converted into an initial acoustic model, and scores are calculated for a correct class and competitive classes, respectively, for each token given the initial model.
Abstract: A method and apparatus for training an acoustic model are disclosed. A training corpus is accessed and converted into an initial acoustic model. Scores are calculated for a correct class and competitive classes, respectively, for each token given the initial acoustic model. Also, a sample-adaptive window bandwidth is calculated for each training token. From the calculated scores and the sample-adaptive window bandwidth values, loss values are calculated based on a loss function. The loss function, which may be derived from a Bayesian risk minimization viewpoint, can include a margin value that moves a decision boundary such that token-to-boundary distances for correct tokens that are near the decision boundary are maximized. The margin can either be a fixed margin or can vary monotonically as a function of algorithm iterations. The acoustic model is updated based on the calculated loss values. This process can be repeated until an empirical convergence is met.

Proceedings ArticleDOI
Geoffrey Zweig1, Patrick Nguyen1, Yun-Cheng Ju1, Ye-Yi Wang1, Dong Yu1, Alex Acero1 
27 Aug 2007
TL;DR: Voice-Rate is an automated dialog system which provides access to over one million ratings of products and businesses by calling a toll-free number and has been primed with ratings taken from a variety of web sources, and is augmenting with user ratings.
Abstract: Voice-Rate is an experimental dialog system that makes product and business ratings available to consumers via a tollfree phone number. By calling Voice-Rate, users can access the ratings of more than one million products, a quarter million local businesses (restaurants), and three thousand national businesses. This paper describes the Voice Rate system, and solutions to three key technical challenges: robust name-matching, efficient disambiguation, and review synthesis for telephone playback. Voice-Rate can be accessed by calling 1-877-456-DATA (toll-free) within the U.S.

Proceedings ArticleDOI
Ye-Yi Wang1, Dong Yu1, Yun-Cheng Ju1, Geoffrey Zweig1, Alex Acero1 
27 Aug 2007
TL;DR: This paper reviews voice search technology, and proposes a new and effective method for computing semantic confidence measures that explores the use of maximum entropy classifiers as confidence models, and investigates a feature selection algorithm that leads to an effective subset of prominent features for the classifier.
Abstract: Voice search is the technology underlying many spoken dialog applications that enable users to access information using spoken queries. This paper reviews voice search technology, and proposes a new and effective method for computing semantic confidence measures. It explores the use of maximum entropy classifiers as confidence models, and investigates a feature selection algorithm that leads to an effective subset of prominent features for the classifier. The experimental results on a directory assistance application show that the reduced feature set not only makes the model more effective in handling different recognition and search engine combinations, but also results in a very informative confidence measure that is closely correlated with the actual voice search accuracy. Index Terms : voice search, directory assistance, confidence measure, Tf-Idf vector space model, maximum entropy model. 1. Introduction

Proceedings ArticleDOI
15 Apr 2007
TL;DR: In this paper, a novel discriminative training approach to spoken utterance classification is proposed, in which the classification error rate is approximated as differentiable functions of the language and classifier model parameters.
Abstract: In this paper, we propose a novel discriminative training approach to spoken utterance classification (SUC). The ultimate objective of the SUC task, originally developed to map a spoken speech utterance into the most appropriate semantic class, is to minimize the classification error rate (CER). Conventionally, a two-phase approach is adapted, in which the first phase is the ASR transcription phase, and the second phase is the semantic classification phase. In the proposed framework, the classification error rate is approximated as differentiable functions of the language and classifier model parameters. Furthermore, in order to exploit all the available information from the first phase, class-specific discriminant functions are defined based on score functions derived from the N-best lists. Our experimental results on the standard ATIS database indicate a notable reduction in CER from the earlier best result on the identical task. The proposed framework achieved a reduction of CER from 4.92% to 4.04%.

PatentDOI
Li Deng1, Dong Yu1
TL;DR: In this paper, a system for speech recognition uses differential cepstra over time frames as acoustic features, together with the traditional static cepstral features, for hidden trajectory modeling, and provides greater accuracy and performance in automatic speech recognition.
Abstract: A novel system for speech recognition uses differential cepstra over time frames as acoustic features, together with the traditional static cepstral features, for hidden trajectory modeling, and provides greater accuracy and performance in automatic speech recognition. According to one illustrative embodiment, an automatic speech recognition method includes receiving a speech input, generating an interpretation of the speech, and providing an output based at least in part on the interpretation of the speech input. The interpretation of the speech uses hidden trajectory modeling with observation vectors that are based on cepstra and on differential cepstra derived from the cepstra. A method is developed that can automatically train the hidden trajectory model's parameters that are corresponding to the components of the differential cepstra in the full acoustic feature vectors.

Patent
12 Jan 2007
TL;DR: A computer-implemented method is disclosed for providing a directory assistance service as discussed by the authors, which includes generating an indexing file that is a representation of information associated with a collection of listings stored in an index.
Abstract: A computer-implemented method is disclosed for providing a directory assistance service The method includes generating an indexing file that is a representation of information associated with a collection of listings stored in an index The indexing file is utilized as a basis for ranking listings in an index based on the strength of association with a query Based at least in part on the ranking, an output is provided and is indicative of listings in the index that are likely correspond to the query At least one particular listing in the index is excluded from the output without there ever being a comparison of features in the query with features in the one particular listing

Proceedings Article
Ivan Tashev, Michael L. Seltzer, Yun-Cheng Ju, Dong Yu, Alex Acero1 
01 Sep 2007
TL;DR: The strategies employed by the system were evaluated through user studies and a system employing the best strategies was deployed and evaluated through an analysis of 700 calls over a two month period.
Abstract: In this paper, we describe a telephone dialog system for location-based services. In such systems, the effectiveness with which both the user can input location information to the system and the system delivers location information to the user is critical. We describe strategies for both of these issues in the context of a dialog system for real-time information about traffic, gas prices, and weather. The strategies employed by our system were evaluated through user studies and a system employing the best strategies was deployed. The system is evaluated through an analysis of 700 calls over a two month period.

Proceedings ArticleDOI
Dong Yu, Li Deng, Alex Acero1
27 Aug 2007
TL;DR: An effective target-value normalization algorithm that can be applied to both training and unknown test speakers and which provides structured context dependency for speech acoustics without using context dependent parameters as required by HMMs is described.
Abstract: Recently we have developed a novel type of structure-based speech recognizer, which uses parameterized, non-recursive hidden trajectory model of vocal tract resonances (VTR) or formants to capture the dynamic structure of long-range speech coarticulation and reduction. The underlying model of this recognizer carries out bi-directional FIR filtering on the piecewise constant sequences of the VTR targets. In this paper, we elaborate on two key aspects of the model. First, the phonetic context controls the movement direction and thus the formation of the VTR trajectories. This provides structured context dependency for speech acoustics without using context dependent parameters as required by HMMs. Second, VTR targets as the key context-independent parameters of the model vary across speakers. We describe an effective target-value normalization algorithm that can be applied to both training and unknown test speakers. We report experimental results demonstrating the effectiveness of the normalization algorithm in the context of structure-based speech recognition. We also provide computational analysis on the HTM-based speech decoder.