scispace - formally typeset
Search or ask a question
Author

Dong Yu

Bio: Dong Yu is an academic researcher from Tencent. The author has contributed to research in topics: Artificial neural network & Word error rate. The author has an hindex of 72, co-authored 339 publications receiving 39098 citations. Previous affiliations of Dong Yu include Peking University & Microsoft.


Papers
More filters
Proceedings ArticleDOI
14 Mar 2010
TL;DR: This paper presents recent work on improving word confidence scores by calibrating them using a small set of calibration data when only the recognized word sequence and associated raw confidence scores are made available, using the maximum entropy model with distribution constraints.
Abstract: It is widely known that the quality of confidence measure is critical for speech applications. In this paper, we present our recent work on improving word confidence scores by calibrating them using a small set of calibration data when only the recognized word sequence and associated raw confidence scores are made available. The core of our technique is the maximum entropy model with distribution constraints which naturally and effectively make use of the word distribution, the raw confidence-score distribution, and the context information. We demonstrate the effectiveness of our approach by showing that it can achieve relative 38% mean square error (MSE), 39% negative normalized likelihood (NNLL), and 23% equal error rate (EER) reduction on a voice mail transcription data set and relative 35% MSE, 45% NNLL, and 35% EER reduction on a command and control data set.

4 citations

PatentDOI
Li Deng1, Dong Yu1
TL;DR: In this paper, a system for speech recognition uses differential cepstra over time frames as acoustic features, together with the traditional static cepstral features, for hidden trajectory modeling, and provides greater accuracy and performance in automatic speech recognition.
Abstract: A novel system for speech recognition uses differential cepstra over time frames as acoustic features, together with the traditional static cepstral features, for hidden trajectory modeling, and provides greater accuracy and performance in automatic speech recognition. According to one illustrative embodiment, an automatic speech recognition method includes receiving a speech input, generating an interpretation of the speech, and providing an output based at least in part on the interpretation of the speech input. The interpretation of the speech uses hidden trajectory modeling with observation vectors that are based on cepstra and on differential cepstra derived from the cepstra. A method is developed that can automatically train the hidden trajectory model's parameters that are corresponding to the components of the differential cepstra in the full acoustic feature vectors.

4 citations

Book ChapterDOI
Dong Yu1, Li Deng1
01 Jan 2015
TL;DR: This chapter covers several key aspects of the HMM, including its parametric characterization, its simulation by random number generators, its likelihood evaluation, its parameter estimation via the EM algorithm, and its state decoding via the Viterbi algorithm or a dynamic programming procedure.
Abstract: This chapter builds upon the reviews in the previous chapter on aspects of probability theory and statistics, including random variables and Gaussian mixture models, and extends the reviews to the Markov chain and the hidden Markov sequence or model (HMM). Central to the HMM is the concept of state, which is itself a random variable typically taking discrete values. Extending from a Markov chain to an HMM involves adding uncertainty or a statistical distribution on each of the states in the Markov chain. Hence, an HMM is a doubly-stochastic process, or probabilistic function of a Markov chain. When the state of the Markov sequence or HMM is confined to be discrete and the distributions associated with the HMM states do not overlap, we reduce it to a Markov chain. This chapter covers several key aspects of the HMM, including its parametric characterization, its simulation by random number generators, its likelihood evaluation, its parameter estimation via the EM algorithm, and its state decoding via the Viterbi algorithm or a dynamic programming procedure. We then provide discussions on the use of the HMM as a generative model for speech feature sequences and its use as the basis for speech recognition. Finally, we discuss the limitations of the HMM, leading to its various extended versions, where each state is made associated with a dynamic system or a hidden time-varying trajectory instead of with a temporally independent stationary distribution such as a Gaussian mixture. These variants of the HMM with state-conditioned dynamic systems expressed in the state-space formulation are introduced as a generative counterpart of the recurrent neural networks to be described in detail in Chap. 13.

4 citations

Posted Content
Zhao You1, Dan Su1, Jie Chen1, Chao Weng1, Dong Yu1 
TL;DR: Experiments on large-scale LVCSR tasks show that on four individual test sets, the DFSMN-SAN architecture outperforms vanilla SAN encoder by 5% relatively in character error rate (CER), and the additional memory structure provides further 5% to 11% relative improvement in CER.
Abstract: Self-attention networks (SAN) have been introduced into automatic speech recognition (ASR) and achieved state-of-the-art performance owing to its superior ability in capturing long term dependency. One of the key ingredients is the self-attention mechanism which can be effectively performed on the whole utterance level. In this paper, we try to investigate whether even more information beyond the whole utterance level can be exploited and beneficial. We propose to apply self-attention layer with augmented memory to ASR. Specifically, we first propose a variant model architecture which combines deep feed-forward sequential memory network (DFSMN) with self-attention layers to form a better baseline model compared with a purely self-attention network. Then, we propose and compare two kinds of additional memory structures added into self-attention layers. Experiments on large-scale LVCSR tasks show that on four individual test sets, the DFSMN-SAN architecture outperforms vanilla SAN encoder by 5% relatively in character error rate (CER). More importantly, the additional memory structure provides further 5% to 11% relative improvement in CER.

3 citations

Book ChapterDOI
Dong Yu1, Li Deng1
01 Jan 2015
TL;DR: This chapter introduces techniques that fuse deep neural networks (DNNs) and Gaussian mixture models (GMMs) and describes the Tandem and bottleneck approach in which DNNs are used as feature extractors in the GMM systems.
Abstract: In this chapter, we introduce techniques that fuse deep neural networks (DNNs) and Gaussian mixture models (GMMs). We first describe the Tandem and bottleneck approach in which DNNs are used as feature extractors. The hidden layers, which are better representation than the raw input feature, are used as features in the GMM systems. We then introduce techniques that fuse the recognition results and frame-level scores of the DNN-HMM hybrid system with that of the GMM-HMM system.

3 citations


Cited by
More filters
Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations

Journal ArticleDOI
28 May 2015-Nature
TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

46,982 citations

Journal ArticleDOI
08 Dec 2014
TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Abstract: We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to ½ everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

38,211 citations

Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal ArticleDOI

[...]

08 Dec 2001-BMJ
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

33,785 citations