scispace - formally typeset
Search or ask a question
Author

Dong Yu

Bio: Dong Yu is an academic researcher from Tencent. The author has contributed to research in topics: Artificial neural network & Word error rate. The author has an hindex of 72, co-authored 339 publications receiving 39098 citations. Previous affiliations of Dong Yu include Peking University & Microsoft.


Papers
More filters
Proceedings ArticleDOI
22 May 2011
TL;DR: This work proposes a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines on a challenging, large vocabulary, spontaneous speech recognition dataset from the Bing mobile voice search task.
Abstract: The context-independent deep belief network (DBN) hidden Markov model (HMM) hybrid architecture has recently achieved promising results for phone recognition. In this work, we propose a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines on a challenging, large vocabulary, spontaneous speech recognition dataset from the Bing mobile voice search task. Our system achieves absolute sentence accuracy improvements of 5.8% and 9.2% over GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively, which translate to relative error reductions of 16.0% and 23.2%.

213 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: A simple, fast, and accurate one-stage approach to visual grounding that enables end-to-end joint optimization and shows great potential in terms of both accuracy and speed for both phrase localization and referring expression comprehension.
Abstract: We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight. The performances of existing propose-and-rank two-stage methods are capped by the quality of the region candidates they propose in the first stage --- if none of the candidates could cover the ground truth region, there is no hope in the second stage to rank the right region to the top. To avoid this caveat, we propose a one-stage model that enables end-to-end joint optimization. The main idea is as straightforward as fusing a text query's embedding into the YOLOv3 object detector, augmented by spatial features so as to account for spatial mentions in the query. Despite being simple, this one-stage approach shows great potential in terms of both accuracy and speed for both phrase localization and referring expression comprehension, according to our experiments. Given these results along with careful investigations into some popular region proposals, we advocate for visual grounding a paradigm shift from the conventional two-stage methods to the one-stage framework.

212 citations

Proceedings ArticleDOI
Li Deng1, Dong Yu1, John Platt1
25 Mar 2012
TL;DR: The Deep Stacking Network (DSN) is presented, which overcomes the problem of parallelizing learning algorithms for deep architectures and provides a method of stacking simple processing modules in buiding deep architectures, with a convex learning problem in each module.
Abstract: Deep Neural Networks (DNNs) have shown remarkable success in pattern recognition tasks. However, parallelizing DNN training across computers has been difficult. We present the Deep Stacking Network (DSN), which overcomes the problem of parallelizing learning algorithms for deep architectures. The DSN provides a method of stacking simple processing modules in buiding deep architectures, with a convex learning problem in each module. Additional fine tuning further improves the DSN, while introducing minor non-convexity. Full learning in the DSN is batch-mode, making it amenable to parallel training over many machines and thus be scalable over the potentially huge size of the training data. Experimental results on both the MNIST (image) and TIMIT (speech) classification tasks demonstrate that the DSN learning algorithm developed in this work is not only parallelizable in implementation but it also attains higher classification accuracy than the DNN.

208 citations

Journal ArticleDOI
TL;DR: A human error rate on the widely used NIST 2000 test set for commercial bulk transcription is measured, suggesting that, given sufficient matched training data, conversational speech transcription engines are approximating human parity in both quantitative and qualitative terms.
Abstract: Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure a human error rate on the widely used NIST 2000 test set for commercial bulk transcription. The error rate of professional transcribers is 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic, and 11.3% for the CallHome portion, where friends and family members have open-ended conversations. In both cases, our automated system edges past the human benchmark, achieving error rates of 5.8% and 11.0%, respectively. The key to our system's performance is the use of various convolutional and long-short-term memory acoustic model architectures, combined with a novel spatial smoothing method and lattice-free discriminative acoustic training, multiple recurrent neural network language modeling approaches, and a systematic use of system combination. Comparing frequent errors in our human and machine transcripts, we find them to be remarkably similar, and highly correlated as a function of the speaker. Human subjects find it very difficult to tell which errorful transcriptions come from humans. Overall, this suggests that, given sufficient matched training data, conversational speech transcription engines are approximating human parity in both quantitative and qualitative terms.

194 citations

Proceedings ArticleDOI
Jian Xue1, Jinyu Li1, Dong Yu1, Michael L. Seltzer1, Yifan Gong1 
04 May 2014
TL;DR: This paper addresses DNN adaptation and personalization issues by presenting two methods based on the singular value decomposition (SVD), which can obtain similar accuracy improvements as the previously proposed Kullback-Leibler divergence (KLD) regularized method with far fewer parameters.
Abstract: The large number of parameters in deep neural networks (DNN) for automatic speech recognition (ASR) makes speaker adaptation very challenging. It also limits the use of speaker personalization due to the huge storage cost in large-scale deployments. In this paper we address DNN adaptation and personalization issues by presenting two methods based on the singular value decomposition (SVD). The first method uses an SVD to replace the weight matrix of a speaker independent DNN by the product of two low rank matrices. Adaptation is then performed by updating a square matrix inserted between the two low-rank matrices. In the second method, we adapt the full weight matrix but only store the delta matrix - the difference between the original and adapted weight matrices. We decrease the footprint of the adapted model by storing a reduced rank version of the delta matrix via an SVD. The proposed methods were evaluated on short message dictation task. Experimental results show that we can obtain similar accuracy improvements as the previously proposed Kullback-Leibler divergence (KLD) regularized method with far fewer parameters, which only requires 0.89% of the original model storage.

194 citations


Cited by
More filters
Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations

Journal ArticleDOI
28 May 2015-Nature
TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

46,982 citations

Journal ArticleDOI
08 Dec 2014
TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Abstract: We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to ½ everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

38,211 citations

Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal ArticleDOI

[...]

08 Dec 2001-BMJ
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

33,785 citations