Home
/
Authors
/
Kaisheng Yao

Author

Kaisheng Yao

Bio: Kaisheng Yao is an academic researcher from Microsoft. The author has contributed to research in topics: Recurrent neural network & Artificial neural network. The author has an hindex of 27, co-authored 69 publications receiving 4628 citations. Previous affiliations of Kaisheng Yao include Texas Instruments.

Papers published on a yearly basis

2023
2021
2020
2019
2017
2016
2015
2014
2013
2012
2011
2010
2009
2007
2006
2001

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Recent advances in deep learning for speech research at Microsoft

[...]

Li Deng¹, Jinyu Li¹, Jui-Ting Huang¹, Kaisheng Yao¹, Dong Yu¹, Frank Seide¹, Michael L. Seltzer¹, Geoff Zweig¹, Xiaodong He¹, Jason D. Williams¹, Yifan Gong¹, Alex Acero¹ - Show less +8 more•Institutions (1)

Microsoft¹

26 May 2013

TL;DR: An overview of the work by Microsoft speech researchers since 2009 is provided, focusing on more recent advances which shed light to the basic capabilities and limitations of the current deep learning technology.

...read moreread less

Abstract: Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the current deep learning technology. We organize this overview along the feature-domain and model-domain dimensions according to the conventional approach to analyzing speech systems. Selected experimental results, including speech recognition and related applications such as spoken dialogue and language modeling, are presented to demonstrate and analyze the strengths and weaknesses of the techniques described in the paper. Potential improvement of these techniques and future research directions are discussed.

...read moreread less

798 citations

Journal Article•DOI•

Using recurrent neural networks for slot filling in spoken language understanding

[...]

Grégoire Mesnil¹, Yann N. Dauphin¹, Kaisheng Yao², Yoshua Bengio¹, Li Deng², Dilek Hakkani-Tur², Xiaodong He², Larry Heck³, Gokhan Tur⁴, Dong Yu², Geoffrey Zweig² - Show less +7 more•Institutions (4)

Université de Montréal¹, Microsoft², Google³, Apple Inc.⁴

01 Mar 2015-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper implemented and compared several important RNN architectures, including Elman, Jordan, and hybrid variants, and implemented these networks with the publicly available Theano neural network toolkit and completed experiments on the well-known airline travel information system (ATIS) benchmark.

...read moreread less

Abstract: Semantic slot filling is one of the most challenging problems in spoken language understanding (SLU). In this paper, we propose to use recurrent neural networks (RNNs) for this task, and present several novel architectures designed to efficiently model past and future temporal dependencies. Specifically, we implemented and compared several important RNN architectures, including Elman, Jordan, and hybrid variants. To facilitate reproducibility, we implemented these networks with the publicly available Theano neural network toolkit and completed experiments on the well-known airline travel information system (ATIS) benchmark. In addition, we compared the approaches on two custom SLU data sets from the entertainment and movies domains. Our results show that the RNN-based models outperform the conditional random field (CRF) baseline by 2% in absolute error reduction on the ATIS benchmark. We improve the state-of-the-art by 0.5% in the Entertainment domain, and 6.7% for the movies domain.

...read moreread less

562 citations

Proceedings Article•DOI•

KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition

[...]

Dong Yu¹, Kaisheng Yao¹, Hang Su¹, Gang Li¹, Frank Seide¹ - Show less +1 more•Institutions (1)

Microsoft¹

26 May 2013

TL;DR: Experiments demonstrate that the proposed adaptation technique can provide 2%-30% relative error reduction against the already very strong speaker independent CD-DNN-HMM systems using different adaptation sets under both supervised and unsupervised adaptation setups.

...read moreread less

Abstract: We propose a novel regularized adaptation technique for context dependent deep neural network hidden Markov models (CD-DNN-HMMs). The CD-DNN-HMM has a large output layer and many large hidden layers, each with thousands of neurons. The huge number of parameters in the CD-DNN-HMM makes adaptation a challenging task, esp. when the adaptation set is small. The technique developed in this paper adapts the model conservatively by forcing the senone distribution estimated from the adapted model to be close to that from the unadapted model. This constraint is realized by adding Kullback-Leibler divergence (KLD) regularization to the adaptation criterion. We show that applying this regularization is equivalent to changing the target distribution in the conventional backpropagation algorithm. Experiments on Xbox voice search, short message dictation, and Switchboard and lecture speech transcription tasks demonstrate that the proposed adaptation technique can provide 2%-30% relative error reduction against the already very strong speaker independent CD-DNN-HMM systems using different adaptation sets under both supervised and unsupervised adaptation setups.

...read moreread less

479 citations

An Introduction to Computational Networks and the Computational Network Toolkit

[...]

Dong Yu, Adam Eversole, Michael L. Seltzer, Kaisheng Yao, Oleksii Kuchaiev, Yu Zhang, Frank Seide, Zhiheng Huang, Brian Guenter, Huaming Wang, Jasha Droppo, Geoffrey Zweig, Christopher J. Rossbach, Jie Gao, Andreas Stolcke, Jon Currey, Malcolm Slaney, Guoguo Chen, Amit Kumar Agarwal, Christopher H. Basoglu, Marko Padmilac, Alexey Kamenev, Vladimir Ivanov, Scott Cypher, Hari Parthasarathi, Bhaskar Mitra, Baolin Peng, Xuedong Huang - Show less +24 more

01 Aug 2014

TL;DR: The computational network toolkit (CNTK), an implementation of CN that supports both GPU and CPU, is introduced and the architecture and the key components of the CNTK are described, the command line options to use C NTK, and the network definition and model editing language are described.

...read moreread less

Abstract: We introduce computational network (CN), a unified framework for describing arbitrary learning machines, such as deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), long short term memory (LSTM), logistic regression, and maximum entropy model, that can be illustrated as a series of computational steps. A CN is a directed graph in which each leaf node represents an input value or a parameter and each non-leaf node represents a matrix operation upon its children. We describe algorithms to carry out forward computation and gradient calculation in CN and introduce most popular computation node types used in a typical CN. We further introduce the computational network toolkit (CNTK), an implementation of CN that supports both GPU and CPU. We describe the architecture and the key components of the CNTK, the command line options to use CNTK, and the network definition and model editing language, and provide sample setups for acoustic model, language model, and spoken language understanding. We also describe the Argon speech recognition decoder as an example to integrate with CNTK.

...read moreread less

408 citations

Proceedings Article•DOI•

Spoken language understanding using long short-term memory neural networks

[...]

Kaisheng Yao¹, Baolin Peng¹, Yu Zhang¹, Dong Yu¹, Geoffrey Zweig¹, Yangyang Shi¹ - Show less +2 more•Institutions (1)

Microsoft¹

01 Dec 2014

TL;DR: This paper investigates using long short-term memory (LSTM) neural networks, which contain input, output and forgetting gates and are more advanced than simple RNN, for the word labeling task and proposes a regression model on top of the LSTM un-normalized scores to explicitly model output-label dependence.

...read moreread less

Abstract: Neural network based approaches have recently produced record-setting performances in natural language understanding tasks such as word labeling. In the word labeling task, a tagger is used to assign a label to each word in an input sequence. Specifically, simple recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have shown to significantly outperform the previous state-of-the-art - conditional random fields (CRFs). This paper investigates using long short-term memory (LSTM) neural networks, which contain input, output and forgetting gates and are more advanced than simple RNN, for the word labeling task. To explicitly model output-label dependence, we propose a regression model on top of the LSTM un-normalized scores. We also propose to apply deep LSTM to the task. We investigated the relative importance of each gate in the LSTM by setting other gates to a constant and only learning particular gates. Experiments on the ATIS dataset validated the effectiveness of the proposed models.

...read moreread less

350 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15

Collapse

Cited by

PDF

Open Access

More filters

Proceedings Article•

Adam: A Method for Stochastic Optimization

[...]

Diederik P. Kingma¹, Jimmy Ba²•Institutions (2)

University of Amsterdam¹, University of Toronto²

01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

...read moreread less

111,197 citations

Posted Content•

Adam: A Method for Stochastic Optimization

[...]

Diederik P. Kingma¹, Jimmy Ba²•Institutions (2)

University of Amsterdam¹, University of Toronto²

22 Dec 2014-arXiv: Learning

TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.

...read moreread less

23,486 citations

Posted Content•

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

[...]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek G. Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay K. Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, Xiaoqiang Zheng - Show less +36 more

01 Jan 2015-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.

...read moreread less

Abstract: TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

...read moreread less

10,447 citations

Posted Content•

Bidirectional LSTM-CRF Models for Sequence Tagging

[...]

Zhiheng Huang, Wei Xu, Kai Yu

09 Aug 2015-arXiv: Computation and Language

TL;DR: This work is the first to apply a bidirectional LSTM CRF model to NLP benchmark sequence tagging data sets and it is shown that the BI-LSTM-CRF model can efficiently use both past and future input features thanks to a biddirectional L STM component.

...read moreread less

Abstract: In this paper, we propose a variety of Long Short-Term Memory (LSTM) based models for sequence tagging. These models include LSTM networks, bidirectional LSTM (BI-LSTM) networks, LSTM with a Conditional Random Field (CRF) layer (LSTM-CRF) and bidirectional LSTM with a CRF layer (BI-LSTM-CRF). Our work is the first to apply a bidirectional LSTM CRF (denoted as BI-LSTM-CRF) model to NLP benchmark sequence tagging data sets. We show that the BI-LSTM-CRF model can efficiently use both past and future input features thanks to a bidirectional LSTM component. It can also use sentence level tag information thanks to a CRF layer. The BI-LSTM-CRF model can produce state of the art (or close to) accuracy on POS, chunking and NER data sets. In addition, it is robust and has less dependence on word embedding as compared to previous observations.

...read moreread less

3,158 citations

Journal Article•DOI•

Recent advances in convolutional neural networks

[...]

Jiuxiang Gu¹, Zhenhua Wang¹, Jason Kuen¹, Lianyang Ma¹, Amir Shahroudy¹, Bing Shuai¹, Ting Liu¹, Xingxing Wang¹, Gang Wang¹, Jianfei Cai¹, Tsuhan Chen¹ - Show less +7 more•Institutions (1)

Nanyang Technological University¹

01 May 2018-Pattern Recognition

TL;DR: A broad survey of the recent advances in convolutional neural networks can be found in this article, where the authors discuss the improvements of CNN on different aspects, namely, layer design, activation function, loss function, regularization, optimization and fast computation.

...read moreread less

3,125 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse