A fast learning algorithm for deep belief nets

doi:10.1162/NECO.2006.18.7.1527

Home
/
Papers
/
A fast learning algorithm for deep belief nets

Journal Article•DOI•

A fast learning algorithm for deep belief nets

Geoffrey E. Hinton¹, Simon Osindero¹, Yee Whye Teh²•Institutions (2)

University of Toronto¹, National University of Singapore²

01 Jul 2006-Neural Computation (MIT Press)-Vol. 18, Iss: 7, pp 1527-1554

TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.

read less

Abstract: We show how to use "complementary priors" to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Deep learning

[...]

Yann LeCun¹, Yann LeCun², Yoshua Bengio³, Geoffrey E. Hinton⁴, Geoffrey E. Hinton⁵ - Show less +1 more•Institutions (5)

Facebook¹, New York University², Université de Montréal³, University of Toronto⁴, Google⁵

28 May 2015-Nature

TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.

...read moreread less

Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

...read moreread less

46,982 citations

Journal Article•DOI•

Generative Adversarial Nets

[...]

Ian Goodfellow¹, Jean Pouget-Abadie¹, Mehdi Mirza¹, Bing Xu¹, David Warde-Farley¹, Sherjil Ozair², Aaron Courville¹, Yoshua Bengio¹ - Show less +4 more•Institutions (2)

Université de Montréal¹, Indian Institute of Technology Delhi²

08 Dec 2014

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.

...read moreread less

Abstract: We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to ½ everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

...read moreread less

38,211 citations

Cites background from "A fast learning algorithm for deep ..."

...Deep belief networks (DBNs) [16] are hybrid models containing a single undirected layer and several directed layers....
[...]
...An alternative to directed graphical models with latent variables are undirected graphical models with latent variables, such as restricted Boltzmann machines (RBMs) [27, 16], deep Boltzmann machines (DBMs) [26] and their numerous variants....
[...]

Book•

Deep Learning

[...]

Ian Goodfellow¹, Yoshua Bengio², Aaron Courville²•Institutions (2)

Google¹, Université de Montréal²

18 Nov 2016

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.

...read moreread less

Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

...read moreread less

38,208 citations

Book•

Reinforcement Learning: An Introduction

[...]

Richard S. Sutton¹, Andrew G. Barto•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 1988

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

...read moreread less

Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

...read moreread less

37,989 citations

Journal Article•

Dropout: a simple way to prevent neural networks from overfitting

[...]

Nitish Srivastava¹, Geoffrey E. Hinton¹, Alex Krizhevsky¹, Ilya Sutskever¹, Ruslan Salakhutdinov¹ - Show less +1 more•Institutions (1)

University of Toronto¹

01 Jan 2014-Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

33,597 citations

Cites methods from "A fast learning algorithm for deep ..."

...2 Learning Dropout RBMs Learning algorithms developed for RBMs such as Contrastive Divergence (Hinton et al., 2006) can be directly applied for learning Dropout RBMs....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Knowledge transfer in deep convolutional neural nets

[...]

Steven Gutstein¹, Olac Fuentes¹, Eric Freudenthal¹•Institutions (1)

University of Texas at El Paso¹

01 Jun 2008-International Journal on Artificial Intelligence Tools

TL;DR: In this paper, the authors apply knowledge transfer to deep convolutional neural nets, which they argue are particularly well suited for knowledge transfer, and demonstrate that components of a trained deep CNN can constructively transfer information to another such CNN.

...read moreread less

Abstract: Knowledge transfer is widely held to be a primary mechanism that enables humans to quickly learn new complex concepts when given only small training sets. In this paper, we apply knowledge transfer to deep convolutional neural nets, which we argue are particularly well suited for knowledge transfer. Our initial results demonstrate that components of a trained deep convolutional neural net can constructively transfer information to another such net. Furthermore, this transfer is completed in such a way that one can envision creating a net that could learn new concepts throughout its lifetime. The experiments we performed involved training a Deep Convolutional Neural Net (DCNN) on a large training set containing 20 different classes of handwritten characters from the NIST Special Database 19. This net was then used as a foundation for training a new net on a set of 20 different character classes from the NIST Special Database 19. The new net would keep the bottom layers of the old net (i.e. those nearest to the input) and only allow the top layers to train on the new character classes. We purposely used small training sets for the new net to force it to rely as much as possible upon transferred knowledge as opposed to a large and varied training set to learn the new set of hand written characters. Our results show a clear advantage in relying upon transferred knowledge to learn new tasks when given small training sets, if the new tasks are sufficiently similar to the previously mastered one. However, this advantage decreases as training sets increase in size.

...read moreread less

54 citations

Journal Article•DOI•

Intuition, Insight, Imagination and Creativity

[...]

Włodzisław Duch¹•Institutions (1)

Nicolaus Copernicus University in Toruń¹

01 Aug 2007-IEEE Computational Intelligence Magazine

TL;DR: Three factors are essential for creativity in invention of novel words: knowledge of word morphology captured in network connections, imagination constrained by this knowledge, and filtering of results that selects the most interesting novel words.

...read moreread less

Abstract: Can computers have intuition and insights, and be creative? Neurocognitive models inspired by the putative processes in the brain show that these mysterious features are a consequence of information processing in complex networks. Intuition is manifested in categorization based on evaluation of similarity, when decision borders are too complex to be reduced to logical rules. It is also manifested in heuristic reasoning based on partial observations, where network activity selects only those paths that may lead to solution, excluding all bad moves. Insight results from reasoning at the higher, non-verbal level of abstraction that comes from involvement of the right hemisphere networks forming large "linguistic receptive fields." Three factors are essential for creativity in invention of novel words: knowledge of word morphology captured in network connections, imagination constrained by this knowledge, and filtering of results that selects the most interesting novel words. These principles have been implemented using a simple correlation-based algorithm for auto-associative memory. Results are surprisingly similar to those created by humans.

...read moreread less

54 citations

Journal Article•DOI•

Visual Recognition and Inference Using Dynamic Overcomplete Sparse Learning

[...]

Joseph F. Murray¹, Kenneth Kreutz-Delgado²•Institutions (2)

Massachusetts Institute of Technology¹, University of California, San Diego²

01 Sep 2007-Neural Computation

TL;DR: It is shown that increasing the degree of overcompleteness improves recognition performance in difficult scenes with occluded objects in clutter.

...read moreread less

Abstract: We present a hierarchical architecture and learning algorithm for visual recognition and other visual inference tasks such as imagination, reconstruction of occluded images, and expectation-driven segmentation. Using properties of biological vision for guidance, we posit a stochastic generative world model and from it develop a simplified world model (SWM) based on a tractable variational approximation that is designed to enforce sparse coding. Recent developments in computational methods for learning overcomplete representations (Lewicki & Sejnowski, 2000; Teh, Welling, Osindero, & Hinton, 2003) suggest that overcompleteness can be useful for visual tasks, and we use an overcomplete dictionary learning algorithm (Kreutz-Delgado, et al., 2003) as a preprocessing stage to produce accurate, sparse codings of images. Inference is performed by constructing a dynamic multilayer network with feedforward, feedback, and lateral connections, which is trained to approximate the SWM. Learning is done with a variant of the back-propagation-through-time algorithm, which encourages convergence to desired states within a fixed number of iterations. Vision tasks require large networks, and to make learning efficient, we take advantage of the sparsity of each layer to update only a small subset of elements in a large weight matrix at each iteration. Experiments on a set of rotated objects demonstrate various types of visual inference and show that increasing the degree of overcompleteness improves recognition performance in difficult scenes with occluded objects in clutter.

...read moreread less

40 citations

Journal Article•DOI•

The Bayesian revolution approaches psychological development.

[...]

Thomas R. Shultz¹•Institutions (1)

McGill University¹

01 May 2007-Developmental Science

TL;DR: The reviewed work extends the current Bayesian revolution into tasks often studied in children, such as causal learning and word learning, and provides evidence that children's performance can be optimal in a Bayesian sense.

...read moreread less

Abstract: This commentary reviews five articles that apply Bayesian ideas to psychological development, some with psychology experiments, some with computational modeling, and some with both experiments and modeling. The reviewed work extends the current Bayesian revolution into tasks often studied in children, such as causal learning and word learning, and provides evidence that children's performance can be optimal in a Bayesian sense. There remains much to be done in terms of understanding how representations are created, how development occurs, how Bayesian computation might be neurally implemented, and in reconciling the new work with older evidence that even skilled adults are incompetent Bayesians.

...read moreread less

33 citations

Diffusion Networks, Products of Experts, and Factor Analysis

[...]

Tim K. Marks¹, Javier R. Movellan¹•Institutions (1)

University of California, San Diego¹

01 Jan 2001

TL;DR: It is shown that when the unit activation functions are linear, this PoE architecture is equivalent to a factor analyzer, which suggests novel non-linear generalizations of factor analysis and independent component analysis that could be implemented using interactive neural circuitry.

...read moreread less

Abstract: Hinton (in press) recently proposed a learning algorithm called contrastive divergence learning for a class of probabilistic models called product of experts (PoE). Whereas in standard mixture models the “beliefs” of individual experts are averaged, in PoEs the “beliefs” are multiplied together and then renormalized. One advantage of this approach is that the combined beliefs can be much sharper than the individual beliefs of each expert. It has been shown that a restricted version of the Boltzmann machine, in which there are no lateral connections between hidden units or between observation units, is a PoE. In this paper we generalize these results to diffusion networks, a continuous-time, continuous-state version of the Boltzmann machine. We show that when the unit activation functions are linear, this PoE architecture is equivalent to a factor analyzer. This result suggests novel non-linear generalizations of factor analysis and independent component analysis that could be implemented using interactive neural circuitry.

...read moreread less

32 citations

"A fast learning algorithm for deep ..." refers background in this paper

...Marks and Movellan (2001) describe a way of using contrastive divergence to perform factor analysis and Welling, Rosen-Zvi, and Hinton (2005) show that a network with logistic, binary visible units and linear, gaussian hidden units can be used for rapid document retrieval....
[...]
...Marks and Movellan (2001) describe a way of using contrastive divergence to perform factor analysis and Welling, Rosen-Zvi, and Hinton (2005) show that a network with logistic, binary visible units and linear, gaussian hidden units can be used for rapid document retrieval....
[...]