scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Learning with Hierarchical-Deep Models

TL;DR: Efficient learning and inference algorithms for the HDP-DBM model are presented and it is shown that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets.
Abstract: We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian (HB) models. Specifically, we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a deep Boltzmann machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training example by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets.

Summary (5 min read)

1 INTRODUCTION

  • THE ability to learn abstract representations that supporttransfer to novel but related tasks lies at the core of many problems in computer vision, natural language processing, cognitive science, and machine learning.
  • In contrast, the authors argue that learning new classes from a handful of training examples will be easier in architectures that can explicitly identify only a small number of degrees of freedom (latent variables and parameters) that are relevant to the new concept being learned, and thereby achieve more appropriate and flexible transfer of learned representations to new tasks.
  • Unlike deep networks, these HB models explicitly represent category hierarchies that admit sharing the appropriate abstract knowledge about the new class’s parameters via a prior abstracted from related classes.
  • They typically rely on domainspecific hand-crafted features [2], [11] (e.g., GIST, SIFT features in computer vision, MFCC features in speech perception domains).
  • Their approach was not ideal as a generic approach to transfer learning with few examples.

2 DEEP BOLTZMANN MACHINES

  • There are connections only between hidden units in adjacent layers, as well as between visible and hidden units in the first hidden layer.
  • The probability that the model assigns to a visible vector v is given by the Boltzmann distribution: P ðv; Þ ¼ 1Zð Þ X h exp E v;hð1Þ;hð2Þ;hð3Þ; : ð1Þ Observe that setting both Wð2Þ ¼ 0 and Wð3Þ ¼ 0 recovers the simpler Restricted Boltzmann Machine (RBM) model.
  • The exact computation of the data-dependent expectation takes time that is exponential in the number of 1.
  • We have omitted the bias terms for clarity of presentation.the authors.the authors.

2.1 Approximate Learning

  • The original learning algorithm for Boltzmann machines used randomly initialized Markov chains to approximate both expectations to estimate gradients of the likelihood function [14].
  • Recently, Salakhutdinov and Hinton [29] proposed a variational approach, where mean-field inference is used to estimate data-dependent expectations and an MCMC-based stochastic approximation procedure is used to approximate the models expected sufficient statistics.

2.1.1 A Variational Approach to Estimating the Data-Dependent Statistics

  • Then the log-likelihood of the DBM model has the following variational lower bound: logP ðv; Þ X h Qðhjv; Þ logP ðv;h; Þ þ HðQÞ logP ðv; Þ KLðQðhjv; ÞkP ðhjv; ÞÞ; ð4Þ where Hð Þ is the entropy functional and KLðQkP Þ denotes the Kullback-Leibler divergence between the two distributions.
  • The bound becomes tight if and only if Qðhjv; Þ ¼ P ðhjv; Þ. Variational learning has the nice property that in addition to maximizing the log-likelihood of the data, it also attempts to find parameters that minimize the Kullback-Leibler divergence between the approximating and true posteriors.
  • To solve these fixed-point equations, the authors simply cycle through layers, updating the mean-field parameters within a single layer.
  • Note the close connection between the form of the mean-field fixed point updates and the form of the conditional distribution3 defined by (2).

2.1.2 A Stochastic Approximation Approach for Estimating the Data-Independent Statistics

  • Given the variational parameters , the model parameters are then updated to maximize the variational bound using an MCMC-based stochastic approximation [29], [39], [46].
  • Implementing the mean-field requires no extra work beyond implementing the Gibbs sampler.
  • Given xt, sample a new state xtþ1 from the transition operator T tðxtþ1 xtÞ that leaves P ð ; tÞ invariant.
  • The overall learning procedure for DBMs is summarized in Algorithm 1.
  • Together with the condition on the learning rate, this ensures almost sure convergence of the stochastic approximation algorithm to an asymptotically stable point.

2.1.3 Greedy Layerwise Pretraining of DBMs

  • The learning procedure for DBMs described above can be used by starting with randomly initialized weights, but it works much better if the weights are initialized sensibly.
  • The authors therefore use a greedy layerwise pretraining strategy by learning a stack of modified RBMs (for details see [29]).
  • This fast approximate inference is then used to initialize the mean-field, which then converges much faster than meanfield with random initialization.

2.2 Gaussian-Bernoulli DBMs

  • The authors now briefly describe a Gaussian-Bernoulli DBM model, which they will use to model real-valued data, such as images of natural scenes and motion capture data.
  • Gaussian-Bernoulli DBMs represent a generalization of a simpler class of models, called Gaussian-Bernoulli RBMs, which have been successfully applied to various tasks, including image classification, video action recognition, and speech recognition [17], [20], [23], [35].
  • In practice, however, instead of learning 2, one would typically use a fixed, predetermined value for 2 [13], [24].

2.3 Multinomial DBMs

  • To allow DBMs to express more information and introduce more structured hierarchical priors, the authors will use a condi- tional multinomial distribution to model activities of the top-level units hð3Þ.
  • The code for pretraining and generative learning of the DBM model is available at http://www.utstat.toronto.edu/~rsalakhu/DBM.html.
  • A key observation is that M separate copies of softmax units that all share the same set of weights can be viewed as a single multinomial unit that is sampled M times from the conditional distribution of (13).
  • A pleasing property of using softmax units is that the mathematics underlying the learning algorithm for binary-binary DBMs remains the same.

3 COMPOUND HDP-DBM MODEL

  • After a DBM model has been learned, the authors have an undirected model that defines the joint distribution P ðv;hð1Þ;hð2Þ;hð3ÞÞ.
  • One way to express what has been learned is the conditional model P ðv;hð1Þ;hð2Þjhð3ÞÞ and a complicated prior term P ðhð3ÞÞ, defined by the DBM model.
  • The authors can therefore rewrite the variational bound as logP ðvÞ X hð1Þ;hð2Þ;hð3Þ Qðhjv; Þ logP ðv;hð1Þ;hð2Þjhð3ÞÞ þ HðQÞ þ X hð3Þ Qðhð3Þjv; Þ logP ðhð3ÞÞ: ð14Þ.
  • This particular decomposition lies at the core of the greedy recursive pretraining algorithm: Instead of adding an additional undirected layer (e.g., an RBM) to model P ðhð3ÞÞ the authors can place an HDP prior over hð3Þ that will allow us to learn category hierarchies and, more importantly, useful representations of classes that contain few training examples.

3.1 A Hierarchical Bayesian Prior

  • In their compound HDP-DBM model, the authors will use a hierarchical topic model as a prior over the activities of the DBM’s top-level features.
  • ;M; h ð3Þ in jxin; xin Multð1; xinÞ; where is the global distribution over topics, is the global distribution over K words, and and are concentration parameters.
  • Let us further assume that their model is presented with a fixed two-level category hierarchy.
  • These high-level features in turn define topic-specific distribution over hð3Þ features, or “words” in their DBM model.
  • For a fixed number of topics T , the above model represents a hierarchical extension of the latent Dirichlet allocation (LDA) model [4].

3.2 Modeling the Number of Supercategories

  • So far the authors have assumed that their model is presented with a two-level partition z ¼ fzs; zbg that defines a fixed two-level tree hierarchy.
  • The authors note that this model corresponds to a standard HDP model that assumes a fixed hierarchy for sharing parameters.
  • The authors place a nonparametric two-level nested Chinese restaurant prior (CRP) [5] over z, which defines a prior over tree structures and is flexible enough to learn arbitrary hierarchies.
  • The main building block of the nested CRP is the Chinese restaurant process, a distribution on partition of integers.
  • As the authors show in the experimental results section, both sharing higher level features and forming coherent hierarchies play a crucial role in the ability of the model to generalize well from one or few examples of a novel category.

4 INFERENCE

  • Inferences about model parameters at all levels of hierarchy can be performed by MCMC.
  • 1) sampling cluster indices xin using Gibbs updates in the Chinese restaurant franchise (CRF) representation of the HDP; 2) sampling the weights at all three levels conditioned on x using the usual posterior of a DP, also known as The sampler alternates between.
  • The speedup could be substantial, particularly as the number of the basic-level categories becomes large.
  • In their conjugate setting, parameters can be further integrated out.
  • Finally, conditioned on the states of hð3Þ, the authors can further fine-tune low-level DBM parameters ¼ fWð1Þ;Wð2Þ;Wð3Þg by applying approximate maximum likelihood learning (see Section 2) to the conditional DBM model of (15).

4.1 Making Predictions

  • Given a test input vt, the authors can quickly infer the approximate posterior over h ð3Þ t using the mean-field of (6), followed by running the full Gibbs sampler to get approximate samples from the posterior over the category assignments.
  • Hence, instead of integrating out document specific DP Combining this likelihood term with nCRP prior P ðztjz tÞ of (19) allows us to efficiently infer approximate posterior over category assignments.
  • In all of their experimental results, computing this approximate posterior takes a fraction of a second, which is crucial for applications, such as object recognition or information retrieval.

5 EXPERIMENTS

  • The authors present experimental results on the CIFAR-100 [17], handwritten character [18], and human motion capture recognition datasets.
  • For all datasets, the authors first pretrain a DBM model in unsupervised fashion on raw sensory input (e.g., pixels, or three-dimensional joint angles), followed by fitting an HDP prior which is run for 200 Gibbs sweeps.
  • This was sufficient to obtain good performance.
  • Across all datasets, the authors also assume that the basic-level category 6. labels are given, but no supercategory labels are available.
  • The first two models, stand-alone DBMs and DBNs [12], used three layers of hidden variables and were pretrained using a stack of RBMs.

5.1 CIFAR-100 Data Set

  • Fig. 3 displays a random subset of the training data, first and second layer DBM features, as well as higher level class-sensitive features, or topics, learned by the HDP model.
  • The results are averaged over 100 classes using “leave-one-out” test format.
  • Table 1 also shows that fine-tuning parameters of all layers jointly as well as learning supercategory hierarchy significantly improves model performance.
  • Viewpoint, and cluttered background, the model is able to capture the overall structure of each class.

5.2 Handwritten Characters

  • The handwritten characters dataset [18] can be viewed as the “transpose” of the standard MNIST dataset.
  • The results are averaged over 200 characters chosen at random, using the “leave-oneout” test format.
  • This result demonstrates that the HDP-DBM model is able to successfully transfer appropriate prior over higher level “strokes” from previously learned categories.
  • Each panel shows three figures: 1) three training examples of a novel character class, 2) 12 synthesized examples of that class, and 3) samples of the training characters in the same supercategory that the novel character has been grouped under.

5.3 Motion Capture

  • Results on the CIFAR and Character datasets show that the HDP-DBM model can significantly outperform many other models on object and character recognition tasks.
  • Features at all levels of the hierarchy were learned without assuming any image-specific priors, and the proposed model can be applied in a wide variety of application domains.
  • The authors show that the HDP-DBM model can be applied to modeling human motion capture data.
  • There are 2,500 frames of each style at 60fps, where each time step was represented by a vector of 58 real-valued numbers.
  • Using “leave- one-out” test format, Table 1 shows that the HDP-DBM model performs much better compared to other models when discriminating between existing nine walking styles versus novel walking style.

6 CONCLUSIONS

  • The authors developed a compositional architecture that learns an HDP prior over the activities of top-level features of the DBM model.
  • The resulting compound HDP-DBM model is able to learn low-level features from raw, high-dimen- sional sensory input, high-level features, as well as a category hierarchy for parameter sharing.
  • The experimen- tal results show that the proposed model can acquire new concepts from very few examples in a diverse set of application domains.
  • The compositional model considered in this paper was directly inspired by the architecture of the DBM and HDP, but it need not be.
  • Indeed, any other deep learning module, including DBNs, sparse autoencoders, or any other HB model, can be adapted.

Did you find this useful? Give us your feedback

Figures (13)

Content maybe subject to copyright    Report

Learning with Hierarchical-Deep Models
Ruslan Salakhutdinov, Joshua B. Tenenbaum, and Antonio Torralba, Member, IEEE
Abstract—We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning
models with structured hierarchical Bayesian (HB) models. Specifically, we show how we can learn a hierarchical Dirichlet process
(HDP) prior over the activities of the top-level features in a deep Boltzmann machine (DBM). This compound HDP-DBM model learns
to learn novel concepts from very few training example by learning low-level generic features, high-level features that capture
correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of
different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to
learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion
capture datasets.
Index Terms—Deep networks, deep Boltzmann machines, hierarchical Bayesian models, one-shot learning
Ç
1INTRODUCTION
T
HE ability to learn abstract representations that support
transfer to novel but related tasks lies at the core of
many problems in computer vision, natural language
processing, cognitive science, and machine learning. In
typical applications of machine classification algorithms
today, learning a new concept requires tens, hundreds, or
thousands of training examples. For human learners,
however, just one or a few examples are often sufficient
to grasp a new category and make meaningful general-
izations to novel instances [15], [25], [31], [44]. Clearly, this
requires very strong but also appropriately tuned inductive
biases. The architecture we describe here takes a step
toward this ability by learning several forms of abstract
knowledge at different levels of abstraction that support
transfer of useful inductive biases from previously learned
concepts to novel ones.
We call our architectures compound HD models, where
“HD” stands for “Hierarchical-Deep,” because they are
derived by composing hierarchical nonparametric Bayesian
models with deep networks, two influential approaches
from the recent unsupervised learning literature with
complementary strengths. Recently introduced deep learn-
ing models, including deep belief networks (DBNs) [12],
deep Boltzmann machines (DBM) [29], deep autoencoders
[19], and many others [9], [10], [21], [22], [26], [32], [34], [43],
have been shown to learn useful distributed feature
representations for many high-dimensional datasets. The
ability to automatically learn in multiple layers allows deep
models to construct sophisticated domain-specific features
without the need to rely on precise human-crafted input
representations, increasingly important with the prolifera-
tion of datasets and application domains.
While the features learned by deep models can enable
more rapid and accurate classification learning, deep
networks themselves are not well suited to learning novel
classes from few examples. All units and parameters at all
levels of the network are engaged in representing any given
input (“distributed representations”), and are adjusted
together during learning. In contrast, we argue that learning
new classes from a handful of training examples will be
easier in architectures that can explicitly identify only a
small number of degrees of freedom (latent variables and
parameters) that are relevant to the new concept being
learned, and thereby achieve more appropriate and flexible
transfer of learned representations to new tasks. This ability
is the hallmark of hierarchical Bayesian (HB) models,
recently proposed in computer vision, statistics, and
cognitive science [8], [11], [15], [28], [44] for learning from
few examples. Unlike deep networks, these HB models
explicitly represent category hierarchies that admit sharing
the appropriate abstract knowledge about the new class’s
parameters via a prior abstracted from related classes. HB
approaches, however, have comple mentary weaknesses
relative to deep networks. They typically rely on domain-
specific hand-crafted features [2], [11] (e.g., GIST, SIFT
features in computer vision, MFCC features in speech
perception domains). Committing to the a-priori defined
feature representations, instead of learning them from data,
can be detrimental. This is especially important when
learning complex tasks, as it is often difficult to hand-craft
high-level features explicitly in terms of raw sensory input.
Moreover, many HB approaches often assume a fixed
hierarchy for sharing parameters [6], [33] instead of
discovering how parameters are shared among classes in
an unsupervised fashion.
In this paper, we propose compound HD architectures
that inte grate thes e deep models with struc tured HB
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1
. R. Salakhutdinov is with the Department of Statistics and Computer
Science, University of Toronto, Toronto, ON M5S 3G3, Canada.
E-mail: rsalakhu@utstat.toronto.edu.
. J.B. Tenenbaum is with the Department of Brain and Cognitive Sciences,
Massachusetts Institute of Technology, Cambridge, MA 02139.
E-mail: jbt@mit.edu.
. A. Torralba is with the Computer Science and Artificial Intelligence
Laboratory, Massachusetts Institute of Technology, Cambridge, MA
02139. E-mail: torralba@mit.edu.
Manuscript received 18 Apr. 2012; revised 30 Aug. 2012; accepted 30 Nov.
2012; published online 19 Dec. 2012.
Recommended for acceptance by M. Welling.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2012-04-0302.
Digital Object Identifier no. 10.1109/TPAMI.2012.269.
0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

mode ls. In particular, we show how we can learn a
hierarchical Dirichlet process (HDP) prior over the activities
of the top-level features in a DBM, coming to represent both
a layered hierarchy of increasingly abstract features and a
tree-structured hierarchy of classes. Our model depends
minimally on domain-specific representations and achieves
state-of-the-art performance by unsupervised discovery of
three components: 1) low-level features that abstract from
the raw high-dimensional sensory input (e.g., pixels, or
three-dimensional joint angles) and provide a useful first
representation for all concepts in a given domain; 2) high-
level part-like features that express the distinctive percep-
tual structure of a specific class, in terms of class-specific
correlations over low-level features; and 3) a hierarchy of
superclasses for sharing abstract knowledge among related
classes via a prior on which higher level features are likely
to be distinctive for classes of a certain kind and are thus
likely to support learning new concepts of that kind.
We evaluate the compound HDP-DBM model on three
different perceptual domains. We also illustrate the
advantages of having a full generative model, extending
from highly abstract concepts all the way down to sensory
inputs: We cannot only generalize class labels but also
synthesize new examples in novel classes that look reason-
ably natural, and we can significantly improve classification
performance by learning parameters at all levels jointly by
maximizing a joint log-probability score.
There have also been several approaches in the computer
vision community addressing the problem of learning with
few examples. Torralba et al. [42] proposed using several
boosted detectors in a multitask setting, where features are
shared between several categories. Bart and Ullman [3]
further proposed a cross-generalization framework for
learning with few examples. Their key assumption is that
new features for a novel category are selected from the pool of
features that was useful for previously learned classification
tasks. In contrast to our work, the above approaches are
discriminative by nature and do not attempt to identify
similar or relevant categories. Babenko et al. [1] used a
boosting approac h that simultaneously groups together
categories into several supercategories, sharing a similarity
metric within these classes. They, however, did not attempt to
address transfer learning problem, and primarily focused on
large-scale image retrieval tasks. Finally, Fei-Fei et al. [11]
used an HB approach, with a prior on the parameters of
new categories that was induced from other categories.
However, their approach was not ideal as a generic
approach to transfer learning with few examples. They
learned only a single prior shared across all categories.
The prior was learned from only three categories, chosen
by hand. Compared to our work, they used a more
elaborate visual object model, based on multiple parts
with separate appearance and shape components.
2DEEP BOLTZMANN MACHINES
A DBM is a network of symmetrically coupled stochastic
binary units. It contains a set of visible units v 2f0; 1g
D
,
and a sequence of layers of hidden units h
ð1Þ
2f0; 1g
F
1
;
h
ð2Þ
2f0; 1g
F
2
... h
ðLÞ
2f0; 1g
F
L
. There are connections only
between hidden units in adjacent layers, as well as between
visible and hidden units in the first hidden layer. Consider a
DBM with three hidden layers
1
(i.e., L ¼ 3). The energy of
the joint configuration fv; hg is defined as
Eðv; h; Þ¼
X
ij
W
ð1Þ
ij
v
i
h
ð1Þ
j
X
jl
W
ð2Þ
jl
h
ð1Þ
j
h
ð2Þ
l
X
lk
W
ð3Þ
lk
h
ð2Þ
l
h
ð3Þ
k
;
where h ¼fh
ð1Þ
; h
ð2Þ
; h
ð3Þ
g represent the set of hidden units
and ¼fW
ð1Þ
; W
ð2Þ
; W
ð3Þ
g are the model parameters,
representing visible-to-hidden and hidden-to-hidden sym-
metric interaction terms.
2
The probability that the model assigns to a visible vector v
is given by the Boltzmann distribution:
P ðv; Þ¼
1
Þ
X
h
exp
E
v; h
ð1Þ
; h
ð2Þ
; h
ð3Þ
;

: ð1Þ
Observe that setting both W
ð2Þ
¼ 0 and W
ð3Þ
¼ 0 recovers
the simpler Restricted Boltzmann Machine (RBM) model.
The conditional distributions over the visible and the
three sets of hidden units are given by
pðh
ð1Þ
j
¼ 1jv; h
ð2Þ
Þ¼g
X
D
i¼1
W
ð1Þ
ij
v
i
þ
X
F
2
l¼1
W
ð2Þ
jl
h
ð2Þ
l
!
;
pðh
ð2Þ
l
¼ 1jh
ð1Þ
; h
ð3Þ
Þ¼g
X
F
1
j¼1
W
ð2Þ
jl
h
ð1Þ
j
þ
X
F
3
k¼1
W
ð3Þ
lk
h
ð3Þ
k
!
;
pðh
ð3Þ
k
¼ 1jh
ð2Þ
Þ¼g
X
F
2
l¼1
W
ð3Þ
lk
h
ð2Þ
l
!
;
pðv
i
¼ 1jh
ð1Þ
Þ¼g
X
F
1
j¼1
W
ð1Þ
ij
h
ð1Þ
j
!
;
ð2Þ
where gðxÞ¼1=ð1 þ expðxÞÞ is the logistic function.
The derivative of the log-likelihood with respect to the
model parameters can be obtained from (1):
@ log P ðv; Þ
@W
ð1Þ
¼ E
P
data
vh
ð1Þ
>
E
P
model
vh
ð1Þ
>
;
@ log P ðv; Þ
@W
ð2Þ
¼ E
P
data
h
ð1Þ
h
ð2Þ
>
E
P
model
h
ð1Þ
h
ð2Þ
>
;
@ log P ðv; Þ
@W
ð3Þ
¼ E
P
data
½h
ð2Þ
h
ð3Þ
>
E
P
model
h
ð2Þ
h
ð3Þ
>
;
ð3Þ
where E
P
data
½ denotes an expectation with respect to the
completed data distribution:
P
data
ðh; v; Þ¼P ðhjv; ÞP
data
ðvÞ;
with P
data
ðvÞ¼
1
N
P
n
v
n
representing the empirical distri-
bution and E
P
model
½ is an expectation with respect to the
distribution defined by the model (1). We will sometimes
refer to E
P
data
½ as the data-dependent expectation and E
P
model
½
as the model’s expectation.
Exact maximum likelihood learning in this model is
intractable. The exact computation of the data-dependent
expectation takes time that is exponential in the number of
2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013
1. For clarity, we use three hidden layers. Extensions to models with
more than three layers is trivial.
2. We have omitted the bias ter ms for clarity of presentation. Biases are
equivalent to weights on a connection to a unit whose state is fixed at 1.

hidden units, whereas the exact computation of the models
expectation takes time that is exponential in the number of
hidden and visible units.
2.1 Approximate Learning
The original learning algorithm for Boltzmann machines
used randomly initialized Markov chains to approximate
both expectations to estimate gradients of the likelihood
function [14]. However, this learning procedure is too slow
to be practical. Recently, Salakhutdinov and Hinton [29]
proposed a variational approach, where mean-field infer-
ence is used to estimate data-dependent expectations and an
MCMC-based stochastic approximation procedure is used
to approximate the models expected sufficient statistics.
2.1.1 A Variational Approach to Estimating the
Data-Dependent Statistics
Consider any approximating distribution Qðhjv; Þ, para-
metarized by a vector of parameters , for the posterior
P ðhjv; Þ. Then the log-likelihood of the DBM model has
the following variational lower bound:
log Pðv; Þ
X
h
Qðhjv; Þlog P ðv; h; ÞþHðQÞ
log P ðv; ÞKLðQðhjv; ÞkP ðhjv; ÞÞ;
ð4Þ
where HðÞ is the entropy functional and KLðQkP Þ denotes
the Kullback-Leibler divergence between the two distribu-
tions. The bound becomes tight if and only if Qðhjv; Þ¼
P ðhjv; Þ.
Variational learning has the nice property that in
addition to maximizing the log-likelihood of the data, it
also attempts to find parameters that minimize the Kull-
back-Leibler divergence between the approximating and
true posteriors.
For simplicity and speed, we approximate the true
posterior P ðh
jv; Þ with a fully factorized approximating
distribution over the three sets of hidden units, which
corresponds to so-called mean-field approximation:
Q
MF
ðhjv; Þ¼
Y
F
1
j¼1
Y
F
2
l¼1
Y
F
3
k¼1
q
h
ð1Þ
j
jv
q
h
ð2Þ
l
jv
q
h
ð3Þ
k
jv
; ð5Þ
where ¼f
ð1Þ
;
ð2Þ
;
ð3Þ
g are the mean-field parameters
with qðh
ðlÞ
i
¼ 1Þ¼
ðlÞ
i
for l ¼ 1; 2; 3. In this case, the varia-
tional lower bound on the log-probability of the data takes a
particularly simple form:
log Pðv; Þ
X
h
Q
MF
ðhjv; Þlog P ðv; h; ÞþHðQ
MF
Þ
v
>
W
ð1Þ
ð1Þ
þ
ð1Þ
>
W
ð2Þ
ð2Þ
þ
þ
ð2Þ
>
W
ð3Þ
ð3Þ
log ÞþHðQ
MF
Þ:
ð6Þ
Learning proceeds as follows: For each training example,
we maximize this lower bound with respect to the
variational parameters for fixed parameters , which
results in the mean-field fixed-point equations:
ð1Þ
j
g
X
D
i¼1
W
ð1Þ
ij
v
i
þ
X
F
2
l¼1
W
ð2Þ
jl
ð2Þ
l
; ð7Þ
Algorithm 1. Learning Procedure for a Deep Boltzmann
Machine with Three Hidden Layers.
1: Given: a training set of N binary data vectors
fvg
N
n¼1
, and M, the number of persistent Markov chains
(i.e., particles).
2: Randomly initialize parameter vector
0
and M
samples: f
~
v
0;1
;
~
h
0;1
g...f
~
v
0;M
;
~
h
0;M
g,
where
~
h ¼f
~
h
ð1Þ
;
~
h
ð2Þ
;
~
h
ð3Þ
g.
3: for t ¼ 0 to T (number of iterations) do
4: // Variational Inference:
5: for each training example v
n
, n ¼ 1 to N do
6: Randomly initialize ¼f
ð1Þ
;
ð2Þ
;
ð3Þ
g and
run mean-field updates until convergence,
using (7), (8), (9).
7: Set
n
¼ .
8: end for
9: // Stochastic Approximation:
10: for each sample m ¼ 1 to M (number of persistent
Markov chains) do
11: Sample ð
~
v
tþ1;m
;
~
h
tþ1;m
Þ given ð
~
v
t;m
;
~
h
t;m
Þ by
running a Gibbs sampler for one step (2).
12: end for
13: // Parameter Update:
14: W
ð1Þ
tþ1
¼ W
ð1Þ
t
þ
t
1
N
P
N
n¼1
v
n
ð
ð1Þ
n
Þ
>
1
M
P
M
m¼1
~
v
tþ1;m
ð
~
h
ð1Þ
tþ1;m
Þ
>
.
15: W
ð2Þ
tþ1
¼ W
ð2Þ
t
þ
t
1
N
P
N
n¼1
ð1Þ
n
ð
ð2Þ
n
Þ
>
1
M
P
M
m¼1
~
h
ð1Þ
tþ1;m
ð
~
h
ð2Þ
tþ1;m
Þ
>
.
16: W
ð3Þ
tþ1
¼ W
ð3Þ
t
þ
t
1
N
P
N
n¼1
ð2Þ
n
ð
ð3Þ
n
Þ
>
1
M
P
M
m¼1
~
h
ð2Þ
tþ1;m
ð
~
h
ð3Þ
tþ1;m
Þ
>
.
17: Decrease
t
.
18: end for
ð2Þ
l
g
X
F
1
j¼1
W
ð2Þ
jl
ð1Þ
j
þ
X
F
3
k¼1
W
ð3Þ
lk
ð3Þ
k
; ð8Þ
ð3Þ
k
g
X
F
2
l¼1
W
ð3Þ
lk
ð2Þ
l
; ð9Þ
where gðxÞ¼1=ð1 þexpðxÞÞ is the logistic function. To
solve these fixed-point equations, we simply cycle through
layers, updating the mean-field parameters within a single
layer. Note the close connection between the form of
the mean-field fixed point updates and the form of the
conditional distribution
3
defined by (2).
2.1.2 A Stochastic Approximation Approach for
Estimating the Data-Independent Statistics
Given the variational parameters , the model parameters
are then updated to maximize the variational bound using
an MCMC-based stochastic approximation [29], [39], [46].
SALAKHUTDINOV ET AL.: LEARNING WITH HIERARCHICAL-DEEP MODELS 3
3. Impl ementi ng the mean -fiel d requires no extra work beyond
implementing the Gibbs sampler.

Learning with stochastic approximation is straightfor-
ward. Let
t
and x
t
¼fv
t
; h
ð1Þ
t
; h
ð2Þ
t
; h
ð3Þ
t
g be the current
parameters and the state. Then x
t
and
t
are updated
sequentially as follows:
. Given x
t
, sample a new state x
tþ1
from the transition
operator T
t
ðx
tþ1
x
t
Þ that leaves P ð;
t
Þ invariant.
This can be accomplished by using Gibbs sampling
(see (2)).
. A new parameter
tþ1
is then obtained by making a
gradient step, where the intractable model’s expec-
tation E
P
model
½ in the gradient is replaced by a point
estimate at sample x
tþ1
.
In practice, we typically maintain a set of M “persistent”
sample particles X
t
¼fx
t;1
; ...; x
t;M
g, and use an average
over those particles. The overall learning procedure for
DBMs is summarized in Algorithm 1.
Stochastic approximation provides asymptotic conver-
gence guarantees and belongs to the general class of
Robbins-Monro approximation algorithms [27], [46]. Precise
sufficient conditions that ensure almost sure convergence to
an asymptotically stable point are given in [45], [46], and
[47]. One necessary condition requires the learning rate to
decrease with time so that
P
1
t¼0
t
¼1and
P
1
t¼0
2
t
< 1.
This condition can, for example, be satisfied simply by
setting
t
¼ a=ðb þ tÞ, for positive constants a>0, b>0.
Other conditions ensure that the speed of convergence of
the Markov chain, governed by the transition operator T
,
does not decrease too fast as tends to infinity. Typically,
in practice the sequence j
t
j is bounded, and the Markov
chain, governed by the transition kernel T
, is ergodic.
Together with the condition on the learning rate, this
ensures almost sure convergence of the stochastic approx-
imation algorithm to an asymptotically stable point.
2.1.3 Greedy Layerwise Pretraining of DBMs
The learning procedure for DBMs described above can be
used by starting with randomly initialized weights, but it
works much better if the weights are initialized sensibly.
We therefore use a greedy layerwise pretraining strategy by
learning a stack of modified RBMs (for details see [29]).
This pretraining procedure is quite similar to the
pretraining procedure of DBNs [12], and it allows us to
perform approximate inference by a single bottom-up pass.
This fast approximate inference is then used to initialize the
mean-field, which then converges much faster than mean-
field with random initialization.
4
2.2 Gaussian-Bernoulli DBMs
We now briefly describe a Gaussian-Bernoulli DBM model,
which we will use to model real-valued data, such as
images of natura l scenes and motion capture data.
Gaussian-Bernoulli DBMs represent a generalization of a
simpler class of models, called Gaussian-Bernoulli RBMs,
which have been successfully applied to various tasks,
including image classification, video action recognition, and
speech recognition [17], [20], [23], [35].
In particular, consider modeling visible real-valued units
v 2 IR
D
and let h
ð1Þ
2f0; 1g
F
1
, h
ð2Þ
2f0; 1g
F
2
, and h
ð3Þ
2
f0; 1 g
F
3
be binary stochastic hidden units. The energy of the
joint configuration fv; h
ð1Þ
; h
ð2Þ
; h
ð3Þ
g of the three-hidden-
layer Gaussian-Bernoulli DBM is defined as follows:
Eðv; h; Þ¼
1
2
X
i
v
2
i
2
i
X
ij
W
ð1Þ
ij
h
ð1Þ
j
v
i
i
X
jl
W
ð2Þ
jl
h
ð1Þ
j
h
ð2Þ
l
X
lk
W
ð3Þ
lk
h
ð2Þ
l
^
h
ð3Þ
k
;
ð10Þ
where h ¼fh
ð1Þ
; h
ð2Þ
; h
ð3Þ
g represent the set of hidden units,
and ¼fW
ð1Þ
; W
ð2Þ
; W
ð3Þ
;
2
g are the model parameters,
and
2
i
is the variance of input i. The marginal distribution
over the visible vector v takes form
Pðv; Þ¼
X
h
exp ðEðv; h; ÞÞ
R
v
0
P
h
exp ðEðv; h; ÞÞdv
0
: ð11Þ
From (10), it is straightforward to derive the following
conditional distributions:
pðv
i
¼ xjh
ð1Þ
Þ¼
1
ffiffiffiffiffi
2
p
i
exp
x
i
P
j
h
ð1Þ
j
W
ð1Þ
ij

2
2
2
i
0
B
@
1
C
A
;
pðh
ð1Þ
j
¼ 1jvÞ¼g
X
i
W
ð1Þ
ij
v
i
i
!
;
ð12Þ
where gðxÞ¼1=ð1 þ expðxÞÞ is the logistic function.
Conditional distributions over h
ð2Þ
and h
ð3Þ
remain the
same as in the standard DBM model (see (2)).
Observe that conditioned on the states of the hidden
units (12), each visible unit is modeled by a Gaussian
distribution whose mean is shifted by the weighted
combination of the hidden unit activations. The derivative
of the log-likelihood with respect to W
ð1Þ
takes form
@ log P ðv; Þ
@W
ð1Þ
ij
¼ E
P
data
1
i
v
i
h
ð1Þ
j

E
P
Model
1
i
v
i
h
ð1Þ
j

:
The derivatives with respect to parameters W
ð2Þ
and W
ð3Þ
remain the same as in (3).
As described in the previous section, learning of the
model parameters, including the variances
2
, can be
carried out using variational learning together with
stochastic approximation procedure. In practice, however,
instead of learning
2
, one would typically use a fixed,
predetermined value for
2
[13], [24].
2.3 Multinomial DBMs
To allow DBMs to express more information and introduce
more structured hierarchical priors, we will use a condi-
tional multinomial distribution to model activities of the
top-level units h
ð3Þ
. Specifically, we will use M softmax
units, each with “1-of-K” encoding, so that each unit
contains a set of K weights. We represent the kth discrete
value of hidden unit by a vector containing 1 at the
kth location and zeros elsewhere. The conditional prob-
ability of a softmax top-level unit is
4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013
4. The code for pretraining and generative learning of the DBM model is
available at htt p://www.utstat.toronto.edu/~rsalakhu/DBM.html.

Pðh
ð3Þ
k
jh
ð2Þ
Þ¼
exp
P
l
W
ð3Þ
lk
h
ð2Þ
l

P
K
s¼1
exp
P
l
W
ð3Þ
ls
h
ð2Þ
l

: ð13Þ
In our formulation, all M separate softmax units will share
the same set of weights, connecting them to binary hidden
units at the lower level (see Fig. 1). The energy of the state
fv; hg is then defined as follows:
Eðv; h; Þ¼
X
ij
W
ð1Þ
ij
v
i
h
ð1Þ
j
X
jl
W
ð2Þ
jl
h
ð1Þ
j
h
ð2Þ
l
X
lk
W
ð3Þ
lk
h
ð2Þ
l
^
h
ð3Þ
k
;
where h
ð1Þ
2f0; 1g
F
1
and h
ð2Þ
2f0; 1g
F
2
represent stochastic
binary units. The top layer is represented by the M softmax
units h
ð3;mÞ
, m ¼ 1; ...;M, with
^
h
ð3Þ
k
¼
P
M
m¼1
h
ð3;mÞ
k
denoting
the count for the kth discrete value of a hidden unit.
A key observation is that M separate copies of softmax
units that all share the same set of weights can be viewed as
a single multinomial unit that is sampled M times from the
conditional distribution of (13). This gives us a familiar
“bag-of-words” representation [30], [36]. A pleasing prop-
erty of using softmax units is that the mathematics under-
lying the learning algorithm for binary-binary DBMs
remains the same.
3COMPOUND HDP-DBM MODEL
After a DBM model has been learned, we have an
undirected model t hat defines the joint distributi on
P ðv; h
ð1Þ
; h
ð2Þ
; h
ð3Þ
Þ. One way to express what has been
learned is the conditional model P ðv; h
ð1Þ
; h
ð2Þ
jh
ð3Þ
Þ and a
complicated prior term P ðh
ð3Þ
Þ, defined by the DBM model.
We can therefore rewrite the variational bound as
log PðvÞ
X
h
ð1Þ
;h
ð2Þ
;h
ð3Þ
Qðhjv; Þlog P ðv; h
ð1Þ
; h
ð2Þ
jh
ð3Þ
Þ
þHðQÞþ
X
h
ð3Þ
Qðh
ð3Þ
jv; Þlog P ðh
ð3Þ
Þ:
ð14Þ
This particular decomposition lies at the core of the greedy
recursive pretraining algorithm: We keep the learned condi-
tional model Pðv; h
ð1Þ
; h
ð2Þ
jh
ð3Þ
Þ, but maximize the variational
lower bound of (14) with respect to the last term [12]. This
maximization amounts to replacing Pðh
ð3Þ
Þ by a prior that is
closer to the average, over all the data vectors, of the
approximate conditional posterior Qðh
ð3Þ
jvÞ.
Instead of adding an additional undirected layer (e.g., an
RBM) to model Pðh
ð3Þ
Þ we can place an HDP prior over h
ð3Þ
that will allow us to learn category hierarchies and, more
importantly, useful representations of classes that contain
few training examples.
The part we keep, P ðv; h
ð1Þ
; h
ð2Þ
jh
ð3Þ
Þ,representsa
conditional DBM model:
5
P ðv; h
ð1Þ
; h
ð2Þ
jh
ð3Þ
Þ¼
1
; h
ð3Þ
Þ
exp
X
ij
W
ð1Þ
ij
v
i
h
ð1Þ
j
þ
X
jl
W
ð2Þ
jl
h
ð1Þ
j
h
ð2Þ
l
þ
X
lk
W
ð3Þ
lk
h
ð2Þ
l
h
ð3Þ
k
;
ð15Þ
which can be viewed as a two-layer DBM but with bias
terms given by the states of h
ð3Þ
.
3.1 A Hierarchical Bayesian Prior
In a typical hierarchical topic model, we observe a set of
N documents, each of which is modeled as a mixture over
topics, that are shared among documents. Let there be
K words in the vocabulary. A topic t is a discrete distribution
over K words with probability vector
t
. Each document n
has its own distribution over topics given by probabilities
n
.
In our compound HDP-DBM model, we will use a
hierarchical topic model as a prior over the activities of the
DBM’s top-level features. Specifically, the term “document”
will refer to the top-level multinomial unit h
ð3Þ
, and M
“words” in the document will represent the M samples, or
active DBM’s top-level features, generated by this multi-
nomial unit. Words in each docume nt are drawn by
choosing a topic t with probability
nt
, and then choosing
a word w with probability
tw
. We will often refer to topics
as our learned higher level features, each of which defines a
topic specific distribution over DBM’s h
ð3Þ
features. Let h
ð3Þ
in
be the ith word in document n, and x
in
be its topic. We can
specify the following prior over h
ð3Þ
:
n
j DirðÞ; for each document n ¼ 1 ; ...;N;
t
j DirðÞ; for each topic t ¼ 1; ...;T;
x
in
j
n
Multð1;
n
Þ; for each word i ¼ 1; ...;M;
h
ð3Þ
in
jx
in
;
x
in
Multð1;
x
in
Þ;
where is the global distribution over topics, is the
global distribution over K words, and and are
concentration parameters.
Let us further assume that our model is presented with a
fixed two-level category hierarchy. In particular, suppose
that N documents, or objects, are partitioned into C basic
level categories (e.g., cow, sheep, car). We represent such a
partition by a vector z
b
of length N, each entry of which is
z
b
n
2f1...Cg. We also assume that our C basic-level
SALAKHUTDINOV ET AL.: LEARNING WITH HIERARCHICAL-DEEP MODELS 5
Fig. 1. Left: Multinomial DBM model: The top layer represents M softmax
hidden units h
ð3Þ
which share the same set of weights. Right: A different
interpretation: M softmax units are replaced by a single multinomial unit
which is sampled M times.
5. Our experiments reveal that using DBNs instead of DBMs decreased
model performance.

Citations
More filters
Journal ArticleDOI
11 Dec 2015-Science
TL;DR: A computational model is described that learns in a similar fashion and does so better than current deep learning algorithms and can generate new letters of the alphabet that look “right” as judged by Turing-like tests of the model's output in comparison to what real humans produce.
Abstract: People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms-for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world's alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion. On a challenging one-shot classification task, the model achieves human-level performance while outperforming recent deep learning approaches. We also present several "visual Turing tests" probing the model's creative generalization abilities, which in many cases are indistinguishable from human behavior.

2,364 citations


Cites background from "Learning with Hierarchical-Deep Mod..."

  • ...There was an average ID level difference of 9% across the two phases, which approached significance (t(29) = 2....

    [...]

  • ...The Hierarchical Deep (HD) model is a probabilistic generative model that extends the Deep Boltzmann Machine (DBM) (29) so that it more elegantly handles learning new concepts from few examples....

    [...]

  • ...The Hierarchical Deep (HD) model is a probabilistic generative model that extends the Deep Boltzmann Machine (DBM) (29) so that it more elegantly handles learning new concepts from few examples....

    [...]

  • ...For BPL, judges had a 59% ID level (6/30 judges reliably above chance), where the group average was significantly better than chance (t(29) = 6....

    [...]

  • ...We compare with two varieties of deep convolutional networks (28), representative of the current leading approaches to object recognition (7), and a hierarchical deep (HD) model (29), a probabilistic model needed for our more generative tasks and specialized for one-shot learning....

    [...]

Journal ArticleDOI
TL;DR: In this article, a review of recent progress in cognitive science suggests that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn and how they learn it.
Abstract: Recent progress in artificial intelligence has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats that of humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn and how they learn it. Specifically, we argue that these machines should (1) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (2) ground learning in intuitive theories of physics and psychology to support and enrich the knowledge that is learned; and (3) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes toward these goals that can combine the strengths of recent neural network advances with more structured cognitive models.

2,010 citations

Proceedings Article
05 Dec 2016
TL;DR: The gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.
Abstract: This work explores conditional image generation with a new image density model based on the PixelCNN architecture. The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks. When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects, landscapes and structures. When conditioned on an embedding produced by a convolutional network given a single image of an unseen face, it generates a variety of new portraits of the same person with different facial expressions, poses and lighting conditions. We also show that conditional PixelCNN can serve as a powerful decoder in an image autoencoder. Additionally, the gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.

1,275 citations


Cites background from "Learning with Hierarchical-Deep Mod..."

  • ...In the future it might be interesting to try and generate new images with a certain animal or object solely from a single example image [21, 22]....

    [...]

Posted Content
TL;DR: In this paper, a new image density model based on the PixelCNN architecture is proposed for conditional image generation, which can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks.
Abstract: This work explores conditional image generation with a new image density model based on the PixelCNN architecture. The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks. When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects, landscapes and structures. When conditioned on an embedding produced by a convolutional network given a single image of an unseen face, it generates a variety of new portraits of the same person with different facial expressions, poses and lighting conditions. We also show that conditional PixelCNN can serve as a powerful decoder in an image autoencoder. Additionally, the gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.

1,259 citations

01 Jan 2015
TL;DR: The authors presented a computational model that captures human learning abilities for a large class of simple visual concepts: handwritten characters from the world's alphabets, represented as simple programs that best explain observed examples under a Bayesian criterion.
Abstract: Handwritten characters drawn by a model Not only do children learn effortlessly, they do so quickly and with a remarkable ability to use what they have learned as the raw material for creating new stuff. Lake et al. describe a computational model that learns in a similar fashion and does so better than current deep learning algorithms. The model classifies, parses, and recreates handwritten characters, and can generate new letters of the alphabet that look “right” as judged by Turing-like tests of the model's output in comparison to what real humans produce. Science, this issue p. 1332 Combining the capacity to handle noise with probabilistic learning yields humanlike performance in a computational model. People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms—for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world’s alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion. On a challenging one-shot classification task, the model achieves human-level performance while outperforming recent deep learning approaches. We also present several “visual Turing tests” probing the model’s creative generalization abilities, which in many cases are indistinguishable from human behavior.

539 citations

References
More filters
Journal ArticleDOI
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

40,826 citations


Additional excerpts

  • ...We used LIBSVM software package of [7]....

    [...]

Journal ArticleDOI
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

30,570 citations


Additional excerpts

  • ...…h ð3Þ in jhð2Þn ;h ð3Þ in;xn / P hð2Þn jhð3Þn P h ð3Þ in jxin ; ð22Þ where the first term is given by the product of logistic functions (see (15)): P hð2Þn jhð3Þn ¼ Y l P h ð2Þ ln jhð3Þn ; with P h ð2Þ l ¼ 1jhð3Þ ¼ 1 1þ exp P k W ð3Þ lk h ð3Þ k ; ð23Þ and the second term P ðhð3Þin Þ is…...

    [...]

Proceedings Article
03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

25,546 citations

Journal ArticleDOI
28 Jul 2006-Science
TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.
Abstract: High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

16,717 citations

Journal ArticleDOI
TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Abstract: We show how to use "complementary priors" to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

15,055 citations


"Learning with Hierarchical-Deep Mod..." refers background or methods in this paper

  • ...This pretraining procedure is quite similar to the pretrain ing procedure of Deep Belief Networks [12], and it allows us to perform approximate inference by a single bottom-up pass....

    [...]

  • ...Recently introduced deep learning models, including Deep Belief Networks [12], Deep Boltzmann Machines [29], deep autoencoders [19], and many others [9], [10], [21], [22], [2 6], [32], [34], [43], have been shown to learn useful distribute d feature representations for many high-dimensional datase ts....

    [...]

  • ...stand-alone Deep Boltzmann Machines and Deep Belief Networks (DBNs) [12] used three layers of hidden variables and were pretrained using a stack of RBMs....

    [...]

Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "Learning with hierarchical-deep models" ?

The authors introduce HD ( or “ Hierarchical-Deep ” ) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian ( HB ) models. Specifically, the authors show how they can learn a hierarchical Dirichlet process ( HDP ) prior over the activities of the top-level features in a deep Boltzmann machine ( DBM ). The authors present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion 

at test time, using variational inference and approximation of (24), it takes a fraction of a second to classify a test example into its corresponding category. 

Recently introduced deep learning models, including deep belief networks (DBNs) [12], deep Boltzmann machines (DBM) [29], deep autoencoders [19], and many others [9], [10], [21], [22], [26], [32], [34], [43], have been shown to learn useful distributed featurerepresentations for many high-dimensional datasets. 

In all of their experimental results, computing this approximate posterior takes a fraction of a second, which is crucial for applications, such as object recognition or information retrieval. 

For all datasets, the authors first pretrain a DBM model in unsupervised fashion on raw sensory input (e.g., pixels, or three-dimensional joint angles), followed by fitting an HDP prior which is run for 200 Gibbs sweeps. 

All units and parameters at all levels of the network are engaged in representing any given input (“distributed representations”), and are adjusted together during learning. 

In typical applications of machine classification algorithms today, learning a new concept requires tens, hundreds, or thousands of training examples. 

Instead of containing 60,000 images of 10 digit classes, the dataset contains 30,000 images of 1,500 characters (20 examples each) with 28 28 pixels. 

The overall learning procedure for DBMs is summarized in Algorithm 1.Stochastic approximation provides asymptotic convergence guarantees and belongs to the general class of Robbins-Monro approximation algorithms [27], [46]. 

Similarly to the CIFAR dataset, the authors pretrain a two-layer DBM model, with the first layer containing 1,000 hidden units, and the second layer containing M ¼ 100 softmax units. 

Similarly to the CIFAR dataset, the authors pretrain a two-layer DBM model, with the first layer containing 1,000 hidden units, and the second layer containing M ¼ 100 softmax units. 

The learning procedure for DBMs described above can be used by starting with randomly initialized weights, but it works much better if the weights are initialized sensibly. 

Hence the novel category can inherit the prior distribution over similar highlevel shape and color features, allowing the HDP-DBM to generalize considerably better to new instances of the “pear” class. 

The resulting compound HDP-DBM model isable to learn low-level features from raw, high-dimen-sional sensory input, high-level features, as well as acategory hierarchy for parameter sharing. 

The authors present experimental results on the CIFAR-100 [17], handwritten character [18], and human motion capture recognition datasets.