Journal Article•DOI•

Learning with Hierarchical-Deep Models

Ruslan Salakhutdinov¹, Joshua B. Tenenbaum², Antonio Torralba²•Institutions (2)

University of Toronto¹, Massachusetts Institute of Technology²

01 Aug 2013-IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE)-Vol. 35, Iss: 8, pp 1958-1971

TL;DR: Efficient learning and inference algorithms for the HDP-DBM model are presented and it is shown that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets.

read less

Abstract: We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian (HB) models. Specifically, we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a deep Boltzmann machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training example by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets.

...read moreread less

Summary (5 min read)

Jump to: [1 INTRODUCTION] – [2 DEEP BOLTZMANN MACHINES] – [2.1 Approximate Learning] – [2.1.1 A Variational Approach to Estimating the Data-Dependent Statistics] – [2.1.2 A Stochastic Approximation Approach for Estimating the Data-Independent Statistics] – [2.1.3 Greedy Layerwise Pretraining of DBMs] – [2.2 Gaussian-Bernoulli DBMs] – [2.3 Multinomial DBMs] – [3 COMPOUND HDP-DBM MODEL] – [3.1 A Hierarchical Bayesian Prior] – [3.2 Modeling the Number of Supercategories] – [4 INFERENCE] – [4.1 Making Predictions] – [5 EXPERIMENTS] – [5.1 CIFAR-100 Data Set] – [5.2 Handwritten Characters] – [5.3 Motion Capture] and [6 CONCLUSIONS]

1 INTRODUCTION

THE ability to learn abstract representations that supporttransfer to novel but related tasks lies at the core of many problems in computer vision, natural language processing, cognitive science, and machine learning.
In contrast, the authors argue that learning new classes from a handful of training examples will be easier in architectures that can explicitly identify only a small number of degrees of freedom (latent variables and parameters) that are relevant to the new concept being learned, and thereby achieve more appropriate and flexible transfer of learned representations to new tasks.
Unlike deep networks, these HB models explicitly represent category hierarchies that admit sharing the appropriate abstract knowledge about the new class’s parameters via a prior abstracted from related classes.
They typically rely on domainspecific hand-crafted features [2], [11] (e.g., GIST, SIFT features in computer vision, MFCC features in speech perception domains).
Their approach was not ideal as a generic approach to transfer learning with few examples.

2 DEEP BOLTZMANN MACHINES

There are connections only between hidden units in adjacent layers, as well as between visible and hidden units in the first hidden layer.
The probability that the model assigns to a visible vector v is given by the Boltzmann distribution: P ðv; Þ ¼ 1Zð Þ X h exp E v;hð1Þ;hð2Þ;hð3Þ; : ð1Þ Observe that setting both Wð2Þ ¼ 0 and Wð3Þ ¼ 0 recovers the simpler Restricted Boltzmann Machine (RBM) model.
The exact computation of the data-dependent expectation takes time that is exponential in the number of 1.
We have omitted the bias terms for clarity of presentation.the authors.the authors.

2.1 Approximate Learning

The original learning algorithm for Boltzmann machines used randomly initialized Markov chains to approximate both expectations to estimate gradients of the likelihood function [14].
Recently, Salakhutdinov and Hinton [29] proposed a variational approach, where mean-field inference is used to estimate data-dependent expectations and an MCMC-based stochastic approximation procedure is used to approximate the models expected sufficient statistics.

2.1.1 A Variational Approach to Estimating the Data-Dependent Statistics

Then the log-likelihood of the DBM model has the following variational lower bound: logP ðv; Þ X h Qðhjv; Þ logP ðv;h; Þ þ HðQÞ logP ðv; Þ KLðQðhjv; ÞkP ðhjv; ÞÞ; ð4Þ where Hð Þ is the entropy functional and KLðQkP Þ denotes the Kullback-Leibler divergence between the two distributions.
The bound becomes tight if and only if Qðhjv; Þ ¼ P ðhjv; Þ. Variational learning has the nice property that in addition to maximizing the log-likelihood of the data, it also attempts to find parameters that minimize the Kullback-Leibler divergence between the approximating and true posteriors.
To solve these fixed-point equations, the authors simply cycle through layers, updating the mean-field parameters within a single layer.
Note the close connection between the form of the mean-field fixed point updates and the form of the conditional distribution3 defined by (2).

2.1.2 A Stochastic Approximation Approach for Estimating the Data-Independent Statistics

Given the variational parameters , the model parameters are then updated to maximize the variational bound using an MCMC-based stochastic approximation [29], [39], [46].
Implementing the mean-field requires no extra work beyond implementing the Gibbs sampler.
Given xt, sample a new state xtþ1 from the transition operator T tðxtþ1 xtÞ that leaves P ð ; tÞ invariant.
The overall learning procedure for DBMs is summarized in Algorithm 1.
Together with the condition on the learning rate, this ensures almost sure convergence of the stochastic approximation algorithm to an asymptotically stable point.

2.1.3 Greedy Layerwise Pretraining of DBMs

The learning procedure for DBMs described above can be used by starting with randomly initialized weights, but it works much better if the weights are initialized sensibly.
The authors therefore use a greedy layerwise pretraining strategy by learning a stack of modified RBMs (for details see [29]).
This fast approximate inference is then used to initialize the mean-field, which then converges much faster than meanfield with random initialization.

2.2 Gaussian-Bernoulli DBMs

The authors now briefly describe a Gaussian-Bernoulli DBM model, which they will use to model real-valued data, such as images of natural scenes and motion capture data.
Gaussian-Bernoulli DBMs represent a generalization of a simpler class of models, called Gaussian-Bernoulli RBMs, which have been successfully applied to various tasks, including image classification, video action recognition, and speech recognition [17], [20], [23], [35].
In practice, however, instead of learning 2, one would typically use a fixed, predetermined value for 2 [13], [24].

2.3 Multinomial DBMs

To allow DBMs to express more information and introduce more structured hierarchical priors, the authors will use a condi- tional multinomial distribution to model activities of the top-level units hð3Þ.
The code for pretraining and generative learning of the DBM model is available at http://www.utstat.toronto.edu/~rsalakhu/DBM.html.
A key observation is that M separate copies of softmax units that all share the same set of weights can be viewed as a single multinomial unit that is sampled M times from the conditional distribution of (13).
A pleasing property of using softmax units is that the mathematics underlying the learning algorithm for binary-binary DBMs remains the same.

3 COMPOUND HDP-DBM MODEL

After a DBM model has been learned, the authors have an undirected model that defines the joint distribution P ðv;hð1Þ;hð2Þ;hð3ÞÞ.
One way to express what has been learned is the conditional model P ðv;hð1Þ;hð2Þjhð3ÞÞ and a complicated prior term P ðhð3ÞÞ, defined by the DBM model.
The authors can therefore rewrite the variational bound as logP ðvÞ X hð1Þ;hð2Þ;hð3Þ Qðhjv; Þ logP ðv;hð1Þ;hð2Þjhð3ÞÞ þ HðQÞ þ X hð3Þ Qðhð3Þjv; Þ logP ðhð3ÞÞ: ð14Þ.
This particular decomposition lies at the core of the greedy recursive pretraining algorithm: Instead of adding an additional undirected layer (e.g., an RBM) to model P ðhð3ÞÞ the authors can place an HDP prior over hð3Þ that will allow us to learn category hierarchies and, more importantly, useful representations of classes that contain few training examples.

3.1 A Hierarchical Bayesian Prior

In their compound HDP-DBM model, the authors will use a hierarchical topic model as a prior over the activities of the DBM’s top-level features.
;M; h ð3Þ in jxin; xin Multð1; xinÞ; where is the global distribution over topics, is the global distribution over K words, and and are concentration parameters.
Let us further assume that their model is presented with a fixed two-level category hierarchy.
These high-level features in turn define topic-specific distribution over hð3Þ features, or “words” in their DBM model.
For a fixed number of topics T , the above model represents a hierarchical extension of the latent Dirichlet allocation (LDA) model [4].

3.2 Modeling the Number of Supercategories

So far the authors have assumed that their model is presented with a two-level partition z ¼ fzs; zbg that defines a fixed two-level tree hierarchy.
The authors note that this model corresponds to a standard HDP model that assumes a fixed hierarchy for sharing parameters.
The authors place a nonparametric two-level nested Chinese restaurant prior (CRP) [5] over z, which defines a prior over tree structures and is flexible enough to learn arbitrary hierarchies.
The main building block of the nested CRP is the Chinese restaurant process, a distribution on partition of integers.
As the authors show in the experimental results section, both sharing higher level features and forming coherent hierarchies play a crucial role in the ability of the model to generalize well from one or few examples of a novel category.

4 INFERENCE

Inferences about model parameters at all levels of hierarchy can be performed by MCMC.
1) sampling cluster indices xin using Gibbs updates in the Chinese restaurant franchise (CRF) representation of the HDP; 2) sampling the weights at all three levels conditioned on x using the usual posterior of a DP, also known as The sampler alternates between.
The speedup could be substantial, particularly as the number of the basic-level categories becomes large.
In their conjugate setting, parameters can be further integrated out.
Finally, conditioned on the states of hð3Þ, the authors can further fine-tune low-level DBM parameters ¼ fWð1Þ;Wð2Þ;Wð3Þg by applying approximate maximum likelihood learning (see Section 2) to the conditional DBM model of (15).

4.1 Making Predictions

Given a test input vt, the authors can quickly infer the approximate posterior over h ð3Þ t using the mean-field of (6), followed by running the full Gibbs sampler to get approximate samples from the posterior over the category assignments.
Hence, instead of integrating out document specific DP Combining this likelihood term with nCRP prior P ðztjz tÞ of (19) allows us to efficiently infer approximate posterior over category assignments.
In all of their experimental results, computing this approximate posterior takes a fraction of a second, which is crucial for applications, such as object recognition or information retrieval.

5 EXPERIMENTS

The authors present experimental results on the CIFAR-100 [17], handwritten character [18], and human motion capture recognition datasets.
For all datasets, the authors first pretrain a DBM model in unsupervised fashion on raw sensory input (e.g., pixels, or three-dimensional joint angles), followed by fitting an HDP prior which is run for 200 Gibbs sweeps.
This was sufficient to obtain good performance.
Across all datasets, the authors also assume that the basic-level category 6. labels are given, but no supercategory labels are available.
The first two models, stand-alone DBMs and DBNs [12], used three layers of hidden variables and were pretrained using a stack of RBMs.

5.1 CIFAR-100 Data Set

Fig. 3 displays a random subset of the training data, first and second layer DBM features, as well as higher level class-sensitive features, or topics, learned by the HDP model.
The results are averaged over 100 classes using “leave-one-out” test format.
Table 1 also shows that fine-tuning parameters of all layers jointly as well as learning supercategory hierarchy significantly improves model performance.
Viewpoint, and cluttered background, the model is able to capture the overall structure of each class.

5.2 Handwritten Characters

The handwritten characters dataset [18] can be viewed as the “transpose” of the standard MNIST dataset.
The results are averaged over 200 characters chosen at random, using the “leave-oneout” test format.
This result demonstrates that the HDP-DBM model is able to successfully transfer appropriate prior over higher level “strokes” from previously learned categories.
Each panel shows three figures: 1) three training examples of a novel character class, 2) 12 synthesized examples of that class, and 3) samples of the training characters in the same supercategory that the novel character has been grouped under.

5.3 Motion Capture

Results on the CIFAR and Character datasets show that the HDP-DBM model can significantly outperform many other models on object and character recognition tasks.
Features at all levels of the hierarchy were learned without assuming any image-specific priors, and the proposed model can be applied in a wide variety of application domains.
The authors show that the HDP-DBM model can be applied to modeling human motion capture data.
There are 2,500 frames of each style at 60fps, where each time step was represented by a vector of 58 real-valued numbers.
Using “leave- one-out” test format, Table 1 shows that the HDP-DBM model performs much better compared to other models when discriminating between existing nine walking styles versus novel walking style.

6 CONCLUSIONS

The authors developed a compositional architecture that learns an HDP prior over the activities of top-level features of the DBM model.
The resulting compound HDP-DBM model is able to learn low-level features from raw, high-dimen- sional sensory input, high-level features, as well as a category hierarchy for parameter sharing.
The experimen- tal results show that the proposed model can acquire new concepts from very few examples in a diverse set of application domains.
The compositional model considered in this paper was directly inspired by the architecture of the DBM and HDP, but it need not be.
Indeed, any other deep learning module, including DBNs, sparse autoencoders, or any other HB model, can be adapted.

Did you find this useful? Give us your feedback

Figures (13)

Fig. 2. HDP prior over the states of the DBM’s top-level features hð3Þ.

Fig. 10. Some of the learned supercategories that share the same prior distribution over “strokes.” Many of the discovered supercategories contain meaningful groupings of characters.

Fig. 9. A random subset of the training images along with the first and second layer DBM features, as well as higher level class-sensitive HDP features/topics. To visualize higher level features, we first sample M words from a fixed topic t, followed by sampling pixel values from the conditional DBM model.

Fig. 8. Performance of HDP-DBM, DBM, and SVMs for all object classes when learning with three examples. Object categories are sorted by their performance.

Fig. 6. Class-conditional samples generated from the HDP-DBM model. Observe that the model despite extreme variability, the model is able to capture a coherent structure of each class. See in color for better visualization.

Fig. 7. Conditional samples generated by the HDP-DBM model when learning with only three training examples of a novel class: Top: Three training examples, Bottom: 49 conditional samples. Best viewed in color.

Fig. 13. Human motion capture data that corresponds to the “normal” walking style.

Fig. 3. A random subset of the training images along with the first and second layer DBM features and higher level class-sensitive HDP features/ topics. To visualize higher level features, we first sample M words from a fixed topic t, followed by sampling RGB pixel values from the conditional DBM model.

Fig. 11. Within each panel: Left: Examples of training characters in one supercategory: Each row is a different training character and each column is a drawing produced by a different subject. Right: Examples of novel sampled characters in the corresponding supercategory: Each row is a different sampled character, and each column is a different example generated at random by the model.

Fig. 12. Each panel shows three figures from left to right: 1) three training examples of a novel character class, 2) 12 synthesized examples of that class, and 3) training characters in the same supercategory that the novel character has been assigned to.

Fig. 4. A typical partition of the 100 basic-level categories. Many of the discovered supercategories contain semantically coherent classes.

Fig. 5. Learning to learn: Training examples along with the eight most probable topics t, ordered by hand.

Fig. 1. Left: Multinomial DBM model: The top layer represents M softmax hidden units hð3Þ which share the same set of weights. Right: A different interpretation: M softmax units are replaced by a single multinomial unit which is sampled M times.

Content maybe subject to copyright Report

Learning with Hierarchical-Deep Models

Ruslan Salakhutdinov, Joshua B. Tenenbaum, and Antonio Torralba, Member, IEEE

Abstract—We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning

models with structured hierarchical Bayesian (HB) models. Specifically, we show how we can learn a hierarchical Dirichlet process

(HDP) prior over the activities of the top-level features in a deep Boltzmann machine (DBM). This compound HDP-DBM model learns

to learn novel concepts from very few training example by learning low-level generic features, high-level features that capture

correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of

different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to

learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion

capture datasets.

Index Terms—Deep networks, deep Boltzmann machines, hierarchical Bayesian models, one-shot learning

1INTRODUCTION

HE ability to learn abstract representations that support

transfer to novel but related tasks lies at the core of

many problems in computer vision, natural language

processing, cognitive science, and machine learning. In

typical applications of machine classification algorithms

today, learning a new concept requires tens, hundreds, or

thousands of training examples. For human learners,

however, just one or a few examples are often sufficient

to grasp a new category and make meaningful general-

izations to novel instances [15], [25], [31], [44]. Clearly, this

requires very strong but also appropriately tuned inductive

biases. The architecture we describe here takes a step

toward this ability by learning several forms of abstract

knowledge at different levels of abstraction that support

transfer of useful inductive biases from previously learned

concepts to novel ones.

We call our architectures compound HD models, where

“HD” stands for “Hierarchical-Deep,” because they are

derived by composing hierarchical nonparametric Bayesian

models with deep networks, two influential approaches

from the recent unsupervised learning literature with

complementary strengths. Recently introduced deep learn-

ing models, including deep belief networks (DBNs) [12],

deep Boltzmann machines (DBM) [29], deep autoencoders

[19], and many others [9], [10], [21], [22], [26], [32], [34], [43],

have been shown to learn useful distributed feature

representations for many high-dimensional datasets. The

ability to automatically learn in multiple layers allows deep

models to construct sophisticated domain-specific features

without the need to rely on precise human-crafted input

representations, increasingly important with the prolifera-

tion of datasets and application domains.

While the features learned by deep models can enable

more rapid and accurate classification learning, deep

networks themselves are not well suited to learning novel

classes from few examples. All units and parameters at all

levels of the network are engaged in representing any given

input (“distributed representations”), and are adjusted

together during learning. In contrast, we argue that learning

new classes from a handful of training examples will be

easier in architectures that can explicitly identify only a

small number of degrees of freedom (latent variables and

parameters) that are relevant to the new concept being

learned, and thereby achieve more appropriate and flexible

transfer of learned representations to new tasks. This ability

is the hallmark of hierarchical Bayesian (HB) models,

recently proposed in computer vision, statistics, and

cognitive science [8], [11], [15], [28], [44] for learning from

few examples. Unlike deep networks, these HB models

explicitly represent category hierarchies that admit sharing

the appropriate abstract knowledge about the new class’s

parameters via a prior abstracted from related classes. HB

approaches, however, have comple mentary weaknesses

relative to deep networks. They typically rely on domain-

specific hand-crafted features [2], [11] (e.g., GIST, SIFT

features in computer vision, MFCC features in speech

perception domains). Committing to the a-priori defined

feature representations, instead of learning them from data,

can be detrimental. This is especially important when

learning complex tasks, as it is often difficult to hand-craft

high-level features explicitly in terms of raw sensory input.

Moreover, many HB approaches often assume a fixed

hierarchy for sharing parameters [6], [33] instead of

discovering how parameters are shared among classes in

an unsupervised fashion.

In this paper, we propose compound HD architectures

that inte grate thes e deep models with struc tured HB

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1

. R. Salakhutdinov is with the Department of Statistics and Computer

Science, University of Toronto, Toronto, ON M5S 3G3, Canada.

E-mail: rsalakhu@utstat.toronto.edu.

. J.B. Tenenbaum is with the Department of Brain and Cognitive Sciences,

Massachusetts Institute of Technology, Cambridge, MA 02139.

E-mail: jbt@mit.edu.

. A. Torralba is with the Computer Science and Artificial Intelligence

Laboratory, Massachusetts Institute of Technology, Cambridge, MA

02139. E-mail: torralba@mit.edu.

Manuscript received 18 Apr. 2012; revised 30 Aug. 2012; accepted 30 Nov.

2012; published online 19 Dec. 2012.

Recommended for acceptance by M. Welling.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number

TPAMI-2012-04-0302.

Digital Object Identifier no. 10.1109/TPAMI.2012.269.

0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

mode ls. In particular, we show how we can learn a

hierarchical Dirichlet process (HDP) prior over the activities

of the top-level features in a DBM, coming to represent both

a layered hierarchy of increasingly abstract features and a

tree-structured hierarchy of classes. Our model depends

minimally on domain-specific representations and achieves

state-of-the-art performance by unsupervised discovery of

three components: 1) low-level features that abstract from

the raw high-dimensional sensory input (e.g., pixels, or

three-dimensional joint angles) and provide a useful first

representation for all concepts in a given domain; 2) high-

level part-like features that express the distinctive percep-

tual structure of a specific class, in terms of class-specific

correlations over low-level features; and 3) a hierarchy of

superclasses for sharing abstract knowledge among related

classes via a prior on which higher level features are likely

to be distinctive for classes of a certain kind and are thus

likely to support learning new concepts of that kind.

We evaluate the compound HDP-DBM model on three

different perceptual domains. We also illustrate the

advantages of having a full generative model, extending

from highly abstract concepts all the way down to sensory

inputs: We cannot only generalize class labels but also

synthesize new examples in novel classes that look reason-

ably natural, and we can significantly improve classification

performance by learning parameters at all levels jointly by

maximizing a joint log-probability score.

There have also been several approaches in the computer

vision community addressing the problem of learning with

few examples. Torralba et al. [42] proposed using several

boosted detectors in a multitask setting, where features are

shared between several categories. Bart and Ullman [3]

further proposed a cross-generalization framework for

learning with few examples. Their key assumption is that

new features for a novel category are selected from the pool of

features that was useful for previously learned classification

tasks. In contrast to our work, the above approaches are

discriminative by nature and do not attempt to identify

similar or relevant categories. Babenko et al. [1] used a

boosting approac h that simultaneously groups together

categories into several supercategories, sharing a similarity

metric within these classes. They, however, did not attempt to

address transfer learning problem, and primarily focused on

large-scale image retrieval tasks. Finally, Fei-Fei et al. [11]

used an HB approach, with a prior on the parameters of

new categories that was induced from other categories.

However, their approach was not ideal as a generic

approach to transfer learning with few examples. They

learned only a single prior shared across all categories.

The prior was learned from only three categories, chosen

by hand. Compared to our work, they used a more

elaborate visual object model, based on multiple parts

with separate appearance and shape components.

2DEEP BOLTZMANN MACHINES

A DBM is a network of symmetrically coupled stochastic

binary units. It contains a set of visible units v 2f0; 1g

and a sequence of layers of hidden units h

ð1Þ

2f0; 1g

;

ð2Þ

2f0; 1g

... h

ðLÞ

2f0; 1g

. There are connections only

between hidden units in adjacent layers, as well as between

visible and hidden units in the first hidden layer. Consider a

DBM with three hidden layers

(i.e., L ¼ 3). The energy of

the joint configuration fv; hg is defined as

Eðv; h; Þ¼

ð1Þ



ð2Þ

ð1Þ

ð2Þ



ð3Þ

ð2Þ

ð3Þ

;

where h ¼fh

ð1Þ

; h

ð2Þ

; h

ð3Þ

g represent the set of hidden units

and ¼fW

ð1Þ

; W

ð2Þ

; W

ð3Þ

g are the model parameters,

representing visible-to-hidden and hidden-to-hidden sym-

metric interaction terms.

The probability that the model assigns to a visible vector v

is given by the Boltzmann distribution:

P ðv; Þ¼

Zð Þ

exp



 E



v; h

ð1Þ

; h

ð2Þ

; h

ð3Þ

;



: ð1Þ

Observe that setting both W

ð2Þ

¼ 0 and W

ð3Þ

¼ 0 recovers

the simpler Restricted Boltzmann Machine (RBM) model.

The conditional distributions over the visible and the

three sets of hidden units are given by

pðh

ð1Þ

¼ 1jv; h

ð2Þ

Þ¼g

i¼1

ð1Þ

l¼1

ð2Þ

;

pðh

ð2Þ

¼ 1jh

ð1Þ

; h

ð3Þ

Þ¼g

j¼1

ð2Þ

ð1Þ

k¼1

ð3Þ

;

pðh

ð3Þ

¼ 1jh

ð2Þ

Þ¼g

l¼1

ð3Þ

ð2Þ

;

pðv

¼ 1jh

ð1Þ

Þ¼g

j¼1

ð1Þ

;

ð2Þ

where gðxÞ¼1=ð1 þ expðxÞÞ is the logistic function.

The derivative of the log-likelihood with respect to the

model parameters can be obtained from (1):

@ log P ðv; Þ

ð1Þ

¼ E

data



ð1Þ



 E

model



ð1Þ



;

@ log P ðv; Þ

ð2Þ

¼ E

data



ð1Þ

ð2Þ



 E

model



ð1Þ

ð2Þ



;

@ log P ðv; Þ

ð3Þ

¼ E

data

½h

ð2Þ

ð3Þ



 E

model



ð2Þ

ð3Þ



;

ð3Þ

where E

data

½ denotes an expectation with respect to the

completed data distribution:

data

ðh; v; Þ¼P ðhjv; ÞP

data

ðvÞ;

with P

data

ðvÞ¼



representing the empirical distri-

bution and E

model

½ is an expectation with respect to the

distribution defined by the model (1). We will sometimes

refer to E

data

½ as the data-dependent expectation and E

model

½

as the model’s expectation.

Exact maximum likelihood learning in this model is

intractable. The exact computation of the data-dependent

expectation takes time that is exponential in the number of

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013

1. For clarity, we use three hidden layers. Extensions to models with

more than three layers is trivial.

2. We have omitted the bias ter ms for clarity of presentation. Biases are

equivalent to weights on a connection to a unit whose state is fixed at 1.

hidden units, whereas the exact computation of the models

expectation takes time that is exponential in the number of

hidden and visible units.

2.1 Approximate Learning

The original learning algorithm for Boltzmann machines

used randomly initialized Markov chains to approximate

both expectations to estimate gradients of the likelihood

function [14]. However, this learning procedure is too slow

to be practical. Recently, Salakhutdinov and Hinton [29]

proposed a variational approach, where mean-field infer-

ence is used to estimate data-dependent expectations and an

MCMC-based stochastic approximation procedure is used

to approximate the models expected sufficient statistics.

2.1.1 A Variational Approach to Estimating the

Data-Dependent Statistics

Consider any approximating distribution Qðhjv; Þ, para-

metarized by a vector of parameters , for the posterior

P ðhjv; Þ. Then the log-likelihood of the DBM model has

the following variational lower bound:

log Pðv; Þ

Qðhjv; Þlog P ðv; h; ÞþHðQÞ

 log P ðv; ÞKLðQðhjv; ÞkP ðhjv; ÞÞ;

ð4Þ

where HðÞ is the entropy functional and KLðQkP Þ denotes

the Kullback-Leibler divergence between the two distribu-

tions. The bound becomes tight if and only if Qðhjv; Þ¼

P ðhjv; Þ.

Variational learning has the nice property that in

addition to maximizing the log-likelihood of the data, it

also attempts to find parameters that minimize the Kull-

back-Leibler divergence between the approximating and

true posteriors.

For simplicity and speed, we approximate the true

posterior P ðh

jv; Þ with a fully factorized approximating

distribution over the three sets of hidden units, which

corresponds to so-called mean-field approximation:

ðhjv; Þ¼

j¼1

l¼1

k¼1



ð1Þ





ð2Þ





ð3Þ



; ð5Þ

where  ¼f

ð1Þ

;

ð2Þ

;

ð3Þ

g are the mean-field parameters

with qðh

ðlÞ

¼ 1Þ¼

ðlÞ

for l ¼ 1; 2; 3. In this case, the varia-

tional lower bound on the log-probability of the data takes a

particularly simple form:

log Pðv; Þ

ðhjv; Þlog P ðv; h; ÞþHðQ

 v

ð1Þ



ð1Þ

þ 

ð1Þ

ð2Þ



ð2Þ

þ 

ð2Þ

ð3Þ



ð3Þ

 log Zð ÞþHðQ

Þ:

ð6Þ

Learning proceeds as follows: For each training example,

we maximize this lower bound with respect to the

variational parameters  for fixed parameters , which

results in the mean-field fixed-point equations:



ð1Þ



i¼1

ð1Þ

l¼1

ð2Þ



ð2Þ



; ð7Þ

Algorithm 1. Learning Procedure for a Deep Boltzmann

Machine with Three Hidden Layers.

1: Given: a training set of N binary data vectors

fvg

n¼1

, and M, the number of persistent Markov chains

(i.e., particles).

2: Randomly initialize parameter vector

and M

samples: f

0;1

;

0;1

g...f

0;M

;

0;M

where

h ¼f

ð1Þ

;

ð2Þ

;

ð3Þ

3: for t ¼ 0 to T (number of iterations) do

4: // Variational Inference:

5: for each training example v

, n ¼ 1 to N do

6: Randomly initialize  ¼f

ð1Þ

;

ð2Þ

;

ð3Þ

g and

run mean-field updates until convergence,

using (7), (8), (9).

7: Set 

¼ .

8: end for

9: // Stochastic Approximation:

10: for each sample m ¼ 1 to M (number of persistent

Markov chains) do

11: Sample ð

tþ1;m

;

tþ1;m

Þ given ð

t;m

;

t;m

Þ by

running a Gibbs sampler for one step (2).

12: end for

13: // Parameter Update:

14: W

ð1Þ

tþ1

¼ W

ð1Þ

þ 



n¼1

ð

ð1Þ



m¼1

tþ1;m

ð1Þ

tþ1;m



15: W

ð2Þ

tþ1

¼ W

ð2Þ

þ 



n¼1



ð1Þ

ð

ð2Þ



m¼1

ð1Þ

tþ1;m

ð2Þ

tþ1;m



16: W

ð3Þ

tþ1

¼ W

ð3Þ

þ 



n¼1



ð2Þ

ð

ð3Þ



m¼1

ð2Þ

tþ1;m

ð3Þ

tþ1;m



17: Decrease 

18: end for



ð2Þ



j¼1

ð2Þ



ð1Þ

k¼1

ð3Þ



ð3Þ



; ð8Þ



ð3Þ



l¼1

ð3Þ



ð2Þ



; ð9Þ

where gðxÞ¼1=ð1 þexpðxÞÞ is the logistic function. To

solve these fixed-point equations, we simply cycle through

layers, updating the mean-field parameters within a single

layer. Note the close connection between the form of

the mean-field fixed point updates and the form of the

conditional distribution

defined by (2).

2.1.2 A Stochastic Approximation Approach for

Estimating the Data-Independent Statistics

Given the variational parameters , the model parameters

are then updated to maximize the variational bound using

an MCMC-based stochastic approximation [29], [39], [46].

SALAKHUTDINOV ET AL.: LEARNING WITH HIERARCHICAL-DEEP MODELS 3

3. Impl ementi ng the mean -fiel d requires no extra work beyond

implementing the Gibbs sampler.

Learning with stochastic approximation is straightfor-

ward. Let

and x

¼fv

; h

ð1Þ

; h

ð2Þ

; h

ð3Þ

g be the current

parameters and the state. Then x

and

are updated

sequentially as follows:

. Given x

, sample a new state x

tþ1

from the transition

operator T

ðx

tþ1

Þ that leaves P ð;

Þ invariant.

This can be accomplished by using Gibbs sampling

(see (2)).

. A new parameter

tþ1

is then obtained by making a

gradient step, where the intractable model’s expec-

tation E

model

½ in the gradient is replaced by a point

estimate at sample x

tþ1

In practice, we typically maintain a set of M “persistent”

sample particles X

¼fx

t;1

; ...; x

t;M

g, and use an average

over those particles. The overall learning procedure for

DBMs is summarized in Algorithm 1.

Stochastic approximation provides asymptotic conver-

gence guarantees and belongs to the general class of

Robbins-Monro approximation algorithms [27], [46]. Precise

sufficient conditions that ensure almost sure convergence to

an asymptotically stable point are given in [45], [46], and

[47]. One necessary condition requires the learning rate to

decrease with time so that

t¼0



¼1and

t¼0



< 1.

This condition can, for example, be satisfied simply by

setting 

¼ a=ðb þ tÞ, for positive constants a>0, b>0.

Other conditions ensure that the speed of convergence of

the Markov chain, governed by the transition operator T

does not decrease too fast as tends to infinity. Typically,

in practice the sequence j

j is bounded, and the Markov

chain, governed by the transition kernel T

, is ergodic.

Together with the condition on the learning rate, this

ensures almost sure convergence of the stochastic approx-

imation algorithm to an asymptotically stable point.

2.1.3 Greedy Layerwise Pretraining of DBMs

The learning procedure for DBMs described above can be

used by starting with randomly initialized weights, but it

works much better if the weights are initialized sensibly.

We therefore use a greedy layerwise pretraining strategy by

learning a stack of modified RBMs (for details see [29]).

This pretraining procedure is quite similar to the

pretraining procedure of DBNs [12], and it allows us to

perform approximate inference by a single bottom-up pass.

This fast approximate inference is then used to initialize the

mean-field, which then converges much faster than mean-

field with random initialization.

2.2 Gaussian-Bernoulli DBMs

We now briefly describe a Gaussian-Bernoulli DBM model,

which we will use to model real-valued data, such as

images of natura l scenes and motion capture data.

Gaussian-Bernoulli DBMs represent a generalization of a

simpler class of models, called Gaussian-Bernoulli RBMs,

which have been successfully applied to various tasks,

including image classification, video action recognition, and

speech recognition [17], [20], [23], [35].

In particular, consider modeling visible real-valued units

v 2 IR

and let h

ð1Þ

2f0; 1g

, h

ð2Þ

2f0; 1g

, and h

ð3Þ

f0; 1 g

be binary stochastic hidden units. The energy of the

joint configuration fv; h

ð1Þ

; h

ð2Þ

; h

ð3Þ

g of the three-hidden-

layer Gaussian-Bernoulli DBM is defined as follows:

Eðv; h; Þ¼





ð1Þ





ð2Þ

ð1Þ

ð2Þ



ð3Þ

ð2Þ

ð3Þ

;

ð10Þ

where h ¼fh

ð1Þ

; h

ð2Þ

; h

ð3Þ

g represent the set of hidden units,

and ¼fW

ð1Þ

; W

ð2Þ

; W

ð3Þ

;

g are the model parameters,

and 

is the variance of input i. The marginal distribution

over the visible vector v takes form

Pðv; Þ¼

exp ðEðv; h; ÞÞ

exp ðEðv; h; ÞÞdv

: ð11Þ

From (10), it is straightforward to derive the following

conditional distributions:

pðv

¼ xjh

ð1Þ

Þ¼

ﬃﬃﬃﬃﬃﬃ

2



exp 

x  

ð1Þ



2

;

pðh

ð1Þ

¼ 1jvÞ¼g

ð1Þ



;

ð12Þ

where gðxÞ¼1=ð1 þ expðxÞÞ is the logistic function.

Conditional distributions over h

ð2Þ

and h

ð3Þ

remain the

same as in the standard DBM model (see (2)).

Observe that conditioned on the states of the hidden

units (12), each visible unit is modeled by a Gaussian

distribution whose mean is shifted by the weighted

combination of the hidden unit activations. The derivative

of the log-likelihood with respect to W

ð1Þ

takes form

@ log P ðv; Þ

ð1Þ

¼ E

data



ð1Þ



 E

Model



ð1Þ



The derivatives with respect to parameters W

ð2Þ

and W

ð3Þ

remain the same as in (3).

As described in the previous section, learning of the

model parameters, including the variances 

, can be

carried out using variational learning together with

stochastic approximation procedure. In practice, however,

instead of learning 

, one would typically use a fixed,

predetermined value for 

[13], [24].

2.3 Multinomial DBMs

To allow DBMs to express more information and introduce

more structured hierarchical priors, we will use a condi-

tional multinomial distribution to model activities of the

top-level units h

ð3Þ

. Specifically, we will use M softmax

units, each with “1-of-K” encoding, so that each unit

contains a set of K weights. We represent the kth discrete

value of hidden unit by a vector containing 1 at the

kth location and zeros elsewhere. The conditional prob-

ability of a softmax top-level unit is

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013

4. The code for pretraining and generative learning of the DBM model is

available at htt p://www.utstat.toronto.edu/~rsalakhu/DBM.html.

Pðh

ð3Þ

ð2Þ

Þ¼

exp

ð3Þ

ð2Þ



s¼1

exp

ð3Þ

ð2Þ



: ð13Þ

In our formulation, all M separate softmax units will share

the same set of weights, connecting them to binary hidden

units at the lower level (see Fig. 1). The energy of the state

fv; hg is then defined as follows:

Eðv; h; Þ¼

ð1Þ



ð2Þ

ð1Þ

ð2Þ



ð3Þ

ð2Þ

ð3Þ

;

where h

ð1Þ

2f0; 1g

and h

ð2Þ

2f0; 1g

represent stochastic

binary units. The top layer is represented by the M softmax

units h

ð3;mÞ

, m ¼ 1; ...;M, with

ð3Þ

m¼1

ð3;mÞ

denoting

the count for the kth discrete value of a hidden unit.

A key observation is that M separate copies of softmax

units that all share the same set of weights can be viewed as

a single multinomial unit that is sampled M times from the

conditional distribution of (13). This gives us a familiar

“bag-of-words” representation [30], [36]. A pleasing prop-

erty of using softmax units is that the mathematics under-

lying the learning algorithm for binary-binary DBMs

remains the same.

3COMPOUND HDP-DBM MODEL

After a DBM model has been learned, we have an

undirected model t hat defines the joint distributi on

P ðv; h

ð1Þ

; h

ð2Þ

; h

ð3Þ

Þ. One way to express what has been

learned is the conditional model P ðv; h

ð1Þ

; h

ð2Þ

ð3Þ

Þ and a

complicated prior term P ðh

ð3Þ

Þ, defined by the DBM model.

We can therefore rewrite the variational bound as

log PðvÞ

ð1Þ

ð2Þ

ð3Þ

Qðhjv; Þlog P ðv; h

ð1Þ

; h

ð2Þ

ð3Þ

þHðQÞþ

ð3Þ

Qðh

ð3Þ

jv; Þlog P ðh

ð3Þ

Þ:

ð14Þ

This particular decomposition lies at the core of the greedy

recursive pretraining algorithm: We keep the learned condi-

tional model Pðv; h

ð1Þ

; h

ð2Þ

ð3Þ

Þ, but maximize the variational

lower bound of (14) with respect to the last term [12]. This

maximization amounts to replacing Pðh

ð3Þ

Þ by a prior that is

closer to the average, over all the data vectors, of the

approximate conditional posterior Qðh

ð3Þ

jvÞ.

Instead of adding an additional undirected layer (e.g., an

RBM) to model Pðh

ð3Þ

Þ we can place an HDP prior over h

ð3Þ

that will allow us to learn category hierarchies and, more

importantly, useful representations of classes that contain

few training examples.

The part we keep, P ðv; h

ð1Þ

; h

ð2Þ

ð3Þ

Þ,representsa

conditional DBM model:

P ðv; h

ð1Þ

; h

ð2Þ

ð3Þ

Þ¼

Zð ; h

ð3Þ

exp



ð1Þ

ð2Þ

ð1Þ

ð2Þ

ð3Þ

ð2Þ

ð3Þ



;

ð15Þ

which can be viewed as a two-layer DBM but with bias

terms given by the states of h

ð3Þ

3.1 A Hierarchical Bayesian Prior

In a typical hierarchical topic model, we observe a set of

N documents, each of which is modeled as a mixture over

topics, that are shared among documents. Let there be

K words in the vocabulary. A topic t is a discrete distribution

over K words with probability vector 

. Each document n

has its own distribution over topics given by probabilities 

In our compound HDP-DBM model, we will use a

hierarchical topic model as a prior over the activities of the

DBM’s top-level features. Specifically, the term “document”

will refer to the top-level multinomial unit h

ð3Þ

, and M

“words” in the document will represent the M samples, or

active DBM’s top-level features, generated by this multi-

nomial unit. Words in each docume nt are drawn by

choosing a topic t with probability 

, and then choosing

a word w with probability 

. We will often refer to topics

as our learned higher level features, each of which defines a

topic specific distribution over DBM’s h

ð3Þ

features. Let h

ð3Þ

be the ith word in document n, and x

be its topic. We can

specify the following prior over h

ð3Þ



j  DirðÞ; for each document n ¼ 1 ; ...;N;



j  DirðÞ; for each topic t ¼ 1; ...;T;

j

 Multð1;

Þ; for each word i ¼ 1; ...;M;

ð3Þ

;

 Multð1;

Þ;

where  is the global distribution over topics,  is the

global distribution over K words, and  and  are

concentration parameters.

Let us further assume that our model is presented with a

fixed two-level category hierarchy. In particular, suppose

that N documents, or objects, are partitioned into C basic

level categories (e.g., cow, sheep, car). We represent such a

partition by a vector z

of length N, each entry of which is

2f1...Cg. We also assume that our C basic-level

SALAKHUTDINOV ET AL.: LEARNING WITH HIERARCHICAL-DEEP MODELS 5

Fig. 1. Left: Multinomial DBM model: The top layer represents M softmax

hidden units h

ð3Þ

which share the same set of weights. Right: A different

interpretation: M softmax units are replaced by a single multinomial unit

which is sampled M times.

5. Our experiments reveal that using DBNs instead of DBMs decreased

model performance.

HTML Viewer

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Learning with hierarchical-deep models" ?

The authors introduce HD ( or “ Hierarchical-Deep ” ) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian ( HB ) models. Specifically, the authors show how they can learn a hierarchical Dirichlet process ( HDP ) prior over the activities of the top-level features in a deep Boltzmann machine ( DBM ). The authors present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion

Q2. How long does it take to fit a test example into its corresponding category?

at test time, using variational inference and approximation of (24), it takes a fraction of a second to classify a test example into its corresponding category.

Q3. What are the recent deep learning models?

Recently introduced deep learning models, including deep belief networks (DBNs) [12], deep Boltzmann machines (DBM) [29], deep autoencoders [19], and many others [9], [10], [21], [22], [26], [32], [34], [43], have been shown to learn useful distributed featurerepresentations for many high-dimensional datasets.

Q4. How long does it take to compute the approximate posterior?

In all of their experimental results, computing this approximate posterior takes a fraction of a second, which is crucial for applications, such as object recognition or information retrieval.

Q5. How many Gibbs sweeps are needed to fit the HDP prior?

For all datasets, the authors first pretrain a DBM model in unsupervised fashion on raw sensory input (e.g., pixels, or three-dimensional joint angles), followed by fitting an HDP prior which is run for 200 Gibbs sweeps.

Q6. What is the role of the units and parameters in the network?

All units and parameters at all levels of the network are engaged in representing any given input (“distributed representations”), and are adjusted together during learning.

Q7. How many examples are needed to learn a new concept?

In typical applications of machine classification algorithms today, learning a new concept requires tens, hundreds, or thousands of training examples.

Q8. How many images of apples are in the dataset?

Instead of containing 60,000 images of 10 digit classes, the dataset contains 30,000 images of 1,500 characters (20 examples each) with 28 28 pixels.

Q9. What is the overall learning procedure for DBMs?

The overall learning procedure for DBMs is summarized in Algorithm 1.Stochastic approximation provides asymptotic convergence guarantees and belongs to the general class of Robbins-Monro approximation algorithms [27], [46].

Q10. How many hidden units were used in the CIFAR dataset?

Similarly to the CIFAR dataset, the authors pretrain a two-layer DBM model, with the first layer containing 1,000 hidden units, and the second layer containing M ¼ 100 softmax units.

Q11. How many hidden units were used in the CIFAR dataset?

Similarly to the CIFAR dataset, the authors pretrain a two-layer DBM model, with the first layer containing 1,000 hidden units, and the second layer containing M ¼ 100 softmax units.

Q12. How does the learning procedure for DBMs work?

The learning procedure for DBMs described above can be used by starting with randomly initialized weights, but it works much better if the weights are initialized sensibly.

Q13. What is the effect of the prior distribution on the HDP-DBM?

Hence the novel category can inherit the prior distribution over similar highlevel shape and color features, allowing the HDP-DBM to generalize considerably better to new instances of the “pear” class.

Q14. What is the way to learn the low-level features of the hdmm?

The resulting compound HDP-DBM model isable to learn low-level features from raw, high-dimen-sional sensory input, high-level features, as well as acategory hierarchy for parameter sharing.

Q15. What datasets are used for this study?

The authors present experimental results on the CIFAR-100 [17], handwritten character [18], and human motion capture recognition datasets.

Learning with Hierarchical-Deep Models

Summary (5 min read)

1 INTRODUCTION

2 DEEP BOLTZMANN MACHINES

2.1 Approximate Learning

2.1.1 A Variational Approach to Estimating the Data-Dependent Statistics

2.1.2 A Stochastic Approximation Approach for Estimating the Data-Independent Statistics

2.1.3 Greedy Layerwise Pretraining of DBMs

2.2 Gaussian-Bernoulli DBMs

2.3 Multinomial DBMs

3 COMPOUND HDP-DBM MODEL

3.1 A Hierarchical Bayesian Prior

3.2 Modeling the Number of Supercategories

4 INFERENCE

4.1 Making Predictions

5 EXPERIMENTS

5.1 CIFAR-100 Data Set

5.2 Handwritten Characters

5.3 Motion Capture

6 CONCLUSIONS

Figures (13)

Citations

Cites background from "Learning with Hierarchical-Deep Mod..."

Cites background from "Learning with Hierarchical-Deep Mod..."

References

Additional excerpts

Additional excerpts

"Learning with Hierarchical-Deep Mod..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Learning with hierarchical-deep models" ?

Q2. How long does it take to fit a test example into its corresponding category?

Q3. What are the recent deep learning models?

Q4. How long does it take to compute the approximate posterior?

Q5. How many Gibbs sweeps are needed to fit the HDP prior?

Q6. What is the role of the units and parameters in the network?

Q7. How many examples are needed to learn a new concept?

Q8. How many images of apples are in the dataset?

Q9. What is the overall learning procedure for DBMs?

Q10. How many hidden units were used in the CIFAR dataset?

Q11. How many hidden units were used in the CIFAR dataset?

Q12. How does the learning procedure for DBMs work?

Q13. What is the effect of the prior distribution on the HDP-DBM?

Q14. What is the way to learn the low-level features of the hdmm?

Q15. What datasets are used for this study?