scispace - formally typeset
Open AccessPosted Content

An Empirical Analysis of Deep Network Loss Surfaces

Reads0
Chats0
TLDR
This paper empirically investigate the geometry of the loss functions for state-of-the-art networks with multiple stochastic optimization methods through several experiments that are visualized on polygons to understand how and when these stochastically optimization methods find local minima.
Abstract
The training of deep neural networks is a high-dimension optimization problem with respect to the loss function of a model. Unfortunately, these functions are of high dimension and non-convex and hence difficult to characterize. In this paper, we empirically investigate the geometry of the loss functions for state-of-the-art networks with multiple stochastic optimization methods. We do this through several experiments that are visualized on polygons to understand how and when these stochastic optimization methods find minima.

read more

Content maybe subject to copyright    Report

Under review as a conference paper at ICLR 2017
AN EMPIRICAL ANALYSIS OF DEEP NETWORK LOSS
SURFACES
Daniel Jiwoong Im
1,2
, Michael Tao
3
, & Kristin Branson
2
1
Janelia Research Campus, HHMI,
2
AIFounded Inc.
3
University of Toronto,
{imd, bransonk}@janelia.hhmi.org
{mtao}@dgp.toronto.edu
ABSTRACT
The training of deep neural networks is a high-dimension optimization problem
with respect to the loss function of a model. Unfortunately, these functions are of
high dimension and non-convex and hence difficult to characterize. In this paper,
we empirically investigate the geometry of the loss functions for state-of-the-art
networks with multiple stochastic optimization methods. We do this through sev-
eral experiments that are visualized on polygons to understand how and when
these stochastic optimization methods find local minima.
1 INTRODUCTION
Deep neural networks are trained by optimizing an extremely high-dimensional loss function with
respect to the weights of the network’s linear layers. The objective function minimized is some
measure of the error of the network’s predictions based on these weights compared to training data.
This loss function is non-convex and has many local minima. These loss functions are usually
minimized using first-order gradient descent (Robbins & Monro, 1951; Polyak, 1964) algorithms
such as stochastic gradient descent (SGD) (Bottou, 1991). The success of deep learning critically
depends on how well we can minimize this loss function, both in terms of the quality of the local
minima found and the time to find them. Understanding the geometry of this loss function and how
well optimization algorithms can find good local minima is thus of vital importance.
Several works have theoretically analyzed and characterized the geometry of deep network loss
functions. However, to make these analyses tractible, they have relied on simplifications of the
network structures, including that the networks are linear (Saxe et al., 2014), or assuming the path
and variable independence of the neural networks (Choromanska et al., 2015). Orthogonally, the
performance of various gradient descent algorithms has been theoretically characterized (Nesterov,
1983). Again, these analyses make simplifying assumptions, in particular that the loss function is
strictly convex, i.e. there is only a single local minimum.
In this work, we empirically investigated the geometry of the real loss functions for state-of-the-art
networks and data sets. In addition, we investigated how popular optimization algorithms interact
with these real loss surfaces. To do this, we plotted low-dimensional projections of the loss function
in subspaces chosen to investigate properties of the local minima selected by different algorithms.
We chose these subspaces to address the following questions:
What types of changes to the optimization procedure result in different local minima?
Do different optimization algorithms find qualitatively different types of local minima?
2 RELATED WORK
2.1 LOSS SURFACES
There have been several attempts to understand the loss surfaces of deep neural networks. Some
have studied the critical points of the deep linear neural networks (Baldi, 1989; Baldi & Hornik,
Work done during an internship at Janelia Research Campus
1

Under review as a conference paper at ICLR 2017
1989; Baldi & Lu, 2012). Others further investigated the learning dynamics of the deep linear
neural networks (Saxe et al., 2014). More recently, several others have attempted to study the loss
surfaces of deep non-linear neural networks (Choromanska et al., 2015; Kawaguchi, 2016; Soudry
& Carmon, 2016).
One approach is to analogize the states of neurons as the magnetics dipoles used in spherical spin-
glass Ising models from statistical physics (Parisi, 2016; Fyodorov & Williams, 2007; Bray & Dean,
2007). Choromanska et al. (2015) attempted to understand the loss function of neural networks
through studying the random Gaussian error functions of Ising models. Recent results (Kawaguchi,
2016; Soudry & Carmon, 2016) have provided cursory evidence in agreement with the theory pro-
vided by Choromanska et al. (2015) in that they found that that there are no “poor” local minima in
neural networks still with strong assumptions.
There is some potential disconnect between these theoretical results and what is found in practice
due to several strong assumptions such as the activation of the hidden units and output being inde-
pendent of the previous hidden units and input data. The work of Dauphin et al. (2014) empirically
investigated properties of the critical points of neural network loss functions and demonstrated that
their critical points behave similarly to the critical points of random Gaussian error functions in high
dimensional space. We will expose further evidence along this trajectory.
2.2 OPTIMIZATION
In practice, the local minima of deep network loss functions are for the most part decent. This
implies that we probably do not need to take many precautions to avoid bad local minima in practice.
If all local minima are decent, then the task of finding a decent local minimum quickly is reduced to
the task of finding any local minimum quickly. From an optimization perspective this implies that
solely focusing on designing fast methods are of key importance for training deep networks.
In the literature the common method for measuring performance of optimization methods is to
analyze them on nice convex quadratic functions (Polyak, 1964; Broyden, 1970; Nesterov, 1983;
Martens, 2010; Erdogdu & Montanari, 2015) even though the problems are applied to non-convex
problems. For non-convex problems, however, if two methods converge to different local minima
their performance will be dictated on how those methods solve those two convex subproblems. It
is challenging to show that one method will beat another without knowledge of the sort of convex
subproblems, which is generally not known apriori. What we will explore is whether indeed are
some characteristics that can found experimentally. If so, perhaps one could validate where these
analytical results are valid or even improve methods for training neural networks.
2.2.1 LEARNING PHASES
Slowing
decaying
Fast decaying
Figure 1: An example of learning curve of neural network
One of the interesting empirical observation is that we often observe is that the incremental improve-
ment of optimization methods decreases rapidly even in non-convex problems. This behavior has
been discussed as a “transient” phase followed by a “minimization” phase (Sutskever et al., 2013)
2

Under review as a conference paper at ICLR 2017
where the former finds the neighborhood of a decent local minima and the latter finds the local
minima within that neighborhood. The existence of these phases implies that if certain methods are
better at different phases one could create novel methods that schedule when to apply each method.
3 EXPERIMENTAL SETUP AND TOOLS
3.1 NETWORK ARCHITECTURES AND DATA SETS
We conducted experiments on three state-of-the-art neural network architectures. Network-in-
Network (NIN) (Lin et al., 2014) and the VGG(Simonyan & Zisserman, 2015) network are feed-
forward convolutional networks developed for image classification, and have excellent performance
on the Imagenet (Russakovsky et al., 2014) and CIFAR10 (Krizhevsky, 2009) data sets. The long
short-term memory network (LSTM) (Hochreiter & Schmidhuber, 1997) is a recurrent neural net-
work that has been successful in tasks that take variable-length sequences as input and/or produce
variable-length sequences as output, such as speech recognition and image caption generation. These
are large networks currently used in many machine vision and learning tasks, and the loss functions
minimized by each are highly non-convex.
All results using the feed-forward convolutional networks (NIN and VGG) are on the CIFAR10
image classification data set, while the LSTM was tested on the Penn Treebank next-word prediction
data set.
3.2 OPTIMIZATION METHODS
We analyzed the performance of five popular gradient-descent optimization methods for these learn-
ing frameworks: Stochastic gradient descent (SGD) (Robbins & Monro, 1951), stochastic gradient
descent with momentum (SGDM), RMSprop (Tieleman & Hinton, 2012), Adadelta (Zeiler et al.,
2011), and ADAM (Kingma & Ba, 2014). These are all first-order gradient descent algorithms that
estimate the gradients based on randomly-grouped minibatches of training examples. One of the
major differences between these algorithms is how they select the weight-update step-size at each
iteration, with SGD and SGDM using fixed schedules, and RMSprop, Adadelta, and ADAM using
adaptive, per-parameter step-sizes. Details are provided in Section A.2.
In addition to these five existing optimization methods, we compare to a new gradient descent
method we developed based on the family of Runge Kutta integrators. In our experiments, we
tested a second-order Runge-Kutta integrator in combination with SGD (RK2) and in combination
with ADAM (ADAM&RK2). Details are provided in Section A.3).
3.3 ANALYSIS METHODS
Several of our empirical analyses are based on the technique of Goodfellow et al. (Goodfellow et al.,
2015). They visualize the loss function by projecting it down to one carefully chosen dimension.
They plot the value of the loss function along a set of samples along this dimension. The projec-
tion space is chosen based on important weight configurations, thus they plot the value of the loss
function at linear interpolations between two weight configurations. They perform two such analy-
ses: one in which they interpolate between the initialization weights and the final learned weights,
and one in which they interpolate between two sets of final weights, each learned from different
initializations.
In this work, we use a similar visualization technique, but choose different low-dimensional sub-
spaces for the projection of the loss function. These subspaces are based on the initial weights as
well as the final weights learned using the different optimization algorithms and combinations of
them, and are chosen to answer a variety of questions about the loss function and how the different
optimization algorithms interact with this loss function. In contrast, Goodfellow et al. only looked
at SGDM. In addition, we explore the use of two-dimensional projections of the loss function, al-
lowing us to better visualize the space between local minima. We do this via barycentric and bilinar
interpolation for triplets and quartets of points respectively (details in Section A.1).
We refer to the critical points found using these variants of SGD, for which the gradient is approxi-
mately 0, as local minima. Our evidence that these are local minima as opposed to saddle points is
3

Under review as a conference paper at ICLR 2017
Initial Config.
Initial Config.
SGD
SGD
(a) NIN
Initial Config.
Initial Config.
SGD
SGD
(b) VGG
Figure 2: Visualization of the loss surface at
weights interpolated between two initial configu-
rations and the final weight vectors learned using
SGD from these initializations.
SGD
RK2
ADAM
ADAM&RK2
(a) NIN
SGD
RK2
ADAM
ADAM&RK2
(b) VGG
Figure 3: Visualization of the loss surface at
weights interpolated between the weights learned
by four different algorithms from the same ini-
tialization.
similar to that presented in Goodfellow et al. (Goodfellow et al., 2015). If we interpolate beyond the
critical point, in this one-dimensional projection, the loss increases (Fig. 10).
3.4 TECHNICAL DETAILS
We used the VGG and NIN implementations from https://github.com/szagoruyko/cifar.torch.git.
The batch size was set to 128 and the number of epochs was set to 200. The learning rate was chosen
from the discrete range between [0.2, 0.1, 0.05, 0.01] for SGD and [0. 002, 0.001, 0.0005, 0.0001] for
adaptive learning methods. We doubled the learning rates when we ran our augmented versions with
Runge-Kutta because they required two stochastic gradient computations per epoch. We used batch-
normalization and dropout to regularize our networks. All experiments were run on a 6-core Intel(R)
Xeon(R) CPU @ 2.40GHz with a TITAN X.
4 EXPERIMENTAL RESULTS
4.1 DIFFERENT OPTIMIZATION METHODS FIND DIFFERENT LOCAL MINIMA
We trained the neural networks described above using each optimization method starting from the
same initial weights and with the same minibatching. We computed the value of the loss function
for weight vectors interpolated between the initial weights, the final weights for one algorithm, and
the final weights for a second algorithm for several pairings of algorithms. The results are shown in
the lower triangle of Table 1.
For every pair of optimization algorithms, we observe that the training loss between the final weights
for different algorithms shows a sharp increase along the interpolated path. This suggests that each
optimization algorithm found a different critical point, despite starting at the same initialization. We
investigated the space between other triplets and quadruples of weight vectors (Figure 2 and 3), and
even in these projections of the loss function, we still see that the local minima returned by different
algorithms are separated by high loss weight parameters.
Deep networks are overparameterized. For example, if we switch all corresponding weights for a
pair of nodes in our network, we will obtain effectively the same network, with both the original
and permuted networks outputting the same prediction for a given input. To ensure that the weight
vectors returned by the different algorithms were functionally different, we compared the outpts of
the networks on each example in a validation data set:
dist(θ
1
, θ
2
) =
v
u
u
t
1
N
test
N
test
X
i=1
kF (x
i
, θ
1
) F (x
i
, θ
2
)k
2
,
where θ
1
and θ
2
are the weights learned by two different optimization algorithms, x
i
is the input for
a validation example, and F (x, θ) is the output of the network for weights θ on input x.
4

Under review as a conference paper at ICLR 2017
SGD
Adadelta
RMSprop
Adam
RK2
Adam&RK2
SGD
Adadelta
RMSprop
Adam
RK2
Adam&RK2
0
1
.5
Distance
0
1
2
RK2
SGD
0
1
.5
Distance
0
1
2
SGD
Adam
0
1
.5
Distance
0
1
2
Adam&RK2
RK2
0
1
.5
Distance
0
1
2
Adam&RK2
Adam
0
1
.5
Distance
0
1
2
SGD
RMSprop
0
1
.5
Distance
0
1
2
Adam
RMSprop
Init
Adam
SGD
SGD
RK2
Adam&RK2
Adam&RK2
Adam&RK2
Adadelta
Adadelta
Adadelta
RMSprop
RMSprop
RK2
Adam
Adam
Adam
Init
Init
Init
Init
Init
Init
Init
Table 1: Visualization of the loss surface near and between local minima found by different opti-
mization methods. Each box corresponds to a pair of optimization methods. In the lower triangle,
we plot the projection of the loss surface at weight vectors between the initial weight and the learned
weights found by the two optimization methods. Color as well as height of the surface indicate the
loss function value. In the upper triangle, we plot the functional difference between the network
corresponding to the learned weights for the first algorithm and networks corresponding to weights
linearly interpolated between the first and second algorithm’s learned weights. (Best viewed in
zoom)
5

Citations
More filters
Proceedings Article

A closer look at memorization in deep networks

TL;DR: The analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.
Posted Content

Sharp Minima Can Generalize For Deep Nets

TL;DR: It is argued that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and when focusing on deep networks with rectifier units, the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit is exploited.
Posted Content

Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning

TL;DR: A theory to identify 5+1 Phases of Training, corresponding to increasing amounts of Implicit Self-Regularization, which demonstrates that DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena.
Journal ArticleDOI

Deep Supervised Learning Using Local Errors.

TL;DR: Learning as discussed by the authors proposes an alternative learning mechanism where errors are generated locally in each layer using fixed, random auxiliary classifiers, which is well suited for learning deep networks in custom hardware and can drastically reduce memory traffic and data communication overheads.
Proceedings Article

Multi-class classification without multi-class labels.

TL;DR: In this article, pairwise similarity between examples is leveraged to learn a multi-class classifier as a submodule, which can be used to learn neural network-based models for cross-task, unsupervised and semi-supervised learning.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Related Papers (5)