Under review as a conference paper at ICLR 2017
AN EMPIRICAL ANALYSIS OF DEEP NETWORK LOSS
SURFACES
Daniel Jiwoong Im
1,2∗
, Michael Tao
3
, & Kristin Branson
2
1
Janelia Research Campus, HHMI,
2
AIFounded Inc.
3
University of Toronto,
{imd, bransonk}@janelia.hhmi.org
{mtao}@dgp.toronto.edu
ABSTRACT
The training of deep neural networks is a high-dimension optimization problem
with respect to the loss function of a model. Unfortunately, these functions are of
high dimension and non-convex and hence difficult to characterize. In this paper,
we empirically investigate the geometry of the loss functions for state-of-the-art
networks with multiple stochastic optimization methods. We do this through sev-
eral experiments that are visualized on polygons to understand how and when
these stochastic optimization methods find local minima.
1 INTRODUCTION
Deep neural networks are trained by optimizing an extremely high-dimensional loss function with
respect to the weights of the network’s linear layers. The objective function minimized is some
measure of the error of the network’s predictions based on these weights compared to training data.
This loss function is non-convex and has many local minima. These loss functions are usually
minimized using first-order gradient descent (Robbins & Monro, 1951; Polyak, 1964) algorithms
such as stochastic gradient descent (SGD) (Bottou, 1991). The success of deep learning critically
depends on how well we can minimize this loss function, both in terms of the quality of the local
minima found and the time to find them. Understanding the geometry of this loss function and how
well optimization algorithms can find good local minima is thus of vital importance.
Several works have theoretically analyzed and characterized the geometry of deep network loss
functions. However, to make these analyses tractible, they have relied on simplifications of the
network structures, including that the networks are linear (Saxe et al., 2014), or assuming the path
and variable independence of the neural networks (Choromanska et al., 2015). Orthogonally, the
performance of various gradient descent algorithms has been theoretically characterized (Nesterov,
1983). Again, these analyses make simplifying assumptions, in particular that the loss function is
strictly convex, i.e. there is only a single local minimum.
In this work, we empirically investigated the geometry of the real loss functions for state-of-the-art
networks and data sets. In addition, we investigated how popular optimization algorithms interact
with these real loss surfaces. To do this, we plotted low-dimensional projections of the loss function
in subspaces chosen to investigate properties of the local minima selected by different algorithms.
We chose these subspaces to address the following questions:
• What types of changes to the optimization procedure result in different local minima?
• Do different optimization algorithms find qualitatively different types of local minima?
2 RELATED WORK
2.1 LOSS SURFACES
There have been several attempts to understand the loss surfaces of deep neural networks. Some
have studied the critical points of the deep linear neural networks (Baldi, 1989; Baldi & Hornik,
∗
Work done during an internship at Janelia Research Campus
1
Under review as a conference paper at ICLR 2017
1989; Baldi & Lu, 2012). Others further investigated the learning dynamics of the deep linear
neural networks (Saxe et al., 2014). More recently, several others have attempted to study the loss
surfaces of deep non-linear neural networks (Choromanska et al., 2015; Kawaguchi, 2016; Soudry
& Carmon, 2016).
One approach is to analogize the states of neurons as the magnetics dipoles used in spherical spin-
glass Ising models from statistical physics (Parisi, 2016; Fyodorov & Williams, 2007; Bray & Dean,
2007). Choromanska et al. (2015) attempted to understand the loss function of neural networks
through studying the random Gaussian error functions of Ising models. Recent results (Kawaguchi,
2016; Soudry & Carmon, 2016) have provided cursory evidence in agreement with the theory pro-
vided by Choromanska et al. (2015) in that they found that that there are no “poor” local minima in
neural networks still with strong assumptions.
There is some potential disconnect between these theoretical results and what is found in practice
due to several strong assumptions such as the activation of the hidden units and output being inde-
pendent of the previous hidden units and input data. The work of Dauphin et al. (2014) empirically
investigated properties of the critical points of neural network loss functions and demonstrated that
their critical points behave similarly to the critical points of random Gaussian error functions in high
dimensional space. We will expose further evidence along this trajectory.
2.2 OPTIMIZATION
In practice, the local minima of deep network loss functions are for the most part decent. This
implies that we probably do not need to take many precautions to avoid bad local minima in practice.
If all local minima are decent, then the task of finding a decent local minimum quickly is reduced to
the task of finding any local minimum quickly. From an optimization perspective this implies that
solely focusing on designing fast methods are of key importance for training deep networks.
In the literature the common method for measuring performance of optimization methods is to
analyze them on nice convex quadratic functions (Polyak, 1964; Broyden, 1970; Nesterov, 1983;
Martens, 2010; Erdogdu & Montanari, 2015) even though the problems are applied to non-convex
problems. For non-convex problems, however, if two methods converge to different local minima
their performance will be dictated on how those methods solve those two convex subproblems. It
is challenging to show that one method will beat another without knowledge of the sort of convex
subproblems, which is generally not known apriori. What we will explore is whether indeed are
some characteristics that can found experimentally. If so, perhaps one could validate where these
analytical results are valid or even improve methods for training neural networks.
2.2.1 LEARNING PHASES
Slowing
decaying
Fast decaying
Figure 1: An example of learning curve of neural network
One of the interesting empirical observation is that we often observe is that the incremental improve-
ment of optimization methods decreases rapidly even in non-convex problems. This behavior has
been discussed as a “transient” phase followed by a “minimization” phase (Sutskever et al., 2013)
2
Under review as a conference paper at ICLR 2017
where the former finds the neighborhood of a decent local minima and the latter finds the local
minima within that neighborhood. The existence of these phases implies that if certain methods are
better at different phases one could create novel methods that schedule when to apply each method.
3 EXPERIMENTAL SETUP AND TOOLS
3.1 NETWORK ARCHITECTURES AND DATA SETS
We conducted experiments on three state-of-the-art neural network architectures. Network-in-
Network (NIN) (Lin et al., 2014) and the VGG(Simonyan & Zisserman, 2015) network are feed-
forward convolutional networks developed for image classification, and have excellent performance
on the Imagenet (Russakovsky et al., 2014) and CIFAR10 (Krizhevsky, 2009) data sets. The long
short-term memory network (LSTM) (Hochreiter & Schmidhuber, 1997) is a recurrent neural net-
work that has been successful in tasks that take variable-length sequences as input and/or produce
variable-length sequences as output, such as speech recognition and image caption generation. These
are large networks currently used in many machine vision and learning tasks, and the loss functions
minimized by each are highly non-convex.
All results using the feed-forward convolutional networks (NIN and VGG) are on the CIFAR10
image classification data set, while the LSTM was tested on the Penn Treebank next-word prediction
data set.
3.2 OPTIMIZATION METHODS
We analyzed the performance of five popular gradient-descent optimization methods for these learn-
ing frameworks: Stochastic gradient descent (SGD) (Robbins & Monro, 1951), stochastic gradient
descent with momentum (SGDM), RMSprop (Tieleman & Hinton, 2012), Adadelta (Zeiler et al.,
2011), and ADAM (Kingma & Ba, 2014). These are all first-order gradient descent algorithms that
estimate the gradients based on randomly-grouped minibatches of training examples. One of the
major differences between these algorithms is how they select the weight-update step-size at each
iteration, with SGD and SGDM using fixed schedules, and RMSprop, Adadelta, and ADAM using
adaptive, per-parameter step-sizes. Details are provided in Section A.2.
In addition to these five existing optimization methods, we compare to a new gradient descent
method we developed based on the family of Runge Kutta integrators. In our experiments, we
tested a second-order Runge-Kutta integrator in combination with SGD (RK2) and in combination
with ADAM (ADAM&RK2). Details are provided in Section A.3).
3.3 ANALYSIS METHODS
Several of our empirical analyses are based on the technique of Goodfellow et al. (Goodfellow et al.,
2015). They visualize the loss function by projecting it down to one carefully chosen dimension.
They plot the value of the loss function along a set of samples along this dimension. The projec-
tion space is chosen based on important weight configurations, thus they plot the value of the loss
function at linear interpolations between two weight configurations. They perform two such analy-
ses: one in which they interpolate between the initialization weights and the final learned weights,
and one in which they interpolate between two sets of final weights, each learned from different
initializations.
In this work, we use a similar visualization technique, but choose different low-dimensional sub-
spaces for the projection of the loss function. These subspaces are based on the initial weights as
well as the final weights learned using the different optimization algorithms and combinations of
them, and are chosen to answer a variety of questions about the loss function and how the different
optimization algorithms interact with this loss function. In contrast, Goodfellow et al. only looked
at SGDM. In addition, we explore the use of two-dimensional projections of the loss function, al-
lowing us to better visualize the space between local minima. We do this via barycentric and bilinar
interpolation for triplets and quartets of points respectively (details in Section A.1).
We refer to the critical points found using these variants of SGD, for which the gradient is approxi-
mately 0, as local minima. Our evidence that these are local minima as opposed to saddle points is
3
Under review as a conference paper at ICLR 2017
Initial Config.
Initial Config.
SGD
SGD
(a) NIN
Initial Config.
Initial Config.
SGD
SGD
(b) VGG
Figure 2: Visualization of the loss surface at
weights interpolated between two initial configu-
rations and the final weight vectors learned using
SGD from these initializations.
SGD
RK2
ADAM
ADAM&RK2
(a) NIN
SGD
RK2
ADAM
ADAM&RK2
(b) VGG
Figure 3: Visualization of the loss surface at
weights interpolated between the weights learned
by four different algorithms from the same ini-
tialization.
similar to that presented in Goodfellow et al. (Goodfellow et al., 2015). If we interpolate beyond the
critical point, in this one-dimensional projection, the loss increases (Fig. 10).
3.4 TECHNICAL DETAILS
We used the VGG and NIN implementations from https://github.com/szagoruyko/cifar.torch.git.
The batch size was set to 128 and the number of epochs was set to 200. The learning rate was chosen
from the discrete range between [0.2, 0.1, 0.05, 0.01] for SGD and [0. 002, 0.001, 0.0005, 0.0001] for
adaptive learning methods. We doubled the learning rates when we ran our augmented versions with
Runge-Kutta because they required two stochastic gradient computations per epoch. We used batch-
normalization and dropout to regularize our networks. All experiments were run on a 6-core Intel(R)
Xeon(R) CPU @ 2.40GHz with a TITAN X.
4 EXPERIMENTAL RESULTS
4.1 DIFFERENT OPTIMIZATION METHODS FIND DIFFERENT LOCAL MINIMA
We trained the neural networks described above using each optimization method starting from the
same initial weights and with the same minibatching. We computed the value of the loss function
for weight vectors interpolated between the initial weights, the final weights for one algorithm, and
the final weights for a second algorithm for several pairings of algorithms. The results are shown in
the lower triangle of Table 1.
For every pair of optimization algorithms, we observe that the training loss between the final weights
for different algorithms shows a sharp increase along the interpolated path. This suggests that each
optimization algorithm found a different critical point, despite starting at the same initialization. We
investigated the space between other triplets and quadruples of weight vectors (Figure 2 and 3), and
even in these projections of the loss function, we still see that the local minima returned by different
algorithms are separated by high loss weight parameters.
Deep networks are overparameterized. For example, if we switch all corresponding weights for a
pair of nodes in our network, we will obtain effectively the same network, with both the original
and permuted networks outputting the same prediction for a given input. To ensure that the weight
vectors returned by the different algorithms were functionally different, we compared the outpts of
the networks on each example in a validation data set:
dist(θ
1
, θ
2
) =
v
u
u
t
1
N
test
N
test
X
i=1
kF (x
i
, θ
1
) − F (x
i
, θ
2
)k
2
,
where θ
1
and θ
2
are the weights learned by two different optimization algorithms, x
i
is the input for
a validation example, and F (x, θ) is the output of the network for weights θ on input x.
4
Under review as a conference paper at ICLR 2017
SGD
Adadelta
RMSprop
Adam
RK2
Adam&RK2
SGD
Adadelta
RMSprop
Adam
RK2
Adam&RK2
0
1
.5
Distance
0
1
2
RK2
SGD
0
1
.5
Distance
0
1
2
SGD
Adam
0
1
.5
Distance
0
1
2
Adam&RK2
RK2
0
1
.5
Distance
0
1
2
Adam&RK2
Adam
0
1
.5
Distance
0
1
2
SGD
RMSprop
0
1
.5
Distance
0
1
2
Adam
RMSprop
Init
Adam
SGD
SGD
RK2
Adam&RK2
Adam&RK2
Adam&RK2
Adadelta
Adadelta
Adadelta
RMSprop
RMSprop
RK2
Adam
Adam
Adam
Init
Init
Init
Init
Init
Init
Init
Table 1: Visualization of the loss surface near and between local minima found by different opti-
mization methods. Each box corresponds to a pair of optimization methods. In the lower triangle,
we plot the projection of the loss surface at weight vectors between the initial weight and the learned
weights found by the two optimization methods. Color as well as height of the surface indicate the
loss function value. In the upper triangle, we plot the functional difference between the network
corresponding to the learned weights for the first algorithm and networks corresponding to weights
linearly interpolated between the first and second algorithm’s learned weights. (Best viewed in
zoom)
5