An Empirical Analysis of Deep Network Loss Surfaces

Under review as a conference paper at ICLR 2017

AN EMPIRICAL ANALYSIS OF DEEP NETWORK LOSS

SURFACES

Daniel Jiwoong Im

1,2∗

, Michael Tao

3

, & Kristin Branson

2

1

Janelia Research Campus, HHMI,

2

AIFounded Inc.

3

University of Toronto,

{imd, bransonk}@janelia.hhmi.org

{mtao}@dgp.toronto.edu

ABSTRACT

The training of deep neural networks is a high-dimension optimization problem

with respect to the loss function of a model. Unfortunately, these functions are of

high dimension and non-convex and hence difﬁcult to characterize. In this paper,

we empirically investigate the geometry of the loss functions for state-of-the-art

networks with multiple stochastic optimization methods. We do this through sev-

eral experiments that are visualized on polygons to understand how and when

these stochastic optimization methods ﬁnd local minima.

1 INTRODUCTION

Deep neural networks are trained by optimizing an extremely high-dimensional loss function with

respect to the weights of the network’s linear layers. The objective function minimized is some

measure of the error of the network’s predictions based on these weights compared to training data.

This loss function is non-convex and has many local minima. These loss functions are usually

minimized using ﬁrst-order gradient descent (Robbins & Monro, 1951; Polyak, 1964) algorithms

such as stochastic gradient descent (SGD) (Bottou, 1991). The success of deep learning critically

depends on how well we can minimize this loss function, both in terms of the quality of the local

minima found and the time to ﬁnd them. Understanding the geometry of this loss function and how

well optimization algorithms can ﬁnd good local minima is thus of vital importance.

Several works have theoretically analyzed and characterized the geometry of deep network loss

functions. However, to make these analyses tractible, they have relied on simpliﬁcations of the

network structures, including that the networks are linear (Saxe et al., 2014), or assuming the path

and variable independence of the neural networks (Choromanska et al., 2015). Orthogonally, the

performance of various gradient descent algorithms has been theoretically characterized (Nesterov,

1983). Again, these analyses make simplifying assumptions, in particular that the loss function is

strictly convex, i.e. there is only a single local minimum.

In this work, we empirically investigated the geometry of the real loss functions for state-of-the-art

networks and data sets. In addition, we investigated how popular optimization algorithms interact

with these real loss surfaces. To do this, we plotted low-dimensional projections of the loss function

in subspaces chosen to investigate properties of the local minima selected by different algorithms.

We chose these subspaces to address the following questions:

• What types of changes to the optimization procedure result in different local minima?

• Do different optimization algorithms ﬁnd qualitatively different types of local minima?

2 RELATED WORK

2.1 LOSS SURFACES

There have been several attempts to understand the loss surfaces of deep neural networks. Some

have studied the critical points of the deep linear neural networks (Baldi, 1989; Baldi & Hornik,

∗

Work done during an internship at Janelia Research Campus

1

Under review as a conference paper at ICLR 2017

1989; Baldi & Lu, 2012). Others further investigated the learning dynamics of the deep linear

neural networks (Saxe et al., 2014). More recently, several others have attempted to study the loss

surfaces of deep non-linear neural networks (Choromanska et al., 2015; Kawaguchi, 2016; Soudry

& Carmon, 2016).

One approach is to analogize the states of neurons as the magnetics dipoles used in spherical spin-

glass Ising models from statistical physics (Parisi, 2016; Fyodorov & Williams, 2007; Bray & Dean,

2007). Choromanska et al. (2015) attempted to understand the loss function of neural networks

through studying the random Gaussian error functions of Ising models. Recent results (Kawaguchi,

2016; Soudry & Carmon, 2016) have provided cursory evidence in agreement with the theory pro-

vided by Choromanska et al. (2015) in that they found that that there are no “poor” local minima in

neural networks still with strong assumptions.

There is some potential disconnect between these theoretical results and what is found in practice

due to several strong assumptions such as the activation of the hidden units and output being inde-

pendent of the previous hidden units and input data. The work of Dauphin et al. (2014) empirically

investigated properties of the critical points of neural network loss functions and demonstrated that

their critical points behave similarly to the critical points of random Gaussian error functions in high

dimensional space. We will expose further evidence along this trajectory.

2.2 OPTIMIZATION

In practice, the local minima of deep network loss functions are for the most part decent. This

implies that we probably do not need to take many precautions to avoid bad local minima in practice.

If all local minima are decent, then the task of ﬁnding a decent local minimum quickly is reduced to

the task of ﬁnding any local minimum quickly. From an optimization perspective this implies that

solely focusing on designing fast methods are of key importance for training deep networks.

In the literature the common method for measuring performance of optimization methods is to

analyze them on nice convex quadratic functions (Polyak, 1964; Broyden, 1970; Nesterov, 1983;

Martens, 2010; Erdogdu & Montanari, 2015) even though the problems are applied to non-convex

problems. For non-convex problems, however, if two methods converge to different local minima

their performance will be dictated on how those methods solve those two convex subproblems. It

is challenging to show that one method will beat another without knowledge of the sort of convex

subproblems, which is generally not known apriori. What we will explore is whether indeed are

some characteristics that can found experimentally. If so, perhaps one could validate where these

analytical results are valid or even improve methods for training neural networks.

2.2.1 LEARNING PHASES

Slowing

decaying

Fast decaying

Figure 1: An example of learning curve of neural network

One of the interesting empirical observation is that we often observe is that the incremental improve-

ment of optimization methods decreases rapidly even in non-convex problems. This behavior has

been discussed as a “transient” phase followed by a “minimization” phase (Sutskever et al., 2013)

2

Under review as a conference paper at ICLR 2017

where the former ﬁnds the neighborhood of a decent local minima and the latter ﬁnds the local

minima within that neighborhood. The existence of these phases implies that if certain methods are

better at different phases one could create novel methods that schedule when to apply each method.

3 EXPERIMENTAL SETUP AND TOOLS

3.1 NETWORK ARCHITECTURES AND DATA SETS

We conducted experiments on three state-of-the-art neural network architectures. Network-in-

Network (NIN) (Lin et al., 2014) and the VGG(Simonyan & Zisserman, 2015) network are feed-

forward convolutional networks developed for image classiﬁcation, and have excellent performance

on the Imagenet (Russakovsky et al., 2014) and CIFAR10 (Krizhevsky, 2009) data sets. The long

short-term memory network (LSTM) (Hochreiter & Schmidhuber, 1997) is a recurrent neural net-

work that has been successful in tasks that take variable-length sequences as input and/or produce

variable-length sequences as output, such as speech recognition and image caption generation. These

are large networks currently used in many machine vision and learning tasks, and the loss functions

minimized by each are highly non-convex.

All results using the feed-forward convolutional networks (NIN and VGG) are on the CIFAR10

image classiﬁcation data set, while the LSTM was tested on the Penn Treebank next-word prediction

data set.

3.2 OPTIMIZATION METHODS

We analyzed the performance of ﬁve popular gradient-descent optimization methods for these learn-

ing frameworks: Stochastic gradient descent (SGD) (Robbins & Monro, 1951), stochastic gradient

descent with momentum (SGDM), RMSprop (Tieleman & Hinton, 2012), Adadelta (Zeiler et al.,

2011), and ADAM (Kingma & Ba, 2014). These are all ﬁrst-order gradient descent algorithms that

estimate the gradients based on randomly-grouped minibatches of training examples. One of the

major differences between these algorithms is how they select the weight-update step-size at each

iteration, with SGD and SGDM using ﬁxed schedules, and RMSprop, Adadelta, and ADAM using

adaptive, per-parameter step-sizes. Details are provided in Section A.2.

In addition to these ﬁve existing optimization methods, we compare to a new gradient descent

method we developed based on the family of Runge Kutta integrators. In our experiments, we

tested a second-order Runge-Kutta integrator in combination with SGD (RK2) and in combination

with ADAM (ADAM&RK2). Details are provided in Section A.3).

3.3 ANALYSIS METHODS

Several of our empirical analyses are based on the technique of Goodfellow et al. (Goodfellow et al.,

2015). They visualize the loss function by projecting it down to one carefully chosen dimension.

They plot the value of the loss function along a set of samples along this dimension. The projec-

tion space is chosen based on important weight conﬁgurations, thus they plot the value of the loss

function at linear interpolations between two weight conﬁgurations. They perform two such analy-

ses: one in which they interpolate between the initialization weights and the ﬁnal learned weights,

and one in which they interpolate between two sets of ﬁnal weights, each learned from different

initializations.

In this work, we use a similar visualization technique, but choose different low-dimensional sub-

spaces for the projection of the loss function. These subspaces are based on the initial weights as

well as the ﬁnal weights learned using the different optimization algorithms and combinations of

them, and are chosen to answer a variety of questions about the loss function and how the different

optimization algorithms interact with this loss function. In contrast, Goodfellow et al. only looked

at SGDM. In addition, we explore the use of two-dimensional projections of the loss function, al-

lowing us to better visualize the space between local minima. We do this via barycentric and bilinar

interpolation for triplets and quartets of points respectively (details in Section A.1).

We refer to the critical points found using these variants of SGD, for which the gradient is approxi-

mately 0, as local minima. Our evidence that these are local minima as opposed to saddle points is

3

Under review as a conference paper at ICLR 2017

Initial Config.

SGD

(a) NIN

Initial Config.

SGD

(b) VGG

Figure 2: Visualization of the loss surface at

weights interpolated between two initial conﬁgu-

rations and the ﬁnal weight vectors learned using

SGD from these initializations.

SGD

RK2

ADAM

ADAM&RK2

(a) NIN

SGD

RK2

ADAM

ADAM&RK2

(b) VGG

Figure 3: Visualization of the loss surface at

weights interpolated between the weights learned

by four different algorithms from the same ini-

tialization.

similar to that presented in Goodfellow et al. (Goodfellow et al., 2015). If we interpolate beyond the

critical point, in this one-dimensional projection, the loss increases (Fig. 10).

3.4 TECHNICAL DETAILS

We used the VGG and NIN implementations from https://github.com/szagoruyko/cifar.torch.git.

The batch size was set to 128 and the number of epochs was set to 200. The learning rate was chosen

from the discrete range between [0.2, 0.1, 0.05, 0.01] for SGD and [0. 002, 0.001, 0.0005, 0.0001] for

adaptive learning methods. We doubled the learning rates when we ran our augmented versions with

Runge-Kutta because they required two stochastic gradient computations per epoch. We used batch-

normalization and dropout to regularize our networks. All experiments were run on a 6-core Intel(R)

Xeon(R) CPU @ 2.40GHz with a TITAN X.

4 EXPERIMENTAL RESULTS

4.1 DIFFERENT OPTIMIZATION METHODS FIND DIFFERENT LOCAL MINIMA

We trained the neural networks described above using each optimization method starting from the

same initial weights and with the same minibatching. We computed the value of the loss function

for weight vectors interpolated between the initial weights, the ﬁnal weights for one algorithm, and

the ﬁnal weights for a second algorithm for several pairings of algorithms. The results are shown in

the lower triangle of Table 1.

For every pair of optimization algorithms, we observe that the training loss between the ﬁnal weights

for different algorithms shows a sharp increase along the interpolated path. This suggests that each

optimization algorithm found a different critical point, despite starting at the same initialization. We

investigated the space between other triplets and quadruples of weight vectors (Figure 2 and 3), and

even in these projections of the loss function, we still see that the local minima returned by different

algorithms are separated by high loss weight parameters.

Deep networks are overparameterized. For example, if we switch all corresponding weights for a

pair of nodes in our network, we will obtain effectively the same network, with both the original

and permuted networks outputting the same prediction for a given input. To ensure that the weight

vectors returned by the different algorithms were functionally different, we compared the outpts of

the networks on each example in a validation data set:

dist(θ

1

, θ

2

) =

v

u

t

1

N

test

N

test

X

i=1

kF (x

i

, θ

1

) − F (x

i

, θ

2

)k

2

,

where θ

1

and θ

2

are the weights learned by two different optimization algorithms, x

i

is the input for

a validation example, and F (x, θ) is the output of the network for weights θ on input x.

4

Under review as a conference paper at ICLR 2017

SGD

Adadelta

RMSprop

Adam

RK2

Adam&RK2

SGD

Adadelta

RMSprop

Adam

RK2

Adam&RK2

0

1

.5

Distance

0

1

2

RK2

SGD

0

1

.5

Distance

0

1

2

SGD

Adam

0

1

.5

Distance

0

1

2

Adam&RK2

RK2

0

1

.5

Distance

0

1

2

Adam&RK2

Adam

0

1

.5

Distance

0

1

2

SGD

RMSprop

0

1

.5

Distance

0

1

2

Adam

RMSprop

Init

Adam

SGD

RK2

Adam&RK2

Adadelta

RMSprop

RK2

Adam

Init

Table 1: Visualization of the loss surface near and between local minima found by different opti-

mization methods. Each box corresponds to a pair of optimization methods. In the lower triangle,

we plot the projection of the loss surface at weight vectors between the initial weight and the learned

weights found by the two optimization methods. Color as well as height of the surface indicate the

loss function value. In the upper triangle, we plot the functional difference between the network

corresponding to the learned weights for the ﬁrst algorithm and networks corresponding to weights

linearly interpolated between the ﬁrst and second algorithm’s learned weights. (Best viewed in

zoom)

5

An Empirical Analysis of Deep Network Loss Surfaces

Citations

A closer look at memorization in deep networks

Sharp Minima Can Generalize For Deep Nets

Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning

Deep Supervised Learning Using Local Errors.

Multi-class classification without multi-class labels.

References

Adam: A Method for Stochastic Optimization

Long short-term memory

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Related Papers (5)

ImageNet Classification with Deep Convolutional Neural Networks

Adam: A Method for Stochastic Optimization

Understanding the difficulty of training deep feedforward neural networks

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Deep Residual Learning for Image Recognition