scispace - formally typeset
Open accessJournal ArticleDOI: 10.1073/PNAS.2015617118

The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima

02 Mar 2021-Proceedings of the National Academy of Sciences of the United States of America (National Academy of Sciences)-Vol. 118, Iss: 9
Abstract: Despite tremendous success of the stochastic gradient descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions at flat minima of the loss function in high-dimensional weight space. Here, we investigate the connection between SGD learning dynamics and the loss function landscape. A principal component analysis (PCA) shows that SGD dynamics follow a low-dimensional drift-diffusion motion in the weight space. Around a solution found by SGD, the loss function landscape can be characterized by its flatness in each PCA direction. Remarkably, our study reveals a robust inverse relation between the weight variance and the landscape flatness in all PCA directions, which is the opposite to the fluctuation-response relation (aka Einstein relation) in equilibrium statistical physics. To understand the inverse variance-flatness relation, we develop a phenomenological theory of SGD based on statistical properties of the ensemble of minibatch loss functions. We find that both the anisotropic SGD noise strength (temperature) and its correlation time depend inversely on the landscape flatness in each PCA direction. Our results suggest that SGD serves as a landscape-dependent annealing algorithm. The effective temperature decreases with the landscape flatness so the system seeks out (prefers) flat minima over sharp ones. Based on these insights, an algorithm with landscape-dependent constraints is developed to mitigate catastrophic forgetting efficiently when learning multiple tasks sequentially. In general, our work provides a theoretical framework to understand learning dynamics, which may eventually lead to better algorithms for different learning tasks.

... read more

Citations
  More

11 results found


Open accessPosted Content
04 Jun 2021-arXiv: Learning
Abstract: The representation of functions by artificial neural networks depends on a large number of parameters in a non-linear fashion. Suitable parameters of these are found by minimizing a 'loss functional', typically by stochastic gradient descent (SGD) or an advanced SGD-based algorithm. In a continuous time model for SGD with noise that follows the 'machine learning scaling', we show that in a certain noise regime, the optimization algorithm prefers 'flat' minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.

... read more

Topics: Stochastic gradient descent (65%), Artificial neural network (54%), Noise (54%) ... read more

3 Citations


Open accessJournal ArticleDOI: 10.1073/PNAS.2110239118
Abstract: Eukaryotic cells are mechanically supported by a polymer network called the cytoskeleton, which consumes chemical energy to dynamically remodel its structure. Recent experiments in vivo have revealed that this remodeling occasionally happens through anomalously large displacements, reminiscent of earthquakes or avalanches. These cytoskeletal avalanches might indicate that the cytoskeleton's structural response to a changing cellular environment is highly sensitive, and they are therefore of significant biological interest. However, the physics underlying "cytoquakes" is poorly understood. Here, we use agent-based simulations of cytoskeletal self-organization to study fluctuations in the network's mechanical energy. We robustly observe non-Gaussian statistics and asymmetrically large rates of energy release compared to accumulation in a minimal cytoskeletal model. The large events of energy release are found to correlate with large, collective displacements of the cytoskeletal filaments. We also find that the changes in the localization of tension and the projections of the network motion onto the vibrational normal modes are asymmetrically distributed for energy release and accumulation. These results imply an avalanche-like process of slow energy storage punctuated by fast, large events of energy release involving a collective network rearrangement. We further show that mechanical instability precedes cytoquake occurrence through a machine-learning model that dynamically forecasts cytoquakes using the vibrational spectrum as input. Our results provide a connection between the cytoquake phenomenon and the network's mechanical energy and can help guide future investigations of the cytoskeleton's structural susceptibility.

... read more

Topics: Active matter (53%), Mechanical energy (50%)

2 Citations


Open accessPosted Content
01 Oct 2021-arXiv: Learning
Abstract: Current deep neural networks are highly overparameterized (up to billions of connection weights) and nonlinear. Yet they can fit data almost perfectly through variants of gradient descent algorithms and achieve unexpected levels of prediction accuracy without overfitting. These are formidable results that escape the bias-variance predictions of statistical learning and pose conceptual challenges for non-convex optimization. In this paper, we use methods from statistical physics of disordered systems to analytically study the computational fallout of overparameterization in nonconvex neural network models. As the number of connection weights increases, we follow the changes of the geometrical structure of different minima of the error loss function and relate them to learning and generalisation performance. We find that there exist a gap between the SAT/UNSAT interpolation transition where solutions begin to exist and the point where algorithms start to find solutions, i.e. where accessible solutions appear. This second phase transition coincides with the discontinuous appearance of atypical solutions that are locally extremely entropic, i.e., flat regions of the weight space that are particularly solution-dense and have good generalization properties. Although exponentially rare compared to typical solutions (which are narrower and extremely difficult to sample), entropic solutions are accessible to the algorithms used in learning. We can characterize the generalization error of different solutions and optimize the Bayesian prediction, for data generated from a structurally different network. Numerical tests on observables suggested by the theory confirm that the scenario extends to realistic deep networks.

... read more

1 Citations


Open accessPosted Content
Abstract: Eukaryotic cells are mechanically supported by a polymer network called the cytoskeleton, which consumes chemical energy to dynamically remodel its structure. Recent experiments in vivo have revealed that this remodeling occasionally happens through anomalously large displacements, reminiscent of earthquakes or avalanches. These cytoskeletal avalanches might indicate that the cytoskeleton's structural response to a changing cellular environment is highly sensitive, and they are therefore of significant biological interest. However, the physics underlying "cytoquakes" is poorly understood. Here, we use agent-based simulations of cytoskeletal self-organization to study fluctuations in the network's mechanical energy. We robustly observe non-Gaussian statistics and asymmetrically large rates of energy release compared to accumulation in a minimal cytoskeletal model. The large events of energy release are found to correlate with large, collective displacements of the cytoskeletal filaments. We also find that the changes in the localization of tension and the projections of the network motion onto the vibrational normal modes are asymmetrically distributed for energy release and accumulation. These results imply an avalanche-like process of slow energy storage punctuated by fast, large events of energy release involving a collective network rearrangement. We further show that mechanical instability precedes cytoquake occurrence through a machine learning model that dynamically forecasts cytoquakes using the vibrational spectrum as input. Our results provide the first connection between the cytoquake phenomenon and the network's mechanical energy and can help guide future investigations of the cytoskeleton's structural susceptibility.

... read more

Topics: Mechanical energy (50%)


References
  More

42 results found


Journal ArticleDOI: 10.1126/SCIENCE.220.4598.671
13 May 1983-Science
Abstract: There is a deep and useful connection between statistical mechanics (the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature) and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters). A detailed analogy with annealing in solids provides a framework for optimization of the properties of very large and complex systems. This connection to statistical mechanics exposes new information and provides an unfamiliar perspective on traditional optimization problems and methods.

... read more

Topics: Optimization problem (61%), Continuous optimization (61%), Extremal optimization (59%) ... read more

38,868 Citations


Journal ArticleDOI: 10.1038/NATURE14539
Yann LeCun1, Yann LeCun2, Yoshua Bengio3, Geoffrey E. Hinton4  +1 moreInstitutions (5)
28 May 2015-Nature
Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

... read more

33,931 Citations


Open accessJournal ArticleDOI: 10.1214/AOMS/1177729586
Herbert Robbins1, Sutton Monro1Institutions (1)
Abstract: Let M(x) denote the expected value at level x of the response to a certain experiment. M(x) is assumed to be a monotone function of x but is unknown to the experimenter, and it is desired to find the solution x = θ of the equation M(x) = α, where a is a given constant. We give a method for making successive experiments at levels x1, x2, ··· in such a way that xn will tend to θ in probability.

... read more

7,621 Citations


Open accessPosted Content
06 Nov 2014-arXiv: Learning
Abstract: Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few surprising results. Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.

... read more

4,661 Citations


Open accessBook ChapterDOI: 10.1007/978-3-7908-2604-3_16
01 Jan 2010-
Abstract: During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by the computing time rather than the sample size. A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.

... read more

Topics: Stochastic gradient descent (72%), Gradient method (69%), Gradient descent (65%) ... read more

4,576 Citations


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20219
20202
Network Information
Related Papers (5)
Stochastic Gradient Descent with Large Learning Rate.07 Dec 2020

Kangqiao Liu, Ziyin Liu +1 more

Flat minima01 Jan 1997

Sepp Hochreiter, Jürgen Schmidhuber