scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima

Yu Feng1, Yuhai Tu1
02 Mar 2021-Proceedings of the National Academy of Sciences of the United States of America (National Academy of Sciences)-Vol. 118, Iss: 9
TL;DR: In this article, the authors investigated the connection between SGD learning dynamics and the loss function landscape and found that SGD serves as a landscape-dependent annealing algorithm, which is the opposite to the fluctuation-response relation in equilibrium statistical physics.
Abstract: Despite tremendous success of the stochastic gradient descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions at flat minima of the loss function in high-dimensional weight space. Here, we investigate the connection between SGD learning dynamics and the loss function landscape. A principal component analysis (PCA) shows that SGD dynamics follow a low-dimensional drift-diffusion motion in the weight space. Around a solution found by SGD, the loss function landscape can be characterized by its flatness in each PCA direction. Remarkably, our study reveals a robust inverse relation between the weight variance and the landscape flatness in all PCA directions, which is the opposite to the fluctuation-response relation (aka Einstein relation) in equilibrium statistical physics. To understand the inverse variance-flatness relation, we develop a phenomenological theory of SGD based on statistical properties of the ensemble of minibatch loss functions. We find that both the anisotropic SGD noise strength (temperature) and its correlation time depend inversely on the landscape flatness in each PCA direction. Our results suggest that SGD serves as a landscape-dependent annealing algorithm. The effective temperature decreases with the landscape flatness so the system seeks out (prefers) flat minima over sharp ones. Based on these insights, an algorithm with landscape-dependent constraints is developed to mitigate catastrophic forgetting efficiently when learning multiple tasks sequentially. In general, our work provides a theoretical framework to understand learning dynamics, which may eventually lead to better algorithms for different learning tasks.
Citations
More filters
Journal ArticleDOI
18 Jul 2022
TL;DR: In this paper , an electrical network made of identical resistive edges that self-adjusts based on local conditions in order to minimize an energy-based global cost function when shown training examples is proposed.
Abstract: Leveraging physical processes rather than a central processor is key to building machine learning systems that are massively scalable, robust to damage, and energy-efficient, like the brain. To achieve these features, the authors build an electrical network made of identical resistive edges that self-adjust based on local conditions in order to minimize an energy-based global cost function when shown training examples. Problems like regression and data classification are successfully solved by this network. Due to their energy efficiency and scaling advantages, future versions may one day compete with computational neural networks.

19 citations

Journal ArticleDOI
TL;DR: It is shown that desynchronizing the learning process does not degrade the performance for a variety of tasks in an idealized simulation, and actually improves the performance by allowing the system to better explore the discretized state space of solutions.
Abstract: In a neuron network, synapses update individually using local information, allowing for entirely decentralized learning. In contrast, elements in an artificial neural network are typically updated simultaneously using a central processor. Here, we investigate the feasibility and effect of desynchronous learning in a recently introduced decentralized, physics-driven learning network. We show that desynchronizing the learning process does not degrade the performance for a variety of tasks in an idealized simulation. In experiment, desynchronization actually improves the performance by allowing the system to better explore the discretized state space of solutions. We draw an analogy between desynchronization and mini-batching in stochastic gradient descent and show that they have similar effects on the learning process. Desynchronizing the learning process establishes physics-driven learning networks as truly fully distributed learning machines, promoting better performance and scalability in deployment.

11 citations

Journal ArticleDOI
TL;DR: In this article, the authors use agent-based simulations of cytoskeletal self-organization to study fluctuations in the network's mechanical energy and find that the changes in the localization of tension and the projections of the network motion onto the vibrational normal modes are asymmetrically distributed for energy release and accumulation.
Abstract: Eukaryotic cells are mechanically supported by a polymer network called the cytoskeleton, which consumes chemical energy to dynamically remodel its structure. Recent experiments in vivo have revealed that this remodeling occasionally happens through anomalously large displacements, reminiscent of earthquakes or avalanches. These cytoskeletal avalanches might indicate that the cytoskeleton's structural response to a changing cellular environment is highly sensitive, and they are therefore of significant biological interest. However, the physics underlying "cytoquakes" is poorly understood. Here, we use agent-based simulations of cytoskeletal self-organization to study fluctuations in the network's mechanical energy. We robustly observe non-Gaussian statistics and asymmetrically large rates of energy release compared to accumulation in a minimal cytoskeletal model. The large events of energy release are found to correlate with large, collective displacements of the cytoskeletal filaments. We also find that the changes in the localization of tension and the projections of the network motion onto the vibrational normal modes are asymmetrically distributed for energy release and accumulation. These results imply an avalanche-like process of slow energy storage punctuated by fast, large events of energy release involving a collective network rearrangement. We further show that mechanical instability precedes cytoquake occurrence through a machine-learning model that dynamically forecasts cytoquakes using the vibrational spectrum as input. Our results provide a connection between the cytoquake phenomenon and the network's mechanical energy and can help guide future investigations of the cytoskeleton's structural susceptibility.

10 citations

Journal ArticleDOI
TL;DR: A standardized parameterization is developed in which all symmetries are removed, resulting in a toroidal topology, and it is shown that in function space minima are closer to each other and that the barriers along the geodesic paths connecting them are small.
Abstract: We systematize the approach to the investigation of deep neural network landscapes by basing it on the geometry of the space of implemented functions rather than the space of parameters. Grouping classifiers into equivalence classes, we develop a standardized parameterization in which all symmetries are removed, resulting in a toroidal topology. On this space, we explore the error landscape rather than the loss. This lets us derive a meaningful notion of the flatness of minimizers and of the geodesic paths connecting them. Using different optimization algorithms that sample minimizers with different flatness we study the mode connectivity and relative distances. Testing a variety of state-of-the-art architectures and benchmark datasets, we confirm the correlation between flatness and generalization performance; we further show that in function space flatter minima are closer to each other and that the barriers along the geodesics connecting them are small. We also find that minimizers found by variants of gradient descent can be connected by zero-error paths composed of two straight lines in parameter space, i.e. polygonal chains with a single bend. We observe similar qualitative results in neural networks with binary weights and activations, providing one of the first results concerning the connectivity in this setting. Our results hinge on symmetry removal, and are in remarkable agreement with the rich phenomenology described by some recent analytical studies performed on simple shallow models.

6 citations

References
More filters
Journal ArticleDOI
28 May 2015-Nature
TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

46,982 citations

Journal ArticleDOI
13 May 1983-Science
TL;DR: There is a deep and useful connection between statistical mechanics and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters), and a detailed analogy with annealing in solids provides a framework for optimization of very large and complex systems.
Abstract: There is a deep and useful connection between statistical mechanics (the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature) and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters). A detailed analogy with annealing in solids provides a framework for optimization of the properties of very large and complex systems. This connection to statistical mechanics exposes new information and provides an unfamiliar perspective on traditional optimization problems and methods.

41,772 citations

Journal ArticleDOI
TL;DR: In this article, a method for making successive experiments at levels x1, x2, ··· in such a way that xn will tend to θ in probability is presented.
Abstract: Let M(x) denote the expected value at level x of the response to a certain experiment. M(x) is assumed to be a monotone function of x but is unknown to the experimenter, and it is desired to find the solution x = θ of the equation M(x) = α, where a is a given constant. We give a method for making successive experiments at levels x1, x2, ··· in such a way that xn will tend to θ in probability.

9,312 citations

Book ChapterDOI
01 Jan 2010
TL;DR: A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems.
Abstract: During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by the computing time rather than the sample size. A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.

5,561 citations

Posted Content
TL;DR: This paper quantifies the generality versus specificity of neurons in each layer of a deep convolutional neural network and reports a few surprising results, including that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.
Abstract: Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few surprising results. Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.

4,663 citations