02 Mar 2021-Proceedings of the National Academy of Sciences of the United States of America (National Academy of Sciences)-Vol. 118, Iss: 9

Abstract: Despite tremendous success of the stochastic gradient descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions at flat minima of the loss function in high-dimensional weight space. Here, we investigate the connection between SGD learning dynamics and the loss function landscape. A principal component analysis (PCA) shows that SGD dynamics follow a low-dimensional drift-diffusion motion in the weight space. Around a solution found by SGD, the loss function landscape can be characterized by its flatness in each PCA direction. Remarkably, our study reveals a robust inverse relation between the weight variance and the landscape flatness in all PCA directions, which is the opposite to the fluctuation-response relation (aka Einstein relation) in equilibrium statistical physics. To understand the inverse variance-flatness relation, we develop a phenomenological theory of SGD based on statistical properties of the ensemble of minibatch loss functions. We find that both the anisotropic SGD noise strength (temperature) and its correlation time depend inversely on the landscape flatness in each PCA direction. Our results suggest that SGD serves as a landscape-dependent annealing algorithm. The effective temperature decreases with the landscape flatness so the system seeks out (prefers) flat minima over sharp ones. Based on these insights, an algorithm with landscape-dependent constraints is developed to mitigate catastrophic forgetting efficiently when learning multiple tasks sequentially. In general, our work provides a theoretical framework to understand learning dynamics, which may eventually lead to better algorithms for different learning tasks.

... read more

Topics: Stochastic gradient descent (57%), Flatness (systems theory) (55%), Maxima and minima (51%)

More

11 results found

••

Abstract: The representation of functions by artificial neural networks depends on a large number of parameters in a non-linear fashion. Suitable parameters of these are found by minimizing a 'loss functional', typically by stochastic gradient descent (SGD) or an advanced SGD-based algorithm.
In a continuous time model for SGD with noise that follows the 'machine learning scaling', we show that in a certain noise regime, the optimization algorithm prefers 'flat' minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.

... read more

Topics: Stochastic gradient descent (65%), Artificial neural network (54%), Noise (54%) ... read more

3 Citations

•••

Carlos Floyd^{1}, Herbert Levine^{2}, Christopher Jarzynski^{1}, Garegin A. Papoian^{1}•Institutions (2)

Abstract: Eukaryotic cells are mechanically supported by a polymer network called the cytoskeleton, which consumes chemical energy to dynamically remodel its structure. Recent experiments in vivo have revealed that this remodeling occasionally happens through anomalously large displacements, reminiscent of earthquakes or avalanches. These cytoskeletal avalanches might indicate that the cytoskeleton's structural response to a changing cellular environment is highly sensitive, and they are therefore of significant biological interest. However, the physics underlying "cytoquakes" is poorly understood. Here, we use agent-based simulations of cytoskeletal self-organization to study fluctuations in the network's mechanical energy. We robustly observe non-Gaussian statistics and asymmetrically large rates of energy release compared to accumulation in a minimal cytoskeletal model. The large events of energy release are found to correlate with large, collective displacements of the cytoskeletal filaments. We also find that the changes in the localization of tension and the projections of the network motion onto the vibrational normal modes are asymmetrically distributed for energy release and accumulation. These results imply an avalanche-like process of slow energy storage punctuated by fast, large events of energy release involving a collective network rearrangement. We further show that mechanical instability precedes cytoquake occurrence through a machine-learning model that dynamically forecasts cytoquakes using the vibrational spectrum as input. Our results provide a connection between the cytoquake phenomenon and the network's mechanical energy and can help guide future investigations of the cytoskeleton's structural susceptibility.

... read more

Topics: Active matter (53%), Mechanical energy (50%)

2 Citations

••

Abstract: Current deep neural networks are highly overparameterized (up to billions of connection weights) and nonlinear. Yet they can fit data almost perfectly through variants of gradient descent algorithms and achieve unexpected levels of prediction accuracy without overfitting. These are formidable results that escape the bias-variance predictions of statistical learning and pose conceptual challenges for non-convex optimization. In this paper, we use methods from statistical physics of disordered systems to analytically study the computational fallout of overparameterization in nonconvex neural network models. As the number of connection weights increases, we follow the changes of the geometrical structure of different minima of the error loss function and relate them to learning and generalisation performance. We find that there exist a gap between the SAT/UNSAT interpolation transition where solutions begin to exist and the point where algorithms start to find solutions, i.e. where accessible solutions appear. This second phase transition coincides with the discontinuous appearance of atypical solutions that are locally extremely entropic, i.e., flat regions of the weight space that are particularly solution-dense and have good generalization properties. Although exponentially rare compared to typical solutions (which are narrower and extremely difficult to sample), entropic solutions are accessible to the algorithms used in learning. We can characterize the generalization error of different solutions and optimize the Bayesian prediction, for data generated from a structurally different network. Numerical tests on observables suggested by the theory confirm that the scenario extends to realistic deep networks.

... read more

Topics: Artificial neural network (55%), Overfitting (54%), Gradient descent (51%)

1 Citations

••

Carlos Floyd^{1}, Herbert Levine^{2}, Christopher Jarzynski^{1}, Garegin A. Papoian^{1}•Institutions (2)

Abstract: Eukaryotic cells are mechanically supported by a polymer network called the cytoskeleton, which consumes chemical energy to dynamically remodel its structure. Recent experiments in vivo have revealed that this remodeling occasionally happens through anomalously large displacements, reminiscent of earthquakes or avalanches. These cytoskeletal avalanches might indicate that the cytoskeleton's structural response to a changing cellular environment is highly sensitive, and they are therefore of significant biological interest. However, the physics underlying "cytoquakes" is poorly understood. Here, we use agent-based simulations of cytoskeletal self-organization to study fluctuations in the network's mechanical energy. We robustly observe non-Gaussian statistics and asymmetrically large rates of energy release compared to accumulation in a minimal cytoskeletal model. The large events of energy release are found to correlate with large, collective displacements of the cytoskeletal filaments. We also find that the changes in the localization of tension and the projections of the network motion onto the vibrational normal modes are asymmetrically distributed for energy release and accumulation. These results imply an avalanche-like process of slow energy storage punctuated by fast, large events of energy release involving a collective network rearrangement. We further show that mechanical instability precedes cytoquake occurrence through a machine learning model that dynamically forecasts cytoquakes using the vibrational spectrum as input. Our results provide the first connection between the cytoquake phenomenon and the network's mechanical energy and can help guide future investigations of the cytoskeleton's structural susceptibility.

... read more

Topics: Mechanical energy (50%)

More

42 results found

••

Abstract: There is a deep and useful connection between statistical mechanics (the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature) and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters). A detailed analogy with annealing in solids provides a framework for optimization of the properties of very large and complex systems. This connection to statistical mechanics exposes new information and provides an unfamiliar perspective on traditional optimization problems and methods.

... read more

Topics: Optimization problem (61%), Continuous optimization (61%), Extremal optimization (59%) ... read more

38,868 Citations

••

Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

... read more

Topics: Deep learning (59%), Object detection (53%), Cognitive neuroscience of visual object recognition (51%) ... read more

33,931 Citations

•••

Abstract: Let M(x) denote the expected value at level x of the response to a certain experiment. M(x) is assumed to be a monotone function of x but is unknown to the experimenter, and it is desired to find the solution x = θ of the equation M(x) = α, where a is a given constant. We give a method for making successive experiments at levels x1, x2, ··· in such a way that xn will tend to θ in probability.

... read more

Topics: Minimax approximation algorithm (64%), Stochastic approximation (60%), Constant (mathematics) (57%) ... read more

7,621 Citations

••

Abstract: Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few surprising results. Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.

... read more

Topics: Convolutional neural network (54%)

4,661 Citations

•••

01 Jan 2010-

Abstract: During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by the computing time rather than the sample size. A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.

... read more

Topics: Stochastic gradient descent (72%), Gradient method (69%), Gradient descent (65%) ... read more

4,576 Citations