scispace - formally typeset
Search or ask a question
Author

Yu Feng

Other affiliations: Duke University
Bio: Yu Feng is an academic researcher from IBM. The author has contributed to research in topics: Stochastic gradient descent & Artificial neural network. The author has an hindex of 2, co-authored 5 publications receiving 15 citations. Previous affiliations of Yu Feng include Duke University.

Papers
More filters
Journal ArticleDOI
Yu Feng1, Yuhai Tu1
TL;DR: In this article, the authors investigated the connection between SGD learning dynamics and the loss function landscape and found that SGD serves as a landscape-dependent annealing algorithm, which is the opposite to the fluctuation-response relation in equilibrium statistical physics.
Abstract: Despite tremendous success of the stochastic gradient descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions at flat minima of the loss function in high-dimensional weight space. Here, we investigate the connection between SGD learning dynamics and the loss function landscape. A principal component analysis (PCA) shows that SGD dynamics follow a low-dimensional drift-diffusion motion in the weight space. Around a solution found by SGD, the loss function landscape can be characterized by its flatness in each PCA direction. Remarkably, our study reveals a robust inverse relation between the weight variance and the landscape flatness in all PCA directions, which is the opposite to the fluctuation-response relation (aka Einstein relation) in equilibrium statistical physics. To understand the inverse variance-flatness relation, we develop a phenomenological theory of SGD based on statistical properties of the ensemble of minibatch loss functions. We find that both the anisotropic SGD noise strength (temperature) and its correlation time depend inversely on the landscape flatness in each PCA direction. Our results suggest that SGD serves as a landscape-dependent annealing algorithm. The effective temperature decreases with the landscape flatness so the system seeks out (prefers) flat minima over sharp ones. Based on these insights, an algorithm with landscape-dependent constraints is developed to mitigate catastrophic forgetting efficiently when learning multiple tasks sequentially. In general, our work provides a theoretical framework to understand learning dynamics, which may eventually lead to better algorithms for different learning tasks.

25 citations

Posted Content
TL;DR: This study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to find generalizable solutions at the flat minima of the landscape.
Abstract: Despite the tremendous success of Stochastic Gradient Descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions in the high-dimensional weight space. By analyzing the learning dynamics and loss function landscape, we discover a robust inverse relation between the weight variance and the landscape flatness (inverse of curvature) for all SGD-based learning algorithms. To explain the inverse variance-flatness relation, we develop a random landscape theory, which shows that the SGD noise strength (effective temperature) depends inversely on the landscape flatness. Our study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to find generalizable solutions at the flat minima of the landscape. Finally, we demonstrate how these new theoretical insights lead to more efficient algorithms, e.g., for avoiding catastrophic forgetting.

5 citations

Journal ArticleDOI
TL;DR: The results provide an unified framework, which will reveal how different regularization schemes (weight decay, stochastic gradient descent with different batch sizes and learning rates, dropout), training data size, and labeling noise affect generalization performance by controlling either one or both of these two geometric determinants for generalization.
Abstract: One of the fundamental problems in machine learning is generalization. In neural network models with a large number of weights (parameters), many solutions can be found to fit the training data equally well. The key question is which solution can describe testing data not in the training set. Here, we report the discovery of an exact duality (equivalence) between changes in activities in a given layer of neurons and changes in weights that connect to the next layer of neurons in a densely connected layer in any feed forward neural network. The activity-weight (A-W) duality allows us to map variations in inputs (data) to variations of the corresponding dual weights. By using this mapping, we show that the generalization loss can be decomposed into a sum of contributions from different eigen-directions of the Hessian matrix of the loss function at the solution in weight space. The contribution from a given eigendirection is the product of two geometric factors (determinants): the sharpness of the loss landscape and the standard deviation of the dual weights, which is found to scale with the weight norm of the solution. Our results provide an unified framework, which we used to reveal how different regularization schemes (weight decay, stochastic gradient descent with different batch sizes and learning rates, dropout), training data size, and labeling noise affect generalization performance by controlling either one or both of these two geometric determinants for generalization. These insights can be used to guide development of algorithms for finding more generalizable solutions in overparametrized neural networks.

1 citations

DOI
18 Jul 2023-bioRxiv
TL;DR: The authors explored a model in which each synapse is described by a continuous variable that evolves in a potential with multiple minima, where external inputs to the network can switch synapses from one potential well to another.
Abstract: It is widely believed that memory storage depends on activity-dependent synaptic modifications. Classical studies of learning and memory in neural networks describe synaptic efficacy either as continuous [1, 2] or discrete [2–4]. However, recent results suggest an intermediate scenario in which synaptic efficacy can be described by a continuous variable, but whose distribution is peaked around a small set of discrete values [5, 6]. Motivated by these results, we explored a model in which each synapse is described by a continuous variable that evolves in a potential with multiple minima. External inputs to the network can switch synapses from one potential well to another. Our analytical and numerical results show that this model can interpolate between models with discrete synapses which correspond to the deep potential limit [7], and models in which synapses evolve in a single quadratic potential [8]. We find that the storage capacity of the network with double-well synapses exhibits a power law dependence on the network size, rather than the logarithmic dependence observed in models with single well synapses [9]. In addition, synapses with deeper potential wells lead to more robust information storage in the presence of noise. When memories are sparsely encoded, the scaling of the capacity with network size is similar to previously studied network models in the sparse coding limit [2, 10–13].

Cited by
More filters
Journal ArticleDOI
18 Jul 2022
TL;DR: In this paper , an electrical network made of identical resistive edges that self-adjusts based on local conditions in order to minimize an energy-based global cost function when shown training examples is proposed.
Abstract: Leveraging physical processes rather than a central processor is key to building machine learning systems that are massively scalable, robust to damage, and energy-efficient, like the brain. To achieve these features, the authors build an electrical network made of identical resistive edges that self-adjust based on local conditions in order to minimize an energy-based global cost function when shown training examples. Problems like regression and data classification are successfully solved by this network. Due to their energy efficiency and scaling advantages, future versions may one day compete with computational neural networks.

19 citations

Journal ArticleDOI
TL;DR: This work explains the theoretical underpinnings of a novel fully-parallel training algorithm that is compatible with asymmetric crosspoint elements and explains how device asymmetry can be exploited as a useful feature for analog deep learning processors.
Abstract: Analog crossbar arrays comprising programmable non-volatile resistors are under intense investigation for acceleration of deep neural network training. However, the ubiquitous asymmetric conductance modulation of practical resistive devices critically degrades the classification performance of networks trained with conventional algorithms. Here we first describe the fundamental reasons behind this incompatibility. Then, we explain the theoretical underpinnings of a novel fully-parallel training algorithm that is compatible with asymmetric crosspoint elements. By establishing a powerful analogy with classical mechanics, we explain how device asymmetry can be exploited as a useful feature for analog deep learning processors. Instead of conventionally tuning weights in the direction of the error function gradient, network parameters can be programmed to successfully minimize the total energy (Hamiltonian) of the system that incorporates the effects of device asymmetry. Our technique enables immediate realization of analog deep learning accelerators based on readily available device technologies.

11 citations

Journal ArticleDOI
TL;DR: It is shown that desynchronizing the learning process does not degrade the performance for a variety of tasks in an idealized simulation, and actually improves the performance by allowing the system to better explore the discretized state space of solutions.
Abstract: In a neuron network, synapses update individually using local information, allowing for entirely decentralized learning. In contrast, elements in an artificial neural network are typically updated simultaneously using a central processor. Here, we investigate the feasibility and effect of desynchronous learning in a recently introduced decentralized, physics-driven learning network. We show that desynchronizing the learning process does not degrade the performance for a variety of tasks in an idealized simulation. In experiment, desynchronization actually improves the performance by allowing the system to better explore the discretized state space of solutions. We draw an analogy between desynchronization and mini-batching in stochastic gradient descent and show that they have similar effects on the learning process. Desynchronizing the learning process establishes physics-driven learning networks as truly fully distributed learning machines, promoting better performance and scalability in deployment.

11 citations

Journal ArticleDOI
TL;DR: In this article, the authors use agent-based simulations of cytoskeletal self-organization to study fluctuations in the network's mechanical energy and find that the changes in the localization of tension and the projections of the network motion onto the vibrational normal modes are asymmetrically distributed for energy release and accumulation.
Abstract: Eukaryotic cells are mechanically supported by a polymer network called the cytoskeleton, which consumes chemical energy to dynamically remodel its structure. Recent experiments in vivo have revealed that this remodeling occasionally happens through anomalously large displacements, reminiscent of earthquakes or avalanches. These cytoskeletal avalanches might indicate that the cytoskeleton's structural response to a changing cellular environment is highly sensitive, and they are therefore of significant biological interest. However, the physics underlying "cytoquakes" is poorly understood. Here, we use agent-based simulations of cytoskeletal self-organization to study fluctuations in the network's mechanical energy. We robustly observe non-Gaussian statistics and asymmetrically large rates of energy release compared to accumulation in a minimal cytoskeletal model. The large events of energy release are found to correlate with large, collective displacements of the cytoskeletal filaments. We also find that the changes in the localization of tension and the projections of the network motion onto the vibrational normal modes are asymmetrically distributed for energy release and accumulation. These results imply an avalanche-like process of slow energy storage punctuated by fast, large events of energy release involving a collective network rearrangement. We further show that mechanical instability precedes cytoquake occurrence through a machine-learning model that dynamically forecasts cytoquakes using the vibrational spectrum as input. Our results provide a connection between the cytoquake phenomenon and the network's mechanical energy and can help guide future investigations of the cytoskeleton's structural susceptibility.

10 citations