Home
/
Authors
/
Yu Feng

Author

Yu Feng

Bio: Yu Feng is an academic researcher from IBM. The author has contributed to research in topics: Stochastic gradient descent & Artificial neural network. The author has an hindex of 2, co-authored 5 publications receiving 15 citations. Previous affiliations of Yu Feng include Duke University.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima

[...]

Yu Feng¹, Yuhai Tu¹•Institutions (1)

IBM¹

02 Mar 2021-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: In this article, the authors investigated the connection between SGD learning dynamics and the loss function landscape and found that SGD serves as a landscape-dependent annealing algorithm, which is the opposite to the fluctuation-response relation in equilibrium statistical physics.

...read moreread less

Abstract: Despite tremendous success of the stochastic gradient descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions at flat minima of the loss function in high-dimensional weight space. Here, we investigate the connection between SGD learning dynamics and the loss function landscape. A principal component analysis (PCA) shows that SGD dynamics follow a low-dimensional drift-diffusion motion in the weight space. Around a solution found by SGD, the loss function landscape can be characterized by its flatness in each PCA direction. Remarkably, our study reveals a robust inverse relation between the weight variance and the landscape flatness in all PCA directions, which is the opposite to the fluctuation-response relation (aka Einstein relation) in equilibrium statistical physics. To understand the inverse variance-flatness relation, we develop a phenomenological theory of SGD based on statistical properties of the ensemble of minibatch loss functions. We find that both the anisotropic SGD noise strength (temperature) and its correlation time depend inversely on the landscape flatness in each PCA direction. Our results suggest that SGD serves as a landscape-dependent annealing algorithm. The effective temperature decreases with the landscape flatness so the system seeks out (prefers) flat minima over sharp ones. Based on these insights, an algorithm with landscape-dependent constraints is developed to mitigate catastrophic forgetting efficiently when learning multiple tasks sequentially. In general, our work provides a theoretical framework to understand learning dynamics, which may eventually lead to better algorithms for different learning tasks.

...read moreread less

25 citations

Journal Article•DOI•

Phases of learning dynamics in artificial neural networks in the absence or presence of mislabeled data

[...]

Yu Feng, Yuhai Tu

01 Dec 2021

10 citations

Posted Content•

How neural networks find generalizable solutions: Self-tuned annealing in deep learning.

[...]

Yu Feng, Yuhai Tu

06 Jan 2020-arXiv: Data Analysis, Statistics and Probability

TL;DR: This study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to find generalizable solutions at the flat minima of the landscape.

...read moreread less

Abstract: Despite the tremendous success of Stochastic Gradient Descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions in the high-dimensional weight space. By analyzing the learning dynamics and loss function landscape, we discover a robust inverse relation between the weight variance and the landscape flatness (inverse of curvature) for all SGD-based learning algorithms. To explain the inverse variance-flatness relation, we develop a random landscape theory, which shows that the SGD noise strength (effective temperature) depends inversely on the landscape flatness. Our study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to find generalizable solutions at the flat minima of the landscape. Finally, we demonstrate how these new theoretical insights lead to more efficient algorithms, e.g., for avoiding catastrophic forgetting.

...read moreread less

5 citations

Journal Article•DOI•

The activity-weight duality in feed forward neural networks: The geometric determinants of generalization

[...]

Yu Feng, Yuhai Tu

21 Mar 2022-arXiv.org

TL;DR: The results provide an unified framework, which will reveal how different regularization schemes (weight decay, stochastic gradient descent with different batch sizes and learning rates, dropout), training data size, and labeling noise affect generalization performance by controlling either one or both of these two geometric determinants for generalization.

...read moreread less

Abstract: One of the fundamental problems in machine learning is generalization. In neural network models with a large number of weights (parameters), many solutions can be found to fit the training data equally well. The key question is which solution can describe testing data not in the training set. Here, we report the discovery of an exact duality (equivalence) between changes in activities in a given layer of neurons and changes in weights that connect to the next layer of neurons in a densely connected layer in any feed forward neural network. The activity-weight (A-W) duality allows us to map variations in inputs (data) to variations of the corresponding dual weights. By using this mapping, we show that the generalization loss can be decomposed into a sum of contributions from different eigen-directions of the Hessian matrix of the loss function at the solution in weight space. The contribution from a given eigendirection is the product of two geometric factors (determinants): the sharpness of the loss landscape and the standard deviation of the dual weights, which is found to scale with the weight norm of the solution. Our results provide an unified framework, which we used to reveal how different regularization schemes (weight decay, stochastic gradient descent with different batch sizes and learning rates, dropout), training data size, and labeling noise affect generalization performance by controlling either one or both of these two geometric determinants for generalization. These insights can be used to guide development of algorithms for finding more generalizable solutions in overparametrized neural networks.

...read moreread less

1 citations

DOI•

Attractor neural networks with double well synapses

[...]

Yu Feng, Nicolas Brunel

18 Jul 2023-bioRxiv

TL;DR: The authors explored a model in which each synapse is described by a continuous variable that evolves in a potential with multiple minima, where external inputs to the network can switch synapses from one potential well to another.

...read moreread less

Abstract: It is widely believed that memory storage depends on activity-dependent synaptic modifications. Classical studies of learning and memory in neural networks describe synaptic efficacy either as continuous [1, 2] or discrete [2–4]. However, recent results suggest an intermediate scenario in which synaptic efficacy can be described by a continuous variable, but whose distribution is peaked around a small set of discrete values [5, 6]. Motivated by these results, we explored a model in which each synapse is described by a continuous variable that evolves in a potential with multiple minima. External inputs to the network can switch synapses from one potential well to another. Our analytical and numerical results show that this model can interpolate between models with discrete synapses which correspond to the deep potential limit [7], and models in which synapses evolve in a single quadratic potential [8]. We find that the storage capacity of the network with double-well synapses exhibits a power law dependence on the network size, rather than the logarithmic dependence observed in models with single well synapses [9]. In addition, synapses with deeper potential wells lead to more robust information storage in the presence of noise. When memories are sparsely encoded, the scaling of the capacity with network size is similar to previously studied network models in the sparse coding limit [2, 10–13].

...read moreread less

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Demonstration of Decentralized Physics-Driven Learning

[...]

18 Jul 2022

TL;DR: In this paper , an electrical network made of identical resistive edges that self-adjusts based on local conditions in order to minimize an energy-based global cost function when shown training examples is proposed.

...read moreread less

Abstract: Leveraging physical processes rather than a central processor is key to building machine learning systems that are massively scalable, robust to damage, and energy-efficient, like the brain. To achieve these features, the authors build an electrical network made of identical resistive edges that self-adjust based on local conditions in order to minimize an energy-based global cost function when shown training examples. Problems like regression and data classification are successfully solved by this network. Due to their energy efficiency and scaling advantages, future versions may one day compete with computational neural networks.

...read moreread less

19 citations

Journal Article•DOI•

Neural Network Training With Asymmetric Crosspoint Elements

[...]

O. M. Onen, Tayfun Gokmen, Teodor K. Todorov, Tomasz Nowicki, Jesus A. del Alamo, John Rozen, Wilfried Haensch, Seyoung Kim - Show less +4 more

31 Jan 2022-Frontiers in artificial intelligence

TL;DR: This work explains the theoretical underpinnings of a novel fully-parallel training algorithm that is compatible with asymmetric crosspoint elements and explains how device asymmetry can be exploited as a useful feature for analog deep learning processors.

...read moreread less

Abstract: Analog crossbar arrays comprising programmable non-volatile resistors are under intense investigation for acceleration of deep neural network training. However, the ubiquitous asymmetric conductance modulation of practical resistive devices critically degrades the classification performance of networks trained with conventional algorithms. Here we first describe the fundamental reasons behind this incompatibility. Then, we explain the theoretical underpinnings of a novel fully-parallel training algorithm that is compatible with asymmetric crosspoint elements. By establishing a powerful analogy with classical mechanics, we explain how device asymmetry can be exploited as a useful feature for analog deep learning processors. Instead of conventionally tuning weights in the direction of the error function gradient, network parameters can be programmed to successfully minimize the total energy (Hamiltonian) of the system that incorporates the effects of device asymmetry. Our technique enables immediate realization of analog deep learning accelerators based on readily available device technologies.

...read moreread less

11 citations

Journal Article•DOI•

Desynchronous learning in a physics-driven learning network.

[...]

J. F. Wycoff, Sam Dillavou, Menachem Stern, A. J. Liu, Douglas J. Durian - Show less +1 more

10 Jan 2022-Journal of Chemical Physics

TL;DR: It is shown that desynchronizing the learning process does not degrade the performance for a variety of tasks in an idealized simulation, and actually improves the performance by allowing the system to better explore the discretized state space of solutions.

...read moreread less

Abstract: In a neuron network, synapses update individually using local information, allowing for entirely decentralized learning. In contrast, elements in an artificial neural network are typically updated simultaneously using a central processor. Here, we investigate the feasibility and effect of desynchronous learning in a recently introduced decentralized, physics-driven learning network. We show that desynchronizing the learning process does not degrade the performance for a variety of tasks in an idealized simulation. In experiment, desynchronization actually improves the performance by allowing the system to better explore the discretized state space of solutions. We draw an analogy between desynchronization and mini-batching in stochastic gradient descent and show that they have similar effects on the learning process. Desynchronizing the learning process establishes physics-driven learning networks as truly fully distributed learning machines, promoting better performance and scalability in deployment.

...read moreread less

11 citations

Journal Article•DOI•

Understanding cytoskeletal avalanches using mechanical stability analysis

[...]

Carlos Floyd¹, Herbert Levine², Christopher Jarzynski¹, Garegin A. Papoian¹•Institutions (2)

University of Maryland, College Park¹, Northeastern University²

12 Oct 2021-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: In this article, the authors use agent-based simulations of cytoskeletal self-organization to study fluctuations in the network's mechanical energy and find that the changes in the localization of tension and the projections of the network motion onto the vibrational normal modes are asymmetrically distributed for energy release and accumulation.

...read moreread less

Abstract: Eukaryotic cells are mechanically supported by a polymer network called the cytoskeleton, which consumes chemical energy to dynamically remodel its structure. Recent experiments in vivo have revealed that this remodeling occasionally happens through anomalously large displacements, reminiscent of earthquakes or avalanches. These cytoskeletal avalanches might indicate that the cytoskeleton's structural response to a changing cellular environment is highly sensitive, and they are therefore of significant biological interest. However, the physics underlying "cytoquakes" is poorly understood. Here, we use agent-based simulations of cytoskeletal self-organization to study fluctuations in the network's mechanical energy. We robustly observe non-Gaussian statistics and asymmetrically large rates of energy release compared to accumulation in a minimal cytoskeletal model. The large events of energy release are found to correlate with large, collective displacements of the cytoskeletal filaments. We also find that the changes in the localization of tension and the projections of the network motion onto the vibrational normal modes are asymmetrically distributed for energy release and accumulation. These results imply an avalanche-like process of slow energy storage punctuated by fast, large events of energy release involving a collective network rearrangement. We further show that mechanical instability precedes cytoquake occurrence through a machine-learning model that dynamically forecasts cytoquakes using the vibrational spectrum as input. Our results provide a connection between the cytoquake phenomenon and the network's mechanical energy and can help guide future investigations of the cytoskeleton's structural susceptibility.

...read moreread less

10 citations

Journal Article•DOI•

Phases of learning dynamics in artificial neural networks in the absence or presence of mislabeled data

[...]

Yu Feng, Yuhai Tu

01 Dec 2021

10 citations

1
2
3
4
…
5
6
7
8

Collapse