Showing papers on "Maxima and minima published in 2019"

PDF

Open Access

Journal Article•DOI•

Optimal Tuning of Fractional Order PID Controller for DC Motor Speed Control via Chaotic Atom Search Optimization Algorithm

[...]

Baran Hekimoglu¹•Institutions (1)

Batman University¹

18 Mar 2019-IEEE Access

TL;DR: The numerical simulations of the proposed ChASO-FOPID and ASO-fOPID controllers for the dc motor speed control system demonstrated the superior performance of both the chaotic ASO and the original ASO, respectively.

...read moreread less

Abstract: In this paper, atom search optimization (ASO) algorithm and a novel chaotic version of it [chaotic ASO (ChASO)] are proposed to determine the optimal parameters of the fractional-order proportional+integral+derivative (FOPID) controller for dc motor speed control. The ASO algorithm is simple and easy to implement, which mathematically models and mimics the atomic motion model in nature, and is developed to address a diverse set of optimization problems. The proposed ChASO algorithm, on the other hand, is based on logistic map chaotic sequences, which makes the original algorithm be able to escape from local minima stagnation and improve its convergence rate and resulting precision. First, the proposed ChASO algorithm is applied to six unimodal and multimodal benchmark optimization problems and the results are compared with other algorithms. Second, the proposed ChASO-FOPID, ASO-FOPID, and ASO-PID controllers are compared with GWO-FOPID, GWO-PID, IWO-PID, and SFS-PID controllers using the integral of time multiplied absolute error (ITAE) objective function for a fair comparison. Comparisons were also made for the integral of time multiplied squared error (ITSE) and Zwe-Lee Gaing's (ZLG) objective function as the most commonly used objective functions in the literature. Transient response analysis, frequency response (Bode) analysis, and robustness analysis were all carried out. The simulation results are promising and validate the effectiveness of the proposed approaches. The numerical simulations of the proposed ChASO-FOPID and ASO-FOPID controllers for the dc motor speed control system demonstrated the superior performance of both the chaotic ASO and the original ASO, respectively.

...read moreread less

156 citations

Journal Article•DOI•

Couplings and quantitative contraction rates for Langevin dynamics

[...]

Andreas Eberle, Arnaud Guillin, Raphael Zimmer

01 Jul 2019-Annals of Probability

TL;DR: In this article, a probabilistic approach to quantify convergence to equilibrium for (kinetic) Langevin processes is introduced, based on a specific combination of reflection and synchronous coupling of two solutions of the Langevin equation.

...read moreread less

Abstract: We introduce a new probabilistic approach to quantify convergence to equilibrium for (kinetic) Langevin processes. In contrast to previous analytic approaches that focus on the associated kinetic Fokker-Planck equation, our approach is based on a specific combination of reflection and synchronous coupling of two solutions of the Langevin equation. It yields contractions in a particular Wasserstein distance, and it provides rather precise bounds for convergence to equilibrium at the borderline between the overdamped and the underdamped regime. In particular, we are able to recover kinetic behavior in terms of explicit lower bounds for the contraction rate. For example, for a rescaled double-well potential with local minima at distance a, we obtain a lower bound for the contraction rate of order Ω(a−1) provided the friction coefficient is of order Θ(a−1)

...read moreread less

152 citations

Proceedings Article•

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects.

[...]

Zhanxing Zhu¹, Jingfeng Wu¹, Bing Yu¹, Lei Wu², Jinwen Ma³ - Show less +1 more•Institutions (3)

Peking University¹, Princeton University², University of Illinois at Chicago³

24 May 2019

TL;DR: This work studies a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics, and shows that the anisotropic noise in SGD helps to escape from sharp and poor minima effectively, towards more stable and flat minima that typically generalize well.

...read moreread less

Abstract: Understanding the behavior of stochastic gradient descent (SGD) in the context of deep neural networks has raised lots of concerns recently. Along this line, we study a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics. Through investigating this general optimization dynamics, we analyze the behavior of SGD on escaping from minima and its regularization effects. A novel indicator is derived to characterize the efficiency of escaping from minima through measuring the alignment of noise covariance and the curvature of loss function. Based on this indicator, two conditions are established to show which type of noise structure is superior to isotropic noise in term of escaping efficiency. We further show that the anisotropic noise in SGD satisfies the two conditions, and thus helps to escape from sharp and poor minima effectively, towards more stable and flat minima that typically generalize well. We systematically design various experiments to verify the benefits of the anisotropic noise, compared with full gradient descent plus isotropic diffusion (i.e. Langevin dynamics).

...read moreread less

143 citations

Journal Article•DOI•

Jamming transition as a paradigm to understand the loss landscape of deep neural networks

[...]

Mario Geiger¹, Stefano Spigler¹, Stéphane d'Ascoli², Stéphane d'Ascoli³, Levent Sagun¹, Levent Sagun², Marco Baity-Jesi⁴, Giulio Biroli³, Giulio Biroli², Matthieu Wyart¹ - Show less +6 more•Institutions (4)

École Polytechnique Fédérale de Lausanne¹, Université Paris-Saclay², École Normale Supérieure³, Columbia University⁴

11 Jul 2019-Physical Review E

TL;DR: It is argued that in fully connected deep networks a phase transition delimits the over- and underparametrized regimes where fitting can or cannot be achieved, and observed that the ability of fully connected networks to fit random data is independent of their depth, an independence that appears to also hold for real data.

...read moreread less

Abstract: Deep learning has been immensely successful at a variety of tasks, ranging from classification to artificial intelligence. Learning corresponds to fitting training data, which is implemented by descending a very high-dimensional loss function. Understanding under which conditions neural networks do not get stuck in poor minima of the loss, and how the landscape of that loss evolves as depth is increased, remains a challenge. Here we predict, and test empirically, an analogy between this landscape and the energy landscape of repulsive ellipses. We argue that in fully connected deep networks a phase transition delimits the over- and underparametrized regimes where fitting can or cannot be achieved. In the vicinity of this transition, properties of the curvature of the minima of the loss (the spectrum of the Hessian) are critical. This transition shares direct similarities with the jamming transition by which particles form a disordered solid as the density is increased, which also occurs in certain classes of computational optimization and learning problems such as the perceptron. Our analysis gives a simple explanation as to why poor minima of the loss cannot be encountered in the overparametrized regime. Interestingly, we observe that the ability of fully connected networks to fit random data is independent of their depth, an independence that appears to also hold for real data. We also study a quantity Δ which characterizes how well (Δ 0) a datum is learned. At the critical point it is power-law distributed on several decades, P_{+}(Δ)∼Δ^{θ} for Δ>0 and P_{-}(Δ)∼(-Δ)^{-γ} for Δ<0, with exponents that depend on the choice of activation function. This observation suggests that near the transition the loss landscape has a hierarchical structure and that the learning dynamics is prone to avalanche-like dynamics, with abrupt changes in the set of patterns that are learned.

...read moreread less

136 citations

Journal Article•DOI•

Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization

[...]

Gang Wang¹, Georgios B. Giannakis¹, Jie Chen²•Institutions (2)

University of Minnesota¹, Beijing Institute of Technology²

01 May 2019-IEEE Transactions on Signal Processing

TL;DR: This paper presents a novel stochastic gradient descent (SGD) algorithm, which can provably train any single-hidden-layer ReLU network to attain global optimality, despite the presence of infinitely many bad local minima, maxima, and saddle points in general.

...read moreread less

Abstract: Neural networks with rectified linear unit (ReLU) activation functions (a.k.a. ReLU networks) have achieved great empirical success in various domains. Nonetheless, existing results for learning ReLU networks either pose assumptions on the underlying data distribution being, e.g., Gaussian, or require the network size and/or training size to be sufficiently large. In this context, the problem of learning a two-layer ReLU network is approached in a binary classification setting, where the data are linearly separable and a hinge loss criterion is adopted. Leveraging the power of random noise perturbation, this paper presents a novel stochastic gradient descent (SGD) algorithm, which can provably train any single-hidden-layer ReLU network to attain global optimality, despite the presence of infinitely many bad local minima, maxima, and saddle points in general. This result is the first of its kind, requiring no assumptions on the data distribution, training/network size, or initialization. Convergence of the resultant iterative algorithm to a global minimum is analyzed by establishing both an upper bound and a lower bound on the number of non-zero updates to be performed. Moreover, generalization guarantees are developed for ReLU networks trained with the novel SGD leveraging classic compression bounds. These guarantees highlight a key difference (at least in the worst case) between reliably learning a ReLU network as well as a leaky ReLU network in terms of sample complexity. Numerical tests using both synthetic data and real images validate the effectiveness of the algorithm and the practical merits of the theory.

...read moreread less

132 citations

Posted Content•

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

[...]

Umut Simsekli, Levent Sagun¹, Mert Gurbuzbalaban²•Institutions (2)

New York University¹, Rutgers University²

18 Jan 2019-arXiv: Learning

TL;DR: In this paper, the gradient noise in SGD is considered in a more general context and the generalized central limit theorem (GCLT) is invoked to analyze SGD as an SDE driven by a Levy motion.

...read moreread less

Abstract: The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT (GCLT), which suggests that the GN converges to a heavy-tailed $\alpha$-stable random variable. Accordingly, we propose to analyze SGD as an SDE driven by a Levy motion. Such SDEs can incur `jumps', which force the SDE transition from narrow minima to wider minima, as proven by existing metastability theory. To validate the $\alpha$-stable assumption, we conduct extensive experiments on common deep learning architectures and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We further investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.

...read moreread less

111 citations

Proceedings Article•

Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks

[...]

Guodong Zhang¹, James Martens², Roger Grosse³•Institutions (3)

University of California, Berkeley¹, Google², University of Toronto³

27 May 2019

TL;DR: For two-layer ReLU networks with one hidden layer, Wang et al. as discussed by the authors showed that NGD converges to global minima under the assumption that the inputs do not degenerate and the network is over-parameterized.

...read moreread less

Abstract: Natural gradient descent has proven very effective at mitigating the catastrophic effects of pathological curvature in the objective function, but little is known theoretically about its convergence properties, especially for \emph{non-linear} networks. In this work, we analyze for the first time the speed of convergence to global optimum for natural gradient descent on non-linear neural networks with the squared error loss. We identify two conditions which guarantee the global convergence: (1) the Jacobian matrix (of network's output for all training cases w.r.t the parameters) is full row rank and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks (i.e. with one hidden layer), we prove that these two conditions do hold throughout the training under the assumptions that the inputs do not degenerate and the network is over-parameterized. We further extend our analysis to more general loss function with similar convergence property. Lastly, we show that K-FAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions.

...read moreread less

86 citations

Journal Article•DOI•

Destabilization of Local Minima in Analog Spin Systems by Correction of Amplitude Heterogeneity

[...]

Timothée Leleu¹, Yoshihisa Yamamoto², Peter L. McMahon³, Peter L. McMahon⁴, Peter L. McMahon², Kazuyuki Aihara¹ - Show less +2 more•Institutions (4)

University of Tokyo¹, Stanford University², Cornell University³, National Institute of Informatics⁴

01 Feb 2019-Physical Review Letters

TL;DR: It is shown that it is possible to destabilize trapping sets of analog states that correspond to local minima of the binary spin Hamiltonian by extending the phase space to include error signals that correct amplitude inhomogeneity of the analog spin states and controlling the divergence of their velocity.

...read moreread less

Abstract: The relaxation of binary spins to analog values has been the subject of much debate in the field of statistical physics, neural networks, and more recently quantum computing, notably because the benefits of using an analog state for finding lower energy spin configurations are usually offset by the negative impact of the improper mapping of the energy function that results from the relaxation. We show that it is possible to destabilize trapping sets of analog states that correspond to local minima of the binary spin Hamiltonian by extending the phase space to include error signals that correct amplitude inhomogeneity of the analog spin states and controlling the divergence of their velocity. Performance of the proposed analog spin system in finding lower energy states is competitive against state-of-the-art heuristics.

...read moreread less

84 citations

Proceedings Article•DOI•

Towards Precise End-to-End Weakly Supervised Object Detection Network

[...]

Ke Yang, Dongsheng Li¹, Yong Dou¹•Institutions (1)

National University of Defense Technology¹

01 Oct 2019

TL;DR: Zhang et al. as discussed by the authors designed a single network with both multiple instance learning and bounding-box regression branches that share the same backbone, and a guided attention module using classification loss was added to the backbone for effectively extracting the implicit location information in the features.

...read moreread less

Abstract: It is challenging for weakly supervised object detection network to precisely predict the positions of the objects, since there are no instance-level category annotations. Most existing methods tend to solve this problem by using a two-phase learning procedure, i.e., multiple instance learning detector followed by a fully supervised learning detector with bounding-box regression. Based on our observation, this procedure may lead to local minima for some object categories. In this paper, we propose to jointly train the two phases in an end-to-end manner to tackle this problem. Specifically, we design a single network with both multiple instance learning and bounding-box regression branches that share the same backbone. Meanwhile, a guided attention module using classification loss is added to the backbone for effectively extracting the implicit location information in the features. Experimental results on public datasets show that our method achieves state-of-the-art performance.

...read moreread less

67 citations

Proceedings Article•

Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience

[...]

Vaishnavh Nagarajan¹, J. Zico Kolter¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2019

TL;DR: In this paper, a PAC-Bayesian framework is proposed to provide a bound on the original network learned, a network that is deterministic and uncompressed, which is a key novelty in our approach.

...read moreread less

Abstract: The ability of overparameterized deep networks to generalize well has been linked to the fact that stochastic gradient descent (SGD) finds solutions that lie in flat, wide minima in the training loss -- minima where the output of the network is resilient to small random noise added to its parameters. So far this observation has been used to provide generalization guarantees only for neural networks whose parameters are either \textit{stochastic} or \textit{compressed}. In this work, we present a general PAC-Bayesian framework that leverages this observation to provide a bound on the original network learned -- a network that is deterministic and uncompressed. What enables us to do this is a key novelty in our approach: our framework allows us to show that if on training data, the interactions between the weight matrices satisfy certain conditions that imply a wide training loss minimum, these conditions themselves {\em generalize} to the interactions between the matrices on test data, thereby implying a wide test loss minimum. We then apply our general framework in a setup where we assume that the pre-activation values of the network are not too small (although we assume this only on the training data). In this setup, we provide a generalization guarantee for the original (deterministic, uncompressed) network, that does not scale with product of the spectral norms of the weight matrices -- a guarantee that would not have been possible with prior approaches.

...read moreread less

61 citations

Proceedings Article•

On Connected Sublevel Sets in Deep Learning

[...]

Quynh C. Nguyen¹•Institutions (1)

University of Maryland, College Park¹

24 May 2019

TL;DR: Every sublevel set of the loss function of a class of deep over-parameterized neural nets with piecewise linear activation functions is connected and unbounded, implying that the loss has no bad local valleys and all of its global minima are connected within a unique and potentially very large global valley.

...read moreread less

Abstract: This paper shows that every sublevel set of the loss function of a class of deep over-parameterized neural nets with piecewise linear activation functions is connected and unbounded. This implies that the loss has no bad local valleys and all of its global minima are connected within a unique and potentially very large global valley.

...read moreread less

Proceedings Article•

Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning.

[...]

Dong Yin¹, Yudong Chen², Kannan Ramchandran³, Peter L. Bartlett³•Institutions (3)

Google¹, Cornell University², University of California, Berkeley³

24 May 2019

TL;DR: In this article, the authors consider the Byzantine setting where some worker machines may create fake local minima near a saddle point that is far away from any true local minimum, even when robust gradient estimators are used.

...read moreread less

Abstract: We study robust distributed learning that involves minimizing a non-convex loss function with saddle points. We consider the Byzantine setting where some worker machines have abnormal or even arbitrary and adversarial behavior. In this setting, the Byzantine machines may create fake local minima near a saddle point that is far away from any true local minimum, even when robust gradient estimators are used. We develop ByzantinePGD, a robust first-order algorithm that can provably escape saddle points and fake local minima, and converge to an approximate true local minimizer with low iteration complexity. As a by-product, we give a simpler algorithm and analysis for escaping saddle points in the usual non-Byzantine setting. We further discuss three robust gradient estimators that can be used in ByzantinePGD, including median, trimmed mean, and iterative filtering. We characterize their performance in concrete statistical settings, and argue for their near-optimality in low and high dimensional regimes.

...read moreread less

Journal Article•DOI•

Convexification of a 3-D coefficient inverse scattering problem

[...]

Michael V. Klibanov¹, Aleksandr E. Kolesov¹, Aleksandr E. Kolesov²•Institutions (2)

University of North Carolina at Charlotte¹, North-Eastern Federal University²

15 Mar 2019-Computers & Mathematics With Applications

TL;DR: In this article, a convexification numerical method for a coefficient inverse scattering problem for the 3D Helmholtz equation is developed analytically and tested numerically, which converges globally.

...read moreread less

Abstract: A version of the so-called “convexification” numerical method for a coefficient inverse scattering problem for the 3D Helmholtz equation is developed analytically and tested numerically. Backscattering data are used, which result from a single direction of the propagation of the incident plane wave on an interval of frequencies. The method converges globally. The idea is to construct a weighted Tikhonov-like functional. The key element of this functional is the presence of the so-called Carleman Weight Function (CWF). This is the function which is involved in the Carleman estimate for the Laplace operator. This functional is strictly convex on any appropriate ball in a Hilbert space for an appropriate choice of the parameters of the CWF. Thus, both the absence of local minima and convergence of minimizers to the exact solution are guaranteed. Numerical tests demonstrate a good performance of the resulting algorithm. Unlikeprevious the so-called tail functions globally convergent method, we neither do not impose the smallness assumption of the interval of wavenumbers, nor we do not iterate with respect to the so-called tail functions.

...read moreread less

Journal Article•DOI•

Self-Organizing RBF Neural Network Using an Adaptive Gradient Multiobjective Particle Swarm Optimization

[...]

Honggui Han¹, Xiaolong Wu¹, Lu Zhang¹, Yu Tian, Junfei Qiao¹ - Show less +1 more•Institutions (1)

Beijing University of Technology¹

01 Jan 2019-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: An adaptive gradient multiobjective particle swarm optimization (AGMOPSO) algorithm designed to optimize both the structure and parameters of RBF neural networks in this paper achieves much better generalization capability and compact network structure than some other existing methods.

...read moreread less

Abstract: One of the major obstacles in using radial basis function (RBF) neural networks is the convergence toward local minima instead of the global minima. For this reason, an adaptive gradient multiobjective particle swarm optimization (AGMOPSO) algorithm is designed to optimize both the structure and parameters of RBF neural networks in this paper. First, the AGMOPSO algorithm, based on a multiobjective gradient method and a self-adaptive flight parameters mechanism, is developed to improve the computation performance. Second, the AGMOPSO-based self-organizing RBF neural network (AGMOPSO-SORBF) can optimize the parameters (centers, widths, and weights), as well as determine the network size. The goal of AGMOPSO-SORBF is to find a tradeoff between the accuracy and the complexity of RBF neural networks. Third, the convergence analysis of AGMOPSO-SORBF is detailed to ensure the prerequisite of any successful applications. Finally, the merits of our proposed approach are verified on multiple numerical examples. The results indicate that the proposed AGMOPSO-SORBF achieves much better generalization capability and compact network structure than some other existing methods.

...read moreread less

Journal Article•DOI•

Depth with nonlinearity creates no bad local minima in ResNets.

[...]

Kenji Kawaguchi¹, Yoshua Bengio²•Institutions (2)

Massachusetts Institute of Technology¹, Université de Montréal²

01 Oct 2019-Neural Networks

TL;DR: In this article, the authors prove that depth with nonlinearity creates no bad local minima in a type of arbitrarily deep ResNets with arbitrary nonlinear activation functions, in the sense that the values of all local minimima are no worse than the global minimum value of corresponding classical machine-learning models, and are guaranteed to further improve via residual representations.

...read moreread less

Posted Content•

Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience

[...]

Vaishnavh Nagarajan¹, J. Zico Kolter¹•Institutions (1)

Carnegie Mellon University¹

30 May 2019-arXiv: Learning

TL;DR: A general PAC-Bayesian framework that provides a generalization guarantee for the original (deterministic, uncompressed) network, that does not scale with product of the spectral norms of the weight matrices -- a guarantee that would not have been possible with prior approaches.

...read moreread less

Proceedings Article•

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

[...]

Umut Simsekli, Levent Sagun¹, Mert Gurbuzbalaban²•Institutions (2)

New York University¹, Rutgers University²

18 Jan 2019

TL;DR: In this article, the gradient noise in SGD is considered in a more general context and the generalized central limit theorem (GCLT) is invoked, which suggests that SGD converges to a heavy-tailed stable random variable.

...read moreread less

Abstract: The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT (GCLT), which suggests that the GN converges to a heavy-tailed $\alpha$-stable random variable. Accordingly, we propose to analyze SGD as an SDE driven by a L\'{e}vy motion. Such SDEs can incur `jumps', which force the SDE transition from narrow minima to wider minima, as proven by existing metastability theory. To validate the $\alpha$-stable assumption, we conduct extensive experiments on common deep learning architectures and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We further investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.

...read moreread less

Journal Article•DOI•

Effect of Depth and Width on Local Minima in Deep Learning

[...]

Kenji Kawaguchi¹, Jiaoyang Huang², Leslie Pack Kaelbling¹•Institutions (2)

Massachusetts Institute of Technology¹, Harvard University²

14 Jun 2019-Neural Computation

TL;DR: The authors theoretically show that the quality of local minima tends to improve toward the global minimum value as depth and width increase, and they empirically support their theoretical observation with a synthetic data set, as well as MNIST, CIFAR-10 and SVHN data sets.

...read moreread less

Abstract: In this paper, we analyze the effects of depth and width on the quality of local minima, without strong overparameterization and simplification assumptions in the literature. Without any simplification assumption, for deep nonlinear neural networks with the squared loss, we theoretically show that the quality of local minima tends to improve toward the global minimum value as depth and width increase. Furthermore, with a locally induced structure on deep nonlinear neural networks, the values of local minima of neural networks are theoretically proven to be no worse than the globally optimal values of corresponding classical machine learning models. We empirically support our theoretical observation with a synthetic data set, as well as MNIST, CIFAR-10, and SVHN data sets. When compared to previous studies with strong overparameterization assumptions, the results in this letter do not require overparameterization and instead show the gradual effects of overparameterization as consequences of general results.

...read moreread less

Journal Article•DOI•

A weighted edge-based level set method based on multi-local statistical information for noisy image segmentation

[...]

Cheng Liu¹, Weibin Liu¹, Weiwei Xing¹•Institutions (1)

Beijing Jiaotong University¹

01 Feb 2019-Journal of Visual Communication and Image Representation

TL;DR: A weighted edge-based level set method based on multi-local statistical information to better segment noisy images and provides higher segmentation accuracies and more accurate segmentation results, which demonstrate its effectiveness and robustness.

...read moreread less

Journal Article•DOI•

Locally adaptive activation functions with slope recovery term for deep and physics-informed neural networks

[...]

Ameya D. Jagtap, Kenji Kawaguchi, George Em Karniadakis

25 Sep 2019-arXiv: Learning

TL;DR: In this paper, the authors proposed two approaches of locally adaptive activation functions namely, layer-wise and neuron-wise locally adaptive activations, which improve the performance of deep and physics-informed neural networks.

...read moreread less

Abstract: We propose two approaches of locally adaptive activation functions namely, layer-wise and neuron-wise locally adaptive activation functions, which improve the performance of deep and physics-informed neural networks. The local adaptation of activation function is achieved by introducing a scalable parameter in each layer (layer-wise) and for every neuron (neuron-wise) separately, and then optimizing it using a variant of stochastic gradient descent algorithm. In order to further increase the training speed, an activation slope based slope recovery term is added in the loss function, which further accelerates convergence, thereby reducing the training cost. On the theoretical side, we prove that in the proposed method, the gradient descent algorithms are not attracted to sub-optimal critical points or local minima under practical conditions on the initialization and learning rate, and that the gradient dynamics of the proposed method is not achievable by base methods with any (adaptive) learning rates. We further show that the adaptive activation methods accelerate the convergence by implicitly multiplying conditioning matrices to the gradient of the base method without any explicit computation of the conditioning matrix and the matrix-vector product. The different adaptive activation functions are shown to induce different implicit conditioning matrices. Furthermore, the proposed methods with the slope recovery are shown to accelerate the training process.

...read moreread less

Journal Article•

Differentiable Game Mechanics

[...]

Alistair Letcher, David Balduzzi, Sébastien Racanière, James Martens, Jakob Foerster, Karl Tuyls, Thore Graepel - Show less +3 more

01 Jan 2019-Journal of Machine Learning Research

TL;DR: In this article, the authors decompose the game Jacobian into two components: symmetric component and antisymmetric component, a new class of games that obey a conservation law akin to conservation laws in classical mechanical systems.

...read moreread less

Abstract: Deep learning is built on the foundational guarantee that gradient descent on an objective function converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, that exhibit multiple interacting losses. The behavior of gradient-based methods in games is not well understood -- and is becoming increasingly important as adversarial and multi-objective architectures proliferate. In this paper, we develop new tools to understand and control the dynamics in n-player differentiable games. The key result is to decompose the game Jacobian into two components. The first, symmetric component, is related to potential games, which reduce to gradient descent on an implicit function. The second, antisymmetric component, relates to Hamiltonian games, a new class of games that obey a conservation law akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in differentiable games. Basic experiments show SGA is competitive with recently proposed algorithms for finding stable fixed points in GANs -- while at the same time being applicable to, and having guarantees in, much more general cases.

...read moreread less

Proceedings Article•DOI•

LVIS: Learning from Value Function Intervals for Contact-Aware Robot Controllers

[...]

Robin Deits¹, Twan Koolen¹, Russ Tedrake¹•Institutions (1)

Massachusetts Institute of Technology¹

01 May 2019

TL;DR: LVIS is introduced, which circumvents the issue of local minima through global mixed-integer optimization and theissue of non-uniqueness through learning the optimal value function rather than the optimal policy, and is applied to a fundamentally hard problem in feedback control–control through contact.

...read moreread less

Abstract: Guided policy search is a popular approach for training controllers for high-dimensional systems, but it has a number of pitfalls. Non-convex trajectory optimization has local minima, and non-uniqueness in the optimal policy itself can mean that independently-optimized samples do not describe a coherent policy from which to train. We introduce LVIS, which circumvents the issue of local minima through global mixed-integer optimization and the issue of non-uniqueness through learning the optimal value function rather than the optimal policy. To avoid the expense of solving the mixed-integer programs to full global optimality, we instead solve them only partially, extracting intervals containing the true cost-to-go from early termination of the branch-and-bound algorithm. These interval samples are used to weakly supervise the training of a neural net which approximates the true cost-to-go. Online, we use that learned cost-to-go as the terminal cost of a one-step model-predictive controller, which we solve via a small mixed-integer optimization. We demonstrate LVIS on piecewise affine models of a cart-pole system with walls and a planar humanoid robot and show that it can be applied to a fundamentally hard problem in feedback control–control through contact.

...read moreread less

Journal Article•DOI•

Atlas of putative minima and low-lying energy networks of water clusters n = 3–25

[...]

Avijit Rakshit¹, Pradipta Bandyopadhyay¹, Joseph P Heindel², Sotiris S. Xantheas²•Institutions (2)

Jawaharlal Nehru University¹, University of Washington²

04 Dec 2019-Journal of Chemical Physics

TL;DR: This large database of water cluster minima spanning quite dissimilar hydrogen bonding networks is expected to influence the development and assessment of the accuracy of interaction potentials for water as well as lower scaling electronic structure methods (such as different density functionals).

...read moreread less

Abstract: We report a database consisting of the putative minima and ∼3.2 × 106 local minima lying within 5 kcal/mol from the putative minima for water clusters of sizes n = 3-25 using an improved version of the Monte Carlo temperature basin paving (MCTBP) global optimization procedure in conjunction with the ab initio based, flexible, polarizable Thole-Type Model (TTM2.1-F, version 2.1) interaction potential for water. Several of the low-lying structures, as well as low-lying penta-coordinated water networks obtained with the TTM2.1-F potential, were further refined at the Moller-Plesset second order perturbation (MP2)/aug-cc-pVTZ level of theory. In total, we have identified 3 138 303 networks corresponding to local minima of the clusters n = 3-25, whose Cartesian coordinates and relative energies can be obtained from the webpage https://sites.uw.edu/wdbase/. Networks containing penta-coordinated water molecules start to appear at n = 11 and, quite surprisingly, are energetically close (within 1-3 kcal/mol) to the putative minima, a fact that has been confirmed from the MP2 calculations. This large database of water cluster minima spanning quite dissimilar hydrogen bonding networks is expected to influence the development and assessment of the accuracy of interaction potentials for water as well as lower scaling electronic structure methods (such as different density functionals). Furthermore, it can also be used in conjunction with data science approaches (including but not limited to neural networks and machine and deep learning) to understand the properties of water, nature's most important substance.

...read moreread less

Posted Content•

Visualising Basins of Attraction for the Cross-Entropy and the Squared Error Neural Network Loss Functions

[...]

Anna Sergeevna Bosman¹, Andries P. Engelbrecht², Mardé Helbig³•Institutions (3)

University of Pretoria¹, Stellenbosch University², Griffith University³

08 Jan 2019-arXiv: Learning

TL;DR: In this paper, a gradient-based random sampling technique is proposed to visualize basins of attraction together with the associated stationary points via gradient-gradient-based sampling, which can be used for fitness landscape analysis of neural networks.

...read moreread less

Abstract: Quantification of the stationary points and the associated basins of attraction of neural network loss surfaces is an important step towards a better understanding of neural network loss surfaces at large. This work proposes a novel method to visualise basins of attraction together with the associated stationary points via gradient-based random sampling. The proposed technique is used to perform an empirical study of the loss surfaces generated by two different error metrics: quadratic loss and entropic loss. The empirical observations confirm the theoretical hypothesis regarding the nature of neural network attraction basins. Entropic loss is shown to exhibit stronger gradients and fewer stationary points than quadratic loss, indicating that entropic loss has a more searchable landscape. Quadratic loss is shown to be more resilient to overfitting than entropic loss. Both losses are shown to exhibit local minima, but the number of local minima is shown to decrease with an increase in dimensionality. Thus, the proposed visualisation technique successfully captures the local minima properties exhibited by the neural network loss surfaces, and can be used for the purpose of fitness landscape analysis of neural networks.

...read moreread less

Journal Article•DOI•

Adaptive Feature Zero Assisted Surrogate-Based EM Optimization for Microwave Filter Design

[...]

Feng Feng¹, Chao Zhang¹, Weicong Na¹, Jianan Zhang¹, Wei Zhang¹, Qi-Jun Zhang¹ - Show less +2 more•Institutions (1)

Carleton University¹

01 Jan 2019-IEEE Microwave and Wireless Components Letters

TL;DR: A feature zero-adaptation approach to enlarge the surrogate range by overcoming the problem of varying orders of the transfer function w.r.t. the changes in design variables is proposed.

...read moreread less

Abstract: Feature-based electromagnetic (EM) optimization techniques can help avoid local minima in microwave design. Zeros of the transfer functions are recently used to help extract the features when the features of filter responses are not explicitly identifiable. This letter proposes a feature zero-adaptation approach to enlarge the surrogate range by overcoming the problem of varying orders of the transfer function w.r.t. the changes in design variables. In this way, the proposed technique allows larger step sizes for optimization, therefore, speeding up the overall EM optimization process. During each optimization iteration, parallel techniques are proposed to be used to generate multiple EM geometrical samples simultaneously for creating the feature-based surrogate model. The proposed technique is demonstrated using two microwave filter examples.

...read moreread less

Proceedings Article•

Elimination of All Bad Local Minima in Deep Learning

[...]

Kenji Kawaguchi¹, Leslie Pack Kaelbling¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2019

TL;DR: This paper theoretically prove that adding one special neuron per output unit eliminates all suboptimal local minima of any deep neural network, for multi-class classification, binary classification, and regression with an arbitrary loss function, under practical assumptions.

...read moreread less

Abstract: In this paper, we theoretically prove that adding one special neuron per output unit eliminates all suboptimal local minima of any deep neural network, for multi-class classification, binary classification, and regression with an arbitrary loss function, under practical assumptions At every local minimum of any deep neural network with these added neurons, the set of parameters of the original neural network (without added neurons) is guaranteed to be a global minimum of the original neural network The effects of the added neurons are proven to automatically vanish at every local minimum Moreover, we provide a novel theoretical characterization of a failure mode of eliminating suboptimal local minima via an additional theorem and several examples This paper also introduces a novel proof technique based on the perturbable gradient basis (PGB) necessary condition of local minima, which provides new insight into the elimination of local minima and is applicable to analyze various models and transformations of objective functions beyond the elimination of local minima

...read moreread less

Posted Content•

Asymmetric Valleys: Beyond Sharp and Flat Local Minima

[...]

Haowei He¹, Gao Huang², Yang Yuan²•Institutions (2)

Beihang University¹, Cornell University²

02 Feb 2019-arXiv: Learning

TL;DR: It is proved that for asymmetric valleys, a solution biased towards the flat side generalizes better than the exact minimizer, which provides a theoretical explanation for the intriguing phenomenon observed by Izmailov et al. (2018).

...read moreread less

Abstract: Despite the non-convex nature of their loss functions, deep neural networks are known to generalize well when optimized with stochastic gradient descent (SGD). Recent work conjectures that SGD with proper configuration is able to find wide and flat local minima, which have been proposed to be associated with good generalization performance. In this paper, we observe that local minima of modern deep networks are more than being flat or sharp. Specifically, at a local minimum there exist many asymmetric directions such that the loss increases abruptly along one side, and slowly along the opposite side--we formally define such minima as asymmetric valleys. Under mild assumptions, we prove that for asymmetric valleys, a solution biased towards the flat side generalizes better than the exact minimizer. Further, we show that simply averaging the weights along the SGD trajectory gives rise to such biased solutions implicitly. This provides a theoretical explanation for the intriguing phenomenon observed by Izmailov et al. (2018). In addition, we empirically find that batch normalization (BN) appears to be a major cause for asymmetric valleys.

...read moreread less

Posted Content•

Elimination of All Bad Local Minima in Deep Learning

[...]

Kenji Kawaguchi¹, Leslie Pack Kaelbling¹•Institutions (1)

Massachusetts Institute of Technology¹

02 Jan 2019-arXiv: Learning

TL;DR: In this article, the authors theoretically prove that adding one special neuron per output unit eliminates all suboptimal local minima of any deep neural network, for multi-class classification, binary classification, and regression with an arbitrary loss function, under practical assumptions.

...read moreread less

Abstract: In this paper, we theoretically prove that adding one special neuron per output unit eliminates all suboptimal local minima of any deep neural network, for multi-class classification, binary classification, and regression with an arbitrary loss function, under practical assumptions. At every local minimum of any deep neural network with these added neurons, the set of parameters of the original neural network (without added neurons) is guaranteed to be a global minimum of the original neural network. The effects of the added neurons are proven to automatically vanish at every local minimum. Moreover, we provide a novel theoretical characterization of a failure mode of eliminating suboptimal local minima via an additional theorem and several examples. This paper also introduces a novel proof technique based on the perturbable gradient basis (PGB) necessary condition of local minima, which provides new insight into the elimination of local minima and is applicable to analyze various models and transformations of objective functions beyond the elimination of local minima.

...read moreread less

Journal Article•DOI•

The structure of extreme level sets in branching Brownian motion

[...]

Aser Cortines, Lisa Hartung, Oren Louidor

01 Jul 2019-Annals of Probability

TL;DR: In this article, the authors studied the structure of extreme level sets of a standard one-dimensional branching Brownian motion, namely the sets of particles whose height is within a fixed distance from the order of the global maximum.

...read moreread less

Abstract: We study the structure of extreme level sets of a standard one-dimensional branching Brownian motion, namely the sets of particles whose height is within a fixed distance from the order of the global maximum. It is well known that such particles congregate at large times in clusters of order-one genealogical diameter around local maxima which form a Cox process in the limit. We add to these results by finding the asymptotic size of extreme level sets and the typical height of the local maxima whose clusters carry such level sets. We also find the right tail decay of the distribution of the distance between the two highest particles. These results confirm two conjectures of Brunet and Derrida (J. Stat. Phys. 143 (2011) 420–446). The proofs rely on a careful study of the cluster distribution.

...read moreread less

Posted Content•

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape.

[...]

Johanni Brea, Berfin Simsek, Bernd Illing, Wulfram Gerstner¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

05 Jul 2019-arXiv: Learning

TL;DR: The geometric approach yields a lower bound on the number of critical points generated by weight-space symmetries and provides a simple intuitive link between previous mathematical results and numerical observations.

...read moreread less

Abstract: The permutation symmetry of neurons in each layer of a deep neural network gives rise not only to multiple equivalent global minima of the loss function, but also to first-order saddle points located on the path between the global minima. In a network of $d-1$ hidden layers with $n_k$ neurons in layers $k = 1, \ldots, d$, we construct smooth paths between equivalent global minima that lead through a `permutation point' where the input and output weight vectors of two neurons in the same hidden layer $k$ collide and interchange. We show that such permutation points are critical points with at least $n_{k+1}$ vanishing eigenvalues of the Hessian matrix of second derivatives indicating a local plateau of the loss function. We find that a permutation point for the exchange of neurons $i$ and $j$ transits into a flat valley (or generally, an extended plateau of $n_{k+1}$ flat dimensions) that enables all $n_k!$ permutations of neurons in a given layer $k$ at the same loss value. Moreover, we introduce high-order permutation points by exploiting the recursive structure in neural network functions, and find that the number of $K^{\text{th}}$-order permutation points is at least by a factor $\sum_{k=1}^{d-1}\frac{1}{2!^K}{n_k-K \choose K}$ larger than the (already huge) number of equivalent global minima. In two tasks, we illustrate numerically that some of the permutation points correspond to first-order saddles (`permutation saddles'): first, in a toy network with a single hidden layer on a function approximation task and, second, in a multilayer network on the MNIST task. Our geometric approach yields a lower bound on the number of critical points generated by weight-space symmetries and provides a simple intuitive link between previous mathematical results and numerical observations.

...read moreread less

Collapse