scispace - formally typeset
Search or ask a question

Showing papers on "Maxima and minima published in 2019"


Journal ArticleDOI
TL;DR: The numerical simulations of the proposed ChASO-FOPID and ASO-fOPID controllers for the dc motor speed control system demonstrated the superior performance of both the chaotic ASO and the original ASO, respectively.
Abstract: In this paper, atom search optimization (ASO) algorithm and a novel chaotic version of it [chaotic ASO (ChASO)] are proposed to determine the optimal parameters of the fractional-order proportional+integral+derivative (FOPID) controller for dc motor speed control. The ASO algorithm is simple and easy to implement, which mathematically models and mimics the atomic motion model in nature, and is developed to address a diverse set of optimization problems. The proposed ChASO algorithm, on the other hand, is based on logistic map chaotic sequences, which makes the original algorithm be able to escape from local minima stagnation and improve its convergence rate and resulting precision. First, the proposed ChASO algorithm is applied to six unimodal and multimodal benchmark optimization problems and the results are compared with other algorithms. Second, the proposed ChASO-FOPID, ASO-FOPID, and ASO-PID controllers are compared with GWO-FOPID, GWO-PID, IWO-PID, and SFS-PID controllers using the integral of time multiplied absolute error (ITAE) objective function for a fair comparison. Comparisons were also made for the integral of time multiplied squared error (ITSE) and Zwe-Lee Gaing's (ZLG) objective function as the most commonly used objective functions in the literature. Transient response analysis, frequency response (Bode) analysis, and robustness analysis were all carried out. The simulation results are promising and validate the effectiveness of the proposed approaches. The numerical simulations of the proposed ChASO-FOPID and ASO-FOPID controllers for the dc motor speed control system demonstrated the superior performance of both the chaotic ASO and the original ASO, respectively.

156 citations


Journal ArticleDOI
TL;DR: In this article, a probabilistic approach to quantify convergence to equilibrium for (kinetic) Langevin processes is introduced, based on a specific combination of reflection and synchronous coupling of two solutions of the Langevin equation.
Abstract: We introduce a new probabilistic approach to quantify convergence to equilibrium for (kinetic) Langevin processes. In contrast to previous analytic approaches that focus on the associated kinetic Fokker-Planck equation, our approach is based on a specific combination of reflection and synchronous coupling of two solutions of the Langevin equation. It yields contractions in a particular Wasserstein distance, and it provides rather precise bounds for convergence to equilibrium at the borderline between the overdamped and the underdamped regime. In particular, we are able to recover kinetic behavior in terms of explicit lower bounds for the contraction rate. For example, for a rescaled double-well potential with local minima at distance a, we obtain a lower bound for the contraction rate of order Ω(a−1) provided the friction coefficient is of order Θ(a−1)

152 citations


Proceedings Article
24 May 2019
TL;DR: This work studies a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics, and shows that the anisotropic noise in SGD helps to escape from sharp and poor minima effectively, towards more stable and flat minima that typically generalize well.
Abstract: Understanding the behavior of stochastic gradient descent (SGD) in the context of deep neural networks has raised lots of concerns recently. Along this line, we study a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics. Through investigating this general optimization dynamics, we analyze the behavior of SGD on escaping from minima and its regularization effects. A novel indicator is derived to characterize the efficiency of escaping from minima through measuring the alignment of noise covariance and the curvature of loss function. Based on this indicator, two conditions are established to show which type of noise structure is superior to isotropic noise in term of escaping efficiency. We further show that the anisotropic noise in SGD satisfies the two conditions, and thus helps to escape from sharp and poor minima effectively, towards more stable and flat minima that typically generalize well. We systematically design various experiments to verify the benefits of the anisotropic noise, compared with full gradient descent plus isotropic diffusion (i.e. Langevin dynamics).

143 citations


Journal ArticleDOI
TL;DR: It is argued that in fully connected deep networks a phase transition delimits the over- and underparametrized regimes where fitting can or cannot be achieved, and observed that the ability of fully connected networks to fit random data is independent of their depth, an independence that appears to also hold for real data.
Abstract: Deep learning has been immensely successful at a variety of tasks, ranging from classification to artificial intelligence. Learning corresponds to fitting training data, which is implemented by descending a very high-dimensional loss function. Understanding under which conditions neural networks do not get stuck in poor minima of the loss, and how the landscape of that loss evolves as depth is increased, remains a challenge. Here we predict, and test empirically, an analogy between this landscape and the energy landscape of repulsive ellipses. We argue that in fully connected deep networks a phase transition delimits the over- and underparametrized regimes where fitting can or cannot be achieved. In the vicinity of this transition, properties of the curvature of the minima of the loss (the spectrum of the Hessian) are critical. This transition shares direct similarities with the jamming transition by which particles form a disordered solid as the density is increased, which also occurs in certain classes of computational optimization and learning problems such as the perceptron. Our analysis gives a simple explanation as to why poor minima of the loss cannot be encountered in the overparametrized regime. Interestingly, we observe that the ability of fully connected networks to fit random data is independent of their depth, an independence that appears to also hold for real data. We also study a quantity Δ which characterizes how well (Δ 0) a datum is learned. At the critical point it is power-law distributed on several decades, P_{+}(Δ)∼Δ^{θ} for Δ>0 and P_{-}(Δ)∼(-Δ)^{-γ} for Δ<0, with exponents that depend on the choice of activation function. This observation suggests that near the transition the loss landscape has a hierarchical structure and that the learning dynamics is prone to avalanche-like dynamics, with abrupt changes in the set of patterns that are learned.

136 citations


Journal ArticleDOI
TL;DR: This paper presents a novel stochastic gradient descent (SGD) algorithm, which can provably train any single-hidden-layer ReLU network to attain global optimality, despite the presence of infinitely many bad local minima, maxima, and saddle points in general.
Abstract: Neural networks with rectified linear unit (ReLU) activation functions (a.k.a. ReLU networks) have achieved great empirical success in various domains. Nonetheless, existing results for learning ReLU networks either pose assumptions on the underlying data distribution being, e.g., Gaussian, or require the network size and/or training size to be sufficiently large. In this context, the problem of learning a two-layer ReLU network is approached in a binary classification setting, where the data are linearly separable and a hinge loss criterion is adopted. Leveraging the power of random noise perturbation, this paper presents a novel stochastic gradient descent (SGD) algorithm, which can provably train any single-hidden-layer ReLU network to attain global optimality, despite the presence of infinitely many bad local minima, maxima, and saddle points in general. This result is the first of its kind, requiring no assumptions on the data distribution, training/network size, or initialization. Convergence of the resultant iterative algorithm to a global minimum is analyzed by establishing both an upper bound and a lower bound on the number of non-zero updates to be performed. Moreover, generalization guarantees are developed for ReLU networks trained with the novel SGD leveraging classic compression bounds. These guarantees highlight a key difference (at least in the worst case) between reliably learning a ReLU network as well as a leaky ReLU network in terms of sample complexity. Numerical tests using both synthetic data and real images validate the effectiveness of the algorithm and the practical merits of the theory.

132 citations


Posted Content
TL;DR: In this paper, the gradient noise in SGD is considered in a more general context and the generalized central limit theorem (GCLT) is invoked to analyze SGD as an SDE driven by a Levy motion.
Abstract: The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT (GCLT), which suggests that the GN converges to a heavy-tailed $\alpha$-stable random variable. Accordingly, we propose to analyze SGD as an SDE driven by a Levy motion. Such SDEs can incur `jumps', which force the SDE transition from narrow minima to wider minima, as proven by existing metastability theory. To validate the $\alpha$-stable assumption, we conduct extensive experiments on common deep learning architectures and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We further investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.

111 citations


Proceedings Article
27 May 2019
TL;DR: For two-layer ReLU networks with one hidden layer, Wang et al. as discussed by the authors showed that NGD converges to global minima under the assumption that the inputs do not degenerate and the network is over-parameterized.
Abstract: Natural gradient descent has proven very effective at mitigating the catastrophic effects of pathological curvature in the objective function, but little is known theoretically about its convergence properties, especially for \emph{non-linear} networks. In this work, we analyze for the first time the speed of convergence to global optimum for natural gradient descent on non-linear neural networks with the squared error loss. We identify two conditions which guarantee the global convergence: (1) the Jacobian matrix (of network's output for all training cases w.r.t the parameters) is full row rank and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks (i.e. with one hidden layer), we prove that these two conditions do hold throughout the training under the assumptions that the inputs do not degenerate and the network is over-parameterized. We further extend our analysis to more general loss function with similar convergence property. Lastly, we show that K-FAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions.

86 citations


Journal ArticleDOI
TL;DR: It is shown that it is possible to destabilize trapping sets of analog states that correspond to local minima of the binary spin Hamiltonian by extending the phase space to include error signals that correct amplitude inhomogeneity of the analog spin states and controlling the divergence of their velocity.
Abstract: The relaxation of binary spins to analog values has been the subject of much debate in the field of statistical physics, neural networks, and more recently quantum computing, notably because the benefits of using an analog state for finding lower energy spin configurations are usually offset by the negative impact of the improper mapping of the energy function that results from the relaxation. We show that it is possible to destabilize trapping sets of analog states that correspond to local minima of the binary spin Hamiltonian by extending the phase space to include error signals that correct amplitude inhomogeneity of the analog spin states and controlling the divergence of their velocity. Performance of the proposed analog spin system in finding lower energy states is competitive against state-of-the-art heuristics.

84 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: Zhang et al. as discussed by the authors designed a single network with both multiple instance learning and bounding-box regression branches that share the same backbone, and a guided attention module using classification loss was added to the backbone for effectively extracting the implicit location information in the features.
Abstract: It is challenging for weakly supervised object detection network to precisely predict the positions of the objects, since there are no instance-level category annotations. Most existing methods tend to solve this problem by using a two-phase learning procedure, i.e., multiple instance learning detector followed by a fully supervised learning detector with bounding-box regression. Based on our observation, this procedure may lead to local minima for some object categories. In this paper, we propose to jointly train the two phases in an end-to-end manner to tackle this problem. Specifically, we design a single network with both multiple instance learning and bounding-box regression branches that share the same backbone. Meanwhile, a guided attention module using classification loss is added to the backbone for effectively extracting the implicit location information in the features. Experimental results on public datasets show that our method achieves state-of-the-art performance.

67 citations


Proceedings Article
01 Jan 2019
TL;DR: In this paper, a PAC-Bayesian framework is proposed to provide a bound on the original network learned, a network that is deterministic and uncompressed, which is a key novelty in our approach.
Abstract: The ability of overparameterized deep networks to generalize well has been linked to the fact that stochastic gradient descent (SGD) finds solutions that lie in flat, wide minima in the training loss -- minima where the output of the network is resilient to small random noise added to its parameters. So far this observation has been used to provide generalization guarantees only for neural networks whose parameters are either \textit{stochastic} or \textit{compressed}. In this work, we present a general PAC-Bayesian framework that leverages this observation to provide a bound on the original network learned -- a network that is deterministic and uncompressed. What enables us to do this is a key novelty in our approach: our framework allows us to show that if on training data, the interactions between the weight matrices satisfy certain conditions that imply a wide training loss minimum, these conditions themselves {\em generalize} to the interactions between the matrices on test data, thereby implying a wide test loss minimum. We then apply our general framework in a setup where we assume that the pre-activation values of the network are not too small (although we assume this only on the training data). In this setup, we provide a generalization guarantee for the original (deterministic, uncompressed) network, that does not scale with product of the spectral norms of the weight matrices -- a guarantee that would not have been possible with prior approaches.

61 citations


Proceedings Article
24 May 2019
TL;DR: Every sublevel set of the loss function of a class of deep over-parameterized neural nets with piecewise linear activation functions is connected and unbounded, implying that the loss has no bad local valleys and all of its global minima are connected within a unique and potentially very large global valley.
Abstract: This paper shows that every sublevel set of the loss function of a class of deep over-parameterized neural nets with piecewise linear activation functions is connected and unbounded. This implies that the loss has no bad local valleys and all of its global minima are connected within a unique and potentially very large global valley.

Proceedings Article
24 May 2019
TL;DR: In this article, the authors consider the Byzantine setting where some worker machines may create fake local minima near a saddle point that is far away from any true local minimum, even when robust gradient estimators are used.
Abstract: We study robust distributed learning that involves minimizing a non-convex loss function with saddle points. We consider the Byzantine setting where some worker machines have abnormal or even arbitrary and adversarial behavior. In this setting, the Byzantine machines may create fake local minima near a saddle point that is far away from any true local minimum, even when robust gradient estimators are used. We develop ByzantinePGD, a robust first-order algorithm that can provably escape saddle points and fake local minima, and converge to an approximate true local minimizer with low iteration complexity. As a by-product, we give a simpler algorithm and analysis for escaping saddle points in the usual non-Byzantine setting. We further discuss three robust gradient estimators that can be used in ByzantinePGD, including median, trimmed mean, and iterative filtering. We characterize their performance in concrete statistical settings, and argue for their near-optimality in low and high dimensional regimes.

Journal ArticleDOI
TL;DR: In this article, a convexification numerical method for a coefficient inverse scattering problem for the 3D Helmholtz equation is developed analytically and tested numerically, which converges globally.
Abstract: A version of the so-called “convexification” numerical method for a coefficient inverse scattering problem for the 3D Helmholtz equation is developed analytically and tested numerically. Backscattering data are used, which result from a single direction of the propagation of the incident plane wave on an interval of frequencies. The method converges globally. The idea is to construct a weighted Tikhonov-like functional. The key element of this functional is the presence of the so-called Carleman Weight Function (CWF). This is the function which is involved in the Carleman estimate for the Laplace operator. This functional is strictly convex on any appropriate ball in a Hilbert space for an appropriate choice of the parameters of the CWF. Thus, both the absence of local minima and convergence of minimizers to the exact solution are guaranteed. Numerical tests demonstrate a good performance of the resulting algorithm. Unlikeprevious the so-called tail functions globally convergent method, we neither do not impose the smallness assumption of the interval of wavenumbers, nor we do not iterate with respect to the so-called tail functions.

Journal ArticleDOI
Honggui Han1, Xiaolong Wu1, Lu Zhang1, Yu Tian, Junfei Qiao1 
TL;DR: An adaptive gradient multiobjective particle swarm optimization (AGMOPSO) algorithm designed to optimize both the structure and parameters of RBF neural networks in this paper achieves much better generalization capability and compact network structure than some other existing methods.
Abstract: One of the major obstacles in using radial basis function (RBF) neural networks is the convergence toward local minima instead of the global minima. For this reason, an adaptive gradient multiobjective particle swarm optimization (AGMOPSO) algorithm is designed to optimize both the structure and parameters of RBF neural networks in this paper. First, the AGMOPSO algorithm, based on a multiobjective gradient method and a self-adaptive flight parameters mechanism, is developed to improve the computation performance. Second, the AGMOPSO-based self-organizing RBF neural network (AGMOPSO-SORBF) can optimize the parameters (centers, widths, and weights), as well as determine the network size. The goal of AGMOPSO-SORBF is to find a tradeoff between the accuracy and the complexity of RBF neural networks. Third, the convergence analysis of AGMOPSO-SORBF is detailed to ensure the prerequisite of any successful applications. Finally, the merits of our proposed approach are verified on multiple numerical examples. The results indicate that the proposed AGMOPSO-SORBF achieves much better generalization capability and compact network structure than some other existing methods.

Journal ArticleDOI
TL;DR: In this article, the authors prove that depth with nonlinearity creates no bad local minima in a type of arbitrarily deep ResNets with arbitrary nonlinear activation functions, in the sense that the values of all local minimima are no worse than the global minimum value of corresponding classical machine-learning models, and are guaranteed to further improve via residual representations.

Posted Content
TL;DR: A general PAC-Bayesian framework that provides a generalization guarantee for the original (deterministic, uncompressed) network, that does not scale with product of the spectral norms of the weight matrices -- a guarantee that would not have been possible with prior approaches.
Abstract: The ability of overparameterized deep networks to generalize well has been linked to the fact that stochastic gradient descent (SGD) finds solutions that lie in flat, wide minima in the training loss -- minima where the output of the network is resilient to small random noise added to its parameters. So far this observation has been used to provide generalization guarantees only for neural networks whose parameters are either \textit{stochastic} or \textit{compressed}. In this work, we present a general PAC-Bayesian framework that leverages this observation to provide a bound on the original network learned -- a network that is deterministic and uncompressed. What enables us to do this is a key novelty in our approach: our framework allows us to show that if on training data, the interactions between the weight matrices satisfy certain conditions that imply a wide training loss minimum, these conditions themselves {\em generalize} to the interactions between the matrices on test data, thereby implying a wide test loss minimum. We then apply our general framework in a setup where we assume that the pre-activation values of the network are not too small (although we assume this only on the training data). In this setup, we provide a generalization guarantee for the original (deterministic, uncompressed) network, that does not scale with product of the spectral norms of the weight matrices -- a guarantee that would not have been possible with prior approaches.

Proceedings Article
18 Jan 2019
TL;DR: In this article, the gradient noise in SGD is considered in a more general context and the generalized central limit theorem (GCLT) is invoked, which suggests that SGD converges to a heavy-tailed stable random variable.
Abstract: The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT (GCLT), which suggests that the GN converges to a heavy-tailed $\alpha$-stable random variable. Accordingly, we propose to analyze SGD as an SDE driven by a L\'{e}vy motion. Such SDEs can incur `jumps', which force the SDE transition from narrow minima to wider minima, as proven by existing metastability theory. To validate the $\alpha$-stable assumption, we conduct extensive experiments on common deep learning architectures and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We further investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.

Journal ArticleDOI
TL;DR: The authors theoretically show that the quality of local minima tends to improve toward the global minimum value as depth and width increase, and they empirically support their theoretical observation with a synthetic data set, as well as MNIST, CIFAR-10 and SVHN data sets.
Abstract: In this paper, we analyze the effects of depth and width on the quality of local minima, without strong overparameterization and simplification assumptions in the literature. Without any simplification assumption, for deep nonlinear neural networks with the squared loss, we theoretically show that the quality of local minima tends to improve toward the global minimum value as depth and width increase. Furthermore, with a locally induced structure on deep nonlinear neural networks, the values of local minima of neural networks are theoretically proven to be no worse than the globally optimal values of corresponding classical machine learning models. We empirically support our theoretical observation with a synthetic data set, as well as MNIST, CIFAR-10, and SVHN data sets. When compared to previous studies with strong overparameterization assumptions, the results in this letter do not require overparameterization and instead show the gradual effects of overparameterization as consequences of general results.

Journal ArticleDOI
TL;DR: A weighted edge-based level set method based on multi-local statistical information to better segment noisy images and provides higher segmentation accuracies and more accurate segmentation results, which demonstrate its effectiveness and robustness.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed two approaches of locally adaptive activation functions namely, layer-wise and neuron-wise locally adaptive activations, which improve the performance of deep and physics-informed neural networks.
Abstract: We propose two approaches of locally adaptive activation functions namely, layer-wise and neuron-wise locally adaptive activation functions, which improve the performance of deep and physics-informed neural networks. The local adaptation of activation function is achieved by introducing a scalable parameter in each layer (layer-wise) and for every neuron (neuron-wise) separately, and then optimizing it using a variant of stochastic gradient descent algorithm. In order to further increase the training speed, an activation slope based slope recovery term is added in the loss function, which further accelerates convergence, thereby reducing the training cost. On the theoretical side, we prove that in the proposed method, the gradient descent algorithms are not attracted to sub-optimal critical points or local minima under practical conditions on the initialization and learning rate, and that the gradient dynamics of the proposed method is not achievable by base methods with any (adaptive) learning rates. We further show that the adaptive activation methods accelerate the convergence by implicitly multiplying conditioning matrices to the gradient of the base method without any explicit computation of the conditioning matrix and the matrix-vector product. The different adaptive activation functions are shown to induce different implicit conditioning matrices. Furthermore, the proposed methods with the slope recovery are shown to accelerate the training process.

Journal Article
TL;DR: In this article, the authors decompose the game Jacobian into two components: symmetric component and antisymmetric component, a new class of games that obey a conservation law akin to conservation laws in classical mechanical systems.
Abstract: Deep learning is built on the foundational guarantee that gradient descent on an objective function converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, that exhibit multiple interacting losses. The behavior of gradient-based methods in games is not well understood -- and is becoming increasingly important as adversarial and multi-objective architectures proliferate. In this paper, we develop new tools to understand and control the dynamics in n-player differentiable games. The key result is to decompose the game Jacobian into two components. The first, symmetric component, is related to potential games, which reduce to gradient descent on an implicit function. The second, antisymmetric component, relates to Hamiltonian games, a new class of games that obey a conservation law akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in differentiable games. Basic experiments show SGA is competitive with recently proposed algorithms for finding stable fixed points in GANs -- while at the same time being applicable to, and having guarantees in, much more general cases.

Proceedings ArticleDOI
01 May 2019
TL;DR: LVIS is introduced, which circumvents the issue of local minima through global mixed-integer optimization and theissue of non-uniqueness through learning the optimal value function rather than the optimal policy, and is applied to a fundamentally hard problem in feedback control–control through contact.
Abstract: Guided policy search is a popular approach for training controllers for high-dimensional systems, but it has a number of pitfalls. Non-convex trajectory optimization has local minima, and non-uniqueness in the optimal policy itself can mean that independently-optimized samples do not describe a coherent policy from which to train. We introduce LVIS, which circumvents the issue of local minima through global mixed-integer optimization and the issue of non-uniqueness through learning the optimal value function rather than the optimal policy. To avoid the expense of solving the mixed-integer programs to full global optimality, we instead solve them only partially, extracting intervals containing the true cost-to-go from early termination of the branch-and-bound algorithm. These interval samples are used to weakly supervise the training of a neural net which approximates the true cost-to-go. Online, we use that learned cost-to-go as the terminal cost of a one-step model-predictive controller, which we solve via a small mixed-integer optimization. We demonstrate LVIS on piecewise affine models of a cart-pole system with walls and a planar humanoid robot and show that it can be applied to a fundamentally hard problem in feedback control–control through contact.

Journal ArticleDOI
TL;DR: This large database of water cluster minima spanning quite dissimilar hydrogen bonding networks is expected to influence the development and assessment of the accuracy of interaction potentials for water as well as lower scaling electronic structure methods (such as different density functionals).
Abstract: We report a database consisting of the putative minima and ∼3.2 × 106 local minima lying within 5 kcal/mol from the putative minima for water clusters of sizes n = 3-25 using an improved version of the Monte Carlo temperature basin paving (MCTBP) global optimization procedure in conjunction with the ab initio based, flexible, polarizable Thole-Type Model (TTM2.1-F, version 2.1) interaction potential for water. Several of the low-lying structures, as well as low-lying penta-coordinated water networks obtained with the TTM2.1-F potential, were further refined at the Moller-Plesset second order perturbation (MP2)/aug-cc-pVTZ level of theory. In total, we have identified 3 138 303 networks corresponding to local minima of the clusters n = 3-25, whose Cartesian coordinates and relative energies can be obtained from the webpage https://sites.uw.edu/wdbase/. Networks containing penta-coordinated water molecules start to appear at n = 11 and, quite surprisingly, are energetically close (within 1-3 kcal/mol) to the putative minima, a fact that has been confirmed from the MP2 calculations. This large database of water cluster minima spanning quite dissimilar hydrogen bonding networks is expected to influence the development and assessment of the accuracy of interaction potentials for water as well as lower scaling electronic structure methods (such as different density functionals). Furthermore, it can also be used in conjunction with data science approaches (including but not limited to neural networks and machine and deep learning) to understand the properties of water, nature's most important substance.

Posted Content
TL;DR: In this paper, a gradient-based random sampling technique is proposed to visualize basins of attraction together with the associated stationary points via gradient-gradient-based sampling, which can be used for fitness landscape analysis of neural networks.
Abstract: Quantification of the stationary points and the associated basins of attraction of neural network loss surfaces is an important step towards a better understanding of neural network loss surfaces at large. This work proposes a novel method to visualise basins of attraction together with the associated stationary points via gradient-based random sampling. The proposed technique is used to perform an empirical study of the loss surfaces generated by two different error metrics: quadratic loss and entropic loss. The empirical observations confirm the theoretical hypothesis regarding the nature of neural network attraction basins. Entropic loss is shown to exhibit stronger gradients and fewer stationary points than quadratic loss, indicating that entropic loss has a more searchable landscape. Quadratic loss is shown to be more resilient to overfitting than entropic loss. Both losses are shown to exhibit local minima, but the number of local minima is shown to decrease with an increase in dimensionality. Thus, the proposed visualisation technique successfully captures the local minima properties exhibited by the neural network loss surfaces, and can be used for the purpose of fitness landscape analysis of neural networks.

Journal ArticleDOI
Feng Feng1, Chao Zhang1, Weicong Na1, Jianan Zhang1, Wei Zhang1, Qi-Jun Zhang1 
TL;DR: A feature zero-adaptation approach to enlarge the surrogate range by overcoming the problem of varying orders of the transfer function w.r.t. the changes in design variables is proposed.
Abstract: Feature-based electromagnetic (EM) optimization techniques can help avoid local minima in microwave design. Zeros of the transfer functions are recently used to help extract the features when the features of filter responses are not explicitly identifiable. This letter proposes a feature zero-adaptation approach to enlarge the surrogate range by overcoming the problem of varying orders of the transfer function w.r.t. the changes in design variables. In this way, the proposed technique allows larger step sizes for optimization, therefore, speeding up the overall EM optimization process. During each optimization iteration, parallel techniques are proposed to be used to generate multiple EM geometrical samples simultaneously for creating the feature-based surrogate model. The proposed technique is demonstrated using two microwave filter examples.

Proceedings Article
01 Jan 2019
TL;DR: This paper theoretically prove that adding one special neuron per output unit eliminates all suboptimal local minima of any deep neural network, for multi-class classification, binary classification, and regression with an arbitrary loss function, under practical assumptions.
Abstract: In this paper, we theoretically prove that adding one special neuron per output unit eliminates all suboptimal local minima of any deep neural network, for multi-class classification, binary classification, and regression with an arbitrary loss function, under practical assumptions At every local minimum of any deep neural network with these added neurons, the set of parameters of the original neural network (without added neurons) is guaranteed to be a global minimum of the original neural network The effects of the added neurons are proven to automatically vanish at every local minimum Moreover, we provide a novel theoretical characterization of a failure mode of eliminating suboptimal local minima via an additional theorem and several examples This paper also introduces a novel proof technique based on the perturbable gradient basis (PGB) necessary condition of local minima, which provides new insight into the elimination of local minima and is applicable to analyze various models and transformations of objective functions beyond the elimination of local minima

Posted Content
TL;DR: It is proved that for asymmetric valleys, a solution biased towards the flat side generalizes better than the exact minimizer, which provides a theoretical explanation for the intriguing phenomenon observed by Izmailov et al. (2018).
Abstract: Despite the non-convex nature of their loss functions, deep neural networks are known to generalize well when optimized with stochastic gradient descent (SGD). Recent work conjectures that SGD with proper configuration is able to find wide and flat local minima, which have been proposed to be associated with good generalization performance. In this paper, we observe that local minima of modern deep networks are more than being flat or sharp. Specifically, at a local minimum there exist many asymmetric directions such that the loss increases abruptly along one side, and slowly along the opposite side--we formally define such minima as asymmetric valleys. Under mild assumptions, we prove that for asymmetric valleys, a solution biased towards the flat side generalizes better than the exact minimizer. Further, we show that simply averaging the weights along the SGD trajectory gives rise to such biased solutions implicitly. This provides a theoretical explanation for the intriguing phenomenon observed by Izmailov et al. (2018). In addition, we empirically find that batch normalization (BN) appears to be a major cause for asymmetric valleys.

Posted Content
TL;DR: In this article, the authors theoretically prove that adding one special neuron per output unit eliminates all suboptimal local minima of any deep neural network, for multi-class classification, binary classification, and regression with an arbitrary loss function, under practical assumptions.
Abstract: In this paper, we theoretically prove that adding one special neuron per output unit eliminates all suboptimal local minima of any deep neural network, for multi-class classification, binary classification, and regression with an arbitrary loss function, under practical assumptions. At every local minimum of any deep neural network with these added neurons, the set of parameters of the original neural network (without added neurons) is guaranteed to be a global minimum of the original neural network. The effects of the added neurons are proven to automatically vanish at every local minimum. Moreover, we provide a novel theoretical characterization of a failure mode of eliminating suboptimal local minima via an additional theorem and several examples. This paper also introduces a novel proof technique based on the perturbable gradient basis (PGB) necessary condition of local minima, which provides new insight into the elimination of local minima and is applicable to analyze various models and transformations of objective functions beyond the elimination of local minima.

Journal ArticleDOI
TL;DR: In this article, the authors studied the structure of extreme level sets of a standard one-dimensional branching Brownian motion, namely the sets of particles whose height is within a fixed distance from the order of the global maximum.
Abstract: We study the structure of extreme level sets of a standard one-dimensional branching Brownian motion, namely the sets of particles whose height is within a fixed distance from the order of the global maximum. It is well known that such particles congregate at large times in clusters of order-one genealogical diameter around local maxima which form a Cox process in the limit. We add to these results by finding the asymptotic size of extreme level sets and the typical height of the local maxima whose clusters carry such level sets. We also find the right tail decay of the distribution of the distance between the two highest particles. These results confirm two conjectures of Brunet and Derrida (J. Stat. Phys. 143 (2011) 420–446). The proofs rely on a careful study of the cluster distribution.

Posted Content
TL;DR: The geometric approach yields a lower bound on the number of critical points generated by weight-space symmetries and provides a simple intuitive link between previous mathematical results and numerical observations.
Abstract: The permutation symmetry of neurons in each layer of a deep neural network gives rise not only to multiple equivalent global minima of the loss function, but also to first-order saddle points located on the path between the global minima. In a network of $d-1$ hidden layers with $n_k$ neurons in layers $k = 1, \ldots, d$, we construct smooth paths between equivalent global minima that lead through a `permutation point' where the input and output weight vectors of two neurons in the same hidden layer $k$ collide and interchange. We show that such permutation points are critical points with at least $n_{k+1}$ vanishing eigenvalues of the Hessian matrix of second derivatives indicating a local plateau of the loss function. We find that a permutation point for the exchange of neurons $i$ and $j$ transits into a flat valley (or generally, an extended plateau of $n_{k+1}$ flat dimensions) that enables all $n_k!$ permutations of neurons in a given layer $k$ at the same loss value. Moreover, we introduce high-order permutation points by exploiting the recursive structure in neural network functions, and find that the number of $K^{\text{th}}$-order permutation points is at least by a factor $\sum_{k=1}^{d-1}\frac{1}{2!^K}{n_k-K \choose K}$ larger than the (already huge) number of equivalent global minima. In two tasks, we illustrate numerically that some of the permutation points correspond to first-order saddles (`permutation saddles'): first, in a toy network with a single hidden layer on a function approximation task and, second, in a multilayer network on the MNIST task. Our geometric approach yields a lower bound on the number of critical points generated by weight-space symmetries and provides a simple intuitive link between previous mathematical results and numerical observations.