scispace - formally typeset
Search or ask a question

Showing papers on "Maxima and minima published in 2020"


Journal ArticleDOI
TL;DR: It is proved that in the proposed method, the gradient descent algorithms are not attracted to sub-optimal critical points or local minima under practical conditions on the initialization and learning rate, and that the gradient dynamics of the proposedmethod is not achievable by base methods with any (adaptive) learning rates.
Abstract: We propose two approaches of locally adaptive activation functions namely, layer-wise and neuron-wise locally adaptive activation functions, which improve the performance of deep and physics-informed neural networks. The local adaptation of activation function is achieved by introducing a scalable parameter in each layer (layer-wise) and for every neuron (neuron-wise) separately, and then optimizing it using a variant of stochastic gradient descent algorithm. In order to further increase the training speed, an activation slope-based slope recovery term is added in the loss function, which further accelerates convergence, thereby reducing the training cost. On the theoretical side, we prove that in the proposed method, the gradient descent algorithms are not attracted to sub-optimal critical points or local minima under practical conditions on the initialization and learning rate, and that the gradient dynamics of the proposed method is not achievable by base methods with any (adaptive) learning rates. We further show that the adaptive activation methods accelerate the convergence by implicitly multiplying conditioning matrices to the gradient of the base method without any explicit computation of the conditioning matrix and the matrix-vector product. The different adaptive activation functions are shown to induce different implicit conditioning matrices. Furthermore, the proposed methods with the slope recovery are shown to accelerate the training process.

159 citations


Journal ArticleDOI
TL;DR: This paper introduces Point2Mesh, a technique for reconstructing a surface mesh from an input point cloud that is robust to non-ideal conditions, and shows that shrink-wrapping a point cloud with a self-prior converges to a desirable solution.
Abstract: In this paper, we introduce Point2Mesh, a technique for reconstructing a surface mesh from an input point cloud. Instead of explicitly specifying a prior that encodes the expected shape properties, the prior is defined automatically using the input point cloud, which we refer to as a self-prior. The self-prior encapsulates reoccurring geometric repetitions from a single shape within the weights of a deep neural network. We optimize the network weights to deform an initial mesh to shrink-wrap a single input point cloud. This explicitly considers the entire reconstructed shape, since shared local kernels are calculated to fit the overall object. The convolutional kernels are optimized globally across the entire shape, which inherently encourages local-scale geometric self-similarity across the shape surface. We show that shrink-wrapping a point cloud with a self-prior converges to a desirable solution; compared to a prescribed smoothness prior, which often becomes trapped in undesirable local minima. While the performance of traditional reconstruction approaches degrades in non-ideal conditions that are often present in real world scanning, i.e., unoriented normals, noise and missing (low density) parts, Point2Mesh is robust to non-ideal conditions. We demonstrate the performance of Point2Mesh on a large variety of shapes with varying complexity.

137 citations


Journal ArticleDOI
TL;DR: A novel convolutional neural network architecture is proposed, termed the contrast source network, that learns the noise space components of the radiation operator that helps in producing high resolution solutions without any significant increase in computational costs.
Abstract: In this paper, we introduce a deep-learning-based framework to solve electromagnetic inverse scattering problems. This framework builds on and extends the capabilities of existing physics-based inversion algorithms. These algorithms, such as the contrast source inversion, subspace-optimization method, and their variants face a problem of getting trapped in false local minima when recovering objects with high permittivity. We propose a novel convolutional neural network architecture, termed the contrast source network, that learns the noise space components of the radiation operator. Together with the signal space components directly estimated from the data, we iteratively refine the solution and show convergence to the correct solution in cases where traditional techniques fail without any significant increase in computational time. We also propose a novel multiresolution strategy that helps in producing high resolution solutions without any significant increase in computational costs. Through extensive numerical experiments, we demonstrate the ability to recover high permittivity objects that include homogeneous, heterogeneous, and lossy scatterers.

109 citations


Journal ArticleDOI
17 Nov 2020
TL;DR: In this paper, the convergence of the natural gradient optimizer for the variational quantum eigensolver across multiple spin chain systems was shown for a single spin chain system, where the optimizer is based on the natural gradients.
Abstract: This paper shows the convergence of the natural gradient optimizer for the variational quantum eigensolver across multiple spin chain systems.

104 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider regularity issues for minima of non-autonomous functionals in the Calculus of Variations exhibiting non-uniform ellipticity features.
Abstract: We consider regularity issues for minima of non-autonomous functionals in the Calculus of Variations exhibiting non-uniform ellipticity features. We provide a few sharp regularity results for local minimizers that also cover the case of functionals with nearly linear growth. The analysis is carried out provided certain necessary approximation-in-energy conditions are satisfied. These are related to the occurrence of the so-called Lavrentiev phenomenon that non-autonomous functionals might exhibit, and which is a natural obstruction to regularity. In the case of vector valued problems, we concentrate on higher gradient integrability of minima. Instead, in the scalar case, we prove local Lipschitz estimates. We also present an approach via a variant of Moser’s iteration technique that allows to reduce the analysis of several non-uniformly elliptic problems to that for uniformly elliptic ones.

85 citations


Journal ArticleDOI
TL;DR: In this paper, a linear reduction diversity technique (LRD) and local minima elimination method (MEM) are used to improve the best-so-far solution of the problem.

79 citations


Journal ArticleDOI
TL;DR: In this article, a global optimization with first-principle energy expressions of atomistic structure is proposed to identify initial stages of the edge oxidation and oxygen intercalation of graphene sheets on the Ir(111) surface.
Abstract: We propose a scheme for global optimization with first-principles energy expressions of atomistic structure. While unfolding its search, the method actively learns a surrogate model of the potential energy landscape on which it performs a number of local relaxations (exploitation) and further structural searches (exploration). Assuming Gaussian processes, deploying two separate kernel widths to better capture rough features of the energy landscape while retaining a good resolution of local minima, an acquisition function is used to decide on which of the resulting structures is the more promising and should be treated at the first-principles level. The method is demonstrated to outperform by 2 orders of magnitude a well established first-principles based evolutionary algorithm in finding surface reconstructions. Finally, global optimization with first-principles energy expressions is utilized to identify initial stages of the edge oxidation and oxygen intercalation of graphene sheets on the Ir(111) surface.

75 citations


Journal ArticleDOI
TL;DR: A unified identification framework called constrained subspace method for structured state-space models (COSMOS) is presented, where the structure is defined by a user-specified linear or polynomial parametrization.
Abstract: In this article, a unified identification framework called constrained subspace method for structured state-space models (COSMOS) is presented, where the structure is defined by a user-specified linear or polynomial parametrization. The new approach operates directly from the input and output data, which differs from the traditional two-step method that first obtains a state-space realization followed by the system-parameter estimation. The new identification framework relies on a subspace inspired linear regression problem which may not yield a consistent estimate in the presence of process noise. To alleviate this problem, the linear regression formulation is imposed by structured and low-rank constraints in terms of a finite set of system Markov parameters and the user specified model parameters. The nonconvex nature of the constrained optimization problem is dealt with by transforming the problem into a difference-of-convex optimization problem, which is then handled by the sequential convex programming strategy. Numerical simulation examples show that the proposed identification method is more robust than the classical prediction-error method initialized by random initial values in converging to local minima, but at the cost of heavier computational burden.

74 citations


Journal ArticleDOI
TL;DR: In this article, a robust design for federated learning to decline the effect of noise is proposed, where the authors formulate the training problem as a parallel optimization for each node under the expectation-based model and worst-case model and utilize the sampling-based successive convex approximation algorithm to develop a feasible training scheme to tackle the unavailable maxima or minima noise condition and the non-convex issue of the objective function.
Abstract: Federated learning is a communication-efficient training process that alternate between local training at the edge devices and averaging of the updated local model at the center server. Nevertheless, it is impractical to achieve perfect acquisition of the local models in wireless communication due to the noise, which also brings serious effect on federated learning. To tackle this challenge in this paper, we propose a robust design for federated learning to decline the effect of noise. Considering the noise in two aforementioned steps, we first formulate the training problem as a parallel optimization for each node under the expectation-based model and worst-case model. Due to the non-convexity of the problem, regularizer approximation method is proposed to make it tractable. Regarding the worst-case model, we utilize the sampling-based successive convex approximation algorithm to develop a feasible training scheme to tackle the unavailable maxima or minima noise condition and the non-convex issue of the objective function. Furthermore, the convergence rates of both new designs are analyzed from a theoretical point of view. Finally, the improvement of prediction accuracy and the reduction of loss function value are demonstrated via simulation for the proposed designs.

74 citations


Proceedings ArticleDOI
TL;DR: Using the statistical properties from Monte-Carlo Markov chains of images, it is shown how this code can place statistical limits on image features such as unseen binary companions.
Abstract: We present a flexible code created for imaging from the bispectrum and visibility-squared. By using a simulated annealing method, we limit the probability of converging to local chi-squared minima as can occur when traditional imaging methods are used on data sets with limited phase information. We present the results of our code used on a simulated data set utilizing a number of regularization schemes including maximum entropy. Using the statistical properties from Monte-Carlo Markov chains of images, we show how this code can place statistical limits on image features such as unseen binary companions.

65 citations


Journal ArticleDOI
TL;DR: This paper proposes a unified nonlocal Laplace operator, which converges to the classical Laplacian as one of the operator parameters, the nonlocal interaction radius δ goes to zero, and to the fractional LaplACian as δ Goes to infinity, and forms a super-set of classical LaPlacian and fractionalLaplace operators and, thus, has the potential to fit a broad spectrum of data sets.

Journal ArticleDOI
TL;DR: A general numerical method is introduced that constructs the pathway map, which guides the understanding of how a physical system moves on the energy landscape.
Abstract: How do we search for the entire family tree of possible intermediate states, without unwanted random guesses, starting from a stationary state on the energy landscape all the way down to energy minima? Here we introduce a general numerical method that constructs the pathway map, which guides our understanding of how a physical system moves on the energy landscape. The method identifies the transition state between energy minima and the energy barrier associated with such a state. As an example, we solve the Landau--de Gennes energy incorporating the Dirichlet boundary conditions to model a liquid crystal confined in a square box; we illustrate the basic concepts by examining the multiple stationary solutions and the connected pathway maps of the model.

Posted Content
TL;DR: This work develops a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters, and is the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima.
Abstract: Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. To the best of our knowledge, we are the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of the batch size and learning rate. Thus, large-batch training cannot search flat minima efficiently in a realistic computational time.

Journal ArticleDOI
TL;DR: A novel machine-learning based on an evolutionary algorithm, namely Cuckoo search (CS) to solve the local minimum problem of ML in the most radical way and completely outperforms CS, ML, and other hybrid ML in terms of accuracy and considerably reduces calculational costs compared to CS.

Journal ArticleDOI
TL;DR: For the random over-complete tensor decomposition problem, the authors showed that for any small constant > 0, all local optima are (approximately) global optima, i.e., the set of points with function values that are larger than the expectation of the function, all the local maxima are approximate global maxima.
Abstract: Non-convex optimization with local search heuristics has been widely used in machine learning, achieving many state-of-art results. It becomes increasingly important to understand why they can work for these NP-hard problems on typical data. The landscape of many objective functions in learning has been conjectured to have the geometric property that “all local optima are (approximately) global optima”, and thus they can be solved efficiently by local search algorithms. However, establishing such property can be very difficult. In this paper, we analyze the optimization landscape of the random over-complete tensor decomposition problem, which has many applications in unsupervised learning, especially in learning latent variable models. In practice, it can be efficiently solved by gradient ascent on a non-convex objective. We show that for any small constant $$\varepsilon > 0$$ , among the set of points with function values $$(1+\varepsilon )$$ -factor larger than the expectation of the function, all the local maxima are approximate global maxima. Previously, the best-known result only characterizes the geometry in small neighborhoods around the true components. Our result implies that even with an initialization that is barely better than the random guess, the gradient ascent algorithm is guaranteed to solve this problem. However, achieving such a initialization with random guess would still require super-polynomial number of attempts. Our main technique uses Kac–Rice formula and random matrix theory. To our best knowledge, this is the first time when Kac–Rice formula is successfully applied to counting the number of local optima of a highly-structured random polynomial with dependent coefficients.

Proceedings ArticleDOI
01 Nov 2020
TL;DR: Analyzing the loss landscape, it is shown that Masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy, confirming that masking can be utilized as an efficient alternative to finetuned.
Abstract: We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. Extensive evaluations of masking BERT, RoBERTa, and DistilBERT on eleven diverse NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several tasks need to be inferred. Intrinsic evaluations show that representations computed by our binary masked language models encode information necessary for solving downstream tasks. Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy. This confirms that masking can be utilized as an efficient alternative to finetuning.

Journal ArticleDOI
TL;DR: In this article, the authors show that the error loss function presents few extremely wide flat minima (WFM) which coexist with narrower minima and critical points, and that the minimizers of the cross-entropy loss overlap with the WFM of error loss.
Abstract: Learning in deep neural networks takes place by minimizing a nonconvex high-dimensional loss function, typically by a stochastic gradient descent (SGD) strategy. The learning process is observed to be able to find good minimizers without getting stuck in local critical points and such minimizers are often satisfactory at avoiding overfitting. How these 2 features can be kept under control in nonlinear devices composed of millions of tunable connections is a profound and far-reaching open question. In this paper we study basic nonconvex 1- and 2-layer neural network models that learn random patterns and derive a number of basic geometrical and algorithmic features which suggest some answers. We first show that the error loss function presents few extremely wide flat minima (WFM) which coexist with narrower minima and critical points. We then show that the minimizers of the cross-entropy loss function overlap with the WFM of the error loss. We also show examples of learning devices for which WFM do not exist. From the algorithmic perspective we derive entropy-driven greedy and message-passing algorithms that focus their search on wide flat regions of minimizers. In the case of SGD and cross-entropy loss, we show that a slow reduction of the norm of the weights along the learning process also leads to WFM. We corroborate the results by a numerical study of the correlations between the volumes of the minimizers, their Hessian, and their generalization performance on real data.

Proceedings Article
12 Jul 2020
TL;DR: A modified notion of the flatness is introduced that does not suffer from the insufficiency and might provide better hierarchy in the hypothesis class and highlight the scale dependence of existing matrix-norm based generalization error bounds similar to the existing flat minima definitions.
Abstract: The notion of flat minima has played a key role in the generalization studies of deep learning models. However, existing definitions of the flatness are known to be sensitive to the rescaling of parameters. The issue suggests that the previous definitions of the flatness might not be a good measure of generalization, because generalization is invariant to such rescalings. In this paper, from the PAC-Bayesian perspective, we scrutinize the discussion concerning the flat minima and introduce the notion of normalized flat minima, which is free from the known scale dependence issues. Additionally, we highlight the scale dependence of existing matrix-norm based generalization error bounds similar to the existing flat minima definitions. Our modified notion of the flatness does not suffer from the insufficiency, either, suggesting it might provide better hierarchy in the hypothesis class.

Book ChapterDOI
23 Aug 2020
TL;DR: In this article, a non-linear quadratic program is proposed to identify regions in the parameter space that contain unique minima with guarantees that at least one of them will be the global minimum.
Abstract: An approach for estimating the pose of a camera given a set of 3D points and their corresponding 2D image projections is presented. It formulates the problem as a non-linear quadratic program and identifies regions in the parameter space that contain unique minima with guarantees that at least one of them will be the global minimum. Each regional minimum is computed with a sequential quadratic programming scheme. These premises result in an algorithm that always determines the global minima of the perspective-n-point problem for any number of input correspondences, regardless of possible coplanar arrangements of the imaged 3D points. For its implementation, the algorithm merely requires ordinary operations available in any standard off-the-shelf linear algebra library. Comparative evaluation demonstrates that the algorithm achieves state-of-the-art results at a consistently low computational cost.

Journal ArticleDOI
TL;DR: New light is shed on the smoothness of optimization problems arising in prediction error parameter estimation of linear and nonlinear systems and the use of multiple shooting as a viable solution is proposed.

Journal ArticleDOI
TL;DR: The proposed visualisation technique successfully captures the local minima properties exhibited by the neural network loss surfaces, and can be used for the purpose of fitness landscape analysis of neural networks.

Journal ArticleDOI
TL;DR: In this paper, an active learning method is proposed to automatically sample data points for constructing globally accurate reactive potential energy surfaces (PESs) using neural networks (NNs), which can alternatively minimize the negative of the squared difference surface (NSDS) given by two different NN models to actively locate the point where the PES is least confident.
Abstract: An efficient and trajectory-free active learning method is proposed to automatically sample data points for constructing globally accurate reactive potential energy surfaces (PESs) using neural networks (NNs). Although NNs do not provide the predictive variance as the Gaussian process regression does, we can alternatively minimize the negative of the squared difference surface (NSDS) given by two different NN models to actively locate the point where the PES is least confident. A batch of points in the minima of this NSDS can be iteratively added into the training set to improve the PES. The configuration space is gradually and globally covered without the need to run classical trajectory (or equivalently molecular dynamics) simulations. Through refitting the available analytical PESs of H3 and OH3 reactive systems, we demonstrate the efficiency and robustness of this new strategy, which enables fast convergence of the reactive PESs with respect to the number of points in terms of quantum scattering probabilities.

Journal ArticleDOI
TL;DR: The experimental results show that ISECNN can increase the generalization ability without decreasing the prediction accuracy, and is compared with traditional DL and machine learning methods, and the results validate the potential performance ofISECNN in the fault diagnosis field.

Journal ArticleDOI
TL;DR: Point2Mesh as mentioned in this paper proposes a self-prior that encapsulates reoccurring geometric repetitions from a single shape within the weights of a deep neural network, which is defined automatically using the input point cloud.
Abstract: In this paper, we introduce Point2Mesh, a technique for reconstructing a surface mesh from an input point cloud. Instead of explicitly specifying a prior that encodes the expected shape properties, the prior is defined automatically using the input point cloud, which we refer to as a self-prior. The self-prior encapsulates reoccurring geometric repetitions from a single shape within the weights of a deep neural network. We optimize the network weights to deform an initial mesh to shrink-wrap a single input point cloud. This explicitly considers the entire reconstructed shape, since shared local kernels are calculated to fit the overall object. The convolutional kernels are optimized globally across the entire shape, which inherently encourages local-scale geometric self-similarity across the shape surface. We show that shrink-wrapping a point cloud with a self-prior converges to a desirable solution; compared to a prescribed smoothness prior, which often becomes trapped in undesirable local minima. While the performance of traditional reconstruction approaches degrades in non-ideal conditions that are often present in real world scanning, i.e., unoriented normals, noise and missing (low density) parts, Point2Mesh is robust to non-ideal conditions. We demonstrate the performance of Point2Mesh on a large variety of shapes with varying complexity.

Journal ArticleDOI
TL;DR: A new human intelligence-based metaheuristic optimization technique, that is, aggrandized class topper optimization (CTO), is proposed to solve ELD and CEED problems and proves that the proposed algorithm provides better and effective results in almost each test case.
Abstract: Optimization techniques are widely being used to solve large and complex economical load dispatch (ELD) and combined emission economical dispatch (CEED) problems in power systems. These techniques can solve these problems in a short computational time. In this article, a new human intelligence-based metaheuristic optimization technique, that is, aggrandized class topper optimization (CTO), is proposed to solve ELD and CEED problems. This proposed algorithm is an upgraded form of classical CTO in which the concept of remedial classes is incorporated to enhance the learning ability of weak students of a class. To validate the exploration, exploitation, convergence, and local minima avoidance capabilities of the proposed algorithm, 29 benchmark functions are considered. Furthermore, seven different test cases for the ELD problem and four test cases for a CEED problem are considered to test the effectiveness of the proposed algorithm to solve these complex problems. The result analysis proves that the proposed algorithm provides better and effective results in almost each test case.

Journal ArticleDOI
TL;DR: This work analytically shows that the multilayered structure holds the key to optimizability: Fixing the number of parameters and increasing network depth, theNumber of stationary points in the loss function decreases, minima become more clustered in parameter space, and the trade-off between the depth and width of minima becomes less severe.
Abstract: Deep neural networks are workhorse models in machine learning with multiple layers of nonlinear functions composed in series. Their loss function is highly nonconvex, yet empirically even gradient descent minimization is sufficient to arrive at accurate and predictive models. It is hitherto unknown why deep neural networks are easily optimizable. We analyze the energy landscape of a spin glass model of deep neural networks using random matrix theory and algebraic geometry. We analytically show that the multilayered structure holds the key to optimizability: Fixing the number of parameters and increasing network depth, the number of stationary points in the loss function decreases, minima become more clustered in parameter space, and the trade-off between the depth and width of minima becomes less severe. Our analytical results are numerically verified through comparison with neural networks trained on a set of classical benchmark datasets. Our model uncovers generic design principles of machine learning models.

Journal ArticleDOI
TL;DR: In particular, it is not known how many modes a mixture of $k$ Gaussians in $d$ dimensions can have, provided it is finite as mentioned in this paper. But the upper bound on the maximum number of modes is known.
Abstract: Gaussian mixture models are widely used in Statistics. A fundamental aspect of these distributions is the study of the local maxima of the density, or modes. In particular, it is not known how many modes a mixture of $k$ Gaussians in $d$ dimensions can have. We give a brief account of this problem's history. Then, we give improved lower bounds and the first upper bound on the maximum number of modes, provided it is finite.

Proceedings Article
01 Jan 2020
TL;DR: It is shown that regularization seems to provide SGD with an escape route: once heuristics such as data augmentation are used, starting from a complex model (adversarial initialization) has no effect on the test accuracy.
Abstract: Several recent works have aimed to explain why severely overparameterized models, generalize well when trained by Stochastic Gradient Descent (SGD). The emergent consensus explanation has two parts: the first is that there are "no bad local minima", while the second is that SGD performs implicit regularization by having a bias towards low complexity models. We revisit both of these ideas in the context of image classification with common deep neural network architectures. Our first finding is that there exist bad global minima, i.e., models that fit the training set perfectly, yet have poor generalization. Our second finding is that given only unlabeled training data, we can easily construct initializations that will cause SGD to quickly converge to such bad global minima. For example, on CIFAR, CINIC10, and (Restricted) ImageNet, this can be achieved by starting SGD at a model derived by fitting random labels on the training data: while subsequent SGD training (with the correct labels) will reach zero training error, the resulting model will exhibit a test accuracy degradation of up to 40% compared to training from a random initialization. Finally, we show that regularization seems to provide SGD with an escape route: once heuristics such as data augmentation are used, starting from a complex model (adversarial initialization) has no effect on the test accuracy.

Proceedings Article
01 Jan 2020
TL;DR: This work proposes a more general framework to investigate the effect of symmetry on landscape connectivity by accounting for the weight permutations of the networks being connected by introducing an inexpensive heuristic referred to as neuron alignment.
Abstract: The loss landscapes of deep neural networks are not well understood due to their high nonconvexity. Empirically, the local minima of these loss functions can be connected by a learned curve in model space, along which the loss remains nearly constant; a feature known as mode connectivity. Yet, current curve finding algorithms do not consider the influence of symmetry in the loss surface created by model weight permutations. We propose a more general framework to investigate the effect of symmetry on landscape connectivity by accounting for the weight permutations of the networks being connected. To approximate the optimal permutation, we introduce an inexpensive heuristic referred to as neuron alignment. Neuron alignment promotes similarity between the distribution of intermediate activations of models along the curve. We provide theoretical analysis establishing the benefit of alignment to mode connectivity based on this simple heuristic. We empirically verify that the permutation given by alignment is locally optimal via a proximal alternating minimization scheme. Empirically, optimizing the weight permutation is critical for efficiently learning a simple, planar, low-loss curve between networks that successfully generalizes. Our alignment method can significantly alleviate the recently identified robust loss barrier on the path connecting two adversarial robust models and find more robust and accurate models on the path.

Proceedings Article
12 Jul 2020
TL;DR: In this paper, the authors show that the combination of stochastic gradient descent (SGD) and over-parameterization makes the landscape of multilayer neural networks approximately connected and thus more favorable to optimization.
Abstract: The optimization of multilayer neural networks typically leads to a solution with zero training error, yet the landscape can exhibit spurious local minima and the minima can be disconnected. In this paper, we shed light on this phenomenon: we show that the combination of stochastic gradient descent (SGD) and over-parameterization makes the landscape of multilayer neural networks approximately connected and thus more favorable to optimization. More specifically, we prove that SGD solutions are connected via a piecewise linear path, and the increase in loss along this path vanishes as the number of neurons grows large. This result is a consequence of the fact that the parameters found by SGD are increasingly dropout stable as the network becomes wider. We show that, if we remove part of the neurons (and suitably rescale the remaining ones), the change in loss is independent of the total number of neurons, and it depends only on how many neurons are left. Our results exhibit a mild dependence on the input dimension: they are dimension-free for two-layer networks and depend linearly on the dimension for multilayer networks. We validate our theoretical findings with numerical experiments for different architectures and classification tasks.