scispace - formally typeset
Search or ask a question

Showing papers on "Maxima and minima published in 2016"


Posted Content
TL;DR: In this paper, the authors investigate the cause of the generalization drop in the large batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minima of the training and testing functions.
Abstract: The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

925 citations


Proceedings Article
15 Sep 2016
TL;DR: In this article, the authors investigate the cause of the generalization drop in the large batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minima of the training and testing functions.
Abstract: The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

845 citations


Posted Content
TL;DR: In this article, the squared loss function of deep linear neural networks with any depth and any widths is shown to be non-convex and nonconcave, every local minimum is a global minimum, every critical point that is not a global minima is a saddle point, and there exist "bad" saddle points (where the Hessian has no negative eigenvalue) for deep networks with more than three layers.
Abstract: In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. With no unrealistic assumption, we first prove the following statements for the squared loss function of deep linear neural networks with any depth and any widths: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist "bad" saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers). Moreover, for deep nonlinear neural networks, we prove the same four statements via a reduction to a deep linear model under the independence assumption adopted from recent work. As a result, we present an instance, for which we can answer the following question: how difficult is it to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima). Furthermore, the mathematically proven existence of bad saddle points for deeper models would suggest a possible open problem. We note that even though we have advanced the theoretical foundations of deep learning and non-convex optimization, there is still a gap between theory and practice.

609 citations


Proceedings Article
23 May 2016
TL;DR: This paper proves a conjecture published in 1989 and partially addresses an open problem announced at the Conference on Learning Theory (COLT) 2015, and presents an instance for which it can answer the following question: how difficult is it to directly train a deep model in theory?
Abstract: In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. For an expected loss function of a deep nonlinear neural network, we prove the following statements under the independence assumption adopted from recent work: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) the property of saddle points differs for shallow networks (with three layers) and deeper networks (with more than three layers). Moreover, we prove that the same four statements hold for deep linear neural networks with any depth, any widths and no unrealistic assumptions. As a result, we present an instance, for which we can answer to the following question: how difficult to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima and the property of the saddle points). We note that even though we have advanced the theoretical foundations of deep learning, there is still a gap between theory and practice.

562 citations


Proceedings Article
24 May 2016
TL;DR: It is proved that the commonly used non-convex objective function for positive semidefinite matrix completion has no spurious local minima --- all local minata must also be global.
Abstract: Matrix completion is a basic machine learning problem that has wide applications, especially in collaborative filtering and recommender systems. Simple non-convex optimization algorithms are popular and effective in practice. Despite recent progress in proving various non-convex algorithms converge from a good initial point, it remains unclear why random or arbitrary initialization suffices in practice. We prove that the commonly used non-convex objective function for matrix completion has no spurious local minima \--- all local minima must also be global. Therefore, many popular optimization algorithms such as (stochastic) gradient descent can provably solve matrix completion with \textit{arbitrary} initialization in polynomial time.

281 citations


Posted Content
TL;DR: It is proved that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization, and extended to the case of more than onehidden layer.
Abstract: We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.

267 citations


Journal ArticleDOI
TL;DR: The results suggest that theBiophysical interpretation of dMRI model parameters crucially depends on establishing which of the minima is closer to the biophysical reality and the size of the uncertainty associated with each parameter.
Abstract: The ultimate promise of diffusion MRI (dMRI) models is specificity to neuronal microstructure, which may lead to distinct clinical biomarkers using noninvasive imaging. While multi-compartment models are a common approach to interpret water diffusion in the brain in vivo, the estimation of their parameters from the dMRI signal remains an unresolved problem. Practically, even when q space is highly oversampled, nonlinear fit outputs suffer from heavy bias and poor precision. So far, this has been alleviated by fixing some of the model parameters to a priori values, for improved precision at the expense of accuracy. Here we use a representative two-compartment model to show that fitting fails to determine the five model parameters from over 60 measurement points. For the first time, we identify the reasons for this poor performance. The first reason is the existence of two local minima in the parameter space for the objective function of the fitting procedure. These minima correspond to qualitatively different sets of parameters, yet they both lie within biophysically plausible ranges. We show that, at realistic signal-to-noise ratio values, choosing between the two minima based on the associated objective function values is essentially impossible. Second, there is an ensemble of very low objective function values around each of these minima in the form of a pipe. The existence of such a direction in parameter space, along which the objective function profile is very flat, explains the bias and large uncertainty in parameter estimation, and the spurious parameter correlations: in the presence of noise, the minimum can be randomly displaced by a very large amount along each pipe. Our results suggest that the biophysical interpretation of dMRI model parameters crucially depends on establishing which of the minima is closer to the biophysical reality and the size of the uncertainty associated with each parameter.

265 citations


Posted Content
TL;DR: It is shown that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements, which yields a polynomial time global convergence guarantee for stochastic gradient descent.
Abstract: We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent {\em from random initialization}.

260 citations


Posted Content
TL;DR: In this article, it was shown that the commonly used non-convex objective function for positive semidefinite matrix completion has no spurious local minima, and that all local minimizations must also be global.
Abstract: Matrix completion is a basic machine learning problem that has wide applications, especially in collaborative filtering and recommender systems. Simple non-convex optimization algorithms are popular and effective in practice. Despite recent progress in proving various non-convex algorithms converge from a good initial point, it remains unclear why random or arbitrary initialization suffices in practice. We prove that the commonly used non-convex objective function for \textit{positive semidefinite} matrix completion has no spurious local minima --- all local minima must also be global. Therefore, many popular optimization algorithms such as (stochastic) gradient descent can provably solve positive semidefinite matrix completion with \textit{arbitrary} initialization in polynomial time. The result can be generalized to the setting when the observed entries contain noise. We believe that our main proof strategy can be useful for understanding geometric properties of other statistical problems involving partial or noisy observations.

228 citations


Journal ArticleDOI
TL;DR: In this paper, the authors studied the problem of line spectral estimation in the continuum of a bounded interval with one snapshot of array measurement and proposed the MUSIC algorithm to find the null space (the noise space) of the adjoint of the Hankel matrix, forming the noise-space correlation function and identifying the s smallest local minima of the correlation as the frequency set.

184 citations


Proceedings Article
05 Dec 2016
TL;DR: In this paper, it was shown that all local minima are very close to a global optimum and a curvature bound at saddle points yields a polynomial time global convergence guarantee for stochastic gradient descent.
Abstract: We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent from random initialization.

Journal ArticleDOI
TL;DR: In this paper, the use of a distance based on the Kantorovich-Rubinstein norm is introduced to overcome the local minima of the associated L2 misfit function, which correspond to velocity models matching the data up to one or several phase shifts.
Abstract: The use of optimal transport distance has recently yielded significant progress in image processing for pattern recognition, shape identification, and histograms matching. In this study, the use of this distance is investigated for a seismic tomography problem exploiting the complete waveform; the full waveform inversion. In its conventional formulation, this high resolution seismic imaging method is based on the minimization of the L2 distance between predicted and observed data. Application of this method is generally hampered by the local minima of the associated L2 misfit function, which correspond to velocity models matching the data up to one or several phase shifts. Conversely, the optimal transport distance appears as a more suitable tool to compare the misfit between oscillatory signals, for its ability to detect shifted patterns. However, its application to the full waveform inversion is not straightforward, as the mass conservation between the compared data cannot be guaranteed, a crucial assumption for optimal transport. In this study, the use of a distance based on the Kantorovich–Rubinstein norm is introduced to overcome this difficulty. Its mathematical link with the optimal transport distance is made clear. An efficient numerical strategy for its computation, based on a proximal splitting technique, is introduced. We demonstrate that each iteration of the corresponding algorithm requires solving the Poisson equation, for which fast solvers can be used, relying either on the fast Fourier transform or on multigrid techniques. The development of this numerical method make possible applications to industrial scale data, involving tenths of millions of discrete unknowns. The results we obtain on such large scale synthetic data illustrate the potentialities of the optimal transport for seismic imaging. Starting from crude initial velocity models, optimal transport based inversion yields significantly better velocity reconstructions than those based on the L2 distance, in 2D and 3D contexts.

Proceedings Article
09 Nov 2016
TL;DR: In this article, the authors show that neural networks with ReLU activation have no spurious local minima and saddle points, and that the loss can be made arbitrarily small if the minimum singular value of the "extended feature matrix" is large enough.
Abstract: Neural networks are a powerful class of functions that can be trained with simple gradient descent to achieve state-of-the-art performance on a variety of applications. Despite their practical success, there is a paucity of results that provide theoretical guarantees on why they are so effective. Lying in the center of the problem is the difficulty of analyzing the non-convex loss function with potentially numerous local minima and saddle points. Can neural networks corresponding to the stationary points of the loss function learn the true target function? If yes, what are the key factors contributing to such nice optimization properties? In this paper, we answer these questions by analyzing one-hidden-layer neural networks with ReLU activation, and show that despite the non-convexity, neural networks with diverse units have no spurious local minima. We bypass the non-convexity issue by directly analyzing the first order optimality condition, and show that the loss can be made arbitrarily small if the minimum singular value of the "extended feature matrix" is large enough. We make novel use of techniques from kernel methods and geometric discrepancy, and identify a new relation linking the smallest singular value to the spectrum of a kernel function associated with the activation function and to the diversity of the units. Our results also suggest a novel regularization function to promote unit diversity for potentially better generalization.

Journal ArticleDOI
TL;DR: This work introduces a mixed-integer formulation whose standard relaxation still has the same solutions as the underlying cardinality-constrained problem; the relation between the local minima is also discussed in detail.
Abstract: Optimization problems with cardinality constraints are very difficult mathematical programs which are typically solved by global techniques from discrete optimization. Here we introduce a mixed-integer formulation whose standard relaxation still has the same solutions (in the sense of global minima) as the underlying cardinality-constrained problem; the relation between the local minima is also discussed in detail. Since our reformulation is a minimization problem in continuous variables, it allows us to apply ideas from that field to cardinality-constrained problems. Here, in particular, we therefore also derive suitable stationarity conditions and suggest an appropriate regularization method for the solution of optimization problems with cardinality constraints. This regularization method is shown to be globally convergent to a Mordukhovich-stationary point. Extensive numerical results are given to illustrate the behavior of this method.

Posted Content
TL;DR: The effectiveness of this heuristic is demonstrated via an application to the problem of online tensor decomposition; a task for which saddle point evasion is known to result in convergence to global minima.
Abstract: A commonly used heuristic in non-convex optimization is Normalized Gradient Descent (NGD) - a variant of gradient descent in which only the direction of the gradient is taken into account and its magnitude ignored. We analyze this heuristic and show that with carefully chosen parameters and noise injection, this method can provably evade saddle points. We establish the convergence of NGD to a local minimum, and demonstrate rates which improve upon the fastest known first order algorithm due to Ge e al. (2015). The effectiveness of our method is demonstrated via an application to the problem of online tensor decomposition; a task for which saddle point evasion is known to result in convergence to global minima.

Proceedings Article
01 Jan 2016
TL;DR: It is established that a first-order variant of EM will not converge to strict saddle points almost surely, indicating that the poor performance of the first- order method can be attributed to the existence of bad local maxima rather than bad saddle points.
Abstract: We provide two fundamental results on the population (infinite-sample) likelihood function of Gaussian mixture models with $M \geq 3$ components. Our first main result shows that the population likelihood function has bad local maxima even in the special case of equally-weighted mixtures of well-separated and spherical Gaussians. We prove that the log-likelihood value of these bad local maxima can be arbitrarily worse than that of any global optimum, thereby resolving an open question of Srebro (2007). Our second main result shows that the EM algorithm (or a first-order variant of it) with random initialization will converge to bad critical points with probability at least $1-e^{-\Omega(M)}$. We further establish that a first-order variant of EM will not converge to strict saddle points almost surely, indicating that the poor performance of the first-order method can be attributed to the existence of bad local maxima rather than bad saddle points. Overall, our results highlight the necessity of careful initialization when using the EM algorithm in practice, even when applied in highly favorable settings.

Proceedings ArticleDOI
01 Mar 2016
TL;DR: In this article, a branch-and-bound approach was proposed to search the space of 3D rigid motions SE(3), guaranteeing global optimality regardless of the initialisation.
Abstract: Gaussian mixture alignment is a family of approaches that are frequently used for robustly solving the point-set registration problem. However, since they use local optimisation, they are susceptible to local minima and can only guarantee local optimality. Consequently, their accuracy is strongly dependent on the quality of the initialisation. This paper presents the first globally-optimal solution to the 3D rigid Gaussian mixture alignment problem under the L2 distance between mixtures. The algorithm, named GOGMA, employs a branch-and-bound approach to search the space of 3D rigid motions SE(3), guaranteeing global optimality regardless of the initialisation. The geometry of SE(3) was used to find novel upper and lower bounds for the objective function and local optimisation was integrated into the scheme to accelerate convergence without voiding the optimality guarantee. The evaluation empirically supported the optimality proof and showed that the method performed much more robustly on two challenging datasets than an existing globally-optimal registration solution.

Posted Content
03 Nov 2016
TL;DR: A non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which is linear in the input representation and applies to a general class of optimization problems including training a neural network and other non- Convex objectives arising in machine learning.
Abstract: We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which is linear in the input representation. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural network and other non-convex objectives arising in machine learning.

Posted Content
TL;DR: It is demonstrated that in this scenario one can construct counter-examples (datasets or initialization schemes) when the network does become susceptible to bad local minima over the weight space.
Abstract: There has been a lot of recent interest in trying to characterize the error surface of deep models. This stems from a long standing question. Given that deep networks are highly nonlinear systems optimized by local gradient methods, why do they not seem to be affected by bad local minima? It is widely believed that training of deep models using gradient methods works so well because the error surface either has no local minima, or if they exist they need to be close in value to the global minimum. It is known that such results hold under very strong assumptions which are not satisfied by real models. In this paper we present examples showing that for such theorem to be true additional assumptions on the data, initialization schemes and/or the model classes have to be made. We look at the particular case of finite size datasets. We demonstrate that in this scenario one can construct counter-examples (datasets or initialization schemes) when the network does become susceptible to bad local minima over the weight space.

Journal ArticleDOI
TL;DR: A density-based clustering method is proposed that is deterministic, computationally efficient, and self-consistent in its parameter choice to robustly generate Markov state models from molecular dynamics trajectories.
Abstract: A density-based clustering method is proposed that is deterministic, computationally efficient, and self-consistent in its parameter choice. By calculating a geometric coordinate space density for every point of a given data set, a local free energy is defined. On the basis of these free energy estimates, the frames are lumped into local free energy minima, ultimately forming microstates separated by local free energy barriers. The algorithm is embedded into a complete workflow to robustly generate Markov state models from molecular dynamics trajectories. It consists of (i) preprocessing of the data via principal component analysis in order to reduce the dimensionality of the problem, (ii) proposed density-based clustering to generate microstates, and (iii) dynamical clustering via the most probable path algorithm to construct metastable states. To characterize the resulting state-resolved conformational distribution, dihedral angle content color plots are introduced which identify structural differences ...

Journal ArticleDOI
TL;DR: Numerical results show the efficiency of the algorithm to create new holes, identify stress concentrations and to provide stable optimization sequences converging to local minima defined by stress saturated designs.

Journal ArticleDOI
TL;DR: It is shown that the proposed formulation is more suitable for beam contact than possible alternatives based on mortar type contact discretizations or constraint enforcement by means of Lagrange multipliers, and enhanced by a consistently linearized integration interval segmentation avoiding numerical integration across strong discontinuities.

Journal ArticleDOI
TL;DR: In this paper, a misfit function based on adaptive matching filtering (AMF) was proposed to solve cycle skipping and local minima in full waveform inversion (FWI).
Abstract: We have proposed a misfit function based on adaptive matching filtering (AMF) to tackle challenges associated with cycle skipping and local minima in full-waveform inversion (FWI). This AMF is designed to measure time-varying phase differences between observations and predictions. Compared with classical least-squares waveform differences, our misfit function behaves as a smooth, quadratic function with a broad basin of attraction. These characters are important because local gradient-based optimization approaches used in most FWI schemes cannot guarantee convergence toward true models if misfit functions include local minima or if the starting model is far away from the global minimum. The 1D and 2D synthetic experiments illustrate the advantages of the proposed misfit function compared with the classical least-squares waveform misfit. Furthermore, we have derived adjoint sources associated with the proposed misfit function and applied them in several 2D time-domain acoustic FWI experiments. Nume...

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a multi-objective inversion scheme that uses the conventional least-squares functional along with an auxiliary data-domain objective, called the bump functional.
Abstract: Least-squares inversion of seismic arrivals can provide remarkably detailed models of the Earth's subsurface. However, cycle skipping associated with these oscillatory arrivals is the main cause for local minima in the least-squares objective function. Therefore, it is often difficult for descent methods to converge to the solution without an accurate initial large-scale velocity estimate. The low frequencies in the arrivals, needed to update the large-scale components in the velocity model, are usually unreliable or absent. To overcome this difficulty, we propose a multi-objective inversion scheme that uses the conventional least-squares functional along with an auxiliary data-domain objective. As the auxiliary objective effectively replaces the seismic arrivals by bumps, we call it the bump functional. The bump functional minimization can be made far less sensitive to cycle skipping and can deal with multiple arrivals in the data. However, it can only be used as an auxiliary objective since it usually does not provide a unique model after minimization even when the regularized-least-squares functional has a unique global minimum and hence a unique solution. The role of the bump functional during the multi-objective inversion is to guide the optimization towards the global minimum by pulling the trapped solution out of the local minima associated with the least-squares functional whenever necessary. The computational complexity of the bump functional is equivalent to that of the least-squares functional. In this paper, we describe various characteristics of the bump functional using simple and illustrative numerical examples. We also demonstrate the effectiveness of the proposed multi-objective inversion scheme by considering more realistic examples. These include synthetic and field data from a cross-well experiment, surface-seismic synthetic data with reflections and synthetic data with refracted arrivals at long offsets.

Posted Content
TL;DR: In this article, a branch-and-bound approach was proposed to search the space of 3D rigid motions SE(3), guaranteeing global optimality regardless of the initialisation.
Abstract: Gaussian mixture alignment is a family of approaches that are frequently used for robustly solving the point-set registration problem. However, since they use local optimisation, they are susceptible to local minima and can only guarantee local optimality. Consequently, their accuracy is strongly dependent on the quality of the initialisation. This paper presents the first globally-optimal solution to the 3D rigid Gaussian mixture alignment problem under the L2 distance between mixtures. The algorithm, named GOGMA, employs a branch-and-bound approach to search the space of 3D rigid motions SE(3), guaranteeing global optimality regardless of the initialisation. The geometry of SE(3) was used to find novel upper and lower bounds for the objective function and local optimisation was integrated into the scheme to accelerate convergence without voiding the optimality guarantee. The evaluation empirically supported the optimality proof and showed that the method performed much more robustly on two challenging datasets than an existing globally-optimal registration solution.

Posted Content
TL;DR: A general theory for studying the geometry of nonconvex objective functions with underlying symmetric structures is proposed and the locations of stationary points and the null space of the associated Hessian matrices are characterized via the lens of invariant groups.
Abstract: We propose a general theory for studying the geometry of nonconvex objective functions with underlying symmetric structures. In specific, we characterize the locations of stationary points and the null space of the associated Hessian matrices via the lens of invariant groups. As a major motivating example, we apply the proposed general theory to characterize the global geometry of the low-rank matrix factorization problem. In particular, we illustrate how the rotational symmetry group gives rise to infinitely many non-isolated strict saddle points and equivalent global minima of the objective function. By explicitly identifying all stationary points, we divide the entire parameter space into three regions: ($\cR_1$) the region containing the neighborhoods of all strict saddle points, where the objective has negative curvatures; ($\cR_2$) the region containing neighborhoods of all global minima, where the objective enjoys strong convexity along certain directions; and ($\cR_3$) the complement of the above regions, where the gradient has sufficiently large magnitudes. We further extend our result to the matrix sensing problem. This allows us to establish strong global convergence guarantees for popular iterative algorithms with arbitrary initial solutions.

Journal ArticleDOI
TL;DR: This paper proposes a new global optimization algorithm by incorporating a new multimutation scheme into a differential evolution algorithm that has a very good ability of exploring the search space and can converge very fast.
Abstract: Seismic inversion problems often involve nonlinear relationships between data and model and usually have many local minima. Linearized inversion methods have been widely used to solve such problems. However, these kinds of methods often strongly depend on the initial model and are easily trapped in a local minimum. Global optimization methods, on the other hand, do not require a very good initial model and can approach a global minimum. However, global optimization methods are exhaustive search techniques that can be very time consuming. When the model dimension or the search space becomes large, these methods can be very slow to converge. In this paper, we propose a new global optimization algorithm by incorporating a new multimutation scheme into a differential evolution algorithm. Because mutation operation with the new multimutation scheme can generate better mutant vectors, the new global optimization algorithm has a very good ability of exploring the search space and can converge very fast. We apply the proposed algorithm to both synthetic and field data to test its performance. The results have clearly indicated that the new global optimization algorithm provides faster convergence and yields better results compared with the conventional global optimization methods in seismic inversion.

Journal ArticleDOI
TL;DR: Under the assumptions of local weak sharp minima of order $p$ ($p \in [1,2]$) and a quasi-regularity condition, a local superlinear convergence rate is established for the linearized proximal algorithm (LPA).
Abstract: In the present paper, we investigate a linearized proximal algorithm (LPA) for solving a convex composite optimization problem. Each iteration of the LPA is a proximal minimization of the convex composite function with the inner function being linearized at the current iterate. The LPA has the attractive computational advantage that the solution of each subproblem is a singleton, which avoids the difficulty as in the Gauss--Newton method (GNM) of finding a solution with minimum norm among the set of minima of its subproblem, while still maintaining the same local convergence rate as that of the GNM. Under the assumptions of local weak sharp minima of order $p$ ($p \in [1,2]$) and a quasi-regularity condition, we establish a local superlinear convergence rate for the LPA. We also propose a globalization strategy for the LPA based on a backtracking line-search and an inexact version of the LPA. We further apply the LPA to solve a (possibly nonconvex) feasibility problem, as well as a sensor network localiza...

Posted Content
TL;DR: The proposed general theory for studying the geometry of nonconvex objective functions with underlying symmetric structures describes how the rotational symmetry group gives rise to infinitely many nonisolated strict saddle points and equivalent global minima of the objective function.
Abstract: We propose a general theory for studying the \xl{landscape} of nonconvex \xl{optimization} with underlying symmetric structures \tz{for a class of machine learning problems (e.g., low-rank matrix factorization, phase retrieval, and deep linear neural networks)}. In specific, we characterize the locations of stationary points and the null space of Hessian matrices \xl{of the objective function} via the lens of invariant groups\removed{for associated optimization problems, including low-rank matrix factorization, phase retrieval, and deep linear neural networks}. As a major motivating example, we apply the proposed general theory to characterize the global \xl{landscape} of the \xl{nonconvex optimization in} low-rank matrix factorization problem. In particular, we illustrate how the rotational symmetry group gives rise to infinitely many nonisolated strict saddle points and equivalent global minima of the objective function. By explicitly identifying all stationary points, we divide the entire parameter space into three regions: ($\cR_1$) the region containing the neighborhoods of all strict saddle points, where the objective has negative curvatures; ($\cR_2$) the region containing neighborhoods of all global minima, where the objective enjoys strong convexity along certain directions; and ($\cR_3$) the complement of the above regions, where the gradient has sufficiently large magnitudes. We further extend our result to the matrix sensing problem. Such global landscape implies strong global convergence guarantees for popular iterative algorithms with arbitrary initial solutions.

Journal ArticleDOI
TL;DR: In this paper, a piecewise deterministic Markov process is designed to sample the corresponding Gibbs measure, and an Eyring-Kramers formula is obtained for the exit time of the domain of a local minimum at low temperature, and a necessary and sufficient condition on the cooling schedule in a simulated annealing algorithm to ensure the process converges to the set of global minima.
Abstract: Given an energy potential on the Euclidian space, a piecewise deter- ministic Markov process is designed to sample the corresponding Gibbs measure. In dimension one an Eyring-Kramers formula is obtained for the exit time of the domain of a local minimum at low temperature, and a necessary and sufficient con- dition is given on the cooling schedule in a simulated annealing algorithm to ensure the process converges to the set of global minima. This condition is similar to the classical one for diffusions and involves the critical depth of the potential. In higher dimensions a non optimal sufficient condition is obtained.