Showing papers on "Maxima and minima published in 2016"

PDF

Open Access

Posted Content•

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

[...]

Nitish Shirish Keskar¹, Dheevatsa Mudigere², Jorge Nocedal, Mikhail Smelyanskiy², Ping Tak Peter Tang² - Show less +1 more•Institutions (2)

Northwestern University¹, Intel²

15 Sep 2016-arXiv: Learning

TL;DR: In this paper, the authors investigate the cause of the generalization drop in the large batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minima of the training and testing functions.

...read moreread less

Abstract: The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

...read moreread less

925 citations

Proceedings Article•

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

[...]

Nitish Shirish Keskar¹, Dheevatsa Mudigere², Jorge Nocedal, Mikhail Smelyanskiy², Ping Tak Peter Tang² - Show less +1 more•Institutions (2)

Northwestern University¹, Intel²

15 Sep 2016

TL;DR: In this article, the authors investigate the cause of the generalization drop in the large batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minima of the training and testing functions.

...read moreread less

845 citations

Posted Content•

Deep Learning without Poor Local Minima

[...]

Kenji Kawaguchi¹•Institutions (1)

Massachusetts Institute of Technology¹

23 May 2016-arXiv: Machine Learning

TL;DR: In this article, the squared loss function of deep linear neural networks with any depth and any widths is shown to be non-convex and nonconcave, every local minimum is a global minimum, every critical point that is not a global minima is a saddle point, and there exist "bad" saddle points (where the Hessian has no negative eigenvalue) for deep networks with more than three layers.

...read moreread less

Abstract: In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. With no unrealistic assumption, we first prove the following statements for the squared loss function of deep linear neural networks with any depth and any widths: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist "bad" saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers). Moreover, for deep nonlinear neural networks, we prove the same four statements via a reduction to a deep linear model under the independence assumption adopted from recent work. As a result, we present an instance, for which we can answer the following question: how difficult is it to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima). Furthermore, the mathematically proven existence of bad saddle points for deeper models would suggest a possible open problem. We note that even though we have advanced the theoretical foundations of deep learning and non-convex optimization, there is still a gap between theory and practice.

...read moreread less

609 citations

Proceedings Article•

Deep Learning without Poor Local Minima

[...]

Kenji Kawaguchi¹•Institutions (1)

Massachusetts Institute of Technology¹

23 May 2016

TL;DR: This paper proves a conjecture published in 1989 and partially addresses an open problem announced at the Conference on Learning Theory (COLT) 2015, and presents an instance for which it can answer the following question: how difficult is it to directly train a deep model in theory?

...read moreread less

Abstract: In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. For an expected loss function of a deep nonlinear neural network, we prove the following statements under the independence assumption adopted from recent work: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) the property of saddle points differs for shallow networks (with three layers) and deeper networks (with more than three layers). Moreover, we prove that the same four statements hold for deep linear neural networks with any depth, any widths and no unrealistic assumptions. As a result, we present an instance, for which we can answer to the following question: how difficult to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima and the property of the saddle points). We note that even though we have advanced the theoretical foundations of deep learning, there is still a gap between theory and practice.

...read moreread less

562 citations

Proceedings Article•

Matrix Completion has No Spurious Local Minimum

[...]

Rong Ge¹, Jason D. Lee², Tengyu Ma³•Institutions (3)

Duke University¹, University of Southern California², Princeton University³

24 May 2016

TL;DR: It is proved that the commonly used non-convex objective function for positive semidefinite matrix completion has no spurious local minima --- all local minata must also be global.

...read moreread less

Abstract: Matrix completion is a basic machine learning problem that has wide applications, especially in collaborative filtering and recommender systems. Simple non-convex optimization algorithms are popular and effective in practice. Despite recent progress in proving various non-convex algorithms converge from a good initial point, it remains unclear why random or arbitrary initialization suffices in practice. We prove that the commonly used non-convex objective function for matrix completion has no spurious local minima \--- all local minima must also be global. Therefore, many popular optimization algorithms such as (stochastic) gradient descent can provably solve matrix completion with \textit{arbitrary} initialization in polynomial time.

...read moreread less

281 citations

Posted Content•

No bad local minima: Data independent training error guarantees for multilayer neural networks

[...]

Daniel Soudry, Yair Carmon

26 May 2016-arXiv: Machine Learning

TL;DR: It is proved that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization, and extended to the case of more than onehidden layer.

...read moreread less

Abstract: We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.

...read moreread less

267 citations

Journal Article•DOI•

Degeneracy in model parameter estimation for multi‐compartmental diffusion in neuronal tissue

[...]

Ileana O. Jelescu¹, Jelle Veraart¹, Els Fieremans¹, Dmitry S. Novikov¹•Institutions (1)

New York University¹

01 Jan 2016-NMR in Biomedicine

TL;DR: The results suggest that theBiophysical interpretation of dMRI model parameters crucially depends on establishing which of the minima is closer to the biophysical reality and the size of the uncertainty associated with each parameter.

...read moreread less

Abstract: The ultimate promise of diffusion MRI (dMRI) models is specificity to neuronal microstructure, which may lead to distinct clinical biomarkers using noninvasive imaging. While multi-compartment models are a common approach to interpret water diffusion in the brain in vivo, the estimation of their parameters from the dMRI signal remains an unresolved problem. Practically, even when q space is highly oversampled, nonlinear fit outputs suffer from heavy bias and poor precision. So far, this has been alleviated by fixing some of the model parameters to a priori values, for improved precision at the expense of accuracy. Here we use a representative two-compartment model to show that fitting fails to determine the five model parameters from over 60 measurement points. For the first time, we identify the reasons for this poor performance. The first reason is the existence of two local minima in the parameter space for the objective function of the fitting procedure. These minima correspond to qualitatively different sets of parameters, yet they both lie within biophysically plausible ranges. We show that, at realistic signal-to-noise ratio values, choosing between the two minima based on the associated objective function values is essentially impossible. Second, there is an ensemble of very low objective function values around each of these minima in the form of a pipe. The existence of such a direction in parameter space, along which the objective function profile is very flat, explains the bias and large uncertainty in parameter estimation, and the spurious parameter correlations: in the presence of noise, the minimum can be randomly displaced by a very large amount along each pipe. Our results suggest that the biophysical interpretation of dMRI model parameters crucially depends on establishing which of the minima is closer to the biophysical reality and the size of the uncertainty associated with each parameter.

...read moreread less

265 citations

Posted Content•

Global Optimality of Local Search for Low Rank Matrix Recovery

[...]

Srinadh Bhojanapalli¹, Behnam Neyshabur¹, Nathan Srebro¹•Institutions (1)

Toyota Technological Institute at Chicago¹

23 May 2016-arXiv: Machine Learning

TL;DR: It is shown that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements, which yields a polynomial time global convergence guarantee for stochastic gradient descent.

...read moreread less

Abstract: We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent {\em from random initialization}.

...read moreread less

260 citations

Posted Content•

Matrix Completion has No Spurious Local Minimum

[...]

Rong Ge¹, Jason D. Lee², Tengyu Ma³•Institutions (3)

Duke University¹, University of Southern California², Princeton University³

24 May 2016-arXiv: Learning

TL;DR: In this article, it was shown that the commonly used non-convex objective function for positive semidefinite matrix completion has no spurious local minima, and that all local minimizations must also be global.

...read moreread less

Abstract: Matrix completion is a basic machine learning problem that has wide applications, especially in collaborative filtering and recommender systems. Simple non-convex optimization algorithms are popular and effective in practice. Despite recent progress in proving various non-convex algorithms converge from a good initial point, it remains unclear why random or arbitrary initialization suffices in practice. We prove that the commonly used non-convex objective function for \textit{positive semidefinite} matrix completion has no spurious local minima --- all local minima must also be global. Therefore, many popular optimization algorithms such as (stochastic) gradient descent can provably solve positive semidefinite matrix completion with \textit{arbitrary} initialization in polynomial time. The result can be generalized to the setting when the observed entries contain noise. We believe that our main proof strategy can be useful for understanding geometric properties of other statistical problems involving partial or noisy observations.

...read moreread less

228 citations

Journal Article•DOI•

MUSIC for single-snapshot spectral estimation: Stability and super-resolution

[...]

Wenjing Liao¹, Wenjing Liao², Albert Fannjiang³•Institutions (3)

Duke University¹, Statistical and Applied Mathematical Sciences Institute², University of California, Davis³

01 Jan 2016-Applied and Computational Harmonic Analysis

TL;DR: In this paper, the authors studied the problem of line spectral estimation in the continuum of a bounded interval with one snapshot of array measurement and proposed the MUSIC algorithm to find the null space (the noise space) of the adjoint of the Hankel matrix, forming the noise-space correlation function and identifying the s smallest local minima of the correlation as the frequency set.

...read moreread less

184 citations

Proceedings Article•

Global optimality of local search for low rank matrix recovery

[...]

Srinadh Bhojanapalli¹, Behnam Neyshabur¹, Nathan Srebro¹•Institutions (1)

Toyota Technological Institute at Chicago¹

05 Dec 2016

TL;DR: In this paper, it was shown that all local minima are very close to a global optimum and a curvature bound at saddle points yields a polynomial time global convergence guarantee for stochastic gradient descent.

...read moreread less

Journal Article•DOI•

An optimal transport approach for seismic tomography: application to 3D full waveform inversion

[...]

Ludovic Métivier¹, Romain Brossier¹, Quentin Mérigot², Edouard Oudet¹, Jean Virieux¹ - Show less +1 more•Institutions (2)

University of Grenoble¹, Paris Dauphine University²

27 Sep 2016-Inverse Problems

TL;DR: In this paper, the use of a distance based on the Kantorovich-Rubinstein norm is introduced to overcome the local minima of the associated L2 misfit function, which correspond to velocity models matching the data up to one or several phase shifts.

...read moreread less

Abstract: The use of optimal transport distance has recently yielded significant progress in image processing for pattern recognition, shape identification, and histograms matching. In this study, the use of this distance is investigated for a seismic tomography problem exploiting the complete waveform; the full waveform inversion. In its conventional formulation, this high resolution seismic imaging method is based on the minimization of the L2 distance between predicted and observed data. Application of this method is generally hampered by the local minima of the associated L2 misfit function, which correspond to velocity models matching the data up to one or several phase shifts. Conversely, the optimal transport distance appears as a more suitable tool to compare the misfit between oscillatory signals, for its ability to detect shifted patterns. However, its application to the full waveform inversion is not straightforward, as the mass conservation between the compared data cannot be guaranteed, a crucial assumption for optimal transport. In this study, the use of a distance based on the Kantorovich–Rubinstein norm is introduced to overcome this difficulty. Its mathematical link with the optimal transport distance is made clear. An efficient numerical strategy for its computation, based on a proximal splitting technique, is introduced. We demonstrate that each iteration of the corresponding algorithm requires solving the Poisson equation, for which fast solvers can be used, relying either on the fast Fourier transform or on multigrid techniques. The development of this numerical method make possible applications to industrial scale data, involving tenths of millions of discrete unknowns. The results we obtain on such large scale synthetic data illustrate the potentialities of the optimal transport for seismic imaging. Starting from crude initial velocity models, optimal transport based inversion yields significantly better velocity reconstructions than those based on the L2 distance, in 2D and 3D contexts.

...read moreread less

Proceedings Article•

Diverse Neural Network Learns True Target Functions

[...]

Bo Xie¹, Yingyu Liang², Le Song¹•Institutions (2)

Georgia Institute of Technology¹, Princeton University²

09 Nov 2016

TL;DR: In this article, the authors show that neural networks with ReLU activation have no spurious local minima and saddle points, and that the loss can be made arbitrarily small if the minimum singular value of the "extended feature matrix" is large enough.

...read moreread less

Abstract: Neural networks are a powerful class of functions that can be trained with simple gradient descent to achieve state-of-the-art performance on a variety of applications. Despite their practical success, there is a paucity of results that provide theoretical guarantees on why they are so effective. Lying in the center of the problem is the difficulty of analyzing the non-convex loss function with potentially numerous local minima and saddle points. Can neural networks corresponding to the stationary points of the loss function learn the true target function? If yes, what are the key factors contributing to such nice optimization properties? In this paper, we answer these questions by analyzing one-hidden-layer neural networks with ReLU activation, and show that despite the non-convexity, neural networks with diverse units have no spurious local minima. We bypass the non-convexity issue by directly analyzing the first order optimality condition, and show that the loss can be made arbitrarily small if the minimum singular value of the "extended feature matrix" is large enough. We make novel use of techniques from kernel methods and geometric discrepancy, and identify a new relation linking the smallest singular value to the spectrum of a kernel function associated with the activation function and to the diversity of the units. Our results also suggest a novel regularization function to promote unit diversity for potentially better generalization.

...read moreread less

Journal Article•DOI•

Mathematical Programs with Cardinality Constraints: Reformulation by Complementarity-Type Conditions and a Regularization Method

[...]

Oleg Burdakov¹, Christian Kanzow², Alexandra Schwartz³•Institutions (3)

Linköping University¹, University of Würzburg², Technische Universität Darmstadt³

04 Feb 2016-Siam Journal on Optimization

TL;DR: This work introduces a mixed-integer formulation whose standard relaxation still has the same solutions as the underlying cardinality-constrained problem; the relation between the local minima is also discussed in detail.

...read moreread less

Abstract: Optimization problems with cardinality constraints are very difficult mathematical programs which are typically solved by global techniques from discrete optimization. Here we introduce a mixed-integer formulation whose standard relaxation still has the same solutions (in the sense of global minima) as the underlying cardinality-constrained problem; the relation between the local minima is also discussed in detail. Since our reformulation is a minimization problem in continuous variables, it allows us to apply ideas from that field to cardinality-constrained problems. Here, in particular, we therefore also derive suitable stationarity conditions and suggest an appropriate regularization method for the solution of optimization problems with cardinality constraints. This regularization method is shown to be globally convergent to a Mordukhovich-stationary point. Extensive numerical results are given to illustrate the behavior of this method.

...read moreread less

Posted Content•

The Power of Normalization: Faster Evasion of Saddle Points

[...]

Kfir Y. Levy

15 Nov 2016-arXiv: Learning

TL;DR: The effectiveness of this heuristic is demonstrated via an application to the problem of online tensor decomposition; a task for which saddle point evasion is known to result in convergence to global minima.

...read moreread less

Abstract: A commonly used heuristic in non-convex optimization is Normalized Gradient Descent (NGD) - a variant of gradient descent in which only the direction of the gradient is taken into account and its magnitude ignored. We analyze this heuristic and show that with carefully chosen parameters and noise injection, this method can provably evade saddle points. We establish the convergence of NGD to a local minimum, and demonstrate rates which improve upon the fastest known first order algorithm due to Ge e al. (2015). The effectiveness of our method is demonstrated via an application to the problem of online tensor decomposition; a task for which saddle point evasion is known to result in convergence to global minima.

...read moreread less

Proceedings Article•

Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences

[...]

Chi Jin¹, Yuchen Zhang¹, Sivaraman Balakrishnan², Martin J. Wainwright¹, Michael I. Jordan¹ - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Carnegie Mellon University²

01 Jan 2016

TL;DR: It is established that a first-order variant of EM will not converge to strict saddle points almost surely, indicating that the poor performance of the first- order method can be attributed to the existence of bad local maxima rather than bad saddle points.

...read moreread less

Abstract: We provide two fundamental results on the population (infinite-sample) likelihood function of Gaussian mixture models with $M \geq 3$ components. Our first main result shows that the population likelihood function has bad local maxima even in the special case of equally-weighted mixtures of well-separated and spherical Gaussians. We prove that the log-likelihood value of these bad local maxima can be arbitrarily worse than that of any global optimum, thereby resolving an open question of Srebro (2007). Our second main result shows that the EM algorithm (or a first-order variant of it) with random initialization will converge to bad critical points with probability at least $1-e^{-\Omega(M)}$. We further establish that a first-order variant of EM will not converge to strict saddle points almost surely, indicating that the poor performance of the first-order method can be attributed to the existence of bad local maxima rather than bad saddle points. Overall, our results highlight the necessity of careful initialization when using the EM algorithm in practice, even when applied in highly favorable settings.

...read moreread less

Proceedings Article•DOI•

GOGMA: Globally-Optimal Gaussian Mixture Alignment

[...]

Dylan Campbell¹, Lars Petersson¹•Institutions (1)

Australian National University¹

01 Mar 2016

TL;DR: In this article, a branch-and-bound approach was proposed to search the space of 3D rigid motions SE(3), guaranteeing global optimality regardless of the initialisation.

...read moreread less

Abstract: Gaussian mixture alignment is a family of approaches that are frequently used for robustly solving the point-set registration problem. However, since they use local optimisation, they are susceptible to local minima and can only guarantee local optimality. Consequently, their accuracy is strongly dependent on the quality of the initialisation. This paper presents the first globally-optimal solution to the 3D rigid Gaussian mixture alignment problem under the L2 distance between mixtures. The algorithm, named GOGMA, employs a branch-and-bound approach to search the space of 3D rigid motions SE(3), guaranteeing global optimality regardless of the initialisation. The geometry of SE(3) was used to find novel upper and lower bounds for the objective function and local optimisation was integrated into the scheme to accelerate convergence without voiding the optimality guarantee. The evaluation empirically supported the optimality proof and showed that the method performed much more robustly on two challenging datasets than an existing globally-optimal registration solution.

...read moreread less

Posted Content•

Finding Approximate Local Minima for Nonconvex Optimization in Linear Time.

[...]

Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, Tengyu Ma - Show less +1 more

03 Nov 2016

TL;DR: A non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which is linear in the input representation and applies to a general class of optimization problems including training a neural network and other non- Convex objectives arising in machine learning.

...read moreread less

Abstract: We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which is linear in the input representation. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural network and other non-convex objectives arising in machine learning.

...read moreread less

Posted Content•

Local minima in training of neural networks

[...]

Grzegorz Swirszcz, Wojciech Marian Czarnecki, Razvan Pascanu

19 Nov 2016-arXiv: Machine Learning

TL;DR: It is demonstrated that in this scenario one can construct counter-examples (datasets or initialization schemes) when the network does become susceptible to bad local minima over the weight space.

...read moreread less

Abstract: There has been a lot of recent interest in trying to characterize the error surface of deep models. This stems from a long standing question. Given that deep networks are highly nonlinear systems optimized by local gradient methods, why do they not seem to be affected by bad local minima? It is widely believed that training of deep models using gradient methods works so well because the error surface either has no local minima, or if they exist they need to be close in value to the global minimum. It is known that such results hold under very strong assumptions which are not satisfied by real models. In this paper we present examples showing that for such theorem to be true additional assumptions on the data, initialization schemes and/or the model classes have to be made. We look at the particular case of finite size datasets. We demonstrate that in this scenario one can construct counter-examples (datasets or initialization schemes) when the network does become susceptible to bad local minima over the weight space.

...read moreread less

Journal Article•DOI•

Robust Density-Based Clustering To Identify Metastable Conformational States of Proteins

[...]

Florian Sittel, Gerhard Stock

21 Apr 2016-Journal of Chemical Theory and Computation

TL;DR: A density-based clustering method is proposed that is deterministic, computationally efficient, and self-consistent in its parameter choice to robustly generate Markov state models from molecular dynamics trajectories.

...read moreread less

Abstract: A density-based clustering method is proposed that is deterministic, computationally efficient, and self-consistent in its parameter choice. By calculating a geometric coordinate space density for every point of a given data set, a local free energy is defined. On the basis of these free energy estimates, the frames are lumped into local free energy minima, ultimately forming microstates separated by local free energy barriers. The algorithm is embedded into a complete workflow to robustly generate Markov state models from molecular dynamics trajectories. It consists of (i) preprocessing of the data via principal component analysis in order to reduce the dimensionality of the problem, (ii) proposed density-based clustering to generate microstates, and (iii) dynamical clustering via the most probable path algorithm to construct metastable states. To characterize the resulting state-resolved conformational distribution, dihedral angle content color plots are introduced which identify structural differences ...

...read moreread less

Journal Article•DOI•

Topology optimization with local stress constraint based on level set evolution via reaction–diffusion

[...]

Hélio Emmendoerfer¹, Eduardo Alberto Fancello¹•Institutions (1)

Universidade Federal de Santa Catarina¹

15 Jun 2016-Computer Methods in Applied Mechanics and Engineering

TL;DR: Numerical results show the efficiency of the algorithm to create new holes, identify stress concentrations and to provide stable optimization sequences converging to local minima defined by stress saturated designs.

...read moreread less

Journal Article•DOI•

A finite element approach for the line-to-line contact interaction of thin beams with arbitrary orientation

[...]

Christoph Meier¹, Alexander Popp¹, Wolfgang A. Wall¹•Institutions (1)

Technische Universität München¹

15 Aug 2016-Computer Methods in Applied Mechanics and Engineering

TL;DR: It is shown that the proposed formulation is more suitable for beam contact than possible alternatives based on mortar type contact discretizations or constraint enforcement by means of Lagrange multipliers, and enhanced by a consistently linearized integration interval segmentation avoiding numerical integration across strong discontinuities.

...read moreread less

Journal Article•DOI•

Building good starting models for full-waveform inversion using adaptive matching filtering misfit

[...]

Hejun Zhu¹, Sergey Fomel²•Institutions (2)

University of Texas at Dallas¹, University of Texas at Austin²

03 Aug 2016-Geophysics

TL;DR: In this paper, a misfit function based on adaptive matching filtering (AMF) was proposed to solve cycle skipping and local minima in full waveform inversion (FWI).

...read moreread less

Abstract: We have proposed a misfit function based on adaptive matching filtering (AMF) to tackle challenges associated with cycle skipping and local minima in full-waveform inversion (FWI). This AMF is designed to measure time-varying phase differences between observations and predictions. Compared with classical least-squares waveform differences, our misfit function behaves as a smooth, quadratic function with a broad basin of attraction. These characters are important because local gradient-based optimization approaches used in most FWI schemes cannot guarantee convergence toward true models if misfit functions include local minima or if the starting model is far away from the global minimum. The 1D and 2D synthetic experiments illustrate the advantages of the proposed misfit function compared with the classical least-squares waveform misfit. Furthermore, we have derived adjoint sources associated with the proposed misfit function and applied them in several 2D time-domain acoustic FWI experiments. Nume...

...read moreread less

Journal Article•DOI•

Full waveform inversion with an auxiliary bump functional

[...]

Pawan Bharadwaj¹, Wim A. Mulder¹, Wim A. Mulder², Guy Drijkoningen¹•Institutions (2)

Delft University of Technology¹, Royal Dutch Shell²

01 Aug 2016-Geophysical Journal International

TL;DR: In this paper, the authors proposed a multi-objective inversion scheme that uses the conventional least-squares functional along with an auxiliary data-domain objective, called the bump functional.

...read moreread less

Abstract: Least-squares inversion of seismic arrivals can provide remarkably detailed models of the Earth's subsurface. However, cycle skipping associated with these oscillatory arrivals is the main cause for local minima in the least-squares objective function. Therefore, it is often difficult for descent methods to converge to the solution without an accurate initial large-scale velocity estimate. The low frequencies in the arrivals, needed to update the large-scale components in the velocity model, are usually unreliable or absent. To overcome this difficulty, we propose a multi-objective inversion scheme that uses the conventional least-squares functional along with an auxiliary data-domain objective. As the auxiliary objective effectively replaces the seismic arrivals by bumps, we call it the bump functional. The bump functional minimization can be made far less sensitive to cycle skipping and can deal with multiple arrivals in the data. However, it can only be used as an auxiliary objective since it usually does not provide a unique model after minimization even when the regularized-least-squares functional has a unique global minimum and hence a unique solution. The role of the bump functional during the multi-objective inversion is to guide the optimization towards the global minimum by pulling the trapped solution out of the local minima associated with the least-squares functional whenever necessary. The computational complexity of the bump functional is equivalent to that of the least-squares functional. In this paper, we describe various characteristics of the bump functional using simple and illustrative numerical examples. We also demonstrate the effectiveness of the proposed multi-objective inversion scheme by considering more realistic examples. These include synthetic and field data from a cross-well experiment, surface-seismic synthetic data with reflections and synthetic data with refracted arrivals at long offsets.

...read moreread less

Posted Content•

GOGMA: Globally-Optimal Gaussian Mixture Alignment

[...]

Dylan Campbell¹, Lars Petersson¹•Institutions (1)

Australian National University¹

01 Mar 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a branch-and-bound approach was proposed to search the space of 3D rigid motions SE(3), guaranteeing global optimality regardless of the initialisation.

...read moreread less

Posted Content•

Symmetry, Saddle Points, and Global Geometry of Nonconvex Matrix Factorization

[...]

Xingguo Li, Zhaoran Wang, Junwei Lu, Raman Arora, Jarvis Haupt, Han Liu, Tuo Zhao - Show less +3 more

29 Dec 2016-arXiv: Learning

TL;DR: A general theory for studying the geometry of nonconvex objective functions with underlying symmetric structures is proposed and the locations of stationary points and the null space of the associated Hessian matrices are characterized via the lens of invariant groups.

...read moreread less

Abstract: We propose a general theory for studying the geometry of nonconvex objective functions with underlying symmetric structures. In specific, we characterize the locations of stationary points and the null space of the associated Hessian matrices via the lens of invariant groups. As a major motivating example, we apply the proposed general theory to characterize the global geometry of the low-rank matrix factorization problem. In particular, we illustrate how the rotational symmetry group gives rise to infinitely many non-isolated strict saddle points and equivalent global minima of the objective function. By explicitly identifying all stationary points, we divide the entire parameter space into three regions: ($\cR_1$) the region containing the neighborhoods of all strict saddle points, where the objective has negative curvatures; ($\cR_2$) the region containing neighborhoods of all global minima, where the objective enjoys strong convexity along certain directions; and ($\cR_3$) the complement of the above regions, where the gradient has sufficiently large magnitudes. We further extend our result to the matrix sensing problem. This allows us to establish strong global convergence guarantees for popular iterative algorithms with arbitrary initial solutions.

...read moreread less

Journal Article•DOI•

Multimutation Differential Evolution Algorithm and Its Application to Seismic Inversion

[...]

Zhaoqi Gao¹, Zhibin Pan¹, Jinghuai Gao¹•Institutions (1)

Xi'an Jiaotong University¹

25 Feb 2016-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: This paper proposes a new global optimization algorithm by incorporating a new multimutation scheme into a differential evolution algorithm that has a very good ability of exploring the search space and can converge very fast.

...read moreread less

Abstract: Seismic inversion problems often involve nonlinear relationships between data and model and usually have many local minima. Linearized inversion methods have been widely used to solve such problems. However, these kinds of methods often strongly depend on the initial model and are easily trapped in a local minimum. Global optimization methods, on the other hand, do not require a very good initial model and can approach a global minimum. However, global optimization methods are exhaustive search techniques that can be very time consuming. When the model dimension or the search space becomes large, these methods can be very slow to converge. In this paper, we propose a new global optimization algorithm by incorporating a new multimutation scheme into a differential evolution algorithm. Because mutation operation with the new multimutation scheme can generate better mutant vectors, the new global optimization algorithm has a very good ability of exploring the search space and can converge very fast. We apply the proposed algorithm to both synthetic and field data to test its performance. The results have clearly indicated that the new global optimization algorithm provides faster convergence and yields better results compared with the conventional global optimization methods in seismic inversion.

...read moreread less

Journal Article•DOI•

On Convergence Rates of Linearized Proximal Algorithms for Convex Composite Optimization with Applications

[...]

Yaohua Hu¹, Chong Li², Xiaoqi Yang³•Institutions (3)

Shenzhen University¹, Zhejiang University², Hong Kong Polytechnic University³

17 May 2016-Siam Journal on Optimization

TL;DR: Under the assumptions of local weak sharp minima of order $p$ ($p \in [1,2]$) and a quasi-regularity condition, a local superlinear convergence rate is established for the linearized proximal algorithm (LPA).

...read moreread less

Abstract: In the present paper, we investigate a linearized proximal algorithm (LPA) for solving a convex composite optimization problem. Each iteration of the LPA is a proximal minimization of the convex composite function with the inner function being linearized at the current iterate. The LPA has the attractive computational advantage that the solution of each subproblem is a singleton, which avoids the difficulty as in the Gauss--Newton method (GNM) of finding a solution with minimum norm among the set of minima of its subproblem, while still maintaining the same local convergence rate as that of the GNM. Under the assumptions of local weak sharp minima of order $p$ ($p \in [1,2]$) and a quasi-regularity condition, we establish a local superlinear convergence rate for the LPA. We also propose a globalization strategy for the LPA based on a backtracking line-search and an inexact version of the LPA. We further apply the LPA to solve a (possibly nonconvex) feasibility problem, as well as a sensor network localiza...

...read moreread less

Posted Content•

Symmetry, Saddle Points, and Global Optimization Landscape of Nonconvex Matrix Factorization

[...]

Xingguo Li¹, Junwei Lu², Raman Arora³, Jarvis Haupt⁴, Han Liu⁵, Zhaoran Wang⁵, Tuo Zhao⁶ - Show less +3 more•Institutions (6)

Princeton University¹, Harvard University², Johns Hopkins University³, University of Minnesota⁴, Northwestern University⁵, Georgia Institute of Technology⁶

29 Dec 2016-arXiv: Learning

TL;DR: The proposed general theory for studying the geometry of nonconvex objective functions with underlying symmetric structures describes how the rotational symmetry group gives rise to infinitely many nonisolated strict saddle points and equivalent global minima of the objective function.

...read moreread less

Abstract: We propose a general theory for studying the \xl{landscape} of nonconvex \xl{optimization} with underlying symmetric structures \tz{for a class of machine learning problems (e.g., low-rank matrix factorization, phase retrieval, and deep linear neural networks)}. In specific, we characterize the locations of stationary points and the null space of Hessian matrices \xl{of the objective function} via the lens of invariant groups\removed{for associated optimization problems, including low-rank matrix factorization, phase retrieval, and deep linear neural networks}. As a major motivating example, we apply the proposed general theory to characterize the global \xl{landscape} of the \xl{nonconvex optimization in} low-rank matrix factorization problem. In particular, we illustrate how the rotational symmetry group gives rise to infinitely many nonisolated strict saddle points and equivalent global minima of the objective function. By explicitly identifying all stationary points, we divide the entire parameter space into three regions: ($\cR_1$) the region containing the neighborhoods of all strict saddle points, where the objective has negative curvatures; ($\cR_2$) the region containing neighborhoods of all global minima, where the objective enjoys strong convexity along certain directions; and ($\cR_3$) the complement of the above regions, where the gradient has sufficiently large magnitudes. We further extend our result to the matrix sensing problem. Such global landscape implies strong global convergence guarantees for popular iterative algorithms with arbitrary initial solutions.

...read moreread less

Journal Article•DOI•

Piecewise deterministic simulated annealing

[...]

Pierre Monmarché

01 Jan 2016-ALEA-Latin American Journal of Probability and Mathematical Statistics

TL;DR: In this paper, a piecewise deterministic Markov process is designed to sample the corresponding Gibbs measure, and an Eyring-Kramers formula is obtained for the exit time of the domain of a local minimum at low temperature, and a necessary and sufficient condition on the cooling schedule in a simulated annealing algorithm to ensure the process converges to the set of global minima.

...read moreread less

Abstract: Given an energy potential on the Euclidian space, a piecewise deter- ministic Markov process is designed to sample the corresponding Gibbs measure. In dimension one an Eyring-Kramers formula is obtained for the exit time of the domain of a local minimum at low temperature, and a necessary and sufficient con- dition is given on the cooling schedule in a simulated annealing algorithm to ensure the process converges to the set of global minima. This condition is similar to the classical one for diffusions and involves the critical depth of the potential. In higher dimensions a non optimal sufficient condition is obtained.

...read moreread less

Collapse