scispace - formally typeset
Search or ask a question

Showing papers on "Maxima and minima published in 2017"


Posted Content
TL;DR: Through this analysis, it is found that three factors – learning rate, batch size and the variance of the loss gradients – control the trade-off between the depth and width of the minima found by SGD, with wider minima favoured by a higher ratio of learning rate to batch size.
Abstract: We investigate the dynamical and convergent properties of stochastic gradient descent (SGD) applied to Deep Neural Networks (DNNs). Characterizing the relation between learning rate, batch size and the properties of the final minima, such as width or generalization, remains an open question. In order to tackle this problem we investigate the previously proposed approximation of SGD by a stochastic differential equation (SDE). We theoretically argue that three factors - learning rate, batch size and gradient covariance - influence the minima found by SGD. In particular we find that the ratio of learning rate to batch size is a key determinant of SGD dynamics and of the width of the final minima, and that higher values of the ratio lead to wider minima and often better generalization. We confirm these findings experimentally. Further, we include experiments which show that learning rate schedules can be replaced with batch size schedules and that the ratio of learning rate to batch size is an important factor influencing the memorization process.

386 citations


Posted Content
TL;DR: It is argued that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and when focusing on deep networks with rectifier units, the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit is exploited.
Abstract: Despite their overwhelming capacity to overfit, deep learning architectures tend to generalize relatively well to unseen data, allowing them to be deployed in practice. However, explaining why this is the case is still an open area of research. One standing hypothesis that is gaining popularity, e.g. Hochreiter & Schmidhuber (1997); Keskar et al. (2017), is that the flatness of minima of the loss function found by stochastic gradient based methods results in good generalization. This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima. Furthermore, if we allow to reparametrize a function, the geometry of its parameters can change drastically without affecting its generalization properties.

341 citations


Proceedings Article
06 Aug 2017
TL;DR: The authors argue that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima.
Abstract: Despite their overwhelming capacity to overfit, deep learning architectures tend to generalize relatively well to unseen data, allowing them to be deployed in practice. However, explaining why this is the case is still an open area of research. One standing hypothesis that is gaining popularity, e.g. Hochreiter & Schmidhuber (1997); Keskar et al. (2017), is that the flatness of minima of the loss function found by stochastic gradient based methods results in good generalization. This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima. Furthermore, if we allow to reparametrize a function, the geometry of its parameters can change drastically without affecting its generalization properties.

323 citations


Posted Content
TL;DR: In this paper, the authors developed a new framework that captures the common landscape underlying the common non-convex low-rank matrix problems including matrix sensing, matrix completion and robust PCA.
Abstract: In this paper we develop a new framework that captures the common landscape underlying the common non-convex low-rank matrix problems including matrix sensing, matrix completion and robust PCA. In particular, we show for all above problems (including asymmetric cases): 1) all local minima are also globally optimal; 2) no high-order saddle points exists. These results explain why simple algorithms such as stochastic gradient descent have global converge, and efficiently optimize these non-convex objective functions in practice. Our framework connects and simplifies the existing analyses on optimization landscapes for matrix sensing and symmetric matrix completion. The framework naturally leads to new results for asymmetric matrix completion and robust PCA.

295 citations


Proceedings Article
06 Aug 2017
TL;DR: In this article, the authors show that perturbed gradient descent can escape saddle points almost for free, in a number of iterations which depends only poly-logarithmically on dimension.
Abstract: This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free. Our results can be directly applied to many machine learning applications, including deep learning. As a particular concrete example of such an application, we show that our results can be used directly to establish sharp global convergence rates for matrix factorization. Our results rely on a novel characterization of the geometry around saddle points, which may be of independent interest to the non-convex optimization community.

280 citations


Posted Content
TL;DR: This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension, which shows that perturbed gradient descent can escape saddle points almost for free.
Abstract: This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free. Our results can be directly applied to many machine learning applications, including deep learning. As a particular concrete example of such an application, we show that our results can be used directly to establish sharp global convergence rates for matrix factorization. Our results rely on a novel characterization of the geometry around saddle points, which may be of independent interest to the non-convex optimization community.

259 citations


Proceedings Article
06 Aug 2017
TL;DR: A new framework that captures the common landscape underlying the common non-convex low-rank matrix problems including matrix sensing, matrix completion and robust PCA shows that all local minima are also globally optimal; no high-order saddle points exists.
Abstract: In this paper we develop a new framework that captures the common landscape underlying the common non-convex low-rank matrix problems including matrix sensing, matrix completion and robust PCA. In particular, we show for all above problems (including asymmetric cases): 1) all local minima are also globally optimal; 2) no high-order saddle points exists. These results explain why simple algorithms such as stochastic gradient descent have global converge, and efficiently optimize these non-convex objective functions in practice. Our framework connects and simplifies the existing analyses on optimization landscapes for matrix sensing and symmetric matrix completion. The framework naturally leads to new results for asymmetric matrix completion and robust PCA.

202 citations


Proceedings Article
04 Dec 2017
TL;DR: This work proposes a learning based motion capture model that optimizes neural network weights that predict 3D shape and skeleton configurations given a monocular RGB video and shows that the proposed model improves with experience and converges to low-error solutions where previous optimization methods fail.
Abstract: Current state-of-the-art solutions for motion capture from a single camera are optimization driven: they optimize the parameters of a 3D human model so that its re-projection matches measurements in the video (e.g. person segmentation, optical flow, keypoint detections etc.). Optimization models are susceptible to local minima. This has been the bottleneck that forced using clean green-screen like backgrounds at capture time, manual initialization, or switching to multiple cameras as input resource. In this work, we propose a learning based motion capture model for single camera input. Instead of optimizing mesh and skeleton parameters directly, our model optimizes neural network weights that predict 3D shape and skeleton configurations given a monocular RGB video. Our model is trained using a combination of strong supervision from synthetic data, and self-supervision from differentiable rendering of (a) skeletal keypoints, (b) dense 3D mesh motion, and (c) human-background segmentation, in an end-to-end framework. Empirically we show our model combines the best of both worlds of supervised learning and test-time optimization: supervised learning initializes the model parameters in the right regime, ensuring good pose and surface initialization at test time, without manual effort. Self-supervision by back-propagating through differentiable rendering allows (unsupervised) adaptation of the model to the test data, and offers much tighter fit than a pretrained fixed model. We show that the proposed model improves with experience and converges to low-error solutions where previous optimization methods fail.

201 citations


Posted Content
TL;DR: The underlying reasons why deep neural networks often generalize well are investigated, and it is shown that the characteristics the landscape of the loss function that explains the good generalization capability is the volume of basin of attraction of good minima.
Abstract: It is widely observed that deep learning models with learned parameters generalize well, even with much more model parameters than the number of training samples. We systematically investigate the underlying reasons why deep neural networks often generalize well, and reveal the difference between the minima (with the same training error) that generalize well and those they don't. We show that it is the characteristics the landscape of the loss function that explains the good generalization capability. For the landscape of loss function for deep networks, the volume of basin of attraction of good minima dominates over that of poor minima, which guarantees optimization methods with random initialization to converge to good minima. We theoretically justify our findings through analyzing 2-layer neural networks; and show that the low-complexity solutions have a small norm of Hessian matrix with respect to model parameters. For deeper networks, extensive numerical evidence helps to support our arguments.

167 citations


Proceedings ArticleDOI
01 Jul 2017
TL;DR: There are sufficient conditions to guarantee that local minima are globally optimal and that a local descent strategy can reach a global minima from any initialization.
Abstract: The past few years have seen a dramatic increase in the performance of recognition systems thanks to the introduction of deep networks for representation learning. However, the mathematical reasons for this success remain elusive. A key issue is that the neural network training problem is nonconvex, hence optimization algorithms may not return a global minima. This paper provides sufficient conditions to guarantee that local minima are globally optimal and that a local descent strategy can reach a global minima from any initialization. Our conditions require both the network output and the regularization to be positively homogeneous functions of the network parameters, with the regularization being designed to control the network size. Our results apply to networks with one hidden layer, where size is measured by the number of neurons in the hidden layer, and multiple deep subnetworks connected in parallel, where size is measured by the number of subnetworks.

132 citations


Posted Content
TL;DR: This paper presents an efficient range estimation algorithm that uses a combination of local search and linear programming problems to efficiently find the maximum and minimum values taken by the outputs of the NN over the given input set and demonstrates the effectiveness of the proposed approach for verification of NNs used in automated control as well as those used in classification.
Abstract: Deep neural networks (NN) are extensively used for machine learning tasks such as image classification, perception and control of autonomous systems. Increasingly, these deep NNs are also been deployed in high-assurance applications. Thus, there is a pressing need for developing techniques to verify neural networks to check whether certain user-expected properties are satisfied. In this paper, we study a specific verification problem of computing a guaranteed range for the output of a deep neural network given a set of inputs represented as a convex polyhedron. Range estimation is a key primitive for verifying deep NNs. We present an efficient range estimation algorithm that uses a combination of local search and linear programming problems to efficiently find the maximum and minimum values taken by the outputs of the NN over the given input set. In contrast to recently proposed "monolithic" optimization approaches, we use local gradient descent to repeatedly find and eliminate local minima of the function. The final global optimum is certified using a mixed integer programming instance. We implement our approach and compare it with Reluplex, a recently proposed solver for deep neural networks. We demonstrate the effectiveness of the proposed approach for verification of NNs used in automated control as well as those used in classification.

Posted Content
TL;DR: It is proved that without nonlinearity, depth alone does not create bad local minima, although it induces non-convex loss surface.
Abstract: In deep learning, \textit{depth}, as well as \textit{nonlinearity}, create non-convex loss surfaces. Then, does depth alone create bad local minima? In this paper, we prove that without nonlinearity, depth alone does not create bad local minima, although it induces non-convex loss surface. Using this insight, we greatly simplify a recently proposed proof to show that all of the local minima of feedforward deep linear neural networks are global minima. Our theoretical results generalize previous results with fewer assumptions, and this analysis provides a method to show similar results beyond square loss in deep linear models.

Proceedings Article
01 Aug 2017
TL;DR: In this article, an efficient block-diagonal approximation to the Gauss-Newton matrix for feedforward neural networks is presented, which is competitive with state-of-the-art first-order optimisation methods.
Abstract: We present an efficient block-diagonal approximation to the Gauss-Newton matrix for feedforward neural networks. Our resulting algorithm is competitive against state-of-the-art first-order optimisation methods, with sometimes significant improvement in optimisation performance. Unlike first-order methods, for which hyperparameter tuning of the optimisation parameters is often a laborious process, our approach can provide good performance even when used with default settings. A side result of our work is that for piecewise linear transfer functions, the network objective function can have no differentiable local maxima, which may partially explain why such transfer functions facilitate effective optimisation.

Journal ArticleDOI
25 Feb 2017
TL;DR: In this paper, the authors consider the minimization of a general objective function over a set of rectangular matrices that have rank at most r. Despite the resulting nonconvexity, recent studies in matrix completion and sensing have shown that the factored problem has no spurious local minima and obeys the strict saddle property.
Abstract: This paper considers the minimization of a general objective function $f(\boldsymbol{X})$ over the set of rectangular $n\times m$ matrices that have rank at most $r$ . To reduce the computational burden, we factorize the variable $\boldsymbol{X}$ into a product of two smaller matrices and optimize over these two matrices instead of $\boldsymbol{X}$ . Despite the resulting nonconvexity, recent studies in matrix completion and sensing have shown that the factored problem has no spurious local minima and obeys the so-called strict saddle property (the function has a directional negative curvature at all critical points but local minima). We analyze the global geometry for a general and yet well-conditioned objective function $f(\boldsymbol{X})$ whose restricted strong convexity and restricted strong smoothness constants are comparable. In particular, we show that the reformulated objective function has no spurious local minima and obeys the strict saddle property. These geometric properties imply that a number of iterative optimization algorithms (such as gradient descent) can provably solve the factored problem with global convergence.

Posted Content
TL;DR: This article showed that almost all local minima are globally optimal for a fully connected network with squared loss and analytic activation function given that the number of hidden units of one layer of the network is larger than the training points and the network structure from this layer on is pyramidal.
Abstract: While the optimization problem behind deep neural networks is highly non-convex, it is frequently observed in practice that training deep networks seems possible without getting stuck in suboptimal points. It has been argued that this is the case as all local minima are close to being globally optimal. We show that this is (almost) true, in fact almost all local minima are globally optimal, for a fully connected network with squared loss and analytic activation function given that the number of hidden units of one layer of the network is larger than the number of training points and the network structure from this layer on is pyramidal.

Proceedings Article
06 Aug 2017
TL;DR: The authors showed that almost all local minima are globally optimal for a fully connected network with squared loss and analytic activation function given that the number of hidden units of one layer of the network is larger than the training points and the network structure from this layer on is pyramidal.
Abstract: While the optimization problem behind deep neural networks is highly non-convex, it is frequently observed in practice that training deep networks seems possible without getting stuck in suboptimal points. It has been argued that this is the case as all local minima are close to being globally optimal. We show that this is (almost) true, in fact almost all local minima are globally optimal, for a fully connected network with squared loss and analytic activation function given that the number of hidden units of one layer of the network is larger than the number of training points and the network structure from this layer on is pyramidal.

Proceedings Article
18 Feb 2017
TL;DR: It is proved that for empirical risk minimization, if the empirical risk is point-wise close to the (smooth) population risk, then the algorithm achieves an approximate local minimum of the population risk in polynomial time, escaping suboptimal local minima that only exist in the empiricalrisk.
Abstract: We study the Stochastic Gradient Langevin Dynamics (SGLD) algorithm for non-convex optimization. The algorithm performs stochastic gradient descent, where in each step it injects appropriately scaled Gaussian noise to the update. We analyze the algorithm's hitting time to an arbitrary subset of the parameter space. Two results follow from our general theory: First, we prove that for empirical risk minimization, if the empirical risk is point-wise close to the (smooth) population risk, then the algorithm achieves an approximate local minimum of the population risk in polynomial time, escaping suboptimal local minima that only exist in the empirical risk. Second, we show that SGLD improves on one of the best known learnability results for learning linear classifiers under the zero-one loss.

Journal ArticleDOI
TL;DR: The Tsinghua Global Minimum (TGMin) algorithm as discussed by the authors is based on the Basin-Hopping algorithm to find the global minima of nanoclusters, as well as periodic systems.

Journal ArticleDOI
TL;DR: This work proposes a high-dimensional bias potential method (NN2B) based on two machine learning algorithms: the nearest neighbor density estimator (NNDE) and the artificial neural network (ANN) for the bias potential approximation.
Abstract: The free energy calculations of complex chemical and biological systems with molecular dynamics (MD) are inefficient due to multiple local minima separated by high-energy barriers. The minima can be escaped using an enhanced sampling method such as metadynamics, which apply bias (i.e., importance sampling) along a set of collective variables (CV), but the maximum number of CVs (or dimensions) is severely limited. We propose a high-dimensional bias potential method (NN2B) based on two machine learning algorithms: the nearest neighbor density estimator (NNDE) and the artificial neural network (ANN) for the bias potential approximation. The bias potential is constructed iteratively from short biased MD simulations accounting for correlation among CVs. Our method is capable of achieving ergodic sampling and calculating free energy of polypeptides with up to 8-dimensional bias potential.

Proceedings Article
01 Jun 2017
TL;DR: For the random over-complete tensor decomposition problem, this article showed that for any small constant ε > 0, among the set of points with function values ε ≥ 0, all the local minima are approximate global minima.
Abstract: Non-convex optimization with local search heuristics has been widely used in machine learning, achieving many state-of-art results. It becomes increasingly important to understand why they can work for these NP-hard problems on typical data. The landscape of many objective functions in learning has been conjectured to have the geometric property that ``all local optima are (approximately) global optima'', and thus they can be solved efficiently by local search algorithms. However, establishing such property can be very difficult. In this paper, we analyze the optimization landscape of the random over-complete tensor decomposition problem, which has many applications in unsupervised leaning, especially in learning latent variable models. In practice, it can be efficiently solved by gradient ascent on a non-convex objective. We show that for any small constant $\epsilon > 0$, among the set of points with function values $(1+\epsilon)$-factor larger than the expectation of the function, all the local maxima are approximate global maxima. Previously, the best-known result only characterizes the geometry in small neighborhoods around the true components. Our result implies that even with an initialization that is barely better than the random guess, the gradient ascent algorithm is guaranteed to solve this problem. Our main technique uses Kac-Rice formula and random matrix theory. To our best knowledge, this is the first time when Kac-Rice formula is successfully applied to counting the number of local minima of a highly-structured random polynomial with dependent coefficients.

Journal ArticleDOI
TL;DR: A new discrete upwinding strategy leading to local extremum bounded low order approximations with compact stencils, high order variational stabilization based on the difference between two gradient approximation, and new localized limiting techniques for antidiffusive element contributions are proposed.

Journal ArticleDOI
TL;DR: This paper considers polygonal meshes and employs the virtual element method (VEM) to solve two classes of paradigmatic topology optimization problems, one governed by nearly-incompressible and compressible linear elasticity and the other by Stokes equations.
Abstract: It is well known that the solution of topology optimization problems may be affected both by the geometric properties of the computational mesh, which can steer the minimization process towards local (and non-physical) minima, and by the accuracy of the method employed to discretize the underlying differential problem, which may not be able to correctly capture the physics of the problem. In light of the above remarks, in this paper we consider polygonal meshes and employ the virtual element method (VEM) to solve two classes of paradigmatic topology optimization problems, one governed by nearly-incompressible and compressible linear elasticity and the other by Stokes equations. Several numerical results show the virtues of our polygonal VEM based approach with respect to more standard methods.

Journal ArticleDOI
TL;DR: The dynamics of driven-dissipative systems is shown to be well-fitted for achieving efficient combinatorial optimization and the heterogeneity in amplitude can be reduced by setting the parameters of the driving signal near a regime, called the dynamic phase transition, where the analog spins' DC components map more accurately the global minima of the Ising Hamiltonian which, in turn, increases the quality of solutions found.
Abstract: The dynamics of driven-dissipative systems is shown to be well-fitted for achieving efficient combinatorial optimization. The proposed method can be applied to solve any combinatorial optimization problem that is equivalent to minimizing an Ising Hamiltonian. Moreover, the dynamics considered can be implemented using various physical systems as it is based on generic dynamics---the normal form of the supercritical pitchfork bifurcation. The computational principle of the proposed method relies on an hybrid analog-digital representation of the binary Ising spins by considering the gradient descent of a Lyapunov function that is the sum of an analog Ising Hamiltonian and archetypal single or double-well potentials. By gradually changing the shape of the latter potentials from a single to double well shape, it can be shown that the first nonzero steady states to become stable are associated with global minima of the Ising Hamiltonian, under the approximation that all analog spins have the same amplitude. In the more general case, the heterogeneity in amplitude between analog spins induces the stabilization of local minima, which reduces the quality of solutions to combinatorial optimization problems. However, we show that the heterogeneity in amplitude can be reduced by setting the parameters of the driving signal near a regime, called the dynamic phase transition, where the analog spins' DC components map more accurately the global minima of the Ising Hamiltonian which, in turn, increases the quality of solutions found. Last, we discuss the possibility of a physical implementation of the proposed method using networks of degenerate optical parametric oscillators.

Posted Content
Zeyuan Allen-Zhu1
TL;DR: A stochastic algorithm to train any smooth neural network to $\varepsilon$-approximate local minima, using backpropagations to find any smooth nonconvex function in rate, with only oracle access to stochastically gradients.
Abstract: We design a stochastic algorithm to train any smooth neural network to $\varepsilon$-approximate local minima, using $O(\varepsilon^{-3.25})$ backpropagations. The best result was essentially $O(\varepsilon^{-4})$ by SGD. More broadly, it finds $\varepsilon$-approximate local minima of any smooth nonconvex function in rate $O(\varepsilon^{-3.25})$, with only oracle access to stochastic gradients.

Journal ArticleDOI
TL;DR: In this paper, the phase diagram of numerous materials of technological importance features high-symmetry high-temperature phases that exhibit phonon instabilities, and the authors propose to compute the free energy in such phases by exploring the system's potential energy surface by discrete sampling of local minima by means of a lattice gas Monte Carlo approach.
Abstract: The phase diagram of numerous materials of technological importance features high-symmetry high-temperature phases that exhibit phonon instabilities. Leading examples include shape-memory alloys, as well as ferroelectric, refractory, and structural materials. The thermodynamics of these phases have proven challenging to handle by atomistic computational thermodynamic techniques due to the occurrence of constant anharmonicity-driven hopping between local low-symmetry distortions, while maintaining a high-symmetry time-averaged structure. To compute the free energy in such phases, we propose to explore the system's potential-energy surface by discrete sampling of local minima by means of a lattice gas Monte Carlo approach and by continuous sampling by means of a lattice dynamics approach in the vicinity of each local minimum. Given the proximity of the local minima, it is necessary to carefully partition phase space by using a Voronoi tessellation to constrain the domain of integration of the partition function in order to avoid double counting artifacts and enable an accurate harmonic treatment near each local minima. We consider the bcc phase of titanium as a prototypical example to illustrate our approach.

Journal ArticleDOI
TL;DR: Pearson residual error optimization (PRO) optimizes least mean square error (LSE) reduction while alleviating the probability of under/over-fitting that ensures optimal PV design parameter identification.

Journal ArticleDOI
TL;DR: A novel technique is described that renders theories of N axions tractable, and can be used to efficiently analyze a large class of periodic potentials of arbitrary dimension, and is found that in a broad class of random theories, the potential is smooth over diameters enhanced by N3/2 compared to the typical scale of the potential.
Abstract: We describe a novel technique that renders theories of N axions tractable, and more generally can be used to efficiently analyze a large class of periodic potentials of arbitrary dimension. Such potentials are complex energy landscapes with a number of local minima that scales as $$ \sqrt{N!} $$ , and so for large N appear to be analytically and numerically intractable. Our method is based on uncovering a set of approximate symmetries that exist in addition to the N periods. These approximate symmetries, which are exponentially close to exact, allow us to locate the minima very efficiently and accurately and to analyze other characteristics of the potential. We apply our framework to evaluate the diameters of flat regions suitable for slow-roll inflation, which unifies, corrects and extends several forms of “axion alignment” previously observed in the literature. We find that in a broad class of random theories, the potential is smooth over diameters enhanced by N 3/2 compared to the typical scale of the potential. A Mathematica implementation of our framework is available online.

Posted Content
24 Apr 2017
TL;DR: This paper empirically investigate the geometry of the loss functions for state-of-the-art networks with multiple stochastic optimization methods through several experiments that are visualized on polygons to understand how and when these stochastically optimization methods find local minima.
Abstract: The training of deep neural networks is a high-dimension optimization problem with respect to the loss function of a model. Unfortunately, these functions are of high dimension and non-convex and hence difficult to characterize. In this paper, we empirically investigate the geometry of the loss functions for state-of-the-art networks with multiple stochastic optimization methods. We do this through several experiments that are visualized on polygons to understand how and when these stochastic optimization methods find minima.

Posted Content
24 Apr 2017
TL;DR: It is demonstrated that in this scenario one can construct counter-examples (datasets or initialization schemes) when the network does become susceptible to bad local minima over the weight space.
Abstract: There has been a lot of recent interest in trying to characterize the error surface of deep models. This stems from a long standing question. Given that deep networks are highly nonlinear systems optimized by local gradient methods, why do they not seem to be affected by bad local minima? It is widely believed that training of deep models using gradient methods works so well because the error surface either has no local minima, or if they exist they need to be close in value to the global minimum. It is known that such results hold under strong assumptions which are not satisfied by real models. In this paper we present examples showing that for such theorem to be true additional assumptions on the data, initialization schemes and/or the model classes have to be made. We look at the particular case of finite size datasets. We demonstrate that in this scenario one can construct counter-examples (datasets or initialization schemes) when the network does become susceptible to bad local minima over the weight space.

Journal ArticleDOI
TL;DR: The results of numerical experiments show that, although the proposed parallel EGO algorithm needs more evaluations to find the optimum compared to the standard EGO algorithms, it is able to reduce the optimization cycles.
Abstract: Most parallel efficient global optimization (EGO) algorithms focus only on the parallel architectures for producing multiple updating points, but give few attention to the balance between the global search (i.e., sampling in different areas of the search space) and local search (i.e., sampling more intensely in one promising area of the search space) of the updating points. In this study, a novel approach is proposed to apply this idea to further accelerate the search of parallel EGO algorithms. In each cycle of the proposed algorithm, all local maxima of expected improvement (EI) function are identified by a multi-modal optimization algorithm. Then the local EI maxima with value greater than a threshold are selected and candidates are sampled around these selected EI maxima. The results of numerical experiments show that, although the proposed parallel EGO algorithm needs more evaluations to find the optimum compared to the standard EGO algorithm, it is able to reduce the optimization cycles. Moreover, the proposed parallel EGO algorithm gains better results in terms of both number of cycles and evaluations compared to a state-of-the-art parallel EGO algorithm over six test problems.