scispace - formally typeset
Search or ask a question
Posted Content

Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time

TL;DR: A convex analytic framework utilizing semi-infinite duality is developed to obtain equivalent convex optimization problems for several two- and three-layer CNN architectures, and it is proved that two-layerCNNs can be globally optimized via an $\ell_2$ norm regularized convex program.
Abstract: We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons, and data dimension. More specifically, we develop a convex analytic framework utilizing semi-infinite duality to obtain equivalent convex optimization problems for several two- and three-layer CNN architectures. We first prove that two-layer CNNs can be globally optimized via an $\ell_2$ norm regularized convex program. We then show that multi-layer circular CNN training problems with a single ReLU layer are equivalent to an $\ell_1$ regularized convex program that encourages sparsity in the spectral domain. We also extend these results to three-layer CNNs with two ReLU layers. Furthermore, we present extensions of our approach to different pooling methods, which elucidates the implicit architectural bias as convex regularizers.
Citations
More filters
Posted Content
TL;DR: It is shown that a set of optimal hidden layer weights for a norm regularized DNN training problem can be explicitly found as the extreme points of a convex set and it is proved that each optimal weight matrix is rank-$K$ and aligns with the previous layers via duality.
Abstract: We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of the hidden layers. We show that a set of optimal hidden layer weights for a norm regularized DNN training problem can be explicitly found as the extreme points of a convex set. For the special case of deep linear networks, we prove that each optimal weight matrix aligns with the previous layers via duality. More importantly, we apply the same characterization to deep ReLU networks with whitened data and prove the same weight alignment holds. As a corollary, we also prove that norm regularized deep ReLU networks yield spline interpolation for one-dimensional datasets which was previously known only for two-layer networks. Furthermore, we provide closed-form solutions for the optimal layer weights when data is rank-one or whitened. The same analysis also applies to architectures with batch normalization even for arbitrary data. Therefore, we obtain a complete explanation for a recent empirical observation termed Neural Collapse where class means collapse to the vertices of a simplex equiangular tight frame.

45 citations


Cites background from "Implicit Convex Regularizers of CNN..."

  • ...In addition, a recent series of work (Pilanci & Ergen, 2020; Ergen & Pilanci, 2021; Sahiner et al., 2021; Gupta et al., 2021) showed that regularized two-layer ReLU network training problems exhibit a convex loss landscape in a higher dimensional space, which was previously attributed to the benign…...

    [...]

Posted Content
TL;DR: A convex analytic framework for ReLU neural networks is developed which elucidates the inner workings of hidden neurons and their function space characteristics and establishes a connection to $\ell_0$-$\ell_1$ equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing.
Abstract: We develop a convex analytic approach to analyze finite width two-layer ReLU networks. We first prove that an optimal solution to the regularized training problem can be characterized as extreme points of a convex set, where simple solutions are encouraged via its convex geometrical properties. We then leverage this characterization to show that an optimal set of parameters yield linear spline interpolation for regression problems involving one dimensional or rank-one data. We also characterize the classification decision regions in terms of a kernel matrix and minimum $\ell_1$-norm solutions. This is in contrast to Neural Tangent Kernel which is unable to explain predictions of finite width networks. Our convex geometric characterization also provides intuitive explanations of hidden neurons as auto-encoders. In higher dimensions, we show that the training problem can be cast as a finite dimensional convex problem with infinitely many constraints. Then, we apply certain convex relaxations and introduce a cutting-plane algorithm to globally optimize the network. We further analyze the exactness of the relaxations to provide conditions for the convergence to a global optimum. Our analysis also shows that optimal network parameters can be also characterized as interpretable closed-form formulas in some practically relevant special cases.

34 citations

Posted Content
TL;DR: A convex duality framework is advocated that makes a two-layer fully-convolutional ReLU denoising network amenable to convex optimization and offers the optimum training with convex solvers, but also facilitates interpreting training and prediction.
Abstract: Neural networks have shown tremendous potential for reconstructing high-resolution images in inverse problems. The non-convex and opaque nature of neural networks, however, hinders their utility in sensitive applications such as medical imaging. To cope with this challenge, this paper advocates a convex duality framework that makes a two-layer fully-convolutional ReLU denoising network amenable to convex optimization. The convex dual network not only offers the optimum training with convex solvers, but also facilitates interpreting training and prediction. In particular, it implies training neural networks with weight decay regularization induces path sparsity while the prediction is piecewise linear filtering. A range of experiments with MNIST and fastMRI datasets confirm the efficacy of the dual network optimization problem.

20 citations


Cites background or methods from "Implicit Convex Regularizers of CNN..."

  • ...The most relevant to our work are (Pilanci & Ergen, 2020; Ergen & Pilanci, 2020b) which put forth a convex duality framework for two-layer ReLU networks with a single output....

    [...]

  • ...It is however restricted to scalar-output networks, and considers either fully-connected networks (Pilanci & Ergen, 2020), or, CNNs with average pooling (Ergen & Pilanci, 2020b)....

    [...]

  • ...Convex neural networks were introduced in (Bach, 2017; Bengio et al., 2006), and later in (Pilanci & Ergen, 2020; Ergen & Pilanci, 2020a;b)....

    [...]

Proceedings ArticleDOI
17 May 2022
TL;DR: In this paper , the authors derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality for the non-linear dot-product self-attention, and alternative mechanisms such as MLP-mixer and Fourier Neural Operator (FNO).
Abstract: Vision transformers using self-attention or its proposed alternatives have demonstrated promising results in many image related tasks. However, the underpinning inductive bias of attention is not well understood. To address this issue, this paper analyzes attention through the lens of convex duality. For the non-linear dot-product self-attention, and alternative mechanisms such as MLP-mixer and Fourier Neural Operator (FNO), we derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality. The convex programs lead to {\it block nuclear-norm regularization} that promotes low rank in the latent feature and token dimensions. In particular, we show how self-attention networks implicitly clusters the tokens, based on their latent similarity. We conduct experiments for transferring a pre-trained transformer backbone for CIFAR-100 classification by fine-tuning a variety of convex attention heads. The results indicate the merits of the bias induced by attention compared with the existing MLP or linear heads.

14 citations

Proceedings Article
27 Jan 2022
TL;DR: The authors theoretically analyzes the implicit regularization in hierarchical tensor factorization, a model equivalent to certain deep convolutional neural networks, and overcome challenges associated with hierarchy, and establish implicit regularisation towards low hierarchical tensors rank.
Abstract: In the pursuit of explaining implicit regularization in deep learning, prominent focus was given to matrix and tensor factorizations, which correspond to simplified neural networks. It was shown that these models exhibit an implicit tendency towards low matrix and tensor ranks, respectively. Drawing closer to practical deep learning, the current paper theoretically analyzes the implicit regularization in hierarchical tensor factorization, a model equivalent to certain deep convolutional neural networks. Through a dynamical systems lens, we overcome challenges associated with hierarchy, and establish implicit regularization towards low hierarchical tensor rank. This translates to an implicit regularization towards locality for the associated convolutional networks. Inspired by our theory, we design explicit regularization discouraging locality, and demonstrate its ability to improve the performance of modern convolutional networks on non-local tasks, in defiance of conventional wisdom by which architectural changes are needed. Our work highlights the potential of enhancing neural networks via theoretical analysis of their implicit regularization.

14 citations

References
More filters
Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal ArticleDOI
TL;DR: In this paper, instead of selecting factors by stepwise backward elimination, the authors focus on the accuracy of estimation and consider extensions of the lasso, the LARS algorithm and the non-negative garrotte for factor selection.
Abstract: Summary. We consider the problem of selecting grouped variables (factors) for accurate prediction in regression. Such a problem arises naturally in many practical situations with the multifactor analysis-of-variance problem as the most important and well-known example. Instead of selecting factors by stepwise backward elimination, we focus on the accuracy of estimation and consider extensions of the lasso, the LARS algorithm and the non-negative garrotte for factor selection. The lasso, the LARS algorithm and the non-negative garrotte are recently proposed regression methods that can be used to select individual variables. We study and propose efficient algorithms for the extensions of these methods for factor selection and show that these extensions give superior performance to the traditional stepwise backward elimination method in factor selection problems. We study the similarities and the differences between these methods. Simulations and real examples are used to illustrate the methods.

7,400 citations

Book
01 Jan 1964
TL;DR: The real and complex number system as discussed by the authors is a real number system where the real number is defined by a real function and the complex number is represented by a complex field of functions.
Abstract: Chapter 1: The Real and Complex Number Systems Introduction Ordered Sets Fields The Real Field The Extended Real Number System The Complex Field Euclidean Spaces Appendix Exercises Chapter 2: Basic Topology Finite, Countable, and Uncountable Sets Metric Spaces Compact Sets Perfect Sets Connected Sets Exercises Chapter 3: Numerical Sequences and Series Convergent Sequences Subsequences Cauchy Sequences Upper and Lower Limits Some Special Sequences Series Series of Nonnegative Terms The Number e The Root and Ratio Tests Power Series Summation by Parts Absolute Convergence Addition and Multiplication of Series Rearrangements Exercises Chapter 4: Continuity Limits of Functions Continuous Functions Continuity and Compactness Continuity and Connectedness Discontinuities Monotonic Functions Infinite Limits and Limits at Infinity Exercises Chapter 5: Differentiation The Derivative of a Real Function Mean Value Theorems The Continuity of Derivatives L'Hospital's Rule Derivatives of Higher-Order Taylor's Theorem Differentiation of Vector-valued Functions Exercises Chapter 6: The Riemann-Stieltjes Integral Definition and Existence of the Integral Properties of the Integral Integration and Differentiation Integration of Vector-valued Functions Rectifiable Curves Exercises Chapter 7: Sequences and Series of Functions Discussion of Main Problem Uniform Convergence Uniform Convergence and Continuity Uniform Convergence and Integration Uniform Convergence and Differentiation Equicontinuous Families of Functions The Stone-Weierstrass Theorem Exercises Chapter 8: Some Special Functions Power Series The Exponential and Logarithmic Functions The Trigonometric Functions The Algebraic Completeness of the Complex Field Fourier Series The Gamma Function Exercises Chapter 9: Functions of Several Variables Linear Transformations Differentiation The Contraction Principle The Inverse Function Theorem The Implicit Function Theorem The Rank Theorem Determinants Derivatives of Higher Order Differentiation of Integrals Exercises Chapter 10: Integration of Differential Forms Integration Primitive Mappings Partitions of Unity Change of Variables Differential Forms Simplexes and Chains Stokes' Theorem Closed Forms and Exact Forms Vector Analysis Exercises Chapter 11: The Lebesgue Theory Set Functions Construction of the Lebesgue Measure Measure Spaces Measurable Functions Simple Functions Integration Comparison with the Riemann Integral Integration of Complex Functions Functions of Class L2 Exercises Bibliography List of Special Symbols Index

6,681 citations


"Implicit Convex Regularizers of CNN..." refers methods in this paper

  • ...Next, we use real-valued Radon measures with the uniform norms (Rudin, 1964)....

    [...]