scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Error bounds for approximations with deep ReLU networks.

01 Oct 2017-Neural Networks (Neural Netw)-Vol. 94, pp 103-114
TL;DR: It is proved that deep ReLU networks more efficiently approximate smooth functions than shallow networks and adaptive depth-6 network architectures more efficient than the standard shallow architecture are described.
About: This article is published in Neural Networks.The article was published on 2017-10-01 and is currently open access. It has received 693 citations till now. The article focuses on the topics: Lipschitz continuity & Network complexity.
Citations
More filters
Posted Content
TL;DR: In particular, this article showed that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data.
Abstract: We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization for deep learning, and pave the way for studying the optimization dynamics of training modern deep neural networks.

467 citations

Journal ArticleDOI
TL;DR: It is theoretically proved that in the proposed method, gradient descent algorithms are not attracted to suboptimal critical points or local minima, and the proposed adaptive activation functions are shown to accelerate the minimization process of the loss values in standard deep learning benchmarks with and without data augmentation.

405 citations

Journal ArticleDOI
01 Apr 2018
TL;DR: There are increasing gaps between the computational complexity and energy efficiency required for the continued scaling of deep neural networks and the hardware capacity actually available with current CMOS technology scaling, in situations where edge inference is required.
Abstract: Deep neural networks offer considerable potential across a range of applications, from advanced manufacturing to autonomous cars. A clear trend in deep neural networks is the exponential growth of network size and the associated increases in computational complexity and memory consumption. However, the performance and energy efficiency of edge inference, in which the inference (the application of a trained network to new data) is performed locally on embedded platforms that have limited area and power budget, is bounded by technology scaling. Here we analyse recent data and show that there are increasing gaps between the computational complexity and energy efficiency required by data scientists and the hardware capacity made available by hardware architects. We then discuss various architecture and algorithm innovations that could help to bridge the gaps. This Perspective highlights the existence of gaps between the computational complexity and energy efficiency required for the continued scaling of deep neural networks and the hardware capacity actually available with current CMOS technology scaling, in situations where edge inference is required; it then discusses various architecture and algorithm innovations that could help to bridge these gaps.

354 citations

Journal ArticleDOI
TL;DR: It is shown that a deep convolutional neural network (CNN) is universal, meaning that it can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough.

345 citations


Cites background or methods or result from "Error bounds for approximations wit..."

  • ...If we take r = d+1 2 + 2 as in our previous discussion, we see that for d ≥ 6, the deep net constructed in Theorem 1 of [29] for achieving an accuracy ∈ (0, 1) for approximating f ∈ C([0, 1]) has at least 2 −d/r free parameters and at least C0d 4 (log(1/ ) + d) layers....

    [...]

  • ...To compare this result with ours when d is large, we need to derive an explicit lower bound for the number of parameters in the above net from the analysis in [29] which is based on Taylor polynomials of f and trapezoid functions defined by σ....

    [...]

  • ...When the activation function is ReLU, explicit rates of approximation by fully connected neural networks were obtained recently in [13] for shallow nets, in [24] for nets with 3 hidden layers, and in [29,2,22] for nets with more layers....

    [...]

  • ...Thus, to achieve an accuracy ∈ (0, 1) for approximating f by a ReLU deep net, one takes N = ( 2d+1dr r! )1/r and δ = 2d+1dr(d+r) as in [29] and know that the depth of the net is at least C0d 2 (log(1/ ) + (d + 1) log 2 + r log d + log(d + r)), and the total number of parameters for the net is more than the number of coefficients D f(m/N) α! which is...

    [...]

  • ...In particular, it was shown in Theorem 1 of [29] that for f ∈ C([0, 1]), the approximation accuracy ∈ (0, 1) can be achieved by a ReLU deep net with at most c(log(1/ ) + 1) layers and at most c −d/r(log(1/ ) + 1) weights and computation units with a constant c = c(d, r)....

    [...]

Journal ArticleDOI
TL;DR: It is proved that one cannot approximate a general function f∈Eβ(Rd) using neural networks that are less complex than those produced by the construction, which partly explains the benefits of depth for ReLU networks by showing that deep networks are necessary to achieve efficient approximation of (piecewise) smooth functions.

307 citations

References
More filters
Journal ArticleDOI
28 May 2015-Nature
TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

46,982 citations

Journal ArticleDOI
TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

14,635 citations


"Error bounds for approximations wit..." refers background in this paper

  • ...Recently, multiple successful applications of deep neural networks to pattern recognition problems (Schmidhuber [2015], LeCun et al. [2015]) have revived active interest in theoretical properties of such networks, in particular their expressive power....

    [...]

Book ChapterDOI
TL;DR: This chapter reproduces the English translation by B. Seckler of the paper by Vapnik and Chervonenkis in which they gave proofs for the innovative results they had obtained in a draft form in July 1966 and announced in 1968 in their note in Soviet Mathematics Doklady.
Abstract: This chapter reproduces the English translation by B. Seckler of the paper by Vapnik and Chervonenkis in which they gave proofs for the innovative results they had obtained in a draft form in July 1966 and announced in 1968 in their note in Soviet Mathematics Doklady. The paper was first published in Russian as Вапник В. Н. and Червоненкис А. Я. О равномерноЙ сходимости частот появления событиЙ к их вероятностям. Теория вероятностеЙ и ее применения 16(2), 264–279 (1971).

3,939 citations


"Error bounds for approximations wit..." refers background in this paper

  • ...Bianchini and Scarselli 2014 give bounds for Betti numbers characterizing topological properties of functions represented by networks....

    [...]

Book
01 Nov 1999
TL;DR: The authors explain the role of scale-sensitive versions of the Vapnik Chervonenkis dimension in large margin classification, and in real prediction, and discuss the computational complexity of neural network learning.
Abstract: This important work describes recent theoretical advances in the study of artificial neural networks. It explores probabilistic models of supervised learning problems, and addresses the key statistical and computational questions. Chapters survey research on pattern classification with binary-output networks, including a discussion of the relevance of the Vapnik Chervonenkis dimension, and of estimates of the dimension for several neural network models. In addition, Anthony and Bartlett develop a model of classification by real-output networks, and demonstrate the usefulness of classification with a "large margin." The authors explain the role of scale-sensitive versions of the Vapnik Chervonenkis dimension in large margin classification, and in real prediction. Key chapters also discuss the computational complexity of neural network learning, describing a variety of hardness results, and outlining two efficient, constructive learning algorithms. The book is self-contained and accessible to researchers and graduate students in computer science, engineering, and mathematics.

1,757 citations


"Error bounds for approximations wit..." refers background or methods in this paper

  • ...For any d, n and ∈ (0, 1), there is a ReLU network architecture that 1. is capable of expressing any function from Fd,n with error ; 2. has the depth at most c(ln(1/ ) + 1) and at most c −d/n(ln(1/ ) + 1) weights and computation units, with some constant c = c(d, n)....

    [...]

  • ...Let us obtain a condition ensuring that such f ∈ Fd,n....

    [...]

  • ...…conclude c), observe that computation (4) consists of three instances of f̃sq,δ and finitely many linear and ReLU operations, so, using Proposition 2, we can implement ×̃ by a ReLU network such that its depth and the number of computation units and weights are O(ln(1/δ)), i.e. are O(ln(1/ ) + lnM)....

    [...]

  • ...Namely, let fm be the piece-wise linear interpolation of f with 2m + 1 uniformly distributed breakpoints k 2m , k = 0, . . . , 2m: fm ( k 2m ) = ( k 2m )2 , k = 0, . . . , 2m (see Fig....

    [...]

  • ...Namely, given f ∈ F1,1 and > 0, set T = d1 e and let f̃ be the piece-wise interpolation of f with T + 1 uniformly spaced breakpoints ( t T )Tt=0 (i.e., f̃( t T ) = f( t T ), t = 0, . . . , T )....

    [...]

Book
31 May 2002
TL;DR: Introduction Classical computation Quantum computation Solutions Elementary number theory Bibliography Index.
Abstract: Introduction Classical computation Quantum computation Solutions Elementary number theory Bibliography Index.

1,209 citations


"Error bounds for approximations wit..." refers background in this paper

  • ...…network, a deep one can be viewed as a long sequence of non-commutative transformations, which is a natural setting for high expressiveness (cf. the well-known Solovay-Kitaev theorem on fast approximation of arbitrary quantum operations by sequences of non-commutative gates, see Kitaev et al.…...

    [...]