Error bounds for approximations with deep ReLU networks.

doi:10.1016/J.NEUNET.2017.07.002

Home
/
Papers
/
Error bounds for approximations with deep ReLU networks.

Journal Article•DOI•

Error bounds for approximations with deep ReLU networks.

Dmitry Yarotsky¹•Institutions (1)

Skolkovo Institute of Science and Technology¹

01 Oct 2017-Neural Networks (Neural Netw)-Vol. 94, pp 103-114

TL;DR: It is proved that deep ReLU networks more efficiently approximate smooth functions than shallow networks and adaptive depth-6 network architectures more efficient than the standard shallow architecture are described.

read less

About: This article is published in Neural Networks.The article was published on 2017-10-01 and is currently open access. It has received 693 citations till now. The article focuses on the topics: Lipschitz continuity & Network complexity.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

[...]

Difan Zou, Yuan Cao, Dongruo Zhou, Quanquan Gu

21 Nov 2018-arXiv: Learning

TL;DR: In particular, this article showed that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data.

...read moreread less

Abstract: We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization for deep learning, and pave the way for studying the optimization dynamics of training modern deep neural networks.

...read moreread less

467 citations

Journal Article•DOI•

Adaptive activation functions accelerate convergence in deep and physics-informed neural networks

[...]

Ameya D. Jagtap¹, Kenji Kawaguchi², George Em Karniadakis¹, George Em Karniadakis³•Institutions (3)

Brown University¹, Massachusetts Institute of Technology², Pacific Northwest National Laboratory³

01 Mar 2020-Journal of Computational Physics

TL;DR: It is theoretically proved that in the proposed method, gradient descent algorithms are not attracted to suboptimal critical points or local minima, and the proposed adaptive activation functions are shown to accelerate the minimization process of the loss values in standard deep learning benchmarks with and without data augmentation.

...read moreread less

405 citations

Journal Article•DOI•

Scaling for edge inference of deep neural networks

[...]

Xiaowei Xu¹, Yukun Ding¹, Sharon Hu¹, Michael Niemier¹, Jason Cong², Yu Hu³, Yiyu Shi¹ - Show less +3 more•Institutions (3)

University of Notre Dame¹, University of California, Los Angeles², Huazhong University of Science and Technology³

01 Apr 2018

TL;DR: There are increasing gaps between the computational complexity and energy efficiency required for the continued scaling of deep neural networks and the hardware capacity actually available with current CMOS technology scaling, in situations where edge inference is required.

...read moreread less

Abstract: Deep neural networks offer considerable potential across a range of applications, from advanced manufacturing to autonomous cars. A clear trend in deep neural networks is the exponential growth of network size and the associated increases in computational complexity and memory consumption. However, the performance and energy efficiency of edge inference, in which the inference (the application of a trained network to new data) is performed locally on embedded platforms that have limited area and power budget, is bounded by technology scaling. Here we analyse recent data and show that there are increasing gaps between the computational complexity and energy efficiency required by data scientists and the hardware capacity made available by hardware architects. We then discuss various architecture and algorithm innovations that could help to bridge the gaps. This Perspective highlights the existence of gaps between the computational complexity and energy efficiency required for the continued scaling of deep neural networks and the hardware capacity actually available with current CMOS technology scaling, in situations where edge inference is required; it then discusses various architecture and algorithm innovations that could help to bridge these gaps.

...read moreread less

354 citations

Journal Article•DOI•

Universality of deep convolutional neural networks

[...]

Ding-Xuan Zhou¹•Institutions (1)

City University of Hong Kong¹

01 Mar 2020-Applied and Computational Harmonic Analysis

TL;DR: It is shown that a deep convolutional neural network (CNN) is universal, meaning that it can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough.

...read moreread less

345 citations

Cites background or methods or result from "Error bounds for approximations wit..."

...If we take r = d+1 2 + 2 as in our previous discussion, we see that for d ≥ 6, the deep net constructed in Theorem 1 of [29] for achieving an accuracy ∈ (0, 1) for approximating f ∈ C([0, 1]) has at least 2 −d/r free parameters and at least C0d 4 (log(1/ ) + d) layers....
[...]
...To compare this result with ours when d is large, we need to derive an explicit lower bound for the number of parameters in the above net from the analysis in [29] which is based on Taylor polynomials of f and trapezoid functions defined by σ....
[...]
...When the activation function is ReLU, explicit rates of approximation by fully connected neural networks were obtained recently in [13] for shallow nets, in [24] for nets with 3 hidden layers, and in [29,2,22] for nets with more layers....
[...]
...Thus, to achieve an accuracy ∈ (0, 1) for approximating f by a ReLU deep net, one takes N = ( 2d+1dr r! )1/r and δ = 2d+1dr(d+r) as in [29] and know that the depth of the net is at least C0d 2 (log(1/ ) + (d + 1) log 2 + r log d + log(d + r)), and the total number of parameters for the net is more than the number of coefficients D f(m/N) α! which is...
[...]
...In particular, it was shown in Theorem 1 of [29] that for f ∈ C([0, 1]), the approximation accuracy ∈ (0, 1) can be achieved by a ReLU deep net with at most c(log(1/ ) + 1) layers and at most c −d/r(log(1/ ) + 1) weights and computation units with a constant c = c(d, r)....
[...]

Journal Article•DOI•

Optimal approximation of piecewise smooth functions using deep ReLU neural networks.

[...]

Philipp Petersen¹, Felix Voigtlaender¹•Institutions (1)

Technical University of Berlin¹

01 Dec 2018-Neural Networks

TL;DR: It is proved that one cannot approximate a general function f∈Eβ(Rd) using neural networks that are less complex than those produced by the construction, which partly explains the benefits of depth for ReLU networks by showing that deep networks are necessary to achieve efficient approximation of (piecewise) smooth functions.

...read moreread less

307 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Deep learning

[...]

Yann LeCun¹, Yann LeCun², Yoshua Bengio³, Geoffrey E. Hinton⁴, Geoffrey E. Hinton⁵ - Show less +1 more•Institutions (5)

New York University¹, Facebook², Université de Montréal³, Google⁴, University of Toronto⁵

28 May 2015-Nature

TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.

...read moreread less

Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

...read moreread less

46,982 citations

Journal Article•DOI•

Deep learning in neural networks

[...]

Jürgen Schmidhuber¹•Institutions (1)

University of Lugano¹

01 Jan 2015-Neural Networks

TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

...read moreread less

14,635 citations

"Error bounds for approximations wit..." refers background in this paper

...Recently, multiple successful applications of deep neural networks to pattern recognition problems (Schmidhuber [2015], LeCun et al. [2015]) have revived active interest in theoretical properties of such networks, in particular their expressive power....
[...]

Book Chapter•DOI•

On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities

[...]

Vladimir Vapnik, A. Ya. Chervonenkis

01 Jan 1971-Theory of Probability and Its Applications

TL;DR: This chapter reproduces the English translation by B. Seckler of the paper by Vapnik and Chervonenkis in which they gave proofs for the innovative results they had obtained in a draft form in July 1966 and announced in 1968 in their note in Soviet Mathematics Doklady.

...read moreread less

Abstract: This chapter reproduces the English translation by B. Seckler of the paper by Vapnik and Chervonenkis in which they gave proofs for the innovative results they had obtained in a draft form in July 1966 and announced in 1968 in their note in Soviet Mathematics Doklady. The paper was first published in Russian as Вапник В. Н. and Червоненкис А. Я. О равномерноЙ сходимости частот появления событиЙ к их вероятностям. Теория вероятностеЙ и ее применения 16(2), 264–279 (1971).

...read moreread less

3,939 citations

"Error bounds for approximations wit..." refers background in this paper

...Bianchini and Scarselli 2014 give bounds for Betti numbers characterizing topological properties of functions represented by networks....
[...]

Book•

Neural Network Learning: Theoretical Foundations

[...]

Martin Anthony¹, Peter L. Bartlett²•Institutions (2)

London School of Economics and Political Science¹, Australian National University²

01 Nov 1999

TL;DR: The authors explain the role of scale-sensitive versions of the Vapnik Chervonenkis dimension in large margin classification, and in real prediction, and discuss the computational complexity of neural network learning.

...read moreread less

Abstract: This important work describes recent theoretical advances in the study of artificial neural networks. It explores probabilistic models of supervised learning problems, and addresses the key statistical and computational questions. Chapters survey research on pattern classification with binary-output networks, including a discussion of the relevance of the Vapnik Chervonenkis dimension, and of estimates of the dimension for several neural network models. In addition, Anthony and Bartlett develop a model of classification by real-output networks, and demonstrate the usefulness of classification with a "large margin." The authors explain the role of scale-sensitive versions of the Vapnik Chervonenkis dimension in large margin classification, and in real prediction. Key chapters also discuss the computational complexity of neural network learning, describing a variety of hardness results, and outlining two efficient, constructive learning algorithms. The book is self-contained and accessible to researchers and graduate students in computer science, engineering, and mathematics.

...read moreread less

1,757 citations

"Error bounds for approximations wit..." refers background or methods in this paper

...For any d, n and ∈ (0, 1), there is a ReLU network architecture that 1. is capable of expressing any function from Fd,n with error ; 2. has the depth at most c(ln(1/ ) + 1) and at most c −d/n(ln(1/ ) + 1) weights and computation units, with some constant c = c(d, n)....
[...]
...Let us obtain a condition ensuring that such f ∈ Fd,n....
[...]
...…conclude c), observe that computation (4) consists of three instances of f̃sq,δ and finitely many linear and ReLU operations, so, using Proposition 2, we can implement ×̃ by a ReLU network such that its depth and the number of computation units and weights are O(ln(1/δ)), i.e. are O(ln(1/ ) + lnM)....
[...]
...Namely, let fm be the piece-wise linear interpolation of f with 2m + 1 uniformly distributed breakpoints k 2m , k = 0, . . . , 2m: fm ( k 2m ) = ( k 2m )2 , k = 0, . . . , 2m (see Fig....
[...]
...Namely, given f ∈ F1,1 and > 0, set T = d1 e and let f̃ be the piece-wise interpolation of f with T + 1 uniformly spaced breakpoints ( t T )Tt=0 (i.e., f̃( t T ) = f( t T ), t = 0, . . . , T )....
[...]

Book•

Classical and Quantum Computation

[...]

A. Yu. Kitaev¹, Alexander Shen², Mikhail N. Vyalyi²•Institutions (2)

California Institute of Technology¹, Independent University of Moscow²

31 May 2002

TL;DR: Introduction Classical computation Quantum computation Solutions Elementary number theory Bibliography Index.

...read moreread less

Abstract: Introduction Classical computation Quantum computation Solutions Elementary number theory Bibliography Index.

...read moreread less

1,209 citations

"Error bounds for approximations wit..." refers background in this paper

...…network, a deep one can be viewed as a long sequence of non-commutative transformations, which is a natural setting for high expressiveness (cf. the well-known Solovay-Kitaev theorem on fast approximation of arbitrary quantum operations by sequences of non-commutative gates, see Kitaev et al.…...
[...]